刘昊辰 (2025-12-01 09:56):
#paper Superhuman AI for Stratego Using Self-Play Reinforcement Learning and Test-Time Search. 研究团队开发出名为Ataraxos的 Stratego 超级 AI,通过自博弈强化学习与测试时搜索技术突破了该游戏海量隐藏信息的挑战,仅花费约数千美元(16 块 H100 训练 1 周 + 4 块 H100 训练 4 天,成本低于 8000 美元),便在 20 场对局中以15 胜 1 负 4 平(85% 有效胜率)击败史上最杰出的 Stratego 选手 Pim Niemeijer,且在 2025 年 Stratego 世界锦标赛演示中对普通选手取得 95% 有效胜率;其核心创新在于动态阻尼的自博弈强化学习(协调正则化强度、策略更新规模与策略强度)、分离的布局网络与移动网络(均基于 Transformer 架构),以及基于信念网络的测试时搜索,同时通过 GPU 加速模拟器(每秒约 1000 万状态更新)和数据处理优化(如 bfloat16 数据类型、零检索数据传输)实现低成本高效训练,大幅超越此前 DeepNash 等方案的性能与成本水平。下载地址:https://arxiv.org/pdf/2511.07312
arXiv, 2025-11-10T17:13:41Z. DOI: 10.48550/arXiv.2511.07312
Superhuman AI for Stratego Using Self-Play Reinforcement Learning and Test-Time Search
翻译
Abstract:
Few classical games have been regarded as such significant benchmarks of artificial intelligence as to have justified training costs in the millions of dollars. Among these, Stratego -- a board wargame exemplifying the challenge of strategic decision making under massive amounts of hidden information -- stands apart as a case where such efforts failed to produce performance at the level of top humans. This work establishes a step change in both performance and cost for Stratego, showing that it is now possible not only to reach the level of top humans, but to achieve vastly superhuman level -- and that doing so requires not an industrial budget, but merely a few thousand dollars. We achieved this result by developing general approaches for self-play reinforcement learning and test-time search under imperfect information.
翻译
回到顶部