文献收藏与分享平台

林海onrush (2026-01-31 23:55):

#paper，DOI: arXiv:2406.03816，ReST-MCTS: LLM Self-Training via Process Reward Guided Tree Search,本文提出ReST-MCTS，一种将过程奖励（Process Reward）与改进的蒙特卡洛树搜索（MCTS)相结合的大语言模型自训练框架，旨在解决现有自训练方法仅依赖最终正确答案、却容易引入低质量中间推理的问题。该方法在仅已知最终正确答案的情况下，通过树搜索中的多次 rollout 自动推断每一步中间推理对通向正确解的贡献概率，从而生成高质量的过程奖励信号，用于同时训练策略模型和过程奖励模型。实验结果表明，在相同搜索预算下，ReST-MCTS*在推理准确率上优于 Best-of-N、Tree-of-Thought 等方法，并在多轮自训练中持续提升模型性能，显著超过 ReSTEM、Self-Rewarding 等已有自训练范式，验证了其在高质量推理轨迹获取和稳定自提升方面的有效性

arXiv, 2024-06-06T07:40:00Z. DOI: 10.48550/arXiv.2406.03816

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

翻译

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang

Abstract:

Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM. We release all code at https://github.com/THUDM/ReST-MCTS.

翻译

Related Links:

https://doi.org/10.48550/arXiv.2406.03816