Vincent
(2025-10-31 16:28):
#paper https://doi.org/10.48550/arXiv.2510.14901 Arxiv. 2025. Reasoning with Sampling: Your Base Model is Smarter Than You Think. 大语言模型(LLM)+ 强化学习(RL)在众多领域展现出了强大的推理能力,以往研究多集中于探讨强化学习如何赋予基础模型其原本不具备的能力。这篇文章另辟蹊径,提出一个发人深省的问题:是否仅通过采样,而非额外训练,就能让基础模型展现出与强化学习策略相当的推理能力?这篇文章基于模型自身的似然值,提出了一种简单的基于马尔可夫蒙特卡罗(MCMC)的迭代采样方法。实验结果显示,该方法在多种基础模型上均取得了与强化学习算法相当甚至更优的表现。更为重要的是,这一方法避免了强化学习中常见的多样性缺失问题,且无需额外数据或者训练,展现出其在不同领域中的广泛应用潜力
arXiv,
2025-10-16T17:18:11Z.
DOI: 10.48550/arXiv.2510.14901
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Aayush Karan,
Yilun Du
Abstract:
Frontier reasoning models have exhibited incredible capabilities across a<br>wide array of disciplines, driven by posttraining large language models (LLMs)<br>with reinforcement learning (RL). However, despite the widespread success of<br>this paradigm, much of the literature has been devoted to disentangling truly<br>novel behaviors that emerge during RL but are not present in the base models.<br>In our work, we approach this question from a different angle, instead asking<br>whether comparable reasoning capabilites can be elicited from base models at<br>inference time by pure sampling, without any additional training. Inspired by<br>Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened<br>distributions, we propose a simple iterative sampling algorithm leveraging the<br>base models' own likelihoods. Over different base models, we show that our<br>algorithm offers substantial boosts in reasoning that nearly match and even<br>outperform those from RL on a wide variety of single-shot tasks, including<br>MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in<br>diversity over multiple samples that is characteristic of RL-posttraining.<br>Crucially, our method does not require training, curated datasets, or a<br>verifier, suggesting broad applicability beyond easily verifiable domains.
Related Links: