林海onrush (2025-12-31 21:49):
#paper, Superposition Yields Robust Neural Scaling, DOI: 10.48550/arXiv.2505.10465. NIPS2025的亚军论文奖,MIT物理团队出身的AI工作,这篇论文提出:神经网络的幂律缩放(模型越宽/维度越大,loss 越低)可能主要源自表示层的“叠加/超位置(superposition)”机制——当需要表示的特征数远大于隐藏维度时,模型会把许多特征压进同一组维度里,导致表示向量之间的重叠干扰;随着维度 (m) 增大,随机几何使这种重叠的平均强度自然按 (~ 1/m) 下降,从而产生鲁棒的 (L∝ 1/m) 幂律缩放。作者用可控的 toy model 对比了弱与强 superposition:弱 superposition 下缩放更依赖数据特征频率的幂律尾部,而强 superposition 下则更普遍地产生接近指数 1 的缩放;并进一步在多种真实 LLM 上测得 token输出权重向量的重叠随宽度近似 (1/m) 下降、宽度指数约 0.9,支持“大模型处于强 superposition、几何干扰驱动缩放”的解释。
arXiv, 2025-05-15T16:18:13Z. DOI: 10.48550/arXiv.2505.10465
Superposition Yields Robust Neural Scaling
翻译
Abstract:
The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling inversely with model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.
翻译
回到顶部