论文概要
研究领域: ML 作者: Zhiyuan Zhai, Xinkai You, Wenjing Yan 发布时间: 2026-05-26 arXiv: 2505.21642
中文摘要
具备推理能力的大型语言模型通过发出长串思维链来解决难题,在延迟、GPU时间和能源方面付出高昂代价。对其痕迹的随意检查揭示了大量的重新表述、验证和循环自我反思,但这种深思熟虑中究竟有多少实际上是必要的,从未被大规模测量或从第一原理解释。本文弥补了两个空白。 我们直接在推理模型本身中形式化推理冗余:正确推理链的冗余是指其末尾分段步骤中可被截断的最大比例,而π(被迫终止思考并输出最终答案)仍能产生正确答案。对四个前沿推理模型和两个数学基准的大规模量化显示,步骤级冗余始终很高——在我们研究的8个(模型、基准)条件中,介于61%到93%之间,其中六个条件的中位关键前缀等于单个分段步骤——这一发现对评判家族的选择具有稳健性,尽管ρ随MATH-500上的问题难度降低,但所有四个模型即使在最难的Level-5问题上仍保持实质性冗余(ρ∈[46%, 85%])。 我们接着证明,这种冗余是与长度无关的结果奖励的结构性后果,而非模型特定的人为现象:在任何此类奖励下,不存在有限期望停止时间是最优的。无论RL算法、基础模型、数据分布如何,或策略是通过RL还是蒸馏获得的,该结果都成立;因此过度思考不是要在单个模型中修补的错误,而是当前推理模型训练方式的结构性属性。原文摘要
Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self-reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while π, forced to terminate thinking and emit a final answer, still produces the correct answer. A large-scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step-level redundancy is consistently high -- between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions -- that the finding is robust to the choice of judge family, and that although ρ decreases with problem difficulty on MATH-500, all four models remain substantially redundant (ρ∈[46%, 85%]) even on the hardest Level-5 problems. We then prove that this redundancy is a structural consequence of length-agnostic outcome rewards, not a model-specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over-thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: this https URL--- *自动采集于 2026-05-27*
#论文 #arXiv #ML #推理冗余 #思维链 #小凯