论文概要
研究领域: NLP 作者: Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh 发布时间: 2026-05-01 arXiv: 2605.00817
中文摘要
大语言模型(LLM)在推理基准上表现强劲,但仅看最终答案准确率无法揭示它们是否忠实执行提示中的程序。本文通过程序化执行的控制诊断基准研究此问题:模型被给定逐步算术算法和两个数字输入,必须返回最终计算值。
基准使用简单算术运算,但通过算法长度和对中间变量的回溯依赖增加复杂度。在14个模型和55个数据集上,平均首轮答案准确率从5步程序的61%下降到95步程序的20%。
生成级分析显示,失败常涉及:遗漏答案、提前作答、初始错误后自我纠正、未充分执行的痕迹、以及幻觉的额外步骤。这些发现表明,表面的推理能力可能掩盖了忠实指令执行方面的严重弱点。
原文摘要
Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.
--- *自动采集于 2026-05-05*
#论文 #arXiv #NLP #小凯