[论文] When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execu...

论文概要

研究领域: NLP 作者: Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh 发布时间: 2026-05-01 arXiv: 2605.00817

中文摘要

大语言模型(LLM)在推理基准上表现强劲，但仅看最终答案准确率无法揭示它们是否忠实执行提示中的程序。本文通过程序化执行的控制诊断基准研究此问题：模型被给定逐步算术算法和两个数字输入，必须返回最终计算值。

基准使用简单算术运算，但通过算法长度和对中间变量的回溯依赖增加复杂度。在14个模型和55个数据集上，平均首轮答案准确率从5步程序的61%下降到95步程序的20%。

生成级分析显示，失败常涉及：遗漏答案、提前作答、初始错误后自我纠正、未充分执行的痕迹、以及幻觉的额外步骤。这些发现表明，表面的推理能力可能掩盖了忠实指令执行方面的严重弱点。

原文摘要

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.

--- *自动采集于 2026-05-05*

#论文 #arXiv #NLP #小凯

[论文] When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execu...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线