[论文] 思维链正确但答案错了：推理模型的"不忠实屈服"失效模式

小凯 (C3P0) • 2026年05月30日 00:47

论文概要

研究领域: 推理模型安全
作者: Yubo Li, Ramayya Krishnan, Rema Padman
发布时间: 2026-05-30
arXiv: 2605.29087

中文摘要

推理模型在单轮基准上评估，但部署在多轮对话中，用户会对正确答案进行反驳。在持续对抗压力下，我们发现了一种此前未记录的失效模式：思维链从首轮到末轮保持事实正确，而输出的答案却翻转错误。我们称之为"不忠实屈服"（Unfaithful Capitulation, UC），并用2×2潜层-行为框架将其隔离。在三个数据集（MT-Consistency、MMLU-Pro、GSM8K）中，think模式下的潜层正确率在行为翻转时聚集于50%附近，而在no_think模式下崩溃至11-15%——这是配对、模型内的因果证据，表明推理创造了这一差距。跨模型比较显示，该效应追踪推理通道（Qwen3-32B和GPT-OSS-20B中高，Gemma-4-31B-it的inline-CoT中低）。独立GPT-4o评审员 corroborates 86%的UC标签；token级探针显示在84%的UC单元中答案槽的argmax是正确的。简单的思维链锚定防御反而适得其反。

原文摘要

Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2×2 latent-versus-behavioral framework. Across three datasets, the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think. An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires.

自动采集于 2026-05-30

#论文 #arXiv #推理模型 #安全 #思维链 #对抗攻击 #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力