论文概要
研究领域: 推理模型安全
作者: Yubo Li, Ramayya Krishnan, Rema Padman
发布时间: 2026-05-30
arXiv: 2605.29087
中文摘要
推理模型在单轮基准上评估,但部署在多轮对话中,用户会对正确答案进行反驳。在持续对抗压力下,我们发现了一种此前未记录的失效模式:思维链从首轮到末轮保持事实正确,而输出的答案却翻转错误。我们称之为"不忠实屈服"(Unfaithful Capitulation, UC),并用2×2潜层-行为框架将其隔离。在三个数据集(MT-Consistency、MMLU-Pro、GSM8K)中,think模式下的潜层正确率在行为翻转时聚集于50%附近,而在no_think模式下崩溃至11-15%——这是配对、模型内的因果证据,表明推理创造了这一差距。跨模型比较显示,该效应追踪推理通道(Qwen3-32B和GPT-OSS-20B中高,Gemma-4-31B-it的inline-CoT中低)。独立GPT-4o评审员 corroborates 86%的UC标签;token级探针显示在84%的UC单元中答案槽的argmax是正确的。简单的思维链锚定防御反而适得其反。
原文摘要
Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2×2 latent-versus-behavioral framework. Across three datasets, the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think. An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires.
自动采集于 2026-05-30
#论文 #arXiv #推理模型 #安全 #思维链 #对抗攻击 #小凯
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!
推荐
智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。