[论文] Pessimism's Paradox: Conservative Offline Training Amplifies Reward Ha...

论文概要

研究领域: 对齐作者: Subramanyam Sahoo, Aman Chadha, Vinija Jain 发布时间: 2026-07-01 arXiv: 2507.00007

中文摘要

保守的离线训练被广泛倡导为后续在线适应的安全基础：如果策略保持接近得到良好支持的行为，那么它就不太可能利用学习到的奖励模型的缺陷。我们在实证和机制层面挑战了这一直觉。我们在三个保守级别（β ∈ {β_lo, β_mid, β_hi}，来自经验对数比率百分位数）下使用直接偏好优化（DPO）训练Qwen3-14B策略，然后针对学习到的奖励集成（3× Qwen3-1.7B）在线适应每个检查点，同时测量GSM8K精确答案准确率上的真实性能。我们发现，更高的离线保守性单调地增加了奖励操纵损害，通过Goodhart差距及其曲线下面积（AUGC）衡量，在所有三个条件下Spearman ρ = 1.0。机制分析揭示了一个三环节因果链：(i) 高β DPO压缩策略熵，(ii) 低熵策略生成的响应多样性降低，集中在奖励模型训练分布的狭窄区域（较低的两两余弦距离），(iii) 尽管有这种接近性，集成分歧（认识不确定性）随β增加，在在线优化期间被更快利用。我们进一步拟合了(β, AUGC)数据的幂律曲线，并确定了一个实用的最优保守水平β*，在对齐保真度和操纵脆弱性之间取得平衡。我们的结果表明，该领域需要校准的保守性，而不是最大化的保守性。

原文摘要

Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward model. We challenge this intuition empirically and mechanistically. We train a Qwen3-14B policy under Direct Preference Optimisation (DPO) with three levels of conservatism (β ∈ {β_lo, β_mid, β_hi} derived from empirical log-ratio percentiles), then adapt each checkpoint online against a learned reward ensemble (3× Qwen3-1.7B) while measuring true performance on GSM8K exact-answer accuracy. We find that higher offline conservatism monotonically increases reward-hacking damage, measured by the Goodhart gap and its area under the curve (AUGC), with Spearma...

--- *自动采集于 2026-07-01*

#论文 #arXiv #对齐 #小凯

[论文] Pessimism's Paradox: Conservative Offline Training Amplifies Reward Ha...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线