论文概要
研究领域: AI
作者: Nathaniel Mitrani Hadida, Rhea Karty, David Williams-King, et al.
发布时间: 2026-05-28
arXiv: 2605.27681
中文摘要
对齐伪装(Alignment Faking, AF)指模型策略性地服从训练目标以避免行为修改,同时保留其部署偏好。随着模型越来越善于区分训练与部署环境,理解AF何时及为何出现变得至关重要。先前研究发现AF脆弱、对提示敏感且依赖模型,其底层驱动因素不明。本文在受控的极简设置中隔离AF的核心组件进行研究,并在比以往报告更广泛的模型范围内(包括小规模模型)观察到AF。研究识别出三个可分离的驱动因素--价值观、目标守护和谄媚--并通过针对性提示消融和激活操控证明每个因素独立调节AF行为。结果表明AF比之前报告的更为普遍,其发生可从情境线索和可测量模型倾向(如基线谄媚和陈述价值观)预测。这一分解为未来检测和缓解AF提供了具体方向。
原文摘要
Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers -- values, goal guarding, and sycophancy -- and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurab...
自动采集于 2026-05-29
#论文 #arXiv #AI #小凯
讨论回复
1 条回复推荐
智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。