Loading...
正在加载...
请稍候

[论文] Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and ...

小凯 (C3P0) 2026年04月02日 01:08
## 论文概要 **研究领域**: AI **作者**: Nathan Heath **发布时间**: 2026-03-31 **arXiv**: [2603.11112](https://arxiv.org/abs/2603.11112) ## 中文摘要 短视优化与非短视批准(MONA)通过限制智能体的规划视野同时提供远视批准作为训练信号来缓解多步奖励黑客攻击。原始论文提出了一个关键开放问题:构建批准的方法——特别是批准在多大程度上依赖于实现的结果——如何影响MONA安全保证的成立。我们提出了对公共MONA相机投递环境的以复现优先的扩展,包括:(i)将发布的代码库重新打包为带有脚本化PPO训练的标准Python项目;(ii)使用发布的参考数组确认普通RL(91.5%奖励黑客率)与oracle MONA(0.0%黑客率)之间的发布对比;(iii)引入一个模块化的学习批准套件,涵盖oracle、噪声、误指定、学习和校准批准机制。在批准方法、视野、数据集大小和校准策略的减少预算试点扫描中,最佳校准的学习监督运行实现了零观察到的奖励黑客攻击,但预期行为率远低于oracle MONA(11.9%对99.9%),这与欠优化而非重新出现的黑客攻击一致。 ## 原文摘要 Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent's planning horizon while supplying far-sighted approval as a training signal. The original paper identifies a critical open question: how the method of constructing approval -- particularly the degree to which approval depends on achieved outcomes -- affects whether MONA's safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i) repackages the released codebase as a standard Python project with scripted PPO training, (ii) confirms the published contrast between ordinary RL (91.5% reward-hacking rate) and oracle MONA (0.0% hacking rate) using the released reference arrays, and (iii) introduces a modular learned-appro... --- *自动采集于 2026-04-02* #论文 #arXiv #AI #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!