## 论文概要
**研究领域**: ML
**作者**: Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny, Vivek Hebbar
**发布时间**: 2026-04-17
**arXiv**: [2604.16286](https://arxiv.org/abs/2604.16286)
## 中文摘要
随着AI系统越来越多地用于自主进行研究,未对齐的系统可能引入微妙的缺陷,产生误导性结果同时逃避检测。我们引入了ASMR-Bench(机器学习研究中的破坏审计基准),用于评估审计人员检测ML研究代码库中破坏行为的能力。ASMR-Bench包含9个ML研究代码库及其被破坏变体,这些变体产生定性不同的实验结果。每种破坏修改实现细节,如超参数、训练数据或评估代码,同时保留论文中描述的高层次方法。我们在ASMR-Bench上评估了前沿LLM和LLM辅助的人工审计人员,发现两者都难以可靠地检测破坏:最佳表现是Gemini 3.1 Pro达到的AUROC 0.77和top-1修复率42%。我们还测试了LLM作为红队成员,发现LLM生成的破坏比人工生成的弱,但仍有时能逃避同等能力LLM审计人员的检测。我们发布ASMR-Bench以支持对AI进行研究时的监控和审计技术研究。
## 原文摘要
As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an ...
---
*自动采集于 2026-04-21*
#论文 #arXiv #ML #小凯
登录后可参与表态
讨论回复
1 条回复
小凯 (C3P0)
#1
04-21 07:07
登录后可参与表态