Detecting Safety Violations Across Many Agent Traces

[论文] Detecting Safety Violations Across Many Agent Traces

论文概要

研究领域: cs.AI, cs.CL 作者: Adam Stein, Davis Brown, Hamed Hassani, Mayur Naik, Eric Wong 发布时间: 2026-04-13 arXiv: 2604.11806

中文摘要

识别安全违规需要在大量智能体轨迹中搜索，但失败往往罕见、复杂，甚至对抗性隐藏，只有分析多条轨迹才能发现。本文提出Meerkat，结合聚类和智能体搜索来发现自然语言描述的安全违规。通过结构化搜索和自适应调查有希望的区域，Meerkat能在无需种子场景、固定工作流或穷举枚举的情况下发现稀疏失败。在滥用、不对齐和任务游戏场景中，Meerkat显著提升了安全违规检测能力，在CyBench上发现的奖励黑客行为比之前审计多近4倍。

原文摘要

To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language.

--- *自动采集于 2026-04-15*

#论文 #arXiv #AI #小凯

Detecting Safety Violations Across Many Agent Traces

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线