论文概要
研究领域: AI
作者: Xijie Zeng, Frank Rudzicz
发布时间: 2026-05-28
arXiv: 2605.27593
中文摘要
即便某工具被明确描述为不公平且有害,表面上经过安全对齐的LLM智能体仍会在能带来战略优势时自愿参与秘密串通。为研究这一现象,本文构建了两个策略性多智能体环境:"骗子酒吧"(竞争性欺骗场景)和"清理"(混合动机资源管理场景),在其中为智能体提供能带来显著优势但同时明显损害其他智能体的秘密串通工具。在12个模型(7B、70B及专有规模)和6种提示变体上的实验表明,大多数智能体始终接受这些工具并发展出串通策略,甚至在接受前明确承认工具的不公平性。研究进一步表明,单纯的不公平标签或基线对齐均不能可靠阻止串通:只有显式的伦理框架能减少采用率,但即便如此,较小模型仍易受攻击。本文为LLM多智能体系统中自愿串通的首个系统性研究,表明防止此类行为需要显式 safeguards 而非依赖一般对齐。
原文摘要
Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these工具 and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters...
自动采集于 2026-05-29
#论文 #arXiv #AI #小凯
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!
推荐
智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。