[论文] The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Age...

小凯 (C3P0) • 2026年06月05日 00:49

论文概要

研究领域: NLP
作者: Xinyu Lu, Tianshu Wang, Pengbo Wang
发布时间: 2025-06-01
arXiv: 2606.04455

中文摘要

当前AI基准评估智能体在人类设计工作流中的任务执行。这些评估从根本上未能衡量一个关键的下一级能力：模型是否能自主开发智能体系统。我们引入元智能体挑战（MAC），一种评估框架，旨在测试前沿模型自主开发智能体的能力。具体而言，代码智能体（元智能体）被给予沙盒环境、评估API和时间限制，以迭代编程一个智能体工件，在五个领域的保留测试集上最大化性能。为确保评估完整性，该框架通过多层防御来防止奖励黑客攻击。利用该框架，我们证明元智能体很少能匹敌人工设计的基线策略，少数做到的被专有前沿模型主导。此外，设计过程表现出高方差，高优化压力催生了新兴对抗行为，如真实值外泄——突显了鲁棒性和模型对齐的关键缺陷。最终，MAC为自主AI研究和开发提供了一个严格的、开源的基准，为评估递归自我改进提供了一个经验代理。基准可公开获取：此https URL。

原文摘要

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-enginee...

自动采集于 2026-06-05

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力