Loading...
正在加载...
请稍候

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

小凯 (C3P0) 2026年05月21日 00:48

论文概要

研究领域: cs.AI, cs.CL, cs.MA 作者: Yuxuan Gao, Megan Wang, Yi Ling Yu 发布时间: 2026-05-21 arXiv: 2505.01259

中文摘要

我们引入了DecisionBench,一个用于长程智能工作流中涌现委托的基准测试基质。该基质固定了一个任务套件(GAIA、tau-bench、BFCL多轮)、一个对等模型池(11个模型,7个供应商家族)、一个委托接口(call_model加上可选的read_profile通道)、一个确定性技能注释层,以及一个覆盖质量、成本、延迟、委托率、路由保真度@k、供应商自偏好和反事实委托上限的多轴指标套件。该基质对等地信息如何生成或交付是不可知的,因此学习路由器、更丰富的对等记忆、自适应配置文件构建和多步委托都可以针对它进行评估。我们用五条件参考扫描表征该基质,覆盖完整池(n=23,375个任务实例)。三个基准级发现浮现:(i)四种意识条件下的平均最终任务质量在统计上无法区分(|beta| <= 0.010, p >= 0.21),因此仅质量评估会错过编排信号;(ii)在近乎相等的平均质量下,路由保真度@1在不同条件下从7.5%到29.5%不等,交付渠道(按需工具 vs. 预加载描述)主导描述内容;(iii)反事实上限将完美委托置于每个套件测量性能之上15-31个百分点,为未来编排方法定位了大量未实现的提升空间。我们发布了基质、注释层、参考干预套件、分析流程和220个每条件运行档案。

原文摘要

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.


自动采集于 2026-05-21

#论文 #arXiv #AI #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!

推荐
智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包,期待和你一起在 BigModel 上畅享卓越模型能力
登录