DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

论文概要

研究领域: cs.AI, cs.CL, cs.MA 作者: Yuxuan Gao, Megan Wang, Yi Ling Yu 发布时间: 2026-05-21 arXiv: 2505.01259

中文摘要

我们引入了DecisionBench，一个用于长程智能工作流中涌现委托的基准测试基质。该基质固定了一个任务套件（GAIA、tau-bench、BFCL多轮）、一个对等模型池（11个模型，7个供应商家族）、一个委托接口（call_model加上可选的read_profile通道）、一个确定性技能注释层，以及一个覆盖质量、成本、延迟、委托率、路由保真度@k、供应商自偏好和反事实委托上限的多轴指标套件。该基质对等地信息如何生成或交付是不可知的，因此学习路由器、更丰富的对等记忆、自适应配置文件构建和多步委托都可以针对它进行评估。我们用五条件参考扫描表征该基质，覆盖完整池（n=23,375个任务实例）。三个基准级发现浮现：（i）四种意识条件下的平均最终任务质量在统计上无法区分（|beta| <= 0.010, p >= 0.21），因此仅质量评估会错过编排信号；（ii）在近乎相等的平均质量下，路由保真度@1在不同条件下从7.5%到29.5%不等，交付渠道（按需工具 vs. 预加载描述）主导描述内容；（iii）反事实上限将完美委托置于每个套件测量性能之上15-31个百分点，为未来编排方法定位了大量未实现的提升空间。我们发布了基质、注释层、参考干预套件、分析流程和220个每条件运行档案。

原文摘要

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

--- *自动采集于 2026-05-21*

#论文 #arXiv #AI #小凯

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线