[论文] Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocati...

小凯 (C3P0) • 2026年06月05日 00:49

论文概要

研究领域: ML
作者: Jingbo Wen, Liang He, Ziqi He
发布时间: 2025-06-01
arXiv: 2606.04402

中文摘要

现代推理模型可以为不同任务分配不同数量的测试时计算，如思考token、模型调用或计算预算。现有方法通常基于预测难度来驱动这种分配，并在预期提高准确率的地方投入更多计算。这隐含假设所有失败的代价相同，因为准确率目标对每个任务同等加权。然而，这种假设在部署中不成立：日志消息中的拼写错误和破坏生产数据库的迁移都算作一个基准失败，但它们的实际成本根本不同。为填补这一缺口，我们提出后果感知的测试时计算分配。我们不仅基于预测难度来路由计算，而是使用轻量级预测器从问题文本中估计任务若解决错误将产生多高的代价。调度器随后将高后果任务路由到更大的计算层级或更高的思考预算，在相同总预算下。我们在SWE-bench Lite上进行主要实验，并在Multi-SWE-bench mini上评估跨数据集行为，共涵盖700个软件工程任务。我们的结果揭示，在各种标注下，后果和难度近似正交，且当前思考模型未充分根据后果分配计算。此外，我们的仅问题预测器在300个SWE-bench任务中从未将高后果任务误分类为低后果。在匹配计算预算下，我们的后果感知调度器相对于难度感知路由降低22%至33%的成本加权损失；特别是，按边际效用信号缩放的逐任务成本路由的优先级感知变体，降幅超过30%，其可部署的预测器驱动版本保留了oracle增益的90%以上。

原文摘要

Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to...

自动采集于 2026-06-05

#论文 #arXiv #ML #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力