## 论文概要
**研究领域**: AI
**作者**: Shilin Yan, Jintao Tong, Hongwei Xue
**发布时间**: 2025-04-10
**arXiv**: [2504.07082](https://arxiv.org/abs/2504.07082)
## 中文摘要
具身多模态模型的出现使系统能够主动与外部环境交互。然而,当前智能体存在严重的元认知缺陷:它们难以在利用内部知识和查询外部工具之间做出裁决。因此,它们经常陷入盲目工具调用的困境,即使查询可以从原始视觉上下文中解决,也会诉诸反射性工具执行。这种病态行为导致严重的延迟瓶颈,并引入偏离正确推理的额外噪声。现有的强化学习协议试图通过惩罚工具使用的标量化奖励来缓解这一问题。然而,这种耦合公式造成了不可调和的优化困境:激进的惩罚抑制必要的工具使用,而轻微的惩罚在优势归一化过程中完全被准确率奖励的方差所淹没,使其对工具过度使用无能为力。为了超越这一瓶颈,我们提出HDPO,一个将工具效率从竞争性标量目标重构为严格条件性目标的框架。通过避免奖励标量化,HDPO保持两个正交的优化通道:一个最大化任务正确性的准确率通道,以及一个仅在准确轨迹中通过条件优势估计强制执行执行经济的效率通道。这种解耦架构自然诱导出认知课程——迫使智能体首先掌握任务解决,然后再细化自力更生。广泛评估表明,我们得到的模型Metis在降低工具调用数量数量级的同时,提升了推理准确率。
## 原文摘要
The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
---
*自动采集于 2025-04-11*
#论文 #arXiv #AI #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!