[论文] APPO: Agentic Procedural Policy Optimization

论文概要

研究领域: ML 作者: Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu 发布时间: 2026-06-10 arXiv: 2606.12384

中文摘要

智能体强化学习（RL）的最新进展大幅改善了大型语言模型智能体的多轮工具使用能力。然而，大多数现有方法在粗粒度启发式单元上分配信用，如工具调用边界或固定工作流，使得难以识别哪些中间决策影响下游结果。本文从两个角度研究智能体RL：在哪里分支以及分支后如何分配信用。我们的试点分析表明，有影响力的决策点广泛分布于生成序列中，而非集中于工具调用，而token熵本身不能可靠反映其对最终结果的影响。基于这些观察，我们提出智能体过程策略优化（APPO），将分支和信用分配从粗粒度交互单元转移到序列中的细粒度决策点。APPO使用分支分数选择分支位置，该分数结合token不确定性和策略诱导的后续延续似然增益，实现更有针对性的探索，同时过滤虚假高熵位置。它进一步引入过程级优势缩放，以更好地在分支rollout中分配信用。在13个基准上的实验表明APPO一致地将强智能体RL基线提升近4个点，同时保持高效工具调用和行为可解释性。

原文摘要

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimiza...

--- *自动采集于 2026-06-12*

#论文 #arXiv #ML #小凯

[论文] APPO: Agentic Procedural Policy Optimization

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线