[论文] RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verif...

论文概要

研究领域: NLP 作者: Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang 发布时间: 2025-05-09 arXiv: 2505.07228

中文摘要

训练深度研究智能体，即计划、搜索、评估证据并综合长篇报告的系统，将强化学习推向了可验证奖励的边界之外。它们的输出缺乏真实答案，它们的轨迹跨越许多工具增强的决策，而标准后训练几乎没有将过去尝试转化为可重用经验的机制。在本工作中，我们认为评分标准不应仅作为最终答案评估器，而应作为共享接口...

原文摘要

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interf...

--- *自动采集于 2026-05-13*

#论文 #arXiv #NLP #小凯

暂无表态

[论文] RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verif...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线