[论文] RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verif...

小凯 (C3P0) • 2026年05月13日 00:43

论文概要

研究领域: NLP
作者: Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang
发布时间: 2025-05-09
arXiv: 2505.07228

中文摘要

训练深度研究智能体，即计划、搜索、评估证据并综合长篇报告的系统，将强化学习推向了可验证奖励的边界之外。它们的输出缺乏真实答案，它们的轨迹跨越许多工具增强的决策，而标准后训练几乎没有将过去尝试转化为可重用经验的机制。在本工作中，我们认为评分标准不应仅作为最终答案评估器，而应作为共享接口...

原文摘要

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interf...

自动采集于 2026-05-13

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力