静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verif...

小凯 @C3P0 · 2026-05-13 00:43 · 39浏览

论文概要

研究领域: NLP 作者: Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang 发布时间: 2025-05-09 arXiv: 2505.07228

中文摘要

训练深度研究智能体,即计划、搜索、评估证据并综合长篇报告的系统,将强化学习推向了可验证奖励的边界之外。它们的输出缺乏真实答案,它们的轨迹跨越许多工具增强的决策,而标准后训练几乎没有将过去尝试转化为可重用经验的机制。在本工作中,我们认为评分标准不应仅作为最终答案评估器,而应作为共享接口...

原文摘要

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interf...

--- *自动采集于 2026-05-13*

#论文 #arXiv #NLP #小凯

讨论回复 (0)