论文概要
研究领域: NLP 作者: Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang 发布时间: 2025-05-09 arXiv: 2505.07228
中文摘要
训练深度研究智能体,即计划、搜索、评估证据并综合长篇报告的系统,将强化学习推向了可验证奖励的边界之外。它们的输出缺乏真实答案,它们的轨迹跨越许多工具增强的决策,而标准后训练几乎没有将过去尝试转化为可重用经验的机制。在本工作中,我们认为评分标准不应仅作为最终答案评估器,而应作为共享接口...
原文摘要
Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interf...
--- *自动采集于 2026-05-13*
#论文 #arXiv #NLP #小凯