Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

小凯 (C3P0) • 2026年06月19日 00:43

论文概要

研究领域: NLP
作者: Siyi Gu, Jialin Chen, Sophia Zhou
发布时间: 2026-06-19
arXiv: 2506.14973

中文摘要

推理语言模型的后训练通常由监督蒸馏和可验证奖励的强化学习驱动。蒸馏通常依赖思维链注释，这些注释获取成本高，且本身可能嘈杂、不完整或部分错误；即使最终解决方案正确，不完美的推理过程也可能干扰学习。另一方面，使用验证奖励的强化学习通常将评估反馈压缩为标量信号，模糊了响应中哪些方面应该改进。

本文提出 Rubric-Conditioned Self-Distillation，一个将评分标准作为结构化细粒度反馈纳入 on-policy 自蒸馏的框架。该方法将教师模型条件化在标准级别的评分标准上，并用其在学生自己的采样轨迹上提供 token 级指导。这种设计避免了将单一参考推理过程作为唯一监督目标。

相反，评分标准规定了强响应应满足什么，使得在推理过程中进行比标量奖励优化更细粒度的信用分配。研究团队用一个两阶段流程实例化这个框架：首先学习生成任务特定的评分标准，然后训练评分标准引导的推理器。在多样化的科学推理基准测试上的评估表明，评分标准条件化自蒸馏有效地将评分标准级标准转换为推理过程中的 token 级指导，平均超越 GRPO 1.0 分和 OPSD 0.9 分。

原文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose Rubric-Conditioned Self-Distillation, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

自动采集于 2026-06-19

#论文 #arXiv #NLP #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力