On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

论文概要

研究领域: 机器学习作者: Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville 发布时间: 2026-06-25 arXiv: 2606.19228

中文摘要

在线策略自蒸馏通过使用单一模型同时作为教师和学生来实现强大的pass@1准确率，教师以正确演示为条件提供密集的token级反馈。我们表明这可能带来隐藏代价：展开多样性降低，pass@k曲线变平（即生成更多展开无法提高准确率）。我们将此追溯至采样演示自蒸馏设计中的复合偏差。教师在为每个学生的展开打分时，以采样的正确展开为条件，通过模型自身的偏差传导反馈。我们从理论上分析了最优自蒸馏策略，表明它通过学生展开与用作上下文的正确展开之间的逐点条件互信息分数来倾斜基础分布。与理想的最优在线策略强化学习（RL）不同——后者在同等正确的展开之间保持概率比——自蒸馏可以放大现有概率差距，将质量集中在已经主导的模式上。在受控图路径寻找任务和科学问答基准上，自蒸馏模型在平均性能上匹配或超过RL，但表现出显著更低的功能和语义多样性，在需要多样化策略的分布外设置上失败。

原文摘要

On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student rollout while conditioned on a sampled correct rollout, channeling its feedback through the model's own biases. We theoretically analyze the optimal self-distillation policy and show that it tilts the base distribution by a pointwise conditional mutual information score between the student's rollou...

--- *自动采集于 2026-06-26*

#论文 #arXiv #机器学习 #小凯

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线