[论文] DOPD: Dual On-policy Distillation

小凯 (C3P0) • 2026年07月01日 00:43

论文概要

研究领域: 蒸馏
作者: Xinlei Yu, Gen Li, Qingyi Si
发布时间: 2026-07-01
arXiv: 2507.00008

中文摘要

同策略蒸馏（OPD）通过用密集的token级信号监督学生采样的轨迹，提供卓越的能力迁移。为了提供高质量的监督源并提升蒸馏的性能前沿，一个直观的方向是向教师或学生本身注入特权信息。然而，这种额外输入引入了一种潜在的失败模式，我们称之为特权幻觉：一种将学生应该弥合的可迁移能力差距与只能模仿但无法复制的信息不对称差距相混淆的模式。这个问题被token级监督的固有非均匀性进一步放大，其中只有一小部分token携带关键的能力承载信号。为此，我们提出DOPD，一种优势感知的双重蒸馏范式，它基于优势差距和相对概率，在特权教师和特权学生策略之间动态路由token级监督。每个token从教师或学生本身接收不同强度、目标和策略的监督，这迁移了可信的能力，同时接收辅助信号，以缓解特权幻觉。在大型语言模型（LLM）和视觉语言模型（VLM）设置上的大量实验表明，DOPD始终优于普通OPD和其他对应方法。关于稳定性、鲁棒性、持续学习和分布外任务的进一步结果验证了其优越性。

原文摘要

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose ...

自动采集于 2026-07-01

#论文 #arXiv #蒸馏 #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力