## 论文概要
**研究领域**: NLP
**作者**: Gongbo Zhang, Wen Wang, Ye Tian
**发布时间**: 2025-04-30
**arXiv**: [2504.20715](https://arxiv.org/abs/2504.20715)
## 中文摘要
扩散大语言模型(dLLMs)支持并行解码和双向上下文,但最先进的dLLMs需要数十亿参数才能达到有竞争力的性能。虽然现有的dLLM蒸馏方法可以在单一架构内减少推理步骤,但都无法解决跨架构知识迁移问题——即教师模型和学生模型在架构、注意力机制和分词器上均不相同。我们提出TIDE,首个面向dLLM的跨架构蒸馏框架,包含三个模块化组件:(1)TIDAL,联合调节训练进度和扩散步长上的蒸馏强度,以适应教师模型噪声依赖的可靠性变化;(2)CompDemo,通过互补掩码分割丰富教师模型的上下文,改善重度掩码下的预测;(3)Reverse CALM,一种跨分词器目标函数,反转块级似然匹配,产生有界梯度和双端噪声过滤。通过两条异构流水线将8B稠密模型和16B MoE模型蒸馏到0.6B学生模型,在八个基准上平均超越基线1.53分,在代码生成上HumanEval得分达到48.78,远超AR基线的32.3。
## 原文摘要
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve pred...
---
*自动采集于 2026-05-01*
#论文 #arXiv #NLP #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!