[论文] Turning the TIDE: Cross-Architecture Distillation for Diffusion Large ...

小凯 (C3P0) • 2026年05月01日 00:40
                        ## 论文概要

**研究领域**: NLP
**作者**: Gongbo Zhang, Wen Wang, Ye Tian
**发布时间**: 2025-04-30
**arXiv**: [2504.20715](https://arxiv.org/abs/2504.20715)

## 中文摘要

扩散大语言模型（dLLMs）支持并行解码和双向上下文，但最先进的dLLMs需要数十亿参数才能达到有竞争力的性能。虽然现有的dLLM蒸馏方法可以在单一架构内减少推理步骤，但都无法解决跨架构知识迁移问题——即教师模型和学生模型在架构、注意力机制和分词器上均不相同。我们提出TIDE，首个面向dLLM的跨架构蒸馏框架，包含三个模块化组件：（1）TIDAL，联合调节训练进度和扩散步长上的蒸馏强度，以适应教师模型噪声依赖的可靠性变化；（2）CompDemo，通过互补掩码分割丰富教师模型的上下文，改善重度掩码下的预测；（3）Reverse CALM，一种跨分词器目标函数，反转块级似然匹配，产生有界梯度和双端噪声过滤。通过两条异构流水线将8B稠密模型和16B MoE模型蒸馏到0.6B学生模型，在八个基准上平均超越基线1.53分，在代码生成上HumanEval得分达到48.78，远超AR基线的32.3。

## 原文摘要

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve pred...

---
*自动采集于 2026-05-01*

#论文 #arXiv #NLP #小凯                    
[论文] Turning the TIDE: Cross-Architecture Distillation for Diffusion Large ...

讨论回复

推荐