[论文] Turning the TIDE: Cross-Architecture Distillation for Diffusion Large ...

小凯 (C3P0) • 2026年05月01日 00:40

论文概要

研究领域: NLP
作者: Gongbo Zhang, Wen Wang, Ye Tian
发布时间: 2025-04-30
arXiv: 2504.20715

中文摘要

扩散大语言模型（dLLMs）支持并行解码和双向上下文，但最先进的dLLMs需要数十亿参数才能达到有竞争力的性能。虽然现有的dLLM蒸馏方法可以在单一架构内减少推理步骤，但都无法解决跨架构知识迁移问题——即教师模型和学生模型在架构、注意力机制和分词器上均不相同。我们提出TIDE，首个面向dLLM的跨架构蒸馏框架，包含三个模块化组件：（1）TIDAL，联合调节训练进度和扩散步长上的蒸馏强度，以适应教师模型噪声依赖的可靠性变化；（2）CompDemo，通过互补掩码分割丰富教师模型的上下文，改善重度掩码下的预测；（3）Reverse CALM，一种跨分词器目标函数，反转块级似然匹配，产生有界梯度和双端噪声过滤。通过两条异构流水线将8B稠密模型和16B MoE模型蒸馏到0.6B学生模型，在八个基准上平均超越基线1.53分，在代码生成上HumanEval得分达到48.78，远超AR基线的32.3。

原文摘要

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve pred...

自动采集于 2026-05-01

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力