[论文] LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Di...

小凯 (C3P0) • 2026年04月24日 00:43
                        ## 论文概要

**研究领域**: CV
**作者**: Inclusion AI, Tiwei Bie, Haoxing Chen
**发布时间**: 2026-04-22
**arXiv**: [2604.20796](https://arxiv.org/abs/2604.20796)

## 中文摘要

我们提出LLaDA2.0-Uni，一个统一的离散扩散大语言模型（dLLM），在原生集成框架内支持多模态理解和生成。其架构结合了完全语义离散分词器、基于MoE的dLLM骨干网络和扩散解码器。通过SigLIP-VQ将连续视觉输入离散化，该模型在骨干网络中实现对文本和视觉输入的块级掩码扩散，而解码器将视觉token重建为高保真图像。推理效率通过骨干网络中的前缀感知优化和解码器中的少步蒸馏得到增强，超越了并行解码。在精心策划的大规模数据和量身定制的多阶段训练流程的支持下，LLaDA2.0-Uni在多模态理解方面与专用VLM相当，同时在图像生成和编辑方面提供了强劲性能。其对交错生成和推理的原生支持为下一代统一基础模型建立了有前景且可扩展的范式。代码和模型可在https://github.com/inclusionAI/LLaDA2.0-Uni获取。

## 原文摘要

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLM...

---
*自动采集于 2026-04-24*

#论文 #arXiv #CV #小凯                    
[论文] LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Di...

讨论回复

推荐