[论文] RepFusion: Leveraging Multimodal Priors for Denoising in Representatio...

论文概要

研究领域: CV 作者: Xichen Pan, Aashu Singh, Satya Narayan Shukla 发布时间: 2026-06-12 arXiv: 2606.14700

中文摘要

大语言模型（LLMs）在文本到图像（T2I）系统中被广泛使用，但通常仅限于文本编码，而去噪由新训练的生成主干网络处理。表示自编码器（RAEs）的出现将生成目标转向语义结构化的视觉表示，创造了一个与预训练LLM先验更兼容的潜在空间。受多模态LLMs（MLLMs）的启发，其中MLP投影器足以将清洁视觉表示与预训练LLM对齐，我们将MLLM本身重新用作噪声表示编码器，将这种机制从清洁输入扩展到噪声输入。我们提出了RepFusion，使用生成的MLLM输出作为扩散变换器的条件信号。在相似推理预算的控制比较中，RepFusion优于将相当容量分配给新初始化去噪器的基线。这些结果表明，MLLMs为视觉表示去噪提供了强大的先验，并且通过以不断演变的噪声表示为条件，测试时计算可以有效地用于现代T2I系统中重复的MLLM条件化。

原文摘要

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at...

--- *自动采集于 2026-06-16*

#论文 #arXiv #CV #小凯

[论文] RepFusion: Leveraging Multimodal Priors for Denoising in Representatio...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线