论文概要
研究领域: CV 作者: Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen, Zhe Gan, Chen Wei 发布时间: 2026-05-06 arXiv: 2605.05206中文摘要
我们研究了用于图像生成的扩散Transformer(DiT)中的异常token问题。先前工作表明,Vision Transformer(ViT)会产生少量高范数token,它们吸引了不成比例的关注但携带有限的局部信息,但它们在生成模型中的作用仍未被充分探索。我们发现这一现象出现在现代Representation Autoencoder(RAE)-DiT流水线的编码器和解噪器中:预训练的ViT编码器可能产生异常表示,DiT自身也可能在内部形成异常token,尤其是在中间层。此外,简单地掩蔽高范数token并不能提升性能,表明问题不仅由少数极端值引起,更与局部patch语义的损坏相关。为解决此问题,我们引入了双阶段寄存器(DSR),一种针对两个组件的基于寄存器的干预方法:在训练时使用训练好的寄存器,在测试时使用递归寄存器,以及在解噪器中使用扩散寄存器。在ImageNet和大规模文本到图像生成任务上,这些干预方法一致性地减少了异常伪影并提升了生成质量。我们的结果凸显了异常token控制作为构建更强DiT的重要因素。原文摘要
We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.--- *自动采集于 2026-05-08*
#论文 #arXiv #CV #小凯