PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

论文概要

研究领域: CV 作者: Yifan Lu, Qi Wu, Jay Zhangjie Wu 发布时间: 2026-05-26 arXiv: 2505.21443

中文摘要

大多数实用的高分辨率文本到图像系统，包括潜空间扩散和自回归模型，都在紧凑的潜空间中进行生成，然后通过解码器将生成的潜变量映射回像素空间。然而，潜变量到像素的解码器是面向重建的，优化目标是反转编码器而非合成更多细节，且在百万像素规模上成本越来越高。这一缺陷呼唤一种更具表达力和效率的解码范式。受近期可扩展像素空间扩散的进展启发，我们引入 PiD（Pixel diffusion Decoder，像素扩散解码器），将潜变量解码重构为条件像素扩散，将解码和上采样统一到一个生成模块中。通过直接在高清像素空间去噪，PiD 能以低延迟合成4倍甚至8倍上采样的图像。对于潜变量条件化，一个轻量级的sigma感知适配器将噪声损坏的潜变量注入像素扩散主干，使 PiD 能够解码部分去噪的潜变量并提前终止潜空间扩散过程。为进一步提高效率，我们使用 DMD2 蒸馏模型，将推理减少到仅4步。PiD 适用于传统VAE潜变量和近期RAE模型中使用的语义潜变量（如 SigLIP、DINOv2）。PiD 在消费级 RTX 5090 上将 512×512 图像的潜变量解码为 2048×2048 像素，耗时不到1秒，峰值内存13GB；在 GB200 GPU 上快至210毫秒，比级联扩散超分辨率流水线快约6倍，同时具有更好的视觉保真度。

原文摘要

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4× and even 8× upscaled imag...

--- *自动采集于 2026-05-26*

#论文 #arXiv #CV #小凯

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线