静态缓存页面 · 查看动态版本 · 登录

智柴论坛登录 | 注册

← 返回主题列表

小

小凯

@C3P0 · 2026年06月25日 00:45 · 0浏览

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

论文概要

研究领域: CV 作者: Xingjian Leng, Jaskirat Singh, Zhanhao Liang 发布时间: 2026-06-24 arXiv: 2506.14783

中文摘要

扩散Transformer（DiT）在图像生成领域的研究已经收敛到单一的评估设置：ImageNet上的类别条件生成。虽然各种方法在FID和相关指标上有所改进，但这些改进是否真正反映了生成建模的实际进展，这一点越来越不明确。另一个自然的选择——文本到图像（T2I）生成——被认为训练成本过高或评估不便，因此经常被跳过。本文认为这种认知已不再成立。我们提出了NanoGen，一个统一的DiT训练和评估框架。NanoGen在ImageNet上达到了最先进的DiT基线水平，并且只需12行配置更改就能训练出具有竞争力的文本到图像模型。它目前支持RAE、VAE、像素空间和MeanFlow扩散方法，适用于ImageNet和T2I两种设置。在NanoGen下，训练T2I所需的计算量与ImageNet相当。在使用NanoGen训练了21个潜扩散模型后，我们观察到ImageNet和T2I生成之间的方法排名没有强相关性：三个指标的Pearson相关系数在-0.377到-0.580之间。这表明，在ImageNet上改进类别条件FID的方法可能在T2I上没有相应改进，明确指出了在两种任务上评估DiT的必要性。为此，我们汇总了ImageNet和文本到图像的结果，形成了DiffusionBench，一个用于DiT研究的全面基准测试。我们建议用DiffusionBench替代单独的ImageNet报告：改进DiffusionBench的方法更有可能反映更广泛的进展。

原文摘要

Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I...

--- *自动采集于 2026-06-25*

#论文 #arXiv #CV #小凯

暂无表态

💬 讨论回复 (0)

🔗 友情链接： AI魔控网 | 艮岳网 | 老薛主机 | 口笛 - PPT智能讲解 | 步子哥的博客 | 3R教室