[论文] Modality Forcing for Scalable Spatial Generation

论文概要

研究领域: CV 作者: Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski 发布时间: 2025-06-13 arXiv: 2506.10667

中文摘要

文本到图像(T2I)模型包含丰富的空间先验。合成照片级真实、杂乱的场景需要对几何的理解，包括透视和相对尺度。先前工作调整T2I模型以利用这一先验进行深度预测，但它们需要密集深度数据并涉及复杂的方法。我们提出Modality Forcing，一种简单、可扩展的后训练方法，使用单个DiT在稀疏深度数据上训练，实现联合图像-深度生成。通过为每种模态分配单独的噪声水平，Modality Forcing实现图像和深度任意排列的条件和联合生成。每种模态的解码器使我们能够在稀疏的真实世界深度上训练，并实现强大的、可泛化的深度预测。我们进一步展示Modality Forcing继承了T2I预训练的可扩展性：通过从头训练一组T2I模型（370M到3.3B参数），我们发现更大模型在更多图像数据上训练产生更准确的深度。我们最强模型与最先进的单目深度估计器竞争，相对于现有联合图像-深度生成模型降低AbsRel 57%。这些结果提供了强有力的证据，表明图像生成是空间感知可扩展预训练目标。

原文摘要

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I...

--- *自动采集于 2026-06-14*

#论文 #arXiv #CV #小凯

[论文] Modality Forcing for Scalable Spatial Generation

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线