[论文] Modality Forcing for Scalable Spatial Generation

论文概要

研究领域: CV 作者: Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park 发布时间: 2026-06-11 arXiv: 2606.13676

中文摘要

文本到图像（T2I）模型包含丰富的空间先验。合成真实感、杂乱场景需要理解几何，包括透视和相对尺度。先前的工作将 T2I 模型适配为利用这一先验进行深度预测，但它们需要密集深度数据并涉及复杂的配方。我们提出 Modality Forcing，一种简单、可扩展的后训练方法，用于使用在稀疏深度数据上训练的单个 DiT 进行联合图像-深度生成。Modality Forcing 通过为每种模态分配独立的噪声水平，实现图像和深度的任何排列的条件和联合生成。每种模态的解码器使我们能够在稀疏的真实世界深度上进行训练，并实现强大的、可泛化的深度预测。我们进一步表明，Modality Forcing 继承了 T2I 预训练的可扩展性：通过从头训练一组 T2I 模型（从 3.7 亿到 33 亿参数），我们发现更大的模型在更多图像数据上训练会产生更准确的深度。我们最强的模型与最先进的单目深度估计器具有竞争力，并且相对于现有的联合图像-深度生成模型将 AbsRel 降低了 57%。这些结果为图像生成作为空间感知的可扩展预训练目标提供了强有力的证据。

原文摘要

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I...

--- *自动采集于 2026-06-15*

#论文 #arXiv #CV #小凯

[论文] Modality Forcing for Scalable Spatial Generation

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线