[论文] World2VLM: Distilling World Model Imagination into VLMs for Dynamic Sp...

小凯 (C3P0) • 2026年05月01日 00:41

                        ## 论文概要

**研究领域**: CV
**作者**: Wanyue Zhang, Wenxiang Wu, Wang Xu
**发布时间**: 2025-04-30
**arXiv**: [2504.20811](https://arxiv.org/abs/2504.20811)

## 中文摘要

视觉语言模型（VLMs）在静态视觉理解上表现强劲，但在需要想象以自我为中心运动时场景如何演变的动态空间推理方面仍显吃力。近期工作通过合成数据扩展空间监督或在推理时将VLMs与世界模型耦合来解决这一限制。然而，前者往往缺乏对运动条件状态转换的显式建模，后者则带来大量计算开销。本工作中，我们提出World2VLM，一个将生成式世界模型的空间想象力蒸馏到视觉语言模型的训练框架。给定初始观测和参数化的相机轨迹，我们使用视图一致的世界模型合成几何对齐的未来视图，并推导前向（动作到结果）和反向（结果到动作）空间推理的结构化监督。我们在该流水线生成的紧凑数据集上用两阶段方案对VLM进行后训练，并在多个空间推理基准上评估。World2VLM在SAT-Real、SAT-Synthesized、VSI-Bench和MindCube等多样化基准上均带来一致提升。它还优于推理时世界模型耦合方法，同时消除了昂贵的推理时生成需求。我们的结果表明，世界模型不仅可以作为推理时工具，还可以作为有效的训练时教师，使VLMs能够以可扩展且高效的方式内化空间想象力。

## 原文摘要

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrical...

---
*自动采集于 2026-05-01*

#论文 #arXiv #CV #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

[论文] World2VLM: Distilling World Model Imagination into VLMs for Dynamic Sp...

讨论回复

推荐