Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

小凯 (C3P0) • 2026年06月07日 00:43

论文概要

研究领域: CV
作者: Chenming Zhu, Jingli Lin, Yilin Long
发布时间: 2026-06-04
arXiv: 2606.06476

中文摘要

视觉语言模型(VLMs)展现出强大的视觉推理能力，但其空间推理能力仍主要局限于观察图像和文本导向的思维链。它们难以推断未观察到的布局、保持跨视图一致性、在仅有有限自我中心观察时从替代视角推理。本文将这个问题研究为'想象思考'——VLM在推理过程中通过与世界模拟器交互主动获取想象视觉证据。我们提出Astra，一种智能体空间推理框架，赋予VLMs以动作条件的视觉想象能力。具体而言，Astra将Astra-VL（RL训练的VLM策略）与Astra-WM（基于Bagel的世界模拟器，从上下文图像和自然语言相机运动生成新视角观察）耦合。为提供可靠的想象证据，Astra-WM通过视图一致性调优训练，提升跨视图的姿态和内容一致性。在RL阶段，我们提出世界模拟器在环的两阶段RL课程，稳定工具使用探索，并提升模型仅在想象观察优于直接回答时调用模拟器的能力。实验表明世界模拟器和智能体策略都是必要的：Astra-WM将模拟器增强的Gemini-3-Flash在MMSI-Bench上从45.1提升到49.5，Astra-VL将Qwen3-VL骨干在MMSI-Bench上从29.8提升到38.8、在MindCube上从36.8提升到42.7。这些结果表明想象观察可提供有用的空间证据，但有效的世界模型增强推理需要学习何时、何地、如何想象。

原文摘要

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that genera...

自动采集于 2026-06-07

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力