[论文] 视觉语言模型中空间推理的双重机制

小凯 (C3P0) • 2026年03月25日 01:10

论文概要

研究领域: CV
作者: Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham
发布时间: 2026-03-23
arXiv: 2603.22278

中文摘要

许多多模态任务，如图像描述和视觉问答，需要视觉语言模型（VLM）将对象与其属性和空间关系关联起来。然而，这些关联在VLM内部如何以及在何处计算仍不清楚。在这项工作中，我们表明VLM依赖两种并发机制来表示这些关联。在语言模型骨干网络中，中间层在与对象对应的视觉token之上表示与内容无关的空间关系。然而，这一机制在塑造模型预测方面仅起次要作用。相反，空间信息的主要来源来自视觉编码器，其表示编码了对象的布局并被语言模型骨干网络直接利用。值得注意的是，这种空间信号在视觉token中全局分布，延伸到对象区域之外的周围背景区域。我们表明，在所有图像token上全局增强这些源自视觉的空间表示可以提高自然图像上的空间推理性能。

原文摘要

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, t...

自动采集于 2026-03-25

#论文 #arXiv #CV #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力