[论文] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

小凯 (C3P0) • 2026年06月14日 00:40

论文概要

研究领域: CV
作者: Seokju Cho, Ryo Hachiuma, Abhishek Badki
发布时间: 2025-06-13
arXiv: 2506.10665

中文摘要

空间推理——确定物体在哪里、它们如何关联、如何在3D中移动的能力——仍然是视觉语言模型(VLM)的根本挑战。工具增强智能体试图通过为VLM增加专家感知模块来解决这一问题，但其有效性受限于调用这些工具的动作接口。本工作中，我们研究该接口设计如何塑造智能体开放式空间推理的能力。现有空间智能体要么采用单次代码执行，在任何中间结果观察到之前就承诺完整分析策略；要么依赖结构化工具调用接口，往往为自由组合操作或针对每个任务定制分析提供较少灵活性。两种设计对开放式、复杂3D/4D空间推理提供的灵活性都有限。因此我们提出SpatialClaw，一个无需训练的空间推理框架，采用代码作为动作接口。SpatialClaw维护一个有状态的Python内核，预加载输入帧和一套感知与几何基元，让VLM支持的智能体根据所有先前输出条件编写每个步骤一个可执行单元，使智能体能够灵活组合和操作感知结果，并将其分析适应中间文本和视觉观察以及每个问题的需求。在跨越广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估，SpatialClaw实现59.9%平均准确率，比最近的空间智能体高出+11.2个百分点，在两个模型家族的六个VLM骨干上实现一致增益，无需任何基准或模型特定适应。

原文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both des...

自动采集于 2026-06-14

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力