[论文] ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perce...

论文概要

研究领域: NLP 作者: Yining Hong, Jiageng Liu, Han Yin 发布时间: 2026-05-19 arXiv: 2505.14305

中文摘要

空间智能通过感知-行动循环展开：代理通过行动获取观察，并推理观察如何随行动变化。它们不是被动处理所见，而是主动揭示未见之物——遮挡结构、动力学、 containment 和功能，这些都无法仅靠被动感知解决。我们超越了先前假设全知观察的空间智能定义，将观察者重新定义为行动者。我们引入ESI-BENCH，一个全面的具身空间智能基准，涵盖10个任务类别和29个子类别，基于OmniGibson构建，并以Spelke核心知识系统为基础。代理必须决定部署何种能力——感知、移动和操控——以及如何排序这些能力以主动积累任务相关证据。我们对最先进的MLLM进行了广泛实验，发现主动探索显著优于被动对应物，代理在没有明确指令的情况下自发发现涌现的空间策略，而随机多视角尽管消耗了更多图像，却常常增加噪声而非信号。大多数失败并非源于感知薄弱，而是行动盲视：糟糕的行动选择导致糟糕的观察，进而引发级联错误。虽然显式3D定位稳定了深度敏感任务的推理，但不完美的3D表示比2D基线更有害，因为它扭曲了空间关系。人类研究进一步揭示，与人类会寻找证伪视角并在矛盾下修正信念不同，模型无论证据质量如何都过早地以高置信度做出承诺，暴露了一种元认知差距——仅靠更好的感知或更多的具身交互都无法弥合。

原文摘要

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence t...

--- *自动采集于 2026-05-20*

#论文 #arXiv #NLP #小凯

[论文] ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perce...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线