论文概要
研究领域: CV 作者: Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li 发布时间: 2026-03-23 arXiv: 2603.22280
中文摘要
视觉-语言-行动(VLA)模型将视觉观测和语言指令直接映射到机器人动作。虽然对简单任务有效,但标准VLA模型在需要逻辑规划的复杂多步任务以及需要细粒度空间感知的精确操作方面往往表现不佳。近期研究将思维链(CoT)推理引入VLA模型,赋予其「先思考后行动」的能力。然而,当前的CoT-VLA模型面临两个关键限制:1)由于依赖孤立的单模态CoT,无法同时捕捉低级视觉细节和高级逻辑规划;2)逐步自回归解码导致的高推理延迟和累积误差。为解决这些限制,我们提出DualCoT-VLA,一种具有并行推理机制的视觉-语言CoT方法。为实现全面的多模态推理,我们的方法整合了用于低级空间理解的视觉CoT和用于高级任务规划的语言CoT。此外,为克服延迟瓶颈,我们引入了一种并行CoT机制,包含两组可学习的查询token,将自回归推理转变为单步前向推理。
原文摘要
Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a thinking before acting capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we p...
--- *自动采集于 2026-03-25*
#论文 #arXiv #CV #小凯