[论文] KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with B...

论文概要

研究领域: CV 作者: Gaoge Han, Zhengqing Gao, Ziwen Li 发布时间: 2025-03-18 arXiv: 2503.13845

中文摘要

本文提出了一项新颖的运动学丰富的视觉-语言-动作（VLA）任务，其中语言命令从起始到完成的关键时刻密集编码各种运动学属性（如方向、轨迹、姿态和相对位移），与现有仅能粗略或部分捕捉运动学的动作指令不同，从而支持细粒度和个性化的操作。在此设定中，任务目标保持不变，而执行轨迹必须适应指令级别的运动学规范。为应对这一挑战，我们提出了KineVLA，一个视觉-语言-动作框架，通过双层动作表示和双层推理令牌明确解耦目标级不变性与运动学级可变性，作为对齐语言和动作的显式监督中间变量。为支持该任务，我们构建了涵盖仿真和真实机器人平台的运动学感知VLA数据集，具有指令级运动学变化和双层标注。在LIBERO和Realman-75机器人上的大量实验表明，KineVLA在运动学敏感的基准测试中始终优于强VLA基线，实现了更精确、可控和泛化的操作行为。

原文摘要

In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning t...

--- *自动采集于 2026-03-19*

#论文 #arXiv #CV #小凯