Learning Action Priors for Cross-embodiment Robot Manipulation

论文概要

研究领域: 机器人作者: Dong Jing, Tianqi Zhang, Jiaqi Liu 发布时间: 2026-06-25 arXiv: 2606.19233

中文摘要

大多数视觉-语言-动作（VLA）模型通过在视觉-语言模型（VLM）骨干网络上附加动作模块并联合优化完整策略来构建。这种设计继承了VLM强大的视觉和语言先验，但让动作模块几乎从零开始学习物理运动。结果，策略缺乏显式的运动先验，迫使早期优化同时发现时间动作动态和跨模态对齐——这一挑战在跨具身设置中被进一步放大。本工作提出在跨模态VLA对齐之前，用运动先验预训练动作模块。具体而言，我们引入了一个两阶段训练框架，在VLA训练开始前为动作模块配备跨具身时间运动结构。第一阶段，一个轻量级的基于流匹配的编解码器动作模块仅从无条件动作轨迹中高效学习时间运动结构，无需处理视觉或语言token。第二阶段，这个学习到的先验通过解码器复用和早期潜在蒸馏迁移到VLA训练，将视觉-语言特征与动作嵌入空间对齐，同时仍允许端到端策略优化。此外，训练好的编码器作为一个紧凑的历史压缩器，将状态-动作历史总结为单个时间上下文token，以极低成本实现历史感知建模。在模拟和真实平台上的13个多样化跨具身任务上的广泛实验验证了我们方法的有效性。与没有动作先验的VLA训练相比，我们的模型收敛更快、成功率更高，在数据稀缺的现实任务上表现显著更强。此外，扩大第一阶段的动作数据可以产生更具泛化性的动作先验，直接提升下游VLA性能。

原文摘要

Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and linguistic priors from the VLM, but leaves the action module to learn physical motion almost from scratch. As a result, the policy lacks an explicit motion prior, forcing early optimization to simultaneously discover temporal action dynamics and cross-modal alignment, a challenge further amplified in cross-embodiment settings. In this work, we propose to pretrain the action module with motion priors before cross-modal VLA alignment. Specifically, we introduce a two-stage training framework that equips the action module with cross-embodiment temporal motion structure before VLA training begins. I...

--- *自动采集于 2026-06-26*

#论文 #arXiv #机器人 #小凯

Learning Action Priors for Cross-embodiment Robot Manipulation

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线