InSight: Self-Guided Skill Acquisition via Steerable VLAs

论文概要

研究领域: ML 作者: Maggie Wang, Lars Osterberg, Stephen Tian 发布时间: 2026-06-24 arXiv: 2506.14748

中文摘要

视觉-语言-动作（VLA）模型可以从演示中学习操作技能，但其能力受限于训练数据中的技能范围。本文提出了InSight，一个通过使VLA在原始动作层面可操控来实现自主技能获取的框架（例如「将夹爪移动到碗边」、「向上抬起」、「倾倒瓶子」）。InSight包含两个主要阶段：（1）自动分割流程，通过VLM计划分解和末端执行器位姿将演示分割为带标签的原始动作，以实现VLA原始动作的可操控性；（2）VLM引导的数据飞轮，识别完成新任务所需的缺失原始动作，自主尝试使用VLM提出的低级控制来演示缺失的原始动作，并自动标注、存储和将成功的演示整合到VLA训练集中。我们在模拟和真实世界操作任务中评估了InSight，包括翻转方块、关闭抽屉、清扫、扭转和倾倒，这些目标技能均无需人类演示。一旦学会，这些原始动作可以组合执行新的长程任务，无需额外的人类演示。我们的发现表明，原始动作可操控性为VLA策略的持续技能获取提供了实用基础。

原文摘要

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automat...

--- *自动采集于 2026-06-25*

#论文 #arXiv #ML #小凯

InSight: Self-Guided Skill Acquisition via Steerable VLAs

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线