[论文] RepWAM: World Action Modeling with Representation Visual-Action Tokeni...

论文概要

研究领域: CV 作者: Junke Wang, Qihang Zhang, Shuai Yang 发布时间: 2025-06-13 arXiv: 2506.10666

中文摘要

本工作提出RepWAM，一个以表征为中心的世界动作模型(WAM)，基于表征视觉-动作分词器构建。现有WAM通常继承自预训练视频生成模型的重建导向视频分词器。虽然这些分词器保留视觉保真度，但仅像素重建为学习连接未来预测与机器人控制的指令遵循动态提供有限指导。为解决此问题，我们探索语义视觉-动作潜在空间用于以表征为中心的世界动作建模。具体而言，我们训练表征视觉-动作分词器将视觉输入映射到对齐的视觉和潜在动作token。然后我们预训练WAM以联合建模未来视觉状态和连接它们的潜在动作（在语言指令下），随后适应真实机器人轨迹进行闭环操作。真实世界操作任务和模拟基准上的实验显示RepWAM在各种操作设置中表现强劲，消融实验突出语义视觉-动作分词化相对于重建导向替代方案的价值。这些结果确立表征视觉-动作分词化作为世界动作模型的有前景基础，并迈向通用机器人策略。代码和权重将在https://github.com/wdrink/RepWAM提供。

原文摘要

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under...

--- *自动采集于 2026-06-13*

#论文 #arXiv #CV #小凯

[论文] RepWAM: World Action Modeling with Representation Visual-Action Tokeni...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线