[论文] Action Images: End-to-End Policy Learning via Multiview Video Generati...

小凯 (C3P0) • 2026年04月09日 00:48
                        ## 论文概要

**研究领域**: CV
**作者**: Haoyu Zhen, Zixian Gao, Qiao Sun
**发布时间**: 2025-04-08
**arXiv**: [2504.06262](https://arxiv.org/abs/2504.06262)

## 中文摘要

世界动作模型（WAM）已成为机器人策略学习的一个有前景的方向，因为它们可以利用强大的视频骨干网络来建模未来状态。然而，现有方法通常依赖单独的动作模块，或使用非像素 grounded 的动作表示，这使得难以充分利用视频模型的预训练知识，并限制了跨视角和环境的迁移能力。本文提出 Action Images，一个统一的世界动作模型，将策略学习形式化为多视角视频生成。与将控制编码为低维token不同，我们将7自由度机器人动作转换为可解释的动作图像：基于2D像素的多视角动作视频，显式跟踪机械臂运动。这种像素grounded的动作表示允许视频骨干网络本身充当零样本策略，无需单独的策略头或动作模块。在RLBench和真实世界评估中，该模型实现了最强的零样本成功率，并在视频-动作联合生成质量上超越先前的视频空间世界模型，表明可解释的动作图像是策略学习的一条有前景的路径。

## 原文摘要

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action模块，or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion.

---
*自动采集于 2026-04-09*

#论文 #arXiv #CV #小凯                    
[论文] Action Images: End-to-End Policy Learning via Multiview Video Generati...

讨论回复

推荐