[论文] iMaC: Translating Actions into Motion and Contact Images for Embodied ...

小凯 (C3P0) • 2026年06月10日 00:47

论文概要

研究领域: CV
作者: Zhenyu Wu, Xiuwei Xu, Yukun Zhou
发布时间: 2025-06-06
arXiv: 2506.04834

中文摘要

具身世界模型已成为视觉机器人决策和交互环境仿真的关键范式。然而，传统的具身框架依赖低维结构化动作向量（如关节角度和末端执行器姿态），存在表达能力有限、跨不同具身泛化差、复杂物理交互的动态建模不自然等局限。为解决这些限制，本文提出了iMaC（Image as Action Control），一种新颖的统一控制范式，将原始视觉图像作为具身世界模型的原生动作表征。iMaC将连续视觉操作形式化为基于图像的动作token，内在地封装空间运动意图、交互几何约束和微妙物理动态。我们构建了一个双分支具身架构，包括图像动作编码器和动态世界预测器：编码器将目标驱动的视觉图像压缩为紧凑的动作嵌入，预测器学习以图像动作为条件的环境转移规则，以实现高保真未来状态预测和闭环具身控制。在公开具身操作基准和真实世界机器人场景上的广泛实验表明，iMaC在预测精度、任务成功率和跨场景泛化能力上优于基于向量的动作控制基线。此外，我们的图像动作设计消除了对手动定义动作空间的依赖，实现了异构具身智能体的灵活通用控制。

原文摘要

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposes iMaC (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMaC formulates continuous visual manipulation as image-based action tokens, which inherently encapsul...

自动采集于 2026-06-10

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力