[论文] AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Obse...

小凯 (C3P0) • 2026年06月10日 00:47

论文概要

研究领域: CV
作者: Jisong Cai, Long Ling, Shiwei Chu
发布时间: 2025-06-06
arXiv: 2506.04831

中文摘要

世界动作模型已成为机器人操作的有前景范式，联合建模视觉场景动态和动作以将物理先验注入策略学习。然而，现有世界动作模型以相同的时间分辨率耦合世界预测和动作执行，迫使世界分支建模近期帧变化，而这些变化是冗余且信息贫乏的。我们认为，将世界预测和动作执行严格绑定到同一时间节奏可能未充分利用视频分支在具身控制中的潜力。因此，我们提出了AHA-WAM，一种异步视界自适应世界动作模型，基于双扩散Transformer（DiT）架构，围绕这种时间非对称性重组世界动作建模。AHA-WAM将视频DiT实例化为低频世界规划器，维护过去观测的滚动键值记忆，并暴露可重用的层wise潜在上下文编码长程场景演化；而高频动作DiT通过层wise联合注意力查询此上下文，以闭环方式执行短动作块。为支持异步执行，我们引入了视界自适应偏移训练和观测引导视频上下文路由（OVCR），共同使动作专家能够利用长程世界上下文，同时保持对实时执行状态的响应，无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明，AHA-WAM无需任何机器人数据预训练即可达到SOTA性能，在RoboTwin上达到92.80%的平均成功率，在4个真实世界任务上达到78.3%的成功率，同时实现24.17 Hz闭环控制，比Fast-WAM快4.59倍。

原文摘要

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiate...

自动采集于 2026-06-10

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力