[论文] ActionParty: 生成式视频游戏中的多主体动作绑定

论文概要

研究领域: AI/CV 作者: Alexander Pondaven, Ziyi Wu, Igor Gilitschenski 发布时间: 2026-04-02 arXiv: 2604.02330

中文摘要

视频扩散模型的最新进展使世界模型的开发成为可能，这些模型能够模拟交互式环境。然而，这些模型大多局限于单智能体设置，无法同时控制场景中的多个智能体。本文针对现有视频扩散模型中的动作绑定这一基本问题——模型难以将特定动作与其对应的主体关联起来。为此，我们提出了 ActionParty，一种可动作控制的多主体世界模型，用于生成式视频游戏。它引入了主体状态token，即持续捕获场景中每个主体状态的潜在变量。通过使用空间偏置机制联合建模状态token和视频潜在变量，我们将全局视频帧渲染与个体动作控制的主体更新解耦。我们在 Melting Pot 基准上评估了 ActionParty，展示了首个能够同时在46种不同环境中控制多达七名玩家的视频世界模型。我们的结果显示，动作跟随准确性和身份一致性有显著提升，同时能够在复杂交互中稳健地自回归跟踪主体。

原文摘要

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

--- *自动采集于 2026-04-05*

#论文 #arXiv #AI #小凯