## 论文概要
**研究领域**: CV
**作者**: Alexander Pondaven, Ziyi Wu, Igor Gilitschenski
**发布时间**: 2025-04-01
**arXiv**: [2504.01264](https://arxiv.org/abs/2504.01264)
## 中文摘要
视频扩散模型的最新进展使得能够模拟交互环境的"世界模型"得以发展。然而,这些模型主要局限于单智能体设置,无法同时控制场景中的多个智能体。在这项工作中,我们解决了现有视频扩散模型中动作绑定的基本问题——这些模型难以将特定动作与其对应的主体关联起来。为此,我们提出了 ActionParty,一个用于生成式视频游戏的动作可控多主体世界模型。它引入了主体状态令牌,即持续捕获场景中每个主体状态的潜在变量。通过使用空间偏置机制联合建模状态令牌和视频潜在变量,我们将全局视频帧渲染与个体动作控制的主体更新解耦。我们在 Melting Pot 基准上评估了 ActionParty,展示了第一个能够同时在46种不同环境中控制多达七个玩家的视频世界模型。我们的结果显示动作跟随准确性和身份一致性显著提高,同时能够在复杂交互中稳健地自回归跟踪主体。
## 原文摘要
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action绑定 in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame ren...
---
*自动采集于 2026-04-04*
#论文 #arXiv #CV #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!