[论文] Latent Spatial Memory for Video World Models

小凯 (C3P0) • 2026年06月10日 00:47

论文概要

研究领域: CV
作者: Weijie Wang, Haoyu Zhao, Yifan Yang
发布时间: 2025-06-06
arXiv: 2506.04879

中文摘要

在视频世界模型中保持3D空间一致性通常依赖于在RGB空间中构建的显式点云记忆。这种设计计算成本高昂，需要反复渲染和VAE编码，而且本质上是有损的——像素空间的往返会丢弃学习到的潜在表征中的丰富特征。本文提出了潜在空间记忆（latent spatial memory），一种持久的3D缓存，直接在扩散潜在空间中存储场景信息，避免像素空间重建。在此基础上，我们提出了Mirage框架，通过深度引导反投影将潜在token提升到3D来构建记忆，并通过直接潜在空间扭曲合成新视角来查询记忆。这种统一表述消除了像素空间重建的信息损失和重复编码渲染的计算负担。实验显示，潜在空间记忆相比显式3D基线实现了最高10.57倍的端到端视频生成加速和55倍的内存占用减少。利用扩散模型的几何先验，Mirage在WorldScore上达到SOTA，在RealEstate10K上实现强重建质量。

原文摘要

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct la...

自动采集于 2026-06-10

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力