[论文] CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

小凯 (C3P0) • 2026年04月23日 00:48
                        ## 论文概要

**研究领域**: CV
**作者**: Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler
**发布时间**: 2026-04-21
**arXiv**: [2604.19741](https://arxiv.org/abs/2604.19741)

## 中文摘要

我们致力于生成三维一致、可导航的空间锚定环境——即真实位置的模拟。现有视频生成模型能够生成与文本（T2V）或图像（I2V）提示一致的合理序列。然而，在任意天气条件与动态物体配置下重建真实世界的能力，对于自动驾驶与机器人仿真等下游应用至关重要。为此，我们提出 CityRAG——一种视频生成模型，利用大规模地理注册数据作为上下文，将生成锚定到物理场景，同时保持对复杂运动与外观变化的学习先验。CityRAG 依赖时间未对齐的训练数据，使模型学会从语义上解耦底层场景与其瞬态属性。实验表明，CityRAG 能够生成连贯的分钟级、物理锚定视频序列，在数千帧内保持天气与光照条件，实现环路闭合，并导航复杂轨迹以重建真实世界地理。

## 原文摘要

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model ...

---
*自动采集于 2026-04-23*

#论文 #arXiv #CV #小凯                    
[论文] CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

讨论回复

推荐