[论文] ShotStream: Streaming Multi-Shot Video Generation for Interactive Stor...

论文概要

研究领域: CV 作者: Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue 发布时间: 2026-03-26 arXiv: 2603.25746

中文摘要

多镜头视频生成对于长篇叙事性故事讲述至关重要，然而当前的双向架构存在交互性差和延迟高的问题。我们提出了ShotStream，一种新颖的因果多镜头架构，能够实现交互式故事讲述和高效的即时帧生成。通过将任务重新定义为基于历史上下文的下一镜头生成，ShotStream允许用户通过流式提示动态指导正在进行的叙事。

我们首先将一个文本到视频的模型微调为双向下一镜头生成器，然后通过分布匹配蒸馏将其蒸馏为因果学生模型。为了克服自回归生成中固有的镜头间一致性和误差累积的挑战，我们引入了两个关键创新：

1. 双缓存记忆机制：保持视觉连贯性——全局上下文缓存保留条件帧以实现镜头间一致性，而局部上下文缓存保存当前镜头内生成的帧以实现镜头内一致性。使用RoPE不连续指示器明确区分两个缓存以消除歧义。

2. 两阶段蒸馏策略：首先以基于真实历史镜头的镜头内自强制为条件，逐步扩展到使用自生成历史的镜头间自强制，有效弥合训练-测试差距。

大量实验表明，ShotStream生成连贯的多镜头视频，延迟低于一秒，在单个GPU上达到16 FPS。它达到或超过了较慢的双向模型的质量，为实时交互式故事讲述铺平了道路。

原文摘要

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two...

--- *自动采集于 2026-03-28*

#论文 #arXiv #CV #小凯