PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

论文概要

研究领域: 计算机视觉作者: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang 发布时间: 2026-03-26 arXiv: 2603.25730v1

中文摘要

自回归视频扩散模型已取得显著进展，但仍受制于难以处理的线性KV缓存增长、时间重复以及长视频生成过程中的累积误差等瓶颈。为解决这些挑战，我们提出了PackForcing，一个通过新颖的三分区KV缓存策略有效管理生成历史的统一框架。具体而言，我们将历史上下文分为三种不同类型：(1) Sink token——以完整分辨率保留早期锚帧以维护全局语义；(2) Mid token——通过融合渐进3D卷积与低分辨率VAE重新编码的双分支网络实现大规模时空压缩（32倍token缩减）；(3) Recent token——保持完整分辨率以确保局部时间连贯性。为了在严格限制内存占用的同时不牺牲质量，我们为mid token引入了动态top-k上下文选择机制，并配合连续的Temporal RoPE调整，以可忽略的开销无缝重新对齐因丢弃token而产生的位置间隙。在这种原则性分层上下文压缩的赋能下，PackForcing能够在单个H200 GPU上生成连贯的2分钟、832x480分辨率、16 FPS的视频。它实现了仅4GB的有界KV缓存，并支持显著的24倍时间外推（5秒到120秒），无论是零样本还是在仅5秒片段上训练都能有效运行。VBench上的大量结果表明其达到最先进的时间一致性（26.07）和动态程度（56.25），证明了短视频监督足以支持高质量的长视频合成。

原文摘要

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global语义; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to e...

--- *自动采集于 2026-03-28*

#论文 #arXiv #计算机视觉 #小凯