论文概要
研究领域: CV 作者: George Stoica, Sayak Paul, Matthew Wallingford, Vivek Ramanujan, Abhay Nori, Winson Han, Ali Farhadi, Ranjay Krishna, Judy Hoffman 发布时间: 2026-05-01 arXiv: 2605.00825
中文摘要
流匹配(Flow Matching, FM)训练一个时间依赖的向量场,将样本从简单先验分布传输到复杂数据分布。然而,对于高维图像,每个训练样本仅监督单条轨迹和中间点,导致训练信号极其稀疏且方差高。这种欠约束的监督会引起流崩塌——学到的动态记忆特定的源-目标配对,将多样化的输入映射到过于相似的输出,无法泛化。
本文提出后验增强流匹配(PAFM),用对给定中间状态和条件下的有效目标补全的近似后验期望,替代单目标监督。PAFM将此难解后验分解为(i)假设端点下中间状态的似然,和(ii)该端点在条件下的先验概率,并使用重要性采样方案构建多候选目标的混合分布。理论证明PAFM产生原始FM目标的无偏估计器,同时通过聚合每个中间点来自多条合理延续轨迹的信息,显著降低训练梯度方差。
实验表明,PAFM在不同模型规模(SiT-B/2和SiT-XL/2)、不同架构(SiT和MMDiT)以及类别和文本条件基准(ImageNet和CC12M)上,比FM提升最多3.4 FID50K,计算开销可忽略。
原文摘要
Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-dimensional images, each training sample supervises only a single trajectory and intermediate point, yielding an extremely sparse and high-variance training signal. This under-constrained supervision can cause flow collapse, where the learned dynamics memorize specific source-target pairings, mapping diverse inputs to overly similar outputs, failing to generalize. We introduce Posterior-Augmented Flow Matching (PAFM), a theoretically grounded generalization of FM that replaces single-target supervision with an expectation over an approximate posterior of valid target completions for a given intermediate state and condition. PAFM factorizes this intractable posterior into (i) the likelihood of the intermediate under a hypothesized endpoint and (ii) the prior probability of that endpoint under the condition, and uses an importance sampling scheme to construct a mixture over multiple candidate targets. We prove that PAFM yields an unbiased estimator of the original FM objective while substantially reducing gradient variance during training by aggregating information from many plausible continuation trajectories per intermediate. Finally, we show that PAFM improves over FM by up to 3.4 FID50K across different model scales (SiT-B/2 and SiT-XL/2), different architectures (SiT and MMDiT), and in both class and text conditioned benchmarks (ImageNet and CC12M), with a negligible increase in the compute overhead.
--- *自动采集于 2026-05-05*
#论文 #arXiv #CV #小凯