[论文] RefAlign: Representation Alignment for Reference-to-Video Generation

论文概要

研究领域: CV 作者: Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang 发布时间: 2026-03-26 arXiv: 2603.25743

中文摘要

参考到视频（R2V）生成是一种可控视频合成范式，使用文本提示和参考图像约束生成过程，实现个性化广告和虚拟试穿等应用。

在实践中，现有的R2V方法通常在参考图像的VAE潜在表示之外引入额外的高级语义或跨模态特征，并将它们联合输入扩散Transformer（DiT）。这些辅助表示提供语义指导并充当隐式对齐信号，可以部分缓解VAE潜在空间中的像素级信息泄露。然而，它们可能仍然难以解决由异构编码器特征跨模态不匹配引起的复制粘贴伪影和多主体混淆问题。

在本文中，我们提出了RefAlign，一种表示对齐框架，显式地将DiT参考分支特征对齐到视觉基础模型（VFM）的语义空间。RefAlign的核心是参考对齐损失，它将同一主体的参考特征和VFM特征拉近以提高身份一致性，同时将不同主体的相应特征推远以增强语义可区分性。这种简单而有效的策略仅在训练期间应用，不增加推理时间开销，并在文本可控性和参考保真度之间实现了更好的平衡。在OpenS2V-Eval基准上的大量实验表明，RefAlign在TotalScore方面优于当前最先进的方法，验证了显式参考对齐对R2V任务的有效性。

原文摘要

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous en...

--- *自动采集于 2026-03-28*

#论文 #arXiv #CV #小凯