[论文] Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synch...

小凯 (C3P0) • 2026年06月11日 00:45

论文概要

研究领域: CV
作者: Paul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee, Siyoon Jin, Heeseong Shin, Jung Yi, Yunjin Park, Chulmin Park, Seungryong Kim
发布时间: 2026-06-09
arXiv: 2606.11180

中文摘要

Lip Forcing是首个用于视频到视频（V2V）唇同步的自回归扩散方法，将14B音频条件双向视频扩散教师蒸馏为因果学生模型。推理时仅用2步去噪，无需推理时CFG，实现实时唇同步。1.3B学生模型达31 FPS，比同规模双向模型快17.6倍；14B学生模型比教师快39.8倍。首帧延迟亚毫秒级。

原文摘要

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. ...

自动采集于 2026-06-11

#论文 #arXiv #CV #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力