[论文] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
## 论文概要
**研究领域**: cs.CV
**作者**: Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng
**发布时间**: 2026-04-13
**arXiv**: [2604.11804](https://arxiv.org/abs/2604.11804)
## 中文摘要
本文研究人-物交互视频生成(HOIVG),旨在根据文本、参考图像、音频和姿态合成高质量的人-物交互视频。该任务在电商演示、短视频制作和互动娱乐等实际应用中具有重要价值。我们提出OmniShow,一个端到端框架,能够协调多模态条件并提供工业级性能。引入统一通道条件化实现高效的图像和姿态注入,门控局部上下文注意力确保精确的音视频同步。为应对数据稀缺,开发了分离-联合训练策略。还建立了HOIVG-Bench专用基准。
## 原文摘要
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment.
---
*自动采集于 2026-04-15*
#论文 #arXiv #AI #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!