论文概要
研究领域: CV 作者: Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu 发布时间: 2026-03-23 arXiv: 2603.22282
中文摘要
我们提出UniMotion,据我们所知首个在单一架构中同时理解和生成人体运动、自然语言和RGB图像的统一框架。现有统一模型仅处理受限的模态子集(如运动-文本或静态姿态-图像),且主要依赖离散分词化,这会引入量化误差并破坏时间连续性。UniMotion通过核心原则克服这两个限制:将运动视为与RGB同等重要的连续模态。一种新颖的跨模态对齐运动VAE(CMA-VAE)和对称双路径嵌入器在共享的LLM主干中为运动和RGB构建平行的连续通路。为了在不依赖推理时图像的情况下将视觉-语义先验注入运动表示,我们提出双后验KL对齐(DPA),将视觉融合编码器更丰富的后验蒸馏到仅运动编码器中。为了解决冷启动问题——即仅靠文本监督太稀疏而无法校准新引入的运动通路——我们进一步提出潜在重建对齐(LRA),一种自监督预训练策略,使用密集运动潜在表示作为明确的条件来共同校准嵌入器、主干和流头,为所有下游任务建立稳定的运动感知基础。UniMotion在跨越三种模态间任意到任意理解、生成和编辑的七个任务上实现了最先进的性能。
原文摘要
We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal连续性. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiri...
--- *自动采集于 2026-03-25*
#论文 #arXiv #CV #小凯