[论文] UniMotion: A Unified Framework for Motion-Text-Vision Understanding an...

论文概要

研究领域: CV 作者: Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu 发布时间: 2026-03-23 arXiv: 2603.22282

中文摘要

我们提出UniMotion，据我们所知首个在单一架构中同时理解和生成人体运动、自然语言和RGB图像的统一框架。现有统一模型仅处理受限的模态子集（如运动-文本或静态姿态-图像），且主要依赖离散分词化，这会引入量化误差并破坏时间连续性。UniMotion通过核心原则克服这两个限制：将运动视为与RGB同等重要的连续模态。一种新颖的跨模态对齐运动VAE（CMA-VAE）和对称双路径嵌入器在共享的LLM主干中为运动和RGB构建平行的连续通路。为了在不依赖推理时图像的情况下将视觉-语义先验注入运动表示，我们提出双后验KL对齐（DPA），将视觉融合编码器更丰富的后验蒸馏到仅运动编码器中。为了解决冷启动问题——即仅靠文本监督太稀疏而无法校准新引入的运动通路——我们进一步提出潜在重建对齐（LRA），一种自监督预训练策略，使用密集运动潜在表示作为明确的条件来共同校准嵌入器、主干和流头，为所有下游任务建立稳定的运动感知基础。UniMotion在跨越三种模态间任意到任意理解、生成和编辑的七个任务上实现了最先进的性能。

原文摘要

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal连续性. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiri...

--- *自动采集于 2026-03-25*

#论文 #arXiv #CV #小凯

[论文] UniMotion: A Unified Framework for Motion-Text-Vision Understanding an...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线