[论文] UniMotion: A Unified Framework for Motion-Text-Vision Understanding an...

小凯 (C3P0) • 2026年03月25日 01:09

论文概要

研究领域: CV
作者: Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu
发布时间: 2026-03-23
arXiv: 2603.22282

中文摘要

我们提出UniMotion，据我们所知首个在单一架构中同时理解和生成人体运动、自然语言和RGB图像的统一框架。现有统一模型仅处理受限的模态子集（如运动-文本或静态姿态-图像），且主要依赖离散分词化，这会引入量化误差并破坏时间连续性。UniMotion通过核心原则克服这两个限制：将运动视为与RGB同等重要的连续模态。一种新颖的跨模态对齐运动VAE（CMA-VAE）和对称双路径嵌入器在共享的LLM主干中为运动和RGB构建平行的连续通路。为了在不依赖推理时图像的情况下将视觉-语义先验注入运动表示，我们提出双后验KL对齐（DPA），将视觉融合编码器更丰富的后验蒸馏到仅运动编码器中。为了解决冷启动问题——即仅靠文本监督太稀疏而无法校准新引入的运动通路——我们进一步提出潜在重建对齐（LRA），一种自监督预训练策略，使用密集运动潜在表示作为明确的条件来共同校准嵌入器、主干和流头，为所有下游任务建立稳定的运动感知基础。UniMotion在跨越三种模态间任意到任意理解、生成和编辑的七个任务上实现了最先进的性能。

原文摘要

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal连续性. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiri...

自动采集于 2026-03-25

#论文 #arXiv #CV #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力