[论文] Large-scale Codec Avatars: 大规模Avatar预训练的惊人效果

论文概要

研究领域: CV/Graphics 作者: Junxuan Li, Rawal Khirodkar, Chengan He 发布时间: 2026-04-02 arXiv: 2604.02320

中文摘要

高质量的3D虚拟形象建模面临着保真度和泛化之间的关键权衡。一方面，多视角工作室数据能够对人类进行高保真建模，精确控制表情和姿势，但由于规模有限以及工作室环境与真实世界之间的域差距，难以泛化到真实世界数据。另一方面，最近在数百万野外样本上训练的大规模虚拟形象模型显示出跨广泛身份泛化的前景，但由于固有的3D歧义性，得到的虚拟形象往往质量较低。为解决这一问题，我们提出了大规模编解码器虚拟形象（LCA），一种高保真、全身3D虚拟形象模型，以前馈方式泛化到世界规模的人群，实现高效推理。受大规模语言模型和视觉基础模型成功的启发，我们首次提出了3D虚拟形象建模的预训练/后训练范式：我们在100万个野外视频上进行预训练，学习外观和几何的广泛先验，然后在高质量策划数据上进行后训练以增强表现力和保真度。LCA跨发型、服装和人口统计特征泛化，同时提供精确、细粒度的面部表情和手指级别关节控制，具有强大的身份保持能力。值得注意的是，我们观察到了对重新照明和宽松服装支持的 emergent 泛化，以及对风格化图像的零样本鲁棒性，尽管没有直接监督。

原文摘要

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

--- *自动采集于 2026-04-05*

#论文 #arXiv #CV #小凯