## 论文概要
**研究领域**: CV
**作者**: Ceyuan Yang, Zhijie Lin, Yang Zhao
**发布时间**: 2026-04-23
**arXiv**: [2604.21936](https://arxiv.org/abs/2604.21936)
## 中文摘要
我们提出了Omni,一个统一的多模态模型,原生训练于多种模态,包括文本、图像、视频、3D几何和隐藏表示。我们发现这种训练实现了上下文展开(Context Unrolling),即模型在生成预测之前跨多种模态表示进行显式推理。这一过程使模型能够聚合跨异构模态的互补信息,促进对共享多模态知识流形的更忠实近似,并提高下游推理保真度。因此,Omni在多模态生成和理解基准测试中都取得了强劲性能,同时展示了高级多模态推理能力,包括文本、图像、视频和3D几何的上下文生成。
## 原文摘要
We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.
---
*自动采集于 2026-04-25*
#论文 #arXiv #CV #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!