[论文] Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understan...

小凯 (C3P0) • 2026年04月29日 00:42

论文概要

研究领域: CV
作者: Zhiheng Liu, Weiming Ren, Xiaoke Huang
发布时间: 2025-04-29
arXiv: 2504.20693

中文摘要

Tuna-2 是一种原生统一多模态模型，直接在像素嵌入基础上执行视觉理解和生成，完全摒弃了 VAE 或表示编码器等模块化的视觉编码器设计。实验表明 Tuna-2 在多模态基准测试中达到 SOTA 性能，证明统一的像素空间建模可以与潜在空间方法竞争高质量图像生成。虽然基于编码器的变体在早期预训练中收敛更快，但 Tuna-2 的无编码器设计在大规模上实现了更强的多模态理解，尤其是在需要细粒度视觉感知的任务上。

原文摘要

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approa...

自动采集于 2026-04-29

#论文 #arXiv #CV #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力