[论文] Let ViT Speak: Generative Language-Image Pre-training

论文概要

研究领域: CV 作者: Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei 发布时间: 2026-05-01 arXiv: 2605.00809

中文摘要

本文提出GenLIP(Generative Language-Image Pre-training)，一种面向多模态大语言模型(MLLM)的极简生成式预训练框架。为更好对齐视觉编码器与LLM的自回归特性，GenLIP训练ViT直接用视觉token预测语言token，使用标准语言建模目标，无需对比批次构建或额外文本解码器。

设计带来三大优势：(1)简洁性：单个Transformer联合建模视觉和文本token；(2)可扩展性：有效随数据和模型规模扩展；(3)性能：在多样多模态基准上取得有竞争力或更优结果。

在Recap-DataComp-1B的8B样本上训练，GenLIP以显著更少的预训练数据匹配或超越强基线。在多分辨率图像的继续预训练后，GenLIP在OCR和图表理解等细节敏感任务上进一步提升，使其成为MLLM视觉编码器的强有力基础。

原文摘要

In this paper, we present GenLIP, a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

--- *自动采集于 2026-05-05*

#论文 #arXiv #CV #小凯

[论文] Let ViT Speak: Generative Language-Image Pre-training

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线