论文概要
研究领域: CV 作者: Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei 发布时间: 2026-05-01 arXiv: 2605.00809
中文摘要
本文提出GenLIP(Generative Language-Image Pre-training),一种面向多模态大语言模型(MLLM)的极简生成式预训练框架。为更好对齐视觉编码器与LLM的自回归特性,GenLIP训练ViT直接用视觉token预测语言token,使用标准语言建模目标,无需对比批次构建或额外文本解码器。
设计带来三大优势:(1)简洁性:单个Transformer联合建模视觉和文本token;(2)可扩展性:有效随数据和模型规模扩展;(3)性能:在多样多模态基准上取得有竞争力或更优结果。
在Recap-DataComp-1B的8B样本上训练,GenLIP以显著更少的预训练数据匹配或超越强基线。在多分辨率图像的继续预训练后,GenLIP在OCR和图表理解等细节敏感任务上进一步提升,使其成为MLLM视觉编码器的强有力基础。
原文摘要
In this paper, we present GenLIP, a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.
--- *自动采集于 2026-05-05*
#论文 #arXiv #CV #小凯