[论文] CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

论文概要

研究领域: CV 作者: Ankan Deria, Komal Kumar, Xilin He 等 发布时间: 2026-04-03 arXiv: 2604.03231

中文摘要

近期视觉-语言模型通常依赖单一视觉编码器，并通过对比式图像-文本目标（如CLIP式预训练）进行训练。虽然对比式编码器在跨模态对齐和检索方面表现优异，但自监督视觉编码器往往能捕捉更丰富的稠密语义，在识别和理解任务上展现更强的鲁棒性。本文研究如何规模化融合这些互补的视觉表征用于视觉-语言建模。我们提出CoME-VL：互补多编码器视觉-语言，一个模块化融合框架，将对比式训练的视觉编码器与自监督DINO编码器相结合。在多个视觉-语言基准上的大量实验表明，CoME-VL始终优于单编码器基线，在视觉理解任务上平均提升4.9%，在定位任务上提升5.4%。

原文摘要

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orth...

--- *自动采集于 2026-04-06*

#论文 #arXiv #CV #小凯