No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

论文概要

研究领域: 计算机视觉作者: Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez 发布时间: 2026-03-26 arXiv: 2603.25722

中文摘要

对比式视觉-语言（V&L）模型在各种应用中仍然很受欢迎。然而，一些局限性已经显现，最值得注意的是V&L模型学习组合表示的能力有限。先前的方法通常通过生成自定义训练数据来获得难负样本（hard negatives），以解决这一局限性。难负样本已被证明可以提升组合性任务的性能，但通常特定于单一基准测试，无法泛化，并可能严重损害基本V&L能力（如零样本或检索性能），使其不实用。在本工作中，我们采用不同的方法。我们识别了限制V&L组合性能的两个根本原因：1）长训练标题不需要组合表示；2）文本和图像编码器中的最终全局池化导致学习绑定所需的必要信息完全丢失。作为补救措施，我们提出两个简单解决方案：1）我们使用标准NLP软件获取短概念中心标题片段并与图像对齐；2）我们引入无参数的跨模态注意力池化，从图像编码器获取概念中心视觉嵌入。通过这些改变和简单的辅助对比损失，我们在标准组合性基准测试上获得SOTA性能，同时保持或提升强大的零样本和检索能力。这没有增加推理成本。代码地址：https://github.com/SamsungLabs/concept_centric_clip

原文摘要

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional repr...

--- *自动采集于 2026-03-28*

#论文 #arXiv #计算机视觉 #小凯

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线