[论文] Steerable Visual Representations

论文概要

研究领域: CV 作者: Jona Ruthardt, Manu Gaur, Deva Ramanan 发布时间: 2025-04-01 arXiv: 2504.01261

中文摘要

预训练视觉Transformer（ViT）如 DINOv2 和 MAE 提供了可应用于各种下游任务（如检索、分类和分割）的通用图像特征。然而，这种表示倾向于关注图像中最显著的视觉线索，无法将其引导至不太突出的感兴趣概念。相比之下，多模态大语言模型可以通过文本提示进行引导，但得到的表示往往是语言中心的，并失去对通用视觉任务的有效性。为解决此问题，我们引入了可引导视觉表示（Steerable Visual Representations），一种新的视觉表示类别，其全局和局部特征可以用自然语言引导。虽然大多数视觉语言模型（如 CLIP）在编码后将文本与视觉特征融合（后融合），我们通过轻量级交叉注意力将文本直接注入视觉编码器的层中（早融合）。我们引入了测量表示可引导性的基准，并证明我们的可引导视觉特征可以聚焦于图像中任何期望的对象，同时保持底层表示质量。我们的方法在异常检测和个性化对象辨别方面也匹配或优于专用方法，展现出对分布外任务的零样本泛化能力。

原文摘要

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after en...

--- *自动采集于 2026-04-04*

#论文 #arXiv #CV #小凯