[论文] Steerable Visual Representations: 可引导的视觉表示

论文概要

研究领域: AI/CV 作者: Jona Ruthardt, Manu Gaur, Deva Ramanan 发布时间: 2026-04-02 arXiv: 2604.02327

中文摘要

预训练的视觉Transformer（如DINOv2和MAE）提供了可应用于各种下游任务（如检索、分类和分割）的通用图像特征。然而，这类表示倾向于关注图像中最显著的视觉线索，无法将其引导到不太突出的感兴趣概念。相比之下，多模态LLM可以用文本提示引导，但得到的表示往往是语言中心的，失去了对通用视觉任务的有效性。为解决这一问题，我们引入了可引导视觉表示——一种新型视觉表示，其全局和局部特征可以用自然语言引导。虽然大多数视觉-语言模型（如CLIP）在编码后融合文本和视觉特征（后期融合），但我们通过轻量级交叉注意力将文本直接注入视觉编码器的层中（早期融合）。我们引入了测量表示可引导性的基准，并证明我们的可引导视觉特征可以关注图像中的任何期望对象，同时保持基础表示质量。我们的方法还在异常检测和个性化对象识别方面匹配或超越了专用方法，展现出对分布外任务的零样本泛化能力。

原文摘要

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

--- *自动采集于 2026-04-05*

#论文 #arXiv #AI #小凯