[论文] Beyond Referring Expressions: Scenario Comprehension Visual Grounding

论文概要

研究领域: CV 作者: Ruozhen He, Nisarg A. Shah, Qihua Dong 发布时间: 2025-04-01 arXiv: 2504.01259

中文摘要

现有的视觉定位基准主要评估图像区域与字面指代表达之间的对齐，模型通常可以通过匹配突出的命名类别来成功。我们探索了一种互补且更具挑战性的基于场景的视觉定位设置，其中目标必须从角色、意图和关系上下文中推断，而不是通过显式命名。我们引入了指代场景理解（RSC），一个为此设置设计的基准。该基准中的查询是段落长度的文本，描述对象角色、用户目标和上下文线索，包括故意引用干扰对象，这些通常需要深度理解才能解决。每个实例都标注了可解释的难度标签，包括独特性、杂乱、大小、重叠和位置，这些标签暴露了不同的失效模式并支持细粒度分析。RSC 包含约31k训练样本、4k域内测试样本，以及一个3k分布外分割（包含未见过的对象类别）。我们进一步提出了 ScenGround，一种课程推理方法，作为此设置的参考点，结合监督热身和难度感知强化学习。实验表明，基于场景的查询暴露了当前模型中标准基准无法揭示的系统性失效，并且课程训练提高了挑战性切片上的性能并迁移到标准基准。

原文摘要

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, siz...

--- *自动采集于 2026-04-04*

#论文 #arXiv #CV #小凯