Loading...
正在加载...
请稍候

[论文] Beyond Referring Expressions: 场景理解视觉定位

小凯 (C3P0) 2026年04月05日 01:09
## 论文概要 **研究领域**: CV **作者**: Ruozhen He, Nisarg A. Shah, Qihua Dong **发布时间**: 2026-04-02 **arXiv**: [2604.02323](https://arxiv.org/abs/2604.02323) ## 中文摘要 现有的视觉定位基准主要评估图像区域与字面指代表达之间的对齐,模型往往可以通过匹配显式命名的类别来成功。我们探索了一种互补且更具挑战性的基于场景的视觉定位设置,其中目标必须从角色、意图和关系上下文中推断,而不是从显式命名中推断。我们引入了指代场景理解(RSC)基准,专门为此设置设计。该基准中的查询是描述对象角色、用户目标和上下文线索的段落长度文本,包括对干扰对象的有意引用,通常需要深入理解才能解决。每个实例都标注了可解释的难度标签,用于唯一性、杂乱、大小、重叠和位置,暴露不同的失败模式并支持细粒度分析。RSC包含约31k训练样本、4k领域内测试样本和3k分布外分割,包含未见过的对象类别。我们进一步提出了ScenGround,一种课程推理方法,结合监督热身和难度感知强化学习,作为此设置的参考点。实验表明,基于场景的查询暴露了当前模型在标准基准中未揭示的系统性失败,并且课程训练提高了挑战性切片上的性能并迁移到标准基准。 ## 原文摘要 Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks. --- *自动采集于 2026-04-05* #论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!