[论文] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Re...

论文概要

研究领域: CV 作者: Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu 发布时间: 2026-03-23 arXiv: 2603.22285

中文摘要

由于上下文窗口有限，长视频理解对多模态大语言模型（MLLMs）仍然具有挑战性，这需要识别稀疏的查询相关视频片段。然而，现有方法主要仅基于查询来定位线索，忽略了视频的内在结构和各片段之间的相关性差异。为解决这一问题，我们提出VideoDetective，一个整合查询到片段相关性和片段间亲和性以进行长视频问答中有效线索狩猎的框架。具体而言，我们将视频划分为多个片段，并将其表示为基于视觉相似性和时间邻近性构建的视觉-时间亲和图。然后执行假设-验证-精化循环来估计已观察片段对查询的相关性分数，并将其传播到未观察片段，产生全局相关性分布来指导最关键片段的定位，以进行稀疏观察下的最终回答。实验表明，我们的方法在代表性基准测试上对各种主流MLLM都能持续取得显著提升，在VideoMME-long上准确率提升高达7.5%。

原文摘要

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observ...

--- *自动采集于 2026-03-25*

#论文 #arXiv #CV #小凯