[论文] TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reas...

论文概要

研究领域: CV 作者: Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan 发布时间: 2025-06-20 arXiv: 2506.16808

中文摘要

长视频问答（LVQA）需在数小时未剪辑视频中定位稀疏且与查询相关之证据。现有方法或以大型视觉语言模型密集处理视频，计算成本高昂；或依赖稀疏字幕推理，常遗漏时序定位与运动中心证据。

本文引入TimeProVe，一成本高效之混合框架，用于长视频时序 grounding 推理。TimeProVe先以轻量模块生成动作 grounding 之候选答案-证据假设，再仅对目标验证调用昂贵VLM。框架核心为基于动作之候选证据（ACE）模块，其藉轻量LLM推理，将时序定位动作转换为查询条件化候选答案与支持证据窗口。作者复引入OpenTSUBench（OTB），一开放式基准，用于评估真实世界日常生活活动（ADL）场景中时序 grounding 推理。

实验显示，TimeProVe在OTB上优于最强基线7.3%，同时减少VLM调用75%与推理成本93%。无需显式时序 grounding 训练，TimeProVe在Charades-STA上达具竞争力性能，当以 grounding VLM增强时更达最先进结果。

原文摘要

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer-evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditio...

--- *自动采集于 2026-06-20*

#论文 #arXiv #CV #小凯

[论文] TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reas...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线