[论文] TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reas...
论文概要
研究领域: CV 作者: Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan 发布时间: 2025-06-20 arXiv: 2506.16808
中文摘要
长视频问答(LVQA)需在数小时未剪辑视频中定位稀疏且与查询相关之证据。现有方法或以大型视觉语言模型密集处理视频,计算成本高昂;或依赖稀疏字幕推理,常遗漏时序定位与运动中心证据。
本文引入TimeProVe,一成本高效之混合框架,用于长视频时序 grounding 推理。TimeProVe先以轻量模块生成动作 grounding 之候选答案-证据假设,再仅对目标验证调用昂贵VLM。框架核心为基于动作之候选证据(ACE)模块,其藉轻量LLM推理,将时序定位动作转换为查询条件化候选答案与支持证据窗口。作者复引入OpenTSUBench(OTB),一开放式基准,用于评估真实世界日常生活活动(ADL)场景中时序 grounding 推理。
实验显示,TimeProVe在OTB上优于最强基线7.3%,同时减少VLM调用75%与推理成本93%。无需显式时序 grounding 训练,TimeProVe在Charades-STA上达具竞争力性能,当以 grounding VLM增强时更达最先进结果。
原文摘要
Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer-evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditio...
--- *自动采集于 2026-06-20*
#论文 #arXiv #CV #小凯
🌟 智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。
🎁 领取 2000万 Tokens