SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

论文概要

研究领域: 计算机视觉作者: Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi 发布时间: 2026-03-26 arXiv: 2603.25733v1

中文摘要

多模态大语言模型（MLLMs）在视频时间定位（VTG）任务上表现强劲。然而，它们的粗粒度识别能力不足以支持细粒度的时间理解，使得特定任务的微调变得不可或缺。这种微调导致模型记忆数据集特定的捷径，而非忠实地基于实际视觉内容进行定位，从而导致较差的域外（OOD）泛化能力。以物体为中心的学习通过将场景分解为实体级表示提供了一个有前景的解决方案，但现有方法需要从头重新运行整个多阶段训练流程。我们提出了SlotVTG，一个以最小成本引导MLLMs进行以物体为中心、基于输入的视觉推理的框架。SlotVTG引入了一个轻量级的slot适配器，通过slot attention将视觉token分解为抽象slot并重建原始序列，其中来自自监督视觉模型的物体先验鼓励语义连贯的slot形成。在标准VTG基准上的跨域评估表明，我们的方法在保持有竞争力的域内（ID）性能的同时，显著提高了OOD鲁棒性，且开销极小。

原文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual内容, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight sl...

--- *自动采集于 2026-03-28*

#论文 #arXiv #计算机视觉 #小凯

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线