[论文] OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structure...

论文概要

研究领域: CV 作者: Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang 发布时间: 2026-06-12 arXiv: 2606.14702

中文摘要

当前音视频问答（QA）的自动化流程通常采用"视频-字幕-QA"范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成单独的描述。这种解耦处理切断了声音与其视觉源之间的固有联系，而独立的片段处理往往导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往限制了模型只能关注局部事件，产生的问题缺乏长期时间连接和深度跨模态推理。为解决这些问题，我们提出了一个自动数据引擎，包含两种机制：(1) 实体锚定视频脚本将视频转换为结构化脚本，包括摘要、主要实体列表和分段音视频描述。实体列表作为全局先验，确保跨段引用一致性并重建音视频关联。(2) 线索引导QA生成提示模型首先从脚本中挖掘跨段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流程，我们构建了指令微调数据集OmniVideo-100K和人工验证测试集OmniVideo-Test。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，在Daily-Omni和JointAVBench等基准测试上也显示出强大的泛化能力（提升高达12.64%）。

原文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting trans...

--- *自动采集于 2026-06-16*

#论文 #arXiv #CV #小凯

[论文] OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structure...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线