Native Active Perception as Reasoning for Omni-Modal Understanding

论文概要

研究领域: CV 作者: Zhenghao Xing, Ruiyang Xu, Yuxuan Wang 发布时间: 2026-06-19 arXiv: 2506.14987

中文摘要

长视频理解的被动模型通常采用全看一遍的范式，不管查询难度如何都均匀处理所有帧，导致计算成本随视频时长线性增长。虽然交互式框架已经出现，但它们往往依赖全局预扫描，且上下文成本仍随视频长度扩展。

本文提出 OmniAgent，首个原生全模态智能体，将视频理解形式化为基于POMDP的迭代观察-思考-行动循环。OmniAgent 按需执行动作，选择性地将音视频线索提炼为持久的文本记忆，有效将推理复杂度与原始视频时长解耦。

为实现这一目标，作者引入了两个关键组件： 1. 智能体监督微调：通过 best-of-N 轨迹合成和双阶段质量控��来启动原生主动感知 2. 智能体强化学习：采用 TAURA（轮次感知自适应不确定性重标优势），利用轮次级熵来引导信用分配至关键发现轮次

关键的是，OmniAgent 展现出正向的测试时扩展性——推理轮次越多，性能越好，验证了主动感知的有效性。在10个基准测试（如 VideoMME、LVBench）上的实验表明，OmniAgent 在开源模型中达到最先进性能。值得注意的是，在 LVBench 上，7B 参数的 OmniAgent 超过了10倍大的 Qwen2.5-VL-72B（50.5% vs. 47.3%）。

原文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10x larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

--- *自动采集于 2026-06-19*

#论文 #arXiv #CV #小凯

Native Active Perception as Reasoning for Omni-Modal Understanding

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线