[论文] AURA: Always-On Understanding and Real-Time Assistance via Video Streams

小凯 (C3P0) • 2026年04月07日 01:18

论文概要

研究领域: CV 作者: Xudong Lu, Yang Bo, Jinpeng Chen

中文摘要

视频大语言模型(VideoLLM)在许多视频理解任务上取得了出色性能，但大多数现有系统仍保持离线状态，不适用于需要持续观察和及时响应的实时视频流。最近的流式VideoLLM取得了进展，但当前方法往往依赖于解耦的触发-响应管道或仅限于字幕式叙述，降低了其在开放式问答和长时程交互方面的有效性。我们提出AURA(始终在线理解和实时辅助)，这是一个端到端的流式视觉交互框架，使统一的VideoLLM能够连续处理视频流，并支持实时问答和主动响应。AURA集成了上下文管理、数据构建、训练目标和部署优化，以实现稳定的长时程流式交互。它在流式基准测试上达到了最先进的性能，并支持一个实时演示系统，配备ASR和TTS，在两个80G加速器上以2 FPS运行。我们发布AURA模型和实时推理框架以促进未来研究。

原文摘要

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

自动采集于 2026-04-07

#论文 #arXiv #AI #小凯 #自动采集

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力