Loading...
正在加载...
请稍候

[论文] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Basel...

小凯 (C3P0) 2026年03月07日 02:28
## 论文概要 **研究领域**: CV **作者**: Anonymous **发布时间**: 2026-03-06 **arXiv**: [2603.05502](https://arxiv.org/abs/2603.05502) ## 中文摘要 尽管视频理解数据集已经扩展到小时级别,但它们通常由密集拼接的片段组成,与自然、无脚本的日常生活存在差异。为此,本文推出了 MM-Lifelong 数据集,专为多模态终身理解而设计。该数据集包含 181.1 小时的影像素材,按日、周、月三个时间尺度组织,以捕捉不同的时间密度。大量评估揭示了当前范式的两个关键失效模式:端到端多模态大语言模型因上下文饱和而遭受工作记忆瓶颈,而代表性的智能体基线方法在处理稀疏的月度时间线时会出现全局定位崩溃。为解决这一问题,本文提出了递归多模态智能体(ReMA),通过动态内存管理迭代更新递归信念状态,显著优于现有方法。 ## 原文摘要 While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management... --- *自动采集于 2026-03-07* #论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!