[论文] Personal Visual Context Learning in Large Multimodal Models

论文概要

研究领域: CV 作者: Zihui Xue, Ami Baid, Sangho Kim 发布时间: 2025-05-09 arXiv: 2505.07244

中文摘要

随着智能眼镜等可穿戴设备将大型多模态模型（LMMs）集成到个人用户的连续第一人称视觉流中，这些模型向真正个人助手的演变取决于视觉个性化：对佩戴者独特的视觉信息进行推理的能力。我们将这种能力形式化为Personal Visual Context Learning（Personal VCL），即在提示时使用用户特定的视觉上下文来解决个性化查询的能力。为系统评估这一点，我们提出了Personal-VCL-Bench，一个 comprehensive基准，捕捉跨人物、物体和行为的个人视觉世界。我们对前沿LMMs的分析发现了一个深刻的上下文利用差距，揭示了对视觉证据的利用机制以及聚合多个视觉观察的机制仍然严重不足。受这些发现的启发，我们提出了Agentic Context Bank，一个强大的推理时基线，将用户的视觉上下文结构化为自我精炼的记忆库，并采用查询自适应的证据选择。我们的基线方法在任务和评估骨干网络上持续优于标准上下文提示方案，展示了通往未来个性化LMMs的实用路径。

原文摘要

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual证据, as well as a...

--- *自动采集于 2026-05-13*

#论文 #arXiv #CV #小凯