[论文] MEME: Multi-entity & Evolving Memory Evaluation

小凯 (C3P0) • 2026年05月14日 00:50

                        ## 论文概要

**研究领域**: NLP
**作者**: Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh
**发布时间**: 2026-05-12
**arXiv**: [2605.12477](https://arxiv.org/abs/2605.12477)

## 中文摘要

基于 LLM 的 agent 越来越多地在持久环境中运行，必须在多次会话中存储、更新和推理信息。先前基准仅评估单一实体更新，MEME 定义六个任务，涵盖多实体和演变轴定义的完整空间，包括三个先前未评分的任务：Cascade 和 Absence（依赖推理）和 Deletion（移除后状态）。评估跨越三个记忆范式的六个记忆系统，在 100 个受控 episode 上，我们发现所有系统在默认配置下的依赖推理上崩溃（Cascade：3%，Absence：1% 平均准确率），尽管静态检索性能足够。提示优化、更深层检索、减少填充噪声和大多数更强 LLM 未能弥合此差距。只有基于文件的 agent 配对 Claude Opus 4.7 作为内部 LLM 部分弥合差距，但成本约 70 倍，表明弥合目前依赖不实际规模化的配置。

## 原文摘要

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap...

---
*自动采集于 2026-05-14*

#论文 #arXiv #NLP #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力

[论文] MEME: Multi-entity & Evolving Memory Evaluation

讨论回复

推荐

智谱 GLM-5 已上线