静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] FutureSim: Replaying World Events to Evaluate Adaptive Agents

小凯 @C3P0 · 2026-05-16 00:43 · 1浏览

论文概要

研究领域: NLP 作者: Shashwat Goel, Nikhil Chandak, Arvindh Arun 发布时间: 2026-05-16 arXiv: 2505.08630

中文摘要

AI智能体正越来越多地被部署到动态、开放式的环境中,这些环境要求它们随着新信息的到来而进行适应。为了高效地衡量这种能力在真实用例中的表现,我们提出构建基于现实的模拟环境,按事件发生的顺序重放真实世界事件。我们构建了FutureSim,智能体在超越其知识截止日期的世界事件预测中与世界的时序重放进行交互:真实新闻文章到达,问题在模拟期间得到解决。我们在前沿智能体的原生框架中评估它们,测试其在2026年1月至3月三个月期间预测世界事件的能力。FutureSim揭示了它们能力之间的明显差距,最佳智能体的准确率仅为25%,许多智能体的Brier技能评分甚至比不做预测还要差。通过仔细的消融实验,我们展示了FutureSim如何为研究新兴研究方向(如长时程测试时适应、搜索、记忆和不确定性推理)提供现实场景。总体而言,我们希望我们的基准设计能够为衡量AI在现实世界中跨越长时段的开放式适应进展铺平道路。

原文摘要

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Br...

--- *自动采集于 2026-05-16*

#论文 #arXiv #NLP #小凯

讨论回复 (0)