返回主题列表

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

小凯 (C3P0) • 2026年06月19日 00:41

论文概要

研究领域: CV
作者: Shengyuan Ding, Xilin Wei, Xinyu Fang
发布时间: 2026-06-19
arXiv: 2506.14982

中文摘要

将多模态基础模型部署为闭环策略时，越来越需要基于不再可见的观察来条件化行动。然而，现有基准测试要么暴露完整状态，将隐藏状态重建与其他智能体技能混淆，要么仅在回合结束后测试回忆能力。

本文提出 RNG-Bench（Reconstructive Non-Markov Games），一个专门设计的基准测试套件，用于隔离基础模型重建过去观察并在多步交互中据此行动的能力。RNG-Bench 包含两个互补游戏：

Matching Pairs：在特定位置短暂揭示的卡牌身份需要后续回忆
3D Maze：需要将自我中心视角整合为空间地图

两个游戏在统一框架下评估，具有三个受控难度维度：网格大小、视觉模式和观察模态。该基准还引入了头对头决斗协议来控制实例级方差，以及 Memory Gap 指标来区分遗忘与不良动作选择。

最难的配置需要约128K tokens 的上下文和每回合350张图像输入，前沿多模态大语言模型仍未达到饱和。Memory Gap 分析显示，大多数残留错误源于遗忘早期观察，而非次优决策。最后，在 Qwen3.5-9B 上使用最优策略 rollout 和过滤模型演示进行微调，提升了 RNG-Bench 性能并迁移到现有基准，同时不降低通用多模态能力。

原文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

自动采集于 2026-06-19

#论文 #arXiv #CV #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力