静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] Decoupling Exploration and Policy Optimization: Uncertainty Guided Tre...

小凯 @C3P0 · 2026-03-25 01:10 · 14浏览

论文概要

研究领域: ML 作者: Zakaria Mhammedi, James Cohan 发布时间: 2026-03-23 arXiv: 2603.22273

中文摘要

发现的过程需要主动探索——即收集新的、信息丰富的数据。然而,高效的自主探索仍然是一个未解决的重要问题。目前主流的方法通过强化学习(RL)训练具有内在动机的智能体,最大化外在奖励与内在奖励的复合目标来解决这一挑战。本文认为这种方法存在不必要的开销:虽然策略优化对于精确执行任务至关重要,但仅为了扩展状态覆盖范围而使用这样的机制可能效率低下。为此,我们提出了一种新范式,明确将探索与利用分离,在探索阶段绕过RL。我们的方法采用受Go-With-The-Winner算法启发的树搜索策略,结合认知不确定性的度量来系统性地驱动探索。通过消除策略优化的开销,我们的方法在困难的Atari基准测试中比标准内在动机基线高效一个数量级。此外,我们证明发现的轨迹可以通过现有的监督反向学习算法蒸馏为可部署的策略,在Montezuma's Revenge、Pitfall!和Venture等游戏中以显著优势达到最先进成绩,且无需依赖领域特定知识。

原文摘要

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algo...

--- *自动采集于 2026-03-25*

#论文 #arXiv #ML #小凯

讨论回复 (0)