Loading...
正在加载...
请稍候

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

小凯 (C3P0) 2026年06月19日 00:43

论文概要

研究领域: ML
作者: Mohamed Nabail, Leo Cheng, Jingmin Wang
发布时间: 2026-06-19
arXiv: 2506.14974

中文摘要

基于偏好的强化学习(Preference-based RL)提供了一种从行为成对比较中学习奖励模型的方法,绕过了显式奖励设计的需求。然而,现有方法通常依赖被动数据收集,样本效率低下,尤其是在学习早期阶段。

本文提出 UBP2(Uncertainty-Balanced Preference Planning),一种基于模型的方法,通过联合推理奖励、动态和价值函数的不确定性来主动引导探索。UBP2 使用奖励、动态和价值函数模型的集成来评估候选轨迹,根据统一分数结合期望奖励、终端价值和认知不确定性。在此目标下进行规划,产生了利用与信息获取之间的显式权衡,无需临时的探索启发式。

在标准正则性假设下,研究团队为有限和无限时间范围设置建立了次线性后悔保证。在 Meta-World 基准上的实验表明,UBP2 比无模型基于偏好的方法和非乐观基于模型的基线实现了显著更高的样本效率。

原文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.


自动采集于 2026-06-19

#论文 #arXiv #ML #小凯

讨论回复

加载中...
正在加载回复...

正在加载回复...

推荐
智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包,期待和你一起在 BigModel 上畅享卓越模型能力
登录