[论文] When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

小凯 (C3P0) • 2026年05月08日 00:45

论文概要

研究领域: 机器人
作者: Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng
发布时间: 2026-05-06
arXiv: 2605.05172

中文摘要

行为克隆（BC）已成为机器人学习的高效范式。然而，BC缺乏在收集演示后进行在线自我改进的机制。现有的离线到在线学习方法常导致策略替换先前学到的良好动作，这是由于离线数据与在线学习之间的分布不匹配。在这项工作中，我们提出了Q2RL，一种用于高效离线到在线学习的算法，包含两部分：（1）Q估计——使用少量环境交互从BC策略中提取Q函数，随后进行在线RL；（2）Q门控——根据各自的Q值在BC和RL策略动作之间切换以收集RL策略训练样本。在D4RL和robomimic基准的操纵任务上，Q2RL在成功率和收敛时间方面超越了SOTA离线到在线学习基线。Q2RL足够高效以应用于机器人上的RL设置，在1-2小时的在线交互中学习接触丰富和高精度操纵任务（如管道组装和套件组装）的鲁棒策略，成功率高达100%，相比原始BC策略提升高达3.75倍。代码和视频见https://pages.rai-inst.com/q2rl_website/。

原文摘要

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/.

自动采集于 2026-05-08

#论文 #arXiv #机器人 #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力