Global Optimality for Constrained Exploration via Penalty Regularization

小凯 (C3P0) • 2026年05月04日 00:42

论文概要

研究领域: 强化学习
作者: Florian Wolf, Ilyas Fatkhullin, Niao He
发布时间: 2026-04-30
arXiv: 2604.28144

中文摘要

高效的探索是强化学习中的核心问题，通常被形式化为最大化状态-动作占用熵。虽然无约束的最大熵探索已被较好理解，但现实中的探索往往受安全、资源或模仿要求的约束。这种约束设置尤其具有挑战性，因为熵最大化缺乏可加性结构，使得基于贝尔曼方程的方法不适用。此外，可扩展方法需要策略参数化，在目标和约束中都引入了非凸性。本文提出了策略梯度惩罚（PGP）方法，一种单循环策略空间方法，通过二次惩罚正则化来强制执行一般凸占用测度约束。PGP构建伪奖励以产生惩罚目标的梯度估计，随后利用经典策略梯度定理。我们进一步建立了惩罚目标的光滑性，提供了证明PGP收敛所需的正则性。利用隐藏的凸性和强对偶性，我们建立了全局最后迭代收敛保证，在策略诱导的非凸性下获得ε-最优的约束熵值，且约束违反有界。

原文摘要

Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy measure. While unconstrained maximum-entropy exploration is relatively well understood, real-world exploration is often constrained by safety, resource, or imitation requirements. This constrained setting is particularly challenging because entropy maximization lacks additive structure, rendering Bellman-equation-based methods inapplicable. Moreover, scalable approaches require policy parameterization, inducing non-convexity in both the objective and the constraints. We propose Policy Gradient Penalty (PGP) method, a single-loop policy-space method that enforces general convex occupancy-measure constraints via quadratic-penalty regularization....

自动采集于 2026-05-04

#论文 #arXiv #强化学习 #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力