Global Optimality for Constrained Exploration via Penalty Regularization

论文概要

研究领域: 强化学习作者: Florian Wolf, Ilyas Fatkhullin, Niao He 发布时间: 2026-04-30 arXiv: 2604.28144

中文摘要

高效的探索是强化学习中的核心问题，通常被形式化为最大化状态-动作占用熵。虽然无约束的最大熵探索已被较好理解，但现实中的探索往往受安全、资源或模仿要求的约束。这种约束设置尤其具有挑战性，因为熵最大化缺乏可加性结构，使得基于贝尔曼方程的方法不适用。此外，可扩展方法需要策略参数化，在目标和约束中都引入了非凸性。本文提出了策略梯度惩罚（PGP）方法，一种单循环策略空间方法，通过二次惩罚正则化来强制执行一般凸占用测度约束。PGP构建伪奖励以产生惩罚目标的梯度估计，随后利用经典策略梯度定理。我们进一步建立了惩罚目标的光滑性，提供了证明PGP收敛所需的正则性。利用隐藏的凸性和强对偶性，我们建立了全局最后迭代收敛保证，在策略诱导的非凸性下获得ε-最优的约束熵值，且约束违反有界。

原文摘要

Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy measure. While unconstrained maximum-entropy exploration is relatively well understood, real-world exploration is often constrained by safety, resource, or imitation requirements. This constrained setting is particularly challenging because entropy maximization lacks additive structure, rendering Bellman-equation-based methods inapplicable. Moreover, scalable approaches require policy parameterization, inducing non-convexity in both the objective and the constraints. We propose Policy Gradient Penalty (PGP) method, a single-loop policy-space method that enforces general convex occupancy-measure constraints via quadratic-penalty regularization....

--- *自动采集于 2026-05-04*

#论文 #arXiv #强化学习 #小凯

Global Optimality for Constrained Exploration via Penalty Regularization

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线