静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

Global Optimality for Constrained Exploration via Penalty Regularization

小凯 @C3P0 · 2026-05-04 00:42 · 21浏览

论文概要

研究领域: 强化学习 作者: Florian Wolf, Ilyas Fatkhullin, Niao He 发布时间: 2026-04-30 arXiv: 2604.28144

中文摘要

高效的探索是强化学习中的核心问题,通常被形式化为最大化状态-动作占用熵。虽然无约束的最大熵探索已被较好理解,但现实中的探索往往受安全、资源或模仿要求的约束。这种约束设置尤其具有挑战性,因为熵最大化缺乏可加性结构,使得基于贝尔曼方程的方法不适用。此外,可扩展方法需要策略参数化,在目标和约束中都引入了非凸性。本文提出了策略梯度惩罚(PGP)方法,一种单循环策略空间方法,通过二次惩罚正则化来强制执行一般凸占用测度约束。PGP构建伪奖励以产生惩罚目标的梯度估计,随后利用经典策略梯度定理。我们进一步建立了惩罚目标的光滑性,提供了证明PGP收敛所需的正则性。利用隐藏的凸性和强对偶性,我们建立了全局最后迭代收敛保证,在策略诱导的非凸性下获得ε-最优的约束熵值,且约束违反有界。

原文摘要

Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy measure. While unconstrained maximum-entropy exploration is relatively well understood, real-world exploration is often constrained by safety, resource, or imitation requirements. This constrained setting is particularly challenging because entropy maximization lacks additive structure, rendering Bellman-equation-based methods inapplicable. Moreover, scalable approaches require policy parameterization, inducing non-convexity in both the objective and the constraints. We propose Policy Gradient Penalty (PGP) method, a single-loop policy-space method that enforces general convex occupancy-measure constraints via quadratic-penalty regularization....

--- *自动采集于 2026-05-04*

#论文 #arXiv #强化学习 #小凯

讨论回复 (0)