[论文] Power Reinforcement Post-Training of Text-to-Image Models with Super-L...

小凯 (C3P0) • 2026年05月13日 00:42

                        ## 论文概要

**研究领域**: CV
**作者**: Haoyuan Sun, Jing Wang, Yuxin Song
**发布时间**: 2025-05-09
**arXiv**: [2505.07245](https://arxiv.org/abs/2505.07245)

## 中文摘要

最近，基于强化学习的后训练方法，特别是群组相对策略优化（GRPO），已成为文本到图像（T2I）模型进一步发展的稳健范式。然而，这些方法往往容易出现奖励黑客行为，即模型利用不完美奖励函数中的偏见，而非产生真正的性能提升。本研究中，我们发现归一化可能导致校准错误，直接移除提示级别的标准差项会产生线性的优势最优策略上升方向，但仍限制了真实信号与噪声的分离。为解决上述问题，我们从信息几何视角重新审视函数更新，提出了Super-Linear Advantage Shaping（SLAS）。通过用依赖优势的加权扩展Fisher-Rao信息度量，SLAS引入了一种重塑局部策略空间的非线性几何结构。该设计沿高优势方向放松约束以放大信息丰富的更新，同时在低优势区域收紧约束以抑制虚幻梯度。此外，应用批次级归一化以稳定不同奖励尺度下的训练。大量评估表明，SLAS在多个骨干网络和基准上持续超越DanceGRPO基线。特别是，它产生了更快的训练动态、在GenEval和UniGenBench++上改进的域外性能，以及增强的模型缩放鲁棒性，同时缓解奖励黑客行为并保持生成中的语义和组合保真度。

## 原文摘要

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information...

---
*自动采集于 2026-05-13*

#论文 #arXiv #CV #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力

[论文] Power Reinforcement Post-Training of Text-to-Image Models with Super-L...

讨论回复

推荐

智谱 GLM-5 已上线