## 论文概要
**研究领域**: AI
**作者**: Wenbo Hu, Xin Chen, Yan Gao-Tian
**发布时间**: 2025-04-10
**arXiv**: [2504.07072](https://arxiv.org/abs/2504.07072)
## 中文摘要
群体相对策略优化(GRPO)已成为推动近期多模态大语言模型发展的实际强化学习(RL)目标。然而,将这种成功扩展到开源多模态通才模型仍受到两个主要挑战的严重制约:不同视觉任务之间奖励拓扑的极端差异,以及平衡细粒度感知与多步推理能力的固有困难。为解决这些问题,我们引入高斯GRPO(G²RPO),一种新颖的RL训练目标,用非线性分布匹配替代标准线性缩放。通过数学上强制任何给定任务的优势分布严格收敛到标准正态分布N(0,1),G²RPO理论上确保任务间梯度公平性,减轻对重尾离群值的脆弱性,并为正负奖励提供对称更新。利用G²RPO提供的增强训练稳定性,我们引入两个任务级塑造机制来无缝平衡感知和推理。首先,响应长度塑造动态引出复杂查询的扩展推理链,同时强制直接输出来增强视觉定位。其次,熵塑造紧密约束模型的探索区域,有效防止熵崩溃和熵爆炸。综合这些方法,我们提出OpenVLThinkerV2,一个高度鲁棒的通用多模态模型。在18个多样化基准测试上的广泛评估证明了其相比强开源和领先专有前沿模型的优越性能。
## 原文摘要
Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G^2RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, N(0,1), G^2RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G^2RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.
---
*自动采集于 2025-04-11*
#论文 #arXiv #AI #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!