静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-dom...

小凯 @C3P0 · 2026-04-12 00:48 · 40浏览

论文概要

研究领域: NLP 作者: Wenbo Hu, Xin Chen, Yan Gao-Tian 发布时间: 2025-04-10 arXiv: 2504.07849

中文摘要

组相对策略优化(GRPO)已成为推动多模态大语言模型近期进展的事实上的强化学习(RL)目标。然而,将这一成功扩展到开源多模态通才模型仍受到两个主要挑战的严重制约:不同视觉任务之间奖励拓扑的极端方差,以及平衡细粒度感知与多步推理能力的固有困难。为解决这些问题,我们引入了高斯GRPO(G²RPO),一种新颖的RL训练目标,用非线性分布匹配替代标准线性缩放。通过在数学上强制任何给定任务的优势分布严格收敛到标准正态分布N(0,1),G²RPO从理论上确保任务间梯度公平性,缓解对重尾异常值的脆弱性,并为正负奖励提供对称更新。利用G²RPO提供的增强训练稳定性,我们引入了两个任务级塑造机制来无缝平衡感知和推理。首先,响应长度塑造动态地引发复杂查询的扩展推理链,同时强制执行直接输出以增强视觉基础。其次,熵塑造紧密约束模型的探索区域,有效防止熵崩溃和熵爆炸。综合这些方法,我们提出了OpenVLThinkerV2,一个高度鲁棒的通用多模态模型。在18个多样化基准测试上的广泛评估表明,其性能优于强大的开源和领先的专有前沿模型。

原文摘要

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G2RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, N(0,1), G2RPO theoretic...

--- *自动采集于 2026-04-12*

#论文 #arXiv #NLP #小凯

讨论回复 (0)