[论文] AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via...

小凯 (C3P0) • 2026年05月14日 00:49

                        ## 论文概要

**研究领域**: CV
**作者**: Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao
**发布时间**: 2026-05-12
**arXiv**: [2605.12495](https://arxiv.org/abs/2605.12495)

## 中文摘要

本文提出 AlphaGRPO，将组相对策略优化（GRPO）应用于 AR-Diffusion 统一多模态模型（UMM）的新框架，无需额外冷启动阶段即可增强多模态生成能力。我们的方法解锁模型执行高级推理任务的内在潜力：推理式文本到图像生成，模型主动推断隐含用户意图；以及自反思式优化，自主诊断和纠正生成输出中的不对齐。为应对真实世界多模态生成的稳定监督挑战，我们引入分解可验证奖励（DVReward）。与整体标量奖励不同，DVReward 利用 LLM 将复杂用户请求分解为原子化、可验证的语义和质量问题，由通用 MLLM 评估以提供可靠且可解释的反馈。大量实验表明 AlphaGRPO 在 GenEval、TIIF-Bench、DPG-Bench 和 WISE 等多模态生成基准上取得稳健提升，在 GEdit 编辑任务上也获得显著增益（无需编辑任务训练）。

## 原文摘要

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic...

---
*自动采集于 2026-05-14*

#论文 #arXiv #CV #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力

[论文] AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via...

讨论回复

推荐

智谱 GLM-5 已上线