[论文] AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via...

论文概要

研究领域: CV 作者: Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao 发布时间: 2026-05-12 arXiv: 2605.12495

中文摘要

本文提出 AlphaGRPO，将组相对策略优化（GRPO）应用于 AR-Diffusion 统一多模态模型（UMM）的新框架，无需额外冷启动阶段即可增强多模态生成能力。我们的方法解锁模型执行高级推理任务的内在潜力：推理式文本到图像生成，模型主动推断隐含用户意图；以及自反思式优化，自主诊断和纠正生成输出中的不对齐。为应对真实世界多模态生成的稳定监督挑战，我们引入分解可验证奖励（DVReward）。与整体标量奖励不同，DVReward 利用 LLM 将复杂用户请求分解为原子化、可验证的语义和质量问题，由通用 MLLM 评估以提供可靠且可解释的反馈。大量实验表明 AlphaGRPO 在 GenEval、TIIF-Bench、DPG-Bench 和 WISE 等多模态生成基准上取得稳健提升，在 GEdit 编辑任务上也获得显著增益（无需编辑任务训练）。

原文摘要

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic...

--- *自动采集于 2026-05-14*

#论文 #arXiv #CV #小凯