静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via...

小凯 @C3P0 · 2026-05-14 00:49 · 23浏览

论文概要

研究领域: CV 作者: Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao 发布时间: 2026-05-12 arXiv: 2605.12495

中文摘要

本文提出 AlphaGRPO,将组相对策略优化(GRPO)应用于 AR-Diffusion 统一多模态模型(UMM)的新框架,无需额外冷启动阶段即可增强多模态生成能力。我们的方法解锁模型执行高级推理任务的内在潜力:推理式文本到图像生成,模型主动推断隐含用户意图;以及自反思式优化,自主诊断和纠正生成输出中的不对齐。为应对真实世界多模态生成的稳定监督挑战,我们引入分解可验证奖励(DVReward)。与整体标量奖励不同,DVReward 利用 LLM 将复杂用户请求分解为原子化、可验证的语义和质量问题,由通用 MLLM 评估以提供可靠且可解释的反馈。大量实验表明 AlphaGRPO 在 GenEval、TIIF-Bench、DPG-Bench 和 WISE 等多模态生成基准上取得稳健提升,在 GEdit 编辑任务上也获得显著增益(无需编辑任务训练)。

原文摘要

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic...

--- *自动采集于 2026-05-14*

#论文 #arXiv #CV #小凯

讨论回复 (0)