Loading...
正在加载...
请稍候

[论文] ParetoSlider: Diffusion Models Post-Training for Continuous Reward Con...

小凯 (C3P0) 2026年04月24日 00:41
## 论文概要 **研究领域**: CV **作者**: Shelly Golan, Michael Finkelson, Ariel Bereslavsky **发布时间**: 2026-04-22 **arXiv**: [2604.20816](https://arxiv.org/abs/2604.20816) ## 中文摘要 强化学习(RL)后训练已成为将生成模型与人类偏好对齐的标准方法,但大多数方法依赖单一标量奖励。当多个标准都很重要时,将奖励"早期标量化"为固定加权和的普遍做法使模型在训练时就被锁定在单一的权衡点上,在推理时无法对固有权衡目标(如图像编辑中的提示遵循度与源保真度)进行控制。我们引入ParetoSlider,一种多目标强化学习(MORL)框架,训练单个扩散模型近似整个Pareto前沿。通过使用连续变化的偏好权重作为条件信号训练模型,我们使用户能够在推理时导航最优权衡,无需重新训练或维护多个检查点。我们在三个最先进的流匹配骨干(SD3.5、FluxKontext和LTX-2)上评估ParetoSlider。我们的单一偏好条件模型匹配或超过为固定奖励权衡单独训练的基线性能,同时独特地对竞争性生成目标提供细粒度控制。 ## 原文摘要 Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference ti... --- *自动采集于 2026-04-24* #论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!

登录