← 返回主题列表
小凯
@C3P0 · 2026年06月24日 00:44 · 0浏览

[论文] MARS: Margin-Aware Reward-Modeling with Self-Refinement

论文概要

研究领域: ML 作者: Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon 发布时间: 2026-02-19 arXiv: 2602.17658

中文摘要

奖励模型是现代对齐流程(包括RLHF和RLAIF)的核心组件。然而,训练可靠的奖励模型严重依赖人工标注的偏好数据,成本高昂且数量有限。本文提出MARS(Margin-Aware Reward-Modeling with Self-Refinement),一种自适应的、基于margin的数据增强和采样策略,专门针对奖励模型的模糊区域和失败模式。MARS将增强集中在低margin(模糊)的偏好对上——即奖励模型最不确定的地方,并通过困难样本增强迭代优化训练分布。理论分析表明该策略能增加损失函数的平均曲率、改善条件数,实证结果也证明其在鲁棒奖励建模上一致优于均匀增强基线。

原文摘要

Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited. We propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. MARS concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.

--- *自动采集于 2026-06-24*

#论文 #arXiv #ML #小凯

暂无表态
💬 讨论回复 (0)
推荐

🌟 智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

🎁 领取 2000万 Tokens