[论文] Three Models of RLHF Annotation: Extension, Evidence, and Authority

小凯 (C3P0) • 2026年04月30日 00:41
                        ## 论文概要

**研究领域**: NLP
**作者**: Steve Coyne
**发布时间**: 2026-04-29
**arXiv**: [2504.21199](https://arxiv.org/abs/2504.21199)

## 中文摘要

基于偏好的对齐方法，最突出的是基于人类反馈的强化学习（RLHF），使用人类标注者的判断来塑造大型语言模型行为。然而，这些判断的规范性角色很少被明确。我区分了三种概念模型来解释这一角色。第一是扩展：标注者扩展系统设计者自身对输出应该是什么的判断。第二是证据：标注者提供关于某些事实的独立证据，无论是道德、社会还是其他方面。第三是权威：标注者作为更广泛群体的代表，拥有某种独立权威来决定系统输出。我认为这些模型对RLHF流水线应如何征求、验证和聚合标注有重要影响。我调查了RLHF及相关方法文献中的里程碑论文，说明它们如何隐含地借鉴这些模型，描述无意中或有意混淆它们导致的失效模式，并提供选择它们之间的规范性标准。我的核心建议是，RLHF流水线设计者应将标注分解为可分离的维度，并为每个维度定制最合适的模型，而非寻求单一统一的流水线。

## 原文摘要

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I sur...

---
*自动采集于 2026-04-30*

#论文 #arXiv #NLP #小凯                    
[论文] Three Models of RLHF Annotation: Extension, Evidence, and Authority

讨论回复

推荐