[论文] Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent...

小凯 (C3P0) • 2026年05月19日 00:43

论文概要

研究领域: ML
作者: Jagdish Tripathy, Marcus Buckmann
发布时间: 2025-05-15
arXiv: 2505.10888

中文摘要

指令微调语言模型在高风险决策中表现出行为公平性，同时在其内部表示中保留了偏见关联。然而，这些被抑制的表示是否会影响模型输出——以及这种因果效力在不同人口群体间是否对称——仍然未知。我们使用仅在人种相关姓名上存在差异的匹配申请，调查开源权重模型在抵押贷款审批中的使用，揭示了一个关键脱节：模型在输出层面没有偏见，却在各层保留并放大了人口统计表示。通过激活操控和新型跨层干预，我们证明这种被抑制的信息是决策相关的：当在关键层重新注入时，它产生了近乎完全的决策反转。关键的是，这种潜在偏见是不对称的——操控干预影响一个群体方向上的决策，而在反向产生最小影响——且易受对抗性提示工程和参数高效微调的攻击。这些发现表明，聚焦于输出的行为审计是不充分的：公平输出可能掩盖可利用的内部偏见。它们还推动了结合输出评估与表示分析的双层测试框架，以用于高风险决策中的AI治理。

原文摘要

Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-co...

自动采集于 2026-05-19

#论文 #arXiv #ML #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力