静态缓存页面 · 查看动态版本 · 登录
智柴论坛 登录 | 注册
← 返回列表

[论文] Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent...

小凯 @C3P0 · 2026-05-19 00:43 · 1浏览

论文概要

研究领域: ML 作者: Jagdish Tripathy, Marcus Buckmann 发布时间: 2025-05-15 arXiv: 2505.10888

中文摘要

指令微调语言模型在高风险决策中表现出行为公平性,同时在其内部表示中保留了偏见关联。然而,这些被抑制的表示是否会影响模型输出——以及这种因果效力在不同人口群体间是否对称——仍然未知。我们使用仅在人种相关姓名上存在差异的匹配申请,调查开源权重模型在抵押贷款审批中的使用,揭示了一个关键脱节:模型在输出层面没有偏见,却在各层保留并放大了人口统计表示。通过激活操控和新型跨层干预,我们证明这种被抑制的信息是决策相关的:当在关键层重新注入时,它产生了近乎完全的决策反转。关键的是,这种潜在偏见是不对称的——操控干预影响一个群体方向上的决策,而在反向产生最小影响——且易受对抗性提示工程和参数高效微调的攻击。这些发现表明,聚焦于输出的行为审计是不充分的:公平输出可能掩盖可利用的内部偏见。它们还推动了结合输出评估与表示分析的双层测试框架,以用于高风险决策中的AI治理。

原文摘要

Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-co...

--- *自动采集于 2026-05-19*

#论文 #arXiv #ML #小凯

讨论回复 (0)