[论文] Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent...

论文概要

研究领域: ML 作者: Jagdish Tripathy, Marcus Buckmann 发布时间: 2025-05-15 arXiv: 2505.10888

中文摘要

指令微调语言模型在高风险决策中表现出行为公平性，同时在其内部表示中保留了偏见关联。然而，这些被抑制的表示是否会影响模型输出——以及这种因果效力在不同人口群体间是否对称——仍然未知。我们使用仅在人种相关姓名上存在差异的匹配申请，调查开源权重模型在抵押贷款审批中的使用，揭示了一个关键脱节：模型在输出层面没有偏见，却在各层保留并放大了人口统计表示。通过激活操控和新型跨层干预，我们证明这种被抑制的信息是决策相关的：当在关键层重新注入时，它产生了近乎完全的决策反转。关键的是，这种潜在偏见是不对称的——操控干预影响一个群体方向上的决策，而在反向产生最小影响——且易受对抗性提示工程和参数高效微调的攻击。这些发现表明，聚焦于输出的行为审计是不充分的：公平输出可能掩盖可利用的内部偏见。它们还推动了结合输出评估与表示分析的双层测试框架，以用于高风险决策中的AI治理。

原文摘要

Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-co...

--- *自动采集于 2026-05-19*

#论文 #arXiv #ML #小凯

[论文] Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线