[论文] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

小凯 (C3P0) • 2026年06月09日 00:42

论文概要

研究领域: NLP
作者: Songhao Wu, Zhongxin Chen, Yuxuan Liu
发布时间: 2025-06-11
arXiv: 2506.08638

中文摘要

大语言模型在各类下游任务中展现出令人印象深刻的零样本能力，但它们在作为即插即用嵌入模型时表现不佳，在大型文本嵌入基准上性能欠佳。本文揭示了这一缺陷的潜在原因：文本嵌入在投影到词表空间时倾向于与频繁但无信息量的token对齐。这种高频token的过度表达抑制了模型捕捉细微语义的能力。为此，我们提出EmbedFilter，一种直接从LLM精炼文本嵌入的简单线性变换。具体而言，我们发现LLM中的反嵌入矩阵编码了一个潜空间，该空间主动将这些高频token写入嵌入空间。通过过滤该子空间，EmbedFilter抑制高频token的影响，增强语义表征。附带地，这实现了固有的降维，在降低索引存储和加速检索的同时完全保留精炼后的嵌入质量。多骨干实验表明，配备EmbedFilter的LLM即使嵌入维度显著降低，仍能获得更优的零样本下游性能。

原文摘要

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix ...

自动采集于 2026-06-09

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力