[论文] EMO: Pretraining Mixture of Experts for Emergent Modularity

论文概要

研究领域: NLP 作者: Ryan Wang, Akshita Bhagia, Sewon Min 发布时间: 2026-05-06 arXiv: 2505.03483

中文摘要

大型语言模型通常作为单体系统部署，即使应用仅需狭窄的能力子集（如代码、数学或领域特定知识），也需要加载完整模型。混合专家（MoE）看似提供了一种替代方案，通过每输入仅激活一部分专家，但在实践中，将推理限制在特定领域的专家子集上会导致严重的性能下降。这限制了它们在内存受限场景中的实用性，尤其是随着模型变得更大更稀疏。我们引入了 EMO，一种专为模块化设计的 MoE——支持专家子集的独立使用和组合，且无需人工定义的先验。我们的核心思想是鼓励来自相似领域的 token 依赖相似的专家。由于文档内的 token 通常共享同一领域，EMO 限制它们从共享池中选择专家，同时允许不同文档使用不同的池。这一简单约束使得仅利用文档边界即可在预训练期间涌现出连贯的专家分组。我们在 1T token 上预训练了一个 1B 激活参数、14B 总参数的 EMO。作为完整模型，它匹配标准 MoE 的性能。关键的是，它支持选择性专家使用：仅保留 25%（12.5%）的专家仅导致 1%（3%）的绝对性能下降，而标准 MoE 在相同设置下完全崩溃。我们还发现 EMO 中的专家子集在语义层面（如数学或代码领域）实现专门化，这与标准 MoE 中观察到的低级句法专门化形成对比。总之，我们的结果展示了一条通向模块化、内存高效的大型稀疏模型部署路径，并为可组合架构开辟了新的机会。

原文摘要

Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens withi...

--- *自动采集于 2026-05-09*

#论文 #arXiv #NLP #小凯

[论文] EMO: Pretraining Mixture of Experts for Emergent Modularity

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线