[论文] UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

论文概要

研究领域: ML 作者: Minbin Huang, Han Shi, Chuanyang Zheng 发布时间: 2026-05-06 arXiv: 2505.03485

中文摘要

现代混合专家（MoE）架构通过严格的逐层规则分配专家容量：每个 Transformer 层拥有一组独立的专家。这种约定将深度扩展与线性专家参数增长耦合在一起，并假设每一层都需要独立的专家容量。然而，近期分析和我们的路由探针挑战了这一分配规则：在多个生产级 MoE 模型中，将较深层的已学习 top-k 路由器替换为均匀随机路由后，下游准确率仅下降 1.0-1.6 个百分点。受此冗余现象的启发，我们提出了 UniPool，一种将专家容量视为全局架构预算的 MoE 架构，通过用单一共享池替代逐层专家所有权，各层通过独立的路由器访问该共享池。为了在共享机制下实现稳定且平衡的训练，我们引入了池级辅助损失来平衡整个池的专家利用率，并采用 NormRouter 提供稀疏且尺度稳定的路由到共享专家池。在五个 LLaMA 架构模型规模（182M、469M、650M、830M 和 978M 参数）上，使用来自 The Pile 的 300 亿个 token 进行训练，UniPool 始终优于匹配的 vanilla MoE 基线，在验证损失和困惑度上均有提升。在这些规模上，UniPool 相对于 vanilla MoE 最多可降低验证损失 0.0386。除了原始损失的改进，我们的结果还识别出池大小作为明确的深度缩放超参数：使用仅 41.6%-66.7% 的 vanilla 专家参数预算的精简池 UniPool 变体，在测试规模上匹配或优于逐层 MoE。这表明，在共享池设计下，专家参数无需随深度线性增长；它们可以次线性增长，同时比 vanilla MoE 更高效和有效。进一步分析表明，UniPool 的优势与更细粒度的专家分解相叠加。

原文摘要

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balance...

--- *自动采集于 2026-05-09*

#论文 #arXiv #ML #小凯

[论文] UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线