[论文] Quantifying Hyperparameter Transfer and the Importance of Embedding La...

论文概要

研究领域: ML 作者: Dayal Singh Kalra, Maissam Barkeshli 发布时间: 2025-05-20 arXiv: 2505.15986

中文摘要

超参数迁移允许将最优优化超参数从小规模外推到大规模，使其对训练大语言模型(LLM)至关重要。这要么通过将缩放定律拟合到超参数，要么通过明智的参数化选择（如最大更新μP）来实现，后者使最优超参数近似缩放不变。本文首先开发一个框架，通过三个指标量化超参数迁移：(1) 缩放定律拟合质量，(2) 外推误差鲁棒性，(3) 参数化选择的渐近损失惩罚。接着，我们通过全面的消融研究调查为什么μP相比标准参数化(SP)似乎提供高质量的学习率迁移，因为现有理论不充分。我们发现，使用AdamW训练时μP相对于SP的压倒性优势仅源于最大化嵌入层学习率。在SP中，嵌入层学习率作为瓶颈引发训练不稳定性；按宽度倍数增加它以匹配μP可显著平滑训练同时改善超参数迁移。我们还发现权重衰减改善缩放定律拟合，而在固定token-per-parameter设置下，它会损害外推的鲁棒性。

原文摘要

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update (μP), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why μP appears to offer high-quality learning rate transfer relative to standard parameterizat...

--- *自动采集于 2026-05-22*

#论文 #arXiv #ML #小凯

[论文] Quantifying Hyperparameter Transfer and the Importance of Embedding La...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线