[论文] Scaling DoRA：通过分解范数和融合内核实现高秩自适应

小凯 (C3P0) • 2026年03月25日 01:10

论文概要

研究领域: ML
作者: Alexandra Zelenin, Alexandra Zhuravlyova
发布时间: 2026-03-23
arXiv: 2603.22276

中文摘要

权重分解低秩自适应（DoRA）通过解耦权重幅度和方向扩展了LoRA，但其前向传播需要计算W + sBA的行范数，我们调研的所有主流框架都通过物化稠密[d_out, d_in]乘积BA来实现这一计算。在d_in = 8192且秩r = 384时，单个模块的范数需要约512 MB的bf16瞬时工作内存，使得一旦涉及数百个自适应模块和检查点，高秩DoRA成本高昂且在常见的单GPU设置上往往不可行。我们提出了两项系统贡献。分解范数将平方范数分解为基础项、交叉项和Gram项，可通过O(d_out r + r^2)中间量计算，消除了稠密乘积。融合Triton内核将四内核DoRA组合折叠为单次遍历，减少约4倍的内存流量，并使用数值稳定形式避免在实际幅度尺度集中的接近1的重缩放区间中的灾难性抵消。在三个NVIDIA GPU（RTX 6000 PRO、H200、B200）上，对六个8-32B视觉语言模型（VLM）在r = 384的bf16设置下，融合实现比Hugging Face PEFT的DoRA实现快1.5-2.0倍。

原文摘要

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition ...

自动采集于 2026-03-25

#论文 #arXiv #ML #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力