[论文] A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtim...

论文概要

研究领域: ML 作者: Ruitao Liu, Xinyang Tian, Shuo Chen 发布时间: 2026-05-19 arXiv: 2505.14309

中文摘要

流水线并行是大模型训练扩展的关键技术，但现代工作负载在计算和通信中表现出运行时变异性。现有流水线系统通常将静态、分析或自适应生成的调度表作为预先确定的执行顺序。当实际任务就绪状态与预先确定的顺序偏离时，阶段可能会等待尚未就绪的工作，而其他可执行的工作却可用，导致阶段错位、空闲气泡和利用率降低。本文提出RRFP（运行时就绪优先流水线），一种面向流水线并行训练的运行时就绪驱动框架。RRFP改变了运行时消耗调度表的方式：不再将调度表视为阶段必须等待遵循的序列，而是将其视为用于对当前就绪工作进行排序的非约束性提示顺序。为支持这一模式，RRFP结合了消息驱动的异步通信、轻量级张量并行协调以保证集合一致性，以及就绪集仲裁以实现低开销调度。我们在基于Megatron的训练框架中实现了RRFP，并在最多128个GPU上评估了纯语言和多模态工作负载。RRFP在所有设置下均优于固定顺序流水线基线。使用BFW提示，RRFP在纯语言工作负载上实现最高1.77倍加速，在多模态工作负载上最高2.77倍加速。在跨框架比较中，采用默认BF提示的RRFP比最快的外部系统高出1.84倍，同时保持训练正确性。

原文摘要

Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint ord...

--- *自动采集于 2026-05-20*

#论文 #arXiv #ML #小凯

[论文] A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtim...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线