-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MoE for pipeline models #5338
Conversation
Please note that Megatron-DeepSpeed PR#373 https://github.com/microsoft/Megatron-DeepSpeed/pull/373 is dependent on this PR. |
Signed-off-by: Moshe Island <[email protected]>
Currently MoE uses Megatron-DeepSpeed APIs to get tensor-parallel info (rank, world_size, group). In order to enable MoE for PipelineModule, modify to use backward-compatible methods that can access either Megatron, DeepSpeed Topology or Old Megatron APIs. Since MoE is not part of deepspeed runtime, move backward compatible methods to deepseed.utils and modify imports as required. Signed-off-by: Moshe Island <[email protected]>
Signed-off-by: Moshe Island <[email protected]>
Currently, only "total_loss" is displayed. If model has additional losses (e.g. MoE loss), display them as well. Similar to "total loss", additional losses are displayed for the full batch after mean reduced across DP ranks. Signed-off-by: Moshe Island <[email protected]>
Signed-off-by: Moshe Island <[email protected]>
Currently, when using no-drop tokens, we calculate locally the capacity and then all-reduce(op=max) on world group. This fails when using pipeline parallel (with micro batches), since different stage workers are handling different model layers (or at warmup, where first stage workers are processing while last stage workers are idle). Fix it by running the all-reduce on the expert group. Signed-off-by: Moshe Island <[email protected]>
This commit enhances expert group creation for both modes: - DP + PP + EP - DP + TP + PP + EP Signed-off-by: Moshe Island <[email protected]>
When using MoE with MoE-TP disabled, use pipeline parallel group to max or sum MoE gradients. This also fixes the behavior for following configuration: No pipeline, TP enabled, MoE TP disabled. Signed-off-by: Moshe Island <[email protected]>
7a5e888
to
0f9d2b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredibly great work. Thank you for the amazing contribution!
I left a comment about a very small part but have already approved this PR. |
Signed-off-by: Moshe Island <[email protected]>
Done |
This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>
This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>
This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>
This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed).
Main changes:
Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model.