Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 #4887

li-plus · 2023-12-30T08:47:37Z

The overlap_comm and contiguous_gradients options have been ignored in ZeRO stage 1 since #1246. Back in that time, ZeRO 1 and 2 are separately implemented (see https://github.com/microsoft/DeepSpeed/tree/6ae756c03f12674f17aef90622e7664a8af9d2af/deepspeed/runtime/zero). ZeRO 1 does not have gradient hooks registered to overlap backward and gradient all-reduce, so it's fine to ignore overlap_comm and contiguous_gradients. However, in the current implementation, ZeRO 1 and 2 share almost the same implementation (stage_1_and_2.py). Features like overlap_comm and contiguous_gradients can also be enabled for ZeRO 1 (Please correct me if I made a mistake).

With this PR, turning on overlap_comm and contiguous_gradients for ZeRO 1 on the SFT task produces exactly the same training curve as the latest master.

I also see a ~1.05x e2e speedup by overlapping backward and gradient all-reduce. I can confirm by the trace that backward and all-reduce do overlap, and the separate gradients are indeed copied to a flat buffer. These options are also effective for ZeRO 1.

Related issue: #2295

tjruwase · 2024-01-05T13:16:15Z

@li-plus, thanks so much for this PR. Your analysis of the problem is accurate. Previously, we did not have bandwidth to evaluate the correctness of enabling those optimizations for ZeRO-1. This is a great contribution. Thanks so much.

…icrosoft#4887) The `overlap_comm` and `contiguous_gradients` options have been ignored in ZeRO stage 1 since microsoft#1246. Back in that time, ZeRO 1 and 2 are separately implemented (see https://github.com/microsoft/DeepSpeed/tree/6ae756c03f12674f17aef90622e7664a8af9d2af/deepspeed/runtime/zero). ZeRO 1 does not have gradient hooks registered to overlap backward and gradient all-reduce, so it's fine to ignore `overlap_comm` and `contiguous_gradients`. However, in the current implementation, ZeRO 1 and 2 share almost the same implementation (`stage_1_and_2.py`). Features like `overlap_comm` and `contiguous_gradients` can also be enabled for ZeRO 1 (Please correct me if I made a mistake). With this PR, turning on `overlap_comm` and `contiguous_gradients` for ZeRO 1 on the [SFT task](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning) produces exactly the same training curve as the latest master. ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/bda3be7b-c236-4e08-b687-b3cd01f5cc73) I also see a ~1.05x e2e speedup by overlapping backward and gradient all-reduce. I can confirm by the trace that backward and all-reduce do overlap, and the separate gradients are indeed copied to a flat buffer. These options are also effective for ZeRO 1. ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/5f876296-e1b4-404b-8b33-03cee8e5e6b2) ![image](https://github.com/microsoft/DeepSpeed/assets/39846316/9654f6be-5c7a-401a-b0bc-413ecd3f4e6b) Related issue: microsoft#2295 Co-authored-by: Olatunji Ruwase <[email protected]>

li-plus requested review from mrwyattii and tjruwase as code owners December 30, 2023 08:47

Release overlap_comm & contiguous_gradients restrictions for ZeRO 1

4ec2151

li-plus force-pushed the opt-zero1 branch from 7945f52 to 4ec2151 Compare January 3, 2024 05:36

li-plus mentioned this pull request Jan 5, 2024

[Question] why are overlap and contiguous grads meaningless in stage 1 and are ignored #2295

Open

Merge branch 'master' into opt-zero1

1d9c381

tjruwase approved these changes Jan 5, 2024

View reviewed changes

tjruwase added this pull request to the merge queue Jan 5, 2024

Merged via the queue into microsoft:master with commit af03383 Jan 5, 2024
14 checks passed

li-plus deleted the opt-zero1 branch January 6, 2024 02:20

li-plus mentioned this pull request Jan 8, 2024

Enable overlap_comm for better performance microsoft/DeepSpeedExamples#846

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 #4887

Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 #4887

li-plus commented Dec 30, 2023

tjruwase commented Jan 5, 2024

Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 #4887

Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 #4887

Conversation

li-plus commented Dec 30, 2023

tjruwase commented Jan 5, 2024