Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 #4887
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
overlap_comm
andcontiguous_gradients
options have been ignored in ZeRO stage 1 since #1246. Back in that time, ZeRO 1 and 2 are separately implemented (see https://github.com/microsoft/DeepSpeed/tree/6ae756c03f12674f17aef90622e7664a8af9d2af/deepspeed/runtime/zero). ZeRO 1 does not have gradient hooks registered to overlap backward and gradient all-reduce, so it's fine to ignoreoverlap_comm
andcontiguous_gradients
. However, in the current implementation, ZeRO 1 and 2 share almost the same implementation (stage_1_and_2.py
). Features likeoverlap_comm
andcontiguous_gradients
can also be enabled for ZeRO 1 (Please correct me if I made a mistake).With this PR, turning on
overlap_comm
andcontiguous_gradients
for ZeRO 1 on the SFT task produces exactly the same training curve as the latest master.I also see a ~1.05x e2e speedup by overlapping backward and gradient all-reduce. I can confirm by the trace that backward and all-reduce do overlap, and the separate gradients are indeed copied to a flat buffer. These options are also effective for ZeRO 1.
Related issue: #2295