-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add zero3 module_granularity_threshold
to zero optimization.
#6649
base: master
Are you sure you want to change the base?
Conversation
@inkcherry there are some error in CI workflows, are they related to your change? |
@delock - there were issues with the nv-accelerate and nv-torch workflows, but both of those should be resolved now. |
@inkcherry this PR looks very promising. on which model did you benchmark the performance? |
@nelyahu The model I’m testing has 64 experts per MoE layer, with each expert containing 3 linear layers. Including the non-expert parameters, each MoE layer consists of 197 parameters (all weights, without biases). There are a total of 48 layers in the model. I think it might be similar in style to the open-source model |
@inkcherry, thanks for this PR. Can you clarify the difference between coalesced params and the leaf modules? I notice that this implementation relies on the leaf modules code. |
thanks for the review @tjruwase , I found that it is also helpful for the GPU (although not as obvious as with the HPU) in such case. I think it is suitable to add it to the comm optimization config and renamed it. because personally I think that z3_leaf_module seems more suitable as an attribute or api name. And the reduce hook overhead scenario should be one case of z3_leaf_module(Another case seems to be aimed at fixing the issue where prefetch cannot accurately predict that the parameters used in the model's forward pass may differ from those in the trace). Adding an independent switch might facilitate conditional operations in the future. |
@inkcherry, thanks for the explanation. I agree that avoiding recursive I will be glad to hear your thoughts. Also, can you please share some unit tests to demonstrate usage? |
@tjruwase , thank you for your suggestions,Yes,I agree with your concerns. Initially, I used the config because I felt this API was difficult for users to be aware of (unless they encountered a related issue and searched in the issue tracker) or recognized the API but couldn’t determine its performance impact.(Compared to other fetch-related optimization configurations in the config,such as overlap_comm, bucket_size, etc.) I discussed this with @delock and I modified it to an int variable that represents the size of the model granularity, indicating ( |
"stage3_coalesced_fetch_threshold": stage3_coalesced_fetch_threshold | ||
} | ||
} | ||
if get_accelerator().is_fp16_supported(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use preferred_dtype()
instead of is_*_supported()
to ensure consistency with line 231.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
def test_finegrained_optimization(self): | ||
hidden_dim = 128 | ||
num_block = 16 | ||
stage3_coalesced_fetch_threshold = 12000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This UT should have coverage of different (e.g., corner-case) values of stage3_coalesced_fetch_threshold
, instead of a fixed value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 4 examples.
deepspeed/runtime/zero/config.py
Outdated
@@ -21,6 +21,7 @@ | |||
"stage3_max_live_parameters" : 1000000000, | |||
"stage3_max_reuse_distance" : 1000000000, | |||
"stage3_use_all_reduce_for_fetch_params": [true|false], | |||
"stage3_coalesced_fetch_threshold": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add documentation of this here: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
deepspeed/runtime/zero/config.py
Outdated
@@ -245,6 +246,15 @@ class DeepSpeedZeroConfig(DeepSpeedConfigModel): | |||
this option is enabled and then saves the fp16 model weights. | |||
""" | |||
|
|||
coalesced_fetch_threshold: int = Field(pp_int(0), alias="stage3_coalesced_fetch_threshold") | |||
""" | |||
The ratio of a module's number of parameters/required forward passes (layers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description of this option is unclear to me. It seems to be related to a new concept of module granularity. I notice that the UT uses fine-grained
vs coarse-grained
, which I could understand, but those terms are not used in this description.
Based on my understanding of this feature, I recommend rewriting the description to emphasize the following points:
- Granularity of a module is as computed recursively as ratio of
parameter_count
/ (1 +descendant count
). - ZeRO3 classifies each module into
fine-grained
vscoarse-grained
by comparing granularity value tocoalesced_fetch_threshold
value. Fine-grained
modules are treated as integral units (along with their descendants) for parameter fetching purposes.
Also, I recommend renaming coalesced_fetch_threshold
to module_granularity_threshold
to indicate that this how users can control the classification of fine-grained
vs coarse-grained
.
Please let me know what I am getting wrong. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your valuable suggestion! I’ve implemented the changes.
module_granularity_threshold
to zero optimization.
This PR adds Z3 coalesced fetch to zero optimization. Currently, some logic can be reused, but it's difficult to realize that as optimization choice(I only discovered these logic when trying to implement it).
The benefit of this approach is reducing host overhead(reduce many hooks) and during the process of recursive fetching parameters (especially in fine-grained models, such as those with a large number of moe experts). This is particularly helpful for host-sensitive devices (such as hpu), where it achieved a 40% performance improvement in our customer workloads.
FYI @delock @deepcharm