add zero3 `module_granularity_threshold` to zero optimization. #6649

inkcherry · 2024-10-21T10:51:33Z

This PR adds Z3 coalesced fetch to zero optimization. Currently, some logic can be reused, but it's difficult to realize that as optimization choice(I only discovered these logic when trying to implement it).

The benefit of this approach is reducing host overhead（reduce many hooks) and during the process of recursive fetching parameters (especially in fine-grained models, such as those with a large number of moe experts). This is particularly helpful for host-sensitive devices (such as hpu), where it achieved a 40% performance improvement in our customer workloads.
FYI @delock @deepcharm

delock · 2024-10-23T02:13:05Z

@inkcherry there are some error in CI workflows, are they related to your change?

loadams · 2024-10-25T00:37:14Z

@inkcherry there are some error in CI workflows, are they related to your change?

@delock - there were issues with the nv-accelerate and nv-torch workflows, but both of those should be resolved now.

inkcherry · 2024-10-30T04:50:28Z

thanks @loadams @delock ！ after add some change.
I see an error in the current CI FileNotFoundError, it seems not related to this patch.

nelyahu · 2024-10-30T08:44:08Z

@inkcherry this PR looks very promising. on which model did you benchmark the performance?

inkcherry · 2024-10-31T08:22:34Z

@inkcherry this PR looks very promising. on which model did you benchmark the performance?

@nelyahu The model I’m testing has 64 experts per MoE layer, with each expert containing 3 linear layers. Including the non-expert parameters, each MoE layer consists of 197 parameters (all weights, without biases). There are a total of 48 layers in the model. I think it might be similar in style to the open-source modelQwen2-MoE. Therefore, introducing a hook for each layer would incur a very high overhead.

tjruwase · 2024-10-31T14:51:19Z

@inkcherry, thanks for this PR. Can you clarify the difference between coalesced params and the leaf modules? I notice that this implementation relies on the leaf modules code.

inkcherry · 2024-11-01T05:44:15Z

thanks for the review @tjruwase ,
Currently in this patch, for modules requiring hook once, they are set to z3_leaf_module in init stage. since z3_leaf_module logic meets the requirements of avoiding recursive add hook and fetch_all_params.

I found that it is also helpful for the GPU (although not as obvious as with the HPU) in such case. I think it is suitable to add it to the comm optimization config and renamed it. because personally I think that z3_leaf_module seems more suitable as an attribute or api name. And the reduce hook overhead scenario should be one case of z3_leaf_module（Another case seems to be aimed at fixing the issue where prefetch cannot accurately predict that the parameters used in the model's forward pass may differ from those in the trace）. Adding an independent switch might facilitate conditional operations in the future.

tjruwase · 2024-11-01T13:14:42Z

@inkcherry, thanks for the explanation.

I agree that avoiding recursive hook_and_fetch can benefit both functionality (e.g., MoE) and performance (e.g., communication) scenarios. The part that I am unsure about is whether these functionalities should be exposed through DeepSpeed API or ds_config. Since users need to specify model-specific modules for this feature, I prefer the API approach because I think it is more natural for model-specific details to be expressed in the client code rather than in the ds_config. Currently, the ds_config is generally model-agnostic which provides the convenience of reusing.

I will be glad to hear your thoughts. Also, can you please share some unit tests to demonstrate usage?

…esced_fetch

inkcherry · 2024-11-04T13:22:13Z

@tjruwase , thank you for your suggestions，Yes，I agree with your concerns. Initially, I used the config because I felt this API was difficult for users to be aware of (unless they encountered a related issue and searched in the issue tracker) or recognized the API but couldn’t determine its performance impact.（Compared to other fetch-related optimization configurations in the config,such as overlap_comm, bucket_size, etc.）

I discussed this with @delock and I modified it to an int variable that represents the size of the model granularity, indicating （the number of parameter elements/the number of required hooks ）. This allows users to set it themselves, it should have reusability in the same software and hardware environment, and I think an optimal range can be derived based on the same hardware and software. I’ve currently modify the code with ut and forward to your suggestions.

…esced_fetch

tjruwase · 2024-11-05T14:23:20Z

tests/unit/runtime/zero/test_zero_leaf_module.py

+                "stage3_coalesced_fetch_threshold": stage3_coalesced_fetch_threshold
+            }
+        }
+        if get_accelerator().is_fp16_supported():


Please use preferred_dtype() instead of is_*_supported() to ensure consistency with line 231.

tjruwase · 2024-11-05T14:31:13Z

tests/unit/runtime/zero/test_zero_leaf_module.py

+    def test_finegrained_optimization(self):
+        hidden_dim = 128
+        num_block = 16
+        stage3_coalesced_fetch_threshold = 12000


This UT should have coverage of different (e.g., corner-case) values of stage3_coalesced_fetch_threshold, instead of a fixed value.

added 4 examples.

tjruwase · 2024-11-05T14:32:27Z

deepspeed/runtime/zero/config.py

@@ -21,6 +21,7 @@
    "stage3_max_live_parameters" : 1000000000,
    "stage3_max_reuse_distance" : 1000000000,
    "stage3_use_all_reduce_for_fetch_params": [true|false],
+    "stage3_coalesced_fetch_threshold": 0,


Add documentation of this here: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training

tjruwase · 2024-11-05T14:59:18Z

deepspeed/runtime/zero/config.py

@@ -245,6 +246,15 @@ class DeepSpeedZeroConfig(DeepSpeedConfigModel):
    this option is enabled and then saves the fp16 model weights.
    """

+    coalesced_fetch_threshold: int = Field(pp_int(0), alias="stage3_coalesced_fetch_threshold")
+    """
+    The ratio of a module's number of parameters/required forward passes (layers)


The description of this option is unclear to me. It seems to be related to a new concept of module granularity. I notice that the UT uses fine-grained vs coarse-grained, which I could understand, but those terms are not used in this description.

Based on my understanding of this feature, I recommend rewriting the description to emphasize the following points:

Granularity of a module is as computed recursively as ratio of parameter_count/ (1 + descendant count).

ZeRO3 classifies each module into fine-grained vs coarse-grained by comparing granularity value to coalesced_fetch_threshold value.

Fine-grained modules are treated as integral units (along with their descendants) for parameter fetching purposes.

Also, I recommend renaming coalesced_fetch_threshold to module_granularity_threshold to indicate that this how users can control the classification of fine-grained vs coarse-grained.

Please let me know what I am getting wrong. Thanks!

Thanks for your valuable suggestion! I’ve implemented the changes.

…pSpeed into z3_coalesced_fetch

inkcherry added 3 commits October 21, 2024 08:27

z3 coalesced fetch

a2610d8

fix format

4e8be08

fix default value

7641994

inkcherry requested review from tjruwase, tohtana and awan-10 as code owners October 21, 2024 10:51

fix default

805a820

delock and others added 4 commits October 23, 2024 10:13

Merge branch 'master' into z3_coalesced_fetch

ce7dfb7

fix ut

810353b

fix ut

a8dd8fe

Merge branch 'master' into z3_coalesced_fetch

53584ca

inkcherry mentioned this pull request Oct 31, 2024

Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694

Open

Merge branch 'master' into z3_coalesced_fetch

4d86198

tjruwase removed the request for review from awan-10 October 31, 2024 14:47

inkcherry added 4 commits November 4, 2024 09:50

add ut(usage)

7b94377

use int type config

cd31a0d

fix format

ea50964

Merge remote-tracking branch 'origin/z3_coalesced_fetch' into z3_coal…

b068118

…esced_fetch

inkcherry requested a review from loadams as a code owner November 4, 2024 13:10

fix note

600d9c7

Merge branch 'master' into z3_coalesced_fetch

4477077

inkcherry added 6 commits November 5, 2024 08:56

refine code

c2c434b

remove debug code

e5f9430

update

c2b020a

Merge remote-tracking branch 'origin/z3_coalesced_fetch' into z3_coal…

511ace0

…esced_fetch

don't set leaf for container module

3680109

Merge branch 'master' into z3_coalesced_fetch

f2752f8

tjruwase reviewed Nov 5, 2024

View reviewed changes

inkcherry added 6 commits November 6, 2024 08:28

update ut

22c0f81

udpate

f773258

change config name, refine doc

c31ad02

fix rjust size

40ceeac

fix merge

73e5bd5

format

c31c8d2

inkcherry changed the title ~~add zero3 coalesced parameters fetch to zero optimization.~~ add zero3 module_granularity_threshold to zero optimization. Nov 6, 2024

inkcherry added 4 commits November 7, 2024 02:57

always print info if the config is enabled

619cbe6

Merge branch 'master' into z3_coalesced_fetch

3c0a183

update

a6e5a39

Merge branch 'z3_coalesced_fetch' of https://github.com/inkcherry/Dee…

e7e5cdf

…pSpeed into z3_coalesced_fetch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add zero3 `module_granularity_threshold` to zero optimization. #6649

add zero3 `module_granularity_threshold` to zero optimization. #6649

inkcherry commented Oct 21, 2024 •

edited

Loading

delock commented Oct 23, 2024

loadams commented Oct 25, 2024

inkcherry commented Oct 30, 2024

nelyahu commented Oct 30, 2024

inkcherry commented Oct 31, 2024 •

edited

Loading

tjruwase commented Oct 31, 2024

inkcherry commented Nov 1, 2024 •

edited

Loading

tjruwase commented Nov 1, 2024

inkcherry commented Nov 4, 2024 •

edited

Loading

tjruwase Nov 5, 2024

inkcherry Nov 6, 2024

tjruwase Nov 5, 2024

inkcherry Nov 6, 2024

tjruwase Nov 5, 2024

inkcherry Nov 6, 2024

tjruwase Nov 5, 2024 •

edited

Loading

inkcherry Nov 6, 2024

add zero3 module_granularity_threshold to zero optimization. #6649

Are you sure you want to change the base?

add zero3 module_granularity_threshold to zero optimization. #6649

Conversation

inkcherry commented Oct 21, 2024 • edited Loading

delock commented Oct 23, 2024

loadams commented Oct 25, 2024

inkcherry commented Oct 30, 2024

nelyahu commented Oct 30, 2024

inkcherry commented Oct 31, 2024 • edited Loading

tjruwase commented Oct 31, 2024

inkcherry commented Nov 1, 2024 • edited Loading

tjruwase commented Nov 1, 2024

inkcherry commented Nov 4, 2024 • edited Loading

tjruwase Nov 5, 2024

Choose a reason for hiding this comment

inkcherry Nov 6, 2024

Choose a reason for hiding this comment

tjruwase Nov 5, 2024

Choose a reason for hiding this comment

inkcherry Nov 6, 2024

Choose a reason for hiding this comment

tjruwase Nov 5, 2024

Choose a reason for hiding this comment

inkcherry Nov 6, 2024

Choose a reason for hiding this comment

tjruwase Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

inkcherry Nov 6, 2024

Choose a reason for hiding this comment

add zero3 `module_granularity_threshold` to zero optimization. #6649

add zero3 `module_granularity_threshold` to zero optimization. #6649

inkcherry commented Oct 21, 2024 •

edited

Loading

inkcherry commented Oct 31, 2024 •

edited

Loading

inkcherry commented Nov 1, 2024 •

edited

Loading

inkcherry commented Nov 4, 2024 •

edited

Loading

tjruwase Nov 5, 2024 •

edited

Loading