Recipe changes for performance #11763

guyueh1 · 2025-01-06T18:04:49Z

What does this PR do ?

Recipe changes for performance in 25.01 release

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py

Signed-off-by: Guyue Huang <[email protected]>

nemo/lightning/run/plugins.py

Signed-off-by: Guyue Huang <[email protected]>

…ability Signed-off-by: Guyue Huang <[email protected]>

…ections Signed-off-by: Guyue Huang <[email protected]>

Conflicts: nemo/lightning/run/plugins.py

Signed-off-by: guyueh1 <[email protected]>

erhoo82

LGTM

…e default

…e_for_25.01

nemo/collections/llm/recipes/mixtral_8x7b.py

nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py

…e and use default" This reverts commit 2b8114b.

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 · 2025-01-09T17:19:41Z

nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py

@@ -168,3 +182,17 @@ class TransformerLayerTPOverlapCfg:
    proj_fprop=PipelineOverlapCfg(num_sm=24, cga_size=2, num_splits=4, set_sm_margin=True, fp8_buf=True),
    fc2_fprop=RingExchangeOverlapCfg(num_sm=1, set_sm_margin=True),
 )
+
+# Nemotron 340B
+userbuffers_bf16_h100_h18432_tp8_mbs1_seqlen4096 = TransformerLayerTPOverlapCfg(


Is this an overlap config for hopper or blackwell?

guyueh1 · 2025-01-14T19:50:01Z

nemo/lightning/run/plugins.py

-            if tp_size > 1 or cp_size > 1:
-                executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = "1"
+            if torch.cuda.is_available():
+                major, _ = torch.cuda.get_device_capability()


@erhoo82 This method won't work because it's run on the cluster frontend node, not after slurm allocation. We need to found another way.

Signed-off-by: Guyue Huang <[email protected]>

Signed-off-by: guyueh1 <[email protected]>

Signed-off-by: Guyue Huang <[email protected]>

…e_for_25.01

…nce scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]>

guyueh1 · 2025-01-24T19:55:16Z

This PR is ready, @erhoo82 please review.

guyueh1 · 2025-01-28T19:22:15Z

@erhoo82 this PR is ready, it's needed by 25.02, let's review and merge it.

erhoo82 · 2025-01-29T22:00:07Z

nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py

+                os.environ.pop('CUDA_DEVICE_MAX_CONNECTIONS')
+        else:
+            if tp_size > 1 or cp_size > 1:
+                os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = "1"


It could also be good to add a doc string for this condition.
Set the device connection to 1 to enforce the kernel queuing order from the host to the execution order on GPU. This is needed to schedule a communication kernel before the overlapping persistent GEMM kernel. Otherwise, the communication kernel will be pushed to the end of the GEMM kernel so failing to overlap the kernels

erhoo82 · 2025-01-29T22:49:58Z

nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py

+        if major > 9:
+            if (tp_size > 1 or cp_size > 1) and (dp_size > 1 or pp_size > 1):
+                # Default is 8, but for this case, we need extra connections
+                # to avoid serialization of streams


Default is 8, but for this case, we need extra connections to avoid serialization of streams
to
We need extra connections to avoid serialization of streams, so we use the max connections of 32 instead of the default device connection of 8.

Signed-off-by: Guyue Huang <[email protected]>

erhoo82

LGTM

Signed-off-by: Guyue Huang <[email protected]>

github-actions · 2025-01-30T22:00:15Z

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.llm.recipes.tp_overlap_configs.userbuffers
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:19:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:24:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:34:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:42:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py:50:0: C0115: Missing class docstring (missing-class-docstring)
************* Module nemo.lightning.pytorch.callbacks.megatron_comm_overlap
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:81:0: C0301: Line too long (121/119) (line-too-long)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:287:0: C0301: Line too long (124/119) (line-too-long)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:251:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:318:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:322:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:326:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py:330:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.44/10

Mitigation guide:

Add sensible and useful docstrings to functions and methods
For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

guyueh1 and others added 3 commits January 6, 2025 09:33

[Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS

f069a19

Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py

Add a tp2 ub config

87f1d43

Signed-off-by: Guyue Huang <[email protected]>

Recipe tuning for mixtral, nemotron4

3bdda64

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 reviewed Jan 8, 2025

View reviewed changes

nemo/lightning/run/plugins.py Outdated Show resolved Hide resolved

nemo/lightning/run/plugins.py Outdated Show resolved Hide resolved

guyueh1 added 3 commits January 8, 2025 10:56

Revert mixtral config change

43f45fa

Signed-off-by: Guyue Huang <[email protected]>

Decide cuda device max connections based on torch.cuda.get_device_cap…

8420b22

…ability Signed-off-by: Guyue Huang <[email protected]>

Rename custom_cuda_device_max_connections to num_cuda_device_max_conn…

633b903

…ections Signed-off-by: Guyue Huang <[email protected]>

guyueh1 marked this pull request as ready for review January 8, 2025 19:06

guyueh1 and others added 2 commits January 8, 2025 11:07

Merge branch 'main' into recipe_for_25.01

88c16c3

Conflicts: nemo/lightning/run/plugins.py

Apply isort and black reformatting

5ca96db

Signed-off-by: guyueh1 <[email protected]>

erhoo82 previously approved these changes Jan 8, 2025

View reviewed changes

erhoo82 added r2.1.1 Run CICD labels Jan 8, 2025

guyueh1 added 2 commits January 8, 2025 16:52

Remove explicit config of align_param_gather in mixtral recipe and us…

2b8114b

…e default

Merge branch 'recipe_for_25.01' of github.com:guyueh1/NeMo into recip…

93cb713

…e_for_25.01

guyueh1 dismissed erhoo82’s stale review via 93cb713 January 9, 2025 00:53

erhoo82 reviewed Jan 9, 2025

View reviewed changes

nemo/collections/llm/recipes/mixtral_8x7b.py Outdated Show resolved Hide resolved

nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py Show resolved Hide resolved

guyueh1 added 2 commits January 8, 2025 19:27

Revert "Remove explicit config of align_param_gather in mixtral recip…

7c5530b

…e and use default" This reverts commit 2b8114b.

Rename ub config; change proj to ring exchange for nemotron 340b

e234588

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 self-requested a review January 14, 2025 19:22

erhoo82 previously approved these changes Jan 14, 2025

View reviewed changes

erhoo82 enabled auto-merge (squash) January 14, 2025 19:23

guyueh1 commented Jan 14, 2025

View reviewed changes

Merge branch 'main' into recipe_for_25.01

43d6e12

erhoo82 added Run CICD and removed Run CICD labels Jan 14, 2025

Update the logic to set cuda_device_max_connections

9d5cb11

Signed-off-by: Guyue Huang <[email protected]>

auto-merge was automatically disabled January 15, 2025 17:58
Head branch was pushed to by a user without write access

guyueh1 dismissed erhoo82’s stale review via 9d5cb11 January 15, 2025 17:58

Revert changes to PerfEnvPlugin

0fd838e

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 and others added 5 commits January 23, 2025 10:14

Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback

441036c

Signed-off-by: Guyue Huang <[email protected]>

Apply isort and black reformatting

b18ac96

Signed-off-by: guyueh1 <[email protected]>

Add b200 tp overlap configs for gpt3 and llama3 models

1f2ff68

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'recipe_for_25.01' of github.com:guyueh1/NeMo into recip…

5bf3f74

…e_for_25.01

Revert changes to nemotron recipe; will put those changes in performa…

6a218f1

…nce scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'main' into recipe_for_25.01

c0d3777

erhoo82 reviewed Jan 29, 2025

View reviewed changes

Add two docstrings

544cd5a

Signed-off-by: Guyue Huang <[email protected]>

erhoo82 previously approved these changes Jan 29, 2025

View reviewed changes

Merge branch 'main' into recipe_for_25.01

83d35d5

erhoo82 added Run CICD and removed Run CICD labels Jan 29, 2025

Fix os.environ.pop

530719a

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 dismissed erhoo82’s stale review via 530719a January 30, 2025 21:59

guyueh1 requested a review from erhoo82 January 30, 2025 23:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe changes for performance #11763

Recipe changes for performance #11763

guyueh1 commented Jan 6, 2025

erhoo82 left a comment

erhoo82 Jan 9, 2025

guyueh1 Jan 14, 2025

guyueh1 commented Jan 24, 2025

guyueh1 commented Jan 28, 2025

erhoo82 Jan 29, 2025

erhoo82 Jan 29, 2025

erhoo82 left a comment

github-actions bot commented Jan 30, 2025

Recipe changes for performance #11763

Are you sure you want to change the base?

Recipe changes for performance #11763

Conversation

guyueh1 commented Jan 6, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

erhoo82 left a comment

Choose a reason for hiding this comment

erhoo82 Jan 9, 2025

Choose a reason for hiding this comment

guyueh1 Jan 14, 2025

Choose a reason for hiding this comment

guyueh1 commented Jan 24, 2025

guyueh1 commented Jan 28, 2025

erhoo82 Jan 29, 2025

Choose a reason for hiding this comment

erhoo82 Jan 29, 2025

Choose a reason for hiding this comment

erhoo82 left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 30, 2025