-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe changes for performance #11763
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Guyue Huang <[email protected]> Conflicts: nemo/lightning/run/plugins.py
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
…ability Signed-off-by: Guyue Huang <[email protected]>
…ections Signed-off-by: Guyue Huang <[email protected]>
Conflicts: nemo/lightning/run/plugins.py
Signed-off-by: guyueh1 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…e and use default" This reverts commit 2b8114b.
Signed-off-by: Guyue Huang <[email protected]>
@@ -168,3 +182,17 @@ class TransformerLayerTPOverlapCfg: | |||
proj_fprop=PipelineOverlapCfg(num_sm=24, cga_size=2, num_splits=4, set_sm_margin=True, fp8_buf=True), | |||
fc2_fprop=RingExchangeOverlapCfg(num_sm=1, set_sm_margin=True), | |||
) | |||
|
|||
# Nemotron 340B | |||
userbuffers_bf16_h100_h18432_tp8_mbs1_seqlen4096 = TransformerLayerTPOverlapCfg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an overlap config for hopper or blackwell?
nemo/lightning/run/plugins.py
Outdated
if tp_size > 1 or cp_size > 1: | ||
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = "1" | ||
if torch.cuda.is_available(): | ||
major, _ = torch.cuda.get_device_capability() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erhoo82 This method won't work because it's run on the cluster frontend node, not after slurm allocation. We need to found another way.
Signed-off-by: Guyue Huang <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: guyueh1 <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
…nce scripts in a separate PR Signed-off-by: Guyue Huang <[email protected]>
This PR is ready, @erhoo82 please review. |
@erhoo82 this PR is ready, it's needed by 25.02, let's review and merge it. |
os.environ.pop('CUDA_DEVICE_MAX_CONNECTIONS') | ||
else: | ||
if tp_size > 1 or cp_size > 1: | ||
os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be good to add a doc string for this condition.
Set the device connection to 1 to enforce the kernel queuing order from the host to the execution order on GPU. This is needed to schedule a communication kernel before the overlapping persistent GEMM kernel. Otherwise, the communication kernel will be pushed to the end of the GEMM kernel so failing to overlap the kernels
if major > 9: | ||
if (tp_size > 1 or cp_size > 1) and (dp_size > 1 or pp_size > 1): | ||
# Default is 8, but for this case, we need extra connections | ||
# to avoid serialization of streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is 8, but for this case, we need extra connections to avoid serialization of streams
to
We need extra connections to avoid serialization of streams, so we use the max connections of 32 instead of the default device connection of 8.
Signed-off-by: Guyue Huang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Guyue Huang <[email protected]>
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Mitigation guide:
By applying these rules, we reduce the occurance of this message in future. Thank you for improving NeMo's documentation! |
What does this PR do ?
Recipe changes for performance in 25.01 release
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information