-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe changes for performance #11763
Open
guyueh1
wants to merge
24
commits into
NVIDIA:main
Choose a base branch
from
guyueh1:recipe_for_25.01
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 8 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
f069a19
[Nemo2] allow setting CUDA_DEVICE_MAX_CONNECTIONS
guyueh1 87f1d43
Add a tp2 ub config
guyueh1 3bdda64
Recipe tuning for mixtral, nemotron4
43f45fa
Revert mixtral config change
guyueh1 8420b22
Decide cuda device max connections based on torch.cuda.get_device_cap…
guyueh1 633b903
Rename custom_cuda_device_max_connections to num_cuda_device_max_conn…
guyueh1 88c16c3
Merge branch 'main' into recipe_for_25.01
guyueh1 5ca96db
Apply isort and black reformatting
guyueh1 2b8114b
Remove explicit config of align_param_gather in mixtral recipe and us…
guyueh1 93cb713
Merge branch 'recipe_for_25.01' of github.com:guyueh1/NeMo into recip…
guyueh1 7c5530b
Revert "Remove explicit config of align_param_gather in mixtral recip…
guyueh1 e234588
Rename ub config; change proj to ring exchange for nemotron 340b
guyueh1 43d6e12
Merge branch 'main' into recipe_for_25.01
erhoo82 9d5cb11
Update the logic to set cuda_device_max_connections
guyueh1 0fd838e
Revert changes to PerfEnvPlugin
guyueh1 441036c
Move setup of CUDA_DEVICE_MAX_CONNECTIONS to MegatronCommOverlapCallback
guyueh1 b18ac96
Apply isort and black reformatting
guyueh1 1f2ff68
Add b200 tp overlap configs for gpt3 and llama3 models
guyueh1 5bf3f74
Merge branch 'recipe_for_25.01' of github.com:guyueh1/NeMo into recip…
guyueh1 6a218f1
Revert changes to nemotron recipe; will put those changes in performa…
guyueh1 c0d3777
Merge branch 'main' into recipe_for_25.01
544cd5a
Add two docstrings
guyueh1 83d35d5
Merge branch 'main' into recipe_for_25.01
erhoo82 530719a
Fix os.environ.pop
guyueh1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,7 @@ | |
from typing import Callable, Optional | ||
|
||
import nemo_run as run | ||
import torch | ||
import yaml | ||
from lightning.pytorch import Callback | ||
from lightning.pytorch.loggers import WandbLogger | ||
|
@@ -27,7 +28,6 @@ | |
from nemo.lightning.pytorch.callbacks import NsysCallback, PreemptionCallback | ||
from nemo.lightning.pytorch.strategies.megatron_strategy import MegatronStrategy | ||
from nemo.utils import logging | ||
|
||
from nemo.utils.import_utils import safe_import | ||
|
||
res_module, HAVE_RES = safe_import('nvidia_resiliency_ext.ptl_resiliency') | ||
|
@@ -315,6 +315,7 @@ class PerfEnvPlugin(run.Plugin): | |
layernorm_sm_margin: int = 16 | ||
enable_vboost: bool = False | ||
nccl_pp_comm_chunksize: Optional[int] = None | ||
num_cuda_device_max_connections: int = None | ||
|
||
def get_vboost_srun_cmd(self, nodes, job_dir): | ||
"Create the vboost `sudo nvidia-smi boost-slider --vboost 1` command" | ||
|
@@ -341,11 +342,24 @@ def setup(self, task: run.Partial | run.Script, executor: run.Executor): | |
"""Enable the performance environment settings""" | ||
|
||
if task.trainer.strategy.__fn_or_cls__ == MegatronStrategy: | ||
# Force program order kernel launch for TP, CP overlap | ||
tp_size = task.trainer.strategy.tensor_model_parallel_size | ||
cp_size = task.trainer.strategy.context_parallel_size | ||
if tp_size > 1 or cp_size > 1: | ||
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = "1" | ||
if torch.cuda.is_available(): | ||
major, _ = torch.cuda.get_device_capability() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @erhoo82 This method won't work because it's run on the cluster frontend node, not after slurm allocation. We need to found another way. |
||
if major > 9: | ||
if self.num_cuda_device_max_connections is not None: | ||
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(self.num_cuda_device_max_connections) | ||
else: | ||
# When TP or CP size is larger than 1, need to use a single cuda device connection to enforce | ||
# the kernel queuing order of the host to GPU for their execution. This is needed for the optimal | ||
# overlap between communication and computation kernels. | ||
tp_size = task.trainer.strategy.tensor_model_parallel_size | ||
cp_size = task.trainer.strategy.context_parallel_size | ||
if tp_size > 1 or cp_size > 1: | ||
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = "1" | ||
else: | ||
if self.num_cuda_device_max_connections is not None: | ||
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str( | ||
self.num_cuda_device_max_connections | ||
) | ||
|
||
# Set LayerNorm SM margin to support the overlap with LayerNorm kernel | ||
if self.enable_layernorm_sm_margin: | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an overlap config for hopper or blackwell?