NT Nano cfg update (#2662) by malay-nagda · Pull Request #2681 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-03-06T11:01:08Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added MOE router load balancing configuration for model optimization.
- Introduced hardware-specific configuration variants for GB300, GB200, B300, B200, and H100 systems with support for multiple precision formats (BF16, FP8, NVFP4).
- Added CUDA graph implementation support with configurable scope options for enhanced performance.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

coderabbitai · 2026-03-06T11:07:23Z

📝 Walkthrough

Walkthrough

Configuration update for Nemotron 3 Nano model training that adds MOE router load balancing, creates new pretrain configuration variants for multiple hardware platforms with adjusted CUDA graph settings, and implements model-specific environment variable handling to preserve cuDNN LayerNorm support.

Changes

Cohort / File(s)	Summary
Nemotron 3 Nano Pretrain Configuration `scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py`	Adds `cfg.model.moe_router_force_load_balancing = True` to set_nemotron_3_nano_common_configs to enable load balancing for the MOE router.
Nemotron 3 Nano Workload Base Configurations `scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`	Refactors BASE_NEMOTRON_3_NANO_CONFIG by removing micro_batch_size and adding cuda_graph_impl and cuda_graph_scope fields. Creates 14 new pretrain config variants across GB300, GB200, B300, B200, and H100 hardware with specific micro_batch_size and CUDA graph settings. Updates H100 base config to use transformer_engine CUDA graph implementation and expands all exports.
Performance Plugin Environment Setup `scripts/performance/perf_plugins.py`	Adds conditional branch in _set_model_specific_environment_variables to preserve cuDNN LayerNorm support (del_cudnn_ln = False) specifically for nemotron_3_nano recipe.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Dsv3 Recipe Update #2152 — Both modify perf_plugins._set_model_specific_environment_variables to preserve cuDNN LayerNorm support for MoE recipes
nemotron3_nano_h100_fix_260201 #2617 — Both update nemotron_3_nano_workload_base_configs.py with H100 pretrain configuration refinements
NT Nano cfg update #2662 — Both introduce identical changes across all three modified files (MOE router load balancing flag, workload config restructuring, and cuDNN LayerNorm preservation)

Suggested labels

performance, performance/optimize, r0.3.0, cherry-pick

Suggested reviewers

tomlifu
erhoo82
ko3n1g

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains significant changes affecting performance and numerics with MOE router load balancing and CUDA graph configurations, but the PR description is empty with only GitHub template placeholders and no test results or validation information documented.	Complete PR description with test results: confirm unit tests pass, provide performance benchmarks for new hardware configurations, validate convergence with no numerical regressions, and include performance comparisons.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'NT Nano cfg update' accurately describes the main changes: configuration updates for Nemotron 3 Nano across multiple config files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/cp_2622_nt_nano_cfg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py (1)

89-100: ⚠️ Potential issue | 🟡 Minor

__all__ is missing the NVFP4 variants.

The NVFP4 config variants are defined (lines 47, 54, 61, 68) but not exported in __all__. This creates an inconsistency with the FP8_MX variants which are exported.

Proposed fix

 __all__ = [
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_FP8_CS_V1",
 ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`
around lines 89 - 100, The __all__ list omits the NVFP4 config symbols; add the
four NVFP4 exports (the constants named with the pattern
NEMOTRON_3_NANO_PRETRAIN_CONFIG_*_NVFP4_V1) to the __all__ array so the NVFP4
variants defined earlier (the ones at lines where
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1, and
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_NVFP4_V1 are declared) are exported
alongside the FP8_MX entries.

🧹 Nitpick comments (1)

scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py (1)

70-76: Redundant cuda_graph_impl assignment.

Line 75 sets cuda_graph_impl="transformer_engine", but BASE_NEMOTRON_3_NANO_CONFIG (line 38) already sets this. The replace() call inherits the value automatically.

Suggested cleanup

 _NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 = replace(
     BASE_NEMOTRON_3_NANO_CONFIG,
     num_gpus=16,
     global_batch_size=1024,
     micro_batch_size=1,
-    cuda_graph_impl="transformer_engine",
 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`
around lines 70 - 76, The replace call creating
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 redundantly reassigns cuda_graph_impl even
though BASE_NEMOTRON_3_NANO_CONFIG already defines it; remove the
cuda_graph_impl="transformer_engine" argument from the replace(...) call so the
new config inherits the value from BASE_NEMOTRON_3_NANO_CONFIG, keeping only the
overridden fields (num_gpus, global_batch_size, micro_batch_size) in the
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 definition.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Around line 89-100: The __all__ list omits the NVFP4 config symbols; add the
four NVFP4 exports (the constants named with the pattern
NEMOTRON_3_NANO_PRETRAIN_CONFIG_*_NVFP4_V1) to the __all__ array so the NVFP4
variants defined earlier (the ones at lines where
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1, and
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_NVFP4_V1 are declared) are exported
alongside the FP8_MX entries.

---

Nitpick comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Around line 70-76: The replace call creating
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 redundantly reassigns cuda_graph_impl even
though BASE_NEMOTRON_3_NANO_CONFIG already defines it; remove the
cuda_graph_impl="transformer_engine" argument from the replace(...) call so the
new config inherits the value from BASE_NEMOTRON_3_NANO_CONFIG, keeping only the
overridden fields (num_gpus, global_batch_size, micro_batch_size) in the
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 definition.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8053d25-c528-449b-95f2-2e8e01439ee8

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1ebe6 and 5fbdf2a.

📒 Files selected for processing (3)

scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
scripts/performance/perf_plugins.py

NT Nano cfg update (#2662)

5fbdf2a

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda requested a review from ko3n1g March 6, 2026 11:01

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 11:01 Inactive

copy-pr-bot bot temporarily deployed to test March 6, 2026 11:01 Inactive

ko3n1g approved these changes Mar 6, 2026

View reviewed changes

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 14:19 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 14:27 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 14:38 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 6, 2026 14:38 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 14:38 Inactive

malay-nagda merged commit bce688d into r0.3.0 Mar 6, 2026
48 of 49 checks passed

malay-nagda deleted the malay/cp_2622_nt_nano_cfg branch March 6, 2026 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NT Nano cfg update (#2662)#2681

NT Nano cfg update (#2662)#2681
malay-nagda merged 1 commit intor0.3.0from
malay/cp_2622_nt_nano_cfg

malay-nagda commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 6, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

malay-nagda commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 6, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

malay-nagda commented Mar 6, 2026 •

edited by coderabbitai bot

Loading