Skip to content

NT Nano cfg update (#2662)#2681

Merged
malay-nagda merged 1 commit intor0.3.0from
malay/cp_2622_nt_nano_cfg
Mar 6, 2026
Merged

NT Nano cfg update (#2662)#2681
malay-nagda merged 1 commit intor0.3.0from
malay/cp_2622_nt_nano_cfg

Conversation

@malay-nagda
Copy link
Copy Markdown
Contributor

@malay-nagda malay-nagda commented Mar 6, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features
    • Added MOE router load balancing configuration for model optimization.
    • Introduced hardware-specific configuration variants for GB300, GB200, B300, B200, and H100 systems with support for multiple precision formats (BF16, FP8, NVFP4).
    • Added CUDA graph implementation support with configurable scope options for enhanced performance.

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

Configuration update for Nemotron 3 Nano model training that adds MOE router load balancing, creates new pretrain configuration variants for multiple hardware platforms with adjusted CUDA graph settings, and implements model-specific environment variable handling to preserve cuDNN LayerNorm support.

Changes

Cohort / File(s) Summary
Nemotron 3 Nano Pretrain Configuration
scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py
Adds cfg.model.moe_router_force_load_balancing = True to set_nemotron_3_nano_common_configs to enable load balancing for the MOE router.
Nemotron 3 Nano Workload Base Configurations
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
Refactors BASE_NEMOTRON_3_NANO_CONFIG by removing micro_batch_size and adding cuda_graph_impl and cuda_graph_scope fields. Creates 14 new pretrain config variants across GB300, GB200, B300, B200, and H100 hardware with specific micro_batch_size and CUDA graph settings. Updates H100 base config to use transformer_engine CUDA graph implementation and expands all exports.
Performance Plugin Environment Setup
scripts/performance/perf_plugins.py
Adds conditional branch in _set_model_specific_environment_variables to preserve cuDNN LayerNorm support (del_cudnn_ln = False) specifically for nemotron_3_nano recipe.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Dsv3 Recipe Update #2152 — Both modify perf_plugins._set_model_specific_environment_variables to preserve cuDNN LayerNorm support for MoE recipes
  • nemotron3_nano_h100_fix_260201 #2617 — Both update nemotron_3_nano_workload_base_configs.py with H100 pretrain configuration refinements
  • NT Nano cfg update #2662 — Both introduce identical changes across all three modified files (MOE router load balancing flag, workload config restructuring, and cuDNN LayerNorm preservation)

Suggested labels

performance, performance/optimize, r0.3.0, cherry-pick

Suggested reviewers

  • tomlifu
  • erhoo82
  • ko3n1g
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains significant changes affecting performance and numerics with MOE router load balancing and CUDA graph configurations, but the PR description is empty with only GitHub template placeholders and no test results or validation information documented. Complete PR description with test results: confirm unit tests pass, provide performance benchmarks for new hardware configurations, validate convergence with no numerical regressions, and include performance comparisons.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'NT Nano cfg update' accurately describes the main changes: configuration updates for Nemotron 3 Nano across multiple config files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch malay/cp_2622_nt_nano_cfg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py (1)

89-100: ⚠️ Potential issue | 🟡 Minor

__all__ is missing the NVFP4 variants.

The NVFP4 config variants are defined (lines 47, 54, 61, 68) but not exported in __all__. This creates an inconsistency with the FP8_MX variants which are exported.

Proposed fix
 __all__ = [
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_FP8_MX_V1",
+    "NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_NVFP4_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_BF16_V1",
     "NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100_FP8_CS_V1",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`
around lines 89 - 100, The __all__ list omits the NVFP4 config symbols; add the
four NVFP4 exports (the constants named with the pattern
NEMOTRON_3_NANO_PRETRAIN_CONFIG_*_NVFP4_V1) to the __all__ array so the NVFP4
variants defined earlier (the ones at lines where
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1, and
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_NVFP4_V1 are declared) are exported
alongside the FP8_MX entries.
🧹 Nitpick comments (1)
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py (1)

70-76: Redundant cuda_graph_impl assignment.

Line 75 sets cuda_graph_impl="transformer_engine", but BASE_NEMOTRON_3_NANO_CONFIG (line 38) already sets this. The replace() call inherits the value automatically.

Suggested cleanup
 _NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 = replace(
     BASE_NEMOTRON_3_NANO_CONFIG,
     num_gpus=16,
     global_batch_size=1024,
     micro_batch_size=1,
-    cuda_graph_impl="transformer_engine",
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`
around lines 70 - 76, The replace call creating
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 redundantly reassigns cuda_graph_impl even
though BASE_NEMOTRON_3_NANO_CONFIG already defines it; remove the
cuda_graph_impl="transformer_engine" argument from the replace(...) call so the
new config inherits the value from BASE_NEMOTRON_3_NANO_CONFIG, keeping only the
overridden fields (num_gpus, global_batch_size, micro_batch_size) in the
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 definition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Around line 89-100: The __all__ list omits the NVFP4 config symbols; add the
four NVFP4 exports (the constants named with the pattern
NEMOTRON_3_NANO_PRETRAIN_CONFIG_*_NVFP4_V1) to the __all__ array so the NVFP4
variants defined earlier (the ones at lines where
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB200_NVFP4_V1,
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1, and
NEMOTRON_3_NANO_PRETRAIN_CONFIG_B200_NVFP4_V1 are declared) are exported
alongside the FP8_MX entries.

---

Nitpick comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Around line 70-76: The replace call creating
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 redundantly reassigns cuda_graph_impl even
though BASE_NEMOTRON_3_NANO_CONFIG already defines it; remove the
cuda_graph_impl="transformer_engine" argument from the replace(...) call so the
new config inherits the value from BASE_NEMOTRON_3_NANO_CONFIG, keeping only the
overridden fields (num_gpus, global_batch_size, micro_batch_size) in the
_NEMOTRON_3_NANO_PRETRAIN_CONFIG_H100 definition.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8053d25-c528-449b-95f2-2e8e01439ee8

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1ebe6 and 5fbdf2a.

📒 Files selected for processing (3)
  • scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py
  • scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
  • scripts/performance/perf_plugins.py

@malay-nagda malay-nagda merged commit bce688d into r0.3.0 Mar 6, 2026
48 of 49 checks passed
@malay-nagda malay-nagda deleted the malay/cp_2622_nt_nano_cfg branch March 6, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants