DSV3 NVFP4 recipe on GB300 by dingqingy-nv · Pull Request #2076 · NVIDIA-NeMo/Megatron-Bridge

dingqingy-nv · 2026-01-27T07:02:12Z

What does this PR do ?

This PR adds DeepSeek V3 NVFP4 recipe configuration for GB300.

Achieves 1185 TFLOP (V1) and 1220 TFLOPs (V2).

Update: Fast Math improves V2 to 1240 TFLOPs.

To test GB200.

This PR also add pp_layout field to WorkloadBaseConfig to support custom pipeline parallelism layouts. This is needed because different workload configs (e.g., V1, V2, large_scale) share the same deepseek_v3_pretrain_config_gb300 function but may require different PP layouts—for instance, large-scale runs prefer FSDP and work better with pp_layout=None, so this must be configurable at the workload base config level.

Additional Information

Related to [Test][patch] NVFP4 support DSv3 and QWEN3 #1728

Summary by CodeRabbit

New Features
- Added GB300 NVFP4 pretraining configurations (V1 and V2 variants) with configurable batch sizes and pipeline topology.
- Introduced pipeline parallelism layout configuration support.
Improvements
- Extended compute dtype support to include NVFP4 for optimizer parameter gathering optimization.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

coderabbitai · 2026-01-27T07:10:07Z

📝 Walkthrough

Walkthrough

These changes introduce NVFP4 quantization support for DeepSeek V3 pretraining on GB300 hardware. Updates include two new configuration variants with specific batch sizing and topology parameters, addition of a pp_layout field to the base configuration structure, propagation of layout parameters through the config chain, and adjustment of optimizer overlap logic to account for NVFP4 dtype compatibility.

Changes

Cohort / File(s)	Summary
DeepSeek GB300 NVFP4 Configurations `scripts/performance/configs/deepseek/deepseek_workload_base_configs.py`	Added two new pretraining variants: DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_NVFP4_V1 (global_batch_size=2048) and V2 (global_batch_size=4096), both configured with num_gpus=256, pp_layout="Et4\|(t4\|)*14tmL", expert_model_parallel_size=32, moe_flex_dispatcher_backend="hybridep", and recompute_modules=["mla_up_proj"].
Configuration Infrastructure `scripts/performance/utils/utils.py`	Added new optional field `pp_layout: Optional[str] = None` to WorkloadBaseConfig to represent Pipeline parallelism layout.
Layout Parameter Propagation `scripts/performance/configs/deepseek/deepseek_llm_pretrain.py`	Changed layout parameter in deepseek_v3_pretrain_config_gb300 from `layout=None` to `layout=base_cfg.pp_layout` to propagate layout configuration downstream.
Optimizer Overlap Logic `scripts/performance/utils/overrides.py`	Modified overlap_param_gather_with_optimizer_step condition to disable overlap for both "fp8_mx" and "nvfp4" compute_dtype values instead of fp8_mx only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'DSV3 NVFP4 recipe on GB300' directly and accurately summarizes the main change: adding a DeepSeek V3 NVFP4 recipe configuration for the GB300 platform.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR includes major changes with measured performance results: 1185 TFLOP (V1) and 1220 TFLOP (V2) for new NVFP4 GB300 recipe configuration with documented test context.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/performance/utils/overrides.py (1)
359-364: Normalize compute_dtype case before the overlap safety guard.

Line 360 compares against lowercase literals ("fp8_mx", "nvfp4"), but config-based execution paths (e.g., setup_experiment.py) pass uppercase values via precision.upper() or hardcoded "FP8_CS" literals. When uppercase compute_dtype arrives, the condition compute_dtype not in ("fp8_mx", "nvfp4") evaluates true and enables overlap, reintroducing the NaN grad norm issue the NOTE warns about.

Also update the NOTE to explicitly mention nvfp4, not just fp8_mx.
🔧 Proposed fix
-    ## NOTE: overlap_param_gather_with_optimizer_step causes NaN grad norm for fp8_mx. Disabling it until the issue is resolved.
-    if dp > 1 and pp > 1 and vp > 1 and compute_dtype not in ("fp8_mx", "nvfp4"):
+    ## NOTE: overlap_param_gather_with_optimizer_step causes NaN grad norm for fp8_mx/nvfp4. Disabling it until the issue is resolved.
+    compute_dtype_l = compute_dtype.lower()
+    if dp > 1 and pp > 1 and vp > 1 and compute_dtype_l not in ("fp8_mx", "nvfp4"):

ko3n1g · 2026-01-27T13:45:24Z

/ok to test 1afc195

erhoo82 · 2026-01-27T17:09:16Z

scripts/performance/configs/deepseek/deepseek_workload_base_configs.py

+    num_gpus=256,
+    global_batch_size=2048,
+    micro_batch_size=2,
+    pipeline_model_parallel_size=2,
+    virtual_pipeline_model_parallel_size=8,
+    pp_layout="Et*4|(t*4|)*14tmL",
+    expert_model_parallel_size=32,
+    moe_flex_dispatcher_backend="hybridep",
+    moe_a2a_overlap=False,
+    cuda_graph_scope=[],
+    recompute_modules=["mla_up_proj"],


We need to override only the changes?
I see that most of the lines remain the same as the baseline.

Thanks, updated

erhoo82 · 2026-01-27T17:10:44Z

scripts/performance/utils/utils.py

    peft: Optional[str] = None

+    # Pipeline parallelism layout
+    pp_layout: Optional[str] = None


This is a nice change to make the PP layout configurable.

sanandaraj5597

LGTM.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

dsv3 nvfp4 gb300

de3d019

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

dingqingy-nv requested review from erhoo82 and sanandaraj5597 January 27, 2026 07:02

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 07:02 Inactive

copy-pr-bot bot temporarily deployed to test January 27, 2026 07:03 Inactive

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 07:16 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 07:26 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 07:40 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 27, 2026 07:40 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 07:40 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 27, 2026 07:40 Failure

copy-pr-bot bot had a problem deploying to nemo-ci January 27, 2026 10:13 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 10:13 Inactive

Merge branch 'main' into dsv3-nvfp4-recipe

1afc195

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 13:45 Inactive

copy-pr-bot bot temporarily deployed to test January 27, 2026 13:46 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 14:34 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 27, 2026 14:40 Inactive

erhoo82 reviewed Jan 27, 2026

View reviewed changes

sanandaraj5597 previously approved these changes Jan 27, 2026

View reviewed changes

remove redundent override

0aaad4d

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

erhoo82 approved these changes Jan 28, 2026

View reviewed changes

This was referenced Jan 30, 2026

Dsv3 Recipe Update #2152

Merged

cp: Dsv3 Recipe Update (2152) into r0.3.0 #2186

Merged

This was referenced Feb 7, 2026

Revert #2152 and 2209 #2271

Merged

kimi k2 recipe intro #2097

Merged

DeepSeek-V3 recipes for H100 #2197

Merged

DeepSeek-V3 recipes for H100 #2312

Merged

This was referenced Feb 23, 2026

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher #2499

Merged

Tune kimi-k2 GB300 MXFP8 recipe #2590

Merged

Onboard NVFP4 and MXFP8 recipes #2600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSV3 NVFP4 recipe on GB300#2076

DSV3 NVFP4 recipe on GB300#2076
erhoo82 merged 4 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:dsv3-nvfp4-recipe

dingqingy-nv commented Jan 27, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

ko3n1g commented Jan 27, 2026

Uh oh!

erhoo82 Jan 27, 2026

Uh oh!

dingqingy-nv Jan 27, 2026

Uh oh!

erhoo82 Jan 27, 2026

Uh oh!

sanandaraj5597 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dingqingy-nv commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ko3n1g commented Jan 27, 2026

Uh oh!

erhoo82 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

dingqingy-nv Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

erhoo82 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

sanandaraj5597 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dingqingy-nv commented Jan 27, 2026 •

edited

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading