Skip to content

DSV3 NVFP4 recipe on GB300#2076

Merged
erhoo82 merged 4 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:dsv3-nvfp4-recipe
Jan 28, 2026
Merged

DSV3 NVFP4 recipe on GB300#2076
erhoo82 merged 4 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:dsv3-nvfp4-recipe

Conversation

@dingqingy-nv
Copy link
Copy Markdown
Contributor

@dingqingy-nv dingqingy-nv commented Jan 27, 2026

What does this PR do ?

This PR adds DeepSeek V3 NVFP4 recipe configuration for GB300.

Achieves 1185 TFLOP (V1) and 1220 TFLOPs (V2).

Update: Fast Math improves V2 to 1240 TFLOPs.

To test GB200.

This PR also add pp_layout field to WorkloadBaseConfig to support custom pipeline parallelism layouts. This is needed because different workload configs (e.g., V1, V2, large_scale) share the same deepseek_v3_pretrain_config_gb300 function but may require different PP layouts—for instance, large-scale runs prefer FSDP and work better with pp_layout=None, so this must be configurable at the workload base config level.

Additional Information

Summary by CodeRabbit

  • New Features

    • Added GB300 NVFP4 pretraining configurations (V1 and V2 variants) with configurable batch sizes and pipeline topology.
    • Introduced pipeline parallelism layout configuration support.
  • Improvements

    • Extended compute dtype support to include NVFP4 for optimizer parameter gathering optimization.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

These changes introduce NVFP4 quantization support for DeepSeek V3 pretraining on GB300 hardware. Updates include two new configuration variants with specific batch sizing and topology parameters, addition of a pp_layout field to the base configuration structure, propagation of layout parameters through the config chain, and adjustment of optimizer overlap logic to account for NVFP4 dtype compatibility.

Changes

Cohort / File(s) Summary
DeepSeek GB300 NVFP4 Configurations
scripts/performance/configs/deepseek/deepseek_workload_base_configs.py
Added two new pretraining variants: DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_NVFP4_V1 (global_batch_size=2048) and V2 (global_batch_size=4096), both configured with num_gpus=256, pp_layout="Et*4|(t*4|)*14tmL", expert_model_parallel_size=32, moe_flex_dispatcher_backend="hybridep", and recompute_modules=["mla_up_proj"].
Configuration Infrastructure
scripts/performance/utils/utils.py
Added new optional field pp_layout: Optional[str] = None to WorkloadBaseConfig to represent Pipeline parallelism layout.
Layout Parameter Propagation
scripts/performance/configs/deepseek/deepseek_llm_pretrain.py
Changed layout parameter in deepseek_v3_pretrain_config_gb300 from layout=None to layout=base_cfg.pp_layout to propagate layout configuration downstream.
Optimizer Overlap Logic
scripts/performance/utils/overrides.py
Modified overlap_param_gather_with_optimizer_step condition to disable overlap for both "fp8_mx" and "nvfp4" compute_dtype values instead of fp8_mx only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'DSV3 NVFP4 recipe on GB300' directly and accurately summarizes the main change: adding a DeepSeek V3 NVFP4 recipe configuration for the GB300 platform.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR includes major changes with measured performance results: 1185 TFLOP (V1) and 1220 TFLOP (V2) for new NVFP4 GB300 recipe configuration with documented test context.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/performance/utils/overrides.py (1)

359-364: Normalize compute_dtype case before the overlap safety guard.

Line 360 compares against lowercase literals ("fp8_mx", "nvfp4"), but config-based execution paths (e.g., setup_experiment.py) pass uppercase values via precision.upper() or hardcoded "FP8_CS" literals. When uppercase compute_dtype arrives, the condition compute_dtype not in ("fp8_mx", "nvfp4") evaluates true and enables overlap, reintroducing the NaN grad norm issue the NOTE warns about.

Also update the NOTE to explicitly mention nvfp4, not just fp8_mx.

🔧 Proposed fix
-    ## NOTE: overlap_param_gather_with_optimizer_step causes NaN grad norm for fp8_mx. Disabling it until the issue is resolved.
-    if dp > 1 and pp > 1 and vp > 1 and compute_dtype not in ("fp8_mx", "nvfp4"):
+    ## NOTE: overlap_param_gather_with_optimizer_step causes NaN grad norm for fp8_mx/nvfp4. Disabling it until the issue is resolved.
+    compute_dtype_l = compute_dtype.lower()
+    if dp > 1 and pp > 1 and vp > 1 and compute_dtype_l not in ("fp8_mx", "nvfp4"):

@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Jan 27, 2026

/ok to test 1afc195

Comment on lines +59 to +69
num_gpus=256,
global_batch_size=2048,
micro_batch_size=2,
pipeline_model_parallel_size=2,
virtual_pipeline_model_parallel_size=8,
pp_layout="Et*4|(t*4|)*14tmL",
expert_model_parallel_size=32,
moe_flex_dispatcher_backend="hybridep",
moe_a2a_overlap=False,
cuda_graph_scope=[],
recompute_modules=["mla_up_proj"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to override only the changes?
I see that most of the lines remain the same as the baseline.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

peft: Optional[str] = None

# Pipeline parallelism layout
pp_layout: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change to make the PP layout configurable.

sanandaraj5597
sanandaraj5597 previously approved these changes Jan 27, 2026
Copy link
Copy Markdown
Contributor

@sanandaraj5597 sanandaraj5597 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants