Conversation
Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
📝 WalkthroughWalkthroughThese changes introduce NVFP4 quantization support for DeepSeek V3 pretraining on GB300 hardware. Updates include two new configuration variants with specific batch sizing and topology parameters, addition of a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/performance/utils/overrides.py (1)
359-364: Normalize compute_dtype case before the overlap safety guard.Line 360 compares against lowercase literals
("fp8_mx", "nvfp4"), but config-based execution paths (e.g.,setup_experiment.py) pass uppercase values viaprecision.upper()or hardcoded"FP8_CS"literals. When uppercase compute_dtype arrives, the conditioncompute_dtype not in ("fp8_mx", "nvfp4")evaluates true and enables overlap, reintroducing the NaN grad norm issue the NOTE warns about.Also update the NOTE to explicitly mention nvfp4, not just fp8_mx.
🔧 Proposed fix
- ## NOTE: overlap_param_gather_with_optimizer_step causes NaN grad norm for fp8_mx. Disabling it until the issue is resolved. - if dp > 1 and pp > 1 and vp > 1 and compute_dtype not in ("fp8_mx", "nvfp4"): + ## NOTE: overlap_param_gather_with_optimizer_step causes NaN grad norm for fp8_mx/nvfp4. Disabling it until the issue is resolved. + compute_dtype_l = compute_dtype.lower() + if dp > 1 and pp > 1 and vp > 1 and compute_dtype_l not in ("fp8_mx", "nvfp4"):
|
/ok to test 1afc195 |
| num_gpus=256, | ||
| global_batch_size=2048, | ||
| micro_batch_size=2, | ||
| pipeline_model_parallel_size=2, | ||
| virtual_pipeline_model_parallel_size=8, | ||
| pp_layout="Et*4|(t*4|)*14tmL", | ||
| expert_model_parallel_size=32, | ||
| moe_flex_dispatcher_backend="hybridep", | ||
| moe_a2a_overlap=False, | ||
| cuda_graph_scope=[], | ||
| recompute_modules=["mla_up_proj"], |
There was a problem hiding this comment.
We need to override only the changes?
I see that most of the lines remain the same as the baseline.
There was a problem hiding this comment.
Thanks, updated
| peft: Optional[str] = None | ||
|
|
||
| # Pipeline parallelism layout | ||
| pp_layout: Optional[str] = None |
There was a problem hiding this comment.
This is a nice change to make the PP layout configurable.
Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
What does this PR do ?
This PR adds DeepSeek V3 NVFP4 recipe configuration for GB300.
Achieves 1185 TFLOP (V1) and 1220 TFLOPs (V2).
Update: Fast Math improves V2 to 1240 TFLOPs.
To test GB200.
This PR also add
pp_layoutfield to WorkloadBaseConfig to support custom pipeline parallelism layouts. This is needed because different workload configs (e.g., V1, V2, large_scale) share the samedeepseek_v3_pretrain_config_gb300function but may require different PP layouts—for instance, large-scale runs prefer FSDP and work better withpp_layout=None, so this must be configurable at the workload base config level.Additional Information
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.