Skip to content

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm#2209

Merged
erhoo82 merged 1 commit intoNVIDIA-NeMo:mainfrom
dingqingy-nv:qwen3_mxfp8_recipe_update
Feb 4, 2026
Merged

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm#2209
erhoo82 merged 1 commit intoNVIDIA-NeMo:mainfrom
dingqingy-nv:qwen3_mxfp8_recipe_update

Conversation

@dingqingy-nv
Copy link
Copy Markdown
Contributor

@dingqingy-nv dingqingy-nv commented Feb 4, 2026

What does this PR do ?

  • Update Qwen3 235B A22B MXFP8 GB200/300 recipe for better performance.
  • Use no VP mapping that resolves NaN grad.

Additional Information

Summary by CodeRabbit

  • Chores
    • Updated QWEN3_235B model training configurations with adjusted parallelism settings and optimization parameters for GB300 and GB200 accelerator types.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
@dingqingy-nv dingqingy-nv added this to the 26.02 milestone Feb 4, 2026
@dingqingy-nv dingqingy-nv added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

This PR updates QWEN3 model pretraining workload configurations, adjusting expert model parallelism from 16 to 32, removing virtual pipeline model parallelism settings, and adding CUDA graph optimization scopes for attention and mixture-of-experts operations.

Changes

Cohort / File(s) Summary
QWEN3 Workload Configuration
scripts/performance/configs/qwen/qwen3_workload_base_configs.py
Updated GB300 FP8 CS V2 config: increased expert_model_parallel_size to 32, removed virtual_pipeline_model_parallel_size=12, and added cuda_graph_scope for attn/moe operations. Added expert_model_parallel_size=32 to GB200 FP8 CS V2 config.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~4 minutes

Possibly related PRs

Suggested reviewers

  • erhoo82
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major changes affecting numerical stability and performance but provides no test results, performance benchmarks, or numerical validation to support claims. Add detailed test results including convergence validation showing NaN gradient norm resolution, before-and-after performance metrics with hardware/configuration details, and evidence of no negative impact on training quality.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main changes: updating Qwen3 235B A22B MXFP8 recipe for GB200/300 and resolving NaN gradient norm issues, which aligns with the actual modifications to the configuration file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@erhoo82
Copy link
Copy Markdown
Contributor

erhoo82 commented Feb 4, 2026

/ok to test 0a8447a

@erhoo82 erhoo82 merged commit 8a10995 into NVIDIA-NeMo:main Feb 4, 2026
50 checks passed
ko3n1g pushed a commit that referenced this pull request Feb 4, 2026
…rm (#2209)

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
sowmen pushed a commit to sowmen/Megatron-Bridge that referenced this pull request Feb 11, 2026
…rm (NVIDIA-NeMo#2209)

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Signed-off-by: sowmen <sowmendipta@gmail.com>
dingqingy-nv added a commit to dingqingy-nv/Megatron-Bridge that referenced this pull request Mar 10, 2026
…rm (NVIDIA-NeMo#2209)

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants