Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm by dingqingy-nv · Pull Request #2209 · NVIDIA-NeMo/Megatron-Bridge

dingqingy-nv · 2026-02-04T05:38:49Z

What does this PR do ?

Update Qwen3 235B A22B MXFP8 GB200/300 recipe for better performance.
Use no VP mapping that resolves NaN grad.

Additional Information

Related to [Bug] Qwen3 235B A22B MXFP8 NaN grad norm #2096

Summary by CodeRabbit

Chores
- Updated QWEN3_235B model training configurations with adjusted parallelism settings and optimization parameters for GB300 and GB200 accelerator types.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

coderabbitai · 2026-02-04T05:43:13Z

📝 Walkthrough

Walkthrough

This PR updates QWEN3 model pretraining workload configurations, adjusting expert model parallelism from 16 to 32, removing virtual pipeline model parallelism settings, and adding CUDA graph optimization scopes for attention and mixture-of-experts operations.

Changes

Cohort / File(s)	Summary
QWEN3 Workload Configuration `scripts/performance/configs/qwen/qwen3_workload_base_configs.py`	Updated GB300 FP8 CS V2 config: increased expert_model_parallel_size to 32, removed virtual_pipeline_model_parallel_size=12, and added cuda_graph_scope for attn/moe operations. Added expert_model_parallel_size=32 to GB200 FP8 CS V2 config.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~4 minutes

Possibly related PRs

cp: Dsv3 Recipe Update (2152) into r0.3.0 #2186: Modifies the same model configuration parameters (expert_model_parallel_size, virtual_pipeline_model_parallel_size, cuda_graph_scope) across workload config files.

Suggested reviewers

erhoo82

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major changes affecting numerical stability and performance but provides no test results, performance benchmarks, or numerical validation to support claims.	Add detailed test results including convergence validation showing NaN gradient norm resolution, before-and-after performance metrics with hardware/configuration details, and evidence of no negative impact on training quality.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main changes: updating Qwen3 235B A22B MXFP8 recipe for GB200/300 and resolving NaN gradient norm issues, which aligns with the actual modifications to the configuration file.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

erhoo82 · 2026-02-04T05:53:00Z

/ok to test 0a8447a

…rm (#2209) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

…rm (NVIDIA-NeMo#2209) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

…rm (NVIDIA-NeMo#2209) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

update qwen3 235b mxfp8 gb recipe andresolves nan grad norm

0a8447a

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

dingqingy-nv added this to the 26.02 milestone Feb 4, 2026

dingqingy-nv requested review from erhoo82 and ko3n1g February 4, 2026 05:38

dingqingy-nv added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 4, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 05:39 Inactive

copy-pr-bot bot temporarily deployed to test February 4, 2026 05:39 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 05:48 Inactive

erhoo82 approved these changes Feb 4, 2026

View reviewed changes

erhoo82 enabled auto-merge (squash) February 4, 2026 06:08

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 06:23 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 06:35 Inactive

erhoo82 merged commit 8a10995 into NVIDIA-NeMo:main Feb 4, 2026
50 checks passed

ko3n1g pushed a commit that referenced this pull request Feb 4, 2026

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad no…

cecb77e

…rm (#2209) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai bot mentioned this pull request Feb 4, 2026

Update Deepseek V3 MXFP8 GB200 mapping #2215

Merged

coderabbitai bot mentioned this pull request Feb 11, 2026

Revert Qwen3 235B GB300 MXFP8 large scale mapping #2338

Merged

This was referenced Feb 28, 2026

Onboard NVFP4 and MXFP8 recipes #2600

Merged

Update Qwen3 235B B300 Configs to match Qwen3 B200 Configs #2669

Merged

Unify bf16 gb300 qwen3 235b mapping #2670

Merged

dingqingy-nv added a commit to dingqingy-nv/Megatron-Bridge that referenced this pull request Mar 10, 2026

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad no…

15ace11

…rm (NVIDIA-NeMo#2209) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm#2209

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm#2209
erhoo82 merged 1 commit intoNVIDIA-NeMo:mainfrom
dingqingy-nv:qwen3_mxfp8_recipe_update

dingqingy-nv commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

erhoo82 commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dingqingy-nv commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

erhoo82 commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dingqingy-nv commented Feb 4, 2026 •

edited by coderabbitai bot

Loading