cp: `Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm (2209)` into `r0.3.0` by ko3n1g · Pull Request #2210 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-02-04T07:08:16Z

beep boop [🤖]: Hi @dingqingy-nv 👋,

we've cherry picked #2209 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Chores
- Updated pretraining configurations for model training to adjust parallelization and GPU execution parameters, optimizing training efficiency.

…rm (#2209) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g · 2026-02-04T07:08:19Z

/ok to test cecb77e

copy-pr-bot · 2026-02-04T07:08:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-04T07:11:44Z

📝 Walkthrough

Walkthrough

Modified Qwen3 model pretraining configurations on GB300 and GB200 GPUs. Removed virtual pipeline model parallel sizing, increased expert model parallel size to 32, and enabled CUDA graph optimization for attention, MoE router, and MoE preprocessing operations.

Changes

Cohort / File(s)	Summary
Qwen3 Config Updates `scripts/performance/configs/qwen/qwen3_workload_base_configs.py`	Removed `virtual_pipeline_model_parallel_size=12`, increased `expert_model_parallel_size` from 16 to 32 for GB300 config, set `expert_model_parallel_size=32` for GB200 config, and added `cuda_graph_scope=["attn", "moe_router", "moe_preprocess"]` for GPU optimization.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Megatron-Bridge#2209: Modifies identical QWEN3 configurations in the same file with overlapping parameter changes (removing virtual pipeline parallelism, increasing expert model parallelism to 32, and adding CUDA graph scope).

Suggested reviewers

dingqingy-nv
thomasdhc

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major changes affecting model numerics (NaN gradient norm resolution) and performance (expert parallelism and cuda_graph optimization) but lacks test results and validation metrics in the description.	Add test results, performance metrics, and numerical stability validation demonstrating the NaN issue resolution to the PR description.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: updating Qwen3 235B A22B MXFP8 recipe for GB200/300 and resolving a NaN grad norm issue, which aligns with the configuration modifications shown in the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-2209-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…e NaN grad norm (2209)` into `r0.3.0` (#2210)" This reverts commit d7a13b1.

…ve NaN grad norm (2209)` into `r0.3.0` (#2210)" This reverts commit 34aec47.

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad no…

cecb77e

…rm (#2209) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g requested a review from dingqingy-nv February 4, 2026 07:08

ko3n1g added cherry-pick Run CICD labels Feb 4, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 07:08 Inactive

copy-pr-bot bot temporarily deployed to test February 4, 2026 07:09 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 4, 2026 07:11 Inactive

dingqingy-nv approved these changes Feb 4, 2026

View reviewed changes

copy-pr-bot bot had a problem deploying to nemo-ci February 4, 2026 07:18 Failure

ko3n1g merged commit d7a13b1 into r0.3.0 Feb 6, 2026
18 of 20 checks passed

ko3n1g deleted the cherry-pick-2209-r0.3.0 branch February 6, 2026 18:15

ko3n1g added a commit that referenced this pull request Feb 7, 2026

Revert "cp: `Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolv…

34aec47

…e NaN grad norm (2209)` into `r0.3.0` (#2210)" This reverts commit d7a13b1.

ko3n1g added a commit that referenced this pull request Feb 8, 2026

Reapply "cp: `Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resol…

f2fee27

…ve NaN grad norm (2209)` into `r0.3.0` (#2210)" This reverts commit 34aec47.

coderabbitai bot mentioned this pull request Feb 13, 2026

cp: Update Deepseek V3 MXFP8 GB200 mapping (2215) into r0.3.0 #2378

Merged

This was referenced Feb 22, 2026

Update Qwen3 30B H100 Base Configs with HybridEP #2477

Merged

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher #2499

Merged

This was referenced Mar 5, 2026

Update Qwen3 235B B300 Configs to match Qwen3 B200 Configs #2669

Merged

Unify bf16 gb300 qwen3 235b mapping #2670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm (2209)` into `r0.3.0`#2210

cp: `Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm (2209)` into `r0.3.0`#2210
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2209-r0.3.0

ko3n1g commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ko3n1g commented Feb 4, 2026

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

coderabbitai bot commented Feb 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ko3n1g commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

ko3n1g commented Feb 4, 2026

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

coderabbitai bot commented Feb 4, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ko3n1g commented Feb 4, 2026 •

edited by coderabbitai bot

Loading