[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher by tomlifu · Pull Request #2499 · NVIDIA-NeMo/Megatron-Bridge

tomlifu · 2026-02-23T23:13:55Z

What does this PR do ?

This PR fixes Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher

Changelog

The B200 workload configs for qwen3_30b_a3b were missing several performance-critical settings compared to other GPU variants (GB200, GB300, B300, H100):

moe_flex_dispatcher_backend was not set to "hybridep" (inherited "deepep" from BASE_QWEN3_30B_A3B_CONFIG), preventing the optimized hybrid EP dispatcher from being used.
moe_token_dispatcher_type was hardcoded to "alltoall" in qwen3_30b_a3b_pretrain_config_b200(), which silently disables the flex dispatcher backend even when the override is passed via CLI. Changed to "flex" to match all other GPU variants.
micro_batch_size=4 was not set, leaving the field unset.
"attn" was missing from cuda_graph_scope, losing CUDA graph coverage for attention kernels.

Validated on 8x B200 single-node with mbs=4, gbs=512, seq=4096:

BF16: ~546 TFLOP/s/GPU
FP8_MX: ~555 TFLOP/s/GPU

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Chores
- Updated Qwen model training configurations for improved performance and efficiency.
- Adjusted batch processing parameters and token distribution strategies to optimize training throughput.
- Extended computation optimization scope to include additional model components, enhancing training performance.
- Standardized core parameters across training configuration variants for consistency.

copy-pr-bot · 2026-02-23T23:13:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-23T23:19:49Z

📝 Walkthrough

Walkthrough

Configuration updates to QWEN3 training parameters on GB200 hardware, including changes to MOE token dispatcher type from "alltoall" to "flex" and updates to micro-batch size, flex dispatcher backend, and CUDA graph scope settings across three B200 FP8 workload configurations.

Changes

Cohort / File(s)	Summary
MOE Token Dispatcher Update `scripts/performance/configs/qwen/qwen3_llm_pretrain.py`	Changed MOE token dispatcher type from "alltoall" to "flex" in the GB200 pretrain configuration.
Workload Configuration Updates `scripts/performance/configs/qwen/qwen3_workload_base_configs.py`	Updated three B200 FP8 configs (V1, CS_V1, MX_V1) to use micro_batch_size=4, moe_flex_dispatcher_backend="hybridep", and extended cuda_graph_scope to include "attn". Converted MX_V1 from alias to explicit config definition.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Update Qwen3 30B H100 Base Configs with HybridEP #2477: Modifies MoE token dispatching and dispatcher backend settings in the same qwen configuration files, switching dispatcher type to "flex" and updating moe_flex_dispatcher_backend to "hybridep".
cp: Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm (2209) into r0.3.0 #2210: Updates QWEN3 training config constants in the same file with related changes to cuda_graph_scope and MOE-related settings.
DSV3 NVFP4 recipe on GB300 #2076: Modifies MOE dispatcher backend settings across pretrain workload configs with moe_flex_dispatcher_backend set to "hybridep".

Suggested labels

performance, performance/optimize

Suggested reviewers

ko3n1g
yaoyu-33
erhoo82

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR includes after-deployment performance numbers (BF16: ~546 TFLOP/s/GPU, FP8_MX: ~555 TFLOP/s/GPU) but lacks before-deployment baseline for comparison.	Add baseline performance numbers from original configuration to demonstrate that dispatcher backend and type changes actually improve performance.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically summarizes the main change—fixing B200 performance config to use hybridep+flex dispatcher for Qwen3 30B A3B.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

scripts/performance/configs/qwen/qwen3_workload_base_configs.py (1)
411-418: B200_FP8_MX_V1 content is identical to B200_FP8_CS_V1 — and differs from GB200_FP8_MX_V1.

Two things worth noting:
Duplication: the new explicit replace(BASE_QWEN3_30B_A3B_CONFIG, ...) body is byte-for-byte identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1 (lines 401–408). B300 and GB300 handle this case with a simple alias (MX = CS). If CS and MX are meant to stay in sync here, an alias would avoid silent divergence.
♻️ Optional: revert to alias pattern (consistent with B300/GB300)
-QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 = replace(
-    BASE_QWEN3_30B_A3B_CONFIG,
-    num_gpus=8,
-    micro_batch_size=4,
-    moe_flex_dispatcher_backend="hybridep",
-    cuda_graph_impl="transformer_engine",
-    cuda_graph_scope=["attn", "moe_router", "moe_preprocess"],
-)
+QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 = QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1
Asymmetry with GB200 FP8 MX: QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 (line 364) uses cuda_graph_scope=["moe_router", "moe_preprocess"] — it intentionally excludes "attn". The new B200 FP8 MX includes "attn". Please confirm this divergence from the GB200 FP8 MX pattern is deliberate (e.g., due to different MX kernel support on B200 standalone vs. the GB200 NVLink array).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen/qwen3_workload_base_configs.py` around lines
411 - 418, The new QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 is byte-for-byte
identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1, creating duplication
and risking silent drift; either make B200_FP8_MX_V1 an alias of B200_FP8_CS_V1
(like the B300/GB300 pattern) or deliberately keep the explicit replace but
document why; also verify whether the cuda_graph_scope for B200_FP8_MX_V1 should
include "attn" (it currently does) or match
QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 which omits "attn"—adjust the
cuda_graph_scope accordingly or add a comment explaining the intentional
divergence from GB200_FP8_MX_V1.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/performance/configs/qwen/qwen3_workload_base_configs.py`:
- Around line 411-418: The new QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 is
byte-for-byte identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1,
creating duplication and risking silent drift; either make B200_FP8_MX_V1 an
alias of B200_FP8_CS_V1 (like the B300/GB300 pattern) or deliberately keep the
explicit replace but document why; also verify whether the cuda_graph_scope for
B200_FP8_MX_V1 should include "attn" (it currently does) or match
QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 which omits "attn"—adjust the
cuda_graph_scope accordingly or add a comment explaining the intentional
divergence from GB200_FP8_MX_V1.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a64507 and f96789d.

📒 Files selected for processing (2)

scripts/performance/configs/qwen/qwen3_llm_pretrain.py
scripts/performance/configs/qwen/qwen3_workload_base_configs.py

scripts/performance/configs/qwen/qwen3_workload_base_configs.py

…+flex dispatcher The B200 workload configs for qwen3_30b_a3b were missing several performance-critical settings compared to other GPU variants (GB200, GB300, B300, H100): 1. `moe_flex_dispatcher_backend` was not set to "hybridep" (inherited "deepep" from BASE_QWEN3_30B_A3B_CONFIG), preventing the optimized hybrid EP dispatcher from being used. 2. `moe_token_dispatcher_type` was hardcoded to "alltoall" in `qwen3_30b_a3b_pretrain_config_b200()`, which silently disables the flex dispatcher backend even when the override is passed via CLI. Changed to "flex" to match all other GPU variants. 3. `micro_batch_size=4` was not set, leaving the field unset. 4. `"attn"` was missing from `cuda_graph_scope`, losing CUDA graph coverage for attention kernels. 5. FP8_CS_V1 and FP8_MX_V1 are now aliases of BF16_V1 to reduce redundancy, matching the pattern used for B300 and GB300 variants. Validated on 8x B200 single-node with mbs=4, gbs=512, seq=4096: - BF16: ~546 TFLOP/s/GPU - FP8_MX: ~555 TFLOP/s/GPU Signed-off-by: Lifu Zhang <lifuz@nvidia.com>

erhoo82 · 2026-02-26T19:22:37Z

/ok to test d622b37

ko3n1g

Did we test this internally and update golden values?

Blocking, but @malay-nagda can override

malay-nagda · 2026-02-27T09:40:52Z

Did we test this internally and update golden values?

Blocking, but @malay-nagda can override

yes. created a PR for that too.

… dispatcher (#2499) Signed-off-by: Lifu Zhang <lifuz@nvidia.com>

tomlifu changed the title ~~[perf,recipe] fix: Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher~~ [perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher Feb 23, 2026

tomlifu requested review from erhoo82 and malay-nagda February 23, 2026 23:14

tomlifu force-pushed the perf/fix-qwen3-30b-a3b-b200-config branch from f96789d to 789bd67 Compare February 23, 2026 23:18

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

erhoo82 reviewed Feb 24, 2026

View reviewed changes

scripts/performance/configs/qwen/qwen3_workload_base_configs.py Outdated Show resolved Hide resolved

malay-nagda previously approved these changes Feb 24, 2026

View reviewed changes

erhoo82 mentioned this pull request Feb 24, 2026

[Regression] Low QWEN3 30B B200 and H100 performance with dropless #1998

Closed

tomlifu dismissed malay-nagda’s stale review via d622b37 February 24, 2026 17:32

tomlifu force-pushed the perf/fix-qwen3-30b-a3b-b200-config branch from 789bd67 to d622b37 Compare February 24, 2026 17:32

erhoo82 added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 24, 2026

erhoo82 requested a review from ko3n1g February 24, 2026 17:35

erhoo82 approved these changes Feb 26, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test February 26, 2026 19:23 Inactive

ko3n1g requested changes Feb 26, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 26, 2026 19:48 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 26, 2026 19:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 26, 2026 20:11 Inactive

malay-nagda self-requested a review February 27, 2026 09:40

malay-nagda approved these changes Feb 27, 2026

View reviewed changes

malay-nagda merged commit 0ec1df7 into NVIDIA-NeMo:main Feb 27, 2026
56 checks passed

malay-nagda pushed a commit that referenced this pull request Feb 27, 2026

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex…

41b6da8

… dispatcher (#2499) Signed-off-by: Lifu Zhang <lifuz@nvidia.com>

malay-nagda mentioned this pull request Feb 27, 2026

260201: Cherrypick various changes #2509

Merged

5 tasks

malay-nagda added performance performance/release Performance items related with NeMo release labels Feb 27, 2026

coderabbitai bot mentioned this pull request Mar 4, 2026

[model, training] fix: MoE checkpoint export with YaRN RoPE and flex dispatcher #2641

Open

4 tasks

copy-pr-bot bot pushed a commit that referenced this pull request Mar 19, 2026

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex…

25ee026

… dispatcher (#2499) Signed-off-by: Lifu Zhang <lifuz@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher #2499

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher #2499
malay-nagda merged 1 commit intoNVIDIA-NeMo:mainfrom
tomlifu:perf/fix-qwen3-30b-a3b-b200-config

tomlifu commented Feb 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 23, 2026

Uh oh!

coderabbitai bot commented Feb 23, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

erhoo82 commented Feb 26, 2026

Uh oh!

ko3n1g left a comment

Uh oh!

malay-nagda commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tomlifu commented Feb 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 23, 2026

Uh oh!

coderabbitai bot commented Feb 23, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erhoo82 commented Feb 26, 2026

Uh oh!

ko3n1g left a comment

Choose a reason for hiding this comment

Uh oh!

malay-nagda commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tomlifu commented Feb 23, 2026 •

edited by coderabbitai bot

Loading