Skip to content

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher #2499

Merged
malay-nagda merged 1 commit intoNVIDIA-NeMo:mainfrom
tomlifu:perf/fix-qwen3-30b-a3b-b200-config
Feb 27, 2026
Merged

[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher #2499
malay-nagda merged 1 commit intoNVIDIA-NeMo:mainfrom
tomlifu:perf/fix-qwen3-30b-a3b-b200-config

Conversation

@tomlifu
Copy link
Copy Markdown
Contributor

@tomlifu tomlifu commented Feb 23, 2026

What does this PR do ?

This PR fixes Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher

Changelog

The B200 workload configs for qwen3_30b_a3b were missing several performance-critical settings compared to other GPU variants (GB200, GB300, B300, H100):

  1. moe_flex_dispatcher_backend was not set to "hybridep" (inherited "deepep" from BASE_QWEN3_30B_A3B_CONFIG), preventing the optimized hybrid EP dispatcher from being used.

  2. moe_token_dispatcher_type was hardcoded to "alltoall" in qwen3_30b_a3b_pretrain_config_b200(), which silently disables the flex dispatcher backend even when the override is passed via CLI. Changed to "flex" to match all other GPU variants.

  3. micro_batch_size=4 was not set, leaving the field unset.

  4. "attn" was missing from cuda_graph_scope, losing CUDA graph coverage for attention kernels.

Validated on 8x B200 single-node with mbs=4, gbs=512, seq=4096:

  • BF16: ~546 TFLOP/s/GPU
  • FP8_MX: ~555 TFLOP/s/GPU

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Chores
    • Updated Qwen model training configurations for improved performance and efficiency.
    • Adjusted batch processing parameters and token distribution strategies to optimize training throughput.
    • Extended computation optimization scope to include additional model components, enhancing training performance.
    • Standardized core parameters across training configuration variants for consistency.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tomlifu tomlifu changed the title [perf,recipe] fix: Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher [perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher Feb 23, 2026
@tomlifu tomlifu force-pushed the perf/fix-qwen3-30b-a3b-b200-config branch from f96789d to 789bd67 Compare February 23, 2026 23:18
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

Configuration updates to QWEN3 training parameters on GB200 hardware, including changes to MOE token dispatcher type from "alltoall" to "flex" and updates to micro-batch size, flex dispatcher backend, and CUDA graph scope settings across three B200 FP8 workload configurations.

Changes

Cohort / File(s) Summary
MOE Token Dispatcher Update
scripts/performance/configs/qwen/qwen3_llm_pretrain.py
Changed MOE token dispatcher type from "alltoall" to "flex" in the GB200 pretrain configuration.
Workload Configuration Updates
scripts/performance/configs/qwen/qwen3_workload_base_configs.py
Updated three B200 FP8 configs (V1, CS_V1, MX_V1) to use micro_batch_size=4, moe_flex_dispatcher_backend="hybridep", and extended cuda_graph_scope to include "attn". Converted MX_V1 from alias to explicit config definition.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

performance, performance/optimize

Suggested reviewers

  • ko3n1g
  • yaoyu-33
  • erhoo82
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR includes after-deployment performance numbers (BF16: ~546 TFLOP/s/GPU, FP8_MX: ~555 TFLOP/s/GPU) but lacks before-deployment baseline for comparison. Add baseline performance numbers from original configuration to demonstrate that dispatcher backend and type changes actually improve performance.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically summarizes the main change—fixing B200 performance config to use hybridep+flex dispatcher for Qwen3 30B A3B.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/performance/configs/qwen/qwen3_workload_base_configs.py (1)

411-418: B200_FP8_MX_V1 content is identical to B200_FP8_CS_V1 — and differs from GB200_FP8_MX_V1.

Two things worth noting:

  1. Duplication: the new explicit replace(BASE_QWEN3_30B_A3B_CONFIG, ...) body is byte-for-byte identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1 (lines 401–408). B300 and GB300 handle this case with a simple alias (MX = CS). If CS and MX are meant to stay in sync here, an alias would avoid silent divergence.

    ♻️ Optional: revert to alias pattern (consistent with B300/GB300)
    -QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 = replace(
    -    BASE_QWEN3_30B_A3B_CONFIG,
    -    num_gpus=8,
    -    micro_batch_size=4,
    -    moe_flex_dispatcher_backend="hybridep",
    -    cuda_graph_impl="transformer_engine",
    -    cuda_graph_scope=["attn", "moe_router", "moe_preprocess"],
    -)
    +QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 = QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1
  2. Asymmetry with GB200 FP8 MX: QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 (line 364) uses cuda_graph_scope=["moe_router", "moe_preprocess"] — it intentionally excludes "attn". The new B200 FP8 MX includes "attn". Please confirm this divergence from the GB200 FP8 MX pattern is deliberate (e.g., due to different MX kernel support on B200 standalone vs. the GB200 NVLink array).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen/qwen3_workload_base_configs.py` around lines
411 - 418, The new QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 is byte-for-byte
identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1, creating duplication
and risking silent drift; either make B200_FP8_MX_V1 an alias of B200_FP8_CS_V1
(like the B300/GB300 pattern) or deliberately keep the explicit replace but
document why; also verify whether the cuda_graph_scope for B200_FP8_MX_V1 should
include "attn" (it currently does) or match
QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 which omits "attn"—adjust the
cuda_graph_scope accordingly or add a comment explaining the intentional
divergence from GB200_FP8_MX_V1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/performance/configs/qwen/qwen3_workload_base_configs.py`:
- Around line 411-418: The new QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 is
byte-for-byte identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1,
creating duplication and risking silent drift; either make B200_FP8_MX_V1 an
alias of B200_FP8_CS_V1 (like the B300/GB300 pattern) or deliberately keep the
explicit replace but document why; also verify whether the cuda_graph_scope for
B200_FP8_MX_V1 should include "attn" (it currently does) or match
QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 which omits "attn"—adjust the
cuda_graph_scope accordingly or add a comment explaining the intentional
divergence from GB200_FP8_MX_V1.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a64507 and f96789d.

📒 Files selected for processing (2)
  • scripts/performance/configs/qwen/qwen3_llm_pretrain.py
  • scripts/performance/configs/qwen/qwen3_workload_base_configs.py

malay-nagda
malay-nagda previously approved these changes Feb 24, 2026
…+flex dispatcher

The B200 workload configs for qwen3_30b_a3b were missing several
performance-critical settings compared to other GPU variants (GB200,
GB300, B300, H100):

1. `moe_flex_dispatcher_backend` was not set to "hybridep" (inherited
   "deepep" from BASE_QWEN3_30B_A3B_CONFIG), preventing the optimized
   hybrid EP dispatcher from being used.

2. `moe_token_dispatcher_type` was hardcoded to "alltoall" in
   `qwen3_30b_a3b_pretrain_config_b200()`, which silently disables the
   flex dispatcher backend even when the override is passed via CLI.
   Changed to "flex" to match all other GPU variants.

3. `micro_batch_size=4` was not set, leaving the field unset.

4. `"attn"` was missing from `cuda_graph_scope`, losing CUDA graph
   coverage for attention kernels.

5. FP8_CS_V1 and FP8_MX_V1 are now aliases of BF16_V1 to reduce
   redundancy, matching the pattern used for B300 and GB300 variants.

Validated on 8x B200 single-node with mbs=4, gbs=512, seq=4096:
- BF16:    ~546 TFLOP/s/GPU
- FP8_MX:  ~555 TFLOP/s/GPU

Signed-off-by: Lifu Zhang <lifuz@nvidia.com>
@tomlifu tomlifu force-pushed the perf/fix-qwen3-30b-a3b-b200-config branch from 789bd67 to d622b37 Compare February 24, 2026 17:32
@erhoo82 erhoo82 added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 24, 2026
@erhoo82 erhoo82 requested a review from ko3n1g February 24, 2026 17:35
@erhoo82
Copy link
Copy Markdown
Contributor

erhoo82 commented Feb 26, 2026

/ok to test d622b37

Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we test this internally and update golden values?

Blocking, but @malay-nagda can override

@malay-nagda malay-nagda self-requested a review February 27, 2026 09:40
@malay-nagda
Copy link
Copy Markdown
Contributor

Did we test this internally and update golden values?

Blocking, but @malay-nagda can override

yes. created a PR for that too.

@malay-nagda malay-nagda merged commit 0ec1df7 into NVIDIA-NeMo:main Feb 27, 2026
56 checks passed
malay-nagda pushed a commit that referenced this pull request Feb 27, 2026
… dispatcher (#2499)

Signed-off-by: Lifu Zhang <lifuz@nvidia.com>
@malay-nagda malay-nagda added performance performance/release Performance items related with NeMo release labels Feb 27, 2026
copy-pr-bot bot pushed a commit that referenced this pull request Mar 19, 2026
… dispatcher (#2499)

Signed-off-by: Lifu Zhang <lifuz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance/release Performance items related with NeMo release performance r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants