[perf,recipe] Fix Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher #2499
Conversation
f96789d to
789bd67
Compare
📝 WalkthroughWalkthroughConfiguration updates to QWEN3 training parameters on GB200 hardware, including changes to MOE token dispatcher type from "alltoall" to "flex" and updates to micro-batch size, flex dispatcher backend, and CUDA graph scope settings across three B200 FP8 workload configurations. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/performance/configs/qwen/qwen3_workload_base_configs.py (1)
411-418:B200_FP8_MX_V1content is identical toB200_FP8_CS_V1— and differs fromGB200_FP8_MX_V1.Two things worth noting:
Duplication: the new explicit
replace(BASE_QWEN3_30B_A3B_CONFIG, ...)body is byte-for-byte identical toQWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1(lines 401–408). B300 and GB300 handle this case with a simple alias (MX = CS). If CS and MX are meant to stay in sync here, an alias would avoid silent divergence.♻️ Optional: revert to alias pattern (consistent with B300/GB300)
-QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 = replace( - BASE_QWEN3_30B_A3B_CONFIG, - num_gpus=8, - micro_batch_size=4, - moe_flex_dispatcher_backend="hybridep", - cuda_graph_impl="transformer_engine", - cuda_graph_scope=["attn", "moe_router", "moe_preprocess"], -) +QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 = QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1Asymmetry with GB200 FP8 MX:
QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1(line 364) usescuda_graph_scope=["moe_router", "moe_preprocess"]— it intentionally excludes"attn". The new B200 FP8 MX includes"attn". Please confirm this divergence from the GB200 FP8 MX pattern is deliberate (e.g., due to different MX kernel support on B200 standalone vs. the GB200 NVLink array).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/qwen/qwen3_workload_base_configs.py` around lines 411 - 418, The new QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 is byte-for-byte identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1, creating duplication and risking silent drift; either make B200_FP8_MX_V1 an alias of B200_FP8_CS_V1 (like the B300/GB300 pattern) or deliberately keep the explicit replace but document why; also verify whether the cuda_graph_scope for B200_FP8_MX_V1 should include "attn" (it currently does) or match QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 which omits "attn"—adjust the cuda_graph_scope accordingly or add a comment explaining the intentional divergence from GB200_FP8_MX_V1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/performance/configs/qwen/qwen3_workload_base_configs.py`:
- Around line 411-418: The new QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_MX_V1 is
byte-for-byte identical to QWEN3_30B_A3B_PRETRAIN_CONFIG_B200_FP8_CS_V1,
creating duplication and risking silent drift; either make B200_FP8_MX_V1 an
alias of B200_FP8_CS_V1 (like the B300/GB300 pattern) or deliberately keep the
explicit replace but document why; also verify whether the cuda_graph_scope for
B200_FP8_MX_V1 should include "attn" (it currently does) or match
QWEN3_30B_A3B_PRETRAIN_CONFIG_GB200_FP8_MX_V1 which omits "attn"—adjust the
cuda_graph_scope accordingly or add a comment explaining the intentional
divergence from GB200_FP8_MX_V1.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
scripts/performance/configs/qwen/qwen3_llm_pretrain.pyscripts/performance/configs/qwen/qwen3_workload_base_configs.py
scripts/performance/configs/qwen/qwen3_workload_base_configs.py
Outdated
Show resolved
Hide resolved
…+flex dispatcher The B200 workload configs for qwen3_30b_a3b were missing several performance-critical settings compared to other GPU variants (GB200, GB300, B300, H100): 1. `moe_flex_dispatcher_backend` was not set to "hybridep" (inherited "deepep" from BASE_QWEN3_30B_A3B_CONFIG), preventing the optimized hybrid EP dispatcher from being used. 2. `moe_token_dispatcher_type` was hardcoded to "alltoall" in `qwen3_30b_a3b_pretrain_config_b200()`, which silently disables the flex dispatcher backend even when the override is passed via CLI. Changed to "flex" to match all other GPU variants. 3. `micro_batch_size=4` was not set, leaving the field unset. 4. `"attn"` was missing from `cuda_graph_scope`, losing CUDA graph coverage for attention kernels. 5. FP8_CS_V1 and FP8_MX_V1 are now aliases of BF16_V1 to reduce redundancy, matching the pattern used for B300 and GB300 variants. Validated on 8x B200 single-node with mbs=4, gbs=512, seq=4096: - BF16: ~546 TFLOP/s/GPU - FP8_MX: ~555 TFLOP/s/GPU Signed-off-by: Lifu Zhang <lifuz@nvidia.com>
789bd67 to
d622b37
Compare
|
/ok to test d622b37 |
ko3n1g
left a comment
There was a problem hiding this comment.
Did we test this internally and update golden values?
Blocking, but @malay-nagda can override
yes. created a PR for that too. |
… dispatcher (#2499) Signed-off-by: Lifu Zhang <lifuz@nvidia.com>
… dispatcher (#2499) Signed-off-by: Lifu Zhang <lifuz@nvidia.com>
What does this PR do ?
This PR fixes Qwen3 30B A3B B200 perf config to use hybridep+flex dispatcher
Changelog
The B200 workload configs for qwen3_30b_a3b were missing several performance-critical settings compared to other GPU variants (GB200, GB300, B300, H100):
moe_flex_dispatcher_backendwas not set to "hybridep" (inherited "deepep" from BASE_QWEN3_30B_A3B_CONFIG), preventing the optimized hybrid EP dispatcher from being used.moe_token_dispatcher_typewas hardcoded to "alltoall" inqwen3_30b_a3b_pretrain_config_b200(), which silently disables the flex dispatcher backend even when the override is passed via CLI. Changed to "flex" to match all other GPU variants.micro_batch_size=4was not set, leaving the field unset."attn"was missing fromcuda_graph_scope, losing CUDA graph coverage for attention kernels.Validated on 8x B200 single-node with mbs=4, gbs=512, seq=4096:
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit