refactor(moe): centralize post-experts all-reduce skip predicate#23748
Merged
Kangyan-Zhou merged 2 commits intoApr 27, 2026
Merged
Conversation
The post-experts EP and TP all-reduce paths in qwen3_moe both gate on the same set of "downstream will absorb the all-reduce" predicates: - should_allreduce_fusion (LayerCommunicator fuses with next layer) - use_reduce_scatter (LayerCommunicator's post-attention reduce-scatter) - should_use_dp_reduce_scatterv() (DP reduce-scatterv combine path) - should_use_flashinfer_cutlass_moe_fp4_allgather() (TP path only) Each new skip path has been added by sweeping every model file by hand, and EP-vs-TP drift has caused two recent correctness bugs (sgl-project#23729 and the follow-up fixed by sgl-project#23734). Centralize the predicate so adding a new skip reason is a one-line change in one place, and EP and TP can no longer drift apart by accident. The helper is byte-identical to the existing qwen3_moe predicates -- verified by enumerating all 16 truth-table combinations of the four inputs. hunyuan_v3 (the only other model with a separate EP-then-TP post-experts all-reduce pattern) is intentionally not migrated here: its current predicate is a strict subset, and switching it to the helper would silently add a flashinfer guard, which is a behavioral change that belongs in a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…_reduce
Migrate every MoE model with a post-experts all-reduce gated on the
"downstream will absorb the all-reduce" predicates to the helper:
bailing_moe, bailing_moe_linear, deepseek_v2, exaone_moe, glm4_moe,
hunyuan_v3, llada2, llama4, mimo_v2_flash, minimax_m2, qwen2_moe,
sarvam_moe, sdar_moe, step3p5
All migrations were verified by enumerating the 16 truth-table combos
of the four input flags. Categorization:
A. Byte-identical refactor -- the original predicate already covered all
four flags (should_allreduce_fusion, use_reduce_scatter, dp_reduce_
scatterv, flashinfer_cutlass_moe_fp4_allgather):
bailing_moe, deepseek_v2 (x2 sites), glm4_moe (x2), minimax_m2,
mimo_v2_flash, qwen2_moe, sarvam_moe (x2), sdar_moe, step3p5.
B. Adds flashinfer guard (latent fix): the original predicate was a
strict subset of the helper's TP path. Helper now also skips when
should_use_flashinfer_cutlass_moe_fp4_allgather() is True; since
that predicate is gated on global FP4 + flashinfer cutlass runner,
it returns False outside that config (no-op there) and correctly
skips when active (avoiding double-reduce):
bailing_moe_linear, exaone_moe, llada2, llama4, hunyuan_v3 (TP).
In Cat C, hunyuan_v3 EP path is byte-identical (helper omits the
flashinfer check on EP). Models without `should_allreduce_fusion` /
`use_reduce_scatter` in their forward signature get the helper
defaults, which are no-ops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
/tag-run-ci-label |
Collaborator
|
/rerun-failed-ci |
1 similar comment
Collaborator
|
/rerun-failed-ci |
Merged
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
…-project#23748) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The post-experts EP and TP all-reduce paths in MoE models gate on the same growing list of "downstream will absorb the all-reduce" predicates:
should_allreduce_fusion—LayerCommunicatorwill fuse with next layer's residual all-reduceuse_reduce_scatter—LayerCommunicator's post-attention scatter does reduce-scattershould_use_dp_reduce_scatterv()— DP reduce-scatterv combine path (Replace all-reduce + dp_scatter with reduce_scatterv for DP attention #22642)should_use_flashinfer_cutlass_moe_fp4_allgather()— flashinfer kernel absorbs the TP all-reduce (TP path only)Each new skip path has historically been added by sweeping every model file by hand. EP-vs-TP drift on this single predicate has produced two recent correctness bugs:
should_use_dp_reduce_scattervguard to TP only, causing double-reduce on the EP path) — fixed by Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (#23729) #23731.use_reduce_scatterguard was on TP only, missed by the Support Qwen3 MoE context parallel #18233 EP/TP split and never caught by Enable the qwen3 test #21195's partial restore).This PR centralizes the predicate so:
is_tp_path, which selects the TP-only flashinfer guard.Modifications
Commit 1 — Add helper. Add
should_skip_post_experts_all_reducetopython/sglang/srt/layers/moe/utils.pyand export it frompython/sglang/srt/layers/moe/__init__.py. RefactorQwen3MoeSparseMoeBlock.forward_normalto use it.Commit 2 — Migrate every other MoE model. Sweep across all models with a post-experts all-reduce gated on these predicates:
bailing_moe,bailing_moe_linear,deepseek_v2,exaone_moe,glm4_moe,hunyuan_v3,llada2,llama4,mimo_v2_flash,minimax_m2,qwen2_moe,sarvam_moe,sdar_moe,step3p5.Validation
Every migration was verified by enumerating all 16 truth-table combinations of the four input flags. Categorization:
A. Byte-identical refactor — the original predicate already covered all four flags:
B. Adds flashinfer guard (latent fix) — original predicate was a strict subset. Helper now also skips when
should_use_flashinfer_cutlass_moe_fp4_allgather()is True. That predicate gates on the global FP4 + flashinfer cutlass runner config, so it returns False outside that config (no-op) and correctly skips when active (avoiding double-reduce):hunyuan_v3EP path is byte-identical (helper omits the flashinfer check on EP). Models withoutshould_allreduce_fusion/use_reduce_scatterin their forward signature get the helper defaults, which are no-ops.Checklist
🤖 Generated with Claude Code