Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731)#23732
Merged
ByronHsu merged 2 commits intosgl-project:mainfrom Apr 26, 2026
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces the should_use_dp_reduce_scatterv() guard across several Mixture-of-Experts (MoE) model implementations, including Bailing, DeepSeek-V2, GLM-4, Hunyuan-V3, MIMO-V2, MiniMax-M2, Sarvam, SDAR, and Step3.5. This guard is integrated into the forward pass logic to conditionally skip the final tensor model parallel or expert parallel all-reduce operations when a fused reduction is expected to be handled by an external communicator. I have no feedback to provide.
d3e2694 to
4a1fbbb
Compare
2 tasks
Collaborator
|
/tag-and-rerun-ci again |
Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4a1fbbb to
9cbd0f8
Compare
2 tasks
Merged
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
…llow-up to sgl-project#23731) (sgl-project#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Follow-up to #23731 (Qwen3 MoE). Supersedes #23431 (same diff for the 12 files there) by also fixing
hunyuan_v3.py.PR #22642 introduced
should_use_dp_reduce_scatterv(), which fuses the post-MoE all-reduce withdp_scatterinto a singlereduce_scattervcall insideLayerCommunicator. To avoid a double-reduce, the model-sidetensor_model_parallel_all_reduce(ormoe_*_all_reduce) onfinal_hidden_statesmust be skipped when this fast path is active.That PR added the guard only to
qwen2_moe.py. #23731 fixedqwen3_moe.py. Every other MoE model that does the same post-experts all-reduce silently double-reduces when running with DP attention + EP +moe_a2a_backend="none"— same regression pattern as #23729.#23431 surfaced this in the nightly suite via
test_minimax_m25.pyvariantTP8+DP8+EP8+DPAttn:Modifications
13 files (~16 reduce sites). The 12 files from #23431 plus
hunyuan_v3.py:bailing_moe.pybailing_moe_linear.pydeepseek_v2.pyforward_normal+ dual-stream;forward_cpuintentionally untouched (CPU path doesn't trigger the fast path)exaone_moe.pyglm4_moe.pyforward_normal+ dual-streamhunyuan_v3.pymoe_expert_parallel_all_reduce+moe_tensor_model_parallel_all_reduce(same shape asqwen3_moe.py); both branches gatedllada2.pyllama4.pymimo_v2_flash.pyminimax_m2.pysarvam_moe.pyforward_normal+ dual-streamsdar_moe.pystep3p5.pyEach
and not should_use_flashinfer_cutlass_moe_fp4_allgather()guard gets a siblingand not should_use_dp_reduce_scatterv()line, matching the pattern fromqwen2_moe.pyandqwen3_moe.py.hunyuan_v3.pydoes not have the fp4 guard, so askip_post_reduce = should_use_dp_reduce_scatterv()local short-circuits both reduces.Validation
minimax_m2from Fix DP-Attention reduce_scatterv missing guard in MiniMax/Bailing MoE #23431: 0.060 → 0.980 on H200 4×GPU (TP=4 DP=4 EP=4) with allshould_use_dp_reduce_scatterv()conditions satisfied.Checklist
cc @YAMY1234 (PR #22642 author)
Refs #23729 #23731 #23431