Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731) by ByronHsu · Pull Request #23732 · sgl-project/sglang

ByronHsu · 2026-04-25T22:34:22Z

Motivation

Follow-up to #23731 (Qwen3 MoE). Supersedes #23431 (same diff for the 12 files there) by also fixing hunyuan_v3.py.

PR #22642 introduced should_use_dp_reduce_scatterv(), which fuses the post-MoE all-reduce with dp_scatter into a single reduce_scatterv call inside LayerCommunicator. To avoid a double-reduce, the model-side tensor_model_parallel_all_reduce (or moe_*_all_reduce) on final_hidden_states must be skipped when this fast path is active.

That PR added the guard only to qwen2_moe.py. #23731 fixed qwen3_moe.py. Every other MoE model that does the same post-experts all-reduce silently double-reduces when running with DP attention + EP + moe_a2a_backend="none" — same regression pattern as #23729.

#23431 surfaced this in the nightly suite via test_minimax_m25.py variant TP8+DP8+EP8+DPAttn:

2026-04-13 (pre-Replace all-reduce + dp_scatter with reduce_scatterv for DP attention #22642): GSM8K 0.951 ✓
2026-04-16 → 2026-04-21 (post-Replace all-reduce + dp_scatter with reduce_scatterv for DP attention #22642): GSM8K 0.002 – 0.010 ✗
After this guard: GSM8K 0.980 ✓ (matches the non-DPAttn baseline)

Modifications

13 files (~16 reduce sites). The 12 files from #23431 plus hunyuan_v3.py:

File	Reduce sites	Notes
`bailing_moe.py`	1	standard pattern
`bailing_moe_linear.py`	1	standard pattern
`deepseek_v2.py`	2	`forward_normal` + dual-stream; `forward_cpu` intentionally untouched (CPU path doesn't trigger the fast path)
`exaone_moe.py`	1	standard pattern
`glm4_moe.py`	2	`forward_normal` + dual-stream
`hunyuan_v3.py`	2	uses `moe_expert_parallel_all_reduce` + `moe_tensor_model_parallel_all_reduce` (same shape as `qwen3_moe.py`); both branches gated
`llada2.py`	1	standard pattern
`llama4.py`	1	standard pattern
`mimo_v2_flash.py`	1	standard pattern
`minimax_m2.py`	1	standard pattern
`sarvam_moe.py`	2	`forward_normal` + dual-stream
`sdar_moe.py`	1	standard pattern
`step3p5.py`	1	standard pattern

Each and not should_use_flashinfer_cutlass_moe_fp4_allgather() guard gets a sibling and not should_use_dp_reduce_scatterv() line, matching the pattern from qwen2_moe.py and qwen3_moe.py. hunyuan_v3.py does not have the fp4 guard, so a skip_post_reduce = should_use_dp_reduce_scatterv() local short-circuits both reduces.

Validation

All 13 modified files import cleanly.
Per-model GSM8K validation for minimax_m2 from Fix DP-Attention reduce_scatterv missing guard in MiniMax/Bailing MoE #23431: 0.060 → 0.980 on H200 4×GPU (TP=4 DP=4 EP=4) with all should_use_dp_reduce_scatterv() conditions satisfied.
The patch only adds redundant-guard skip; numerics on every other path are unchanged.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results (see Fix DP-Attention reduce_scatterv missing guard in MiniMax/Bailing MoE #23431 GSM8K nightly).
Follow the SGLang code style guidance.

cc @YAMY1234 (PR #22642 author)

Refs #23729 #23731 #23431

gemini-code-assist

Code Review

This pull request introduces the should_use_dp_reduce_scatterv() guard across several Mixture-of-Experts (MoE) model implementations, including Bailing, DeepSeek-V2, GLM-4, Hunyuan-V3, MIMO-V2, MiniMax-M2, Sarvam, SDAR, and Step3.5. This guard is integrated into the forward pass logic to conditionally skip the final tensor model parallel or expert parallel all-reduce operations when a fused reduction is expected to be handled by an external communicator. I have no feedback to provide.

Kangyan-Zhou · 2026-04-25T22:45:23Z

/tag-and-rerun-ci again

Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-up to sgl-project#23731) (sgl-project#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

ByronHsu requested review from Fridge003, ch-wan, fzyzcjy, ispobock and merrymercy as code owners April 25, 2026 22:34

github-actions Bot added the deepseek label Apr 25, 2026

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

ByronHsu force-pushed the fix/moe-dp-reduce-scatterv-guard-rest branch from d3e2694 to 4a1fbbb Compare April 25, 2026 22:41

ByronHsu mentioned this pull request Apr 25, 2026

Fix DP-Attention reduce_scatterv missing guard in MiniMax/Bailing MoE #23431

Closed

2 tasks

Kangyan-Zhou added the high priority label Apr 25, 2026

github-actions Bot added the run-ci label Apr 25, 2026

ByronHsu force-pushed the fix/moe-dp-reduce-scatterv-guard-rest branch from 4a1fbbb to 9cbd0f8 Compare April 25, 2026 22:50

Merge branch 'main' into fix/moe-dp-reduce-scatterv-guard-rest

b2931c4

ByronHsu merged commit ba4e9d2 into sgl-project:main Apr 26, 2026
195 of 212 checks passed

alisonshao mentioned this pull request Apr 26, 2026

Revert #23533 (Hy3 preview) + re-enable test_nvidia_nemotron_3_nano #23758

Closed

2 tasks

b8zhong mentioned this pull request Apr 26, 2026

[Bug] Kimi K2.6 DEP8 produces garbage output but DP8 works fine #23554

Open

hnyls2002 mentioned this pull request Apr 29, 2026

Deepseek V4 #23882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731)#23732

Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731)#23732
ByronHsu merged 2 commits intosgl-project:mainfrom
ByronHsu:fix/moe-dp-reduce-scatterv-guard-rest

ByronHsu commented Apr 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Kangyan-Zhou commented Apr 25, 2026 •

edited by b8zhong

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ByronHsu commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Validation

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Kangyan-Zhou commented Apr 25, 2026 • edited by b8zhong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ByronHsu commented Apr 25, 2026 •

edited

Loading

Kangyan-Zhou commented Apr 25, 2026 •

edited by b8zhong

Loading