feat: [2/2][DeepEP] Add waterfill load balancing for shared expert dispatch#19290
Merged
ch-wan merged 118 commits intoMay 14, 2026
Conversation
This commit implements waterfill load balancing for shared expert using DeepEP
dispatch mechanism. The key idea is to treat shared expert as a virtual 9th
expert and dispatch it through DeepEP along with routed experts.
Design principles:
1. Each token's shared expert can be sent to:
- One of the ranks it already routes to (no extra communication)
- Or stay at source rank for local computation
2. Waterfill algorithm selects the lowest-loaded rank from candidates
3. Shared expert weight = 1.0 / routed_scaling_factor (for correct combine)
New files:
- python/sglang/srt/layers/moe/deepep_waterfill.py: Waterfill algorithm and helpers
Modified files:
- python/sglang/srt/server_args.py: Add --enable-deepep-waterfill flag
- python/sglang/srt/models/deepseek_v2.py: Add forward_deepep_waterfill method
Usage:
python -m sglang.launch_server --model-path <model> --tp 8 --ep 8 \
--moe-a2a-backend deepep --enable-deepep-waterfill
This script runs benchmarks to compare: - Experiment 1: DeepEP baseline (no waterfill) - Experiment 2: DeepEP + Waterfill - Experiment 3: DeepEP + Waterfill with debug logging Usage: bash test/run_deepep_waterfill_benchmark.sh
Bugs fixed: 1. Used wrong rank function (get_tensor_model_parallel_rank -> get_moe_expert_parallel_rank) 2. expand_topk_for_shared_expert didn't use shared_destination parameter 3. Simplified implementation: all shared experts computed locally 4. Added alt_stream optimization for parallel shared expert computation 5. Added debug logging for load distribution analysis This is a simplified implementation where shared experts are computed locally on the source rank in parallel with DeepEP dispatch/combine. True cross-rank waterfill (dispatching shared expert to already-routed ranks) requires DeepEP protocol modifications and is left as future work. Current flow: 1. Router + topk computation 2. Shared expert on alt_stream (parallel) 3. DeepEP dispatch for routed experts 4. MoE computation 5. DeepEP combine 6. Add shared expert result
…expert Key Design: 1. Shared expert treated as virtual 9th routed expert 2. Virtual expert ID = target_rank * experts_per_rank (routes to correct rank) 3. Waterfill only assigns to ranks token already routes to (no extra comm) 4. Receiver identifies shared tokens via virtual ID and computes separately 5. Shared weight = 1/routed_scaling_factor for correct final scaling Flow: 1. Router + topk(8) 2. AllReduce global routed counts 3. Waterfill assigns shared destination 4. Expand topk to 9 columns 5. DeepEP dispatch with topk=9 6. Receiver: MoE(8 cols) + shared expert + merge 7. DeepEP combine with topk=9 8. Apply routed_scaling_factor
Improvements: 1. LOCAL_SHARED_MARKER (-1): tokens compute shared expert locally 2. MIN_BATCH_FOR_BALANCE (64): small batches compute all shared locally 3. alt_stream optimization: local shared expert parallel with dispatch 4. Separate handling of local vs remote shared expert computation Flow: 1. Router + topk(8) 2. AllReduce global routed counts 3. Waterfill assigns destination (local or remote rank) 4. Expand topk to 9 cols (LOCAL_SHARED_MARKER or virtual ID) 5. Local shared expert on alt_stream (parallel) 6. DeepEP dispatch with topk=9 7. Receiver: MoE(8 cols) + remote shared expert 8. DeepEP combine 9. Add local shared expert output 10. Apply routed_scaling_factor
Fix: rsf should only be applied to routed experts, not shared experts. Before (wrong order): combined += local_shared * (1/rsf) combined *= rsf # rsf affects local_shared! After (correct order): combined *= rsf # only affects routed combined += local_shared # not affected by rsf Weight handling: - Local shared: weight = 1.0 (added AFTER rsf multiplication) - Remote shared: weight = 1/rsf (added BEFORE combine, rsf cancels out) Final result: routed * rsf + shared
If a remote rank would receive fewer than MIN_TOKENS_PER_RANK (16) tokens for shared expert computation, redirect those tokens to local computation. This avoids sending only a few tokens to a remote rank, which would have more overhead than computing locally. Thresholds: - MIN_BATCH_FOR_BALANCE = 64: small batches compute all shared locally - MIN_TOKENS_PER_RANK = 16: sparse destinations redirected to local
Tests cover: 1. count_routed_per_rank_pytorch - token counting per rank 2. assign_shared_destination_pytorch - waterfill assignment 3. assign_shared_destination - source rank preference 4. expand_topk_with_shared_expert - topk expansion to 9 cols 5. identify_shared_expert_tokens - receiver side identification 6. compute_local_shared_expert - local computation 7. DeepEPWaterfillBalancer - small batch optimization 8. DeepEPWaterfillBalancer - sparse destination redirect 9. End-to-end scenario 10. shared_weight calculation All 10 tests pass on CPU.
Additional tests added: - Empty batch handling - Single token handling - All tokens route to same rank - Waterfill load balancing effectiveness - MIN_TOKENS_PER_RANK threshold - identify_shared_expert_tokens with all local markers - identify_shared_expert_tokens mixed scenarios - compute_local_shared_expert with no local tokens - Virtual ID to rank mapping - Weight preservation in topk expansion - Routed count accuracy - Consistency across repeated calls Total: 22 tests, all passing
Changes: 1. Increase MIN_TOKENS_PER_RANK from 16 to 128 (tile size) 2. Redirect local shared tokens to remote if count < 128 Before: Multiple ranks received <128 shared tokens (wasted tiles) After: All ranks receive 0 or >=128 shared tokens (no waste) Load balance improvement: 15-39% reduction in imbalance ratio
DeepEP Normal mode and Low Latency mode handle topk_weights differently: - Normal mode: run_moe_core applies weights, combine does NOT - Low Latency mode: run_moe_core does NOT apply weights, combine DOES Fixed remote shared expert weight application: - Normal mode: Apply weight (1/rsf) before combine - Low Latency mode: Let combine handle weight multiplication Also verified: - DeepGEMM tile size (BLOCK_M) = 128 (confirms MIN_TOKENS_PER_RANK = 128) - DeepEP topk_ids=-1 means no selection (confirms LOCAL_SHARED_MARKER = -1)
Added test_deepep_waterfill_comprehensive.py with 15 test cases: - count_routed_per_rank accuracy - assign_shared_destination correctness - expand_topk_with_shared_expert - identify_shared_expert_tokens - Virtual ID to rank mapping - MIN_BATCH_FOR_BALANCE optimization - MIN_TOKENS_PER_RANK redirect - Shared weight calculation (1/rsf) - Empty batch handling - compute_local_shared_expert - Weights preservation - Waterfill load balancing effectiveness - Invalid expert ID handling - Large batch performance All 15 tests pass.
Key optimizations: 1. assign_shared_destination: Replace for-loop with scatter-based vectorized ops - Old: O(topk) loop iterations, each with indexing - New: Single scatter operation for all topk values - Speedup: 2.6x - 4.2x depending on batch size 2. expand_topk_with_shared_expert: Pre-allocate output tensors - Avoid torch.cat overhead by pre-allocating and copying - Reduce memory allocation operations 3. prepare_dispatch: Vectorized sparse rank redirect - Replace for-loop with lookup table approach Benchmark results (CPU): - batch=128: 0.11ms -> 0.03ms (4.21x faster) - batch=4096: 0.70ms -> 0.18ms (3.79x faster) - batch=8192: 1.09ms -> 0.31ms (3.47x faster)
- Add assign_shared_destination_triton() kernel for GPU - Auto-select Triton on GPU, fallback to PyTorch on CPU - Update benchmark to compare PyTorch vs Triton on GPU Triton kernel processes one token per thread block, iterating over topk experts to find the minimum-load destination rank.
Major changes: 1. New fused kernel: _waterfill_expand_topk_fused_kernel - Combines waterfill assignment + topk expansion in single pass - Reduces kernel launches from 3 to 1 - Eliminates intermediate tensor allocations 2. Kernel design: - Each thread block handles BLOCK_SIZE=256 tokens - Loop over topk experts to find minimum-load rank - Write expanded topk_ids, weights, and local_mask in-place 3. Vectorized post-processing: - Sparse rank redirect: use boolean indexing instead of for-loop - Local count redirect: single tensor operation - Minimal GPU-CPU synchronization Performance (CPU): - assign_shared_destination: 3.4-4.3x speedup vs loop version - prepare_dispatch: 0.28ms for 4096 tokens GPU benefits (when Triton available): - Single kernel launch vs multiple PyTorch ops - No intermediate memory allocation - Better memory coalescing
- Map checkpoint routed expert_ids (0..255) with old experts_per_rank when Waterfill expands num_experts to +ep_size - Add unit test to prevent EP-rank mis-mapping regression
…red expert when SGLANG_DEEPEP_WATERFILL_FIXED_LOCAL is set. This change allows for fixed local computation of shared experts, improving control over load balancing behavior.
… moe-ep all_reduce group
…location for input_tensor and m_indices
It does not use self. Allows direct call without an instance for future adapter / external reuse.
Match the staticmethod intent at the call site.
ch-wan
approved these changes
May 9, 2026
Collaborator
Author
|
/rerun-failed-ci |
1 similar comment
Collaborator
Author
|
/rerun-failed-ci |
# Conflicts: # test/registered/unit/server_args/test_server_args.py
Collaborator
Author
|
/rerun-failed-ci |
3 similar comments
Collaborator
Author
|
/rerun-failed-ci |
Collaborator
Author
|
/rerun-failed-ci |
Collaborator
Author
|
/rerun-failed-ci |
Fridge003
pushed a commit
that referenced
this pull request
May 14, 2026
…spatch (#19290) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Co-authored-by: root <aichenf@nvidia.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
This was referenced May 25, 2026
Shunkangz
pushed a commit
to Shunkangz/sglang
that referenced
this pull request
May 27, 2026
…spatch (sgl-project#19290) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Co-authored-by: root <aichenf@nvidia.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
alphabetc1
pushed a commit
to alphabetc1/sglang
that referenced
this pull request
Jun 4, 2026
…spatch (sgl-project#19290) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Co-authored-by: root <aichenf@nvidia.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
zijiexia
added a commit
to zijiexia/sglang
that referenced
this pull request
Jun 4, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
In DeepSeek V3/R1 with expert parallelism (EP), each rank processes a subset of routed experts plus the full shared expert. Since every token must visit the shared expert, it creates a fixed compute load on all ranks regardless of routed expert distribution. When routed expert load is imbalanced across ranks (common with skewed token distributions), the shared expert amplifies the bottleneck on already-overloaded ranks.
Waterfill addresses this by treating the shared expert as an additional routed expert (the "9th expert" per rank) and dispatching it to the least-loaded rank via DeepEP, effectively filling the load gap like water filling a container. This converts the shared expert from a fixed per-rank cost into a dynamic load-balancing lever.
coauthor: @AichenF
Thanks to @ch-wan for the discussion and suggestions.
This is the second half of the waterfill feature split (pending on #20089):
Modifications
New file:
python/sglang/srt/layers/moe/deepep_waterfill.py(514 lines)_count_routed_per_rank_kernel: counts routed tokens per EP rank from topk_ids._waterfill_kernel: assigns each token's shared expert to the least-loaded rank using a waterfill algorithm, with local-rank preference (1.1x bias).DeepEPWaterfillBalancerclass: orchestrates the waterfill dispatch with two modes:rank_loadfrom EPLB expert distribution data (no runtime all_reduce).expand_topk(): Expands topk output by one column (shared expert slot) with computed expert IDs and scaling factors.Modified files:
models/deepseek_v2.py(+112/-21): CreatesDeepEPWaterfillBalancerin MoE init; callsexpand_topk()after TopK selection; skips separate shared expert forward when waterfill is enabled (shared expert is now part of MoE dispatch/compute/combine).layers/moe/fused_moe_triton/layer.py(+40/-2): Adjusts weight loading to map checkpoint expert IDs correctly when waterfill expands the expert count byep_size(one shared slot per rank).eplb/expert_location.py(+33/-1): Addsrank_loadfield toExpertLocationMetadataand_compute_rank_load()to derive per-rank load from logical expert counts + physical-to-logical mapping.server_args.py(+29/-1): Adds--enable-deepep-waterfillflag with validation.Accuracy Tests
MMLU accuracy on DeepSeek-V3 (FP8, 2-node 16xH20, EP16, DP16):
No accuracy degradation observed.
Benchmarking and Profiling
Throughput benchmark on DeepSeek-V3 (FP8, 2-node 16×H20, EP16, DP16, input_len≈~600 MMLU prompts, output_len=1, 256 concurrent requests × multiple rounds, 1000 prompts / round):
waterfill
Baseline
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci