Single Batch Overlap for MoE Models by Sulfur6 · Pull Request #9660 · sgl-project/sglang

Sulfur6 · 2025-08-26T16:27:26Z

1. Motivation

The optimization effect of Two-Batch Overlap (TBO) is suboptimal for the Decode phase on low-compute-power cards (i.e., H20). This is due to two main factors: First, on the Hopper architecture, the WGMMA block_m is 64. Consequently, when TBO is enabled with a small Decode batch size, the MLP GEMM suffers from redundant computations. A positive throughput gain is only observed at larger batch sizes (e.g., 64, 128). Second, at these larger batch sizes, low-compute-power cards like the H20 fail to meet the SLA guarantees for TPOT/ITL.

Therefore, it is necessary to find a solution that can improve Decode throughput even with small batch sizes. Single Batch Overlap (SBO) presents itself as a viable solution.

We implement SBO for DeepSeek v3/R1 by modifying DeepEP and DeepGEMM, including the overlap of Shared Expert and Dispatch Recv, as well as the overlap of Down GEMM with Combine Send.

The overlap of Down GEMM with Combine Send is implemented by modifying DeepEP and DeepGEMM, with the detailed implementation available in the branches below:

2. Overlap Design

SBO implements two overlap for the MoE layers of DeepSeek-V3/R1. One is to overlap the Shared Expert computation with the Dispatch Recv communication, and the other is to overlap the Down GEMM computation with the Combine Send communication.

The interaction between Down GEMM and Combine Send is structured as a producer-consumer model synchronized by signals. For each local expert, a signal unit is allocated for every block_m tokens. The Down GEMM computes the results for these block_m tokens and atomically increments the signaling unit after completing a portion of the work. The Combine Send polls this signaling unit. Once the value reaches a threshold, it sends the corresponding block_m tokens.

3. Modifications

Dockerfile

When building docker image for SBO on hopper, install deepep based on specific branch mentioned above.

Server Arguments

--enable-single-batch-overlap: This argument is able to enable SBO (Single Batch Overlap) on Hopper, which is previously added in Support single batch overlap #10422.

When SBO is enabled, --moe-a2a-backed must be "deepep" and --moe-runner-backend must be "deep_gemm", otherwise this argument would be invalid.

SBO

Add condition check and arguments calculation for SBO on Hopper.

Model

Add deepep hook for SBO on Hopper in DeepseekV2MoE of deepseek_v2 model.

MoE Runner

Add meta_overlap_args in moe runner running states to dynamically modify overlap arguments during forward process.
Add overlap arguments in deep_gemm runner.

Deepep Token Dispatcher

Add new hook to overlap the shared experts with dispatch recv.
Add overlap arguments on hopper in combine send.

DeepGEMM Wrapper

Modify masked gemm wrapper to support SBO based on deep_gemm.

4. Evaluation

4.1. Experiment Setup

5 nodes, with 8 × H20 GPUs per node. Each prefill node uses TP8, and the other 2 decode nodes use DP_Attn 16 + EP 16.
Input length 4096, output length 1536.

4.2. Performance Evaluation

bs 32, origin

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2359.16
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672509
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17778.82
Output token throughput (tok/s):         6667.06
Total token throughput (tok/s):          24445.88
Concurrency:                             490.01
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   112892.31
Median E2E Latency (ms):                 113847.19
---------------Time to First Token----------------
Mean TTFT (ms):                          640.62
Median TTFT (ms):                        545.06
P99 TTFT (ms):                           1543.37
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.11
Median ITL (ms):                         71.81
P95 ITL (ms):                            86.02
P99 ITL (ms):                            155.32
Max ITL (ms):                            1543.26
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2357.80
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673361
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17789.05
Output token throughput (tok/s):         6670.89
Total token throughput (tok/s):          24459.95
Concurrency:                             490.83
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   113015.97
Median E2E Latency (ms):                 113951.58
---------------Time to First Token----------------
Mean TTFT (ms):                          724.98
Median TTFT (ms):                        624.73
P99 TTFT (ms):                           1693.64
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.13
Median ITL (ms):                         71.84
P95 ITL (ms):                            86.57
P99 ITL (ms):                            155.21
Max ITL (ms):                            1081.95
==================================================

bs 32, sbo

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2211.76
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673456
Request throughput (req/s):              4.63
Input token throughput (tok/s):          18963.67
Output token throughput (tok/s):         7111.38
Total token throughput (tok/s):          26075.05
Concurrency:                             481.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104017.64
Median E2E Latency (ms):                 105363.65
---------------Time to First Token----------------
Mean TTFT (ms):                          606.28
Median TTFT (ms):                        508.61
P99 TTFT (ms):                           1475.44
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.35
Median ITL (ms):                         66.10
P95 ITL (ms):                            81.58
P99 ITL (ms):                            141.96
Max ITL (ms):                            1422.74
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2194.12
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672577
Request throughput (req/s):              4.67
Input token throughput (tok/s):          19116.14
Output token throughput (tok/s):         7168.55
Total token throughput (tok/s):          26284.70
Concurrency:                             487.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104545.42
Median E2E Latency (ms):                 105483.50
---------------Time to First Token----------------
Mean TTFT (ms):                          619.03
Median TTFT (ms):                        511.23
P99 TTFT (ms):                           1504.27
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.68
Median ITL (ms):                         66.44
P95 ITL (ms):                            82.13
P99 ITL (ms):                            142.48
Max ITL (ms):                            1024.85
==================================================

4.3. Accuracy Tests

bs 32, origin

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:20<00:00, 12.41it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 80.802 s
Output throughput: 1183.468 token/s

bs 32, sbo

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:17<00:00, 12.87it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 78.056 s
Output throughput: 1217.443 token/s

4.4. Repro Script

#------------------------------------------- Variables For PD start -------------------------------------------#
# Configuration for PD disaggregation.
MODEL_PATH=""  # Your own model path.
WORK_DIR=""  # Your own work dir.

IB_DEVICE="bond_0,bond_1,bond_2,bond_3"  # Your own IB device names.

NUM_PREFILL=3  # Number of prefill nodes.
NUM_DECODE=2  # Number of decode nodes.

# Example array of IP addresses for the cluster nodes.
NODE_IPS=( \
  "192.168.1.1" \
  "192.168.1.2"  \
  "192.168.1.3" \
  "192.168.1.4" \
  "192.168.1.5"
)

PREFILL_MAIN_IP="192.168.1.1"  # IP of the main node for prefill.
DECODE_MAIN_IP="192.168.1.4"  # IP of the main node for decode (offset by NUM_PREFILL).

DEFAULT_PORT=61001  # Default port for SGLang services.
MAIN_PORT=62001  # Port for the main node communication.
MINI_LB_PORT=8000  # Port for the load balancer.
MAX_RUNNING_REQUEST_PER_GPU=32  # Maximum concurrent requests per GPU.
CHUNKED_PREFILL_SIZE_PER_DP_RANK=4096  # Size of chunked prefill per data-parallel rank.

DP_SIZE_PER_PREFILL_NODE=8  # Data-parallel size for each prefill node.
DP_SIZE_PER_DECODE_NODE=8  # Data-parallel size for each decode node.

PAGE_SIZE=$(( 64 ))
PREFILL_ATTENTION_BACKEND="fa3"  # Attention backend for prefill.
DECODE_ATTENTION_BACKEND="flashmla"  # Attention backend for decode.

# Calculate maximum requests per data-parallel rank based on GPU capacity.
MAX_RUNNING_REQUEST_PER_DP_RANK=$(( MAX_RUNNING_REQUEST_PER_GPU * GPUS_PER_NODE / DP_SIZE_PER_DECODE_NODE ))
CUDA_GRAPH_MAX_BATCH_SIZE=$MAX_RUNNING_REQUEST_PER_DP_RANK  # Set CUDA graph batch size to match max requests per rank.

#---------------------------------------- For Prefill Nodes Start -------------------------------------------#
SGL_ENABLE_JIT_DEEPGEMM=1 \
nohup python3 -m sglang.launch_server \
--trust-remote-code \
--model-path ${MODEL_PATH} \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device ${IB_DEVICE} \
--host 0.0.0.0 \
--port ${DEFAULT_PORT} \
--tp-size 8 \
--page-size ${PAGE_SIZE} \
--attention-backend ${PREFILL_ATTENTION_BACKEND} \
--mem-fraction-static 0.92 \
--chunked-prefill-size 32768 \
--max-running-requests $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * MAX_RUNNING_REQUEST_PER_DP_RANK)) \
--max-total-tokens 131076 \
--context-length 65535 \
--enable-cache-report \
--log-level info \
> ${WORK_DIR}/stdout.log 2>&1 &

#---------------------------------------- For Decode Main Node Start -------------------------------------------#
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS=3 \
SGLANG_DEEPGEMM_ON_H20=1 \
nohup python3 -m sglang.launch_server \
--model-path ${MODEL_PATH} \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device ${IB_DEVICE} \
--attention-backend ${DECODE_ATTENTION_BACKEND} \
--host 0.0.0.0 \
--port ${DEFAULT_PORT} \
--trust-remote-code \
--dist-init-addr ${DECODE_MAIN_IP}:${MAIN_PORT} \
--nnodes ${NUM_DECODE} \
--node-rank 0 \
--tp-size $((GPUS_PER_NODE * NUM_DECODE)) \
--dp-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE)) \
--enable-dp-attention \
--mem-fraction-static 0.88 \
--chunked-prefill-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * CHUNKED_PREFILL_SIZE_PER_DP_RANK)) \
--max-running-requests $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * MAX_RUNNING_REQUEST_PER_DP_RANK)) \
--context-length 32768 \
--log-level info \
--decode-log-interval 50 \
--page-size ${PAGE_SIZE} \
--schedule-conservativeness 0.3 \
--enable-cache-report \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--cuda-graph-max-bs ${CUDA_GRAPH_MAX_BATCH_SIZE} \
--load-balance-method minimum_tokens \
--moe-a2a-backend deepep \
--moe-runner-backend deep_gemm \
--enable-single-batch-overlap \
> ${WORK_DIR}/stdout.log 2>&1 &

#---------------------------------------- For Decode Worker Node Start -------------------------------------------#
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS=3 \
SGLANG_DEEPGEMM_ON_H20=1 \
nohup python3 -m sglang.launch_server \
--model-path ${MODEL_PATH} \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device ${IB_DEVICE} \
--attention-backend ${DECODE_ATTENTION_BACKEND} \
--host 0.0.0.0 \
--port ${DEFAULT_PORT} \
--trust-remote-code \
--dist-init-addr ${DECODE_MAIN_IP}:${MAIN_PORT} \
--nnodes ${NUM_DECODE} \
--node-rank 1 \
--tp-size $((GPUS_PER_NODE * NUM_DECODE)) \
--dp-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE)) \
--enable-dp-attention \
--mem-fraction-static 0.88 \
--chunked-prefill-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * CHUNKED_PREFILL_SIZE_PER_DP_RANK)) \
--max-running-requests $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * MAX_RUNNING_REQUEST_PER_DP_RANK)) \
--context-length 32768 \
--log-level info \
--decode-log-interval 50 \
--page-size ${PAGE_SIZE} \
--schedule-conservativeness 0.3 \
--enable-cache-report \
--moe-dense-tp-size 1 \
--enable-dp-lm-head \
--cuda-graph-max-bs ${CUDA_GRAPH_MAX_BATCH_SIZE} \
--load-balance-method minimum_tokens \
--moe-a2a-backend deepep \
--moe-runner-backend deep_gemm \
--enable-single-batch-overlap \
> ${WORK_DIR}/stdout.log 2>&1 &

#--------------------------------------- For sglang router start -------------------------------------------#
nohup python3 -m sglang_router.launch_router \
  --pd-disaggregation \
  --host 0.0.0.0 \
  --port 8000 \
  --decode http://192.168.1.4:61001 \
  --prefill http://192.168.1.1:61001 \
  --prefill http://192.168.1.2:61001 \
  --prefill http://192.168.1.3:61001

fzyzcjy · 2025-08-26T23:36:36Z

Interesting, I am also doing some overlapping recently. Do we need to use a changed DeepEP or DeepGEMM?

Sulfur6 · 2025-08-27T02:34:28Z

Interesting, I am also doing some overlapping recently. Do we need to use a changed DeepEP or DeepGEMM?

@fzyzcjy Thanks for the comments. To overlap the Down GEMM with the Combine Send, we have modified the DeepGEMM and DeepEP respectively. As for the Shared Expert and Dispatch Recv overlap, that only required modifications to SGLang. We are currently cleaning up the code for DeepGEMM and DeepEP and will submit PRs within the next two days.

fzyzcjy · 2025-08-27T05:19:03Z

Sure, I mean you need to paste the corresponding deepgemm/deepep branches as well (when ready).

Sulfur6 · 2025-08-27T09:43:47Z

Sure, I mean you need to paste the corresponding deepgemm/deepep branches as well (when ready).

@fzyzcjy We updated the PR and added our modified DeepEP and DeepGEMM branches:

DeepEP: https://github.com/Zqy11/DeepEP/tree/feat/overlap
DeepGEMM: https://github.com/Sulfur6/DeepGEMM/tree/sbo.v1 (Adapt to the latest version of SGLang)

fzyzcjy · 2025-08-27T11:11:35Z

get, btw the speedup looks like only ~1%, thus I am curious whether it is because the overlappable region is tiny or the overhead of overlap is large, and also how much does the SBO improve over the simple standard overlap-shared-with-comunication. could you please share a pair of profile (one w/o overlap, one w/ overlap) about them?

Sulfur6 · 2025-08-27T11:38:59Z

get, btw the speedup looks like only ~1%, thus I am curious whether it is because the overlappable region is tiny or the overhead of overlap is large, and also how much does the SBO improve over the simple standard overlap-shared-with-comunication. could you please share a pair of profile (one w/o overlap, one w/ overlap) about them?

@fzyzcjy Thank you for your reminder. We pasted the wrong result when creating the draft, and have now updated it to the correct one.

Sulfur6 · 2025-08-27T13:35:42Z

get, btw the speedup looks like only ~1%, thus I am curious whether it is because the overlappable region is tiny or the overhead of overlap is large, and also how much does the SBO improve over the simple standard overlap-shared-with-comunication. could you please share a pair of profile (one w/o overlap, one w/ overlap) about them?

@fzyzcjy We recorded the profiles with and without overlap when the batch size was 32. Below is a screenshot of the profile of a single DeepseekV2Decoder layer on DP0_TP0_EP0 on the decode node:

w/o overlap

w/ overlap

fzyzcjy · 2025-08-27T23:39:49Z

I see, yes that looks reasonable on your H20 hardware (I do not have H20 and thus dnk the time spent of each kernel before)

fzyzcjy · 2025-08-28T00:07:26Z

btw, briefly glanced and it seems atomicAdd is used to send signals, thus curious whether this memory ordering and send location is strong enough

Sulfur6 · 2025-08-28T12:59:07Z

Since sglang has merged PR: #9340 to upgrade to DeepGEMM v2, we are working on the relevant adaptation work.

fzyzcjy · 2025-09-02T09:40:01Z

this change looks great, but I am still a bit worried (1) shall we use atomicAdd (doc says relaxed ordering) or use released ordering (2) will the extra tma store wait make that warp group slower (i.e. shall we signal on the next existing tma store wait).

FYI my naive implementations are in flashinfer-ai/flashinfer#1569 (have not tested it since the nvfp4 code path has not arrived yet...)

Sulfur6 · 2025-09-02T09:51:18Z

this change looks great, but I am still a bit worried (1) shall we use atomicAdd (doc says relaxed ordering) or use released ordering (2) will the extra tma store wait make that warp group slower (i.e. shall we signal on the next existing tma store wait).

FYI my naive implementations are in flashinfer-ai/flashinfer#1569 (have not tested it since the nvfp4 code path has not arrived yet...)

@fzyzcjy For (1), we will conduct a more in-depth investigation. For (2), after our tests, tma_store_wait<0>() does not bring additional overhead. We speculate that this is because tma_store_wait<0>() must also be executed before tma_store_arrive(). However, __threadfence() will bring a certain performance overhead of ~4% to down gemm (in the EP16 scenario). But we believe this is necessary to ensure memory ordering.

fzyzcjy · 2025-09-03T08:37:04Z

FYI I am waiting for the refactored deepgemm (hopper), since I need to implement deepgemm blackwell and want to be aligned with your style to avoid two conflicting styles

Sulfur6 · 2025-09-03T08:46:03Z

FYI I am waiting for the refactored deepgemm (hopper), since I need to implement deepgemm blackwell and want to be aligned with your style to avoid two conflicting styles

@fzyzcjy We have submitted a pull request to DeepGEMM.v2 deepseek-ai/DeepGEMM#183, which contains the GEMM interface and implementation required for overlap. We would like to know if you have any suggestions for modification.

fzyzcjy · 2025-09-03T09:08:03Z

@Sulfur6 I made a tiny nit comment there

Sulfur6 · 2025-12-02T09:58:19Z

The current version of sgl-kernel's DeepGEMM does not include the modifications in sgl-project/DeepGEMM#14, which causes the deep_gemm moe runner to malfunction. I have fixed this issue in the latest version of this PR and am currently conducting CI tests. @Fridge003

Fridge003 · 2025-12-03T03:25:49Z

@Sulfur6 Please fix the conflicts, thanks

Sulfur6 · 2025-12-03T07:29:29Z

@Sulfur6 Please fix the conflicts, thanks

@Fridge003 we fixed the conflicts, and the CI test is now running.

AniZpZ · 2025-12-03T11:28:08Z

@ch-wan @Fridge003 i think this pr is ready for merge, could you please review again

Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

programmer-lxj · 2025-12-18T06:18:09Z

@Sulfur6 @Zqy11 @fzyzcjy I'm currently running the deepseek-R1-w4a8 model, but this model's --moe-runner-backend can only be set to cutlass; using deep_gemm results in an error. I've already used the --enable-single-batch-overlap and --moe-a2a-backend deepep parameters, as well as the deepep, deepgemm, and sglang0.5.6.post2 images you mentioned. Running with --enable-single-batch-overlap --moe-a2a-backend deepep --moe-runner-backend cutlass works, but the performance is the same as without SBO (Single Batch Overlap), showing no improvement. I'd like to ask if this is because --moe-runner-backend isn't using deepgemm? If only cutlass can be used with the w4a8 model and not deepgemm, how can I achieve performance improvement using SBO? use deepgemm get a error:Warning: DeepEP timeout for combine receive, rank 3, local_expert_idx 28, src_rank 1
Warning: DeepEP timeout for combine receive, rank 3, local_expert_idx 27, src_rank 1
TMA Desc Addr: 0x7ffffe757c40
format 0
dim 2
gmem_address 0x7fa74bcbc400
globalDim (1,1,1,1,1)
globalStrides (1,0,0,0,0)
boxDim (128,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 719

Sulfur6 · 2025-12-18T06:28:48Z

@programmer-lxj Currently, SBO on Hopper requires setting moe_runner_backend to deep_gemm; otherwise, the SBO will not be entered. However, w4a8 does not currently support deep_gemm as moe_runner_backend, so SBO cannot be enabled on w4a8 models on Hopper at present.

programmer-lxj · 2025-12-18T07:29:32Z

@Sulfur6 Thank you very much!

lixiuhong · 2025-12-22T10:26:32Z

Awesome work. It looks like flashoverlap, a work to enable overlap between GEMM and its following dependent communication. (https://github.com/infinigence/FlashOverlap).

Zqy11 · 2025-12-23T03:48:42Z

Awesome work. It looks like flashoverlap, a work to enable overlap between GEMM and its following dependent communication. (https://github.com/infinigence/FlashOverlap).

Thanks! FlashOverlap is also an awesome work, and SBO drew some inspiration from it during its initial design.

Huixxi · 2026-01-17T06:35:43Z

Is that SBO only worked for DeepSeek?

Zqy11 · 2026-01-19T03:42:05Z

Is that SBO only worked for DeepSeek?

SBO currently only supports DeepSeek v3/R1. However, it can be adapted for other MoE models. You can refer to the forward_deepep function in deepseek_v2.py for adapting to your model.

lizhiqihhh · 2026-01-21T07:20:59Z

Shared Expert and Dispatch Recv overlap

Hi there, I have a question regarding the SBO feature. When SBO is enabled, is the overlap between the shared expert and the dispatch mechanism automatically disabled?
If so, would it be possible to enable both the "shared expert overlap with dispatch" and the "combine overlap with down gemm" simultaneously?

Zqy11 · 2026-01-21T08:03:52Z

Shared Expert and Dispatch Recv overlap

Hi there, I have a question regarding the SBO feature. When SBO is enabled, is the overlap between the shared expert and the dispatch mechanism automatically disabled? If so, would it be possible to enable both the "shared expert overlap with dispatch" and the "combine overlap with down gemm" simultaneously?

On Hopper, enabling SBO with --enable-single-batch-overlap means simultaneously enabling the overlap of Shared Expert with Dispatch Recv, as well as the overlap of Down GEMM with Combine Send. Additionally, for Blackwell, please refer to #10422.

lizhiqihhh · 2026-01-21T10:08:58Z

Shared Expert and Dispatch Recv overlap

Hi there, I have a question regarding the SBO feature. When SBO is enabled, is the overlap between the shared expert and the dispatch mechanism automatically disabled? If so, would it be possible to enable both the "shared expert overlap with dispatch" and the "combine overlap with down gemm" simultaneously?

On Hopper, enabling SBO with --enable-single-batch-overlap means simultaneously enabling the overlap of Shared Expert with Dispatch Recv, as well as the overlap of Down GEMM with Combine Send. Additionally, for Blackwell, please refer to #10422.

Thank you for a quick response. I'm currently deploying on Hopper devices.
I generated a profile based on the "1node-Prefill" and "1node-Decode" configurations, and I can observe the overlap between "combine" and "down-GEMM".
However, I’m concerned about the purple block representing dispatch and the green block representing shared-expert computation, as they do not appear to be overlapped.
The config for decode node is :

tp: 8
ep: 8
dp: 8
enable-dp-attention: true
prefill-round-robin-balance: true

moe-a2a-backend: deepep
deepep-mode: low_latency
enable-single-batch-overlap: True
moe-runner-backend: deep_gemm

page-size: 64

disaggregation-mode: decode
disaggregation-ib-device: mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7

speculative-algorithm: EAGLE
speculative-num-steps: 2
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3

max-running-requests: 128
cuda-graph-max-bs: 32
mem-fraction-static: 0.8
watchdog-timeout: 1000000
enable-cache-report: true
enable-dp-lm-head: true
moe-dense-tp-size: 1

Zqy11 · 2026-01-21T10:26:24Z

Shared Expert and Dispatch Recv overlap

Hi there, I have a question regarding the SBO feature. When SBO is enabled, is the overlap between the shared expert and the dispatch mechanism automatically disabled? If so, would it be possible to enable both the "shared expert overlap with dispatch" and the "combine overlap with down gemm" simultaneously?

On Hopper, enabling SBO with --enable-single-batch-overlap means simultaneously enabling the overlap of Shared Expert with Dispatch Recv, as well as the overlap of Down GEMM with Combine Send. Additionally, for Blackwell, please refer to #10422.

Thank you for a quick response. I'm currently deploying on Hopper devices. I generated a profile based on the "1node-Prefill" and "1node-Decode" configurations, and I can observe the overlap between "combine" and "down-GEMM". However, I’m concerned about the purple block representing dispatch and the green block representing shared-expert computation, as they do not appear to be overlapped. The config for decode node is :

tp: 8 ep: 8 dp: 8 enable-dp-attention: true prefill-round-robin-balance: true

moe-a2a-backend: deepep deepep-mode: low_latency enable-single-batch-overlap: True moe-runner-backend: deep_gemm

page-size: 64

disaggregation-mode: decode disaggregation-ib-device: mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7

speculative-algorithm: EAGLE speculative-num-steps: 2 speculative-eagle-topk: 1 speculative-num-draft-tokens: 3

max-running-requests: 128 cuda-graph-max-bs: 32 mem-fraction-static: 0.8 watchdog-timeout: 1000000 enable-cache-report: true enable-dp-lm-head: true moe-dense-tp-size: 1

DeepEP's low-latency mode uses IBGDA, where Dispatch Send asynchronously submits tokens to the NIC, followed by Dispatch Recv busy-polling flags to receive tokens. Before all tokens arrive, the busy-polling in Dispatch Recv is actually unnecessary.
For the overlap of Shared Expert with Dispatch Recv, the Shared Expert is moved to execute after Dispatch Send, then Dispatch Recv runs. At this point, tokens have essentially all arrived, saving polling time. The profiling does show only a single stream, but the Shared Expert and token transfer are indeed overlapped.

zhyncs assigned fzyzcjy Aug 26, 2025

Sulfur6 force-pushed the dev/sbo.public branch from 01299d2 to 878daf3 Compare August 27, 2025 03:03

Sulfur6 mentioned this pull request Sep 1, 2025

[Bug] Long DeepGEMM v2 warmup time in latest SGLang leading to NCCL timeout. #9867

Closed

5 tasks

Zqy11 mentioned this pull request Sep 2, 2025

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send deepseek-ai/DeepEP#390

Closed

Sulfur6 mentioned this pull request Sep 2, 2025

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send deepseek-ai/DeepGEMM#183

Open

AniZpZ changed the title ~~Single Batch Overlap for MoE Models~~ [WIP]Single Batch Overlap for MoE Models Sep 3, 2025

AniZpZ marked this pull request as ready for review September 3, 2025 09:40

AniZpZ requested review from HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners September 3, 2025 09:41

bugfix.

d60deb5

Merge branch 'main' into dev/sbo.public

c5f79c9

Zqy11 and others added 3 commits December 3, 2025 13:36

merge main and fix conflicts

68d33b7

fix format

c0df5f5

Merge branch 'main' into dev/sbo.public

38bed67

Fridge003 enabled auto-merge (squash) December 3, 2025 18:07

Fridge003 disabled auto-merge December 3, 2025 18:07

Fridge003 merged commit 20aad5b into sgl-project:main Dec 3, 2025
133 of 163 checks passed

Comments

Conversation

Sulfur6 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Motivation

2. Overlap Design

3. Modifications

Dockerfile

Server Arguments

SBO

Model

MoE Runner

Deepep Token Dispatcher

DeepGEMM Wrapper

4. Evaluation

4.1. Experiment Setup

4.2. Performance Evaluation

4.3. Accuracy Tests

4.4. Repro Script

Uh oh!

fzyzcjy commented Aug 26, 2025

Uh oh!

Sulfur6 commented Aug 27, 2025

Uh oh!

fzyzcjy commented Aug 27, 2025

Uh oh!

Sulfur6 commented Aug 27, 2025

Uh oh!

fzyzcjy commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sulfur6 commented Aug 27, 2025

Uh oh!

Sulfur6 commented Aug 27, 2025

Uh oh!

fzyzcjy commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sulfur6 commented Aug 28, 2025

Uh oh!

fzyzcjy commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sulfur6 commented Sep 2, 2025

Uh oh!

fzyzcjy commented Sep 3, 2025

Uh oh!

Sulfur6 commented Sep 3, 2025

Uh oh!

fzyzcjy commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sulfur6 commented Dec 2, 2025

Uh oh!

Fridge003 commented Dec 3, 2025

Uh oh!

Sulfur6 commented Dec 3, 2025

Uh oh!

AniZpZ commented Dec 3, 2025

Uh oh!

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

Sulfur6 commented Dec 18, 2025

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

lixiuhong commented Dec 22, 2025

Uh oh!

Zqy11 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Huixxi commented Jan 17, 2026

Uh oh!

Zqy11 commented Jan 19, 2026

Uh oh!

lizhiqihhh commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Sulfur6 commented Aug 26, 2025 •

edited

Loading

fzyzcjy commented Aug 27, 2025 •

edited

Loading

fzyzcjy commented Aug 27, 2025 •

edited

Loading

fzyzcjy commented Aug 28, 2025 •

edited

Loading

fzyzcjy commented Sep 2, 2025 •

edited

Loading

fzyzcjy commented Sep 3, 2025 •

edited

Loading

Zqy11 commented Dec 23, 2025 •

edited

Loading

lizhiqihhh commented Jan 21, 2026 •

edited

Loading

lizhiqihhh commented Jan 21, 2026 •

edited

Loading