[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #14

Sulfur6 · 2025-11-09T11:07:38Z

1. Motivation

The optimization effect of Two-Batch Overlap (TBO) is suboptimal for the Decode phase on low-compute-power cards (i.e., H20). This is due to two main factors: First, on the Hopper architecture, the WGMMA block_m is 64. Consequently, when TBO is enabled with a small Decode batch size, the MLP GEMM suffers from redundant computations. A positive throughput gain is only observed at larger batch sizes (e.g., 64, 128). Second, at these larger batch sizes, low-compute-power cards like the H20 fail to meet the SLA guarantees for TPOT/ITL.

Therefore, it is necessary to find a solution that can improve Decode throughput even with small batch sizes. Single Batch Overlap (SBO) presents itself as a viable solution.

We implement SBO for DeepSeek v3/R1 by modifying DeepEP and DeepGEMM, including the overlap of Shared Expert and Dispatch Recv, as well as the overlap of Down GEMM with Combine Send.

The overlap of Down GEMM with Combine Send is implemented by modifying SGlang, DeepEP and DeepGEMM, with the detailed implementation available in the PRs below:

DeepEP: Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send deepseek-ai/DeepEP#483.
DeepGEMM: this PR.

We also conducted integration and evaluation in SGLang: sgl-project/sglang#9660.

2. Overlap Design

SBO implements two overlap for the MoE layers of DeepSeek-V3/R1. One is to overlap the Shared Expert computation with the Dispatch Recv communication, and the other is to overlap the Down GEMM computation with the Combine Send communication.

The interaction between Down GEMM and Combine Send is structured as a producer-consumer model synchronized by signals. For each local expert, a signal unit is allocated for every block_m tokens. The Down GEMM computes the results for these block_m tokens and atomically increments the signaling unit after completing a portion of the work. The Combine Send polls this signaling unit. Once the value reaches a threshold, it sends the corresponding block_m tokens.

3. Modifications

Modify the m_grouped_fp8_gemm_nt_masked Python interface and the sm90_m_grouped_fp8_gemm_masked_1d2d implemtation, adding return value and parameters to support overlapping Down GEMM with Combine Send.
Modify the sm90_fp8_gemm_1d2d_impl kernel, adding parameters for overlap and using atom.add.release.gpu.global.s32 to write signal after the corresponding block_m tokens are computed.
Modify JIT config and auto-tune to support signal GEMM.
Add test for signal GEMM.

4. Evaluation

We integrated the modified DeepEP and DeepGEMM into SGLang for performance evaluation.

4.1. Experiment Setup

5 nodes, with 8 × H20 GPUs per node. Each prefill node uses TP8, and the other 2 decode nodes use DP_Attn 16 + EP 16.
Input length 4096, output length 1536.

4.2. Performance Evaluation

bs 32, origin

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2359.16
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672509
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17778.82
Output token throughput (tok/s):         6667.06
Total token throughput (tok/s):          24445.88
Concurrency:                             490.01
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   112892.31
Median E2E Latency (ms):                 113847.19
---------------Time to First Token----------------
Mean TTFT (ms):                          640.62
Median TTFT (ms):                        545.06
P99 TTFT (ms):                           1543.37
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.11
Median ITL (ms):                         71.81
P95 ITL (ms):                            86.02
P99 ITL (ms):                            155.32
Max ITL (ms):                            1543.26
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2357.80
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673361
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17789.05
Output token throughput (tok/s):         6670.89
Total token throughput (tok/s):          24459.95
Concurrency:                             490.83
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   113015.97
Median E2E Latency (ms):                 113951.58
---------------Time to First Token----------------
Mean TTFT (ms):                          724.98
Median TTFT (ms):                        624.73
P99 TTFT (ms):                           1693.64
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.13
Median ITL (ms):                         71.84
P95 ITL (ms):                            86.57
P99 ITL (ms):                            155.21
Max ITL (ms):                            1081.95
==================================================

bs 32, sbo

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2211.76
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673456
Request throughput (req/s):              4.63
Input token throughput (tok/s):          18963.67
Output token throughput (tok/s):         7111.38
Total token throughput (tok/s):          26075.05
Concurrency:                             481.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104017.64
Median E2E Latency (ms):                 105363.65
---------------Time to First Token----------------
Mean TTFT (ms):                          606.28
Median TTFT (ms):                        508.61
P99 TTFT (ms):                           1475.44
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.35
Median ITL (ms):                         66.10
P95 ITL (ms):                            81.58
P99 ITL (ms):                            141.96
Max ITL (ms):                            1422.74
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2194.12
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672577
Request throughput (req/s):              4.67
Input token throughput (tok/s):          19116.14
Output token throughput (tok/s):         7168.55
Total token throughput (tok/s):          26284.70
Concurrency:                             487.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104545.42
Median E2E Latency (ms):                 105483.50
---------------Time to First Token----------------
Mean TTFT (ms):                          619.03
Median TTFT (ms):                        511.23
P99 TTFT (ms):                           1504.27
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.68
Median ITL (ms):                         66.44
P95 ITL (ms):                            82.13
P99 ITL (ms):                            142.48
Max ITL (ms):                            1024.85
==================================================

4.3. Accuracy Tests

bs 32, origin

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:20<00:00, 12.41it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 80.802 s
Output throughput: 1183.468 token/s

bs 32, sbo

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:17<00:00, 12.87it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 78.056 s
Output throughput: 1217.443 token/s

4.4. Repro Script

Please refer to sgl-project/sglang#9660.

Co-authored-by: Zqy11 <[email protected]> Co-authored-by: AniZpZ <[email protected]>

FlamingoPg · 2025-11-30T16:34:19Z

LGTM

programmer-lxj · 2025-12-18T06:00:09Z

I'm using h20, with one machine running a P node and another running a D node. When starting the D node, I added the following three parameters: --enable-single-batch-overlap --moe-a2a-backend deepep --moe-runner-backend deep_gemm. This results in the error shown in the image, specifically a DeepGEMM combine timeout. I'm using the DeepEP and DeepGEMM versions from these pull requests: deepseek-ai/DeepEP#483 and #14, as well as the sglang image version 0.5.6.post2. Is there anything else I need to change? My node startup script is as follows: python3 -m sglang.launch_server --model-path /export/models/DeepSeek-R1-0528-W4AFP8-MTP --served-model-name R1-W4A8-Decode-H20 --enable-metrics --log-level-http debug --log-requests-level 3 --watchdog-timeout 1200 --decode-log-interval 50 --collect-tokens-histogram --show-time-cost --disaggregation-mode decode --disaggregation-ib-device "mlx5_gdr_0,mlx5_gdr_1,mlx5_gdr_2,mlx5_gdr_3,mlx5_gdr_4,mlx5_gdr_5,mlx5_gdr_6,mlx5_gdr_7" --host 0.0.0.0 --port 9123 --trust-remote-code --tp-size 8 --dp-size 8 --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --deepep-mode low_latency --mem-fraction-static 0.81 --cuda-graph-max-bs 32 --max-running-requests 256 --context-length 65536 --moe-a2a-backend deepep --prefill-round-robin-balance --tokenizer-worker-num 8 --enable-dynamic-batch-tokenizer --dynamic-batch-tokenizer-batch-size 8 --attention-backend flashinfer --tool-call-parser deepseekv3 --page-size 1 --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --enable-single-batch-overlap --moe-a2a-backend deepep --moe-runner-backend deep_gemm. It runs normally when I don't include the last three parameters. I now want to use your SBO (Single Batch Overlap) feature to see if there is a performance improvement. I look forward to your reply!

programmer-lxj · 2025-12-18T06:01:03Z

@Sulfur6 The problem is as described above; I forgot to add the "@" symbol earlier.

Sulfur6 · 2025-12-18T06:29:35Z

@programmer-lxj I left a comment in the conversation of sgl-project/sglang#9660.

programmer-lxj · 2025-12-18T07:37:57Z

@Sulfur6 Thank you very much!

Sulfur6 and others added 3 commits November 9, 2025 14:53

feat: add signal for SBO in SM90 masked gemm.

6635dd2

Co-authored-by: Zqy11 <[email protected]> Co-authored-by: AniZpZ <[email protected]>

feat: add test for signal GEMM.

a01ab1a

Co-authored-by: Zqy11 <[email protected]> Co-authored-by: AniZpZ <[email protected]>

bugfix.

5f8a71a

Sulfur6 mentioned this pull request Nov 9, 2025

Single Batch Overlap for MoE Models sgl-project/sglang#9660

Merged

Sulfur6 added 3 commits November 13, 2025 22:15

add max_block_n.

3a29764

bugfix.

f259a0e

rollback.

5f99d8d

FlamingoPg approved these changes Nov 30, 2025

View reviewed changes

FlamingoPg merged commit ffe2b6b into sgl-project:sgl-release Nov 30, 2025

Zqy11 mentioned this pull request Dec 4, 2025

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send deepseek-ai/DeepEP#390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #14

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #14

Uh oh!

Sulfur6 commented Nov 9, 2025 •

edited

Loading

Uh oh!

FlamingoPg commented Nov 30, 2025

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

Sulfur6 commented Dec 18, 2025

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #14

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #14

Uh oh!

Conversation

Sulfur6 commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Motivation

2. Overlap Design

3. Modifications

4. Evaluation

4.1. Experiment Setup

4.2. Performance Evaluation

4.3. Accuracy Tests

4.4. Repro Script

Uh oh!

FlamingoPg commented Nov 30, 2025

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

Sulfur6 commented Dec 18, 2025

Uh oh!

programmer-lxj commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sulfur6 commented Nov 9, 2025 •

edited

Loading