Single Batch Overlap for MoE Models#9660
Conversation
|
Interesting, I am also doing some overlapping recently. Do we need to use a changed DeepEP or DeepGEMM? |
@fzyzcjy Thanks for the comments. To overlap the Down GEMM with the Combine Send, we have modified the DeepGEMM and DeepEP respectively. As for the Shared Expert and Dispatch Recv overlap, that only required modifications to SGLang. We are currently cleaning up the code for DeepGEMM and DeepEP and will submit PRs within the next two days. |
01299d2 to
878daf3
Compare
|
Sure, I mean you need to paste the corresponding deepgemm/deepep branches as well (when ready). |
@fzyzcjy We updated the PR and added our modified DeepEP and DeepGEMM branches:
|
|
get, btw the speedup looks like only ~1%, thus I am curious whether it is because the overlappable region is tiny or the overhead of overlap is large, and also how much does the SBO improve over the simple standard overlap-shared-with-comunication. could you please share a pair of profile (one w/o overlap, one w/ overlap) about them? |
@fzyzcjy Thank you for your reminder. We pasted the wrong result when creating the draft, and have now updated it to the correct one. |
@fzyzcjy We recorded the profiles with and without overlap when the batch size was 32. Below is a screenshot of the profile of a single DeepseekV2Decoder layer on DP0_TP0_EP0 on the decode node:
|
|
I see, yes that looks reasonable on your H20 hardware (I do not have H20 and thus dnk the time spent of each kernel before) |
|
Since sglang has merged PR: #9340 to upgrade to DeepGEMM v2, we are working on the relevant adaptation work. |
|
this change looks great, but I am still a bit worried (1) shall we use atomicAdd (doc says relaxed ordering) or use released ordering (2) will the extra tma store wait make that warp group slower (i.e. shall we signal on the next existing tma store wait). FYI my naive implementations are in flashinfer-ai/flashinfer#1569 (have not tested it since the nvfp4 code path has not arrived yet...)
|
@fzyzcjy For (1), we will conduct a more in-depth investigation. For (2), after our tests, |
|
FYI I am waiting for the refactored deepgemm (hopper), since I need to implement deepgemm blackwell and want to be aligned with your style to avoid two conflicting styles |
@fzyzcjy We have submitted a pull request to DeepGEMM.v2 deepseek-ai/DeepGEMM#183, which contains the GEMM interface and implementation required for overlap. We would like to know if you have any suggestions for modification. |
|
@Sulfur6 I made a tiny nit comment there |
|
The current version of sgl-kernel's DeepGEMM does not include the modifications in sgl-project/DeepGEMM#14, which causes the deep_gemm moe runner to malfunction. I have fixed this issue in the latest version of this PR and am currently conducting CI tests. @Fridge003 |
|
@Sulfur6 Please fix the conflicts, thanks |
@Fridge003 we fixed the conflicts, and the CI test is now running. |
|
@ch-wan @Fridge003 i think this pr is ready for merge, could you please review again |
Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Zqy11 <841971412@qq.com> Co-authored-by: AniZpZ <aniz1905@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
@Sulfur6 @Zqy11 @fzyzcjy I'm currently running the deepseek-R1-w4a8 model, but this model's |
|
@programmer-lxj Currently, SBO on Hopper requires setting |
|
@Sulfur6 Thank you very much! |
|
Awesome work. It looks like flashoverlap, a work to enable overlap between GEMM and its following dependent communication. (https://github.com/infinigence/FlashOverlap). |
Thanks! FlashOverlap is also an awesome work, and SBO drew some inspiration from it during its initial design. |
|
Is that SBO only worked for DeepSeek? |
SBO currently only supports DeepSeek v3/R1. However, it can be adapted for other MoE models. You can refer to the |
Hi there, I have a question regarding the SBO feature. When SBO is enabled, is the overlap between the shared expert and the dispatch mechanism automatically disabled? |
On Hopper, enabling SBO with |
Thank you for a quick response. I'm currently deploying on Hopper devices. tp: 8 moe-a2a-backend: deepep page-size: 64 disaggregation-mode: decode speculative-algorithm: EAGLE max-running-requests: 128
|
DeepEP's |







1. Motivation
The optimization effect of Two-Batch Overlap (TBO) is suboptimal for the Decode phase on low-compute-power cards (i.e., H20). This is due to two main factors: First, on the Hopper architecture, the WGMMA block_m is 64. Consequently, when TBO is enabled with a small Decode batch size, the MLP GEMM suffers from redundant computations. A positive throughput gain is only observed at larger batch sizes (e.g., 64, 128). Second, at these larger batch sizes, low-compute-power cards like the H20 fail to meet the SLA guarantees for TPOT/ITL.
Therefore, it is necessary to find a solution that can improve Decode throughput even with small batch sizes. Single Batch Overlap (SBO) presents itself as a viable solution.
We implement SBO for DeepSeek v3/R1 by modifying DeepEP and DeepGEMM, including the overlap of Shared Expert and Dispatch Recv, as well as the overlap of Down GEMM with Combine Send.
The overlap of Down GEMM with Combine Send is implemented by modifying DeepEP and DeepGEMM, with the detailed implementation available in the branches below:
2. Overlap Design
SBO implements two overlap for the MoE layers of DeepSeek-V3/R1. One is to overlap the Shared Expert computation with the Dispatch Recv communication, and the other is to overlap the Down GEMM computation with the Combine Send communication.


The interaction between Down GEMM and Combine Send is structured as a producer-consumer model synchronized by signals. For each local expert, a signal unit is allocated for every block_m tokens. The Down GEMM computes the results for these block_m tokens and atomically increments the signaling unit after completing a portion of the work. The Combine Send polls this signaling unit. Once the value reaches a threshold, it sends the corresponding block_m tokens.
3. Modifications
Dockerfile
When building docker image for SBO on hopper, install deepep based on specific branch mentioned above.
Server Arguments
--enable-single-batch-overlap: This argument is able to enable SBO (Single Batch Overlap) on Hopper, which is previously added in Support single batch overlap #10422.When SBO is enabled,
--moe-a2a-backedmust be"deepep"and--moe-runner-backendmust be"deep_gemm", otherwise this argument would be invalid.SBO
Add condition check and arguments calculation for SBO on Hopper.
Model
Add deepep hook for SBO on Hopper in
DeepseekV2MoEof deepseek_v2 model.MoE Runner
meta_overlap_argsin moe runner running states to dynamically modify overlap arguments during forward process.Deepep Token Dispatcher
DeepGEMM Wrapper
Modify masked gemm wrapper to support SBO based on deep_gemm.
4. Evaluation
4.1. Experiment Setup
4.2. Performance Evaluation
4.3. Accuracy Tests
4.4. Repro Script