-
Notifications
You must be signed in to change notification settings - Fork 7
[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Zqy11 <[email protected]> Co-authored-by: AniZpZ <[email protected]>
Co-authored-by: Zqy11 <[email protected]> Co-authored-by: AniZpZ <[email protected]>
|
LGTM |
|
I'm using h20, with one machine running a P node and another running a D node. When starting the D node, I added the following three parameters: |
|
@Sulfur6 The problem is as described above; I forgot to add the "@" symbol earlier. |
|
@programmer-lxj I left a comment in the conversation of sgl-project/sglang#9660. |
|
@Sulfur6 Thank you very much! |

from deepseek-ai#183
1. Motivation
The optimization effect of Two-Batch Overlap (TBO) is suboptimal for the Decode phase on low-compute-power cards (i.e., H20). This is due to two main factors: First, on the Hopper architecture, the WGMMA block_m is 64. Consequently, when TBO is enabled with a small Decode batch size, the MLP GEMM suffers from redundant computations. A positive throughput gain is only observed at larger batch sizes (e.g., 64, 128). Second, at these larger batch sizes, low-compute-power cards like the H20 fail to meet the SLA guarantees for TPOT/ITL.
Therefore, it is necessary to find a solution that can improve Decode throughput even with small batch sizes. Single Batch Overlap (SBO) presents itself as a viable solution.
We implement SBO for DeepSeek v3/R1 by modifying DeepEP and DeepGEMM, including the overlap of Shared Expert and Dispatch Recv, as well as the overlap of Down GEMM with Combine Send.
The overlap of Down GEMM with Combine Send is implemented by modifying SGlang, DeepEP and DeepGEMM, with the detailed implementation available in the PRs below:
We also conducted integration and evaluation in SGLang: sgl-project/sglang#9660.
2. Overlap Design
SBO implements two overlap for the MoE layers of DeepSeek-V3/R1. One is to overlap the Shared Expert computation with the Dispatch Recv communication, and the other is to overlap the Down GEMM computation with the Combine Send communication.


The interaction between Down GEMM and Combine Send is structured as a producer-consumer model synchronized by signals. For each local expert, a signal unit is allocated for every block_m tokens. The Down GEMM computes the results for these block_m tokens and atomically increments the signaling unit after completing a portion of the work. The Combine Send polls this signaling unit. Once the value reaches a threshold, it sends the corresponding block_m tokens.
3. Modifications
m_grouped_fp8_gemm_nt_maskedPython interface and thesm90_m_grouped_fp8_gemm_masked_1d2dimplemtation, adding return value and parameters to support overlapping Down GEMM with Combine Send.sm90_fp8_gemm_1d2d_implkernel, adding parameters for overlap and usingatom.add.release.gpu.global.s32to write signal after the corresponding block_m tokens are computed.4. Evaluation
We integrated the modified DeepEP and DeepGEMM into SGLang for performance evaluation.
4.1. Experiment Setup
4.2. Performance Evaluation
4.3. Accuracy Tests
4.4. Repro Script
Please refer to sgl-project/sglang#9660.