[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache#13959
[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache#13959Fridge003 merged 13 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@Fridge003 @ch-wan @lixiaolx Please help review the PR. |
ch-wan
left a comment
There was a problem hiding this comment.
Could you add some test cases? I will have a closer check tomorrow.
@ch-wan done! |
|
maybe we can combine the pp to gain the best performance #11852 |
|
Out of curiosity, may I ask whether the performance before the PR was measured after tuning? |
@yhyang201 Do you mean the tuning of fused MoE? Before this PR, fused MoE was not supported. The performance improvement in this PR mainly comes from the optimized fused MoE, which delivers better performance compared to DeepEP. |
e61a33c to
32c693a
Compare
use mha bugfix fix
This reverts commit 2265f1d.
|
Let's merge after new version release |
|
/rerun-failed-ci |
|
@xu-yfei Can you please pull the latest branch |
done~ |
|
Just verified this feature on local H200. It should be correct |
…ch and fp8 kvcache (sgl-project#13959)
…ch and fp8 kvcache (sgl-project#13959)
|
Hi, @xu-yfei , does this Context Parallelism support PD deployment? |
|
@xu-yfei report performance recession in H800 with the new options : cc @Fridge003 |
@yiakwy-xpu-ml-framework-team Sorry, I didn't quite get what you meant.Could you clarify which scenarios are being compared, and what specific performance metric has degraded?What are the exact values of this metric before and after the degradation?Also, what are the input length and output length in question? |
Motivation
The original default token splitting scheme of cp does not support prefill multi-batch. A new token splitting method is introduced to enable multi-batch support, fused MoE compatibility, and FP8 KV-cache support. Compared with the original DeepEP scheme, the combination of the tuned fused MoE backend and the new token splitting method reduces TTFT by 8.9% (for inputs ≥16K tokens) to 32% (for 1K token inputs) in 8× H20(141GB). H20-3e fused MoE tuning configurations will be submitted in the next PR.
Activate via the

--nsa-prefill-cp-mode round-robin-splitflag (default:in-seq-split, which uses the original token splitting scheme). Tokens are evenly distributed with the rank calculated astoken_idx % cp_size, ensuring balanced computation across all indexers.Modifications
Support for the new token splitting scheme: compatibility adaptations for
cp_split_and_rebuild_data,cp_split_and_rebuild_position, andcp_all_gather_rerange_output.Support for fused MoE: compatibility optimizations for
communicator_nsa_cp.pyto accommodate both theScatterMode.SCATTEREDdeepep andScatterMode.FULLfused MoE implementations. Fused MoE support requiresdp-size=1Support for KV cache FP8; when the attention TP size is not equal to 1,
nsa_cache_seqlens_int32requires additional padding.Optimize the decode function in the MTP target verify scenario: consider draft tokens in the batch size for CUDA Graph capture; during target verify, the implementation of forward extend for NSA uses
--nsa-decode-backendinstead of--nsa-prefill-backend.Accuracy Tests
DeepEp After PR
DeepEp with new token splitting scheme
FusedMoe with new token splitting scheme
FusedMoe with new token splitting scheme, kvcache fp8
Benchmarking and Profiling
In 8*H20(141GB), the mean TTFT under different configurations:
for deepep:
--moe-a2a-backend deepep --ep-size 8for round robin split cp-mode:
--nsa-prefill-cp-mode round-robin-splitfor kvcache fp8:
--kv-cache-dtype fp8_e4m3Checklist