[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache by xu-yfei · Pull Request #13959 · sgl-project/sglang

xu-yfei · 2025-11-26T03:22:12Z

Motivation

The original default token splitting scheme of cp does not support prefill multi-batch. A new token splitting method is introduced to enable multi-batch support, fused MoE compatibility, and FP8 KV-cache support. Compared with the original DeepEP scheme, the combination of the tuned fused MoE backend and the new token splitting method reduces TTFT by 8.9% (for inputs ≥16K tokens) to 32% (for 1K token inputs) in 8× H20(141GB). H20-3e fused MoE tuning configurations will be submitted in the next PR.

Activate via the --nsa-prefill-cp-mode round-robin-split flag (default: in-seq-split, which uses the original token splitting scheme). Tokens are evenly distributed with the rank calculated as token_idx % cp_size, ensuring balanced computation across all indexers.

Modifications

Support for the new token splitting scheme: compatibility adaptations for cp_split_and_rebuild_data, cp_split_and_rebuild_position, and cp_all_gather_rerange_output.
Support for fused MoE: compatibility optimizations for communicator_nsa_cp.py to accommodate both the ScatterMode.SCATTERED deepep and ScatterMode.FULL fused MoE implementations. Fused MoE support requires dp-size=1
Support for KV cache FP8; when the attention TP size is not equal to 1, nsa_cache_seqlens_int32 requires additional padding.
Optimize the decode function in the MTP target verify scenario: consider draft tokens in the batch size for CUDA Graph capture; during target verify, the implementation of forward extend for NSA uses --nsa-decode-backend instead of --nsa-prefill-backend.

Accuracy Tests

DeepEp After PR

# gsm8k
Accuracy: 0.961
Invalid: 0.000
Latency: 553.991 s
Output throughput: 221.052 token/s

DeepEp with new token splitting scheme

# gsm8k
Accuracy: 0.958
Invalid: 0.000
Latency: 414.659 s
Output throughput: 295.325 token/s

FusedMoe with new token splitting scheme

# gsm8k
Accuracy: 0.958
Invalid: 0.000
Latency: 330.346 s
Output throughput: 368.333 token/s

# mmlu
subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.954
subject: business_ethics, #q:100, acc: 0.850
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.640
subject: college_computer_science, #q:100, acc: 0.880
subject: college_mathematics, #q:100, acc: 0.820
subject: college_medicine, #q:173, acc: 0.873
subject: college_physics, #q:102, acc: 0.882
subject: computer_security, #q:100, acc: 0.910
subject: conceptual_physics, #q:235, acc: 0.936
subject: econometrics, #q:114, acc: 0.816
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.740
subject: high_school_biology, #q:310, acc: 0.961
subject: high_school_chemistry, #q:203, acc: 0.882
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.928
subject: high_school_mathematics, #q:270, acc: 0.785
subject: high_school_microeconomics, #q:238, acc: 0.962
subject: high_school_physics, #q:151, acc: 0.854
subject: high_school_psychology, #q:545, acc: 0.965
subject: high_school_statistics, #q:216, acc: 0.875
subject: high_school_us_history, #q:204, acc: 0.951
subject: high_school_world_history, #q:237, acc: 0.954
subject: human_aging, #q:223, acc: 0.857
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.939
subject: machine_learning, #q:112, acc: 0.821
subject: management, #q:103, acc: 0.922
subject: marketing, #q:234, acc: 0.953
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.962
subject: moral_disputes, #q:346, acc: 0.876
subject: moral_scenarios, #q:895, acc: 0.800
subject: nutrition, #q:306, acc: 0.935
subject: philosophy, #q:311, acc: 0.916
subject: prehistory, #q:324, acc: 0.944
subject: professional_accounting, #q:282, acc: 0.887
subject: professional_law, #q:1534, acc: 0.717
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.917
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.894
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.936
Total latency: 1132.070
Average accuracy: 0.881

FusedMoe with new token splitting scheme, kvcache fp8

# gsm8k
Accuracy: 0.958
Invalid: 0.000
Latency: 334.842 s
Output throughput: 366.755 token/s

# mmlu
subject: abstract_algebra, #q:100, acc: 0.810
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.954
subject: business_ethics, #q:100, acc: 0.860
subject: clinical_knowledge, #q:265, acc: 0.925
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.880
subject: college_mathematics, #q:100, acc: 0.850
subject: college_medicine, #q:173, acc: 0.890
subject: college_physics, #q:102, acc: 0.892
subject: computer_security, #q:100, acc: 0.900
subject: conceptual_physics, #q:235, acc: 0.936
subject: econometrics, #q:114, acc: 0.781
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.947
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.710
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.872
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.928
subject: high_school_mathematics, #q:270, acc: 0.785
subject: high_school_microeconomics, #q:238, acc: 0.971
subject: high_school_physics, #q:151, acc: 0.861
subject: high_school_psychology, #q:545, acc: 0.965
subject: high_school_statistics, #q:216, acc: 0.880
subject: high_school_us_history, #q:204, acc: 0.951
subject: high_school_world_history, #q:237, acc: 0.954
subject: human_aging, #q:223, acc: 0.839
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.950
subject: jurisprudence, #q:108, acc: 0.926
subject: logical_fallacies, #q:163, acc: 0.945
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.953
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.959
subject: moral_disputes, #q:346, acc: 0.867
subject: moral_scenarios, #q:895, acc: 0.792
subject: nutrition, #q:306, acc: 0.925
subject: philosophy, #q:311, acc: 0.920
subject: prehistory, #q:324, acc: 0.948
subject: professional_accounting, #q:282, acc: 0.890
subject: professional_law, #q:1534, acc: 0.714
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.912
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.886
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.590
subject: world_religions, #q:171, acc: 0.936
Total latency: 1169.962
Average accuracy: 0.880

Benchmarking and Profiling

In 8*H20(141GB), the mean TTFT under different configurations:

export SGL_ENABLE_JIT_DEEPGEMM=1
export TORCHINDUCTOR_CACHE_DIR=/home/admin/inductor_root_cache
export SGLANG_TORCH_PROFILER_DIR=/home/admin/torch_profiler

MODEL_PATH=/home/models/DeepSeek-V3.2-Exp/

python3 -m sglang.launch_server --model-path $MODEL_PATH --dp 1 \
--enable-dp-attention --trust-remote-code --port 8000 --host 0.0.0.0 \
--attention-backend  nsa --nsa-prefill-backend flashmla_sparse \
--nsa-decode flashmla_sparse --enable-metrics --mem-fraction-static 0.8 \
--max-running-requests 128 --enable-cache-report --page-size 64 \
--tp-size 8 --moe-dense-tp-size 1 \
--disable-radix-cache \
--chunked-prefill-size 16384 \
--enable-nsa-prefill-context-parallel \
#--moe-a2a-backend deepep --ep-size 8 \
#--nsa-prefill-cp-mode round-robin-split
#--kv-cache-dtype fp8_e4m3

for deepep: --moe-a2a-backend deepep --ep-size 8
for round robin split cp-mode: --nsa-prefill-cp-mode round-robin-split
for kvcache fp8: --kv-cache-dtype fp8_e4m3

i=1
python3 sglang/python/sglang/bench_serving.py --model /home/models/DeepSeek-V3.2-Exp/ --base-url http://127.0.0.1:8000  --dataset-name random --num-prompts 100 --random-input-len $((i*1024)) --random-output-len 1 --request-rate 1000 --random-range-ratio 1.0  --max-concurrency 1 --dataset-path /home/ShareGPT_V3_unfiltered_cleaned_split.json

for((i=8;i<=64;i+=8)); do
python3 sglang/python/sglang/bench_serving.py --model /home/models/DeepSeek-V3.2-Exp/ --base-url http://127.0.0.1:8000  --dataset-name random --num-prompts 100 --random-input-len $((i*1024)) --random-output-len 1 --request-rate 1000 --random-range-ratio 1.0  --max-concurrency 1 --dataset-path /home/ShareGPT_V3_unfiltered_cleaned_split.json 
done

Input Length (KB)	Before PR (DeepEP)	After PR (DeepEP, Round Robin Split Mode)	TTFT Change vs. Before PR	After PR (Tuned FusedMoE, Round Robin Split Mode 1)	TTFT Change vs. Before PR	After PR (Tuned FusedMoE, KV FP8, Round Robin Split Mode 1)	TTFT Change vs. Before PR
1	303.49	257.7	-15.09%	205.6	-32.25%	220.2	-27.44%
8	900.3	869.54	-3.42%	778.63	-13.51%	789.83	-12.27%
16	1858.31	1851.36	-0.37%	1692.13	-8.94%	1722.28	-7.32%
24	2636.58	2571.06	-2.49%	2351.67	-10.81%	2392.25	-9.27%
32	3669.27	3619.3	-1.36%	3322.56	-9.45%	3382.81	-7.81%
40	4517.65	4413.67	-2.30%	4055.31	-10.23%	4125.17	-8.69%
48	5605.21	5530.09	-1.34%	5096.42	-9.08%	5181.6	-7.56%
56	6477.39	6334.2	-2.21%	5837.79	-9.87%	5935.96	-8.36%
64	7695.75	7580.04	-1.50%	7011.81	-8.89%	7116.35	-7.53%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-26T03:22:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

xu-yfei · 2025-11-26T03:24:59Z

@Fridge003 @ch-wan @lixiaolx Please help review the PR.

ch-wan

Could you add some test cases? I will have a closer check tomorrow.

xu-yfei · 2025-11-26T12:12:23Z

Could you add some test cases? I will have a closer check tomorrow.

@ch-wan done!

whybeyoung · 2025-11-26T15:55:41Z

maybe we can combine the pp to gain the best performance #11852

yhyang201 · 2025-11-27T15:43:41Z

Out of curiosity, may I ask whether the performance before the PR was measured after tuning?
If not, could you please provide the performance metrics before the PR but after tuning?
This would help better highlight the performance improvements brought by the change.
Thank you very much!

xu-yfei · 2025-11-28T03:04:15Z

Out of curiosity, may I ask whether the performance before the PR was measured after tuning? If not, could you please provide the performance metrics before the PR but after tuning? This would help better highlight the performance improvements brought by the change. Thank you very much!

@yhyang201 Do you mean the tuning of fused MoE? Before this PR, fused MoE was not supported. The performance improvement in this PR mainly comes from the optimized fused MoE, which delivers better performance compared to DeepEP.

use mha bugfix fix

This reverts commit 2265f1d.

Fridge003 · 2025-12-28T07:50:20Z

Let's merge after new version release

Fridge003 · 2026-01-01T04:18:14Z

/rerun-failed-ci

Fridge003 · 2026-01-02T01:45:36Z

@xu-yfei Can you please pull the latest branch

xu-yfei · 2026-01-02T01:50:56Z

@xu-yfei Can you please pull the latest branch

done~

Fridge003 · 2026-01-02T15:47:50Z

Just verified this feature on local H200. It should be correct

…ch and fp8 kvcache (sgl-project#13959)

llc-kc · 2026-01-26T05:02:30Z

Hi, @xu-yfei , does this Context Parallelism support PD deployment?
Currently we get significant accuracy drop when we deploy deepseek v3.2 PD with prefill CP+MTP.

xu-yfei · 2026-01-26T07:11:33Z

Hi, @xu-yfei , does this Context Parallelism support PD deployment? Currently we get significantly accuracy drop when we deploy deepseek v3.2 PD with prefill CP+MTP.

@llc-kc This should be supported. Please specify your testing method. If possible, let’s discuss it via an issue.

yiakwy-xpu-ml-framework-team · 2026-03-03T06:41:04Z

@xu-yfei report performance recession in H800 with the new options :

#12065 (comment)

cc @Fridge003

xu-yfei · 2026-03-05T02:43:10Z

H800

@yiakwy-xpu-ml-framework-team Sorry, I didn't quite get what you meant.Could you clarify which scenarios are being compared, and what specific performance metric has degraded?What are the exact values of this metric before and after the degradation?Also, what are the input length and output length in question?

xu-yfei requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners November 26, 2025 03:22

github-actions bot added documentation Improvements or additions to documentation deepseek labels Nov 26, 2025

huangtingwei9988 added the run-ci label Nov 26, 2025

xu-yfei force-pushed the xyf/cp_opt branch from 6055e34 to 8198a1d Compare November 26, 2025 06:35

xu-yfei changed the title ~~[DeeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache~~ [DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache Nov 26, 2025

ch-wan reviewed Nov 26, 2025

View reviewed changes

xu-yfei force-pushed the xyf/cp_opt branch from 1dde550 to 0550cbb Compare November 26, 2025 12:11

xu-yfei force-pushed the xyf/cp_opt branch from 0550cbb to a854d24 Compare November 28, 2025 02:51

xu-yfei force-pushed the xyf/cp_opt branch 2 times, most recently from e61a33c to 32c693a Compare December 4, 2025 02:51

ch-wan self-assigned this Dec 5, 2025

xu-yfei mentioned this pull request Dec 8, 2025

[Bug] Performance Regression: DeepSeek V3.2 Prefill Throughput is 1/3 of V3.1 in Disaggregation Mode #14498

Open

5 tasks

xu-yfei force-pushed the xyf/cp_opt branch from 32c693a to a84c49e Compare December 9, 2025 09:12

xu-yfei added 11 commits December 27, 2025 22:16

opt cp: support cp model 1 and support fused moe

a1626a0

use mha bugfix fix

opt ds32 decode with mtp

e4cd746

revert fused moe opt

7d018fe

add pp2 support

ccb95cc

Revert "opt ds32 decode with mtp"

7f7e13b

This reverts commit 2265f1d.

update type of nsa_prefill_cp_mode from int to str

01d0b45

opt code

0864f2a

opt 2

88d9d74

opt cpu->device sync

6b58120

rename continuous-split to round-robin-split

63103ce

add test nsa mock

8d0b736

xu-yfei force-pushed the xyf/cp_opt branch from ccb8573 to 8d0b736 Compare December 27, 2025 16:02

Fridge003 approved these changes Dec 28, 2025

View reviewed changes

Merge branch 'main' into xyf/cp_opt

a8756d0

Merge branch 'main' into xyf/cp_opt

a0beafa

Fridge003 merged commit 0d24411 into sgl-project:main Jan 2, 2026
195 of 201 checks passed

yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Jan 4, 2026

[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi bat…

7fb02ad

…ch and fp8 kvcache (sgl-project#13959)

Shunkangz mentioned this pull request Jan 7, 2026

[Feature] Support context parallel for Qwen3 model #16632

Open

6 tasks

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi bat…

0e3be16

…ch and fp8 kvcache (sgl-project#13959)

ch-wan mentioned this pull request Mar 2, 2026

[RFC] Complexity in model layers and communication kernels #18237

Open

Fridge003 mentioned this pull request Mar 31, 2026

[Roadmap] Context Parallelism (2026 Q2) #21788

Open

9 tasks

Conversation

xu-yfei commented Nov 26, 2025 • edited by Fridge003 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

DeepEp After PR

DeepEp with new token splitting scheme

FusedMoe with new token splitting scheme

FusedMoe with new token splitting scheme, kvcache fp8

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Uh oh!

xu-yfei commented Nov 26, 2025

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

xu-yfei commented Nov 26, 2025

Uh oh!

whybeyoung commented Nov 26, 2025

Uh oh!

yhyang201 commented Nov 27, 2025

Uh oh!

xu-yfei commented Nov 28, 2025

Uh oh!

Fridge003 commented Dec 28, 2025

Uh oh!

Fridge003 commented Jan 1, 2026

Uh oh!

Fridge003 commented Jan 2, 2026

Uh oh!

xu-yfei commented Jan 2, 2026

Uh oh!

Fridge003 commented Jan 2, 2026

Uh oh!

Uh oh!

llc-kc commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xu-yfei commented Jan 26, 2026

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xu-yfei commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

xu-yfei commented Nov 26, 2025 •

edited by Fridge003

Loading

llc-kc commented Jan 26, 2026 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Mar 3, 2026 •

edited

Loading