Skip to content

[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache#13959

Merged
Fridge003 merged 13 commits intosgl-project:mainfrom
antgroup:xyf/cp_opt
Jan 2, 2026
Merged

[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache#13959
Fridge003 merged 13 commits intosgl-project:mainfrom
antgroup:xyf/cp_opt

Conversation

@xu-yfei
Copy link
Copy Markdown
Contributor

@xu-yfei xu-yfei commented Nov 26, 2025

Motivation

The original default token splitting scheme of cp does not support prefill multi-batch. A new token splitting method is introduced to enable multi-batch support, fused MoE compatibility, and FP8 KV-cache support. Compared with the original DeepEP scheme, the combination of the tuned fused MoE backend and the new token splitting method reduces TTFT by 8.9% (for inputs ≥16K tokens) to 32% (for 1K token inputs) in 8× H20(141GB). H20-3e fused MoE tuning configurations will be submitted in the next PR.

Activate via the --nsa-prefill-cp-mode round-robin-split flag (default: in-seq-split, which uses the original token splitting scheme). Tokens are evenly distributed with the rank calculated as token_idx % cp_size, ensuring balanced computation across all indexers.
image

Modifications

  1. Support for the new token splitting scheme: compatibility adaptations for cp_split_and_rebuild_data, cp_split_and_rebuild_position, and cp_all_gather_rerange_output.

  2. Support for fused MoE: compatibility optimizations for communicator_nsa_cp.py to accommodate both the ScatterMode.SCATTERED deepep and ScatterMode.FULL fused MoE implementations. Fused MoE support requires dp-size=1

  3. Support for KV cache FP8; when the attention TP size is not equal to 1, nsa_cache_seqlens_int32 requires additional padding.

  4. Optimize the decode function in the MTP target verify scenario: consider draft tokens in the batch size for CUDA Graph capture; during target verify, the implementation of forward extend for NSA uses --nsa-decode-backend instead of --nsa-prefill-backend.

Accuracy Tests

DeepEp After PR

# gsm8k
Accuracy: 0.961
Invalid: 0.000
Latency: 553.991 s
Output throughput: 221.052 token/s

DeepEp with new token splitting scheme

# gsm8k
Accuracy: 0.958
Invalid: 0.000
Latency: 414.659 s
Output throughput: 295.325 token/s

FusedMoe with new token splitting scheme

# gsm8k
Accuracy: 0.958
Invalid: 0.000
Latency: 330.346 s
Output throughput: 368.333 token/s
# mmlu
subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.954
subject: business_ethics, #q:100, acc: 0.850
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.640
subject: college_computer_science, #q:100, acc: 0.880
subject: college_mathematics, #q:100, acc: 0.820
subject: college_medicine, #q:173, acc: 0.873
subject: college_physics, #q:102, acc: 0.882
subject: computer_security, #q:100, acc: 0.910
subject: conceptual_physics, #q:235, acc: 0.936
subject: econometrics, #q:114, acc: 0.816
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.740
subject: high_school_biology, #q:310, acc: 0.961
subject: high_school_chemistry, #q:203, acc: 0.882
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.928
subject: high_school_mathematics, #q:270, acc: 0.785
subject: high_school_microeconomics, #q:238, acc: 0.962
subject: high_school_physics, #q:151, acc: 0.854
subject: high_school_psychology, #q:545, acc: 0.965
subject: high_school_statistics, #q:216, acc: 0.875
subject: high_school_us_history, #q:204, acc: 0.951
subject: high_school_world_history, #q:237, acc: 0.954
subject: human_aging, #q:223, acc: 0.857
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.939
subject: machine_learning, #q:112, acc: 0.821
subject: management, #q:103, acc: 0.922
subject: marketing, #q:234, acc: 0.953
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.962
subject: moral_disputes, #q:346, acc: 0.876
subject: moral_scenarios, #q:895, acc: 0.800
subject: nutrition, #q:306, acc: 0.935
subject: philosophy, #q:311, acc: 0.916
subject: prehistory, #q:324, acc: 0.944
subject: professional_accounting, #q:282, acc: 0.887
subject: professional_law, #q:1534, acc: 0.717
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.917
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.894
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.936
Total latency: 1132.070
Average accuracy: 0.881

FusedMoe with new token splitting scheme, kvcache fp8

# gsm8k
Accuracy: 0.958
Invalid: 0.000
Latency: 334.842 s
Output throughput: 366.755 token/s
# mmlu
subject: abstract_algebra, #q:100, acc: 0.810
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.954
subject: business_ethics, #q:100, acc: 0.860
subject: clinical_knowledge, #q:265, acc: 0.925
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.880
subject: college_mathematics, #q:100, acc: 0.850
subject: college_medicine, #q:173, acc: 0.890
subject: college_physics, #q:102, acc: 0.892
subject: computer_security, #q:100, acc: 0.900
subject: conceptual_physics, #q:235, acc: 0.936
subject: econometrics, #q:114, acc: 0.781
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.947
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.710
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.872
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.928
subject: high_school_mathematics, #q:270, acc: 0.785
subject: high_school_microeconomics, #q:238, acc: 0.971
subject: high_school_physics, #q:151, acc: 0.861
subject: high_school_psychology, #q:545, acc: 0.965
subject: high_school_statistics, #q:216, acc: 0.880
subject: high_school_us_history, #q:204, acc: 0.951
subject: high_school_world_history, #q:237, acc: 0.954
subject: human_aging, #q:223, acc: 0.839
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.950
subject: jurisprudence, #q:108, acc: 0.926
subject: logical_fallacies, #q:163, acc: 0.945
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.953
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.959
subject: moral_disputes, #q:346, acc: 0.867
subject: moral_scenarios, #q:895, acc: 0.792
subject: nutrition, #q:306, acc: 0.925
subject: philosophy, #q:311, acc: 0.920
subject: prehistory, #q:324, acc: 0.948
subject: professional_accounting, #q:282, acc: 0.890
subject: professional_law, #q:1534, acc: 0.714
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.912
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.886
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.590
subject: world_religions, #q:171, acc: 0.936
Total latency: 1169.962
Average accuracy: 0.880

Benchmarking and Profiling

In 8*H20(141GB), the mean TTFT under different configurations:

export SGL_ENABLE_JIT_DEEPGEMM=1
export TORCHINDUCTOR_CACHE_DIR=/home/admin/inductor_root_cache
export SGLANG_TORCH_PROFILER_DIR=/home/admin/torch_profiler

MODEL_PATH=/home/models/DeepSeek-V3.2-Exp/

python3 -m sglang.launch_server --model-path $MODEL_PATH --dp 1 \
--enable-dp-attention --trust-remote-code --port 8000 --host 0.0.0.0 \
--attention-backend  nsa --nsa-prefill-backend flashmla_sparse \
--nsa-decode flashmla_sparse --enable-metrics --mem-fraction-static 0.8 \
--max-running-requests 128 --enable-cache-report --page-size 64 \
--tp-size 8 --moe-dense-tp-size 1 \
--disable-radix-cache \
--chunked-prefill-size 16384 \
--enable-nsa-prefill-context-parallel \
#--moe-a2a-backend deepep --ep-size 8 \
#--nsa-prefill-cp-mode round-robin-split
#--kv-cache-dtype fp8_e4m3

for deepep: --moe-a2a-backend deepep --ep-size 8
for round robin split cp-mode: --nsa-prefill-cp-mode round-robin-split
for kvcache fp8: --kv-cache-dtype fp8_e4m3

i=1
python3 sglang/python/sglang/bench_serving.py --model /home/models/DeepSeek-V3.2-Exp/ --base-url http://127.0.0.1:8000  --dataset-name random --num-prompts 100 --random-input-len $((i*1024)) --random-output-len 1 --request-rate 1000 --random-range-ratio 1.0  --max-concurrency 1 --dataset-path /home/ShareGPT_V3_unfiltered_cleaned_split.json

for((i=8;i<=64;i+=8)); do
python3 sglang/python/sglang/bench_serving.py --model /home/models/DeepSeek-V3.2-Exp/ --base-url http://127.0.0.1:8000  --dataset-name random --num-prompts 100 --random-input-len $((i*1024)) --random-output-len 1 --request-rate 1000 --random-range-ratio 1.0  --max-concurrency 1 --dataset-path /home/ShareGPT_V3_unfiltered_cleaned_split.json 
done
Input Length (KB) Before PR (DeepEP) After PR (DeepEP, Round Robin Split Mode) TTFT Change vs. Before PR After PR (Tuned FusedMoE, Round Robin Split Mode 1) TTFT Change vs. Before PR After PR (Tuned FusedMoE, KV FP8, Round Robin Split Mode 1) TTFT Change vs. Before PR
1 303.49 257.7 -15.09% 205.6 -32.25% 220.2 -27.44%
8 900.3 869.54 -3.42% 778.63 -13.51% 789.83 -12.27%
16 1858.31 1851.36 -0.37% 1692.13 -8.94% 1722.28 -7.32%
24 2636.58 2571.06 -2.49% 2351.67 -10.81% 2392.25 -9.27%
32 3669.27 3619.3 -1.36% 3322.56 -9.45% 3382.81 -7.81%
40 4517.65 4413.67 -2.30% 4055.31 -10.23% 4125.17 -8.69%
48 5605.21 5530.09 -1.34% 5096.42 -9.08% 5181.6 -7.56%
56 6477.39 6334.2 -2.21% 5837.79 -9.87% 5935.96 -8.36%
64 7695.75 7580.04 -1.50% 7011.81 -8.89% 7116.35 -7.53%

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added documentation Improvements or additions to documentation deepseek labels Nov 26, 2025
@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Nov 26, 2025

@Fridge003 @ch-wan @lixiaolx Please help review the PR.

@xu-yfei xu-yfei changed the title [DeeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache [DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache Nov 26, 2025
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test cases? I will have a closer check tomorrow.

@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Nov 26, 2025

Could you add some test cases? I will have a closer check tomorrow.

@ch-wan done!

@whybeyoung
Copy link
Copy Markdown
Collaborator

maybe we can combine the pp to gain the best performance #11852

@yhyang201
Copy link
Copy Markdown
Collaborator

Out of curiosity, may I ask whether the performance before the PR was measured after tuning?
If not, could you please provide the performance metrics before the PR but after tuning?
This would help better highlight the performance improvements brought by the change.
Thank you very much!

@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Nov 28, 2025

Out of curiosity, may I ask whether the performance before the PR was measured after tuning? If not, could you please provide the performance metrics before the PR but after tuning? This would help better highlight the performance improvements brought by the change. Thank you very much!

@yhyang201 Do you mean the tuning of fused MoE? Before this PR, fused MoE was not supported. The performance improvement in this PR mainly comes from the optimized fused MoE, which delivers better performance compared to DeepEP.

@Fridge003
Copy link
Copy Markdown
Collaborator

Let's merge after new version release

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@Fridge003
Copy link
Copy Markdown
Collaborator

@xu-yfei Can you please pull the latest branch

@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Jan 2, 2026

@xu-yfei Can you please pull the latest branch

done~

@Fridge003
Copy link
Copy Markdown
Collaborator

Just verified this feature on local H200. It should be correct

@Fridge003 Fridge003 merged commit 0d24411 into sgl-project:main Jan 2, 2026
195 of 201 checks passed
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Jan 4, 2026
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
@llc-kc
Copy link
Copy Markdown
Contributor

llc-kc commented Jan 26, 2026

Hi, @xu-yfei , does this Context Parallelism support PD deployment?
Currently we get significant accuracy drop when we deploy deepseek v3.2 PD with prefill CP+MTP.

@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Jan 26, 2026

Hi, @xu-yfei , does this Context Parallelism support PD deployment? Currently we get significantly accuracy drop when we deploy deepseek v3.2 PD with prefill CP+MTP.

@llc-kc This should be supported. Please specify your testing method. If possible, let’s discuss it via an issue.

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor

yiakwy-xpu-ml-framework-team commented Mar 3, 2026

@xu-yfei report performance recession in H800 with the new options :

#12065 (comment)

cc @Fridge003

@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Mar 5, 2026

H800

@yiakwy-xpu-ml-framework-team Sorry, I didn't quite get what you meant.Could you clarify which scenarios are being compared, and what specific performance metric has degraded?What are the exact values of this metric before and after the degradation?Also, what are the input length and output length in question?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation npu run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants