[DeepSeek 3.2] Support and optimize pipeline parallelis when context pipeline enabled#16380
[DeepSeek 3.2] Support and optimize pipeline parallelis when context pipeline enabled#16380Fridge003 merged 5 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@Fridge003 Could you please help review this PR? |
|
Nice job |
|
/tag-and-rerun-ci |
|
Hi, @ShangmingCai @xu-yfei When I tried to launch GLM5 with PP2 and TP8 CP8 (sglang v0.5.9): I got this Error: |
|
this is a bug introduced by context parallelism refactor, we are fixing it now. If you are in a hurry, you can try the older release version, where that commit didn't get merged yet. |
Motivation
For issue #15358. CC @whybeyoung
Support pipeline parallelism for the Prefill CP scenario (
--enable-nsa-prefill-context-parallel). In this scenario, the linear layer of attention uses repeated weights, and the received input is scattered. Therefore, the attention TP group does not need to perform anall_gatheroperation during PP send/recv.Optimize the indexer as follows: In the case of PP parallelism, part of the model execution may run in parallel with the PP recv operation, which occupies an additional SM resource. Thus, the total number of SMs allocated to the indexer’s
deep_gemm.fp8_mqa_logitsmust be reduced by 1.As shown in the figure below, for
deep_gemm.fp8_mqa_logitson the first node (i.e., PP rank = 0), the latency is 1.312 ms when it does not overlap with the PPrecvoperation, and increases to 2.211 ms when overlapped with therecvoperator.Analysis reveals that
deep_gemm.fp8_mqa_logitsoccupies all SMs by default. In the H20 scenario, the grid is (78,). However, since the recv operator already occupies 1 SM, the latency ofdeep_gemm.fp8_mqa_logitsincreases significantly.Solution: In the PP scenario, we reduce the number of SMs for
deep_gemm.fp8_mqa_logitsby callingdeep_gemm.set_num_sms, so as to maintain the performance ofdeep_gemm.fp8_mqa_logits.In 2* 8*H20(96GB), 48.5K inputs, TTFT 3.11 s.
Modifications
Accuracy Tests
In 2* 8*H20(96GB).
Benchmarking and Profiling
Before optimize the indexer:


After optimize the indexer:


Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.