[PP] Refactor PP to async mode by XucSh · Pull Request #11852 · sgl-project/sglang

XucSh · 2025-10-20T07:29:30Z

Motivation

see 11857

User can test with below command now:

python3 -m sglang.launch_server --model /opt/models/Qwen/Qwen3-8b --pp-size 4 --tp 2 --pp-async-batch-depth 1

Co-author: @ShangmingCai @merrymercy @alpha-baby

Cc: @ShangmingCai @merrymercy @whybeyoung @bluecoffee8

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

gemini-code-assist · 2025-10-20T07:29:51Z

Summary of Changes

Hello @XucSh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the pipeline parallelism (PP) implementation by introducing asynchronous processing. The primary goal is to enhance efficiency by allowing GPU computation and CPU processing to overlap, particularly for the last rank in the pipeline, thereby mitigating potential performance bottlenecks. The changes involve a substantial refactoring of the PP scheduling logic into a new mixin and the addition of a configurable parameter to fine-tune the asynchronous batching behavior.

Highlights

Asynchronous Pipeline Parallelism (PP) Support: Introduced asynchronous capabilities for pipeline parallelism, allowing for better overlap of computation and communication, especially for the last PP rank.
Refactored PP Logic: The core pipeline parallelism event loop (event_loop_pp) has been extracted from scheduler.py into a new dedicated mixin, SchedulerPPMixin, improving modularity and maintainability.
Configurable Asynchronous Batch Depth: A new command-line argument --pp-async-batch-depth was added to server_args.py, enabling users to specify the depth of asynchronous batching for PP.
Asynchronous Point-to-Point Communication: The point_to_point_pyobj utility function was updated to support asynchronous sending, crucial for non-blocking communication in the new PP implementation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the pipeline parallelism (PP) logic into a new SchedulerPPMixin and adds support for asynchronous operations to improve performance by overlapping communication and computation. The changes are quite extensive, introducing a new pp_async_batch_depth server argument and modifying the point_to_point_pyobj utility for async sends.

While the overall direction is good, I've found a few critical issues in the new event_loop_pp implementation within scheduler_pp_mixin.py where essential logic for receiving data from previous pipeline stages appears to be commented out, which would break the pipeline. I've also pointed out some dead code that should be cleaned up for better maintainability. Please see the detailed comments for suggestions on how to fix these issues.

python/sglang/srt/managers/scheduler_pp_mixin.py

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

whybeyoung · 2025-10-22T10:21:24Z

here is the benmark result in a800 80G*8
model: qwen3-8b
sglangserver: python -m sglang.launch_server --model-path /work/models/qwen8b --disable-radix-cache --pp-size 4 --trust-remote --host 0.0.0.0 --port 8001 --mem-fraction-static 0.8 --tokenizer-worker-num 8 --tp-size 2 --pp-async-batch-depth 1 --torch-compile-max-bs 8 --max-running-requests 20
benchcmd: python -m sglang.bench_serving --port 8001 --dataset-name random-ids --num-prompts 128 --random-input-len 1000 --random-output-len 1000 --random-range-ratio 0.9 --disable-stream
before:
main branch commit: 01f14a7

Input tokens: 121127
#Output tokens: 121703
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████| 128/128 [03:00<00:00,  1.41s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     128       
Benchmark duration (s):                  180.53    
Total input tokens:                      121127    
Total input text tokens:                 121127    
Total input vision tokens:               0         
Total generated tokens:                  121703    
Total generated tokens (retokenized):    120510    
Request throughput (req/s):              0.71      
Input token throughput (tok/s):          670.94    
Output token throughput (tok/s):         674.13    
Total token throughput (tok/s):          1345.08   
Concurrency:                             71.00     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   100144.97 
Median E2E Latency (ms):                 105706.80 
---------------Time to First Token----------------
Mean TTFT (ms):                          100145.09 
Median TTFT (ms):                        105706.89 
P99 TTFT (ms):                           179895.75 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

after:
commit: 8fec316

#Input tokens: 121127
#Output tokens: 121703
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████| 128/128 [01:41<00:00,  1.26it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     128       
Benchmark duration (s):                  101.86    
Total input tokens:                      121127    
Total input text tokens:                 121127    
Total input vision tokens:               0         
Total generated tokens:                  121703    
Total generated tokens (retokenized):    120204    
Request throughput (req/s):              1.26      
Input token throughput (tok/s):          1189.17   
Output token throughput (tok/s):         1194.83   
Total token throughput (tok/s):          2384.00   
Concurrency:                             68.74     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   54700.18  
Median E2E Latency (ms):                 58085.45  
---------------Time to First Token----------------
Mean TTFT (ms):                          54700.27  
Median TTFT (ms):                        58085.52  
P99 TTFT (ms):                           101474.87 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00

ShangmingCai · 2025-11-21T05:34:35Z

@ShangmingCai @ByronHsu @zhyncs could you review this? We found that this PR significantly improves SGLang's PP performance. Thanks!

@nvpohanh We are pretty close to finishing and determining the final design, will merge this in main ASAP, thx for the testing and performance verification.

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

Co-authored-by: bluecoffee8 <jasperli2002@gmail.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Signed-off-by: Shangming Cai <csmthu@gmail.com>

Signed-off-by: Shangming Cai <csmthu@gmail.com>

MichoChan · 2025-11-29T09:45:16Z

python/sglang/srt/managers/scheduler_pp_mixin.py

+                    mbs[next_mb_id], mb_metadata[next_mb_id], next_pp_outputs
+                )
+                d2h_event = torch.cuda.Event()
+                d2h_event.record(torch.cuda.current_stream())


why d2h and copy stream ? there is no copy op in _pp_prep_batch_result?

MichoChan · 2025-11-30T05:52:54Z

would hang when return_logprob=True

XucSh · 2025-12-01T00:35:49Z

would hang when return_logprob=True

thanks for your feedback. will dig into it. Could you provide your test command？

MichoChan · 2025-12-01T02:35:53Z

would hang when return_logprob=True

thanks for your feedback. will dig into it. Could you provide your test command？
only hang when tp=8 , pp=2, nodes=2，when pp=4,tp=8,nodes=4 is ok

weireweire · 2025-12-02T09:42:25Z

Could we do another rebase so I can run this on torch2.9/cuda13？

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai · 2025-12-10T05:25:24Z

/rerun-failed-ci 3

ShangmingCai · 2025-12-11T06:38:31Z

/tag-and-rerun-ci

ShangmingCai · 2025-12-11T14:08:12Z

/rerun-failed-ci 2

ShangmingCai

We think this PR is ready for public testing now. Please ping me in the comment of #11857 (or in the Slack channel or DM me in Slack) if you find any bugs or compatibility issues with this PR. We will come up with the following PRs to fix it ASAP.

Failed CI is irrelevant:

ShangmingCai · 2025-12-12T05:12:27Z

Update: @alpha-baby and @liusy58 also put many efforts into experimenting and testing on this PR, even though no related commits are included, they also contribute to this PR a lot. Sorry for forgetting to manually add them as a co-author. My bad.

Signed-off-by: Shangming Cai <csmthu@gmail.com> Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: bluecoffee8 <jasperli2002@gmail.com> Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>

merrymercy and others added 3 commits October 20, 2025 15:07

add pp mixin

d172c82

update point_to_point_pyobj

437f100

[PP] support async PP

27b78d5

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners October 20, 2025 07:29

XucSh marked this pull request as draft October 20, 2025 07:29

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

python/sglang/srt/managers/scheduler_pp_mixin.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/scheduler_pp_mixin.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/scheduler_pp_mixin.py Show resolved Hide resolved

update

269e64b

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

ShangmingCai mentioned this pull request Oct 20, 2025

[Roadmap] Pipeline parallelism roadmap #11857

Open

23 tasks

ShangmingCai and others added 6 commits October 21, 2025 11:44

Merge branch 'main' into Xuchun/pp-dev

14125ae

upd

cad6210

fix

53335cc

fix

c493200

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

fix

44656b8

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

async run and process

21df2f2

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh marked this pull request as ready for review October 21, 2025 06:35

ShangmingCai added the run-ci label Oct 21, 2025

XucSh added 2 commits October 22, 2025 11:15

fix

8fec316

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

fix

382ce86

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh changed the title ~~[PP] support async PP~~ [PP] Refactor PP to async mode Oct 22, 2025

update

770b28f

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh and others added 5 commits October 23, 2025 10:11

Merge branch 'main' into Xuchun/pp-dev

0dbc810

tiny improvement by delaying commit sync

49a81dc

send req asap

ac347c5

cleanup PR

2efd976

add recv tensor to cuda stream

7dbc256

ShangmingCai mentioned this pull request Nov 21, 2025

Fix: PrefillAdder.add_chunked_req with negative rem_total_tokens with pp #13698

Closed

6 tasks

XucSh and others added 7 commits November 21, 2025 15:02

Merge branch 'main' into Xuchun/pp-dev

6df9b79

fix

026b942

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

Merge branch 'main' into Xuchun/pp-dev

673c37f

fix

59c8ef3

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

add dynamic chunk support

4015608

Co-authored-by: bluecoffee8 <jasperli2002@gmail.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Signed-off-by: Shangming Cai <csmthu@gmail.com>

clean

368382a

Signed-off-by: Shangming Cai <csmthu@gmail.com>

remove redundant code

d13b07c

Signed-off-by: Shangming Cai <csmthu@gmail.com>

whybeyoung mentioned this pull request Nov 26, 2025

[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache #13959

Merged

6 tasks

MichoChan reviewed Nov 29, 2025

View reviewed changes

XucSh and others added 3 commits December 2, 2025 17:44

Merge branch 'main' into Xuchun/pp-dev

7f65270

add smooth coeff

34faf31

Signed-off-by: Shangming Cai <csmthu@gmail.com>

Merge branch 'main' into Xuchun/pp-dev

44452b0

Merge branch 'main' into Xuchun/pp-dev

e1fd4d6

ShangmingCai approved these changes Dec 12, 2025

View reviewed changes

ShangmingCai merged commit c01b2ee into sgl-project:main Dec 12, 2025
282 of 352 checks passed

ShangmingCai mentioned this pull request Dec 12, 2025

[1/N] Update doc of Pipeline Parallelism #14985

Merged

6 tasks

ClawSeven mentioned this pull request Dec 18, 2025

[Bug] PP refactor break the dLLM execution #15370

Closed

5 tasks

xesdiny mentioned this pull request Dec 26, 2025

[Feature] pipeline parallel with mixed chunk #10589

Closed

2 tasks

Conversation

XucSh commented Oct 20, 2025 • edited by ShangmingCai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Uh oh!

gemini-code-assist bot commented Oct 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whybeyoung commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented Nov 21, 2025

Uh oh!

MichoChan Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

MichoChan commented Nov 30, 2025

Uh oh!

XucSh commented Dec 1, 2025

Uh oh!

MichoChan commented Dec 1, 2025

Uh oh!

weireweire commented Dec 2, 2025

Uh oh!

ShangmingCai commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented Dec 11, 2025

Uh oh!

ShangmingCai commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShangmingCai commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

XucSh commented Oct 20, 2025 •

edited by ShangmingCai

Loading

whybeyoung commented Oct 22, 2025 •

edited

Loading

ShangmingCai commented Dec 10, 2025 •

edited

Loading

ShangmingCai commented Dec 11, 2025 •

edited

Loading