[PP] Refactor PP to async mode#11852
Conversation
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Summary of ChangesHello @XucSh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the pipeline parallelism (PP) implementation by introducing asynchronous processing. The primary goal is to enhance efficiency by allowing GPU computation and CPU processing to overlap, particularly for the last rank in the pipeline, thereby mitigating potential performance bottlenecks. The changes involve a substantial refactoring of the PP scheduling logic into a new mixin and the addition of a configurable parameter to fine-tune the asynchronous batching behavior. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the pipeline parallelism (PP) logic into a new SchedulerPPMixin and adds support for asynchronous operations to improve performance by overlapping communication and computation. The changes are quite extensive, introducing a new pp_async_batch_depth server argument and modifying the point_to_point_pyobj utility for async sends.
While the overall direction is good, I've found a few critical issues in the new event_loop_pp implementation within scheduler_pp_mixin.py where essential logic for receiving data from previous pipeline stages appears to be commented out, which would break the pipeline. I've also pointed out some dead code that should be cleaned up for better maintainability. Please see the detailed comments for suggestions on how to fix these issues.
|
here is the benmark result in a800 80G*8
after:
|
@nvpohanh We are pretty close to finishing and determining the final design, will merge this in main ASAP, thx for the testing and performance verification. |
Co-authored-by: bluecoffee8 <jasperli2002@gmail.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
| mbs[next_mb_id], mb_metadata[next_mb_id], next_pp_outputs | ||
| ) | ||
| d2h_event = torch.cuda.Event() | ||
| d2h_event.record(torch.cuda.current_stream()) |
There was a problem hiding this comment.
why d2h and copy stream ? there is no copy op in _pp_prep_batch_result?
|
would hang when return_logprob=True |
thanks for your feedback. will dig into it. Could you provide your test command? |
|
|
Could we do another rebase so I can run this on torch2.9/cuda13? |
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
/rerun-failed-ci 3 |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci 2 |
ShangmingCai
left a comment
There was a problem hiding this comment.
We think this PR is ready for public testing now. Please ping me in the comment of #11857 (or in the Slack channel or DM me in Slack) if you find any bugs or compatibility issues with this PR. We will come up with the following PRs to fix it ASAP.
|
Update: @alpha-baby and @liusy58 also put many efforts into experimenting and testing on this PR, even though no related commits are included, they also contribute to this PR a lot. Sorry for forgetting to manually add them as a co-author. My bad. |
Signed-off-by: Shangming Cai <csmthu@gmail.com> Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: bluecoffee8 <jasperli2002@gmail.com> Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com> Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: bluecoffee8 <jasperli2002@gmail.com> Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>



Motivation
see 11857
User can test with below command now:
python3 -m sglang.launch_server --model /opt/models/Qwen/Qwen3-8b --pp-size 4 --tp 2 --pp-async-batch-depth 1
Co-author: @ShangmingCai @merrymercy @alpha-baby
Cc: @ShangmingCai @merrymercy @whybeyoung @bluecoffee8
Checklist