[PD] Support PD with context parallel after refactor by ShangmingCai · Pull Request #19504 · sgl-project/sglang

ShangmingCai · 2026-02-27T11:59:26Z

Motivation

Support CP in the PD module after #17213

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Signed-off-by: Shangming Cai <csmthu@gmail.com>

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai · 2026-02-27T11:59:48Z

/rerun-stage stage-c-test-8-gpu-h20

gemini-code-assist · 2026-02-27T11:59:50Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Collective Parallelism (CP) within the KV cache disaggregation framework. It extends the system's ability to manage and transfer KV caches across distributed environments by incorporating CP rank information into server registration, client communication, and data transfer coordination. This enhancement allows for more flexible and robust distributed inference setups, particularly when dealing with varying parallelism strategies.

Highlights

Collective Parallelism (CP) Integration: Introduced comprehensive support for Collective Parallelism (CP) across the KV cache disaggregation system, including tracking CP ranks and sizes in connection managers and server information.
Bootstrap Server and Client Updates: Modified the bootstrap server and client communication protocols to incorporate CP rank information for more precise routing and registration of prefill servers.
Dummy CP Rank Handling: Implemented logic to identify and manage 'dummy' CP ranks, allowing them to participate in the control plane while optimizing data transfer by skipping intermediate KV chunk transfers.
Refined KV Cache Transfer Polling: Updated the KV cache transfer polling mechanism to perform all-reduce operations sequentially across both attention Tensor Parallelism (TP) and Collective Parallelism (CP) groups, ensuring consistent state across distributed ranks.
Code Clarity and Consistency: Renamed the is_last flag to is_last_chunk in KV chunk transfer structures and logic for improved readability and accuracy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/disaggregation/common/conn.py
- Added get_attention_cp_rank and get_attention_cp_size imports.
- Updated PrefillServerInfo dataclass to include attn_cp_size.
- Introduced PrefillRankInfo dataclass for rank IP and port.
- Modified CommonKVManager to store attn_cp_size and attn_cp_rank.
- Implemented is_dummy_cp_rank logic to conditionally register to the bootstrap server.
- Updated _fetch_prefill_server_info and register_to_bootstrap to include prefill_cp_rank in URLs and payloads.
- Adjusted required_prefill_response_num and target_cp_ranks calculation based on CP sizes.
- Modified _setup_bootstrap_infos and _get_bootstrap_info_from_server to incorporate prefill_cp_rank in connection keys and URLs.
- Updated BootstrapServer to manage attn_cp_size and attn_cp_rank, and restructured prefill_port_table to include CP rank.
- Modified _handle_route_get to query using prefill_cp_rank and return PrefillRankInfo as a dictionary.
python/sglang/srt/disaggregation/mooncake/conn.py
- Renamed is_last field to is_last_chunk in TransferKVChunk dataclass.
- Updated references to is_last to is_last_chunk in transfer_worker and add_transfer_request methods.
- Added conditional logic in the send method to handle is_dummy_cp_rank, allowing dummy ranks to skip intermediate KV chunk transfers.
python/sglang/srt/disaggregation/prefill.py
- Replaced poll_and_all_reduce with poll_and_all_reduce_attn_cp_tp_group in pop_bootstrapped, process_disagg_prefill_inflight_queue, and get_transferred_rids.
- Passed self.scheduler.attn_cp_cpu_group as an argument to the new polling function.
python/sglang/srt/disaggregation/utils.py
- Added a new function poll_and_all_reduce_attn_cp_tp_group to perform sequential all-reduce operations across both attention TP and CP groups.

Activity

No specific activity (comments, reviews, progress) was provided in the context for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-02-27T12:00:09Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

github-actions · 2026-02-27T12:00:15Z

🔗 View workflow run

gemini-code-assist

Code Review

This pull request adds support for context parallelism (CP) in the prefill/decode disaggregation feature. The changes span across several files to handle CP group initialization, communication, and status synchronization. Key changes include updating data structures like PrefillServerInfo to include CP information, modifying the bootstrap process to handle CP ranks, and implementing hierarchical status polling across TP and CP groups. A new PrefillRankInfo dataclass is introduced for better type safety. The logic for dummy CP ranks in MLA backends is also added. Overall, the changes are comprehensive for adding CP support. I have found one critical issue in _setup_bootstrap_infos that could lead to a deadlock, which I've detailed in a specific comment.

python/sglang/srt/disaggregation/common/conn.py

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai · 2026-02-27T12:49:54Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-02-27T12:50:14Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

ShangmingCai · 2026-02-27T12:50:14Z

/rerun-stage stage-c-test-4-gpu-gb200

github-actions · 2026-02-27T12:50:20Z

🔗 View workflow run

github-actions · 2026-02-27T12:50:37Z

✅ Triggered stage-c-test-4-gpu-gb200 to run independently (skipping dependencies).

github-actions · 2026-02-27T12:50:43Z

🔗 View workflow run

ShangmingCai · 2026-02-27T13:42:19Z

/rerun-stage stage-c-test-4-gpu-gb200

github-actions · 2026-02-27T13:42:40Z

✅ Triggered stage-c-test-4-gpu-gb200 to run independently (skipping dependencies).

ShangmingCai · 2026-02-27T13:42:46Z

/tag-and-rerun-ci

github-actions · 2026-02-27T13:42:46Z

🔗 View workflow run

vladnosiv · 2026-02-27T13:43:33Z

LGTM, I'll stress test it on next week, thanks !

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai · 2026-02-27T14:00:07Z

On second thought, the dummy CP rank is not fully correct, it will break the case when prefill cp size == decode cp size

will add a flag or an env var to control when use let multi prefill cp ranks -> 1 decode cp ranks, or 1 prefill cp rank -> 1 decode cp rank for MLA

whybeyoung · 2026-02-27T14:04:55Z

Nice of you

vladnosiv · 2026-02-27T14:08:09Z

On second thought, the dummy CP rank is not fully correct, it will break the case when prefill cp size == decode cp size

will add a flag or an env var to control when use let multi prefill cp ranks -> 1 decode cp ranks, or 1 prefill cp rank -> 1 decode cp rank for MLA

The idea was that any CP rank of the decode could take the KV cache from any one CP rank of the prefill (because kv caches is equal on any CP-rank on prefill), and using the remaining ranks could be an optimization for load balancing, but not a requirement, and this approach ensured the correctness of the case with Prefill CP > 1 and Decode CP = 1

Perhaps, after the inclusion of the cp-rank in bootstrap info, this is no longer required, because previously the idea was to register one non-dummy rank per TP-rank and ensure correctness with minimal changes.

This reverts commit d6596fe.

ShangmingCai · 2026-02-27T14:20:36Z

On second thought, the dummy CP rank is not fully correct, it will break the case when prefill cp size == decode cp size

will add a flag or an env var to control when use let multi prefill cp ranks -> 1 decode cp ranks, or 1 prefill cp rank -> 1 decode cp rank for MLA

The idea was that any CP rank of the decode could take the KV cache from any one CP rank of the prefill (because kv caches is equal on any CP-rank on prefill), and using the remaining ranks could be an optimization for load balancing, but not a requirement, and this approach ensured the correctness of the case with Prefill CP > 1 and Decode CP = 1

Perhaps, after the inclusion of the cp-rank in bootstrap info, this is no longer required, because previously the idea was to register one non-dummy rank per TP-rank and ensure correctness with minimal changes.

Yeah, maybe we can do it in the next pr.

ShangmingCai · 2026-02-27T14:27:18Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-02-27T14:27:40Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

github-actions · 2026-02-27T14:27:46Z

🔗 View workflow run

ShangmingCai · 2026-02-28T03:35:17Z

/rerun-failed-ci

llc-kc · 2026-02-28T03:49:12Z

@ShangmingCai When using CP+PD, should both prefill and decode enable CP? I see the code check P/D cp size equal.

ShangmingCai · 2026-02-28T04:48:02Z

@ShangmingCai When using CP+PD, should both prefill and decode enable CP? I see the code check P/D cp size equal.

@llc-kc Not necessary, we support prefill CP + decode no CP now. The rank mapping in this PR is not used for any case temporarily, I just pre-impl this to prepare KV transfer module for future usage.

ShangmingCai · 2026-02-28T05:10:53Z

CI has passed.

Since this PR won't break any current usage, we can merge it first.

I am also collaborating with @whybeyoung for some fixes for NSA, and we have verified that with those changes and this PR, we can fix DPSK V3.2 and make GLM 5 runnable (PP2 CP8 TP8 x PD). We will make another PR for those changes later.

Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

ShangmingCai and others added 2 commits February 27, 2026 17:24

add change from #19119

8485d94

Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Signed-off-by: Shangming Cai <csmthu@gmail.com>

[PD] Support PD with context parallel after refactor

50b9fb9

Signed-off-by: Shangming Cai <csmthu@gmail.com>

ShangmingCai requested review from ByronHsu and hnyls2002 as code owners February 27, 2026 11:59

ShangmingCai changed the title ~~Support pd cp~~ [PD] Support PD with context parallel after refactor Feb 27, 2026

ShangmingCai mentioned this pull request Feb 27, 2026

[DSv32] Support CP + P/D #19119

Closed

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

python/sglang/srt/disaggregation/common/conn.py Show resolved Hide resolved

fix connection pool

03852ef

Signed-off-by: Shangming Cai <csmthu@gmail.com>

github-actions bot added the run-ci label Feb 27, 2026

ShangmingCai added 3 commits February 27, 2026 21:45

support CPxPPxPD

8cad5f7

Signed-off-by: Shangming Cai <csmthu@gmail.com>

fix

fadea3c

Signed-off-by: Shangming Cai <csmthu@gmail.com>

fix

d6596fe

Signed-off-by: Shangming Cai <csmthu@gmail.com>

Revert "fix"

6fe27d8

This reverts commit d6596fe.

ShangmingCai added the high priority label Feb 27, 2026

llc-kc mentioned this pull request Feb 28, 2026

[Bug] SGLang v0.5.9 DeepSeek V3.2 CP+DP get incorrect results. V0.5.8 is ok, v0.5.9 not PD is ok. #19483

Closed

5 tasks

ShangmingCai merged commit b01f359 into main Feb 28, 2026
311 of 337 checks passed

ShangmingCai deleted the support_pd_cp branch February 28, 2026 05:11

whybeyoung mentioned this pull request Feb 28, 2026

fix: support PP2+CP8+TP8 (PP with context parallelism) #19548

Merged

lawrence-harmonic mentioned this pull request Mar 6, 2026

[Bug] sglang.patch breaks DP attention + PD disaggregation THUDM/slime#1670

Closed

4 tasks

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[PD] Support PD with context parallel after refactor (sgl-project#19504)

09e5987

Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[PD] Support PD with context parallel after refactor (sgl-project#19504)

d31cdcc

Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

ShangmingCai mentioned this pull request Mar 30, 2026

[Roadmap] Prefill-Decode Disaggregation Roadmap (2026 Q2) #21703

Open

10 tasks

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[PD] Support PD with context parallel after refactor (sgl-project#19504)

8b614b1

Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

Conversation

ShangmingCai commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

ShangmingCai commented Feb 27, 2026

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ShangmingCai commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

ShangmingCai commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

ShangmingCai commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

ShangmingCai commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

vladnosiv commented Feb 27, 2026

Uh oh!

ShangmingCai commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whybeyoung commented Feb 27, 2026

Uh oh!

vladnosiv commented Feb 27, 2026 • edited by ShangmingCai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

ShangmingCai commented Feb 28, 2026

Uh oh!

llc-kc commented Feb 28, 2026

Uh oh!

ShangmingCai commented Feb 28, 2026

Uh oh!

ShangmingCai commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

ShangmingCai commented Feb 27, 2026 •

edited

Loading

ShangmingCai commented Feb 27, 2026 •

edited

Loading

vladnosiv commented Feb 27, 2026 •

edited by ShangmingCai

Loading

ShangmingCai commented Feb 27, 2026 •

edited

Loading