Skip to content

[PD] Support PD with context parallel after refactor#19504

Merged
ShangmingCai merged 7 commits intomainfrom
support_pd_cp
Feb 28, 2026
Merged

[PD] Support PD with context parallel after refactor#19504
ShangmingCai merged 7 commits intomainfrom
support_pd_cp

Conversation

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai commented Feb 27, 2026

Motivation

Support CP in the PD module after #17213

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

ShangmingCai and others added 2 commits February 27, 2026 17:24
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h20

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Collective Parallelism (CP) within the KV cache disaggregation framework. It extends the system's ability to manage and transfer KV caches across distributed environments by incorporating CP rank information into server registration, client communication, and data transfer coordination. This enhancement allows for more flexible and robust distributed inference setups, particularly when dealing with varying parallelism strategies.

Highlights

  • Collective Parallelism (CP) Integration: Introduced comprehensive support for Collective Parallelism (CP) across the KV cache disaggregation system, including tracking CP ranks and sizes in connection managers and server information.
  • Bootstrap Server and Client Updates: Modified the bootstrap server and client communication protocols to incorporate CP rank information for more precise routing and registration of prefill servers.
  • Dummy CP Rank Handling: Implemented logic to identify and manage 'dummy' CP ranks, allowing them to participate in the control plane while optimizing data transfer by skipping intermediate KV chunk transfers.
  • Refined KV Cache Transfer Polling: Updated the KV cache transfer polling mechanism to perform all-reduce operations sequentially across both attention Tensor Parallelism (TP) and Collective Parallelism (CP) groups, ensuring consistent state across distributed ranks.
  • Code Clarity and Consistency: Renamed the is_last flag to is_last_chunk in KV chunk transfer structures and logic for improved readability and accuracy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/disaggregation/common/conn.py
    • Added get_attention_cp_rank and get_attention_cp_size imports.
    • Updated PrefillServerInfo dataclass to include attn_cp_size.
    • Introduced PrefillRankInfo dataclass for rank IP and port.
    • Modified CommonKVManager to store attn_cp_size and attn_cp_rank.
    • Implemented is_dummy_cp_rank logic to conditionally register to the bootstrap server.
    • Updated _fetch_prefill_server_info and register_to_bootstrap to include prefill_cp_rank in URLs and payloads.
    • Adjusted required_prefill_response_num and target_cp_ranks calculation based on CP sizes.
    • Modified _setup_bootstrap_infos and _get_bootstrap_info_from_server to incorporate prefill_cp_rank in connection keys and URLs.
    • Updated BootstrapServer to manage attn_cp_size and attn_cp_rank, and restructured prefill_port_table to include CP rank.
    • Modified _handle_route_get to query using prefill_cp_rank and return PrefillRankInfo as a dictionary.
  • python/sglang/srt/disaggregation/mooncake/conn.py
    • Renamed is_last field to is_last_chunk in TransferKVChunk dataclass.
    • Updated references to is_last to is_last_chunk in transfer_worker and add_transfer_request methods.
    • Added conditional logic in the send method to handle is_dummy_cp_rank, allowing dummy ranks to skip intermediate KV chunk transfers.
  • python/sglang/srt/disaggregation/prefill.py
    • Replaced poll_and_all_reduce with poll_and_all_reduce_attn_cp_tp_group in pop_bootstrapped, process_disagg_prefill_inflight_queue, and get_transferred_rids.
    • Passed self.scheduler.attn_cp_cpu_group as an argument to the new polling function.
  • python/sglang/srt/disaggregation/utils.py
    • Added a new function poll_and_all_reduce_attn_cp_tp_group to perform sequential all-reduce operations across both attention TP and CP groups.
Activity
  • No specific activity (comments, reviews, progress) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@ShangmingCai ShangmingCai changed the title Support pd cp [PD] Support PD with context parallel after refactor Feb 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for context parallelism (CP) in the prefill/decode disaggregation feature. The changes span across several files to handle CP group initialization, communication, and status synchronization. Key changes include updating data structures like PrefillServerInfo to include CP information, modifying the bootstrap process to handle CP ranks, and implementing hierarchical status polling across TP and CP groups. A new PrefillRankInfo dataclass is introduced for better type safety. The logic for dummy CP ranks in MLA backends is also added. Overall, the changes are comprehensive for adding CP support. I have found one critical issue in _setup_bootstrap_infos that could lead to a deadlock, which I've detailed in a specific comment.

Signed-off-by: Shangming Cai <csmthu@gmail.com>
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-gb200

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-gb200 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-gb200

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-gb200 to run independently (skipping dependencies).

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@vladnosiv
Copy link
Copy Markdown
Contributor

LGTM, I'll stress test it on next week, thanks !

Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

ShangmingCai commented Feb 27, 2026

On second thought, the dummy CP rank is not fully correct, it will break the case when prefill cp size == decode cp size

will add a flag or an env var to control when use let multi prefill cp ranks -> 1 decode cp ranks, or 1 prefill cp rank -> 1 decode cp rank for MLA

@whybeyoung
Copy link
Copy Markdown
Collaborator

Nice of you

@vladnosiv
Copy link
Copy Markdown
Contributor

vladnosiv commented Feb 27, 2026

On second thought, the dummy CP rank is not fully correct, it will break the case when prefill cp size == decode cp size

will add a flag or an env var to control when use let multi prefill cp ranks -> 1 decode cp ranks, or 1 prefill cp rank -> 1 decode cp rank for MLA

The idea was that any CP rank of the decode could take the KV cache from any one CP rank of the prefill (because kv caches is equal on any CP-rank on prefill), and using the remaining ranks could be an optimization for load balancing, but not a requirement, and this approach ensured the correctness of the case with Prefill CP > 1 and Decode CP = 1

Perhaps, after the inclusion of the cp-rank in bootstrap info, this is no longer required, because previously the idea was to register one non-dummy rank per TP-rank and ensure correctness with minimal changes.

This reverts commit d6596fe.
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

ShangmingCai commented Feb 27, 2026

On second thought, the dummy CP rank is not fully correct, it will break the case when prefill cp size == decode cp size

will add a flag or an env var to control when use let multi prefill cp ranks -> 1 decode cp ranks, or 1 prefill cp rank -> 1 decode cp rank for MLA

The idea was that any CP rank of the decode could take the KV cache from any one CP rank of the prefill (because kv caches is equal on any CP-rank on prefill), and using the remaining ranks could be an optimization for load balancing, but not a requirement, and this approach ensured the correctness of the case with Prefill CP > 1 and Decode CP = 1

Perhaps, after the inclusion of the cp-rank in bootstrap info, this is no longer required, because previously the idea was to register one non-dummy rank per TP-rank and ensure correctness with minimal changes.

Yeah, maybe we can do it in the next pr.

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@llc-kc
Copy link
Copy Markdown
Contributor

llc-kc commented Feb 28, 2026

@ShangmingCai When using CP+PD, should both prefill and decode enable CP? I see the code check P/D cp size equal.

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

@ShangmingCai When using CP+PD, should both prefill and decode enable CP? I see the code check P/D cp size equal.

@llc-kc Not necessary, we support prefill CP + decode no CP now. The rank mapping in this PR is not used for any case temporarily, I just pre-impl this to prepare KV transfer module for future usage.

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

CI has passed.
image

Since this PR won't break any current usage, we can merge it first.

I am also collaborating with @whybeyoung for some fixes for NSA, and we have verified that with those changes and this PR, we can fix DPSK V3.2 and make GLM 5 runnable (PP2 CP8 TP8 x PD). We will make another PR for those changes later.

@ShangmingCai ShangmingCai merged commit b01f359 into main Feb 28, 2026
311 of 337 checks passed
@ShangmingCai ShangmingCai deleted the support_pd_cp branch February 28, 2026 05:11
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants