[PD] Support PD with context parallel after refactor#19504
[PD] Support PD with context parallel after refactor#19504ShangmingCai merged 7 commits intomainfrom
Conversation
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Signed-off-by: Shangming Cai <csmthu@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
/rerun-stage stage-c-test-8-gpu-h20 |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for Collective Parallelism (CP) within the KV cache disaggregation framework. It extends the system's ability to manage and transfer KV caches across distributed environments by incorporating CP rank information into server registration, client communication, and data transfer coordination. This enhancement allows for more flexible and robust distributed inference setups, particularly when dealing with varying parallelism strategies. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
✅ Triggered |
There was a problem hiding this comment.
Code Review
This pull request adds support for context parallelism (CP) in the prefill/decode disaggregation feature. The changes span across several files to handle CP group initialization, communication, and status synchronization. Key changes include updating data structures like PrefillServerInfo to include CP information, modifying the bootstrap process to handle CP ranks, and implementing hierarchical status polling across TP and CP groups. A new PrefillRankInfo dataclass is introduced for better type safety. The logic for dummy CP ranks in MLA backends is also added. Overall, the changes are comprehensive for adding CP support. I have found one critical issue in _setup_bootstrap_infos that could lead to a deadlock, which I've detailed in a specific comment.
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
/rerun-stage stage-c-test-8-gpu-h20 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-4-gpu-gb200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-4-gpu-gb200 |
|
✅ Triggered |
|
/tag-and-rerun-ci |
|
LGTM, I'll stress test it on next week, thanks ! |
|
On second thought, the dummy CP rank is not fully correct, it will break the case when prefill cp size == decode cp size will add a flag or an env var to control when use let multi prefill cp ranks -> 1 decode cp ranks, or 1 prefill cp rank -> 1 decode cp rank for MLA |
|
Nice of you |
The idea was that any CP rank of the decode could take the KV cache from any one CP rank of the prefill (because kv caches is equal on any CP-rank on prefill), and using the remaining ranks could be an optimization for load balancing, but not a requirement, and this approach ensured the correctness of the case with Prefill CP > 1 and Decode CP = 1 Perhaps, after the inclusion of the cp-rank in bootstrap info, this is no longer required, because previously the idea was to register one non-dummy rank per TP-rank and ensure correctness with minimal changes. |
This reverts commit d6596fe.
Yeah, maybe we can do it in the next pr. |
|
/rerun-stage stage-c-test-8-gpu-h20 |
|
✅ Triggered |
|
/rerun-failed-ci |
|
@ShangmingCai When using CP+PD, should both prefill and decode enable CP? I see the code check P/D cp size equal. |
@llc-kc Not necessary, we support prefill CP + decode no CP now. The rank mapping in this PR is not used for any case temporarily, I just pre-impl this to prepare KV transfer module for future usage. |
|
Since this PR won't break any current usage, we can merge it first. I am also collaborating with @whybeyoung for some fixes for NSA, and we have verified that with those changes and this PR, we can fix DPSK V3.2 and make GLM 5 runnable (PP2 CP8 TP8 x PD). We will make another PR for those changes later. |
Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

Motivation
Support CP in the PD module after #17213
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci