[Feature][Attention][PCP] Support PCP (Prefill Context Parallel) with MLA#28988
[Feature][Attention][PCP] Support PCP (Prefill Context Parallel) with MLA#28988FENP wants to merge 11 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request adds support for Prefill Context Parallelism (PCP) to the Multi-Level Attention (MLA) backend. The changes are extensive, involving refactoring of context parallelism logic to be more generic, adding new metadata and utility functions for PCP, and implementing the PCP attention logic based on the Dual-Chunk-Swap strategy.
My review has identified a critical issue in the attention correction logic where DCP and PCP corrections are applied in the wrong order, which will lead to incorrect results. I have also pointed out a significant performance issue related to nested communication calls that should be optimized. Overall, the PR is a good step towards enabling PCP, but these critical issues need to be addressed.
| cur_allgather_kvcache.copy_( | ||
| get_dcp_group().all_gather(local_gathered_kvcache, dim=0) | ||
| get_pcp_group().all_gather( | ||
| get_dcp_group().all_gather(local_gathered_kvcache, dim=0), | ||
| dim=0, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The nested all_gather calls, first over the DCP group and then over the PCP group, are inefficient as they introduce extra communication overhead and synchronization points. This should be optimized into a single all_gather operation.
To achieve this, a new communication group that combines the ranks from both DCP and PCP should be created during initialization. Then, a single all_gather can be performed over this combined "context parallel" (CP) group. This will be more performant. The TODO comment already acknowledges this, and this comment serves to emphasize its importance for performance.
There was a problem hiding this comment.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/mla/common.py#L212-L215
Importing undefined get_pcp_group
Lines 212‑215 import get_pcp_group from vllm.distributed.parallel_state, but that module still only exposes get_dcp_group (the commit merely introduced a _CP variable without any getter). Importing common.py will therefore immediately raise ImportError: cannot import name 'get_pcp_group', so none of the new PCP code paths can even be instantiated.
https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/mla/flashattn_mla.py#L79-L86
FlashAttn builder now passes nonexistent kwarg
The call to super().__init__(…, supports_cp_with_varlen=True) in FlashAttnMLAMetadataBuilder.__init__ (lines 79‑86) will raise TypeError: __init__() got an unexpected keyword argument 'supports_cp_with_varlen' because MLACommonMetadataBuilder.__init__ still only accepts supports_dcp_with_varlen. This prevents the FlashAttn MLA backend from constructing at all.
https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/mla/common.py#L572-L574
Referencing cp_kv_cache_interleave_size attribute that does not exist
Lines 572‑574 now read self.cp_local_block_size = parallel_config.cp_kv_cache_interleave_size, but ParallelConfig (vllm/config/parallel.py) defines only dcp_kv_cache_interleave_size. As soon as MLACommonMetadataBuilder is constructed this access raises AttributeError: 'ParallelConfig' object has no attribute 'cp_kv_cache_interleave_size', so the MLA backend cannot even initialize.
https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/utils.py#L1118-L1124
New utils annotation causes NameError on import
The new helper pcp_kv_allgather_and_restore (lines 1118‑1124) annotates pcp_group: GroupCoordinator, but GroupCoordinator is only imported inside the TYPE_CHECKING block and there is no from __future__ import annotations. When Python evaluates these annotations at import time it looks up GroupCoordinator, fails to find the name, and raises NameError, breaking vllm.v1.attention.backends.utils for every runtime import.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
2d9034a to
8ac9843
Compare
8ac9843 to
5e79da7
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
4b40860 to
a0a9181
Compare
Hi @chaunceyjiang , I've resolved the conflicts—feel free to try it out and share your feedback! |
|
Hi @FENP, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…vice tensor Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
a2b046a to
cf9c2eb
Compare
|
Hi @FENP, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
### What this PR does / why we need it? Since the [PR](vllm-project/vllm#28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](vllm-project/vllm#28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
|
@FENP is this ready for another round of review? happy to start reviewing whenever it is ready |
@LucasWilkinson It’s ready for your review. Thanks in advance! |
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
|
Hi @FENP, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
| if self.pcp_world_size > 1: | ||
| max_num_scheduled_tokens = int(num_scheduled_tokens_np.max()) | ||
| num_tokens_unpadded = scheduler_output.total_num_scheduled_tokens |
There was a problem hiding this comment.
I find it quite messy that _prepare_inputs modifies num_scheduled_tokens_np and scheduler_output internally. Its not very clear to the reader here thats whats happening and thats why this recomputation/re-assignment is required.
I think we should try harder to keep PCP more isolated for now, im working on an idea here (vibe coded and not tested yet): FENP#4 to leave _prepare_inputs untouched. It just shuffles the inputs after preparation to select for the tokens this PCP rank cares about. It means we do some duplicated/wasted work, but I think better for the initial implementations and we can do broader refactors on the model runner later to make it support this more naturally with less duplicated/wasted work, potentially by breaking up prepare inputs. Thoughts? (sorry not a complete review yet, will continue tmrw)
### What this PR does / why we need it? Since the [PR](vllm-project/vllm#28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](vllm-project/vllm#28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it? Since the [PR](vllm-project/vllm#28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](vllm-project/vllm#28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Since the [PR](vllm-project/vllm#28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](vllm-project/vllm#28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it? Since the [PR](vllm-project/vllm#28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](vllm-project/vllm#28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Since the [PR](vllm-project/vllm#28988) for PCP modifications to `GPUModelRunner` has not yet been merged into vLLM, this PR temporarily requires adjustments to certain buffer sizes. These changes can be reverted once the original [PR](vllm-project/vllm#28988) is merged. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Purpose
Ref to issue #25749. Enable PCP for MLA models.
This PR mainly includes the following changes:
vllm/v1/worker/cp_utils.pyfor model runner.vllm/v1/worker/gpu_model_runner.pyfor PCP splitting logic for tokensvllm/v1/attention/backends/mla/common.pyto adapt the MLA backend to PCPvllm/v1/attention/backends/utils.pyandvllm/attention/ops/common.pyTest Plan
Test Result
Benchmark
In addition to reducing GPU memory redundancy and increasing KV cache capacity, PCP can also reduce the all-reduce communication overhead of
o_projand lower TTFT. We evaluated the performance of PCP and TP on DeepSeek-R1 and Kimi-K2 using 4K-length inputs on the H20-3e.DeepSeek-R1
vllm bench serve --backend vllm --model deepseek-ai/DeepSeek-R1/ --endpoint /v1/completions --dataset-name random --random-input 4096 --random-output 1 --max-concurrency 1 --num-prompt 10 --ignore-eos --metric-percentiles "50,90,99"Kimi-K2
vllm bench serve --backend vllm --model moonshotai/Kimi-K2-Instruct/ --endpoint /v1/completions --dataset-name random --random-input 4096 --random-output 1 --max-concurrency 1 --num-prompt 10 --ignore-eos --metric-percentiles "50,90,99" --trust-remote-codeOf course, PCP additionally introduces communication overhead from KV all-gather and index select kernel launch overhead for restoring KV. Further tuning is still needed to improve performance.
Limitations
Although the current PCP logic is fully compatible with decoding, the lack of splitting of decode tokens means that every PCP rank holds the full set of decode tokens, leading to significant redundant communication and computation (include attention and MoE). We therefore recommend enabling PCP on P instances in a P/D-disaggregation case.
Future work
These items will be tackled in follow-up PRs; community contributions are warmly welcomed.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.