[Feature] Prefill Context Parallel (PCP) basic support by pisceskkk · Pull Request #28718 · vllm-project/vllm

pisceskkk · 2025-11-14T09:02:56Z

Purpose

This PR, splited from full PR #26864, adds the basic supports for the Prefill Context Parallelism (PCP) feature, which corresponds to DCP. For specific implementation details, please refer to the RFC #25749.

TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage.

The current implementation primarily includes the following changes:

Modified files such as block_tables.py to extend the KV cache storage based on DCP & PCP;
Added a communication group pcp_group for PCP;
Added necessary command-line arguments to control parallelism for PCP. Temporarily disabled PCP parameters until backend support is complete, then re-enable;
Added PCP-related parameters to the attention backend prototype class;

CC @LookAround0301 @FENP @LucasWilkinson

chatgpt-codex-connector · 2025-11-14T09:06:24Z

💡 Codex Review

https://github.com/vllm-project/vllm/blob/14870a720161273d7583493580f81e14ab45199f/vllm/engine/arg_utils.py#L762-L772
Avoid duplicate --data-parallel-size argument registration

The CLI now calls parallel_group.add_argument("--data-parallel-size", …) twice in a row. argparse rejects duplicate option strings, so EngineArgs.add_cli_args() will raise ArgumentError: conflicting option string(s): --data-parallel-size before any command line can be parsed. This prevents vLLM from starting at all. One of the two registrations should be removed or renamed.

https://github.com/vllm-project/vllm/blob/14870a720161273d7583493580f81e14ab45199f/vllm/v1/worker/gpu_worker.py#L726-L737
Pass new PCP argument to FusedMoEParallelConfig.make

FusedMoEParallelConfig.make now requires a pcp_size_ positional argument, but the call in update_moe_modules still passes only tp_size_ and dp_size_. Any MoE model will hit this code path and raise TypeError: make() missing 1 required positional argument: 'pcp_size_' when the worker adjusts MoE modules. Update the invocation to include the prefill-context size (or provide a default) so MoE models can initialize.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces basic support for Prefill Context Parallelism (PCP), aligning with the RFC. The changes are consistently applied across the codebase, including updates to parallel configuration, KV cache management, and attention backends. Renaming dcp_kv_cache_interleave_size to cp_kv_cache_interleave_size generalizes the context parallelism KV cache interleaving logic to support both decode and prefill context parallelism. Compatibility checks are in place to temporarily disable PCP for certain features like full CUDA graphs and hybrid attention, indicating a phased rollout of full support. The integration appears thorough and well-considered for the initial support phase.

vllm/engine/arg_utils.py

luccafong · 2025-11-14T18:13:16Z

thanks for the PR, can you share a few combinations of PCP x DCP x TP in your summary?

…28718) Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com> (cherry picked from commit 2fd893b)

…28718) Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix https://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

heroes999 · 2025-12-24T01:08:05Z

can you share a few combinations of PCP x DCP x TP in your summary?

We usually set PCP=2, and TP=num_devices/PCP. For the MLA model, we set DCP=TP, and for the GQA model, we set DCP=TP/num_kv_heads, to fully utilize the KVCache. Some tests indicate that setting a larger PCP can further optimize TTFT, but currently, our testing primarily focuses on the case where PCP=2.

If we set PCP=2, TP=8 but without DCP, what's the KVCache split strategy? Input prompt segmented between ranks, but decoding output tokens containing copies?

pisceskkk · 2025-12-24T01:28:13Z

If we set PCP=2, TP=8 but without DCP, what's the KVCache split strategy? Input prompt segmented between ranks, but decoding output tokens containing copies?

Different PCP ranks contain split sequence dimensions and distinct KV caches, while copies may exist within the same PCP rank (depending on MLA or GQA configurations, specifically the relationship between num_kv_heads and tp_size).

In your example, assuming we are using an MLA model, all KV caches belonging to PCP0 are identical. The KV caches of PCP0 and PCP1 are different, following the same distribution pattern as DCP, where they are stored in interleave-style on cp_kvcache_interleave_size.

heroes999 · 2025-12-24T03:29:19Z

If we set PCP=2, TP=8 but without DCP, what's the KVCache split strategy? Input prompt segmented between ranks, but decoding output tokens containing copies?

Different PCP ranks contain split sequence dimensions and distinct KV caches, while copies may exist within the same PCP rank (depending on MLA or GQA configurations, specifically the relationship between num_kv_heads and tp_size).

In your example, assuming we are using an MLA model, all KV caches belonging to PCP0 are identical. The KV caches of PCP0 and PCP1 are different, following the same distribution pattern as DCP, where they are stored in interleave-style on cp_kvcache_interleave_size.

Does it mean in decoding phase，pcp acts similarly to dcp, which requires gather q, calc local attn, correct lse..., sharding new generated tokens' kv cache interleavely?

heroes999 · 2025-12-24T06:25:18Z

28988 , it seems dcp&pcp are merged to cp concept, so they do the same thing in decoding phase?

pisceskkk · 2025-12-24T09:05:32Z

If we set PCP=2, TP=8 but without DCP, what's the KVCache split strategy? Input prompt segmented between ranks, but decoding output tokens containing copies?

Different PCP ranks contain split sequence dimensions and distinct KV caches, while copies may exist within the same PCP rank (depending on MLA or GQA configurations, specifically the relationship between num_kv_heads and tp_size).
In your example, assuming we are using an MLA model, all KV caches belonging to PCP0 are identical. The KV caches of PCP0 and PCP1 are different, following the same distribution pattern as DCP, where they are stored in interleave-style on cp_kvcache_interleave_size.

Does it mean in decoding phase，pcp acts similarly to dcp, which requires gather q, calc local attn, correct lse..., sharding new generated tokens' kv cache interleavely?

#28988 , it seems dcp&pcp are merged to cp concept, so they do the same thing in decoding phase?

yes, in decoding phase, they act similarly for kvcache. If I remember correctly, there are only two differences: one is that dcp introduce an all-gather op for Q in head dim, and another is that dcp need a reduce-scatter after update output using lse.

pisceskkk requested review from ApostaC, LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zhuohan123 as code owners November 14, 2025 09:02

mergify bot added the v1 label Nov 14, 2025

pisceskkk force-pushed the pcp_base branch from 14870a7 to ffdccd5 Compare November 14, 2025 09:06

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

pisceskkk force-pushed the pcp_base branch 5 times, most recently from 4254801 to 00678e8 Compare November 14, 2025 10:30

pisceskkk mentioned this pull request Nov 14, 2025

[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #28723

Closed

pisceskkk force-pushed the pcp_base branch from 00678e8 to 3cbf550 Compare November 14, 2025 10:40

luccafong reviewed Nov 14, 2025

View reviewed changes

vllm/engine/arg_utils.py Show resolved Hide resolved

Livinfly mentioned this pull request Nov 20, 2025

[Bugfix] Fix block size in block_table with PCP #29094

Merged

5 tasks

LucasWilkinson mentioned this pull request Nov 24, 2025

[CI/Test Fix] Fix CP tests on Blackwell #29338

Merged

Potabk mentioned this pull request Dec 1, 2025

[Main] Upgrade vllm commit to 2025_12_01 vllm-project/vllm-ascend#4527

Closed

wangxiyuan mentioned this pull request Dec 1, 2025

upgrade vLLM to main vllm-project/vllm-ascend#4608

Merged

pisceskkk mentioned this pull request Dec 10, 2025

[RFC]: Support Prefill Context Parallel (PCP) #25749

Open

9 tasks

pisceskkk mentioned this pull request Dec 19, 2025

[Mics] add pcp basic support to MoE model #31003

Merged

hmellor mentioned this pull request Mar 6, 2026

Add token sharding functions for context parallel #26058

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Prefill Context Parallel (PCP) basic support#28718

[Feature] Prefill Context Parallel (PCP) basic support#28718
LucasWilkinson merged 4 commits intovllm-project:mainfrom
pisceskkk:pcp_base

pisceskkk commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

luccafong commented Nov 14, 2025

Uh oh!

heroes999 commented Dec 24, 2025

Uh oh!

pisceskkk commented Dec 24, 2025 •

edited

Loading

Uh oh!

heroes999 commented Dec 24, 2025 •

edited

Loading

Uh oh!

heroes999 commented Dec 24, 2025

Uh oh!

pisceskkk commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

pisceskkk commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

chatgpt-codex-connector bot commented Nov 14, 2025

💡 Codex Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

luccafong commented Nov 14, 2025

Uh oh!

heroes999 commented Dec 24, 2025

Uh oh!

pisceskkk commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heroes999 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heroes999 commented Dec 24, 2025

Uh oh!

pisceskkk commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pisceskkk commented Nov 14, 2025 •

edited by github-actions bot

Loading

pisceskkk commented Dec 24, 2025 •

edited

Loading

heroes999 commented Dec 24, 2025 •

edited

Loading