[WIP] pcp alternative impl by LucasWilkinson · Pull Request #33403 · vllm-project/vllm

LucasWilkinson · 2026-01-30T09:32:50Z

Alternative implementation to: #28988

Makes DCP/PCP interface conform to to: #25749 (comment)

Rank layout:

================================================================================
LEGEND
================================================================================
  q0-3   = query heads 0 through 3
  k0     = KV head 0
  s0     = KV shard 0 (decode sequence shard)
  p0     = prefill sequence shard 0
  [0, 2] = ranks in group

================================================================================
TP=2, PCP=2, DCP=2  (8 query heads, 2 KV heads)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k1/s0/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k1/s1/p1  |
     +-------------------+-------------------+

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k1, s0, p0 | 3: q4-7/k1, s1, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k1, s0, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k1, s1, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  DCP1: [1, 3]  1: q4-7/k1, s0, p0 | 3: q4-7/k1, s1, p1


================================================================================
TP=2, PCP=2, DCP=2  (8 query heads, 1 KV head - MQA)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k0/s0/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k0/s1/p1  |
     +-------------------+-------------------+

NOTE: (k0, s0) and (k0, s1) are each replicated once (since num_gpus = TP * PCP and num_gpus / DCP > num_kv_heads)

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k0, s0, p0 | 3: q4-7/k0, s1, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k0, s0, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k0, s1, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  DCP1: [1, 3]  1: q4-7/k0, s0, p0 | 3: q4-7/k0, s1, p1


================================================================================
TP=2, PCP=2, DCP=4  (8 query heads, 2 KV heads)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k1/s2/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k1/s3/p1  |
     +-------------------+-------------------+

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k1, s2, p0 | 3: q4-7/k1, s3, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k1, s2, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k1, s3, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2, 1, 3]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1 | 1: q4-7/k1, s2, p0 | 3: q4-7/k1, s3, p1


================================================================================
TP=2, PCP=2, DCP=4  (8 query heads, 1 KV head - MQA)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k0/s2/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k0/s3/p1  |
     +-------------------+-------------------+

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k0, s2, p0 | 3: q4-7/k0, s3, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k0, s2, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k0, s3, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2, 1, 3]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1 | 1: q4-7/k0, s2, p0 | 3: q4-7/k0, s3, p1

gemini-code-assist

Code Review

This pull request introduces Prefill Context Parallelism (PCP), a significant feature for distributing the computational load of prefill requests across multiple GPUs. The implementation is extensive, touching on attention mechanisms, worker logic, and configuration. Key changes include the introduction of a PCPManager to handle input partitioning and output restoration, a fused Triton kernel for optimized QKV selection, and a generalization of context parallelism code from dcp to a more generic cp. The changes are well-structured, with necessary updates to tests and compatibility checks, such as disabling full CUDA graphs when PCP is active. The PR also includes improvements for asynchronous pipeline parallelism and multimodal input handling. Overall, this is a solid and well-thought-out implementation of a complex feature.

mergify · 2026-01-30T09:41:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-01-31T15:02:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-02-05T19:43:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-02T05:13:54Z

Documentation preview: https://vllm--33403.org.readthedocs.build/en/33403/

…nication This PR adds Prefill Context Parallelism (PCP) support for splitting prefill tokens across ranks using a DualChunkSwap pattern, and integrates an All-to-All communication backend for Decode Context Parallelism (DCP). Key changes: - Add PCP with DualChunkSwap token partitioning for balanced prefill computation - Add All-to-All DCP communication backend reducing NCCL calls from 3 to 2 - Restrict DCP+PCP to two clean configurations: - Case 1: DCP = PCP (same TP position, all-reduce only) - Case 2: DCP = TP × PCP (full TP all-gather, all-reduce + slice) - Add PCPManager for buffer management and input partitioning - Update attention backends (FlashAttention, FlashInfer, MLA) for PCP support - Add comprehensive tests for DCP operations Co-Authored-By: QiuChunshuo <qiuchunshuo@huawei.com> Co-Authored-By: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-Authored-By: FENP <yuanyongjie.yyj@antgroup.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify · 2026-03-02T17:15:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added nvidia rocm Related to AMD ROCm speculative-decoding v1 labels Jan 30, 2026

github-project-automation bot added this to AMD and NVIDIA Jan 30, 2026

github-project-automation bot moved this to Todo in AMD Jan 30, 2026

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

mergify bot added needs-rebase and removed needs-rebase labels Jan 30, 2026

mergify bot added needs-rebase and removed needs-rebase labels Jan 31, 2026

mergify bot added the needs-rebase label Feb 5, 2026

LucasWilkinson force-pushed the pcp-alt branch from c36662e to 72ce8cb Compare March 2, 2026 05:13

mergify bot added documentation Improvements or additions to documentation and removed needs-rebase labels Mar 2, 2026

LucasWilkinson force-pushed the pcp-alt branch 3 times, most recently from dc1e212 to b3c3f97 Compare March 2, 2026 05:31

LucasWilkinson force-pushed the pcp-alt branch from b3c3f97 to a97bd60 Compare March 2, 2026 05:40

mergify bot added the needs-rebase label Mar 2, 2026

LucasWilkinson mentioned this pull request Mar 7, 2026

[Attention] PCP alternative implementation #36306

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] pcp alternative impl#33403

[WIP] pcp alternative impl#33403
LucasWilkinson wants to merge 1 commit intomainfrom
pcp-alt

LucasWilkinson commented Jan 30, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

mergify bot commented Jan 31, 2026

Uh oh!

mergify bot commented Feb 5, 2026

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LucasWilkinson commented Jan 30, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

mergify bot commented Jan 31, 2026

Uh oh!

mergify bot commented Feb 5, 2026

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LucasWilkinson commented Jan 30, 2026 •

edited by github-actions bot

Loading