Skip to content

[WIP] pcp alternative impl#33403

Draft
LucasWilkinson wants to merge 1 commit intomainfrom
pcp-alt
Draft

[WIP] pcp alternative impl#33403
LucasWilkinson wants to merge 1 commit intomainfrom
pcp-alt

Conversation

@LucasWilkinson
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson commented Jan 30, 2026

Alternative implementation to: #28988

Makes DCP/PCP interface conform to to: #25749 (comment)

Rank layout:

================================================================================
LEGEND
================================================================================
  q0-3   = query heads 0 through 3
  k0     = KV head 0
  s0     = KV shard 0 (decode sequence shard)
  p0     = prefill sequence shard 0
  [0, 2] = ranks in group

================================================================================
TP=2, PCP=2, DCP=2  (8 query heads, 2 KV heads)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k1/s0/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k1/s1/p1  |
     +-------------------+-------------------+

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k1, s0, p0 | 3: q4-7/k1, s1, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k1, s0, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k1, s1, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  DCP1: [1, 3]  1: q4-7/k1, s0, p0 | 3: q4-7/k1, s1, p1


================================================================================
TP=2, PCP=2, DCP=2  (8 query heads, 1 KV head - MQA)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k0/s0/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k0/s1/p1  |
     +-------------------+-------------------+

NOTE: (k0, s0) and (k0, s1) are each replicated once (since num_gpus = TP * PCP and num_gpus / DCP > num_kv_heads)

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k0, s0, p0 | 3: q4-7/k0, s1, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k0, s0, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k0, s1, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  DCP1: [1, 3]  1: q4-7/k0, s0, p0 | 3: q4-7/k0, s1, p1


================================================================================
TP=2, PCP=2, DCP=4  (8 query heads, 2 KV heads)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k1/s2/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k1/s3/p1  |
     +-------------------+-------------------+

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k1, s2, p0 | 3: q4-7/k1, s3, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k1, s2, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k1, s3, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2, 1, 3]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1 | 1: q4-7/k1, s2, p0 | 3: q4-7/k1, s3, p1


================================================================================
TP=2, PCP=2, DCP=4  (8 query heads, 1 KV head - MQA)
================================================================================

Rank Layout (query heads / KV head / KV shard / prefill shard):
      tp0                    tp1
     +-------------------+-------------------+
pcp0 | 0: q0-3/k0/s0/p0  | 1: q4-7/k0/s2/p0  |
     +-------------------+-------------------+
pcp1 | 2: q0-3/k0/s1/p1  | 3: q4-7/k0/s3/p1  |
     +-------------------+-------------------+

PCP Groups (shard sequence during prefill):
  PCP0: [0, 2]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1
  PCP1: [1, 3]  1: q4-7/k0, s2, p0 | 3: q4-7/k0, s3, p1

TP Groups (shard heads):
  TP0:  [0, 1]  0: q0-3/k0, s0, p0 | 1: q4-7/k0, s2, p0
  TP1:  [2, 3]  2: q0-3/k0, s1, p1 | 3: q4-7/k0, s3, p1

DCP Groups (shard KV during decode):
  DCP0: [0, 2, 1, 3]  0: q0-3/k0, s0, p0 | 2: q0-3/k0, s1, p1 | 1: q4-7/k0, s2, p0 | 3: q4-7/k0, s3, p1

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Prefill Context Parallelism (PCP), a significant feature for distributing the computational load of prefill requests across multiple GPUs. The implementation is extensive, touching on attention mechanisms, worker logic, and configuration. Key changes include the introduction of a PCPManager to handle input partitioning and output restoration, a fused Triton kernel for optimized QKV selection, and a generalization of context parallelism code from dcp to a more generic cp. The changes are well-structured, with necessary updates to tests and compatibility checks, such as disabling full CUDA graphs when PCP is active. The PR also includes improvements for asynchronous pipeline parallelism and multimodal input handling. Overall, this is a solid and well-thought-out implementation of a complex feature.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 30, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 31, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 2, 2026

Documentation preview: https://vllm--33403.org.readthedocs.build/en/33403/

@mergify mergify bot added documentation Improvements or additions to documentation and removed needs-rebase labels Mar 2, 2026
@LucasWilkinson LucasWilkinson force-pushed the pcp-alt branch 3 times, most recently from dc1e212 to b3c3f97 Compare March 2, 2026 05:31
…nication

This PR adds Prefill Context Parallelism (PCP) support for splitting prefill
tokens across ranks using a DualChunkSwap pattern, and integrates an All-to-All
communication backend for Decode Context Parallelism (DCP).

Key changes:
- Add PCP with DualChunkSwap token partitioning for balanced prefill computation
- Add All-to-All DCP communication backend reducing NCCL calls from 3 to 2
- Restrict DCP+PCP to two clean configurations:
  - Case 1: DCP = PCP (same TP position, all-reduce only)
  - Case 2: DCP = TP × PCP (full TP all-gather, all-reduce + slice)
- Add PCPManager for buffer management and input partitioning
- Update attention backends (FlashAttention, FlashInfer, MLA) for PCP support
- Add comprehensive tests for DCP operations

Co-Authored-By: QiuChunshuo <qiuchunshuo@huawei.com>
Co-Authored-By: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-Authored-By: FENP <yuanyongjie.yyj@antgroup.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 2, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase nvidia rocm Related to AMD ROCm speculative-decoding v1

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant