[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer by pisceskkk · Pull Request #28723 · vllm-project/vllm

pisceskkk · 2025-11-14T10:38:47Z

Purpose

This PR, splited from full PR #26864, adds the supports for the Prefill Context Parallelism (PCP) with GQA flashinfer, following PR #28718. For specific implementation details, please refer to the RFC #25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage.

The current implementation primarily includes the following changes:

Modified ModelRunner.py for PCP partitioning logic for tokens;
Modified flashinfer.py to adapt the FlashInfer backend for GQA to PCP.
Added PrefillContextParallelMetadata shared across attention backends;
Renamed variables and functions shared by both PCP and DCP.

Test Plan

Qwen/Qwen2.5-3B

export VLLM_ATTENTION_BACKEND='FLASHINFER'
vllm serve Qwen/Qwen2.5-3B --tensor-parallel-size 4 --decode-context-parallel-size 2 --prefill-context-parallel-size 2 --dcp-kv-cache-interleave-size 8

Test Result

gsm8k eval

tp4 17c540a

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.78

tp4 dcp2 interleave 8

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.43

tp4 pcp2 interleave 8

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.51

tp4 dcp2 pcp2 interleave 8

dataset	version	metric	mode	vllm-api-general-stream
gsm8kdataset	-	avg@5	gen	72.98

CC @LookAround0301 @FENP @gjc0824 @LucasWilkinson

mergify · 2025-11-14T10:39:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pisceskkk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces support for Prefill Context Parallelism (PCP) for GQA with flashinfer, which is a significant feature for enhancing long-sequence inference. The changes are extensive, touching configuration, parallel state management, attention backends, and the model runner. Overall, the implementation looks solid, but I've identified a few critical issues that need to be addressed. These include a duplicated command-line argument, a syntax error, a typo in a variable name, and incorrect tensor indexing, all of which could lead to runtime errors or prevent the code from running.

vllm/v1/attention/backends/flashinfer.py

vllm/v1/worker/gpu_model_runner.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/config/parallel.py

vllm/v1/attention/backends/flashinfer.py

mergify · 2025-11-19T01:12:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pisceskkk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

LucasWilkinson

Left some comments on #28988 which I think similarly apply here

vllm/v1/worker/gpu_model_runner.py

LucasWilkinson

Thanks for the contribution! A few more comments

vllm/v1/spec_decode/eagle.py

vllm/v1/attention/backends/flashinfer.py

vllm/v1/attention/backends/utils.py

vllm/v1/attention/backends/flashinfer.py

mergify · 2025-11-25T12:14:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pisceskkk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

vllm/v1/attention/backends/flashinfer.py

github-actions · 2026-03-02T02:15:03Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2026-04-01T02:16:59Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

pisceskkk requested review from ApostaC, LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zhuohan123 as code owners November 14, 2025 10:38

mergify bot added nvidia v1 labels Nov 14, 2025

github-project-automation bot added this to NVIDIA Nov 14, 2025

mergify bot added the needs-rebase label Nov 14, 2025

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

vllm/config/parallel.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

pisceskkk force-pushed the pcp+flashinfer branch 2 times, most recently from c0f45f9 to 489b6c5 Compare November 18, 2025 09:12

mergify bot removed the needs-rebase label Nov 18, 2025

pisceskkk force-pushed the pcp+flashinfer branch 2 times, most recently from 8bc261d to 58cbd8f Compare November 18, 2025 09:54

mergify bot added the needs-rebase label Nov 19, 2025

pisceskkk and others added 2 commits November 20, 2025 09:04

pisceskkk force-pushed the pcp+flashinfer branch from 44f658e to 1cac317 Compare November 20, 2025 01:11

pisceskkk requested review from benchislett and luccafong as code owners November 20, 2025 01:11

mergify bot added speculative-decoding and removed needs-rebase labels Nov 20, 2025

gjc0824 added 2 commits November 23, 2025 23:04

[Fix] common support for pcp

eb5d34b

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

[Fix] flashinfer support for pcp

3d65330

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

gjc0824 force-pushed the pcp+flashinfer branch 2 times, most recently from b9ed205 to d6bbe6d Compare November 24, 2025 01:59

[Lint]

df36e76

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

gjc0824 force-pushed the pcp+flashinfer branch from d6bbe6d to df36e76 Compare November 24, 2025 02:05

LucasWilkinson reviewed Nov 25, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

LucasWilkinson reviewed Nov 25, 2025

View reviewed changes

Livinfly reviewed Nov 25, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 25, 2025

Livinfly reviewed Nov 25, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

gjc0824 added 2 commits November 25, 2025 21:07

Move flashinfer-related params out of utils

3e2f6fd

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

fix bug&& add doc

07e78b1

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

gjc0824 force-pushed the pcp+flashinfer branch from 28e2d1a to 07e78b1 Compare November 26, 2025 04:02

FENP reviewed Nov 28, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

pisceskkk mentioned this pull request Dec 10, 2025

[RFC]: Support Prefill Context Parallel (PCP) #25749

Open

9 tasks

github-actions bot added the stale Over 90 days of inactivity label Mar 2, 2026

github-actions bot closed this Apr 1, 2026

github-project-automation bot moved this to Done in NVIDIA Apr 1, 2026

Uh oh!

Conversation

pisceskkk commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pisceskkk commented Nov 14, 2025 •

edited by github-actions bot

Loading