[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer#28723
[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer#28723pisceskkk wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Prefill Context Parallelism (PCP) for GQA with flashinfer, which is a significant feature for enhancing long-sequence inference. The changes are extensive, touching configuration, parallel state management, attention backends, and the model runner. Overall, the implementation looks solid, but I've identified a few critical issues that need to be addressed. These include a duplicated command-line argument, a syntax error, a typo in a variable name, and incorrect tensor indexing, all of which could lead to runtime errors or prevent the code from running.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
c0f45f9 to
489b6c5
Compare
8bc261d to
58cbd8f
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
44f658e to
1cac317
Compare
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
b9ed205 to
d6bbe6d
Compare
d6bbe6d to
df36e76
Compare
LucasWilkinson
left a comment
There was a problem hiding this comment.
Left some comments on #28988 which I think similarly apply here
LucasWilkinson
left a comment
There was a problem hiding this comment.
Thanks for the contribution! A few more comments
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
28e2d1a to
07e78b1
Compare
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
Purpose
This PR, splited from full PR #26864, adds the supports for the Prefill Context Parallelism (PCP) with GQA flashinfer, following PR #28718. For specific implementation details, please refer to the RFC #25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning the sequence dimension during the prefill stage.
The current implementation primarily includes the following changes:
ModelRunner.pyfor PCP partitioning logic for tokens;flashinfer.pyto adapt the FlashInfer backend for GQA to PCP.PrefillContextParallelMetadatashared across attention backends;Test Plan
Qwen/Qwen2.5-3B
Test Result
gsm8k eval
tp4 17c540a
tp4 dcp2 interleave 8
tp4 pcp2 interleave 8
tp4 dcp2 pcp2 interleave 8
CC @LookAround0301 @FENP @gjc0824 @LucasWilkinson