Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Prefill Context Parallelism (PCP), a significant feature for distributing the computational load of prefill requests across multiple GPUs. The implementation is extensive, touching on attention mechanisms, worker logic, and configuration. Key changes include the introduction of a PCPManager to handle input partitioning and output restoration, a fused Triton kernel for optimized QKV selection, and a generalization of context parallelism code from dcp to a more generic cp. The changes are well-structured, with necessary updates to tests and compatibility checks, such as disabling full CUDA graphs when PCP is active. The PR also includes improvements for asynchronous pipeline parallelism and multimodal input handling. Overall, this is a solid and well-thought-out implementation of a complex feature.
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Documentation preview: https://vllm--33403.org.readthedocs.build/en/33403/ |
dc1e212 to
b3c3f97
Compare
…nication This PR adds Prefill Context Parallelism (PCP) support for splitting prefill tokens across ranks using a DualChunkSwap pattern, and integrates an All-to-All communication backend for Decode Context Parallelism (DCP). Key changes: - Add PCP with DualChunkSwap token partitioning for balanced prefill computation - Add All-to-All DCP communication backend reducing NCCL calls from 3 to 2 - Restrict DCP+PCP to two clean configurations: - Case 1: DCP = PCP (same TP position, all-reduce only) - Case 2: DCP = TP × PCP (full TP all-gather, all-reduce + slice) - Add PCPManager for buffer management and input partitioning - Update attention backends (FlashAttention, FlashInfer, MLA) for PCP support - Add comprehensive tests for DCP operations Co-Authored-By: QiuChunshuo <qiuchunshuo@huawei.com> Co-Authored-By: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-Authored-By: FENP <yuanyongjie.yyj@antgroup.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Alternative implementation to: #28988
Makes DCP/PCP interface conform to to: #25749 (comment)
Rank layout: