Support CP with query length larger than 1#93
Merged
LucasWilkinson merged 9 commits intovllm-project:mainfrom Oct 5, 2025
Merged
Support CP with query length larger than 1#93LucasWilkinson merged 9 commits intovllm-project:mainfrom
LucasWilkinson merged 9 commits intovllm-project:mainfrom
Conversation
b8792e9 to
54be252
Compare
5 tasks
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
hopper/flash_api.cpp
Outdated
| bool const packgqa_override = params.arch >= 90 && (params.h / params.h_k) == 8 && | ||
| params.is_local && | ||
| bool const packgqa_override = params.arch >= 90 && (params.h / params.h_k) == 8 && | ||
| params.is_local && |
Collaborator
There was a problem hiding this comment.
do you mind removing the unrelated formatting changes? trying to stay as close to upstream as possible when possible
| : std::max(n_block_min, | ||
| cute::ceil_div(m_idx_max + seqlen_k - seqlen_q - params.window_size_left, kBlockN)); | ||
| cute::ceil_div(m_idx_max + | ||
| params.cp_world_size * seqlen_k - |
Collaborator
There was a problem hiding this comment.
can we use cp_tot_seqlen_k to skip the mul here? should branch in the non-cp case to save the mul?
Collaborator
There was a problem hiding this comment.
we could make cp_tot_seqlen_k == seqlen_k in the params.cp_world_size == 1 case
hopper/seqlen.h
Outdated
| , cp_world_size(cp_world_size) | ||
| , cp_tot_seqlen_k(cp_tot_seqused_k == nullptr | ||
| ? 0 | ||
| : cp_tot_seqused_k[bidb]) |
Collaborator
There was a problem hiding this comment.
Collaborator
LucasWilkinson
left a comment
There was a problem hiding this comment.
Awesome work! left a few comments but its looking really good!
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Ming Yang <minos.future@gmail.com>
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements the causal mask for interleave context parallelism to allow query length > 1.
The solution follows the discussion between @LucasWilkinson , @youkaichao , and @youzhedian on slack.
key illustration made by @LucasWilkinson :
In the DCP case, the k/v tokens are distributed in an interleaved fashion, see vllm-project/vllm#23734.

Therefore we have 0,2,4 kv on rank0 and 1,3,5 kv on rank1 in the example above. The mask shape is no longer a bottom right triangle.
This requires FA to be aware of cp world size and cp rank, in order to determine the causal mask.
The block tiling implementation also needs to be updated. As illustrated below, we now needs to process block tile (0,1) in CP case, while it can be skipped previously in normal case.
Tests
Added and passed unit tests for CP.