UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA#1041
Open
UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA#1041
Conversation
|
No summary available at this time. Visit Version Insights to review detailed analysis. |
10471d1 to
e11b5e5
Compare
6b41339 to
1a5cc77
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#19115
Fixes ggml-org/llama.cpp#19112 , the issue was introduced with ggml-org/llama.cpp#19092 .
The MMA CUDA FlashAttention kernel uses a stream-k decomposition to treat the four-dimensional input tensors as one continuous dimension to split across streaming multiprocessors. However, in conjunction with the GQA-specific optimizations in the MMA kernel this is only correct if the number of Q columns per CUDA block exactly divide
n_gqa. Otherwise the wrong Q and K/V heads will be associated and the result will be wrong (if there is only a single K/V head this doesn't matter so it was not detected in testing).This PR extends the 4D space on master to a 5D space by splitting the "z" dimension with the number of Q heads into one dimension for the number of K/V heads and another dimension for the number of Q heads per K/V head. This then makes it possible to simply pad the Q columns per CUDA block to a power of 2.
I modified one of the test cases in
test-backend-opsto check for this fix. On mastern_gqais set to 1, 4, and 16. I chose these values to check for no GQA optimizations, GQA optimizations with a single CUDA block in z direction, and GQA optimizations with >1 CUDA blocks per z direction. By changing the last value from 16 to 12 it will still cover that case while also checking for the correct padding logic.