Skip to content

UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA#1041

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19115-branch_JohannesGaessler-cuda-fa-fix-gqa-padding
Open

UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA#1041
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19115-branch_JohannesGaessler-cuda-fa-fix-gqa-padding

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#19115

Fixes ggml-org/llama.cpp#19112 , the issue was introduced with ggml-org/llama.cpp#19092 .

The MMA CUDA FlashAttention kernel uses a stream-k decomposition to treat the four-dimensional input tensors as one continuous dimension to split across streaming multiprocessors. However, in conjunction with the GQA-specific optimizations in the MMA kernel this is only correct if the number of Q columns per CUDA block exactly divide n_gqa. Otherwise the wrong Q and K/V heads will be associated and the result will be wrong (if there is only a single K/V head this doesn't matter so it was not detected in testing).

This PR extends the 4D space on master to a 5D space by splitting the "z" dimension with the number of Q heads into one dimension for the number of K/V heads and another dimension for the number of Q heads per K/V head. This then makes it possible to simply pad the Q columns per CUDA block to a power of 2.

I modified one of the test cases in test-backend-ops to check for this fix. On master n_gqa is set to 1, 4, and 16. I chose these values to check for no GQA optimizations, GQA optimizations with a single CUDA block in z direction, and GQA optimizations with >1 CUDA blocks per z direction. By changing the last value from 16 to 12 it will still cover that case while also checking for the correct padding logic.

@loci-review
Copy link

loci-review bot commented Jan 26, 2026

No summary available at this time. Visit Version Insights to review detailed analysis.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 10471d1 to e11b5e5 Compare January 29, 2026 15:17
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 6b41339 to 1a5cc77 Compare February 1, 2026 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants