UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA by loci-dev · Pull Request #1041 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-26T16:45:25Z

Mirrored from ggml-org/llama.cpp#19115

Fixes ggml-org/llama.cpp#19112 , the issue was introduced with ggml-org/llama.cpp#19092 .

The MMA CUDA FlashAttention kernel uses a stream-k decomposition to treat the four-dimensional input tensors as one continuous dimension to split across streaming multiprocessors. However, in conjunction with the GQA-specific optimizations in the MMA kernel this is only correct if the number of Q columns per CUDA block exactly divide n_gqa. Otherwise the wrong Q and K/V heads will be associated and the result will be wrong (if there is only a single K/V head this doesn't matter so it was not detected in testing).

This PR extends the 4D space on master to a 5D space by splitting the "z" dimension with the number of Q heads into one dimension for the number of K/V heads and another dimension for the number of Q heads per K/V head. This then makes it possible to simply pad the Q columns per CUDA block to a power of 2.

I modified one of the test cases in test-backend-ops to check for this fix. On master n_gqa is set to 1, 4, and 16. I chose these values to check for no GQA optimizations, GQA optimizations with a single CUDA block in z direction, and GQA optimizations with >1 CUDA blocks per z direction. By changing the last value from 16 to 12 it will still cover that case while also checking for the correct padding logic.

loci-review · 2026-01-26T17:34:32Z

No summary available at this time. Visit Version Insights to review detailed analysis.

CUDA: fix padding of GQA to power of 2 in FA

060a694

loci-dev temporarily deployed to PROD__AL_DEMO January 26, 2026 16:45 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 10471d1 to e11b5e5 Compare January 29, 2026 15:17

loci-dev force-pushed the main branch 30 times, most recently from 6b41339 to 1a5cc77 Compare February 1, 2026 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA#1041

UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA#1041
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19115-branch_JohannesGaessler-cuda-fa-fix-gqa-padding

loci-dev commented Jan 26, 2026

Uh oh!

loci-review bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 26, 2026

Uh oh!

loci-review bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants