CUDA: fix padding of GQA to power of 2 in FA by JohannesGaessler · Pull Request #19115 · ggml-org/llama.cpp

JohannesGaessler · 2026-01-26T16:08:34Z

Fixes #19112 , the issue was introduced with #19092 .

The MMA CUDA FlashAttention kernel uses a stream-k decomposition to treat the four-dimensional input tensors as one continuous dimension to split across streaming multiprocessors. However, in conjunction with the GQA-specific optimizations in the MMA kernel this is only correct if the number of Q columns per CUDA block exactly divide n_gqa. Otherwise the wrong Q and K/V heads will be associated and the result will be wrong (if there is only a single K/V head this doesn't matter so it was not detected in testing).

This PR extends the 4D space on master to a 5D space by splitting the "z" dimension with the number of Q heads into one dimension for the number of K/V heads and another dimension for the number of Q heads per K/V head. This then makes it possible to simply pad the Q columns per CUDA block to a power of 2.

I modified one of the test cases in test-backend-ops to check for this fix. On master n_gqa is set to 1, 4, and 16. I chose these values to check for no GQA optimizations, GQA optimizations with a single CUDA block in z direction, and GQA optimizations with >1 CUDA blocks per z direction. By changing the last value from 16 to 12 it will still cover that case while also checking for the correct padding logic.

RodriMora · 2026-01-26T20:07:25Z

Confirmed working good now with https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.1-GGUF at Q6.
It was outputting giberish with the latest commit 8f80d1b

* model: add JAIS-2 architecture support Add support for the JAIS-2 family of Arabic-English bilingual models from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat). Architecture characteristics: - LayerNorm (not RMSNorm) with biases - ReLU² (ReLU squared) activation function - Separate Q/K/V projections with biases - Simple MLP without gate projection (up -> act -> down) - RoPE positional embeddings - GPT-2 BPE tokenizer Supported model sizes: - Jais-2-8B (32 layers, 26 heads, 3328 hidden) - Jais-2-70B (68 layers, 56 heads, 7168 hidden) Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K Note: JAIS-2 requires F32 precision accumulators for numerical stability and uses standard attention (not flash attention) on CUDA backends. * fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash * fix: use NEOX RoPE type for JAIS2 * fix: remove Q/K permutation (NEOX RoPE doesn't need it) * fix: enable flash attention for JAIS2 (fixed by #19115) * fix: add dedicated JAIS2 pre-tokenizer type and control vector support - Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex - Include original regex from tokenizer.json as comment - Add build_cvec call for control vector support * no longer necessary to override set_vocab --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: add JAIS-2 architecture support Add support for the JAIS-2 family of Arabic-English bilingual models from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat). Architecture characteristics: - LayerNorm (not RMSNorm) with biases - ReLU² (ReLU squared) activation function - Separate Q/K/V projections with biases - Simple MLP without gate projection (up -> act -> down) - RoPE positional embeddings - GPT-2 BPE tokenizer Supported model sizes: - Jais-2-8B (32 layers, 26 heads, 3328 hidden) - Jais-2-70B (68 layers, 56 heads, 7168 hidden) Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K Note: JAIS-2 requires F32 precision accumulators for numerical stability and uses standard attention (not flash attention) on CUDA backends. * fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash * fix: use NEOX RoPE type for JAIS2 * fix: remove Q/K permutation (NEOX RoPE doesn't need it) * fix: enable flash attention for JAIS2 (fixed by ggml-org#19115) * fix: add dedicated JAIS2 pre-tokenizer type and control vector support - Add LLAMA_VOCAB_PRE_TYPE_JAIS2 with cascading whitespace regex - Include original regex from tokenizer.json as comment - Add build_cvec call for control vector support * no longer necessary to override set_vocab --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CUDA: fix padding of GQA to power of 2 in FA

060a694

JohannesGaessler requested a review from ggerganov as a code owner January 26, 2026 16:08

This was referenced Jan 26, 2026

Eval bug: Minimax M2.1 outputs gibberish after updating to 7836 #19112

Closed

tests : add GQA=20 FA test #19095

Merged

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 26, 2026

loci-dev mentioned this pull request Jan 26, 2026

UPSTREAM PR #19115: CUDA: fix padding of GQA to power of 2 in FA auroralabs-loci/llama.cpp#1041

Open

ggerganov approved these changes Jan 26, 2026

View reviewed changes

JohannesGaessler merged commit b0311c1 into ggml-org:master Jan 26, 2026
76 of 78 checks passed

ggerganov mentioned this pull request Jan 29, 2026

chat : add parsing for solar-open-100b #18540

Merged

shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026

CUDA: fix padding of GQA to power of 2 in FA (ggml-org#19115)

683d20d

CISC mentioned this pull request Feb 11, 2026

model: add JAIS-2 architecture support #19488

Merged

alielfilali01 added a commit to alielfilali01/llama.cpp that referenced this pull request Feb 12, 2026

fix: enable flash attention for JAIS2 (fixed by ggml-org#19115)

cbe37e3

blime4 mentioned this pull request Feb 13, 2026

Eval bug: Llama2 outputs an infinite loop. #19585

Open

alielfilali01 added a commit to alielfilali01/llama.cpp that referenced this pull request Feb 19, 2026

fix: enable flash attention for JAIS2 (fixed by ggml-org#19115)

ec0bf92

mindplay-dk mentioned this pull request Feb 25, 2026

Minimax M2.5 experience is weird anomalyco/opencode#15092

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix padding of GQA to power of 2 in FA#19115

CUDA: fix padding of GQA to power of 2 in FA#19115
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-fix-gqa-padding

JohannesGaessler commented Jan 26, 2026

Uh oh!

RodriMora commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented Jan 26, 2026

Uh oh!

RodriMora commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants