llama : rotate activations for better quantization by ggerganov · Pull Request #21038 · ggml-org/llama.cpp

ggerganov · 2026-03-26T18:14:48Z

Overview

In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit using a very simple interpretation of the idea of using Hadamard transform to reduce outliers in the attention and improve the quantization quality:

Rotate input Q, K, V and store in cache
Perform attention with the rotated Q, K, V
Rotate back the output vector of the attention

This works because:

Rotation does not affect the dot product of 2 vectors
The output vector is a linear combination of rotated vectors

The implementation is very simple and backend agnostic - use Hadamard matrix of size n x n, normalized by 1/sqrt(n) so that it is orthonormal and can be used both for the forward and backward rotation. Technically any rotation matrix (and it's inverse) should work - I just think this is what is commonly used due to it's simplicity. The implementation does not introduce new types and is compatible with all existing quantizations. It adds 4 matrix multiplication operators in the attention, though I think some of them can be fused in the attention weights (similar to QuaRot).

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this. In any case, having a better baseline at almost 0 cost won't hurt. Only based on the PPL data below, I think this should never be worse that before, though it needs a bit more evaluation.

Note: MLA is not supported

Additional information

Here are some PPL results before and after using base models. Ideally, there should be KLD data too, but leaving it for people to play with it and see if it looks good.

./bin/llama-perplexity -m model.gguf -f wiki.test.raw --chunks 128 -ctk f16 -ctv f16

Model: Qwen3 0.6B BF16

https://huggingface.co/Qwen/Qwen3-0.6B-Base

Baseline F16 cache: PPL = 13.6711 +/- 0.21422

type	master	PR
q8_0	13.9115	13.6713
q5_1	61.6992	14.1452
q5_0	17.2805	14.2171
q4_1	212.479	22.2816
q4_0	62.0161	46.2503

Model: Qwen3 8B BF16

https://huggingface.co/Qwen/Qwen3-8B-Base

Baseline F16 cache: PPL = 7.3203 +/- 0.09901

type	master	PR
q8_0	7.3172	7.3195
q5_0	7.3793	7.3323
q4_0	7.6451	7.5012

Model: Gemma3 4B Q8_0

https://huggingface.co/google/gemma-3-4b-pt

Baseline F16 cache: PPL = 7.6905 +/- 0.10483

type	master	PR
q8_0	7.6928	7.6914
q5_1	7.7133	7.7052
q5_0	7.7544	7.7182
q4_1	7.8095	7.7531
q4_0	7.8535	7.7928

Model: Qwen3.5 4B F16

https://huggingface.co/Qwen/Qwen3.5-4B-Base

Baseline F16 cache: PPL = 8.3266 +/- 0.11623

type	master	PR
q8_0	8.3272	8.3261
q5_0	8.3359	8.3339
q4_0	8.3475	8.3448

TODOs

Cache shift support ff76c67
Disable rotations with env variable (LLAMA_ATTN_ROT_DISABLE)

Next PRs

Add randomized Hadamard matrices (f0fea26)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ubergarm · 2026-03-27T04:12:04Z

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

f16 = 16 bpw
q4_0 = 4.5 bpw
tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache	v-cache	baseline master@6861f6509	PR gg/attn-rot@e5aa067d6	Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16	f16	6.5788 +/- 0.04196, 688.81 tok/sec	6.5788 +/- 0.04196, 721.01 tok/sec
q4_0	q4_0	6.6148 +/- 0.04216, 703.91 tok/sec	6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0	tq3_0			6.6911 +/- 0.04273, 481 tok/sec

👈 Details

Experiment

Measure perplexity against wiki.test.raw varying kv-cache quantization.

Test Quant

ubergarm/Qwen3.5-35B-A3B-GGUF Q4_0 19.776 GiB (4.901 BPW)

Test Rig

AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (~538 GB/s via mlc)
Compiled CPU-only backend

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_VULKAN=OFF
cmake --build build --config Release -j $(nproc)

Command

# seed does nothing as no sampling here, but is fun
model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
    #-ctk f16 -ctv f16 \
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    -ctk q4_0 -ctv q4_0 \
    --seed 1337 \
    --ctx-size 512 \
    -ub 512 -b 2048 \
    --no-mmap \
    --numa numactl \
    --threads 96 \
    --threads-batch 128

AesSedai · 2026-03-27T08:16:49Z

I've run a pretty big permutation matrix for the Qwen3.5-9B model against the master branch and this PR branch.

The following images are the ratios vs F16/F16:

Master branch PPL and KLD

and this PR PPL and KLD

Here's the table of data comparing master vs this PR

k-cache	v-cache	master PPL	master mean KLD	attn-rot PPL	attn-rot KLD
f16	f16	8.195983 ± 0.055532	0.000782 ± 0.000040	8.195983 ± 0.055532	0.000782 ± 0.000040
f16	q4_0	8.208920 ± 0.055630	0.003747 ± 0.000182	8.208920 ± 0.055630	0.003747 ± 0.000182
f16	q4_1	8.212052 ± 0.055711	0.002985 ± 0.000140	8.212052 ± 0.055711	0.002985 ± 0.000140
f16	q5_0	8.206032 ± 0.055617	0.001752 ± 0.000111	8.206032 ± 0.055617	0.001752 ± 0.000111
f16	q5_1	8.195687 ± 0.055524	0.001393 ± 0.000111	8.195687 ± 0.055524	0.001393 ± 0.000111
f16	q8_0	8.196533 ± 0.055536	0.000804 ± 0.000037	8.196533 ± 0.055536	0.000804 ± 0.000037
q4_0	f16	8.212304 ± 0.055635	0.005426 ± 0.000271	8.212304 ± 0.055635	0.005426 ± 0.000271
q4_0	q4_0	8.227048 ± 0.055766	0.007979 ± 0.000352	8.211705 ± 0.055631	0.005059 ± 0.000224
q4_0	q4_1	8.223944 ± 0.055773	0.007163 ± 0.000298	8.209285 ± 0.055635	0.004664 ± 0.000196
q4_0	q5_0	8.217618 ± 0.055691	0.006165 ± 0.000307	8.201787 ± 0.055561	0.003708 ± 0.000187
q4_0	q5_1	8.210686 ± 0.055628	0.006011 ± 0.000332	8.201859 ± 0.055564	0.003656 ± 0.000203
q4_0	q8_0	8.209199 ± 0.055617	0.005336 ± 0.000272	8.198971 ± 0.055537	0.003334 ± 0.000160
q4_1	f16	8.219674 ± 0.055730	0.004174 ± 0.000239	8.219674 ± 0.055730	0.004174 ± 0.000239
q4_1	q4_0	8.232653 ± 0.055833	0.006612 ± 0.000263	8.213936 ± 0.055656	0.004702 ± 0.000223
q4_1	q4_1	8.235108 ± 0.055910	0.006014 ± 0.000266	8.210796 ± 0.055641	0.004252 ± 0.000183
q4_1	q5_0	8.229521 ± 0.055827	0.004827 ± 0.000253	8.204419 ± 0.055595	0.003456 ± 0.000191
q4_1	q5_1	8.221232 ± 0.055750	0.004739 ± 0.000252	8.203722 ± 0.055583	0.003297 ± 0.000171
q4_1	q8_0	8.219816 ± 0.055743	0.004005 ± 0.000209	8.201606 ± 0.055565	0.002968 ± 0.000156
q5_0	f16	8.202763 ± 0.055572	0.002223 ± 0.000157	8.202763 ± 0.055572	0.002223 ± 0.000157
q5_0	q4_0	8.213422 ± 0.055674	0.005025 ± 0.000243	8.211162 ± 0.055646	0.003654 ± 0.000244
q5_0	q4_1	8.217085 ± 0.055749	0.004181 ± 0.000185	8.206240 ± 0.055612	0.003055 ± 0.000165
q5_0	q5_0	8.214116 ± 0.055692	0.002849 ± 0.000143	8.200742 ± 0.055571	0.002062 ± 0.000124
q5_0	q5_1	8.202209 ± 0.055575	0.002657 ± 0.000170	8.200578 ± 0.055565	0.002028 ± 0.000141
q5_0	q8_0	8.201038 ± 0.055579	0.002310 ± 0.000168	8.197975 ± 0.055550	0.001553 ± 0.000095
q5_1	f16	8.195862 ± 0.055514	0.001843 ± 0.000165	8.195862 ± 0.055514	0.001843 ± 0.000165
q5_1	q4_0	8.208856 ± 0.055624	0.004611 ± 0.000216	8.210008 ± 0.055633	0.003729 ± 0.000258
q5_1	q4_1	8.210093 ± 0.055686	0.003773 ± 0.000176	8.206406 ± 0.055621	0.002889 ± 0.000152
q5_1	q5_0	8.207489 ± 0.055626	0.002754 ± 0.000220	8.200513 ± 0.055562	0.001825 ± 0.000103
q5_1	q5_1	8.195263 ± 0.055514	0.002626 ± 0.000226	8.200874 ± 0.055575	0.001821 ± 0.000121
q5_1	q8_0	8.194897 ± 0.055520	0.001957 ± 0.000178	8.198389 ± 0.055547	0.001750 ± 0.000182
q8_0	f16	8.197374 ± 0.055531	0.000923 ± 0.000075	8.197374 ± 0.055531	0.000923 ± 0.000075
q8_0	q4_0	8.209297 ± 0.055632	0.003642 ± 0.000159	8.208469 ± 0.055623	0.002935 ± 0.000204
q8_0	q4_1	8.210675 ± 0.055697	0.002920 ± 0.000116	8.202587 ± 0.055587	0.002519 ± 0.000149
q8_0	q5_0	8.207077 ± 0.055629	0.001715 ± 0.000095	8.197825 ± 0.055538	0.001263 ± 0.000053
q8_0	q5_1	8.196398 ± 0.055535	0.001484 ± 0.000113	8.200065 ± 0.055566	0.001187 ± 0.000055
q8_0	q8_0	8.196226 ± 0.055531	0.000866 ± 0.000048	8.196227 ± 0.055532	0.000757 ± 0.000039

And attached here is the full raw run data for every invocation:
analysis-results.zip

Dampfinchen · 2026-03-27T09:18:51Z

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

f16 = 16 bpw

q4_0 = 4.5 bpw

tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

ggerganov · 2026-03-27T09:32:58Z

I think we can actually rotate the V tensor using smaller matrices (64 x 64) which should result in better quality of the V cache. We cannot do that for the Q and K because it would not preserve the dot product.

Pushed a change to do that 832e326 and just from a quick PPL sanity check it looks slightly better.

Note, using 64 instead of 32 because the Metal matrix multiplication kernels require ne00 to be at least 64.

Edit:

We cannot do that for the Q and K because it would not preserve the dot product.

On second thought, I think it does preserve it. Something to try too. Here is the patch:

More rotations for Q and K

diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 8dfc92b71..84bcf26be 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -2096,8 +2096,9 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2105,7 +2106,7 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
             ggml_is_quantized(mctx_cur->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);
@@ -2453,8 +2454,9 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2462,7 +2464,7 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
             ggml_is_quantized(mctx_cur->get_base()->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);

Rotatingxenomorph · 2026-03-27T10:18:24Z

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.
I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.
For reference:

f16 = 16 bpw

q4_0 = 4.5 bpw

tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

It can't approach turboquant's performance because it's missing the QJL part of turboquant. (I think?)

AesSedai · 2026-03-28T02:17:33Z

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

Details data table

k-cache	v-cache	PPL	mean KLD
f16	f16	8.195983 ± 0.055532	0.000782 ± 0.000040
f16	q4_0	8.205849 ± 0.055607	0.002518 ± 0.000094
f16	q4_1	8.206960 ± 0.055627	0.002308 ± 0.000088
f16	q5_0	8.197224 ± 0.055540	0.001275 ± 0.000052
f16	q5_1	8.198303 ± 0.055550	0.001126 ± 0.000043
f16	q8_0	8.196163 ± 0.055536	0.000838 ± 0.000058
q4_0	f16	8.202777 ± 0.055567	0.003475 ± 0.000182
q4_0	q4_0	8.208411 ± 0.055610	0.004782 ± 0.000194
q4_0	q4_1	8.210197 ± 0.055641	0.004592 ± 0.000184
q4_0	q5_0	8.199608 ± 0.055539	0.003880 ± 0.000205
q4_0	q5_1	8.200301 ± 0.055547	0.003644 ± 0.000190
q4_0	q8_0	8.199085 ± 0.055545	0.003346 ± 0.000183
q4_1	f16	8.202827 ± 0.055570	0.003155 ± 0.000182
q4_1	q4_0	8.214026 ± 0.055670	0.004670 ± 0.000214
q4_1	q4_1	8.213453 ± 0.055686	0.004174 ± 0.000173
q4_1	q5_0	8.200692 ± 0.055565	0.003446 ± 0.000189
q4_1	q5_1	8.199583 ± 0.055558	0.003457 ± 0.000206
q4_1	q8_0	8.200708 ± 0.055569	0.002958 ± 0.000168
q5_0	f16	8.199265 ± 0.055551	0.001553 ± 0.000099
q5_0	q4_0	8.207457 ± 0.055617	0.003247 ± 0.000161
q5_0	q4_1	8.208826 ± 0.055647	0.002947 ± 0.000144
q5_0	q5_0	8.200150 ± 0.055558	0.002054 ± 0.000158
q5_0	q5_1	8.199639 ± 0.055568	0.001954 ± 0.000120
q5_0	q8_0	8.197875 ± 0.055554	0.001521 ± 0.000093
q5_1	f16	8.200392 ± 0.055557	0.001585 ± 0.000137
q5_1	q4_0	8.210025 ± 0.055650	0.003052 ± 0.000151
q5_1	q4_1	8.207874 ± 0.055632	0.002765 ± 0.000128
q5_1	q5_0	8.200399 ± 0.055565	0.001849 ± 0.000126
q5_1	q5_1	8.199001 ± 0.055561	0.001839 ± 0.000146
q5_1	q8_0	8.199596 ± 0.055562	0.001544 ± 0.000125
q8_0	f16	8.196933 ± 0.055526	0.000837 ± 0.000047
q8_0	q4_0	8.205234 ± 0.055600	0.002568 ± 0.000141
q8_0	q4_1	8.206657 ± 0.055621	0.002230 ± 0.000084
q8_0	q5_0	8.197552 ± 0.055547	0.001229 ± 0.000046
q8_0	q5_1	8.197543 ± 0.055546	0.001300 ± 0.000117
q8_0	q8_0	8.196556 ± 0.055535	0.000799 ± 0.000042

Full run archives:
attn-rot-7711b3a36.zip

Dampfinchen · 2026-03-28T15:51:36Z

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

Details data table
k-cache v-cache PPL mean KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.205849 ± 0.055607 0.002518 ± 0.000094
f16 q4_1 8.206960 ± 0.055627 0.002308 ± 0.000088
f16 q5_0 8.197224 ± 0.055540 0.001275 ± 0.000052
f16 q5_1 8.198303 ± 0.055550 0.001126 ± 0.000043
f16 q8_0 8.196163 ± 0.055536 0.000838 ± 0.000058
q4_0 f16 8.202777 ± 0.055567 0.003475 ± 0.000182
q4_0 q4_0 8.208411 ± 0.055610 0.004782 ± 0.000194
q4_0 q4_1 8.210197 ± 0.055641 0.004592 ± 0.000184
q4_0 q5_0 8.199608 ± 0.055539 0.003880 ± 0.000205
q4_0 q5_1 8.200301 ± 0.055547 0.003644 ± 0.000190
q4_0 q8_0 8.199085 ± 0.055545 0.003346 ± 0.000183
q4_1 f16 8.202827 ± 0.055570 0.003155 ± 0.000182
q4_1 q4_0 8.214026 ± 0.055670 0.004670 ± 0.000214
q4_1 q4_1 8.213453 ± 0.055686 0.004174 ± 0.000173
q4_1 q5_0 8.200692 ± 0.055565 0.003446 ± 0.000189
q4_1 q5_1 8.199583 ± 0.055558 0.003457 ± 0.000206
q4_1 q8_0 8.200708 ± 0.055569 0.002958 ± 0.000168
q5_0 f16 8.199265 ± 0.055551 0.001553 ± 0.000099
q5_0 q4_0 8.207457 ± 0.055617 0.003247 ± 0.000161
q5_0 q4_1 8.208826 ± 0.055647 0.002947 ± 0.000144
q5_0 q5_0 8.200150 ± 0.055558 0.002054 ± 0.000158
q5_0 q5_1 8.199639 ± 0.055568 0.001954 ± 0.000120
q5_0 q8_0 8.197875 ± 0.055554 0.001521 ± 0.000093
q5_1 f16 8.200392 ± 0.055557 0.001585 ± 0.000137
q5_1 q4_0 8.210025 ± 0.055650 0.003052 ± 0.000151
q5_1 q4_1 8.207874 ± 0.055632 0.002765 ± 0.000128
q5_1 q5_0 8.200399 ± 0.055565 0.001849 ± 0.000126
q5_1 q5_1 8.199001 ± 0.055561 0.001839 ± 0.000146
q5_1 q8_0 8.199596 ± 0.055562 0.001544 ± 0.000125
q8_0 f16 8.196933 ± 0.055526 0.000837 ± 0.000047
q8_0 q4_0 8.205234 ± 0.055600 0.002568 ± 0.000141
q8_0 q4_1 8.206657 ± 0.055621 0.002230 ± 0.000084
q8_0 q5_0 8.197552 ± 0.055547 0.001229 ± 0.000046
q8_0 q5_1 8.197543 ± 0.055546 0.001300 ± 0.000117
q8_0 q8_0 8.196556 ± 0.055535 0.000799 ± 0.000042
Full run archives: attn-rot-7711b3a36.zip

What about higher context, like for example 100K? I believe that's where kv cache quantization actually harms the model from multiplications errors that accumulated.

ggerganov · 2026-03-28T16:24:54Z

@AesSedai Thanks for the results. It seems important to track the KLD rather than PPL (maybe more significant for Qwen3.5).

I did some additional tests with randomized Hadamard matrices (as suggested by @sashkboos in #6444 (comment)). Will follow-up in next PRs.

CISC · 2026-03-28T20:05:35Z

+
+    auto * data = (float *) tensor->data;
+
+    data[0*n + 0] = 1.0 / sqrtf(n);


This always gets me, at least this one makes esthetic sense. :)

am17an · 2026-03-29T06:40:20Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

GGML_CUDA=ON ./scripts/compare-commits.sh upstream/master gg/attn-rot llama-bench -ctk q4_0 -ctv q4_0 -m /opt/models/qwen_3-30b3a-q4_0.gguf -fa 1 -d 0,2048,4096,8192,16384,32768

Model	Test	t/s `51a84ef`	t/s gg/attn-rot	Speedup
qwen3moe 30B.A3B Q4_0	pp512	7603.68	7409.35	0.97
qwen3moe 30B.A3B Q4_0	pp512@d2048	7085.71	6912.28	0.98
qwen3moe 30B.A3B Q4_0	pp512@d4096	6586.79	6449.26	0.98
qwen3moe 30B.A3B Q4_0	pp512@d8192	5844.15	5694.56	0.97
qwen3moe 30B.A3B Q4_0	pp512@d16384	4704.12	4612.89	0.98
qwen3moe 30B.A3B Q4_0	pp512@d32768	3398.99	3346.66	0.98
qwen3moe 30B.A3B Q4_0	tg128	229.42	202.37	0.88
qwen3moe 30B.A3B Q4_0	tg128@d2048	210.24	188.03	0.89
qwen3moe 30B.A3B Q4_0	tg128@d4096	197.10	177.06	0.90
qwen3moe 30B.A3B Q4_0	tg128@d8192	175.60	159.45	0.91
qwen3moe 30B.A3B Q4_0	tg128@d16384	145.26	134.28	0.92
qwen3moe 30B.A3B Q4_0	tg128@d32768	101.33	94.13	0.93

handpickencounter · 2026-03-29T06:47:15Z

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this.

PolarQuant is just a rotation by a random matrix to spread the energy of any outliers across all dimensions prior to quantization. A normalized Hadamard matrix may or may not produce similar results.

QJL is extremely important however, PolarQuant (and your Hadamard) gives biased attention scores (despite the good seeming MSE) which results in rank flips. QJL uses the 'unreasonable effectiveness of random projections' to give unbiased results. The theory (more accurately, the lemma) is that randomly projecting a vector to lower dimensions will preserve pairwise distances (and therefore dot products) in expectation. TurboQuant quantizes the error residual all the way to 1-bit (the sign) which somehow works well enough.

The best technical yet intuitive explanation I saw on the subject so far is by some AI Researcher from Amazon: https://darshanfofadiya.com/research-papers/turboquant/

CISC · 2026-03-29T07:56:25Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

am17an · 2026-03-29T08:10:05Z

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues. This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

the only reason to quantize kv-cache is to save memory

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

CISC · 2026-03-29T08:42:40Z

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues.

You could say we are in the business of making breaking changes and that git bisect is our option, but I digress. :)

If anything, I guess adding an env-var for some grace period would be sufficient.

This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

Well, until you pointed out the below I would have said that no such test were likely to exist. :)

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

Dampfinchen · 2026-03-29T08:53:44Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

CISC · 2026-03-29T08:58:19Z

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

You'd most likely gain much more than you lose, so calling it counterproductive is perhaps a bit far fetched.

am17an · 2026-03-29T10:05:34Z

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Details

Model	Test	t/s `51a84ef`	t/s gg/attn-rot	Speedup
qwen3moe 30B.A3B Q4_0	pp512	7576.02	7372.24	0.97
qwen3moe 30B.A3B Q4_0	pp512@d2048	7045.27	6838.27	0.97
qwen3moe 30B.A3B Q4_0	pp512@d4096	6580.31	6435.85	0.98
qwen3moe 30B.A3B Q4_0	pp512@d8192	5792.41	5661.65	0.98
qwen3moe 30B.A3B Q4_0	pp512@d16384	4660.49	4584.87	0.98
qwen3moe 30B.A3B Q4_0	pp512@d32768	3349.27	3300.92	0.99
qwen3moe 30B.A3B Q4_0	tg128	230.59	203.59	0.88
qwen3moe 30B.A3B Q4_0	tg128@d2048	212.31	189.41	0.89
qwen3moe 30B.A3B Q4_0	tg128@d4096	199.79	179.40	0.90
qwen3moe 30B.A3B Q4_0	tg128@d8192	179.48	162.80	0.91
qwen3moe 30B.A3B Q4_0	tg128@d16384	149.03	137.47	0.92
qwen3moe 30B.A3B Q4_0	tg128@d32768	105.50	98.55	0.93

CISC · 2026-03-29T11:04:38Z

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Granted, it's probably pointless for q8_0, so perhaps add a check.

ggerganov · 2026-03-29T15:48:25Z

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

Here is the table with the hyperlinks preserved for each score:

eval	KV type	score (no rot)	score (rot)
AIME25 x8	F16	37.9%	—
AIME25 x8	Q8_0	31.7%	37.1%
AIME25 x8	Q5_1	30.8%	32.5%
AIME25 x8	Q5_0	25.4%	32.5%
AIME25 x8	Q4_1	18.3%	28.3%
AIME25 x8	Q4_0	2.0%	21.7%

Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type `Q4_0` without rotations

Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:

https://arxiv.org/pdf/2508.10925

Dampfinchen · 2026-03-29T17:23:42Z

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

eval KV type rot score results (HTML)
AIME25 x8 F16 no 37.9% aime2025-gpt-oss-20b-low-x8-kv_f16.json.html
AIME25 x8 Q8_0 no 31.7% aime2025-gpt-oss-20b-low-x8-kv_q8_0.json.html
AIME25 x8 Q8_0 yes 37.1% aime2025-gpt-oss-20b-low-x8-kv_q8_0-rot.json.html
AIME25 x8 Q4_0 no 0.0% aime2025-gpt-oss-20b-low-x8-kv_q4_0.json.html
AIME25 x8 Q4_0 yes 21.7% aime2025-gpt-oss-20b-low-x8-kv_q4_0-rot.json.html
Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type Q4_0 without rotations

Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:
https://arxiv.org/pdf/2508.10925

Yeah, based on the score, there is a clear benefit from rotating activations for q8_0, and I think it makes sense to have them on by default as the purpose of Q8_0 is to save memory for KV Cache for the same quality and with attn-rot it is noticeably closer to that goal. Furthermore, I'm not sure if AIME25 tests at long context as well. Quality differences might become even more noticeable at a longer context.

Performance is a concern, however. In the end, I think the best course of action would be to make attn-rots an opt-out/opt-in option with a simple flag.

- Register GGML_TYPE_TURBO3_0 and GGML_TYPE_TURBO4_0 in kv_cache_types so --cache-type-k turbo3 / --cache-type-v turbo3 are recognized - Fix double V un-rotation: upstream PR ggml-org#21038 (attn_rot_v) already handles Hadamard rotation for quantized KV cache types including turbo. Make TurboQuant WHT fallback only when upstream rotation is not active (else if instead of sequential if blocks)

Includes: - fix: handle non-capturing groups (?:...) in JSON schema pattern converter (ggml-org#21124) - memory: respect unified KV cache in hybrid memory for eval tasks (ggml-org#21224) - fix: CUDA FA kernel selection, head dimension 512 support - rotate activations for better quantization (ggml-org#21038) - Various parser, jinja, webui, and CI fixes Conflicts resolved: - llama-kv-cache.cpp: keep TurboQuant InnerQ stubs + upstream Hadamard helpers - llama-graph.cpp: keep TurboQuant V-padding + upstream self_v_rot - fattn-tile.cu: add upstream D=512 before TurboQuant HIP guard - fattn.cu: combine D=512 (upstream) + D=640 (TurboQuant) exclusions

Notable upstream changes: - ggml-org#21038: rotate activations for better KV cache quantization - ggml-org#21074: generic NVFP4 MMQ kernel - ggml-org#21271: fix FA kernel selection logic - ggml-org#21238: fix mmvq mmid kernel selection - ggml-org#20998: FA support for head dim 512 - ggml-org#20920: docker cuda12 bump to 12.9.1 Conflicts resolved: - fattn.cu: took upstream (adds head_dim 512 exclusion) - mmq.cu/mmq.cuh: kept both TQ3 types + upstream additions - llama-graph.cpp: kept both turbo WHT + upstream v_rot - docker.yml: took upstream - tests/CMakeLists.txt: took upstream

Dampfinchen · 2026-04-06T12:05:12Z

It appears attention rotations are currently disabled for the Gemma 4 architecture.

#21394

…V cache quants Adds graph-level Hadamard rotation for Q/K/V when using standard quant types (q4_0, q8_0, etc.) in the KV cache. Dramatically improves quantized KV quality. Our turbo V un-rotation takes priority when turbo types are used. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion Turbo types have their own FWHT rotation in set-rows/fattn. The generic rotation from PR ggml-org#21038 would double-rotate, corrupting results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* llama : rotate activations for better quantization * cont : rotate V more + refactor * cont : rotate caches separately + support non-power-of-2 head sizes * cont : simplify * cont : add reference for V rotation * cont : refactor * cont : support context shift * cont : consolidate * cont : dedup + allow different types for the rotation matrix * cont : add env variable to disable rotation * cont : simplify attn rot kv cache logic + rename env * cont : pre-compute the Hadamard matrices

The online Hadamard rotation (PR ggml-org#21038) exists to suppress quantization error. On an F16 lid cache it is a no-op and costs one extra mul_mat per layer. Remove the force-enable for GLM_DSA and DEEPSEEK2 and let the lid cache follow the standard rule (quantized type + aligned head dim). Guard the Hadamard mul_mats in the builder and the set_input_k_rot call on self_k_rot_lid being non-null, since it can legitimately be null now.

vektorprime · 2026-04-26T14:57:05Z

Just wanted to leave this image here since this post is referenced quite a bit:

The following results come from the official SLANG website, but they show that GPTQ and AIME to be sensitive to KV quant. It's NVFP4 but still very relevant, I think. @ggerganov maybe you can incorporate that into your future testing suite like you did with the python-based AIME testing.

* llama : rotate activations for better quantization * cont : rotate V more + refactor * cont : rotate caches separately + support non-power-of-2 head sizes * cont : simplify * cont : add reference for V rotation * cont : refactor * cont : support context shift * cont : consolidate * cont : dedup + allow different types for the rotation matrix * cont : add env variable to disable rotation * cont : simplify attn rot kv cache logic + rename env * cont : pre-compute the Hadamard matrices

Wires the Hadamard scaffold (landed earlier on this branch) into ml/nn/attention.go. When OLLAMA_KV_ROTATE=1 and the head dim is a multiple of 64, Q/K/V are rotated by a fixed normalized 64×64 Sylvester matrix before cache.Put; the attention output is rotated again to undo V's rotation (Sylvester H is symmetric so H·H = I). Math: orthogonal H satisfies (QH)(KH)^T = QK^T so attention scores are preserved. Storing rotated K/V in the cache is the whole point — it smooths the per-coordinate distribution before quantization, which is what closes the q4_0/q8_0 quality gap to fp16 (upstream PR ggml-org/llama.cpp#21038, commit 744c0c73). Helpers added in ml/nn/hadamard.go: - hadamard64 — package-level cached float slice (computed once) - blockRotate(ctx, t, h) — applies H_64 block-diagonally along Dim(0) - hadamardTensor(ctx) — materializes hadamard64 as a backend tensor - noteIncompatibleHeadDim — one-shot slog.Warn per head_dim when the flag is set but the dim isn't compatible Default off: when OLLAMA_KV_ROTATE is unset (the default), behavior is byte-identical to before — every gate short-circuits on the envconfig check before any rotation work happens. Design doc updated to reflect what landed and to note two known limitations: (1) the gate reads the env each call, so mid-process toggle would mix rotated/unrotated entries; (2) the gate checks Q's head dim only, so head-dim asymmetry in V is not yet handled. llamarunner cherry-pick of upstream commit 744c0c73 deferred to a separate follow-up.

* llama : rotate activations for better quantization * cont : rotate V more + refactor * cont : rotate caches separately + support non-power-of-2 head sizes * cont : simplify * cont : add reference for V rotation * cont : refactor * cont : support context shift * cont : consolidate * cont : dedup + allow different types for the rotation matrix * cont : add env variable to disable rotation * cont : simplify attn rot kv cache logic + rename env * cont : pre-compute the Hadamard matrices

Investigation of master llama.cpp ggml-org#21038 attention rotation in our fork, ending in a per-side env-knob policy (LLAMA_ATTN_ROT_K_OVERRIDE / LLAMA_ATTN_ROT_V_OVERRIDE). Default unchanged from the original fork (rotation OFF on both sides); the contribution is per-side control plus a matrix documenting which models want which knob. Headline finding while running the matrix: corpus PPL on wikitext-2 is unreliable for KV-quantization evaluation on gemma-class instruct models. Quantized KV scores 7-42% BELOW the fp16-KV baseline on three gemma-4-it GGUFs; KLD against the fp16-KV reference points the opposite direction. Reproduced at q8/q8 and q8/turbo4, on wikitext-2 and wikitext-103, ctx=512 and ctx=2048. Cleanest case: t4/t4 K-only PPL +52.7% but KLD -4.9% (best on row). The PPL/KLD ranking inversion is a continuum (small magnitude on Qwen2.5-7B, large on gemma-4 26B-A4B), not gemma-only. Recommendation: use KLD against fp16-KV logits as the oracle for KV-quant evaluation on this model class. Independently observed prior: vektorprime in ggml-org#21394, AesSedai's tables in ggml-org#21038, ggerganov's "track KLD rather than PPL" comment in same PR, localbench. This paper adds the controlled per-side measurement, the cross-format / cross-corpus reproduction, the bit-exact-zero KLD noise floor on Metal, and the resulting engineering policy.

henk717 mentioned this pull request Mar 27, 2026

[FEATURE REQUEST] - Turbo Quant LostRuins/koboldcpp#2075

Closed

This was referenced Mar 27, 2026

V-cache Hadamard transform ikawrakow/ik_llama.cpp#1527

Merged

TurboQuant KV Cache Compression — Working Implementation Ready for Review ikawrakow/ik_llama.cpp#1509

Closed

jesusmb1995 mentioned this pull request Mar 28, 2026

QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0) tetherto/qvac-fabric-llm.cpp#115

Merged

11 tasks

ggerganov force-pushed the gg/attn-rot branch from 7711b3a to 5e60035 Compare March 28, 2026 16:19

ggerganov marked this pull request as ready for review March 28, 2026 16:21

ggerganov requested a review from CISC as a code owner March 28, 2026 16:21

CISC approved these changes Mar 28, 2026

View reviewed changes

This comment was marked as off-topic.

Sign in to view

CISC mentioned this pull request Mar 29, 2026

Support for DeepseekV32ForCausalLM with DeepSeek Sparse Attention (DSA) #21149

Closed

ggerganov force-pushed the gg/attn-rot branch from 5e60035 to e05a504 Compare March 29, 2026 15:22

mihai-chiorean mentioned this pull request Apr 4, 2026

Feature Request: TurboQuant support #20977

Open

4 tasks

ggerganov mentioned this pull request Apr 6, 2026

kv-cache : support attention rotation for heterogeneous iSWA #21513

Merged

Readon mentioned this pull request Apr 7, 2026

native ollama-go-engine: TurboQuant+RotorQuant implementation ollama/ollama#15051

Open

mingyi456 mentioned this pull request Apr 8, 2026

Gemma-4 31b KV excessive KV cache footprint lmstudio-ai/lmstudio-bug-tracker#1740

Open

TimPietruskyRunPod mentioned this pull request Apr 9, 2026

Investigate TurboQuant for KV cache compression (llama.cpp + MLX) runpod-labs/a2go#115

Open

6 tasks

iwr-redmond mentioned this pull request Apr 9, 2026

ENH: Support classic KV attention xorbitsai/xllamacpp#132

Closed

SharkWipf mentioned this pull request Apr 16, 2026

Eval bug: gibberish on 2nd message with kv quantization unless restarted first #21915

Open

This was referenced Apr 18, 2026

Study Google's recent KV cache compression/quantization research for K80 adoption dogkeeper886/ollama37#99

Closed

Add rotation-before-quant to KV cache (Phase 1) dogkeeper886/ollama37#102

Closed

This was referenced Apr 29, 2026

fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF) TheTom/llama-cpp-turboquant#111

Merged

attn-rotation for native llama quants TheTom/turboquant_plus#88

Closed

nilo85 mentioned this pull request Apr 29, 2026

Server forces full prompt re-processing on subsequent requests (SWA/recurrent memory error) #21831

Open

This was referenced May 3, 2026

KV rotation: always-on for compatible head dims (closes #102) dogkeeper886/ollama37#133

Merged

Bring KV rotation to llamarunner-served models (deepseek-r1, qwen2 family) dogkeeper886/ollama37#134

Closed

Leechael mentioned this pull request May 5, 2026

[meta] Branch & upstream status (DFlash + DDTree) Leechael/llama.cpp-dflash-ggml#3

Open

5 tasks


		auto * data = (float *) tensor->data;

		data[0*n + 0] = 1.0 / sqrtf(n);

Conversation

ggerganov commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Model: Qwen3 0.6B BF16

Model: Qwen3 8B BF16

Model: Gemma3 4B Q8_0

Model: Qwen3.5 4B F16

TODOs

Next PRs

Requirements

Uh oh!

ubergarm commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

Experiment

Test Quant

Test Rig

Command

Uh oh!

AesSedai commented Mar 27, 2026

Uh oh!

Dampfinchen commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

Uh oh!

ggerganov commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rotatingxenomorph commented Mar 27, 2026

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

Uh oh!

AesSedai commented Mar 28, 2026

Uh oh!

Dampfinchen commented Mar 28, 2026

Uh oh!

ggerganov commented Mar 28, 2026

Uh oh!

CISC Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented Mar 29, 2026

Uh oh!

handpickencounter commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

am17an commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

CISC commented Mar 29, 2026

Uh oh!

Dampfinchen commented Mar 29, 2026

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

am17an commented Mar 29, 2026

Uh oh!

CISC commented Mar 29, 2026

Uh oh!

ggerganov commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dampfinchen commented Mar 29, 2026

Uh oh!

Dampfinchen commented Apr 6, 2026

Uh oh!

vektorprime commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

ggerganov commented Mar 26, 2026 •

edited

Loading

ubergarm commented Mar 27, 2026 •

edited

Loading

Dampfinchen commented Mar 27, 2026 •

edited

Loading

ggerganov commented Mar 27, 2026 •

edited

Loading

handpickencounter commented Mar 29, 2026 •

edited

Loading

am17an commented Mar 29, 2026 •

edited

Loading

ggerganov commented Mar 29, 2026 •

edited

Loading

vektorprime commented Apr 26, 2026 •

edited

Loading