Skip to content

llama : rotate activations for better quantization#21038

Merged
ggerganov merged 12 commits into
masterfrom
gg/attn-rot
Apr 1, 2026
Merged

llama : rotate activations for better quantization#21038
ggerganov merged 12 commits into
masterfrom
gg/attn-rot

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Mar 26, 2026

Overview

In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit using a very simple interpretation of the idea of using Hadamard transform to reduce outliers in the attention and improve the quantization quality:

  • Rotate input Q, K, V and store in cache
  • Perform attention with the rotated Q, K, V
  • Rotate back the output vector of the attention

This works because:

  • Rotation does not affect the dot product of 2 vectors
  • The output vector is a linear combination of rotated vectors

The implementation is very simple and backend agnostic - use Hadamard matrix of size n x n, normalized by 1/sqrt(n) so that it is orthonormal and can be used both for the forward and backward rotation. Technically any rotation matrix (and it's inverse) should work - I just think this is what is commonly used due to it's simplicity. The implementation does not introduce new types and is compatible with all existing quantizations. It adds 4 matrix multiplication operators in the attention, though I think some of them can be fused in the attention weights (similar to QuaRot).

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this. In any case, having a better baseline at almost 0 cost won't hurt. Only based on the PPL data below, I think this should never be worse that before, though it needs a bit more evaluation.

Note: MLA is not supported

Additional information

Here are some PPL results before and after using base models. Ideally, there should be KLD data too, but leaving it for people to play with it and see if it looks good.

./bin/llama-perplexity -m model.gguf -f wiki.test.raw --chunks 128 -ctk f16 -ctv f16

Model: Qwen3 0.6B BF16

https://huggingface.co/Qwen/Qwen3-0.6B-Base

Baseline F16 cache: PPL = 13.6711 +/- 0.21422

type master PR
q8_0 13.9115 13.6713
q5_1 61.6992 14.1452
q5_0 17.2805 14.2171
q4_1 212.479 22.2816
q4_0 62.0161 46.2503

Model: Qwen3 8B BF16

https://huggingface.co/Qwen/Qwen3-8B-Base

Baseline F16 cache: PPL = 7.3203 +/- 0.09901

type master PR
q8_0 7.3172 7.3195
q5_0 7.3793 7.3323
q4_0 7.6451 7.5012

Model: Gemma3 4B Q8_0

https://huggingface.co/google/gemma-3-4b-pt

Baseline F16 cache: PPL = 7.6905 +/- 0.10483

type master PR
q8_0 7.6928 7.6914
q5_1 7.7133 7.7052
q5_0 7.7544 7.7182
q4_1 7.8095 7.7531
q4_0 7.8535 7.7928

Model: Qwen3.5 4B F16

https://huggingface.co/Qwen/Qwen3.5-4B-Base

Baseline F16 cache: PPL = 8.3266 +/- 0.11623

type master PR
q8_0 8.3272 8.3261
q5_0 8.3359 8.3339
q4_0 8.3475 8.3448

TODOs

  • Cache shift support ff76c67
  • Disable rotations with env variable (LLAMA_ATTN_ROT_DISABLE)

Next PRs

  • Add randomized Hadamard matrices (f0fea26)

Requirements

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Mar 27, 2026

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

  • f16 = 16 bpw
  • q4_0 = 4.5 bpw
  • tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Experiment

Measure perplexity against wiki.test.raw varying kv-cache quantization.

Test Quant

Test Rig

  • AMD EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s NPS1 Single Socket (~538 GB/s via mlc)
  • Compiled CPU-only backend
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_VULKAN=OFF
cmake --build build --config Release -j $(nproc)

Command

# seed does nothing as no sampling here, but is fun
model=/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
    #-ctk f16 -ctv f16 \
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    -ctk q4_0 -ctv q4_0 \
    --seed 1337 \
    --ctx-size 512 \
    -ub 512 -b 2048 \
    --no-mmap \
    --numa numactl \
    --threads 96 \
    --threads-batch 128

@AesSedai
Copy link
Copy Markdown
Contributor

I've run a pretty big permutation matrix for the Qwen3.5-9B model against the master branch and this PR branch.

The following images are the ratios vs F16/F16:

Master branch PPL and KLD
image
image

and this PR PPL and KLD
image
image

Here's the table of data comparing master vs this PR
k-cache v-cache master PPL master mean KLD attn-rot PPL attn-rot KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.208920 ± 0.055630 0.003747 ± 0.000182 8.208920 ± 0.055630 0.003747 ± 0.000182
f16 q4_1 8.212052 ± 0.055711 0.002985 ± 0.000140 8.212052 ± 0.055711 0.002985 ± 0.000140
f16 q5_0 8.206032 ± 0.055617 0.001752 ± 0.000111 8.206032 ± 0.055617 0.001752 ± 0.000111
f16 q5_1 8.195687 ± 0.055524 0.001393 ± 0.000111 8.195687 ± 0.055524 0.001393 ± 0.000111
f16 q8_0 8.196533 ± 0.055536 0.000804 ± 0.000037 8.196533 ± 0.055536 0.000804 ± 0.000037
q4_0 f16 8.212304 ± 0.055635 0.005426 ± 0.000271 8.212304 ± 0.055635 0.005426 ± 0.000271
q4_0 q4_0 8.227048 ± 0.055766 0.007979 ± 0.000352 8.211705 ± 0.055631 0.005059 ± 0.000224
q4_0 q4_1 8.223944 ± 0.055773 0.007163 ± 0.000298 8.209285 ± 0.055635 0.004664 ± 0.000196
q4_0 q5_0 8.217618 ± 0.055691 0.006165 ± 0.000307 8.201787 ± 0.055561 0.003708 ± 0.000187
q4_0 q5_1 8.210686 ± 0.055628 0.006011 ± 0.000332 8.201859 ± 0.055564 0.003656 ± 0.000203
q4_0 q8_0 8.209199 ± 0.055617 0.005336 ± 0.000272 8.198971 ± 0.055537 0.003334 ± 0.000160
q4_1 f16 8.219674 ± 0.055730 0.004174 ± 0.000239 8.219674 ± 0.055730 0.004174 ± 0.000239
q4_1 q4_0 8.232653 ± 0.055833 0.006612 ± 0.000263 8.213936 ± 0.055656 0.004702 ± 0.000223
q4_1 q4_1 8.235108 ± 0.055910 0.006014 ± 0.000266 8.210796 ± 0.055641 0.004252 ± 0.000183
q4_1 q5_0 8.229521 ± 0.055827 0.004827 ± 0.000253 8.204419 ± 0.055595 0.003456 ± 0.000191
q4_1 q5_1 8.221232 ± 0.055750 0.004739 ± 0.000252 8.203722 ± 0.055583 0.003297 ± 0.000171
q4_1 q8_0 8.219816 ± 0.055743 0.004005 ± 0.000209 8.201606 ± 0.055565 0.002968 ± 0.000156
q5_0 f16 8.202763 ± 0.055572 0.002223 ± 0.000157 8.202763 ± 0.055572 0.002223 ± 0.000157
q5_0 q4_0 8.213422 ± 0.055674 0.005025 ± 0.000243 8.211162 ± 0.055646 0.003654 ± 0.000244
q5_0 q4_1 8.217085 ± 0.055749 0.004181 ± 0.000185 8.206240 ± 0.055612 0.003055 ± 0.000165
q5_0 q5_0 8.214116 ± 0.055692 0.002849 ± 0.000143 8.200742 ± 0.055571 0.002062 ± 0.000124
q5_0 q5_1 8.202209 ± 0.055575 0.002657 ± 0.000170 8.200578 ± 0.055565 0.002028 ± 0.000141
q5_0 q8_0 8.201038 ± 0.055579 0.002310 ± 0.000168 8.197975 ± 0.055550 0.001553 ± 0.000095
q5_1 f16 8.195862 ± 0.055514 0.001843 ± 0.000165 8.195862 ± 0.055514 0.001843 ± 0.000165
q5_1 q4_0 8.208856 ± 0.055624 0.004611 ± 0.000216 8.210008 ± 0.055633 0.003729 ± 0.000258
q5_1 q4_1 8.210093 ± 0.055686 0.003773 ± 0.000176 8.206406 ± 0.055621 0.002889 ± 0.000152
q5_1 q5_0 8.207489 ± 0.055626 0.002754 ± 0.000220 8.200513 ± 0.055562 0.001825 ± 0.000103
q5_1 q5_1 8.195263 ± 0.055514 0.002626 ± 0.000226 8.200874 ± 0.055575 0.001821 ± 0.000121
q5_1 q8_0 8.194897 ± 0.055520 0.001957 ± 0.000178 8.198389 ± 0.055547 0.001750 ± 0.000182
q8_0 f16 8.197374 ± 0.055531 0.000923 ± 0.000075 8.197374 ± 0.055531 0.000923 ± 0.000075
q8_0 q4_0 8.209297 ± 0.055632 0.003642 ± 0.000159 8.208469 ± 0.055623 0.002935 ± 0.000204
q8_0 q4_1 8.210675 ± 0.055697 0.002920 ± 0.000116 8.202587 ± 0.055587 0.002519 ± 0.000149
q8_0 q5_0 8.207077 ± 0.055629 0.001715 ± 0.000095 8.197825 ± 0.055538 0.001263 ± 0.000053
q8_0 q5_1 8.196398 ± 0.055535 0.001484 ± 0.000113 8.200065 ± 0.055566 0.001187 ± 0.000055
q8_0 q8_0 8.196226 ± 0.055531 0.000866 ± 0.000048 8.196227 ± 0.055532 0.000757 ± 0.000039

And attached here is the full raw run data for every invocation:
analysis-results.zip

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented Mar 27, 2026

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.

I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.

For reference:

  • f16 = 16 bpw
  • q4_0 = 4.5 bpw
  • tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

@ggerganov
Copy link
Copy Markdown
Member Author

ggerganov commented Mar 27, 2026

I think we can actually rotate the V tensor using smaller matrices (64 x 64) which should result in better quality of the V cache. We cannot do that for the Q and K because it would not preserve the dot product.

Pushed a change to do that 832e326 and just from a quick PPL sanity check it looks slightly better.

Note, using 64 instead of 32 because the Metal matrix multiplication kernels require ne00 to be at least 64.

Edit:

We cannot do that for the Q and K because it would not preserve the dot product.

On second thought, I think it does preserve it. Something to try too. Here is the patch:

More rotations for Q and K
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 8dfc92b71..84bcf26be 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -2096,8 +2096,9 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2105,7 +2106,7 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
             ggml_is_quantized(mctx_cur->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);
@@ -2453,8 +2454,9 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
         const bool can_rot =
             !hparams.is_n_embd_k_gqa_variable() &&
             !hparams.is_n_embd_v_gqa_variable() &&
-            ggml_is_power_of_2(hparams.n_embd_head_k()) &&
+          //ggml_is_power_of_2(hparams.n_embd_head_k()) &&
           //ggml_is_power_of_2(hparams.n_embd_head_v()) &&
+            hparams.n_embd_head_k() % 64 == 0 &&
             hparams.n_embd_head_v() % 64 == 0 &&
             hparams.n_embd_head_k() >= 64 &&
             hparams.n_embd_head_v() >= 64 &&
@@ -2462,7 +2464,7 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
             ggml_is_quantized(mctx_cur->get_base()->type_v());
 
         if (can_rot) {
-            const auto nk = hparams.n_embd_head_k();
+            const auto nk = 64;
             const auto nv = 64;
 
             inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nk, nk);

@Rotatingxenomorph
Copy link
Copy Markdown

This PR does seem to improve PPL as compared to tip of master using -ctk q4_0 -ctv q4_0.
I ran everything CPU-only backend so as to also compare https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 from @Aaryan-Kapoor to see how that specific tq3_0 implementation compares.
For reference:

  • f16 = 16 bpw
  • q4_0 = 4.5 bpw
  • tq3_0 = 3.5 bpw

mainline llama.cpp PPL wiki.test.raw

Data Results

mainline llama.cpp PPL wiki.test.raw

k-cache v-cache baseline master@6861f6509 PR gg/attn-rot@e5aa067d6 Aaryan-Kapoor/turboquant-tq3_0@1fb1fb3ab
f16 f16 6.5788 +/- 0.04196, 688.81 tok/sec 6.5788 +/- 0.04196, 721.01 tok/sec
q4_0 q4_0 6.6148 +/- 0.04216, 703.91 tok/sec 6.5962 +/- 0.04208, 540.63 tok/sec
tq3_0 tq3_0 6.6911 +/- 0.04273, 481 tok/sec
👈 Details

Interesting results, which indicate that Q4_0, even before this PR, is superior to TQ_3. However as a word of caution, this is likely due to a very experimental and early implementation of this specific fork and not indicative of the actual performance of TurboQuant, as it is supposed to be effectively lossless compared to fp16 kv cache. That is not the case here.

It can't approach turboquant's performance because it's missing the QJL part of turboquant. (I think?)

@AesSedai
Copy link
Copy Markdown
Contributor

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

image image
Details data table
k-cache v-cache PPL mean KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.205849 ± 0.055607 0.002518 ± 0.000094
f16 q4_1 8.206960 ± 0.055627 0.002308 ± 0.000088
f16 q5_0 8.197224 ± 0.055540 0.001275 ± 0.000052
f16 q5_1 8.198303 ± 0.055550 0.001126 ± 0.000043
f16 q8_0 8.196163 ± 0.055536 0.000838 ± 0.000058
q4_0 f16 8.202777 ± 0.055567 0.003475 ± 0.000182
q4_0 q4_0 8.208411 ± 0.055610 0.004782 ± 0.000194
q4_0 q4_1 8.210197 ± 0.055641 0.004592 ± 0.000184
q4_0 q5_0 8.199608 ± 0.055539 0.003880 ± 0.000205
q4_0 q5_1 8.200301 ± 0.055547 0.003644 ± 0.000190
q4_0 q8_0 8.199085 ± 0.055545 0.003346 ± 0.000183
q4_1 f16 8.202827 ± 0.055570 0.003155 ± 0.000182
q4_1 q4_0 8.214026 ± 0.055670 0.004670 ± 0.000214
q4_1 q4_1 8.213453 ± 0.055686 0.004174 ± 0.000173
q4_1 q5_0 8.200692 ± 0.055565 0.003446 ± 0.000189
q4_1 q5_1 8.199583 ± 0.055558 0.003457 ± 0.000206
q4_1 q8_0 8.200708 ± 0.055569 0.002958 ± 0.000168
q5_0 f16 8.199265 ± 0.055551 0.001553 ± 0.000099
q5_0 q4_0 8.207457 ± 0.055617 0.003247 ± 0.000161
q5_0 q4_1 8.208826 ± 0.055647 0.002947 ± 0.000144
q5_0 q5_0 8.200150 ± 0.055558 0.002054 ± 0.000158
q5_0 q5_1 8.199639 ± 0.055568 0.001954 ± 0.000120
q5_0 q8_0 8.197875 ± 0.055554 0.001521 ± 0.000093
q5_1 f16 8.200392 ± 0.055557 0.001585 ± 0.000137
q5_1 q4_0 8.210025 ± 0.055650 0.003052 ± 0.000151
q5_1 q4_1 8.207874 ± 0.055632 0.002765 ± 0.000128
q5_1 q5_0 8.200399 ± 0.055565 0.001849 ± 0.000126
q5_1 q5_1 8.199001 ± 0.055561 0.001839 ± 0.000146
q5_1 q8_0 8.199596 ± 0.055562 0.001544 ± 0.000125
q8_0 f16 8.196933 ± 0.055526 0.000837 ± 0.000047
q8_0 q4_0 8.205234 ± 0.055600 0.002568 ± 0.000141
q8_0 q4_1 8.206657 ± 0.055621 0.002230 ± 0.000084
q8_0 q5_0 8.197552 ± 0.055547 0.001229 ± 0.000046
q8_0 q5_1 8.197543 ± 0.055546 0.001300 ± 0.000117
q8_0 q8_0 8.196556 ± 0.055535 0.000799 ± 0.000042

Full run archives:
attn-rot-7711b3a36.zip

@Dampfinchen
Copy link
Copy Markdown

I re-ran the suite against the updated new commit, looks like the KLD has generally improved a little bit across the board!

Heatmap results:

image image
Details data table
k-cache v-cache PPL mean KLD
f16 f16 8.195983 ± 0.055532 0.000782 ± 0.000040
f16 q4_0 8.205849 ± 0.055607 0.002518 ± 0.000094
f16 q4_1 8.206960 ± 0.055627 0.002308 ± 0.000088
f16 q5_0 8.197224 ± 0.055540 0.001275 ± 0.000052
f16 q5_1 8.198303 ± 0.055550 0.001126 ± 0.000043
f16 q8_0 8.196163 ± 0.055536 0.000838 ± 0.000058
q4_0 f16 8.202777 ± 0.055567 0.003475 ± 0.000182
q4_0 q4_0 8.208411 ± 0.055610 0.004782 ± 0.000194
q4_0 q4_1 8.210197 ± 0.055641 0.004592 ± 0.000184
q4_0 q5_0 8.199608 ± 0.055539 0.003880 ± 0.000205
q4_0 q5_1 8.200301 ± 0.055547 0.003644 ± 0.000190
q4_0 q8_0 8.199085 ± 0.055545 0.003346 ± 0.000183
q4_1 f16 8.202827 ± 0.055570 0.003155 ± 0.000182
q4_1 q4_0 8.214026 ± 0.055670 0.004670 ± 0.000214
q4_1 q4_1 8.213453 ± 0.055686 0.004174 ± 0.000173
q4_1 q5_0 8.200692 ± 0.055565 0.003446 ± 0.000189
q4_1 q5_1 8.199583 ± 0.055558 0.003457 ± 0.000206
q4_1 q8_0 8.200708 ± 0.055569 0.002958 ± 0.000168
q5_0 f16 8.199265 ± 0.055551 0.001553 ± 0.000099
q5_0 q4_0 8.207457 ± 0.055617 0.003247 ± 0.000161
q5_0 q4_1 8.208826 ± 0.055647 0.002947 ± 0.000144
q5_0 q5_0 8.200150 ± 0.055558 0.002054 ± 0.000158
q5_0 q5_1 8.199639 ± 0.055568 0.001954 ± 0.000120
q5_0 q8_0 8.197875 ± 0.055554 0.001521 ± 0.000093
q5_1 f16 8.200392 ± 0.055557 0.001585 ± 0.000137
q5_1 q4_0 8.210025 ± 0.055650 0.003052 ± 0.000151
q5_1 q4_1 8.207874 ± 0.055632 0.002765 ± 0.000128
q5_1 q5_0 8.200399 ± 0.055565 0.001849 ± 0.000126
q5_1 q5_1 8.199001 ± 0.055561 0.001839 ± 0.000146
q5_1 q8_0 8.199596 ± 0.055562 0.001544 ± 0.000125
q8_0 f16 8.196933 ± 0.055526 0.000837 ± 0.000047
q8_0 q4_0 8.205234 ± 0.055600 0.002568 ± 0.000141
q8_0 q4_1 8.206657 ± 0.055621 0.002230 ± 0.000084
q8_0 q5_0 8.197552 ± 0.055547 0.001229 ± 0.000046
q8_0 q5_1 8.197543 ± 0.055546 0.001300 ± 0.000117
q8_0 q8_0 8.196556 ± 0.055535 0.000799 ± 0.000042
Full run archives: attn-rot-7711b3a36.zip

What about higher context, like for example 100K? I believe that's where kv cache quantization actually harms the model from multiplications errors that accumulated.

@ggerganov
Copy link
Copy Markdown
Member Author

@AesSedai Thanks for the results. It seems important to track the KLD rather than PPL (maybe more significant for Qwen3.5).

I did some additional tests with randomized Hadamard matrices (as suggested by @sashkboos in #6444 (comment)). Will follow-up in next PRs.

Comment thread src/llama-kv-cache.cpp

auto * data = (float *) tensor->data;

data[0*n + 0] = 1.0 / sqrtf(n);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always gets me, at least this one makes esthetic sense. :)

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

There is a slowdown which is expected, however probably we should have a flag to opt out?

GGML_CUDA=ON ./scripts/compare-commits.sh upstream/master gg/attn-rot llama-bench -ctk q4_0 -ctv q4_0 -m /opt/models/qwen_3-30b3a-q4_0.gguf -fa 1 -d 0,2048,4096,8192,16384,32768

Model Test t/s 51a84ef t/s gg/attn-rot Speedup
qwen3moe 30B.A3B Q4_0 pp512 7603.68 7409.35 0.97
qwen3moe 30B.A3B Q4_0 pp512@d2048 7085.71 6912.28 0.98
qwen3moe 30B.A3B Q4_0 pp512@d4096 6586.79 6449.26 0.98
qwen3moe 30B.A3B Q4_0 pp512@d8192 5844.15 5694.56 0.97
qwen3moe 30B.A3B Q4_0 pp512@d16384 4704.12 4612.89 0.98
qwen3moe 30B.A3B Q4_0 pp512@d32768 3398.99 3346.66 0.98
qwen3moe 30B.A3B Q4_0 tg128 229.42 202.37 0.88
qwen3moe 30B.A3B Q4_0 tg128@d2048 210.24 188.03 0.89
qwen3moe 30B.A3B Q4_0 tg128@d4096 197.10 177.06 0.90
qwen3moe 30B.A3B Q4_0 tg128@d8192 175.60 159.45 0.91
qwen3moe 30B.A3B Q4_0 tg128@d16384 145.26 134.28 0.92
qwen3moe 30B.A3B Q4_0 tg128@d32768 101.33 94.13 0.93

@handpickencounter
Copy link
Copy Markdown

handpickencounter commented Mar 29, 2026

I don't know what is the impact of the remaining techniques explained in TurboQuant (PolarQuant, QJL, etc.). They could be important and can potentially improve further on top of this.

PolarQuant is just a rotation by a random matrix to spread the energy of any outliers across all dimensions prior to quantization. A normalized Hadamard matrix may or may not produce similar results.

QJL is extremely important however, PolarQuant (and your Hadamard) gives biased attention scores (despite the good seeming MSE) which results in rank flips. QJL uses the 'unreasonable effectiveness of random projections' to give unbiased results. The theory (more accurately, the lemma) is that randomly projecting a vector to lower dimensions will preserve pairwise distances (and therefore dot products) in expectation. TurboQuant quantizes the error residual all the way to 1-bit (the sign) which somehow works well enough.

The best technical yet intuitive explanation I saw on the subject so far is by some AI Researcher from Amazon: https://darshanfofadiya.com/research-papers/turboquant/

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues. This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

the only reason to quantize kv-cache is to save memory

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

@Rotatingxenomorph

This comment was marked as off-topic.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

My opinion is that any new breaking change should have an option to turn it off, in case of any unforeseen issues.

You could say we are in the business of making breaking changes and that git bisect is our option, but I digress. :)

If anything, I guess adding an env-var for some grace period would be sufficient.

This change changes outputs of the LLM (even if they are materially better), so just as an example someone having tests downstream would see tests break.

Well, until you pointed out the below I would have said that no such test were likely to exist. :)

I think when using -nkvo, it will be faster to transfer a quantized cache to the GPU when offloading, so it can be used for a performance reason also. Also the CUDA fattn vec kernel uses dp4a for q8_0 which is 2x the throughput vs f16. So that statement is just wrong IMO.

@Dampfinchen
Copy link
Copy Markdown

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

There is a slowdown which is expected, however probably we should have a flag to opt out?

I don't think we should, the only reason to quantize kv-cache is to save memory, if that comes at the cost of speed, so be it, an option to reduce quality does not make sense (unless you think this is not a general quality improvement).

I disagree. You could use quantized KV Cache to save memory in order to put more layers on the GPU. In that case, the purpose of it would be to increase speed and this change would be counterproductive to that goal.

You'd most likely gain much more than you lose, so calling it counterproductive is perhaps a bit far fetched.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented Mar 29, 2026

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Details
Model Test t/s 51a84ef t/s gg/attn-rot Speedup
qwen3moe 30B.A3B Q4_0 pp512 7576.02 7372.24 0.97
qwen3moe 30B.A3B Q4_0 pp512@d2048 7045.27 6838.27 0.97
qwen3moe 30B.A3B Q4_0 pp512@d4096 6580.31 6435.85 0.98
qwen3moe 30B.A3B Q4_0 pp512@d8192 5792.41 5661.65 0.98
qwen3moe 30B.A3B Q4_0 pp512@d16384 4660.49 4584.87 0.98
qwen3moe 30B.A3B Q4_0 pp512@d32768 3349.27 3300.92 0.99
qwen3moe 30B.A3B Q4_0 tg128 230.59 203.59 0.88
qwen3moe 30B.A3B Q4_0 tg128@d2048 212.31 189.41 0.89
qwen3moe 30B.A3B Q4_0 tg128@d4096 199.79 179.40 0.90
qwen3moe 30B.A3B Q4_0 tg128@d8192 179.48 162.80 0.91
qwen3moe 30B.A3B Q4_0 tg128@d16384 149.03 137.47 0.92
qwen3moe 30B.A3B Q4_0 tg128@d32768 105.50 98.55 0.93

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 29, 2026

If anything, I guess adding an env-var for some grace period would be sufficient.

Actually I'm not sure it would be worth it to be on by default for q8_0 since the benefits are much lesser compared to q4 and the performance loss is significant

Granted, it's probably pointless for q8_0, so perhaps add a check.

@ggerganov
Copy link
Copy Markdown
Member Author

ggerganov commented Mar 29, 2026

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

Here is the table with the hyperlinks preserved for each score:

eval KV type score (no rot) score (rot)
AIME25 x8 F16 37.9%
AIME25 x8 Q8_0 31.7% 37.1%
AIME25 x8 Q5_1 30.8% 32.5%
AIME25 x8 Q5_0 25.4% 32.5%
AIME25 x8 Q4_1 18.3% 28.3%
AIME25 x8 Q4_0 2.0% 21.7%

Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type `Q4_0` without rotations image

Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:

image

https://arxiv.org/pdf/2508.10925

@Dampfinchen
Copy link
Copy Markdown

It seems that evaluating AIME25 could be another sensitivity test for confirming the improvement of rotating the activations. I did a few runs today with gpt-oss-20b on low reasoning (for faster evaluation):

eval KV type rot score results (HTML)
AIME25 x8 F16 no 37.9% aime2025-gpt-oss-20b-low-x8-kv_f16.json.html
AIME25 x8 Q8_0 no 31.7% aime2025-gpt-oss-20b-low-x8-kv_q8_0.json.html
AIME25 x8 Q8_0 yes 37.1% aime2025-gpt-oss-20b-low-x8-kv_q8_0-rot.json.html
AIME25 x8 Q4_0 no 0.0% aime2025-gpt-oss-20b-low-x8-kv_q4_0.json.html
AIME25 x8 Q4_0 yes 21.7% aime2025-gpt-oss-20b-low-x8-kv_q4_0-rot.json.html
Used the new llama-eval script to perform these evals: #21152

It's interesting that with Q4_0 KV type without applying rotations, the model completely breaks down - it cannot solve even 1 problem from the 240 attempts. Note that it is still reasoning (i.e. it's not producing garbage outputs). The final answers and logic are just incorrect.

Example of reasoning trace with KV type Q4_0 without rotations
image
Regarding Q8_0 - I think there is some non-negligible quality benefit of applying the rotation here too (31.7% -> 37.1%). Though we need a bit more stats to confirm that. Atm, I am not convinced that disabling the rotations for Q8_0 is the better option, but I'm still considering it.

For reference, the expected score of this eval per the gpt-oss model card is 37.1%:

image https://arxiv.org/pdf/2508.10925

Yeah, based on the score, there is a clear benefit from rotating activations for q8_0, and I think it makes sense to have them on by default as the purpose of Q8_0 is to save memory for KV Cache for the same quality and with attn-rot it is noticeably closer to that goal. Furthermore, I'm not sure if AIME25 tests at long context as well. Quality differences might become even more noticeable at a longer context.

Performance is a concern, however. In the end, I think the best course of action would be to make attn-rots an opt-out/opt-in option with a simple flag.

didlawowo added a commit to didlawowo/llama.cpp that referenced this pull request Apr 4, 2026
- Register GGML_TYPE_TURBO3_0 and GGML_TYPE_TURBO4_0 in kv_cache_types
  so --cache-type-k turbo3 / --cache-type-v turbo3 are recognized
- Fix double V un-rotation: upstream PR ggml-org#21038 (attn_rot_v) already
  handles Hadamard rotation for quantized KV cache types including
  turbo. Make TurboQuant WHT fallback only when upstream rotation
  is not active (else if instead of sequential if blocks)
icex added a commit to icex/llama.cpp that referenced this pull request Apr 5, 2026
Includes:
- fix: handle non-capturing groups (?:...) in JSON schema pattern converter (ggml-org#21124)
- memory: respect unified KV cache in hybrid memory for eval tasks (ggml-org#21224)
- fix: CUDA FA kernel selection, head dimension 512 support
- rotate activations for better quantization (ggml-org#21038)
- Various parser, jinja, webui, and CI fixes

Conflicts resolved:
- llama-kv-cache.cpp: keep TurboQuant InnerQ stubs + upstream Hadamard helpers
- llama-graph.cpp: keep TurboQuant V-padding + upstream self_v_rot
- fattn-tile.cu: add upstream D=512 before TurboQuant HIP guard
- fattn.cu: combine D=512 (upstream) + D=640 (TurboQuant) exclusions
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Apr 6, 2026
Notable upstream changes:
- ggml-org#21038: rotate activations for better KV cache quantization
- ggml-org#21074: generic NVFP4 MMQ kernel
- ggml-org#21271: fix FA kernel selection logic
- ggml-org#21238: fix mmvq mmid kernel selection
- ggml-org#20998: FA support for head dim 512
- ggml-org#20920: docker cuda12 bump to 12.9.1

Conflicts resolved:
- fattn.cu: took upstream (adds head_dim 512 exclusion)
- mmq.cu/mmq.cuh: kept both TQ3 types + upstream additions
- llama-graph.cpp: kept both turbo WHT + upstream v_rot
- docker.yml: took upstream
- tests/CMakeLists.txt: took upstream
@Dampfinchen
Copy link
Copy Markdown

It appears attention rotations are currently disabled for the Gemma 4 architecture.

#21394

spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 6, 2026
…V cache quants

Adds graph-level Hadamard rotation for Q/K/V when using standard quant types
(q4_0, q8_0, etc.) in the KV cache. Dramatically improves quantized KV quality.
Our turbo V un-rotation takes priority when turbo types are used.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 6, 2026
…tion

Turbo types have their own FWHT rotation in set-rows/fattn. The generic
rotation from PR ggml-org#21038 would double-rotate, corrupting results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
* llama : rotate activations for better quantization

* cont : rotate V more + refactor

* cont : rotate caches separately + support non-power-of-2 head sizes

* cont : simplify

* cont : add reference for V rotation

* cont : refactor

* cont : support context shift

* cont : consolidate

* cont : dedup + allow different types for the rotation matrix

* cont : add env variable to disable rotation

* cont : simplify attn rot kv cache logic + rename env

* cont : pre-compute the Hadamard matrices
mmtmu pushed a commit to mmtmu/llama.cpp that referenced this pull request Apr 20, 2026
The online Hadamard rotation (PR ggml-org#21038) exists to suppress quantization
error. On an F16 lid cache it is a no-op and costs one extra mul_mat per
layer. Remove the force-enable for GLM_DSA and DEEPSEEK2 and let the lid
cache follow the standard rule (quantized type + aligned head dim).
Guard the Hadamard mul_mats in the builder and the set_input_k_rot call
on self_k_rot_lid being non-null, since it can legitimately be null now.
@vektorprime
Copy link
Copy Markdown

vektorprime commented Apr 26, 2026

Just wanted to leave this image here since this post is referenced quite a bit:

The following results come from the official SLANG website, but they show that GPTQ and AIME to be sensitive to KV quant. It's NVFP4 but still very relevant, I think. @ggerganov maybe you can incorporate that into your future testing suite like you did with the python-based AIME testing.

image

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* llama : rotate activations for better quantization

* cont : rotate V more + refactor

* cont : rotate caches separately + support non-power-of-2 head sizes

* cont : simplify

* cont : add reference for V rotation

* cont : refactor

* cont : support context shift

* cont : consolidate

* cont : dedup + allow different types for the rotation matrix

* cont : add env variable to disable rotation

* cont : simplify attn rot kv cache logic + rename env

* cont : pre-compute the Hadamard matrices
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* llama : rotate activations for better quantization

* cont : rotate V more + refactor

* cont : rotate caches separately + support non-power-of-2 head sizes

* cont : simplify

* cont : add reference for V rotation

* cont : refactor

* cont : support context shift

* cont : consolidate

* cont : dedup + allow different types for the rotation matrix

* cont : add env variable to disable rotation

* cont : simplify attn rot kv cache logic + rename env

* cont : pre-compute the Hadamard matrices
dogkeeper886 added a commit to dogkeeper886/ollama37 that referenced this pull request May 3, 2026
Wires the Hadamard scaffold (landed earlier on this branch) into
ml/nn/attention.go. When OLLAMA_KV_ROTATE=1 and the head dim is a
multiple of 64, Q/K/V are rotated by a fixed normalized 64×64 Sylvester
matrix before cache.Put; the attention output is rotated again to undo
V's rotation (Sylvester H is symmetric so H·H = I).

Math: orthogonal H satisfies (QH)(KH)^T = QK^T so attention scores are
preserved. Storing rotated K/V in the cache is the whole point — it
smooths the per-coordinate distribution before quantization, which is
what closes the q4_0/q8_0 quality gap to fp16 (upstream PR
ggml-org/llama.cpp#21038, commit 744c0c73).

Helpers added in ml/nn/hadamard.go:
- hadamard64 — package-level cached float slice (computed once)
- blockRotate(ctx, t, h) — applies H_64 block-diagonally along Dim(0)
- hadamardTensor(ctx) — materializes hadamard64 as a backend tensor
- noteIncompatibleHeadDim — one-shot slog.Warn per head_dim when the
  flag is set but the dim isn't compatible

Default off: when OLLAMA_KV_ROTATE is unset (the default), behavior is
byte-identical to before — every gate short-circuits on the envconfig
check before any rotation work happens.

Design doc updated to reflect what landed and to note two known
limitations: (1) the gate reads the env each call, so mid-process toggle
would mix rotated/unrotated entries; (2) the gate checks Q's head dim
only, so head-dim asymmetry in V is not yet handled.

llamarunner cherry-pick of upstream commit 744c0c73 deferred to a
separate follow-up.
dogkeeper886 added a commit to dogkeeper886/ollama37 that referenced this pull request May 3, 2026
Wires the Hadamard scaffold (landed earlier on this branch) into
ml/nn/attention.go. When OLLAMA_KV_ROTATE=1 and the head dim is a
multiple of 64, Q/K/V are rotated by a fixed normalized 64×64 Sylvester
matrix before cache.Put; the attention output is rotated again to undo
V's rotation (Sylvester H is symmetric so H·H = I).

Math: orthogonal H satisfies (QH)(KH)^T = QK^T so attention scores are
preserved. Storing rotated K/V in the cache is the whole point — it
smooths the per-coordinate distribution before quantization, which is
what closes the q4_0/q8_0 quality gap to fp16 (upstream PR
ggml-org/llama.cpp#21038, commit 744c0c73).

Helpers added in ml/nn/hadamard.go:
- hadamard64 — package-level cached float slice (computed once)
- blockRotate(ctx, t, h) — applies H_64 block-diagonally along Dim(0)
- hadamardTensor(ctx) — materializes hadamard64 as a backend tensor
- noteIncompatibleHeadDim — one-shot slog.Warn per head_dim when the
  flag is set but the dim isn't compatible

Default off: when OLLAMA_KV_ROTATE is unset (the default), behavior is
byte-identical to before — every gate short-circuits on the envconfig
check before any rotation work happens.

Design doc updated to reflect what landed and to note two known
limitations: (1) the gate reads the env each call, so mid-process toggle
would mix rotated/unrotated entries; (2) the gate checks Q's head dim
only, so head-dim asymmetry in V is not yet handled.

llamarunner cherry-pick of upstream commit 744c0c73 deferred to a
separate follow-up.
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* llama : rotate activations for better quantization

* cont : rotate V more + refactor

* cont : rotate caches separately + support non-power-of-2 head sizes

* cont : simplify

* cont : add reference for V rotation

* cont : refactor

* cont : support context shift

* cont : consolidate

* cont : dedup + allow different types for the rotation matrix

* cont : add env variable to disable rotation

* cont : simplify attn rot kv cache logic + rename env

* cont : pre-compute the Hadamard matrices
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 8, 2026
Investigation of master llama.cpp ggml-org#21038 attention rotation in our fork,
ending in a per-side env-knob policy (LLAMA_ATTN_ROT_K_OVERRIDE /
LLAMA_ATTN_ROT_V_OVERRIDE). Default unchanged from the original fork
(rotation OFF on both sides); the contribution is per-side control plus a
matrix documenting which models want which knob.

Headline finding while running the matrix: corpus PPL on wikitext-2 is
unreliable for KV-quantization evaluation on gemma-class instruct models.
Quantized KV scores 7-42% BELOW the fp16-KV baseline on three gemma-4-it
GGUFs; KLD against the fp16-KV reference points the opposite direction.
Reproduced at q8/q8 and q8/turbo4, on wikitext-2 and wikitext-103, ctx=512
and ctx=2048. Cleanest case: t4/t4 K-only PPL +52.7% but KLD -4.9% (best
on row).

The PPL/KLD ranking inversion is a continuum (small magnitude on Qwen2.5-7B,
large on gemma-4 26B-A4B), not gemma-only. Recommendation: use KLD against
fp16-KV logits as the oracle for KV-quant evaluation on this model class.

Independently observed prior: vektorprime in ggml-org#21394, AesSedai's tables in
ggml-org#21038, ggerganov's "track KLD rather than PPL" comment in same PR,
localbench. This paper adds the controlled per-side measurement, the
cross-format / cross-corpus reproduction, the bit-exact-zero KLD noise
floor on Metal, and the resulting engineering policy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.