Skip to content

fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF)#111

Merged
TheTom merged 2 commits into
feature/turboquant-kv-cachefrom
fix/enable-attn-rot-by-default
May 1, 2026
Merged

fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF)#111
TheTom merged 2 commits into
feature/turboquant-kv-cachefrom
fix/enable-attn-rot-by-default

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented Apr 29, 2026

Summary

Fixes TheTom/turboquant_plus#88 reported by @erazortt.

The fork was suppressing upstream's activation-rotation pre-quantization step from ml-explore/llama.cpp#21038 ("llama : rotate activations for better quantization") via a LLAMA_ATTN_ROT_DISABLE=true default. That was correct for symmetric turbo (where the kernel-level WHT handles rotation end-to-end) but wrong for asymmetric configs like q8_0-K + turbo*-V: the K side stayed in the un-rotated coordinate space and lost the quality boost upstream rotation gives Q8_0 / Q4_0. The user-facing symptom is that asymmetric q8_0-K / turbo-V PPL was slightly worse than upstream q8_0/q8_0 on the same model.

This patch flips the default. Master's k_rot / v_rot matrices and turbo's kernel-level WHT are independent rotations (different basis, different invert sites) and compose cleanly, so enabling both does not double-rotate.

Empirical validation

Qwen3.5-2B-Q8_0 weights, q8_0-K + turbo4-V cache, wikitext-2 16 chunks @ ctx=2048, M5 Max Metal:

config attn_rot_k attn_rot_v PPL
pre-fix default (rotation off) 0 0 10.9170 ± 0.233
v2 fix default (rotation on) 1 1 10.8819 ± 0.235
LLAMA_ATTN_ROT_DISABLE=1 (escape hatch) 0 0 10.9170

Δ = -0.32% PPL, 15 of 16 chunks favorable. Small absolute, but consistent direction; matches the symmetric case @erazortt observed against upstream master q8_0/q8_0.

Asks for testers

  • @erazortt — original reporter, please confirm on your test setup and your model(s). Particularly interested in any larger model class where the absolute delta widens.
  • Anyone running asymmetric KV (q8_0 or q4_0 on K with turbo* on V) — brew uninstall && brew install or rebuild from source on this branch and re-run your favorite eval. PPL on wikitext is the cleanest signal.

Escape hatch preserved

LLAMA_ATTN_ROT_DISABLE=1 still globally disables the upstream rotation if a model hits graph-node hash-table overflow (Phi-4 was the prior known failure case). No symptoms expected on the supported model set, but the env var stays for safety.

Out of scope

  • The companion fix in src/llama-graph.cpp (where the pre-rotate-queries WHT is gated on K being a turbo type) is unaffected — it remains correct for the turbo path. This change only touches the upstream-rotation gate in the KV cache constructor.
  • Phi-4 hash-table overflow: not retested. If anyone hits it, set LLAMA_ATTN_ROT_DISABLE=1 and report.

@erazortt
Copy link
Copy Markdown

Well it seems to work for the native KV quants (Q8_0, Q4_0), but I am getting completly different results when using turboquants. I would have assumed that the now enabled rotation would not effect the turboquants. The model I tested on was: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf

branch feature/turboquant-kv-cache:
/d/sources/llama.cpp-my/build-turbo-non-rot/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk turbo4 -ctv turbo4 -fa on > turbo-non-rot-t4-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 12081.266227 ± 236.359360
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 86.28%
Mean ln(PPL(Q)/PPL(base)) : 0.337545 ± 0.009917
Mean PPL(Q)/PPL(base) : 1.401503 ± 0.013899
Mean PPL(Q)-PPL(base) : 3461.041835 ± 135.851787

====== KL divergence statistics ======
Mean KLD: 2.250782 ± 0.009766
Maximum KLD: 47.711288
99.9% KLD: 27.624287
99.0% KLD: 17.616028
95.0% KLD: 10.180385
90.0% KLD: 6.817993
Median KLD: 0.647133
10.0% KLD: 0.003063
5.0% KLD: 0.000338
1.0% KLD: 0.000006
0.1% KLD: 0.000000
Minimum KLD: -0.000008

====== Token probability statistics ======
Mean Δp: 1.449 ± 0.047 %
Maximum Δp: 100.000%
99.9% Δp: 99.973%
99.0% Δp: 88.276%
95.0% Δp: 24.133%
90.0% Δp: 5.378%
75.0% Δp: 0.037%
Median Δp: -0.000%
25.0% Δp: -0.002%
10.0% Δp: -1.385%
5.0% Δp: -11.222%
1.0% Δp: -64.421%
0.1% Δp: -99.642%
Minimum Δp: -100.000%
RMS Δp : 18.155 ± 0.099 %
Same top p: 55.923 ± 0.130 %

branch: fix/enable-attn-rot-by-default
/d/sources/llama.cpp-my/build-turbo/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk turbo4 -ctv turbo4 -fa on > turbo-t4-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 8171.644854 ± 159.821344
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 84.28%
Mean ln(PPL(Q)/PPL(base)) : -0.053441 ± 0.010533
Mean PPL(Q)/PPL(base) : 0.947962 ± 0.009985
Mean PPL(Q)-PPL(base) : -448.579538 ± 86.153966

====== KL divergence statistics ======
Mean KLD: 2.588906 ± 0.010435
Maximum KLD: 49.253040
99.9% KLD: 27.660925
99.0% KLD: 18.427866
95.0% KLD: 11.055090
90.0% KLD: 7.729677
Median KLD: 0.866591
10.0% KLD: 0.004266
5.0% KLD: 0.000446
1.0% KLD: 0.000008
0.1% KLD: 0.000000
Minimum KLD: -0.000017

====== Token probability statistics ======
Mean Δp: 2.931 ± 0.052 %
Maximum Δp: 100.000%
99.9% Δp: 99.990%
99.0% Δp: 94.913%
95.0% Δp: 37.536%
90.0% Δp: 10.724%
75.0% Δp: 0.135%
Median Δp: 0.000%
25.0% Δp: -0.000%
10.0% Δp: -0.795%
5.0% Δp: -8.835%
1.0% Δp: -61.325%
0.1% Δp: -99.375%
Minimum Δp: -100.000%
RMS Δp : 20.122 ± 0.099 %
Same top p: 52.962 ± 0.130 %

@TheTom TheTom force-pushed the fix/enable-attn-rot-by-default branch from 6c55e84 to cf5c3db Compare April 29, 2026 19:06
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 29, 2026

@erazortt great catch — your symmetric turbo4/turbo4 result is conclusive (KLD 2.25 → 2.59, same-top-p 55.9% → 52.9%). reverted the broad enable and went per-side instead. force-pushed cf5c3db09 to the same branch.

new gating: enable attn_rot_k/attn_rot_v only for non-turbo quantized types per side. so:

config attn_rot_k attn_rot_v
q8_0 / q8_0 1 1 (matches master)
q8_0 / turbo* 1 0 (helps Q8 K, leaves turbo V alone)
turbo* / q8_0 0 1
turbo* / turbo* 0 0 (avoids your regression)

re-validated on Qwen3.5-2B-Q8_0 wikitext-2 16 chunks @ ctx=2048:

q8_0 / turbo4    rot_k=1 rot_v=0  PPL 10.9159
turbo4 / turbo4  rot_k=0 rot_v=0  PPL 10.9194

turbo4/turbo4 now sits at the rotation-off baseline (no regression). q8_0/turbo4 still picks up the K-side rotation that helps Q8 quality. would appreciate one more pass of your gemma-4 turbo4/turbo4 KLD eval to confirm it's back to the rotation-off numbers.

asymmetric path note: my Qwen3.5-2B q8_0/turbo4 delta was within SE (10.9159 vs 10.9170), so the K-only rotation may not help as much as full rotation does on your test setup — would also be useful to see q8_0/turbo4 numbers from your matrix if you have them. trying to land the right shape of fix.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 29, 2026

local proof-of-safety pass on Qwen3.5-2B-Q8_0, wikitext-2 16 chunks @ ctx=2048, M5 Max Metal:

config attn_rot_k attn_rot_v PPL
q8_0 / q8_0 rotation off (LLAMA_ATTN_ROT_DISABLE=1) 0 0 10.7897
q8_0 / q8_0 v3 default (master parity) 1 1 10.7877
q8_0 / turbo4 rotation off 0 0 10.9170
q8_0 / turbo4 v3 default (k-only rotation) 1 0 10.9159
turbo4 / turbo4 v3 default 0 0 10.9194

three things v3 does correctly:

  1. no regression on symmetric turbo/turbo — rot_k/rot_v default to 0 for turbo types, PPL stays at the rotation-off baseline. the gemma-4 26B-A4B KLD regression you observed will not reproduce.
  2. master parity on q8_0 / q8_0 — both rotations on by default, behaviour matches upstream master.
  3. k-only rotation on the asymmetric path — q8_0 K is rotated, turbo* V is left alone for the turbo encode path to handle.

what's NOT fully proven on this model: the absolute magnitude of the asymmetric quality win. master rotation barely moves Qwen3.5-2B Q8_0 PPL (10.7897 → 10.7877, well within ±0.23 SE), so q8_0/turbo4 partial rotation also barely moves (10.9170 → 10.9159). the structural shape of the fix is correct and safe; the magnitude is model-dependent. your KLD-based eval on gemma-4 26B-A4B is the gold standard for sizing the actual win — particularly the q8_0/turbo4 cell if you have it on the same model.

@erazortt
Copy link
Copy Markdown

Great, now it seems to works correctly! New the turbo4/turbo4 result is like before, while the q8/turbo4 results are as follows:

/d/sources/llama.cpp-my/build-turbo-non-rot/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk q8_0 -ctv turbo4 -fa on > turbo-non-rot-q8-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 14326.867168 ± 281.289693
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 88.45%
Mean ln(PPL(Q)/PPL(base)) : 0.508025 ± 0.009241
Mean PPL(Q)/PPL(base) : 1.662006 ± 0.015358
Mean PPL(Q)-PPL(base) : 5706.642776 ± 170.926823

====== KL divergence statistics ======
Mean KLD: 1.780215 ± 0.008535
Maximum KLD: 52.474285
99.9% KLD: 25.558065
99.0% KLD: 15.853672
95.0% KLD: 8.555811
90.0% KLD: 5.407770
Median KLD: 0.425377
10.0% KLD: 0.001950
5.0% KLD: 0.000212
1.0% KLD: 0.000004
0.1% KLD: -0.000000
Minimum KLD: -0.000004

====== Token probability statistics ======
Mean Δp: 1.047 ± 0.041 %
Maximum Δp: 100.000%
99.9% Δp: 99.895%
99.0% Δp: 76.169%
95.0% Δp: 17.408%
90.0% Δp: 3.648%
75.0% Δp: 0.022%
Median Δp: -0.000%
25.0% Δp: -0.001%
10.0% Δp: -1.054%
5.0% Δp: -8.704%
1.0% Δp: -55.163%
0.1% Δp: -99.356%
Minimum Δp: -100.000%
RMS Δp : 15.792 ± 0.096 %
Same top p: 60.418 ± 0.128 %

/d/sources/llama.cpp-my/build-turbo-2/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk q8_0 -ctv turbo4 -fa on > turbo2-q8-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 15288.244248 ± 300.226430
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 89.00%
Mean ln(PPL(Q)/PPL(base)) : 0.572973 ± 0.009049
Mean PPL(Q)/PPL(base) : 1.773532 ± 0.016049
Mean PPL(Q)-PPL(base) : 6668.019856 ± 187.334781

====== KL divergence statistics ======
Mean KLD: 1.678579 ± 0.008187
Maximum KLD: 43.678173
99.9% KLD: 25.100157
99.0% KLD: 15.344507
95.0% KLD: 8.072799
90.0% KLD: 5.027307
Median KLD: 0.396494
10.0% KLD: 0.001770
5.0% KLD: 0.000192
1.0% KLD: 0.000003
0.1% KLD: -0.000000
Minimum KLD: -0.000006

====== Token probability statistics ======
Mean Δp: 0.868 ± 0.040 %
Maximum Δp: 100.000%
99.9% Δp: 99.885%
99.0% Δp: 71.754%
95.0% Δp: 16.066%
90.0% Δp: 3.199%
75.0% Δp: 0.018%
Median Δp: -0.000%
25.0% Δp: -0.001%
10.0% Δp: -1.047%
5.0% Δp: -8.549%
1.0% Δp: -54.589%
0.1% Δp: -99.082%
Minimum Δp: -100.000%
RMS Δp : 15.297 ± 0.096 %
Same top p: 61.403 ± 0.127 %

@erazortt
Copy link
Copy Markdown

erazortt commented Apr 29, 2026

Though, I now see that for q8/turbo4 the PPL got worse, but KLD and same-top-p got better.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 29, 2026

Im considering options. running experiments.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 29, 2026

expanded the experiment matrix to triangulate this. ran a 13-cell PPL sweep on M5 Max, wikitext-2 32 chunks @ ctx=512, with per-side rotation overrides via two new debug env knobs (LLAMA_ATTN_ROT_K_OVERRIDE / LLAMA_ATTN_ROT_V_OVERRIDE).

Gemma-4 26B-A4B Q8_0 (SWA + MoE)

config rot k/v PPL Δ vs OFF
q8 / q8 OFF 0/0 9979
q8 / q8 v3 default 1/1 10118 +1.4%
q8 / q8 K-only 1/0 10153 +1.7%
q8 / q8 V-only 0/1 10451 +4.7%
q8 / turbo4 OFF 0/0 6273
q8 / turbo4 v3 (k=1, v=0) 1/0 6700 +6.8%
q8 / turbo4 broad (k+v) 1/1 6176 -1.6%
q8 / turbo4 V-only 0/1 6027 -3.9%
t4 / t4 OFF 0/0 5785
t4 / t4 broad 1/1 7031 +21.5%
t4 / t4 K-only 1/0 8831 +52.7% ⚠⚠

Qwen2.5 1.5B Q4_K_M (pure-global, dense)

config rot k/v PPL
q8 / q8 OFF 0/0 9.2889
q8 / q8 v3 1/1 9.2687 (within SE)
q8 / turbo4 OFF 0/0 9.3799
q8 / turbo4 v3 (k=1) 1/0 9.3772 (within SE)
t4 / t4 OFF 0/0 6300 (broken)
t4 / t4 broad 1/1 4711 (rescues partially)

Observations

  1. v3's per-side gating is the WORST asymmetric config on gemma-4. q8/turbo4 K-only (current v3 logic) at +6.8% PPL. V-only is the BEST at -3.9%. The empirically right asymmetric policy on this family is the direct opposite of what v3 picked.
  2. K-side rotation when K is turbo is catastrophic (t4/t4 K-only +52.7%). Confirms the symmetric turbo regression is K-side. V-side master rotation on turbo V actually helps on gemma-4.
  3. Qwen2.5 dense behaves differently. All q8 variants within SE; t4/t4 is fundamentally broken regardless of rotation (PPL > 4700 either way). Different family, different patterns.
  4. Master's rotation barely helps on gemma-4 q8/q8 (+1.4% with rotation vs OFF). Even master's intended use case isn't a clear win on this architecture.

What this changes

The "skip rotation for turbo types" gating in v3 is wrong on the gemma family. Per-family rotation policies appear genuinely different (gemma SWA+MoE vs Qwen2.5 pure-global vs Qwen3.5 hybrid all show distinct optima), so a single universal default that's correct for all is unlikely.

Leaning toward this PR shape:

  • Default: rotation OFF on all paths. Avoids the +52.7% catastrophe and the +6.8% v3 regression. Matches pre-fix behavior.
  • First-class per-side env knobs: LLAMA_ATTN_ROT_K_OVERRIDE=0/1 and LLAMA_ATTN_ROT_V_OVERRIDE=0/1 for opt-in.
  • README section with family-tested recommendations: e.g. "gemma-4 q8/turbo4: set LLAMA_ATTN_ROT_V_OVERRIDE=1 for ~4% PPL improvement; gemma-4 t4/t4: leave default off (avoid +21% regression)."

Open to running more families (Phi-4, Mistral-Small, Llama-3) before locking the PR shape if that helps.

@TheTom TheTom force-pushed the fix/enable-attn-rot-by-default branch from cf5c3db to 9ecb4f2 Compare April 29, 2026 21:15
…default OFF

Replaces the prior approach (auto-enable rotation for non-turbo quantized
types) with explicit per-side opt-in via LLAMA_ATTN_ROT_K_OVERRIDE and
LLAMA_ATTN_ROT_V_OVERRIDE. Default behavior: rotation OFF on both sides
across all KV types.

Background: TheTom/turboquant_plus#88 surfaced that asymmetric q8_0/turbo*
configs were missing the upstream activation rotation from
ml-explore/llama.cpp#21038. Multiple iterations (broad enable, per-side
gating with turbo skip) tried to find a smart default that balances quality
across model families.

Empirical PPL+KLD testing on 7 model families (gemma-4 26B-A4B / 31B / E2B,
Qwen2.5-7B, Qwen3.5-2B, Mistral-Small-24B, phi-4) showed the optimal
rotation policy is highly model-and-quant specific. No single default is
correct everywhere, including within the same architecture family
(gemma-4 26B-A4B Q8, 31B Q8, and E2B Q4_K_L showed three distinct optima).

phi-4 V-side rotation crashes with graph-node hash overflow, ruling out any
default-on policy that touches V rotation across model families.

Default OFF avoids regressing any tested model. Env-knob opt-in lets
power users tune for their specific config based on documented per-model
findings (see README/docs follow-up).

LLAMA_ATTN_ROT_DISABLE remains as a no-op alias for historical scripts.

Co-Authored-By: tturney@psyguard.ai
@TheTom TheTom changed the title fix(kv-cache): enable upstream attention rotation by default fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF) Apr 29, 2026
@TheTom TheTom force-pushed the fix/enable-attn-rot-by-default branch from 9ecb4f2 to db3595a Compare April 29, 2026 21:15
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 29, 2026

Update: v4 — default OFF + per-side env knobs

After expanding the experiment matrix to 7 model families and discovering that:

  • the optimal rotation policy varies wildly within the same architecture family (3 distinct optima across gemma-4 sizes)
  • phi-4 crashes when V-side rotation is enabled (graph-node hash overflow, historical issue)
  • the most dramatic apparent "wins" turn out to be PPL eval artifacts specific to instruction-tuned gemma models

I'm reshaping the fix from "smart per-side gating" (v3) to "default OFF + opt-in env knobs" (v4). Force-pushed db3595a75.

v4 behavior

  • attn_rot_k defaults to false, regardless of KV type
  • attn_rot_v defaults to false, regardless of KV type
  • LLAMA_ATTN_ROT_K_OVERRIDE=1 enables K-side rotation (subject to quantized + head_dim%64 guard)
  • LLAMA_ATTN_ROT_V_OVERRIDE=1 enables V-side rotation (same guards)
  • LLAMA_ATTN_ROT_DISABLE=1 retained as no-op alias for historical scripts

Why default OFF — the data

Expanded matrix on M5 Max wikitext-2 32 chunks @ ctx=512, q8_0/turbo4 KV across 7 model families:

model OFF v3 (k=1) V-only broad
gemma-4 26B-A4B Q8 6273 6700 6027 6176
gemma-4 31B Q8 8685 9009 4924 4818
gemma-4 E2B Q4_K_L 114.8 115.4 122.4 122.4
Qwen2.5-7B Q8 6.140 6.146 6.135 6.116
Qwen3.5-2B Q8 10.794 10.791 10.692 10.692
Mistral-Small-24B Q4 5.317 5.318 5.326 5.326
phi-4 Q8 5.824 5.818 CRASH CRASH

Three things this matrix kills

  1. Per-arch policy in code is dead. Even within gemma-4 family, three sizes show three different optima. Hardcoding LLM_ARCH_GEMMA → V-only would silently regress E2B Q4_K_L users by +6.7%.

  2. phi-4 V-rotation crashes. Confirmed historical hash-table-overflow. Any default-on policy that touches V is unshippable.

  3. The dramatic gemma-4 31B "-43% PPL win" is a metric artifact. Sanity-checked against the FP16 baseline:

    gemma-4 31B Q8 ctx=512 PPL
    f16/f16 KV (no quant) 5320
    q8/turbo4 OFF 8685
    q8/turbo4 V-only 4924 ← below the FP16 baseline?!
    q8/turbo4 broad 4818 ← also below FP16

    Quantization should never score better than FP16. The fact that V-only / broad land below the f16/f16 baseline means PPL on wikitext-2 is unreliable for gemma-4-it models — likely chat-template or special-token handling artifacts in the eval. The 26B-A4B numbers are in the same suspicious 4-digit-PPL regime, smaller but probably the same artifact.

What the matrix DOES tell us reliably

  • Most non-gemma model families (Qwen2.5, Qwen3.5, Mistral) are within standard error regardless of rotation. No clear default applies.
  • gemma-4 E2B Q4_K_L (PPL 114, in the believable range) shows rotation hurts by +6.7%.
  • phi-4 crashes on V-side rotation. Hard ceiling.
  • erazortt's KLD eval on gemma-4 26B-A4B Q6_K_XL (the gold-standard signal) showed rotation improves base-model fidelity (KLD ↓, same-top-p ↑) at PPL cost. That's the right metric for KV-quant evaluation.

Path forward

This PR ships the env-knob infrastructure. Per-model recommendations belong in a follow-up README section, populated as the community reports KLD-based findings on their specific configs.

Welcome any further testing. The ask for testers is now: run KLD eval (--kl-divergence + --kl-divergence-base) on YOUR model and report whether LLAMA_ATTN_ROT_K_OVERRIDE=1 and/or LLAMA_ATTN_ROT_V_OVERRIDE=1 improves KLD without crashing.

…disable=false)

Previous state: db3595a added LLAMA_ATTN_ROT_K_OVERRIDE / _V_OVERRIDE per-side
opt-in knobs but kept attn_rot_disable defaulting to TRUE for legacy
LLAMA_ATTN_ROT_DISABLE compatibility. The override branches included
`&& !attn_rot_disable` guards, so when LLAMA_ATTN_ROT_DISABLE is unset
(default true) the per-side env knobs were silently no-ops. Users could not
opt into rotation without also setting LLAMA_ATTN_ROT_DISABLE=0.

Fix: flip attn_rot_disable default to false. Rotation is still OFF by default
because attn_rot_k/v default to false. LLAMA_ATTN_ROT_DISABLE=1 still acts as
a hard lock-out that blocks the per-side overrides for users who want a
single switch to guarantee no rotation.

Caught while running the cross-format KLD matrix for the rotation/PPL
investigation paper — V-only override appeared to silently fail. Confirmed
with logs that attn_rot_v stayed 0 even with LLAMA_ATTN_ROT_V_OVERRIDE=1
until this default flip.
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 29, 2026

Update: critical bug fix + cross-format KLD overturns the t4/t4 K-only narrative

Pushed 817e913ec on top of db3595a75. Two things this update covers.

1. Critical bug fix: per-side overrides were silently no-ops

db3595a75 left attn_rot_disable defaulting to TRUE for legacy LLAMA_ATTN_ROT_DISABLE compatibility. The new per-side override branches include && !attn_rot_disable guards, so when LLAMA_ATTN_ROT_DISABLE is unset (default true), LLAMA_ATTN_ROT_K_OVERRIDE=1 and LLAMA_ATTN_ROT_V_OVERRIDE=1 were silently no-ops. Users could not opt into rotation without also setting LLAMA_ATTN_ROT_DISABLE=0. Caught while running the post-merge KLD matrix below — V-only override looked like it had no effect, traced to this guard.

817e913ec flips the default to false. Rotation is still OFF by default (because attn_rot_k/v default to false). LLAMA_ATTN_ROT_DISABLE=1 is preserved as a hard lock-out that blocks the per-side overrides for users who want one switch to guarantee no rotation.

If you tested env-knob behavior on db3595a75 and saw "no change," please retest on 817e913ec — your override was probably blocked.

2. Cross-format KLD overturns the "t4/t4 K-only catastrophic +52.7%" framing

The earlier comment called t4/t4 K-only catastrophic by PPL. After running the cross-format KLD matrix on gemma-4 26B-A4B Q8 (vs fp16-KV reference, ctx=512, 32 chunks):

KV format config PPL Δ vs OFF KLD Δ vs OFF
q8 / q8 V-only +4.7% −19.4%
q8 / q8 broad +1.4% −22.7%
q8 / turbo4 V-only −3.9% +6.1%
q8 / turbo4 broad −1.6% +6.8%
t4 / t4 K-only +52.7% −4.9%
t4 / t4 broad +21.5% +9.7%

PPL and KLD point opposite directions on five of six rows. The "+52.7% K-only catastrophic" PPL number is itself an artifact in the same family as the q8/turbo4 V-only "win" — KLD on the same row shows K-only is the closest-to-fp16 configuration. Master's PR ggml-org#21038 is in fact doing what it claims on q8/q8 (KLD drops 20%+), even though PPL reads it as a regression.

3. Headline: PPL is unreliable on gemma-class instruct, KLD ranks correctly

Headline trio at 256 chunks (8× more data, ~3× tighter CIs vs fp16-KV reference PPL 20045.71 ± 604.41):

Config PPL(Q) KL Divergence
q8/turbo4 OFF 11673.35 ± 342.77 1.7067 ± 0.0125
q8/turbo4 V-only 11160.84 ± 332.03 (−4.4%) 1.8943 ± 0.0132 (+11.0%)
q8/turbo4 broad 10785.84 ± 320.35 (−7.6%) 1.9193 ± 0.0133 (+12.5%)

V-only KLD penalty is 14σ above OFF. PPL ranks the configs backward; KLD ranks them in the order intuition predicts.

4. KLD reference noise floor on Metal: bit-exact zero

Built the fp16-KV reference twice on identical inputs, scored run #2 against run #1: KLD = 0.000000 ± 0.000000, RMS Δp = 0.000%. Every KLD delta in this PR thread is real signal, not nondeterminism floor. CUDA/HIP not measured; backend testers should re-measure on their build before relying on small KLD deltas.

5. Cross-corpus reproduction (closes the "wikitext-2 specific" hammer)

Corpus KV format fp16-KV PPL q8/* PPL Δ
wikitext-2 test q8/turbo4 10813.67 6273.74 −42.0%
wikitext-103 train q8/turbo4 33845.11 19902.90 −41.2%
wikitext-2 test q8/q8 10813.67 9979.55 −7.7%
wikitext-103 train q8/q8 33845.11 31818.69 −6.0%

Same direction, near-identical magnitude on a 516MB different corpus. Artifact is not wikitext-2-specific.

6. Downstream completion probe — model is not actually broken

3 short factual prompts on gemma-4 26B-A4B, fp16-KV vs q8/turbo4 OFF (the headline −42% PPL config), greedy generation:

Prompt fp16-KV q8/turbo4 OFF
"The capital of France is" "Paris" "Paris"
"Two plus two equals" "Four" "four"
"List three colors:" "Red, Blue, Green" "Red, Blue, Yellow"

Both configs answer correctly. The "−42% PPL improvement" config is not an obviously-broken model in the practical sense — it produces correct answers on prompts where many continuations are acceptable. KLD captures real distribution drift; downstream task accuracy is preserved despite the drift because the answer space is small. PPL alone tells you neither.

7. Engineering policy unchanged: default OFF + per-side opt-in

The matrix above does not change the default decision (rotation OFF on both sides — same default the fork shipped with pre-investigation). It changes the language in the recommendation table from "K-only catastrophic" to "PPL says catastrophic, KLD says best — measure with KLD on your model before opting in."

8. Full write-up

Full investigation, all data, all mechanism candidates, limitations, and reproducibility instructions are in the companion paper:

docs/papers/attn-rotation-and-ppl-artifact.md"When Quantized Beats fp16: A KV-Rotation Investigation, and Why PPL Lies on gemma-class Instruct Models."

Ask for testers

If you're testing this PR:

  1. Build at 817e913ec (not db3595a75 — overrides were broken there).
  2. Run --kl-divergence-base + --kl-divergence on YOUR model with the four configs (OFF, K-only, V-only, broad) on q8/turbo4 KV.
  3. Report KLD numbers (not just PPL). PPL alone is unreliable on gemma-class.

CC: @erazortt @ggerganov (your "track KLD rather than PPL" comment in ggml-org#21038 is now a reproduced demonstration with controlled per-side measurements — happy to chat if you want any of this upstreamed)

@TheTom TheTom merged commit e0954d1 into feature/turboquant-kv-cache May 1, 2026
22 of 50 checks passed
@TheTom TheTom deleted the fix/enable-attn-rot-by-default branch May 1, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

attn-rotation for native llama quants

2 participants