fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF) by TheTom · Pull Request #111 · TheTom/llama-cpp-turboquant

TheTom · 2026-04-29T13:26:21Z

Summary

Fixes TheTom/turboquant_plus#88 reported by @erazortt.

The fork was suppressing upstream's activation-rotation pre-quantization step from ml-explore/llama.cpp#21038 ("llama : rotate activations for better quantization") via a LLAMA_ATTN_ROT_DISABLE=true default. That was correct for symmetric turbo (where the kernel-level WHT handles rotation end-to-end) but wrong for asymmetric configs like q8_0-K + turbo*-V: the K side stayed in the un-rotated coordinate space and lost the quality boost upstream rotation gives Q8_0 / Q4_0. The user-facing symptom is that asymmetric q8_0-K / turbo-V PPL was slightly worse than upstream q8_0/q8_0 on the same model.

This patch flips the default. Master's k_rot / v_rot matrices and turbo's kernel-level WHT are independent rotations (different basis, different invert sites) and compose cleanly, so enabling both does not double-rotate.

Empirical validation

Qwen3.5-2B-Q8_0 weights, q8_0-K + turbo4-V cache, wikitext-2 16 chunks @ ctx=2048, M5 Max Metal:

config	attn_rot_k	attn_rot_v	PPL
pre-fix default (rotation off)	0	0	10.9170 ± 0.233
v2 fix default (rotation on)	1	1	10.8819 ± 0.235
`LLAMA_ATTN_ROT_DISABLE=1` (escape hatch)	0	0	10.9170

Δ = -0.32% PPL, 15 of 16 chunks favorable. Small absolute, but consistent direction; matches the symmetric case @erazortt observed against upstream master q8_0/q8_0.

Asks for testers

@erazortt — original reporter, please confirm on your test setup and your model(s). Particularly interested in any larger model class where the absolute delta widens.
Anyone running asymmetric KV (q8_0 or q4_0 on K with turbo* on V) — brew uninstall && brew install or rebuild from source on this branch and re-run your favorite eval. PPL on wikitext is the cleanest signal.

Escape hatch preserved

LLAMA_ATTN_ROT_DISABLE=1 still globally disables the upstream rotation if a model hits graph-node hash-table overflow (Phi-4 was the prior known failure case). No symptoms expected on the supported model set, but the env var stays for safety.

Out of scope

The companion fix in src/llama-graph.cpp (where the pre-rotate-queries WHT is gated on K being a turbo type) is unaffected — it remains correct for the turbo path. This change only touches the upstream-rotation gate in the KV cache constructor.
Phi-4 hash-table overflow: not retested. If anyone hits it, set LLAMA_ATTN_ROT_DISABLE=1 and report.

erazortt · 2026-04-29T18:42:58Z

Well it seems to work for the native KV quants (Q8_0, Q4_0), but I am getting completly different results when using turboquants. I would have assumed that the now enabled rotation would not effect the turboquants. The model I tested on was: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf

branch feature/turboquant-kv-cache:
/d/sources/llama.cpp-my/build-turbo-non-rot/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk turbo4 -ctv turbo4 -fa on > turbo-non-rot-t4-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 12081.266227 ± 236.359360
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 86.28%
Mean ln(PPL(Q)/PPL(base)) : 0.337545 ± 0.009917
Mean PPL(Q)/PPL(base) : 1.401503 ± 0.013899
Mean PPL(Q)-PPL(base) : 3461.041835 ± 135.851787

====== KL divergence statistics ======
Mean KLD: 2.250782 ± 0.009766
Maximum KLD: 47.711288
99.9% KLD: 27.624287
99.0% KLD: 17.616028
95.0% KLD: 10.180385
90.0% KLD: 6.817993
Median KLD: 0.647133
10.0% KLD: 0.003063
5.0% KLD: 0.000338
1.0% KLD: 0.000006
0.1% KLD: 0.000000
Minimum KLD: -0.000008

====== Token probability statistics ======
Mean Δp: 1.449 ± 0.047 %
Maximum Δp: 100.000%
99.9% Δp: 99.973%
99.0% Δp: 88.276%
95.0% Δp: 24.133%
90.0% Δp: 5.378%
75.0% Δp: 0.037%
Median Δp: -0.000%
25.0% Δp: -0.002%
10.0% Δp: -1.385%
5.0% Δp: -11.222%
1.0% Δp: -64.421%
0.1% Δp: -99.642%
Minimum Δp: -100.000%
RMS Δp : 18.155 ± 0.099 %
Same top p: 55.923 ± 0.130 %

branch: fix/enable-attn-rot-by-default
/d/sources/llama.cpp-my/build-turbo/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk turbo4 -ctv turbo4 -fa on > turbo-t4-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 8171.644854 ± 159.821344
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 84.28%
Mean ln(PPL(Q)/PPL(base)) : -0.053441 ± 0.010533
Mean PPL(Q)/PPL(base) : 0.947962 ± 0.009985
Mean PPL(Q)-PPL(base) : -448.579538 ± 86.153966

====== KL divergence statistics ======
Mean KLD: 2.588906 ± 0.010435
Maximum KLD: 49.253040
99.9% KLD: 27.660925
99.0% KLD: 18.427866
95.0% KLD: 11.055090
90.0% KLD: 7.729677
Median KLD: 0.866591
10.0% KLD: 0.004266
5.0% KLD: 0.000446
1.0% KLD: 0.000008
0.1% KLD: 0.000000
Minimum KLD: -0.000017

====== Token probability statistics ======
Mean Δp: 2.931 ± 0.052 %
Maximum Δp: 100.000%
99.9% Δp: 99.990%
99.0% Δp: 94.913%
95.0% Δp: 37.536%
90.0% Δp: 10.724%
75.0% Δp: 0.135%
Median Δp: 0.000%
25.0% Δp: -0.000%
10.0% Δp: -0.795%
5.0% Δp: -8.835%
1.0% Δp: -61.325%
0.1% Δp: -99.375%
Minimum Δp: -100.000%
RMS Δp : 20.122 ± 0.099 %
Same top p: 52.962 ± 0.130 %

TheTom · 2026-04-29T19:07:09Z

@erazortt great catch — your symmetric turbo4/turbo4 result is conclusive (KLD 2.25 → 2.59, same-top-p 55.9% → 52.9%). reverted the broad enable and went per-side instead. force-pushed cf5c3db09 to the same branch.

new gating: enable attn_rot_k/attn_rot_v only for non-turbo quantized types per side. so:

config	attn_rot_k	attn_rot_v
q8_0 / q8_0	1	1 (matches master)
q8_0 / turbo*	1	0 (helps Q8 K, leaves turbo V alone)
turbo* / q8_0	0	1
turbo* / turbo*	0	0 (avoids your regression)

re-validated on Qwen3.5-2B-Q8_0 wikitext-2 16 chunks @ ctx=2048:

q8_0 / turbo4    rot_k=1 rot_v=0  PPL 10.9159
turbo4 / turbo4  rot_k=0 rot_v=0  PPL 10.9194

turbo4/turbo4 now sits at the rotation-off baseline (no regression). q8_0/turbo4 still picks up the K-side rotation that helps Q8 quality. would appreciate one more pass of your gemma-4 turbo4/turbo4 KLD eval to confirm it's back to the rotation-off numbers.

asymmetric path note: my Qwen3.5-2B q8_0/turbo4 delta was within SE (10.9159 vs 10.9170), so the K-only rotation may not help as much as full rotation does on your test setup — would also be useful to see q8_0/turbo4 numbers from your matrix if you have them. trying to land the right shape of fix.

TheTom · 2026-04-29T19:21:20Z

local proof-of-safety pass on Qwen3.5-2B-Q8_0, wikitext-2 16 chunks @ ctx=2048, M5 Max Metal:

config	attn_rot_k	attn_rot_v	PPL
q8_0 / q8_0 rotation off (`LLAMA_ATTN_ROT_DISABLE=1`)	0	0	10.7897
q8_0 / q8_0 v3 default (master parity)	1	1	10.7877
q8_0 / turbo4 rotation off	0	0	10.9170
q8_0 / turbo4 v3 default (k-only rotation)	1	0	10.9159
turbo4 / turbo4 v3 default	0	0	10.9194

three things v3 does correctly:

no regression on symmetric turbo/turbo — rot_k/rot_v default to 0 for turbo types, PPL stays at the rotation-off baseline. the gemma-4 26B-A4B KLD regression you observed will not reproduce.
master parity on q8_0 / q8_0 — both rotations on by default, behaviour matches upstream master.
k-only rotation on the asymmetric path — q8_0 K is rotated, turbo* V is left alone for the turbo encode path to handle.

what's NOT fully proven on this model: the absolute magnitude of the asymmetric quality win. master rotation barely moves Qwen3.5-2B Q8_0 PPL (10.7897 → 10.7877, well within ±0.23 SE), so q8_0/turbo4 partial rotation also barely moves (10.9170 → 10.9159). the structural shape of the fix is correct and safe; the magnitude is model-dependent. your KLD-based eval on gemma-4 26B-A4B is the gold standard for sizing the actual win — particularly the q8_0/turbo4 cell if you have it on the same model.

erazortt · 2026-04-29T20:04:26Z

Great, now it seems to works correctly! New the turbo4/turbo4 result is like before, while the q8/turbo4 results are as follows:

/d/sources/llama.cpp-my/build-turbo-non-rot/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk q8_0 -ctv turbo4 -fa on > turbo-non-rot-q8-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 14326.867168 ± 281.289693
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 88.45%
Mean ln(PPL(Q)/PPL(base)) : 0.508025 ± 0.009241
Mean PPL(Q)/PPL(base) : 1.662006 ± 0.015358
Mean PPL(Q)-PPL(base) : 5706.642776 ± 170.926823

====== KL divergence statistics ======
Mean KLD: 1.780215 ± 0.008535
Maximum KLD: 52.474285
99.9% KLD: 25.558065
99.0% KLD: 15.853672
95.0% KLD: 8.555811
90.0% KLD: 5.407770
Median KLD: 0.425377
10.0% KLD: 0.001950
5.0% KLD: 0.000212
1.0% KLD: 0.000004
0.1% KLD: -0.000000
Minimum KLD: -0.000004

====== Token probability statistics ======
Mean Δp: 1.047 ± 0.041 %
Maximum Δp: 100.000%
99.9% Δp: 99.895%
99.0% Δp: 76.169%
95.0% Δp: 17.408%
90.0% Δp: 3.648%
75.0% Δp: 0.022%
Median Δp: -0.000%
25.0% Δp: -0.001%
10.0% Δp: -1.054%
5.0% Δp: -8.704%
1.0% Δp: -55.163%
0.1% Δp: -99.356%
Minimum Δp: -100.000%
RMS Δp : 15.792 ± 0.096 %
Same top p: 60.418 ± 0.128 %

/d/sources/llama.cpp-my/build-turbo-2/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk q8_0 -ctv turbo4 -fa on > turbo2-q8-t4.txt

====== Perplexity statistics ======
Mean PPL(Q) : 15288.244248 ± 300.226430
Mean PPL(base) : 8620.224392 ± 139.300324
Cor(ln(PPL(Q)), ln(PPL(base))): 89.00%
Mean ln(PPL(Q)/PPL(base)) : 0.572973 ± 0.009049
Mean PPL(Q)/PPL(base) : 1.773532 ± 0.016049
Mean PPL(Q)-PPL(base) : 6668.019856 ± 187.334781

====== KL divergence statistics ======
Mean KLD: 1.678579 ± 0.008187
Maximum KLD: 43.678173
99.9% KLD: 25.100157
99.0% KLD: 15.344507
95.0% KLD: 8.072799
90.0% KLD: 5.027307
Median KLD: 0.396494
10.0% KLD: 0.001770
5.0% KLD: 0.000192
1.0% KLD: 0.000003
0.1% KLD: -0.000000
Minimum KLD: -0.000006

====== Token probability statistics ======
Mean Δp: 0.868 ± 0.040 %
Maximum Δp: 100.000%
99.9% Δp: 99.885%
99.0% Δp: 71.754%
95.0% Δp: 16.066%
90.0% Δp: 3.199%
75.0% Δp: 0.018%
Median Δp: -0.000%
25.0% Δp: -0.001%
10.0% Δp: -1.047%
5.0% Δp: -8.549%
1.0% Δp: -54.589%
0.1% Δp: -99.082%
Minimum Δp: -100.000%
RMS Δp : 15.297 ± 0.096 %
Same top p: 61.403 ± 0.127 %

erazortt · 2026-04-29T20:07:49Z

Though, I now see that for q8/turbo4 the PPL got worse, but KLD and same-top-p got better.

TheTom · 2026-04-29T20:34:18Z

Im considering options. running experiments.

TheTom · 2026-04-29T20:56:00Z

expanded the experiment matrix to triangulate this. ran a 13-cell PPL sweep on M5 Max, wikitext-2 32 chunks @ ctx=512, with per-side rotation overrides via two new debug env knobs (LLAMA_ATTN_ROT_K_OVERRIDE / LLAMA_ATTN_ROT_V_OVERRIDE).

Gemma-4 26B-A4B Q8_0 (SWA + MoE)

config	rot k/v	PPL	Δ vs OFF
q8 / q8 OFF	0/0	9979	—
q8 / q8 v3 default	1/1	10118	+1.4%
q8 / q8 K-only	1/0	10153	+1.7%
q8 / q8 V-only	0/1	10451	+4.7%
q8 / turbo4 OFF	0/0	6273	—
q8 / turbo4 v3 (k=1, v=0)	1/0	6700	+6.8% ⚠
q8 / turbo4 broad (k+v)	1/1	6176	-1.6%
q8 / turbo4 V-only	0/1	6027	-3.9% ⭐
t4 / t4 OFF	0/0	5785	—
t4 / t4 broad	1/1	7031	+21.5%
t4 / t4 K-only	1/0	8831	+52.7% ⚠⚠

Qwen2.5 1.5B Q4_K_M (pure-global, dense)

config	rot k/v	PPL
q8 / q8 OFF	0/0	9.2889
q8 / q8 v3	1/1	9.2687 (within SE)
q8 / turbo4 OFF	0/0	9.3799
q8 / turbo4 v3 (k=1)	1/0	9.3772 (within SE)
t4 / t4 OFF	0/0	6300 (broken)
t4 / t4 broad	1/1	4711 (rescues partially)

Observations

v3's per-side gating is the WORST asymmetric config on gemma-4. q8/turbo4 K-only (current v3 logic) at +6.8% PPL. V-only is the BEST at -3.9%. The empirically right asymmetric policy on this family is the direct opposite of what v3 picked.
K-side rotation when K is turbo is catastrophic (t4/t4 K-only +52.7%). Confirms the symmetric turbo regression is K-side. V-side master rotation on turbo V actually helps on gemma-4.
Qwen2.5 dense behaves differently. All q8 variants within SE; t4/t4 is fundamentally broken regardless of rotation (PPL > 4700 either way). Different family, different patterns.
Master's rotation barely helps on gemma-4 q8/q8 (+1.4% with rotation vs OFF). Even master's intended use case isn't a clear win on this architecture.

What this changes

The "skip rotation for turbo types" gating in v3 is wrong on the gemma family. Per-family rotation policies appear genuinely different (gemma SWA+MoE vs Qwen2.5 pure-global vs Qwen3.5 hybrid all show distinct optima), so a single universal default that's correct for all is unlikely.

Leaning toward this PR shape:

Default: rotation OFF on all paths. Avoids the +52.7% catastrophe and the +6.8% v3 regression. Matches pre-fix behavior.
First-class per-side env knobs: LLAMA_ATTN_ROT_K_OVERRIDE=0/1 and LLAMA_ATTN_ROT_V_OVERRIDE=0/1 for opt-in.
README section with family-tested recommendations: e.g. "gemma-4 q8/turbo4: set LLAMA_ATTN_ROT_V_OVERRIDE=1 for ~4% PPL improvement; gemma-4 t4/t4: leave default off (avoid +21% regression)."

Open to running more families (Phi-4, Mistral-Small, Llama-3) before locking the PR shape if that helps.

…default OFF Replaces the prior approach (auto-enable rotation for non-turbo quantized types) with explicit per-side opt-in via LLAMA_ATTN_ROT_K_OVERRIDE and LLAMA_ATTN_ROT_V_OVERRIDE. Default behavior: rotation OFF on both sides across all KV types. Background: TheTom/turboquant_plus#88 surfaced that asymmetric q8_0/turbo* configs were missing the upstream activation rotation from ml-explore/llama.cpp#21038. Multiple iterations (broad enable, per-side gating with turbo skip) tried to find a smart default that balances quality across model families. Empirical PPL+KLD testing on 7 model families (gemma-4 26B-A4B / 31B / E2B, Qwen2.5-7B, Qwen3.5-2B, Mistral-Small-24B, phi-4) showed the optimal rotation policy is highly model-and-quant specific. No single default is correct everywhere, including within the same architecture family (gemma-4 26B-A4B Q8, 31B Q8, and E2B Q4_K_L showed three distinct optima). phi-4 V-side rotation crashes with graph-node hash overflow, ruling out any default-on policy that touches V rotation across model families. Default OFF avoids regressing any tested model. Env-knob opt-in lets power users tune for their specific config based on documented per-model findings (see README/docs follow-up). LLAMA_ATTN_ROT_DISABLE remains as a no-op alias for historical scripts. Co-Authored-By: tturney@psyguard.ai

TheTom · 2026-04-29T21:16:42Z

Update: v4 — default OFF + per-side env knobs

After expanding the experiment matrix to 7 model families and discovering that:

the optimal rotation policy varies wildly within the same architecture family (3 distinct optima across gemma-4 sizes)
phi-4 crashes when V-side rotation is enabled (graph-node hash overflow, historical issue)
the most dramatic apparent "wins" turn out to be PPL eval artifacts specific to instruction-tuned gemma models

I'm reshaping the fix from "smart per-side gating" (v3) to "default OFF + opt-in env knobs" (v4). Force-pushed db3595a75.

v4 behavior

attn_rot_k defaults to false, regardless of KV type
attn_rot_v defaults to false, regardless of KV type
LLAMA_ATTN_ROT_K_OVERRIDE=1 enables K-side rotation (subject to quantized + head_dim%64 guard)
LLAMA_ATTN_ROT_V_OVERRIDE=1 enables V-side rotation (same guards)
LLAMA_ATTN_ROT_DISABLE=1 retained as no-op alias for historical scripts

Why default OFF — the data

Expanded matrix on M5 Max wikitext-2 32 chunks @ ctx=512, q8_0/turbo4 KV across 7 model families:

model	OFF	v3 (k=1)	V-only	broad
gemma-4 26B-A4B Q8	6273	6700	6027	6176
gemma-4 31B Q8	8685	9009	4924	4818
gemma-4 E2B Q4_K_L	114.8	115.4	122.4	122.4
Qwen2.5-7B Q8	6.140	6.146	6.135	6.116
Qwen3.5-2B Q8	10.794	10.791	10.692	10.692
Mistral-Small-24B Q4	5.317	5.318	5.326	5.326
phi-4 Q8	5.824	5.818	CRASH	CRASH

Three things this matrix kills

Per-arch policy in code is dead. Even within gemma-4 family, three sizes show three different optima. Hardcoding LLM_ARCH_GEMMA → V-only would silently regress E2B Q4_K_L users by +6.7%.
phi-4 V-rotation crashes. Confirmed historical hash-table-overflow. Any default-on policy that touches V is unshippable.
The dramatic gemma-4 31B "-43% PPL win" is a metric artifact. Sanity-checked against the FP16 baseline:

gemma-4 31B Q8 ctx=512 PPL

f16/f16 KV (no quant) 5320

q8/turbo4 OFF 8685

q8/turbo4 V-only 4924 ← below the FP16 baseline?!

q8/turbo4 broad 4818 ← also below FP16

Quantization should never score better than FP16. The fact that V-only / broad land below the f16/f16 baseline means PPL on wikitext-2 is unreliable for gemma-4-it models — likely chat-template or special-token handling artifacts in the eval. The 26B-A4B numbers are in the same suspicious 4-digit-PPL regime, smaller but probably the same artifact.

What the matrix DOES tell us reliably

Most non-gemma model families (Qwen2.5, Qwen3.5, Mistral) are within standard error regardless of rotation. No clear default applies.
gemma-4 E2B Q4_K_L (PPL 114, in the believable range) shows rotation hurts by +6.7%.
phi-4 crashes on V-side rotation. Hard ceiling.
erazortt's KLD eval on gemma-4 26B-A4B Q6_K_XL (the gold-standard signal) showed rotation improves base-model fidelity (KLD ↓, same-top-p ↑) at PPL cost. That's the right metric for KV-quant evaluation.

Path forward

This PR ships the env-knob infrastructure. Per-model recommendations belong in a follow-up README section, populated as the community reports KLD-based findings on their specific configs.

Welcome any further testing. The ask for testers is now: run KLD eval (--kl-divergence + --kl-divergence-base) on YOUR model and report whether LLAMA_ATTN_ROT_K_OVERRIDE=1 and/or LLAMA_ATTN_ROT_V_OVERRIDE=1 improves KLD without crashing.

…disable=false) Previous state: db3595a added LLAMA_ATTN_ROT_K_OVERRIDE / _V_OVERRIDE per-side opt-in knobs but kept attn_rot_disable defaulting to TRUE for legacy LLAMA_ATTN_ROT_DISABLE compatibility. The override branches included `&& !attn_rot_disable` guards, so when LLAMA_ATTN_ROT_DISABLE is unset (default true) the per-side env knobs were silently no-ops. Users could not opt into rotation without also setting LLAMA_ATTN_ROT_DISABLE=0. Fix: flip attn_rot_disable default to false. Rotation is still OFF by default because attn_rot_k/v default to false. LLAMA_ATTN_ROT_DISABLE=1 still acts as a hard lock-out that blocks the per-side overrides for users who want a single switch to guarantee no rotation. Caught while running the cross-format KLD matrix for the rotation/PPL investigation paper — V-only override appeared to silently fail. Confirmed with logs that attn_rot_v stayed 0 even with LLAMA_ATTN_ROT_V_OVERRIDE=1 until this default flip.

TheTom · 2026-04-29T23:13:28Z

Update: critical bug fix + cross-format KLD overturns the t4/t4 K-only narrative

Pushed 817e913ec on top of db3595a75. Two things this update covers.

1. Critical bug fix: per-side overrides were silently no-ops

db3595a75 left attn_rot_disable defaulting to TRUE for legacy LLAMA_ATTN_ROT_DISABLE compatibility. The new per-side override branches include && !attn_rot_disable guards, so when LLAMA_ATTN_ROT_DISABLE is unset (default true), LLAMA_ATTN_ROT_K_OVERRIDE=1 and LLAMA_ATTN_ROT_V_OVERRIDE=1 were silently no-ops. Users could not opt into rotation without also setting LLAMA_ATTN_ROT_DISABLE=0. Caught while running the post-merge KLD matrix below — V-only override looked like it had no effect, traced to this guard.

817e913ec flips the default to false. Rotation is still OFF by default (because attn_rot_k/v default to false). LLAMA_ATTN_ROT_DISABLE=1 is preserved as a hard lock-out that blocks the per-side overrides for users who want one switch to guarantee no rotation.

If you tested env-knob behavior on db3595a75 and saw "no change," please retest on 817e913ec — your override was probably blocked.

2. Cross-format KLD overturns the "t4/t4 K-only catastrophic +52.7%" framing

The earlier comment called t4/t4 K-only catastrophic by PPL. After running the cross-format KLD matrix on gemma-4 26B-A4B Q8 (vs fp16-KV reference, ctx=512, 32 chunks):

KV format	config	PPL Δ vs OFF	KLD Δ vs OFF
q8 / q8	V-only	+4.7%	−19.4%
q8 / q8	broad	+1.4%	−22.7%
q8 / turbo4	V-only	−3.9%	+6.1%
q8 / turbo4	broad	−1.6%	+6.8%
t4 / t4	K-only	+52.7%	−4.9%
t4 / t4	broad	+21.5%	+9.7%

PPL and KLD point opposite directions on five of six rows. The "+52.7% K-only catastrophic" PPL number is itself an artifact in the same family as the q8/turbo4 V-only "win" — KLD on the same row shows K-only is the closest-to-fp16 configuration. Master's PR ggml-org#21038 is in fact doing what it claims on q8/q8 (KLD drops 20%+), even though PPL reads it as a regression.

3. Headline: PPL is unreliable on gemma-class instruct, KLD ranks correctly

Headline trio at 256 chunks (8× more data, ~3× tighter CIs vs fp16-KV reference PPL 20045.71 ± 604.41):

Config	PPL(Q)	KL Divergence
q8/turbo4 OFF	11673.35 ± 342.77	1.7067 ± 0.0125
q8/turbo4 V-only	11160.84 ± 332.03 (−4.4%)	1.8943 ± 0.0132 (+11.0%)
q8/turbo4 broad	10785.84 ± 320.35 (−7.6%)	1.9193 ± 0.0133 (+12.5%)

V-only KLD penalty is 14σ above OFF. PPL ranks the configs backward; KLD ranks them in the order intuition predicts.

4. KLD reference noise floor on Metal: bit-exact zero

Built the fp16-KV reference twice on identical inputs, scored run #2 against run #1: KLD = 0.000000 ± 0.000000, RMS Δp = 0.000%. Every KLD delta in this PR thread is real signal, not nondeterminism floor. CUDA/HIP not measured; backend testers should re-measure on their build before relying on small KLD deltas.

5. Cross-corpus reproduction (closes the "wikitext-2 specific" hammer)

Corpus	KV format	fp16-KV PPL	q8/* PPL	Δ
wikitext-2 test	q8/turbo4	10813.67	6273.74	−42.0%
wikitext-103 train	q8/turbo4	33845.11	19902.90	−41.2%
wikitext-2 test	q8/q8	10813.67	9979.55	−7.7%
wikitext-103 train	q8/q8	33845.11	31818.69	−6.0%

Same direction, near-identical magnitude on a 516MB different corpus. Artifact is not wikitext-2-specific.

6. Downstream completion probe — model is not actually broken

3 short factual prompts on gemma-4 26B-A4B, fp16-KV vs q8/turbo4 OFF (the headline −42% PPL config), greedy generation:

Prompt	fp16-KV	q8/turbo4 OFF
"The capital of France is"	"Paris"	"Paris"
"Two plus two equals"	"Four"	"four"
"List three colors:"	"Red, Blue, Green"	"Red, Blue, Yellow"

Both configs answer correctly. The "−42% PPL improvement" config is not an obviously-broken model in the practical sense — it produces correct answers on prompts where many continuations are acceptable. KLD captures real distribution drift; downstream task accuracy is preserved despite the drift because the answer space is small. PPL alone tells you neither.

7. Engineering policy unchanged: default OFF + per-side opt-in

The matrix above does not change the default decision (rotation OFF on both sides — same default the fork shipped with pre-investigation). It changes the language in the recommendation table from "K-only catastrophic" to "PPL says catastrophic, KLD says best — measure with KLD on your model before opting in."

8. Full write-up

Full investigation, all data, all mechanism candidates, limitations, and reproducibility instructions are in the companion paper:

docs/papers/attn-rotation-and-ppl-artifact.md — "When Quantized Beats fp16: A KV-Rotation Investigation, and Why PPL Lies on gemma-class Instruct Models."

Ask for testers

If you're testing this PR:

Build at 817e913ec (not db3595a75 — overrides were broken there).
Run --kl-divergence-base + --kl-divergence on YOUR model with the four configs (OFF, K-only, V-only, broad) on q8/turbo4 KV.
Report KLD numbers (not just PPL). PPL alone is unreliable on gemma-class.

CC: @erazortt @ggerganov (your "track KLD rather than PPL" comment in ggml-org#21038 is now a reproduced demonstration with controlled per-side measurements — happy to chat if you want any of this upstreamed)

TheTom mentioned this pull request Apr 29, 2026

attn-rotation for native llama quants TheTom/turboquant_plus#88

Closed

TheTom force-pushed the fix/enable-attn-rot-by-default branch from 6c55e84 to cf5c3db Compare April 29, 2026 19:06

TheTom force-pushed the fix/enable-attn-rot-by-default branch from cf5c3db to 9ecb4f2 Compare April 29, 2026 21:15

TheTom changed the title ~~fix(kv-cache): enable upstream attention rotation by default~~ fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF) Apr 29, 2026

TheTom force-pushed the fix/enable-attn-rot-by-default branch from 9ecb4f2 to db3595a Compare April 29, 2026 21:15

TheTom merged commit e0954d1 into feature/turboquant-kv-cache May 1, 2026
22 of 50 checks passed

TheTom deleted the fix/enable-attn-rot-by-default branch May 1, 2026 13:45

TheTom mentioned this pull request May 1, 2026

sparse V: skip negligible attention weights across all backends #98

Closed

signalnine mentioned this pull request May 1, 2026

Cherry-pick signalnine PR #53: auto-asymmetric GQA + turbo VEC FA opts #115

Merged

TheTom mentioned this pull request May 1, 2026

perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch #53

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF)#111

fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF)#111
TheTom merged 2 commits into
feature/turboquant-kv-cachefrom
fix/enable-attn-rot-by-default

TheTom commented Apr 29, 2026

Uh oh!

erazortt commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

erazortt commented Apr 29, 2026

Uh oh!

erazortt commented Apr 29, 2026 •

edited

Loading

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TheTom commented Apr 29, 2026

Summary

Empirical validation

Asks for testers

Escape hatch preserved

Out of scope

Uh oh!

erazortt commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

erazortt commented Apr 29, 2026

Uh oh!

erazortt commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented Apr 29, 2026

Uh oh!

TheTom commented Apr 29, 2026

Gemma-4 26B-A4B Q8_0 (SWA + MoE)

Qwen2.5 1.5B Q4_K_M (pure-global, dense)

Observations

What this changes

Uh oh!

TheTom commented Apr 29, 2026

Update: v4 — default OFF + per-side env knobs

v4 behavior

Why default OFF — the data

Three things this matrix kills

What the matrix DOES tell us reliably

Path forward

Uh oh!

TheTom commented Apr 29, 2026

Update: critical bug fix + cross-format KLD overturns the t4/t4 K-only narrative

1. Critical bug fix: per-side overrides were silently no-ops

2. Cross-format KLD overturns the "t4/t4 K-only catastrophic +52.7%" framing

3. Headline: PPL is unreliable on gemma-class instruct, KLD ranks correctly

4. KLD reference noise floor on Metal: bit-exact zero

5. Cross-corpus reproduction (closes the "wikitext-2 specific" hammer)

6. Downstream completion probe — model is not actually broken

7. Engineering policy unchanged: default OFF + per-side opt-in

8. Full write-up

Ask for testers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erazortt commented Apr 29, 2026 •

edited

Loading