fix(kv-cache): per-side env-knob control for upstream attn rotation (default OFF)#111
Conversation
|
Well it seems to work for the native KV quants (Q8_0, Q4_0), but I am getting completly different results when using turboquants. I would have assumed that the now enabled rotation would not effect the turboquants. The model I tested on was: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf branch feature/turboquant-kv-cache: ====== Perplexity statistics ====== ====== KL divergence statistics ====== ====== Token probability statistics ====== branch: fix/enable-attn-rot-by-default ====== Perplexity statistics ====== ====== KL divergence statistics ====== ====== Token probability statistics ====== |
6c55e84 to
cf5c3db
Compare
|
@erazortt great catch — your symmetric turbo4/turbo4 result is conclusive (KLD 2.25 → 2.59, same-top-p 55.9% → 52.9%). reverted the broad enable and went per-side instead. force-pushed new gating: enable
re-validated on Qwen3.5-2B-Q8_0 wikitext-2 16 chunks @ ctx=2048: turbo4/turbo4 now sits at the rotation-off baseline (no regression). q8_0/turbo4 still picks up the K-side rotation that helps Q8 quality. would appreciate one more pass of your gemma-4 turbo4/turbo4 KLD eval to confirm it's back to the rotation-off numbers. asymmetric path note: my Qwen3.5-2B q8_0/turbo4 delta was within SE (10.9159 vs 10.9170), so the K-only rotation may not help as much as full rotation does on your test setup — would also be useful to see q8_0/turbo4 numbers from your matrix if you have them. trying to land the right shape of fix. |
|
local proof-of-safety pass on Qwen3.5-2B-Q8_0, wikitext-2 16 chunks @ ctx=2048, M5 Max Metal:
three things v3 does correctly:
what's NOT fully proven on this model: the absolute magnitude of the asymmetric quality win. master rotation barely moves Qwen3.5-2B Q8_0 PPL (10.7897 → 10.7877, well within ±0.23 SE), so q8_0/turbo4 partial rotation also barely moves (10.9170 → 10.9159). the structural shape of the fix is correct and safe; the magnitude is model-dependent. your KLD-based eval on gemma-4 26B-A4B is the gold standard for sizing the actual win — particularly the q8_0/turbo4 cell if you have it on the same model. |
|
Great, now it seems to works correctly! New the turbo4/turbo4 result is like before, while the q8/turbo4 results are as follows: /d/sources/llama.cpp-my/build-turbo-non-rot/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk q8_0 -ctv turbo4 -fa on > turbo-non-rot-q8-t4.txt ====== Perplexity statistics ====== ====== KL divergence statistics ====== ====== Token probability statistics ====== /d/sources/llama.cpp-my/build-turbo-2/bin/Release/llama-perplexity.exe -m /f/LLM/models/gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf --kl-divergence --kl-divergence-base kld --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 -ngl 99 -c 512 -ctk q8_0 -ctv turbo4 -fa on > turbo2-q8-t4.txt ====== Perplexity statistics ====== ====== KL divergence statistics ====== ====== Token probability statistics ====== |
|
Though, I now see that for q8/turbo4 the PPL got worse, but KLD and same-top-p got better. |
|
Im considering options. running experiments. |
|
expanded the experiment matrix to triangulate this. ran a 13-cell PPL sweep on M5 Max, wikitext-2 32 chunks @ ctx=512, with per-side rotation overrides via two new debug env knobs ( Gemma-4 26B-A4B Q8_0 (SWA + MoE)
Qwen2.5 1.5B Q4_K_M (pure-global, dense)
Observations
What this changesThe "skip rotation for turbo types" gating in v3 is wrong on the gemma family. Per-family rotation policies appear genuinely different (gemma SWA+MoE vs Qwen2.5 pure-global vs Qwen3.5 hybrid all show distinct optima), so a single universal default that's correct for all is unlikely. Leaning toward this PR shape:
Open to running more families (Phi-4, Mistral-Small, Llama-3) before locking the PR shape if that helps. |
cf5c3db to
9ecb4f2
Compare
…default OFF Replaces the prior approach (auto-enable rotation for non-turbo quantized types) with explicit per-side opt-in via LLAMA_ATTN_ROT_K_OVERRIDE and LLAMA_ATTN_ROT_V_OVERRIDE. Default behavior: rotation OFF on both sides across all KV types. Background: TheTom/turboquant_plus#88 surfaced that asymmetric q8_0/turbo* configs were missing the upstream activation rotation from ml-explore/llama.cpp#21038. Multiple iterations (broad enable, per-side gating with turbo skip) tried to find a smart default that balances quality across model families. Empirical PPL+KLD testing on 7 model families (gemma-4 26B-A4B / 31B / E2B, Qwen2.5-7B, Qwen3.5-2B, Mistral-Small-24B, phi-4) showed the optimal rotation policy is highly model-and-quant specific. No single default is correct everywhere, including within the same architecture family (gemma-4 26B-A4B Q8, 31B Q8, and E2B Q4_K_L showed three distinct optima). phi-4 V-side rotation crashes with graph-node hash overflow, ruling out any default-on policy that touches V rotation across model families. Default OFF avoids regressing any tested model. Env-knob opt-in lets power users tune for their specific config based on documented per-model findings (see README/docs follow-up). LLAMA_ATTN_ROT_DISABLE remains as a no-op alias for historical scripts. Co-Authored-By: tturney@psyguard.ai
9ecb4f2 to
db3595a
Compare
Update: v4 — default OFF + per-side env knobsAfter expanding the experiment matrix to 7 model families and discovering that:
I'm reshaping the fix from "smart per-side gating" (v3) to "default OFF + opt-in env knobs" (v4). Force-pushed v4 behavior
Why default OFF — the dataExpanded matrix on M5 Max wikitext-2 32 chunks @ ctx=512, q8_0/turbo4 KV across 7 model families:
Three things this matrix kills
What the matrix DOES tell us reliably
Path forwardThis PR ships the env-knob infrastructure. Per-model recommendations belong in a follow-up README section, populated as the community reports KLD-based findings on their specific configs. Welcome any further testing. The ask for testers is now: run KLD eval ( |
…disable=false) Previous state: db3595a added LLAMA_ATTN_ROT_K_OVERRIDE / _V_OVERRIDE per-side opt-in knobs but kept attn_rot_disable defaulting to TRUE for legacy LLAMA_ATTN_ROT_DISABLE compatibility. The override branches included `&& !attn_rot_disable` guards, so when LLAMA_ATTN_ROT_DISABLE is unset (default true) the per-side env knobs were silently no-ops. Users could not opt into rotation without also setting LLAMA_ATTN_ROT_DISABLE=0. Fix: flip attn_rot_disable default to false. Rotation is still OFF by default because attn_rot_k/v default to false. LLAMA_ATTN_ROT_DISABLE=1 still acts as a hard lock-out that blocks the per-side overrides for users who want a single switch to guarantee no rotation. Caught while running the cross-format KLD matrix for the rotation/PPL investigation paper — V-only override appeared to silently fail. Confirmed with logs that attn_rot_v stayed 0 even with LLAMA_ATTN_ROT_V_OVERRIDE=1 until this default flip.
Update: critical bug fix + cross-format KLD overturns the t4/t4 K-only narrativePushed 1. Critical bug fix: per-side overrides were silently no-ops
If you tested env-knob behavior on 2. Cross-format KLD overturns the "t4/t4 K-only catastrophic +52.7%" framingThe earlier comment called
PPL and KLD point opposite directions on five of six rows. The "+52.7% K-only catastrophic" PPL number is itself an artifact in the same family as the q8/turbo4 V-only "win" — KLD on the same row shows K-only is the closest-to-fp16 configuration. Master's PR ggml-org#21038 is in fact doing what it claims on 3. Headline: PPL is unreliable on gemma-class instruct, KLD ranks correctlyHeadline trio at 256 chunks (8× more data, ~3× tighter CIs vs fp16-KV reference PPL 20045.71 ± 604.41):
V-only KLD penalty is 14σ above OFF. PPL ranks the configs backward; KLD ranks them in the order intuition predicts. 4. KLD reference noise floor on Metal: bit-exact zeroBuilt the fp16-KV reference twice on identical inputs, scored run #2 against run #1: KLD = 0.000000 ± 0.000000, RMS Δp = 0.000%. Every KLD delta in this PR thread is real signal, not nondeterminism floor. CUDA/HIP not measured; backend testers should re-measure on their build before relying on small KLD deltas. 5. Cross-corpus reproduction (closes the "wikitext-2 specific" hammer)
Same direction, near-identical magnitude on a 516MB different corpus. Artifact is not wikitext-2-specific. 6. Downstream completion probe — model is not actually broken3 short factual prompts on gemma-4 26B-A4B, fp16-KV vs q8/turbo4 OFF (the headline −42% PPL config), greedy generation:
Both configs answer correctly. The "−42% PPL improvement" config is not an obviously-broken model in the practical sense — it produces correct answers on prompts where many continuations are acceptable. KLD captures real distribution drift; downstream task accuracy is preserved despite the drift because the answer space is small. PPL alone tells you neither. 7. Engineering policy unchanged: default OFF + per-side opt-inThe matrix above does not change the default decision (rotation OFF on both sides — same default the fork shipped with pre-investigation). It changes the language in the recommendation table from "K-only catastrophic" to "PPL says catastrophic, KLD says best — measure with KLD on your model before opting in." 8. Full write-upFull investigation, all data, all mechanism candidates, limitations, and reproducibility instructions are in the companion paper:
Ask for testersIf you're testing this PR:
CC: @erazortt @ggerganov (your "track KLD rather than PPL" comment in ggml-org#21038 is now a reproduced demonstration with controlled per-side measurements — happy to chat if you want any of this upstreamed) |
Summary
Fixes TheTom/turboquant_plus#88 reported by @erazortt.
The fork was suppressing upstream's activation-rotation pre-quantization step from
ml-explore/llama.cpp#21038("llama : rotate activations for better quantization") via aLLAMA_ATTN_ROT_DISABLE=truedefault. That was correct for symmetric turbo (where the kernel-level WHT handles rotation end-to-end) but wrong for asymmetric configs likeq8_0-K + turbo*-V: the K side stayed in the un-rotated coordinate space and lost the quality boost upstream rotation gives Q8_0 / Q4_0. The user-facing symptom is that asymmetricq8_0-K / turbo-VPPL was slightly worse than upstreamq8_0/q8_0on the same model.This patch flips the default. Master's
k_rot/v_rotmatrices and turbo's kernel-level WHT are independent rotations (different basis, different invert sites) and compose cleanly, so enabling both does not double-rotate.Empirical validation
Qwen3.5-2B-Q8_0 weights,
q8_0-K + turbo4-Vcache, wikitext-2 16 chunks @ ctx=2048, M5 Max Metal:LLAMA_ATTN_ROT_DISABLE=1(escape hatch)Δ = -0.32% PPL, 15 of 16 chunks favorable. Small absolute, but consistent direction; matches the symmetric case @erazortt observed against upstream master
q8_0/q8_0.Asks for testers
q8_0orq4_0on K withturbo*on V) —brew uninstall && brew installor rebuild from source on this branch and re-run your favorite eval. PPL on wikitext is the cleanest signal.Escape hatch preserved
LLAMA_ATTN_ROT_DISABLE=1still globally disables the upstream rotation if a model hits graph-node hash-table overflow (Phi-4 was the prior known failure case). No symptoms expected on the supported model set, but the env var stays for safety.Out of scope
src/llama-graph.cpp(where the pre-rotate-queries WHT is gated on K being a turbo type) is unaffected — it remains correct for the turbo path. This change only touches the upstream-rotation gate in the KV cache constructor.LLAMA_ATTN_ROT_DISABLE=1and report.