Skip to content

docs: final investigation log — 77.7 tok/s, 91% of q8_0#1

Closed
KGardevoir wants to merge 431 commits into
masterfrom
claude/rebase-triattention-kv-cache-k1pO6
Closed

docs: final investigation log — 77.7 tok/s, 91% of q8_0#1
KGardevoir wants to merge 431 commits into
masterfrom
claude/rebase-triattention-kv-cache-k1pO6

Conversation

@KGardevoir
Copy link
Copy Markdown
Owner

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

TheTom and others added 30 commits April 15, 2026 14:42
Pre-computed turbo_wht_signs1_h4[32] and turbo_wht_signs2_h4[32] as
constant half4 arrays. Eliminates per-element float→half conversion
and reduces constant memory reads from 4 per half4 to 1.

Marginal improvement (~1%) — Metal compiler already optimized the
constant reads. But cleaner code and consistent with the half4 WHT.

PPL: 6.195 (unchanged)
Codex: no issues (included in Exp1 review scope)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
THE BIG WIN: moved WHT rotation from per-block dequant to graph-level
ggml_mul_mat ops. 47% speedup over previous best.

Prefill: 2095 tok/s (0.78x q8_0, was 1424 = 0.53x)
PPL: 6.201 (within 0.01 of 6.195 baseline)
Compression: 4.9x (unchanged)

Key insight: applying WHT in build_attn (after RoPE, before build_attn_mha)
matches the K quantize pipeline exactly. K stores WHT(RoPE(K)) from SET_ROWS,
Q becomes WHT(RoPE(Q)) from graph mul_mat. Dot products preserved.

Changes:
- llama-graph.cpp: Q forward rotation (R @ q) and V un-rotation (R^T @ cur)
  in the llm_graph_input_attn_kv build_attn overload
- ggml-metal.metal: stripped WHT from turbo3_dequantize_full_block
  (returns centroid * norm in rotated space, graph handles un-rotation)

Codex review: pipeline point correct, reshape dims correct, lifecycle OK.
Noted: only covers one build_attn overload (sufficient for Qwen3MoE).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
THE BREAKTHROUGH: block-32 with graph-side WHT rotation reaches q8_0 parity.

Prefill: 2747 tok/s (1.02x q8_0, was 0.78x with block-128)
PPL: 5.460 (32-chunk) / 6.193 (8-chunk) — within noise of baseline
Compression: 4.6x (slightly less than 4.9x due to per-block norm overhead)

Changes:
- QK_TURBO3: 128 → 32 (matches q4_0 block size for GPU parallelism)
- dequantize_turbo3_0: simple centroid lookup + norm scale (no WHT, no full-block)
- dequantize_turbo3_0_t4: same simple path (no SIMD shuffle needed)
- Flash attention nl: 8→2 (non-vec), 32→8 (vec) matching new block size

Why this works: with graph-side WHT rotation, dequant no longer needs the
128-element WHT butterfly. Each 32-element block can be decoded independently.
Smaller blocks = more GPU parallelism = faster flash attention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Added TURBO_LAYER_ADAPTIVE env var for per-layer cache type selection:
  0 = uniform (default)
  1 = q8_0 for first+last 4 layers, turbo3 for middle 32
  2 = q8_0 for last 8 layers, turbo3 for first 32

Results (Qwen3.5-35B-A3B, 8 chunks):
  uniform turbo3:  PPL = 6.193 (+1.3% vs q8_0)
  mode 1:          PPL = 6.185 (+1.2% vs q8_0)
  mode 2:          PPL = 6.110 (+0.0% vs q8_0!!!)

Mode 2 achieves q8_0 quality (PPL 6.110 vs 6.111) while compressing
32 of 40 layers at turbo3 (4.6x). Only the last 8 layers use q8_0.
Effective compression: ~3.5x overall vs 2.0x uniform q8_0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
…ow guard

1. Thread-safe static init via C++ lambda (was data race on static int)
2. Guard n_layer >= 8 to prevent unsigned underflow on small models
3. Use const local for n_layer and is_turbo check

PPL verified: mode 2 still gives 6.1095 (matching q8_0 baseline)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
…n data

Part of TheTom#32: turbo3 prefill degrades relative to q8_0 with context length.

Changes so far:
- Skip ggml_cont when tensors already contiguous (+1%, minimal)
- Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for
  reduced group size approach (16x less matmul compute)
- Fixed V un-rotation to check v->type not k->type

Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Reducing WHT rotation group from 128 to 32 elements degrades quality.
Python kurtosis test showed 3.06 (good) on random data, but real Qwen3.5
KV tensors need 128-element groups for proper Gaussianization.

Group-32 also didn't help speed — actually slower at all context sizes.
This approach is a dead end.

Next: custom GGML_OP_TURBO_WHT for O(d log d) rotation without dense matmul.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Adds a new ggml operation for applying WHT rotation to 128-element groups.
Replaces the previous dense ggml_mul_mat(128x128, ...) approach.

Implementation:
- ggml.h: new op enum + ggml_turbo_wht(tensor, direction) API
- ggml.c: constructor with direction param in op_params
- ggml-cpu/ops.cpp: CPU impl (fp32 butterfly, parallel over groups)
- ggml-metal.metal: Metal kernel (fp16 half4 vectorized butterfly)
- ggml-metal-device: pipeline getter, supports_op
- ggml-metal-ops: dispatch with threadgroup-per-group layout
- llama-graph.cpp: uses ggml_turbo_wht instead of mul_mat+reshape

Results:
- PPL: 6.211 (within tolerance of 6.19 baseline)
- Context scaling: same as dense matmul (~8% gap at 4k vs q8_0)
- The matmul was NOT the bottleneck — dequant per KV position is

The custom op is still valuable: eliminates rotation tensor storage,
cleaner graph (no reshape/cont), and correct O(d log d) complexity.
The context scaling regression comes from flash attention dequant cost,
not the graph rotation.

Codex review: fixed missing OP_NAME table entry. Noted CPU fp32 vs
Metal fp16 precision difference (acceptable, Metal is the target).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Unrolled dequant with batched byte reads. Each 4-element group reads
qs and signs bytes ONCE instead of per-element. Codex-verified bit indexing.

Context scaling results:
  ctx=1024: 0.981x q8_0 (was 0.976x)
  ctx=2048: 0.989x q8_0 (was 0.960x)
  ctx=4096: 0.981x q8_0 (was 0.921x)

The ratio now stays FLAT at ~98% vs q8_0 across all context sizes.
Previous 7.9% gap at 4k context reduced to 1.9%.

PPL: 6.211 (within tolerance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Checks both:
1. PPL within 5% of q8_0 baseline (8-chunk wikitext-2)
2. Context scaling ratio > 0.95 at 4K context

Both must pass. Run: bash scripts/turbo-quality-gate.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Half LUT for cache pressure + float4 * scalar norm (1 multiply vs 4).
Verified on main: PPL 6.211, decode 78.4 short / 68.3 at 8K.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
llama-bench had a hardcoded ggml_type_from_name() that didn't include
turbo types. Now turbo3 and turbo4 work with -ctk/-ctv flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Replace single 8-entry constant half LUT with two 4-entry LUTs
(one for positive, one for negative centroids). Each lookup now
has only 4 possible constant addresses instead of 8, reducing
divergent constant cache access that causes 10x decode slowdown
on M1 hardware.

Codex review caught sign-mapping bug in initial magnitude+sign
approach — the sorted centroid LUT has reversed magnitude order
for negative values. Split LUT avoids this by keeping the original
index mapping within each half.

PPL: 6.2109 (identical to main)
Decode M5: 74.0 tok/s (vs 77.4 main — 4.4% regression on M5)
Target: significant improvement on M1 where constant cache is the bottleneck

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Signs can mix per element within a thread's 4-element dequant — each
element independently selects from positive or negative LUT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Port of @spiritbuun's norm correction from CUDA to Metal SET_ROWS.
After quantizing all 128 elements in a group, compute the L2 norm of
the centroid reconstruction vector and store:
  corrected_norm = original_norm / ||centroid_vector||
instead of raw original_norm.

This corrects systematic norm shrinkage from codebook quantization.
Zero decode cost — dequant code is unchanged, just reads a better
stored norm value. Only adds 128 FMAs to the quantizer (not hot path).

Results (Qwen3.5-35B-A3B, wikitext-2):
  Before: PPL 6.2109 (8-chunk), 5.4714 (32-chunk) — +1.6% vs q8_0
  After:  PPL 6.1756 (8-chunk), 5.4451 (32-chunk) — +1.1% vs q8_0
  q8_0:   PPL 6.1109 (8-chunk), 5.4145 (32-chunk)

0.5% quality improvement at literally zero speed cost.

Original CUDA implementation:
  github.com/spiritbuun/llama-cpp-turboquant-cuda (commit 721880c)

Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
…text overflow

Two bugs that caused turbo3 to silently fail on pre-M5 Apple Silicon:

1. turbo3/turbo4 require flash attention for the dequant path, but
   llama-bench defaults to flash_attn=disabled. Auto-enable FA when
   turbo cache types are detected, with a warning log message. This
   fixes context creation failures on M2 Pro/Max and similar hardware.

2. KV cache ggml context was sized for exactly K/V tensors per layer,
   but turbo types add 2 rotation matrix tensors (turbo_rotation and
   turbo_rotation_inv) that weren't accounted for. Add +2 tensor
   overhead to prevent GGML_ASSERT(obj_new) failure.

Tested on M5 Max (Apple9/has_tensor=true) and M2 Pro (Apple8/has_tensor=false).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ported @spiritbuun's register centroid×norm LUT from CUDA to Metal.
On CUDA: 96-97% of q8_0 decode (big win).
On Metal: 75.2 tok/s vs 77.4 main (SLOWER — register spill).

The cn[8] float array spills to device memory on Metal's smaller
register file, making it slower than constant memory access.
Reverted to proven constant half LUT + float norm broadcast.

This is a fundamental Metal vs CUDA architecture difference:
- CUDA: 255 registers per thread, cn[8] fits easily
- Metal: smaller register file, 8 floats cause spill

The split-LUT approach (2x4 half entries) was also tested earlier
and showed similar regression (74.0 tok/s). Constant half[8] with
float norm broadcast remains the fastest vec dequant on Apple Silicon.

Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
…ling (Issue TheTom#29)

Three bugs from the block-size-32 refactor:

1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into
   separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels.
   turbo4 now correctly does 3-bit PolarQuant + QJL residual correction.

2. Integer division in n_groups = nk0 / blocks_per_group silently dropped
   tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling
   division with tail-group bounds checking in turbo3, and GGML_ASSERT in
   WHT dispatch to catch non-128-aligned tensors.

3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with
   TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that
   QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift.

Closes TheTom#29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stack

turbo_init_rotation() allocated a 128x128 float array (64KB) on the stack
to generate the random Gaussian matrix, then memcpy'd it to the static
turbo_rotation[]. llama.cpp worker threads have reduced stack sizes,
causing segfault on first turbo4 quantize call.

Fix: generate directly into the static turbo_rotation[] array, eliminating
the intermediate stack allocation entirely. The Gram-Schmidt QR
decomposition already runs in-place on turbo_rotation[].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unroll qs/signs extraction into separate variables before centroid
lookup. Helps Metal compiler schedule device reads ahead of ALU.
Ported from spiritbuun's CUDA batched load pattern.

Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TURBO_PROFILE_MODE env var (0-4):
  0 = full dequant (batched extract, production)
  1 = no-op (zeros) — decode ceiling without dequant cost
  2 = norm only — isolate norm read overhead
  3 = norm + qs, skip signs — isolate signs byte cost
  4 = full read, constant centroid — isolate LUT indexing cost

Set at runtime: TURBO_PROFILE_MODE=1 ./build/bin/llama-bench ...

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TURBO_FORCE_NONVEC=1 forces turbo3 to use the non-vec FA kernel
(nl=2, 16 elements/call) instead of vec (nl=8, 4 elements/call).
Hypothesis: nl=8 loop overhead is the dominant decode cost on M2.

M5 Max: non-vec 78.0 vs vec 76.7 (+1.7% — FASTER even on M5!)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Runtime detection: has_tensor=false (M1/M2/M3/M4) → TURBO_USE_4MAG=1
M5+ with efficient constant cache → full 8-entry LUT (unchanged)

4-mag LUT: 4-entry magnitude constant LUT + XOR sign reversal.
Halves constant cache divergence from 8 to 4 addresses.

M2 Pro results:
  8K:  10.95 → 15.1 tok/s (+38%)
  16K:  8.0  → 11.6 tok/s (+45%)

M5 Max: 76.5 tok/s (no regression from main 77.4, within noise)

Profiling showed constant memory LUT costs 25% on M2 vs 14% on M5.
4-mag reduces this by using half the constant addresses + ALU sign.

Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Deferred norm: 12.9 at 8K (worse)
Per-element norm: 15.1 at 8K (best)
The per-element multiply provides ALU work that hides constant
memory latency via instruction-level parallelism.

Full experiment log (M2 Pro at 8K decode):
  4-mag + per-elem norm: 15.1 (BEST)
  Batched extract (8-LUT): 13.7
  2-pair half2 LUT: 12.0
  Deferred norm: 12.9
  Select chain: 11.9
  Bit-arithmetic: 11.6
  Main (8-LUT): 10.95
  No-op ceiling: 24.5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
SongTonyLi and others added 8 commits May 4, 2026 08:28
…869) (ggml-org#22267)

* server: clamp n_discard to non-negative at JSON parse boundary (CVE-2026-21869)

A negative n_discard from client JSON causes heap-buffer-overflow in
update_slots() context-shift loop (CWE-787, CVSS 8.8). Clamp to 0 at
ingress; n_discard=0 already triggers auto-discard (n_left/2).

Ref: GHSA-8947-pfff-2f3c

* cont : cleaner

* cont : cleanerer

* cont : cleanest

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…d-clamp

security: cherry-pick CVE-2026-21869 (n_discard heap-buffer-overflow in server)
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack
   then memcpy'd into the static turbo_rotation array. This segfaults on
   llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB
   some Linux). Fix: generate the Gaussian matrix directly into
   turbo_rotation, eliminating both the stack allocation and the memcpy.

2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always
   match (both represent the rotation group size). Add static_assert to
   catch silent divergence between CPU reference and GPU kernels.

Fixes: TheTom#29 (remaining items from PR TheTom#18 review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements TriAttention-based KV cache eviction for llama.cpp, enabling
long-context inference with bounded memory via importance-scored pruning.

Core files:
- src/llama-triattention.h/.cpp  CPU scoring engine with RoPE inversion
- ggml/src/ggml-cuda/triattention-score.cu  CUDA kernel (~1000x vs CPU)
- src/llama-kv-cache.cpp/.h      pruning hook, prefix/recent protection
- src/llama-context.cpp          state wired into llama_context lifecycle
- include/llama.h                public C API
- common/arg.cpp                 13 new --triattention-* CLI flags

Performance on Qwen3-8B Q4_K_M, RTX 3080, -c 512:
  CPU prune: ~5900ms/event  GPU prune: ~4-9ms/event
  Generation: 17.5 tok/s (no budget) -> 75.0 tok/s (GPU, budget=256)
Adds --triattention-calibrate PATH, which collects pre-RoPE Q statistics
during normal inference and writes a .triattention binary file on context
teardown.  No Python or HF transformers required; calibration runs on the
quantized GGUF so stats match the actual runtime.

Mechanism:
- build_qkv() emits a dedicated cb("Qcur_pre_rope", il) hook immediately
  after the post-linear reshape, before any RoPE application.
- graph_get_cb() marks those tensors as ggml outputs and registers them in
  tria_cal->pending_q.  Pending map is cleared on graph rebuild.
- After each graph_compute(), triattention_calibrate_process_batch() reads
  the device tensors back (stride-aware for fused QKV paths) and accumulates
  per-(layer, head, freq) sums of Re(q_f), Im(q_f), |q_f|.
- On llama_context destruction the .triattention binary is written via
  triattention_calibrate_write(), which divides by n_tokens and appends the
  r_f validation field.

New files: src/llama-triattention-calibrate.{h,cpp}
New API:   llama_triattention_calibrate_start(ctx, path)
New flag:  --triattention-calibrate PATH  (SERVER + CLI examples)

https://claude.ai/code/session_019VVHgA9W2JALWFEhSxixUH
…file

1. ggml_set_output on a reshape-view or view_3d does not protect the
   backing tensor from being reused within the same graph.  Walk the
   view chain to find the actual data tensor and mark that as output too.

2. ggml_backend_sched_graph_compute_async returns while GPU kernels are
   still running; add an explicit synchronize() before reading Q tensors
   so the device data is fully written before ggml_backend_tensor_get.

3. For fused-QKV models the token stride nb2 is wider than one Q row,
   so nb2*ne2 exceeds ggml_nbytes(tensor) and triggers an assertion abort.
   Use ggml_nbytes(cur) as the read size instead; stride-based indexing
   inside the accumulation loop is unaffected.

https://claude.ai/code/session_019VVHgA9W2JALWFEhSxixUH
llama-cli exits immediately when --no-conversation is passed, which
the previous help-text example used.  The flag was also not registered
for LLAMA_EXAMPLE_COMPLETION, so switching to llama-completion (as
llama-cli instructs) still didn't surface it.

- add LLAMA_EXAMPLE_COMPLETION to the flag's set_examples() list
- update the example command in the description to use llama-completion

https://claude.ai/code/session_019VVHgA9W2JALWFEhSxixUH
@KGardevoir KGardevoir force-pushed the claude/rebase-triattention-kv-cache-k1pO6 branch from 7510276 to 0225469 Compare May 5, 2026 15:47
claude and others added 5 commits May 5, 2026 21:02
Adds stderr diagnostics at each stage so we can see exactly where
the pipeline breaks when no .triattention file is produced:

  1. graph_get_cb: prints when Qcur_pre_rope fires for layer 0
     (shape + type), so we know the callback is being reached
  2. process_ubatch: prints WARNING if pending_q is empty after
     graph_compute (means the callback never fired); otherwise prints
     cumulative token count after each batch
  3. triattention_calibrate_write: prints entry with output path and
     token count before attempting the file write

Also adds -no-cnv -n 0 to the help-text example so conversation mode
is disabled and only the prompt pass runs (no token generation).

https://claude.ai/code/session_019VVHgA9W2JALWFEhSxixUH
The calibration callback only fired for models that call build_qkv.
Models that build QKV inline (olmo2, olmoe, openelm, gemma4-iswa,
step35-iswa, minimax-m2, plamo2, plamo3) were silently skipped,
producing the 'pending_q is empty' warning.

Add cb(Qcur, "Qcur_pre_rope", il) between the Q reshape/norm and the
ggml_rope_ext call in each of these model files, matching the same
insertion point used by build_qkv.

MLA models (deepseek2, minicpm3, plm) are left untouched — their Q is
split into q_nope + q_pe before attention, so the calibration concept
does not directly apply.

https://claude.ai/code/session_019VVHgA9W2JALWFEhSxixUH
All Qwen3 models apply RMSNorm to Q after build_qkv returns.  The
hook in build_qkv fired on the pre-norm Q, which is wrong for two
reasons: it's not the tensor that actually enters RoPE, and the
second cb(Qcur, "Qcur_pre_rope") in the model file overwrites it
with the correct post-norm tensor anyway.

Add cb(Qcur, "Qcur_pre_rope", il) after cb(Qcur, "Qcur_normed")
in: qwen3, qwen3moe, qwen3next, qwen35, qwen35moe.

Also add a one-shot stderr breadcrumb in graph_get_cb showing the
first tensor name seen and whether tria_cal is non-null, to confirm
the callback wiring before we reach the Qcur_pre_rope check.

https://claude.ai/code/session_019VVHgA9W2JALWFEhSxixUH
Four correctness bugs fixed:

1. TURBO4_0 missing WHT inverse (main fix): dequantize_row_turbo4_0
   leaves values in WHT-rotated space (by design for attention, where Q
   is also rotated). TriAttention needs unrotated K to correctly invert
   RoPE before scoring. need_wht_inv now includes GGML_TYPE_TURBO4_0 in
   both the CPU and GPU paths.

2. padded_hd removed: the ((hd+127)/128)*128 rounding was wrong for
   Q8_0/F16/F32 caches on models with head_dim < 128 (e.g. 64-dim heads).
   All functions now use hd (= cal->head_dim) directly. For TurboQuant
   types this is a no-op since they require head_dim=128.

3. WHT inverse replaced: matvec_128(TURBO_ROTATION_RT, ...) (O(n²),
   64 KB static matrix included via turbo-rotation-data.h) replaced by
   turbo_cpu_fwht_inverse() (O(n log n), in-place, supports 64 and 128
   element groups). Removes the 64 KB static data from the binary.

4. Interleaved rope inversion fixed: triattention_invert_rope for
   rope_style=1 (interleaved input) was writing output in interleaved
   layout, but triattention_score_keys always reads half layout
   [re_0..re_{fc-1} | im_0..im_{fc-1}]. Fixed to always emit half layout.

Also updates comments: removes the incorrect claim that turbo4 dequant
applies R^T internally, clarifies that callers needing unrotated K must
apply WHT_inv themselves.

https://claude.ai/code/session_019VVHgA9W2JALWFEhSxixUH
@KGardevoir KGardevoir closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.