perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT by signalnine · Pull Request #36 · TheTom/llama-cpp-turboquant

signalnine · 2026-03-31T18:22:46Z

Summary

CUDA flash attention kernel optimizations, auto-asymmetric KV for
quantized models, and SET_ROWS warp shuffle WHT. Combined effect:
turbo2 now beats q8_0 decode speed at 32K context.

Performance (tg128 @ d32768, Qwen3.5 35B Q4_K_M, RTX 5090)

Type	Before	After	Improvement	vs q8_0
q8_0	156	181	+16%	baseline
turbo2	148	185	+25%	+2.5% faster
turbo3	127	171	+35%	-5.5%
turbo4	101	117	+16%	-35%

Changes

1. Shared-memory LUT for turbo KQ scoring (+15% turbo3, +4% turbo2)

Precompute Q[d] × centroid[c] into shared memory once per decode
step. The KQ inner loop does a single shmem read per element instead
of centroid lookup + multiply. 8-wide processing with bank-conflict-
free stride. turbo4 excluded (16 centroids too large for shmem).

2. General VEC FA optimizations (+16% q8_0 baseline)

__launch_bounds__ occupancy 1→3 (3 blocks/SM, better latency hiding)
__expf fast-math softmax (~3.7%)
L2 prefetch hints for next K+V blocks
Aggressive sparse V threshold (5e-3 for turbo3/4, 1e-2 for turbo2)
Remove turbo from V_is_unquantized (quantized V dequant path)

3. q8_1 Q path for turbo types

Turbo K types now use q8_1-quantized Q (int8 packed) instead of
float2, reducing Q register footprint 4×. KQ dot product rewritten
to process 4 elements per iteration with integer Q.

4. Auto-asymmetric KV for quantized models

When user requests symmetric turbo K+V on a quantized-weight model,
auto-downgrades K to q8_0. Prevents catastrophic PPL on outlier
models (Qwen 2.5: 4015→8.85). Skipped for mismatched K/V head dims.
Override: TURBO_SYMMETRIC=1.

5. Warp shuffle WHT (SET_ROWS)

Replace 5 of 7 __syncthreads() with __shfl_xor_sync() for
intra-warp butterfly stages in all turbo SET_ROWS kernels.

Quality

PPL unchanged across all optimizations:

Model	q8_0	turbo3	turbo4
Qwen 3.5 Q4_K_M	6.18	6.24 (auto-asym)	6.23 (sym)
Qwen 2.5 Q2_K	8.82	8.85 (auto-asym)	—

Test plan

Coherence: 4 models × 4 KV combos (16/16)
PPL: Qwen 3.5 + Qwen 2.5 (3 types each)
NIAH: turbo3 + turbo2 at 4K + 32K (10-11/11)
Server smoke: 3 types × 3 requests (9/9)
CPU-only: Mixtral + DeepSeek (2/2)
Head-to-head vs Madreag/turbo3-cuda (matched or exceeded)

Credit: FA optimizations based on Madreag/turbo3-cuda.
Warp shuffle WHT based on seanrasch/perf/ftz-and-wht-shuffle.

🤖 Generated with Claude Code

New types: GGML_TYPE_TURBO3_0 (3-bit) and GGML_TYPE_TURBO4_0 (4-bit) Implements PolarQuant + QJL compression per the ICLR 2026 paper. Block size = 128 (matching head_dim for optimal rotation Gaussianization) turbo3: 52 bytes per 128 values = 3.25 bits/value (4.9× vs fp16) turbo4: 68 bytes per 128 values = 4.25 bits/value (3.8× vs fp16) Status: - ✅ Type definitions in ggml.h - ✅ Block structures in ggml-common.h - ✅ Quantize/dequantize C implementation in ggml-turbo-quant.c - ✅ Registered in ggml.c type traits - ✅ Added to kv_cache_types in arg.cpp - ✅ Builds successfully - ✅ Shows in --help output - ❌ Metal SET_ROWS kernel not implemented (blocks GPU inference) - ❌ Needs Metal dequantize kernels for attention computation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added Metal shader implementations: - quantize_turbo3_0 / quantize_turbo4_0 (per-block quantization) - dequantize_turbo3_0 / dequantize_turbo4_0 (type4x4 and type4 variants) - kernel_set_rows_turbo template (128-element block size) - Flash attention instantiations for all dk/dv variants Added TURBO3_0/TURBO4_0 to Metal device SET_ROWS validation. Builds successfully. Testing with Qwen 3.5 35B-A3B MoE on M5 Max. Note: Initial version uses simplified quantization (no rotation matrix) for Metal compatibility. Full rotation requires custom kernel with extra buffer bindings — tracked for follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Embedded pre-computed 128×128 rotation and QJL matrices (256KB constant memory) directly in the Metal shader. Both quantize and dequantize now perform the full TurboQuant algorithm: Quantize: normalize → rotate → codebook → inverse rotate → residual → QJL Dequantize: codebook → inverse rotate → QJL correction → rescale Previous version (no rotation) produced garbage. This should produce meaningful output since the rotation Gaussianizes the KV distribution. Note: dequantize does full 128-element rotation per chunk (8× work). Optimization possible with caching or restructured kernel in follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#21 - Inlined turbo-matrices.h directly into ggml-metal.metal (256KB) to fix JIT compilation failure with #include - Added C round-trip test (test-turbo-quant.c): turbo3 cosine=0.906, turbo4 cosine=0.966 — matches Python prototype - Metal library loads successfully ("loaded in 5.9 sec") - Model runs on Metal but output quality needs debugging (Metal quantize/dequantize may have a bug vs the working C version) C round-trip PROVES the algorithm works in C. Metal shader needs debugging — likely an issue with the dequantize chunk addressing or the large constant arrays in thread-local memory. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#23 Codex review found: 1. Stale duplicate code in dequantize_turbo3_0_t4 (compile would fail) 2. thread static is risky/non-portable in MSL Fixed: removed thread static caching, using plain thread locals. Speed unchanged (2.4 tok/s) — the static caching wasn't actually working on Metal. True optimization needs architectural change in flash attention kernel to dequantize once per block, not per chunk. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#26 Massive reduction in constant memory and compute: - 256KB of dense matrices → 512 bytes of sign arrays - O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation - Metal shader file: 1.5MB → 432KB Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the bottleneck is redundant calls (8-32× per block from flash attention). The dequantize function is called per 4/16-element chunk, each time doing the full 128-element WHT. Need to modify the flash attention kernel to dequantize once per block. Quality: WHT+signs gives BETTER quality than dense QR on real KV tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution (kurtosis 1.53) means fewer outliers hitting extreme centroids. Reviewed by Codex: WHT butterfly correct, inverse order verified, QJL correction matches reference C implementation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#23 Root cause analysis: 8-32× redundant full-block dequantize per block from flash attention template. Four approaches documented with expected speedups and risk levels. Plan: D (reduce overhead) → A/B (eliminate redundant calls) Target: 2.4 tok/s → 20-40 tok/s Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…om#23 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#23 No-op dequant test: even returning all zeros from dequantize, turbo3 runs at 2.4 tok/s (same as with full WHT rotation). The bottleneck is NOT in the attention dequantize path. New hypothesis: the SET_ROWS (quantize) path is the bottleneck. The Metal quantize_turbo3_0 function does 3 WHT rotations per KV write, totaling ~3200 ops per block × 224 blocks per token. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CRITICAL BUG: The #include "turbo-wht.h" caused Metal JIT compilation to fail at runtime. The model silently fell back to CPU for ALL ops. ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU. After inlining the header: - MoE gen: 2.4 → 10.7 tok/s (4.5× improvement, now actually on Metal) - MoE prompt: 4.2 → 60.9 tok/s (14.5× improvement) Remaining gap vs q8_0: 85 → 10.7 tok/s (8× slower, down from 35×) This is the SAME bug we hit with turbo-matrices.h earlier. Rule: NEVER use #include in ggml-metal.metal — always inline. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#23 Previous 2.4 tok/s was CPU fallback. Real Metal numbers: MoE: 10.7 tok/s gen (8× slower than q8_0, was thought to be 35×) Qwopus: 5.3 tok/s gen (3.3× slower than q8_0) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#27 Full investigation log with all tests, results, and the root cause. Upstream TurboQuant activity tracked in TheTom#27. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…om#28 Key findings from Dejan.ai, unixsysdev, and mudler: 1. QJL naively added back destroys quality (cosine 0.69) 2. Pre-rotate queries eliminates rotation from dequant path 3. WHT abandoned by everyone — dense QR or no rotation preferred 4. unixsysdev gets -0.8% speed loss with fused CUDA kernel 5. We're the only Metal implementation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…in) TheTom#23 Removing WHT rotation from dequant (quality broken, speed test only): gen: 10.7 → 49.1 tok/s (4.6× improvement, 57% of q8_0) prompt: 67.3 → 162.6 tok/s Confirms pre-rotate-queries would deliver ~49 tok/s. Remaining gap (49 vs 85) is block size + QJL overhead. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speed ceiling confirmed: stripping rotation from dequant gives 49.1 tok/s (vs 10.7 with rotation, vs 85.5 q8_0 baseline). Implementation plan: store rotation matrix in KV cache, apply to Q in graph builder, strip from Metal dequant. 6 files to modify. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m#23 Instead of inverse-rotating every K during dequant, rotate Q once before attention. Math: <q, R^T*c[idx]> = <R*q, c[idx]>. Changes: - Store rotation matrix (R^T) in KV cache, filled after buffer clear - Apply ggml_mul_mat(R_T, q) in build_attn_mha after permute - Strip turbo_rotate_inverse from Metal dequant - Dynamic cast to access rotation from mctx Results: - MoE gen: 10.7 → 51.4 tok/s (4.8× speedup) - MoE prompt: 67.3 → 160.3 tok/s (2.4× speedup) - Now at 60% of q8_0 speed with 4.9× compression - Model produces coherent output Codex review: fixed buffer clear ordering (was zeroing rotation after init). Verified: rotation point is correct (after 4d reshape + permute, ne[0]=128). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#23 Full investigation log documenting every test, every dead end, and every breakthrough. 21× total improvement from CPU fallback to pre-rotate-queries. Key lessons: no #include in Metal, no-op testing, pre-rotate-queries, buffer clear ordering, codex+roast catch real bugs. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Validated on real Qwen3 KV tensors: cosine sim 0.9508 → 0.9831 (+3.2%) MSE-only better on 99.3% of vectors including p1 tails. 3-bit index split: lower 2 bits in qs[], upper 1 bit in signs[]. No QJL stage in quantize or dequant. Results: - MoE gen: 51.4 → 62.2 tok/s (73% of q8_0, was 60%) - MoE prompt: 160 → 200 tok/s (90% of q8_0) - Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0, was 83%) - Qwopus prompt: 67 → 83 tok/s (100% of q8_0!) Codex verified: bit packing correct, quantize/dequant consistent. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speed ceiling without Q rotation: 61.3 tok/s (vs 62.2 with it). The 128×128 ggml_mul_mat adds <1% overhead on Metal. Remaining gap is structural (block size + dequant complexity). Final: MoE 62.2 tok/s (73%), Qwopus 15.5 tok/s (88%). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diagnostic benchmark proves the 26% gap is entirely from block size 128. q4_0 (block 32, 4-bit quantization) runs at 84.2 tok/s = identical to q8_0. Next: turbo3 with block size 32. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Changed QK_TURBO3 from 128 to 32 (storage block size). Rotation still operates on 128-element groups (QK_TURBO3_GROUP=128). SET_ROWS kernel processes 4 blocks per rotation group. Flash attention nl_k changed from 32 to 8 (matching q4_0). Block struct: 14 bytes per 32 values = 3.5 bits/val → 4.6× compression. Results: - MoE gen: 62.2 → 77.7 tok/s (91% of q8_0 at 85.5) - MoE prompt: 200 → 218.5 tok/s (98% of q8_0) - Qwopus gen: 15.5 → 17.0 tok/s (97% of q8_0 at 17.6) - Qwopus prompt: 83 → 89.5 tok/s (108% of q8_0 — FASTER) Target was 75+ tok/s. Exceeded. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed TheTom#3 (TURBO_D). TheTom#1 and TheTom#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Tom#30 Perplexity benchmarking reveals catastrophic quality failure: - f16: 6.121, q8_0: 6.111, q4_0: 6.142 - turbo3: 165.6 (27× worse) Speed benchmarks were meaningless — fast garbage. Root cause investigation needed before any quality claims. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. V cache returns rotated-space values (cosine=0.02 vs correct 0.987) 2. dynamic_cast to llama_kv_cache_context fails for MoE models (uses llama_memory_hybrid_context, not kv_cache_context) → Q rotation and V inverse rotation NEVER executed Fix: store rotation tensors in llm_graph_context, not KV cache. Or access through hybrid memory interface. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heTom#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eTom#31 TheTom#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Quality confirmed: PPL 6.194 (+1.4% of q8_0) Speed: 10.7 tok/s (inverse rotation in dequant, no pre-rotate-queries) Previous speed claims (51-77 tok/s) were invalid — measured garbage output speed. Key lessons documented for future reference. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Validated: CUDA-only change (set-rows.cu), Metal path untouched. Fix is backwards compatible with QK=32. PPL confirmed identical on RTX 3090 sm_86.

1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sparse V: now enabled by default on all Metal (was M5+ only). Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0. Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set. Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V. 37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The HIP build was missing 9 turbo cross-type flash attention vec instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists. Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP since those instance files are already excluded from the HIP build (they exceed HIP's 65536-byte local memory limit). Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

…ache fix: add missing TurboQuant FA template instances for HIP/ROCm build

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: stack overflow in turbo_init_rotation + TURBO_D static_assert

When user requests symmetric turbo K+V on a quantized model, auto- downgrade K to q8_0 while keeping V as turbo. This prevents catastrophic PPL on outlier-sensitive models (Qwen 2.5: 4015→8.85) and actually improves quality on all models tested (Qwen 3.5: 6.31→6.24). V compression is virtually lossless across all architectures (+0.3%). K compression is model-sensitive and compounds with weight quantization error. Asymmetric q8_0-K + turbo-V is the safe default. Detection: checks tok_embd tensor type. If quantized (Q2-Q6), auto- switches. F16/F32/BF16 models keep symmetric turbo (no stacking risk). Override: TURBO_SYMMETRIC=1 forces symmetric. PPL (wikitext-2, ctx=512): Qwen 2.5 Q2_K: 4015 symmetric → 8.85 auto-asymmetric (+0.3% vs q8_0) Qwen 3.5 Q4_K_M: 6.31 symmetric → 6.24 auto-asymmetric (+1.0% vs q8_0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace shared-memory butterfly stages h=1,2,4,8,16 with __shfl_xor_sync() in turbo3, turbo2, and turbo4 SET_ROWS kernels. Pairs with distance < 32 are always in the same warp — no barrier needed. Only h=32 and h=64 (cross-warp) retain __syncthreads(). Mathematically identical: __shfl_xor_sync(mask, v, h) gives thread j the value from j^h, then (j & h) ? (other - v) : (v + other) is the exact butterfly computation. Zero PPL regression: turbo3 6.24, turbo4 6.23 (unchanged). Credit to seanrasch (perf/ftz-and-wht-shuffle branch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Auto-asymmetric (K→q8_0) creates FA dimension mismatch on models where head_dim_k != head_dim_v (e.g. DeepSeek: K=192, V=128). The q8_0 K at non-standard D has no CUDA FA kernel, falling to slow CPU FA. Fix: when K and V head dims differ, keep symmetric turbo which pads both K and V consistently for CUDA FA. Note: turbo4 symmetric on DeepSeek still falls to CPU FA (21 t/s) because padded K D=256 != V D=128. This is a pre-existing FA D matching limitation, not a regression. turbo3 on DeepSeek works at full CUDA speed (172 t/s) because the VEC kernel handles asymmetric D natively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Port initial optimizations from Madreag/turbo3-cuda: - Remove turbo types from K_is_unquantized — Q is now q8_1 quantized (int8 packed) for turbo types, reducing Q register footprint 4× - Keep nthreads_KQ=8 for turbo (same ILP as f16) via K_is_turbo flag - Rewrite turbo3/turbo2/turbo4 vec_dot_KQ to process 4 elements per iteration with q8_1 Q (packed int32 + scale) - Replace 5 expf() with __expf() in softmax (~3.7% at long context) PPL unchanged (6.31). Speed unchanged on this model — the main speedup in Madreag's fork comes from a shared-memory LUT approach (precompute Q×centroid for all positions) which eliminates the multiply in the hot loop. That optimization needs separate porting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Precompute Q[d] × centroid[c] into shared memory LUT once per decode step, then the KQ inner loop does a single shmem read per element instead of centroid lookup + multiply. - turbo3: 127→146 t/s at 32K depth (+15%) - turbo2: 148→154 t/s at 32K depth (+4%) - Only for ncols==1 (decode path, not prefill) - turbo4 excluded: 16 centroids × D exceeds shmem budget - LUT stride = n_centroids+1 to avoid bank conflicts - 8-wide processing (2 qs bytes + 1 signs byte per iteration) - L2 prefetch hints for next K block PPL unchanged (6.31). Zero quality impact — the LUT stores the exact same Q×centroid values, just precomputed. Based on Madreag/turbo3-cuda (release/cuda-optimized branch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Four general VEC flash attention optimizations: 1. __launch_bounds__ occupancy 1→3: allows 3 blocks per SM, better latency hiding. This is the biggest win (~16% on q8_0 baseline). 2. V_is_unquantized: remove turbo types from V unquantized path, matching the K change. Turbo V uses quantized dequant path. 3. Aggressive sparse V threshold: 5e-3 for turbo3/4 (was 1e-6), 1e-2 for turbo2. Validated zero PPL impact per Madreag. 4. L2 prefetch: add V block prefetch alongside existing K prefetch. Results (tg128 @ d32768, Qwen3.5 35B Q4_K_M, RTX 5090): | Type | Before | After | vs q8_0 | |------|--------|-------|---------| | q8_0 | 156 | 181 | baseline | | turbo2 | 148 | 185 | +2.5% faster | | turbo3 | 127 | 171 | -5.5% | | turbo4 | 101 | 117 | -35% (no LUT) | turbo2 at 7.5x compression is now FASTER than q8_0 at 32K context. PPL unchanged (6.31). Based on Madreag/turbo3-cuda optimizations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ops.cpp: add TURBO3_0/TURBO4_0/TURBO2_0 to clamp switch to fix -Werror=switch on GCC/Clang CI - ggml-rpc.h: bump RPC_PROTO_PATCH_VERSION 1→2 and update GGML_OP_COUNT assert 96→97 (GGML_OP_TURBO_WHT added) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Precompute Q[d] × centroid[c] into shared memory once per decode step. The KQ inner loop does a single shmem read per element instead of centroid lookup + multiply. Bank-conflict-free stride (n_centroids+1). turbo3: 8-entry LUT (8 centroids × D half values) turbo2: 4-entry LUT (4 centroids × D half values) turbo4: excluded (16 centroids × D exceeds shmem budget) Combined with autoresearch optimizations: 236.6 → 272.8 t/s (+15.3%). PPL unchanged (7.06 asymmetric q8_0/turbo3). Ported from PR TheTom#36 (credit: Madreag/turbo3-cuda).

signalnine · 2026-04-04T19:55:31Z

Closing in favor of PR #53, which includes the autoresearch-discovered FA optimizations (+7%), the shmem LUT from this PR (+8%), the auto-asymmetric GQA fix, and the HIP build fix. Combined: 236.6 → 272.8 t/s (+15.3%) on turbo3 decode. The warp shuffle WHT from this PR was already merged upstream.

Move Q forward rotation from graph-level ggml_turbo_wht op into FA kernels to eliminate a separate kernel launch per layer during decode: - Vec kernel (decode): shared memory FWHT with 64-thread parallel butterfly, zero extra kernel launches, CUDA graph compatible - Prefill MMA: separate k_turbo_fwht_forward kernel with persistent cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay) - V inverse rotation remains at graph level for CUDA graph compat Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%), PPL identical to baseline (19.7152 on 10-chunk test). Also adds temporal decay test (experiment #36) and benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

First-kernel-execution bug: pass 2 was applying the inverse WHT on the FA output, but llama-graph.cpp:1894 unconditionally appends a standalone ggml_turbo_wht(cur, direction=1, ...) op immediately after ggml_flash_attn_ext whenever V is turbo-rotated. The two rotations compounded, leaving the output in rotated-squared space and producing garbage tokens. Symptom observed on Ampere (sm_86) Qwen3 8B Q4_K_M: TURBO_FLASH=1: "<think>\ighth儋urosewiseHintsanmar伎单项..." TURBO_FLASH=0: "[Start thinking] Okay, the user is asking..." Fix: pass 2 now writes the merged, softmax-normalized output to dst directly in the rotated domain. The graph op handles un-rotation, matching the contract the existing CUDA VEC path already satisfies. Verified: - Asymmetric q8_0-K / turbo3-V decode: bit-identical to VEC fallback at greedy sampling over 30 tokens ("Okay, the user is asking for the capital of France. Let me think. I know that France is a country in Europe. The capital" — matches VEC exactly). - Symmetric turbo3/turbo3 decode: coherent output, diverges from VEC at the 19th token ("I remember" vs "I know") — normal numerical drift from different K-dequant ordering. Open perf gap (separate issue, for future tuning): | context | TurboFlash tg | VEC tg | ratio | |---------|---------------|--------|-------| | 512 | 95.6 | 114.0 | 0.84x | | 2K | 100.8 | 114.6 | 0.88x | | 8K | 100.4 | 114.7 | 0.88x | | 16K | 101.2 | 113.4 | 0.89x | TurboFlash is ~12% slower than signalnine's optimized VEC kernel across all tested contexts. The Metal architecture's main advantage (replacing single-threadgroup-per-head serial-over-tokens decode) doesn't exist on CUDA because VEC is already well-parallelized post PR TheTom#36. Achieving a CUDA perf win would require larger thread blocks (4+ warps vs the current 1), fused pass 1+2 for short context, or tensor-core K·Q — none of which are line-for-line portable from the Metal reference. Metal-parity note: the Metal pass 2 kernel *does* apply an inverse WHT internally. If that's correct on Metal, then either (a) Metal's OP_TURBO_WHT dispatcher special-cases TurboFlash-sourced tensors, or (b) there's a silent double-rotation bug on Metal that nobody has noticed yet. Worth raising with TheTom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The fused Gated Delta Net kernels crash on ROCm/RDNA4 (gfx1201) during the test decode in slot initialization. Gate enablement on CUDA-only by checking the backend registry name before resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TheTom and others added 30 commits March 26, 2026 12:15

docs: log simd_broadcast attempt — no speed improvement TheTom#23

283441c

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: log threadgroup attempt — no speed improvement, rethinking TheT…

aede1bb

…om#23 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: final investigation log — 77.7 tok/s, 91% of q8_0

c3a7afd

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: perplexity 6.194 confirmed — 1.4% of q8_0 TheTom#30

d9b9725

Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom and others added 7 commits March 30, 2026 17:54

Merge pull request TheTom#32 from HyperionMS2040/fix/cuda-block-size-128

7b75078

Validated: CUDA-only change (set-rows.cu), Metal path untouched. Fix is backwards compatible with QK=32. PPL confirmed identical on RTX 3090 sm_86.

Merge pull request TheTom#35 from terrysimons/feature/turboquant-kv-c…

b4b588a

…ache fix: add missing TurboQuant FA template instances for HIP/ROCm build

Remove unused CENTROIDS_1BIT constant

013d9b2

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request TheTom#23 from seanrasch/fix/stack-overflow-v3

8ad0f00

fix: stack overflow in turbo_init_rotation + TURBO_D static_assert

github-actions Bot added Nvidia GPU ggml labels Mar 31, 2026

signalnine and others added 3 commits March 31, 2026 20:01

signalnine force-pushed the feature/turboquant-kv-cache branch from b00a40b to cfb57af Compare April 1, 2026 03:06

signalnine and others added 3 commits April 1, 2026 18:35

signalnine force-pushed the feature/turboquant-kv-cache branch from 338990c to f575431 Compare April 2, 2026 02:07

signalnine changed the title ~~perf: block-128, warp shuffle WHT, auto-asymmetric KV, mismatched-D fix~~ perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT Apr 2, 2026

signalnine marked this pull request as draft April 2, 2026 02:25

signalnine marked this pull request as ready for review April 2, 2026 03:30

TheTom force-pushed the feature/turboquant-kv-cache branch from 63b832b to e9c54d5 Compare April 3, 2026 16:14

signalnine closed this Apr 4, 2026

signalnine mentioned this pull request Apr 4, 2026

perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch #53

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT#36

perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT#36
signalnine wants to merge 136 commits into
TheTom:feature/turboquant-kv-cachefrom
signalnine:feature/turboquant-kv-cache

signalnine commented Mar 31, 2026 •

edited

Loading

Uh oh!

signalnine commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

signalnine commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (tg128 @ d32768, Qwen3.5 35B Q4_K_M, RTX 5090)

Changes

1. Shared-memory LUT for turbo KQ scoring (+15% turbo3, +4% turbo2)

2. General VEC FA optimizations (+16% q8_0 baseline)

3. q8_1 Q path for turbo types

4. Auto-asymmetric KV for quantized models

5. Warp shuffle WHT (SET_ROWS)

Quality

Test plan

Uh oh!

signalnine commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

signalnine commented Mar 31, 2026 •

edited

Loading