Skip to content

Fix/turbo4 wht dequant#43

Merged
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
Dubascudes:fix/turbo4-wht-dequant
Apr 2, 2026
Merged

Fix/turbo4 wht dequant#43
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
Dubascudes:fix/turbo4-wht-dequant

Conversation

@Dubascudes
Copy link
Copy Markdown

Summary

Fixes the C reference path for #17, the CUDA port (PR #24) added WHT-based set_rows/dequant but the C reference quantize_row_turbo4_0_ref and dequantize_row_turbo4_0 were not updated to match, causing turbo4 to produce garbage when the C path is used.

turbo4 produces gibberish on all models and platforms (CUDA + CPU). turbo3 works fine.
Root cause: the C reference quantize_row_turbo4_0_ref and dequantize_row_turbo4_0 use a dense random rotation matrix (matvec(turbo_rotation, ...)), while the CUDA k_set_rows_turbo4 uses WHT. This creates a transform mismatch that corrupts all turbo4 KV cache data.

Changes (one file, two fixes)

ggml/src/ggml-turbo-quant.c

  1. quantize_row_turbo4_0_ref: replaced matvec(turbo_rotation, normalized, rotated, d) with memcpy + turbo_cpu_fwht(rotated, d) to match the CUDA set_rows WHT path.

  2. dequantize_row_turbo4_0: replaced centroid lookup → matvec(turbo_rotation_t, ...) -> scale by norm with centroid * norm (no inverse transform). The dequant must stay in the rotated domain because Q is WHT-rotated by the graph (GGML_OP_TURBO_WHT, direction=0), and the inverse WHT is applied to the attention output by a separate graph op (direction=1).

Why turbo3 was unaffected

turbo3's C reference (quantize_row_turbo3_0_ref) already uses turbo_cpu_fwht. Its dequant uses the separate legacy 3-bit+QJL path. Only turbo4's C code still used the dense rotation.

Testing

  • Before fix: turbo4 produces gibberish on Llama 3.1 8B Instruct and Qwen 2.5 7B Instruct (Q4_K_M), both CUDA and CPU
  • After fix: turbo4 produces coherent output matching turbo3 quality
  • Hardware: RTX 4070 Ti Super (sm_89), CUDA 12.6, Ubuntu Linux

Reproduction

# turbo3 works (before and after fix):
./build/bin/llama-cli --hf-repo bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \               
  --hf-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \                                      
  --cache-type-k turbo3 --cache-type-v turbo3 --flash-attn on \                           
  -p "Explain vector quantization in one paragraph" -n 100                                
                                                                                          
# turbo4 broken before fix, works after:                                                  
./build/bin/llama-cli --hf-repo bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \               
  --hf-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \                                      
  --cache-type-k turbo4 --cache-type-v turbo4 --flash-attn on \                           
  -p "Explain vector quantization in one paragraph" -n 100  

…ation

  quantize_row_turbo4_0_ref used matvec(turbo_rotation, ...) (dense rotation)
  while CUDA k_set_rows_turbo4 uses WHT — transform mismatch causing garbage
  output on all platforms.

  dequantize_row_turbo4_0 applied inverse dense rotation, but the dequant must
  stay in the rotated domain since Q is WHT-rotated by the graph.

  Fix: replace matvec with turbo_cpu_fwht in quantize, remove inverse transform
  from dequant (return centroid * norm directly).

  Tested on RTX 4070 Ti Super with Llama 3.1 8B Instruct (Q4_K_M).
@Dubascudes Dubascudes changed the base branch from master to feature/turboquant-kv-cache April 1, 2026 18:02
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 2, 2026

Tested and validated. Thank you @Dubascudes for the clean fix and thorough analysis.

Code Review

The fix is correct. The C reference quantize_row_turbo4_0_ref was using dense matrix rotation (matvec) while CUDA and Metal use WHT. The dequant was applying inverse rotation when it should stay in the rotated domain (the graph handles inverse WHT via GGML_OP_TURBO_WHT). Both changes match the existing turbo3 C reference path, the CUDA set_rows kernel, and the Metal FA dequant.

Test Matrix (M5 Max, Qwen2.5-1.5B Q8_0)

All tests pass. No regressions.

Test Config Path PP TG Status
1 q8_0/turbo4 GPU Metal FA pp32=2349 tg8=134 PASS
2 q8_0/turbo4 CPU-only C dequant (fixed) pp32=560 tg8=119 PASS
3 turbo4/turbo4 CPU-only C dequant (fixed) pp32=528 tg8=118 PASS
4 turbo3/turbo3 CPU-only Control (unaffected) pp32=520 tg8=118 PASS
5 turbo4/turbo4 GPU full Metal FA regression pp512=9945 tg128=112 PASS

KL Divergence (CPU path, turbo4 with fix vs q8_0 baseline)

Metric Value
Mean KLD 0.0141
Median KLD 0.0080
Max KLD 0.685
Mean PPL ratio 1.021 (+2.1%)

KLD of 0.014 is consistent with normal turbo4 compression loss. Confirms the fix produces correct output distributions on the CPU path.

Concerns Investigated

The dequant change means dequantize_row_turbo4_0 now outputs values in the rotated domain (no inverse WHT). This is correct for all inference paths because the graph applies inverse WHT separately. The round-trip unit test in test-turbo-quant.c will need updating to account for the rotated output, but this is a test cosmetic issue, not a correctness issue.

Merging.

@TheTom TheTom merged commit 63b832b into TheTom:feature/turboquant-kv-cache Apr 2, 2026
1 check passed
TheTom added a commit that referenced this pull request Apr 2, 2026
iamwavecut pushed a commit to iamwavecut/llama-cpp-turboquant that referenced this pull request Apr 8, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 9, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 10, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 13, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 14, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 15, 2026
TheTom added a commit that referenced this pull request Apr 15, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 23, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 27, 2026
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants