Fix/turbo4 wht dequant by Dubascudes · Pull Request #43 · TheTom/llama-cpp-turboquant

Dubascudes · 2026-04-01T18:00:55Z

Summary

Fixes the C reference path for #17, the CUDA port (PR #24) added WHT-based set_rows/dequant but the C reference quantize_row_turbo4_0_ref and dequantize_row_turbo4_0 were not updated to match, causing turbo4 to produce garbage when the C path is used.

turbo4 produces gibberish on all models and platforms (CUDA + CPU). turbo3 works fine.
Root cause: the C reference quantize_row_turbo4_0_ref and dequantize_row_turbo4_0 use a dense random rotation matrix (matvec(turbo_rotation, ...)), while the CUDA k_set_rows_turbo4 uses WHT. This creates a transform mismatch that corrupts all turbo4 KV cache data.

Changes (one file, two fixes)

ggml/src/ggml-turbo-quant.c

quantize_row_turbo4_0_ref: replaced matvec(turbo_rotation, normalized, rotated, d) with memcpy + turbo_cpu_fwht(rotated, d) to match the CUDA set_rows WHT path.
dequantize_row_turbo4_0: replaced centroid lookup → matvec(turbo_rotation_t, ...) -> scale by norm with centroid * norm (no inverse transform). The dequant must stay in the rotated domain because Q is WHT-rotated by the graph (GGML_OP_TURBO_WHT, direction=0), and the inverse WHT is applied to the attention output by a separate graph op (direction=1).

Why turbo3 was unaffected

turbo3's C reference (quantize_row_turbo3_0_ref) already uses turbo_cpu_fwht. Its dequant uses the separate legacy 3-bit+QJL path. Only turbo4's C code still used the dense rotation.

Testing

Before fix: turbo4 produces gibberish on Llama 3.1 8B Instruct and Qwen 2.5 7B Instruct (Q4_K_M), both CUDA and CPU
After fix: turbo4 produces coherent output matching turbo3 quality
Hardware: RTX 4070 Ti Super (sm_89), CUDA 12.6, Ubuntu Linux

Reproduction

# turbo3 works (before and after fix):
./build/bin/llama-cli --hf-repo bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \               
  --hf-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \                                      
  --cache-type-k turbo3 --cache-type-v turbo3 --flash-attn on \                           
  -p "Explain vector quantization in one paragraph" -n 100                                
                                                                                          
# turbo4 broken before fix, works after:                                                  
./build/bin/llama-cli --hf-repo bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \               
  --hf-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \                                      
  --cache-type-k turbo4 --cache-type-v turbo4 --flash-attn on \                           
  -p "Explain vector quantization in one paragraph" -n 100

…ation quantize_row_turbo4_0_ref used matvec(turbo_rotation, ...) (dense rotation) while CUDA k_set_rows_turbo4 uses WHT — transform mismatch causing garbage output on all platforms. dequantize_row_turbo4_0 applied inverse dense rotation, but the dequant must stay in the rotated domain since Q is WHT-rotated by the graph. Fix: replace matvec with turbo_cpu_fwht in quantize, remove inverse transform from dequant (return centroid * norm directly). Tested on RTX 4070 Ti Super with Llama 3.1 8B Instruct (Q4_K_M).

TheTom · 2026-04-02T20:32:52Z

Tested and validated. Thank you @Dubascudes for the clean fix and thorough analysis.

Code Review

The fix is correct. The C reference quantize_row_turbo4_0_ref was using dense matrix rotation (matvec) while CUDA and Metal use WHT. The dequant was applying inverse rotation when it should stay in the rotated domain (the graph handles inverse WHT via GGML_OP_TURBO_WHT). Both changes match the existing turbo3 C reference path, the CUDA set_rows kernel, and the Metal FA dequant.

Test Matrix (M5 Max, Qwen2.5-1.5B Q8_0)

All tests pass. No regressions.

Test	Config	Path	PP	TG	Status
1	q8_0/turbo4 GPU	Metal FA	pp32=2349	tg8=134	PASS
2	q8_0/turbo4 CPU-only	C dequant (fixed)	pp32=560	tg8=119	PASS
3	turbo4/turbo4 CPU-only	C dequant (fixed)	pp32=528	tg8=118	PASS
4	turbo3/turbo3 CPU-only	Control (unaffected)	pp32=520	tg8=118	PASS
5	turbo4/turbo4 GPU full	Metal FA regression	pp512=9945	tg128=112	PASS

KL Divergence (CPU path, turbo4 with fix vs q8_0 baseline)

Metric	Value
Mean KLD	0.0141
Median KLD	0.0080
Max KLD	0.685
Mean PPL ratio	1.021 (+2.1%)

KLD of 0.014 is consistent with normal turbo4 compression loss. Confirms the fix produces correct output distributions on the CPU path.

Concerns Investigated

The dequant change means dequantize_row_turbo4_0 now outputs values in the rotated domain (no inverse WHT). This is correct for all inference paths because the graph applies inverse WHT separately. The round-trip unit test in test-turbo-quant.c will need updating to account for the rotated output, but this is a test cosmetic issue, not a correctness issue.

Merging.

Fix/turbo4 wht dequant

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU ggml examples server Apple Metal testing devops python script model OpenCL labels Apr 1, 2026

Dubascudes changed the base branch from master to feature/turboquant-kv-cache April 1, 2026 18:02

TheTom merged commit 63b832b into TheTom:feature/turboquant-kv-cache Apr 2, 2026
1 check passed

TheTom added a commit that referenced this pull request Apr 2, 2026

Fix turbo4 C reference WHT dequant mismatch (#43)

a32f7c9

Fix/turbo4 wht dequant

iamwavecut pushed a commit to iamwavecut/llama-cpp-turboquant that referenced this pull request Apr 8, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

52ea680

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 9, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

17d63b8

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 10, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

4eeaa8e

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 13, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

f44e6d9

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 14, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

ae6a6ba

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 15, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

3b58946

Fix/turbo4 wht dequant

TheTom added a commit that referenced this pull request Apr 15, 2026

Fix turbo4 C reference WHT dequant mismatch (#43)

fe2ead9

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 22, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

fe0dce6

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 23, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

aabe5e4

Fix/turbo4 wht dequant

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 27, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

a01f864

Fix/turbo4 wht dequant

jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

6dbe92b

Fix/turbo4 wht dequant

sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026

Fix turbo4 C reference WHT dequant mismatch (TheTom#43)

57bfa3d

Fix/turbo4 wht dequant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/turbo4 wht dequant#43

Fix/turbo4 wht dequant#43
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
Dubascudes:fix/turbo4-wht-dequant

Dubascudes commented Apr 1, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Dubascudes commented Apr 1, 2026

Summary

Changes (one file, two fixes)

Why turbo3 was unaffected

Testing

Reproduction

Uh oh!

TheTom commented Apr 2, 2026

Code Review

Test Matrix (M5 Max, Qwen2.5-1.5B Q8_0)

KL Divergence (CPU path, turbo4 with fix vs q8_0 baseline)

Concerns Investigated

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants