Fix/turbo4 wht dequant#43
Conversation
…ation quantize_row_turbo4_0_ref used matvec(turbo_rotation, ...) (dense rotation) while CUDA k_set_rows_turbo4 uses WHT — transform mismatch causing garbage output on all platforms. dequantize_row_turbo4_0 applied inverse dense rotation, but the dequant must stay in the rotated domain since Q is WHT-rotated by the graph. Fix: replace matvec with turbo_cpu_fwht in quantize, remove inverse transform from dequant (return centroid * norm directly). Tested on RTX 4070 Ti Super with Llama 3.1 8B Instruct (Q4_K_M).
|
Tested and validated. Thank you @Dubascudes for the clean fix and thorough analysis. Code ReviewThe fix is correct. The C reference Test Matrix (M5 Max, Qwen2.5-1.5B Q8_0)All tests pass. No regressions.
KL Divergence (CPU path, turbo4 with fix vs q8_0 baseline)
KLD of 0.014 is consistent with normal turbo4 compression loss. Confirms the fix produces correct output distributions on the CPU path. Concerns InvestigatedThe dequant change means Merging. |
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Fix/turbo4 wht dequant
Summary
Fixes the C reference path for #17, the CUDA port (PR #24) added WHT-based set_rows/dequant but the C reference
quantize_row_turbo4_0_refanddequantize_row_turbo4_0were not updated to match, causing turbo4 to produce garbage when the C path is used.turbo4 produces gibberish on all models and platforms (CUDA + CPU). turbo3 works fine.
Root cause: the C reference
quantize_row_turbo4_0_refanddequantize_row_turbo4_0use a dense random rotation matrix (matvec(turbo_rotation, ...)), while the CUDAk_set_rows_turbo4uses WHT. This creates a transform mismatch that corrupts all turbo4 KV cache data.Changes (one file, two fixes)
ggml/src/ggml-turbo-quant.cquantize_row_turbo4_0_ref: replaced
matvec(turbo_rotation, normalized, rotated, d)withmemcpy + turbo_cpu_fwht(rotated, d)to match the CUDA set_rows WHT path.dequantize_row_turbo4_0: replaced
centroid lookup → matvec(turbo_rotation_t, ...) -> scale by normwithcentroid * norm(no inverse transform). The dequant must stay in the rotated domain because Q is WHT-rotated by the graph (GGML_OP_TURBO_WHT, direction=0), and the inverse WHT is applied to the attention output by a separate graph op (direction=1).Why turbo3 was unaffected
turbo3's C reference (
quantize_row_turbo3_0_ref) already usesturbo_cpu_fwht. Its dequant uses the separate legacy 3-bit+QJL path. Only turbo4's C code still used the dense rotation.Testing
Reproduction