feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45
Conversation
Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv) WHT-rotated weight compression types. These achieve 27-37% model size reduction at +1.0-1.9% PPL on Qwen/Phi families. Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45. CUDA additions: - turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels), sign array for 32-element inverse WHT - dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block inverse RHT (5 butterfly stages + normalize + unsign) - convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers - ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from mmvq/mmq (uses cuBLAS dequant-to-f16 path) The cuBLAS path is correct for initial support. Future optimization: pre-rotate activations via warp shuffle WHT (same pattern as KV cache Q rotation) to eliminate per-block inverse WHT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Regression Test Results — PR #45Verified that the TQ4_1S weight compression PR does NOT break existing TurboQuant KV cache functionality or standard inference on non-compressed models. Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB) Speed — No Regressions
All speeds normal or improved. No regressions. PPL — No Regressions (full wikitext-2 runs)
All PPL values match known-good. MUL_MAT_ID (MoE path) verified working. VerdictALL TESTS PASS. 5 models, 2 hardware platforms, 4 KV configs. The |
6c3e503 to
cb8bddc
Compare
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue #45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update: Rebased on upstream master + regression testBranch force-pushed. Now rebased on latest
Upstream conflict: activation rotation (commit 744c0c7)Upstream added graph-level Hadamard rotation for KV cache quantization (
Fix: disabled upstream rotation by default in our fork. Users can re-enable with Regression test (M5 Max, rebased branch cb8bddc)
All tests pass. No regressions. Phi-4 crash resolved. |
|
CUDA port available on our branch: signalnine/llama-cpp-turboquant What's implemented:
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
The fused kernel pre-rotates the activation vector once per mul_mat via Happy to iterate on this if you have ideas for closing the CUDA gap further. |
|
This is great work, thank you for turning this around so fast. PPL matching between cuBLAS and fused confirms correctness. One question before we merge: can you confirm that uncompressed models (q8_0, q4_0, etc.) show no decode regression on this branch? i.e. the new code paths only activate for TQ4_1S/TQ3_1S and existing quant types run at the same speed as before the PR. |
CUDA kernel review — performance improvement opportunitiesNice work on the V8 pre-rotation approach. PPL matching confirms correctness. Here's what I see for closing the gap from 39% to 70-85% of q8_0: High priority (biggest decode wins)
Medium priority
Skip / low value
Realistic ceilingPer architecture with full tuning (NR0 + load dedup + vectorized + batch):
The 39% → 70-85% gap is primarily data reuse, not math precision. The pre-rotation design is correct — it just needs the activation tile shared across more rows per CTA. |
Full Regression Test — PR #45 (cb8bddc)Hardware
Quantize tool verification
M5 Max — Uncompressed weights + TurboQuant KVQwen2.5-1.5B Q8_0
Phi-4 14B Q8_0 (crash fix verification)
No crash. Upstream attn_rot disabled by default (commit cb8bddc). Qwen3.5-27B Q8_0
Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)
M5 Max — TQ4_1S Weight CompressionQwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)
Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M
Summary
All tests pass. PR is safe for review. |
|
Here is my lllama-bench of 27b, same quantization results as for the size of the model. nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-q8_0.gguf -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 99 | 1 | pp512 | 986.70 ± 2.48 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 99 | 1 | tg128 | 22.18 ± 0.03 |
build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 19.13 GiB | 26.90 B | CUDA | 99 | 1 | pp512 | 947.65 ± 1.47 |
| qwen35 27B Q8_0 | 19.13 GiB | 26.90 B | CUDA | 99 | 1 | tg128 | 24.12 ± 0.04 |
build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -ctk q8_0 -ctv turbo4 -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model | size | params | backend | ngl | type_k | type_v | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 19.13 GiB | 26.90 B | CUDA | 99 | q8_0 | turbo4 | 1 | pp512 | 938.75 ± 0.75 |
| qwen35 27B Q8_0 | 19.13 GiB | 26.90 B | CUDA | 99 | q8_0 | turbo4 | 1 | tg128 | 23.79 ± 0.03 |
build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ |
|
Hey Tom — first Gemma 4 Blackwell numbers from an RTX 5090 on commit Good news: loads clean, runs stable, and all tests completed cleanly at 64k and 128k. Throughput: Memory / KV:
128k
So on this setup, Savings look smaller than on dense full-context models, which makes sense here since Gemma 4’s hybrid sliding-window layout already limits full-context KV to 5 of 30 layers. Happy to run 256k or perplexity next if that’s useful. gemma4_26b_5090_tqkv_bench_4k_32k_64k_2026-04-04_11-57-07.log |
|
Finished the Gemma 4 26B Config I weight-compression run on RTX 5090 and packaged the logs/results. Setup
Config used
Size
Bench
Config I only
Config I + safe KV (
PPL
Confirmation run (
So on this setup, Config I came out:
Attaching the packaged zip with summary + raw quant / bench / PPL logs. gemma4_26b_config_i_results_2026-04-04.zip |
|
Hi, thanks for all the tests and writings @TheTom! Looking at qwen 3.5 9B and 4B models, where each have 32 layers, the first 3 layers are not self-attention layers but are Gated Delta Nets. So you're not really preserving any self-attention layers if you skip e.g. the first 2 layers ( The self-attention layers are otherwise correctly changed with Edit: 27B works similarly, where |
|
Good catch @swfsql. I verified this on Qwen3.5-27B-Q8_0. The layer pattern is a 1:3 interleave — only 16 of 64 layers are self-attention with split With boundary=2, blk.0-1 are protected but those are delta net layers — no split attention tensors to protect. The first real self-attention layer (blk.3) is unprotected. The fused I tested boundary=2 vs boundary=4 on M5 Max to measure the impact:
boundary=4 protects blk.3 (first real self-attention) and gives a 0.3% PPL improvement at the cost of 500 MiB. Real but small. The delta net layers are accidentally safe — their fused For the default Config I recommendation, I am keeping boundary=2 since the practical difference is marginal. But this is useful context for anyone tuning Qwen 3.5 specifically — boundary=4 is a free 0.3% if you can spare the 500 MiB. Re: compressing the fused |
|
@TheTom interesting, I just started trying and am basically limited to 4B models. On Qwen 3.5 9B I just noted that For the 4B model, the I'm informing this because Config_I currently doesn't quantize those token input-output weights, and they are quite heavy. |
Blackwell workstation follow-up for PR #45I ran a bounded local follow-up on an RTX PRO 6000 Blackwell Workstation Edition 96 GB across four surfaces:
Important caveat up front: cross-family promotion remains blocked on a direct qualitative task set. Local perplexity is used here as a reproducibility anchor within a family, not as a direct cross-family ranking metric. Two provenance notes matter:
1. Qwen3.5-27B PR45 validation and 32K KV follow-upLocal quality anchor
|
| KV mode | pp32768 tok/s |
tg128 tok/s |
KV buffer | Working set | Read |
|---|---|---|---|---|---|
f16 |
3535.14 |
52.46 |
2048.00 MiB |
25.503 GiB |
reference |
turbo3 |
3441.67 |
51.18 |
400.13 MiB |
23.894 GiB |
best compressed-KV trade here |
turbo4 |
3388.16 |
51.29 |
544.13 MiB |
24.034 GiB |
valid, but weaker than turbo3 |
Carried lane vs Q8_0 alternatives
| Lane | pp32768 tok/s |
tg128 tok/s |
Working set | Read |
|---|---|---|---|---|
Config I + turbo3 |
3555.99 |
51.01 |
23.894 GiB |
carried balanced Qwen default |
Q8_0 + turbo3 |
3497.44 |
47.38 |
26.384 GiB |
worse speed, higher footprint |
Q8_0 + turbo4 |
3448.25 |
47.25 |
26.524 GiB |
worse speed, higher footprint |
2. Native vLLM 0.18 sidecar vs the carried Qwen lane
| Surface | native vLLM 0.18 (Qwen3.5-27B-FP8) |
Config I + turbo3 |
Delta |
|---|---|---|---|
| Long-prompt prefill | 7789.745 tok/s |
3555.99 tok/s |
+119.060% for vLLM |
| Short decode | 35.64 tok/s |
51.01 tok/s |
-30.131% for vLLM |
| GPU working set / load delta | 86.897 GiB |
23.894 GiB |
3.637x larger for vLLM |
| Context in this pass | 32K |
32K |
same ceiling in this comparison |
Read: native vLLM 0.18 earned a real prefill-specialist role on this workstation, but not the balanced-default role because decode was slower and footprint was much larger.
3. APEX Qwen3.5-35B-A3B follow-up
First-pass variant benchmark
| Variant | File size | pp32768 tok/s |
tg128 tok/s |
Working set | Outcome |
|---|---|---|---|---|---|
Compact |
15.811 GiB |
6578.17 |
218.24 |
16.858 GiB |
low-footprint fallback |
Balanced |
23.650 GiB |
6736.65 |
215.02 |
24.698 GiB |
dominated |
Quality |
21.306 GiB |
6919.55 |
216.53 |
22.354 GiB |
carry-forward |
Quality validation
| Check | Result |
|---|---|
Local PPL vs Compact |
15.2822 vs 15.3825 (-0.652%) |
Best 32K KV mode |
f16 |
| Why not compressed KV? | turbo3 and turbo4 both missed the local tg128 regression window |
Quality + f16 vs carried Qwen default
| Lane | Local quality note | pp32768 tok/s |
tg128 tok/s |
Working set | Read |
|---|---|---|---|---|---|
Config I + turbo3 |
same-family Qwen baseline | 3555.99 |
51.01 |
23.894 GiB |
balanced default |
Quality + f16 |
+20.980% PPL regression vs Config I + turbo3 |
6749.54 |
208.79 |
22.354 GiB |
throughput specialist |
Read: APEX Quality + f16 is operationally strong as a throughput specialist, but the local same-family quality gap is too large to justify replacing Config I + turbo3 as the balanced Qwen default.
4. Gemma 4 TurboQuant+ branch bring-up and carry-forward
Clean canary branch proof
| Item | Result |
|---|---|
| Source | TheTom/llama-cpp-turboquant |
| Required ref | feature/turboquant-kv-cache |
| Measured commit | bc05a6803e48f17e0f2c7a99fce9b50d03882de7 |
| Important build flag | -DLLAMA_OPENSSL=ON |
| Why it mattered | master rejected turbo4; non-SSL build could not use the -hf path |
| First working canary posture | Gemma 4 Q4_K_M, -ctk q8_0, -ctv turbo4, -c 32768, -ngl 99, -fa on |
| First measured load delta | 18.604 GiB |
| First proof request throughput | prompt 308.83 tok/s, completion 195.45 tok/s |
Temp=0 KV sanity sweep
| Candidate | Probes passed | Mean decode tok/s | Mean prompt tok/s | Memory delta | Outcome |
|---|---|---|---|---|---|
q8_0/turbo3 |
2 / 3 |
181.588 |
597.918 |
18.552 GiB |
rejected for visible-answer miss |
q8_0/turbo4 |
3 / 3 |
181.859 |
546.791 |
18.604 GiB |
asymmetric winner |
turbo3/turbo3 |
3 / 3 |
178.122 |
583.902 |
18.299 GiB |
symmetric fallback |
turbo4/turbo4 |
3 / 3 |
176.657 |
600.757 |
18.405 GiB |
valid, but weaker than turbo3/turbo3 |
Benchmark carry-forward
| Lane | pp32768 tok/s |
tg128 tok/s |
Working set | Outcome |
|---|---|---|---|---|
q8_0/turbo4 |
8626.64 |
202.35 |
16.392 GiB |
carry-forward |
turbo3/turbo3 |
8558.52 |
191.80 |
16.259 GiB |
fallback only |
Validation
| Check | Result |
|---|---|
| Local PPL anchor | 222.2524 on the shared repo-local corpus |
| Validated contexts | 64K, 128K, 262K |
| Highest-context GPU delta | 20.642 GiB at 262K |
| KV growth read | non-SWA KV grew 510 MiB -> 1020 MiB -> 2040 MiB; SWA KV stayed bounded at 358.59 MiB |
5. Current role split after the Qwen/APEX/Gemma synthesis
| Role | Lane | Why |
|---|---|---|
| Balanced Qwen default | Config I + turbo3 |
still the only carried Qwen lane backed by same-family quality evidence |
| Qwen throughput specialist | Quality + f16 |
much faster than the carried Qwen default, but too much PPL regression to become the balanced default |
| Prefill-first specialist | native vLLM 0.18 (Qwen3.5-27B-FP8) |
strongest long-prompt prefill sidecar, but too large and too slow on decode to become the default |
| Preferred interactive long-context canary | Gemma 4 q8_0/turbo4 |
strongest measured GGUF lane in this tested set on this workstation for speed + footprint + context, with validated 262K |
6. Gemma vs the current carried Qwen role split
| Compare | Prefill delta | Decode delta | Working-set delta | Context delta | Read |
|---|---|---|---|---|---|
Gemma q8_0/turbo4 vs Config I + turbo3 |
+142.595% |
+296.687% |
-7.502 GiB |
262K vs 32K |
Gemma leads the measured speed/footprint/context envelope |
Gemma q8_0/turbo4 vs Quality + f16 |
+27.811% |
-3.084% |
-5.962 GiB |
262K vs 32K |
APEX keeps a small decode edge, but Gemma is stronger overall operationally |
Gemma q8_0/turbo4 vs native vLLM 0.18 |
+10.744% |
+467.761% |
-70.506 GiB |
262K vs 32K |
directional only; Gemma is much easier to carry locally |
7. Limitations
- These are single-workstation measurements on one exact Blackwell host.
- The runtime surfaces are not identical: TurboQuant PR45
llama.cpp, nativevLLM 0.18, APEX GGUF, and Gemma TurboQuant+ branch. - Same-family quality evidence is stronger here than cross-family comparison evidence.
- Gemma local perplexity is treated only as a reproducibility anchor, not as a direct cross-family ranking metric.
- Cross-family promotion remains blocked on a direct qualitative task set.
8. Bottom line
Config I + turbo3remains the balanced Qwen default.Quality + f16earned a real throughput-specialist role.- native
vLLM 0.18earned a real prefill-specialist role. - Gemma 4
q8_0/turbo4is the strongest measured interactive long-context GGUF lane in this tested set on this workstation. - I would not auto-replace the balanced Qwen default with Gemma yet, because I still do not have an apples-to-apples cross-family qualitative gate.
9. Artifact note
The raw workstation docs/receipts behind this summary are currently stored in a separate local evaluation repo rather than in this PR branch, so I am intentionally not pasting broken repo-relative links here. If useful, I can provide the exact local receipt/doc path list or extract a cleaner public artifact bundle.
|
@tryingET very interesting comparison in regard to vllm. Would be interesting how sglang would be positioned with its radix tree approach here. |
|
Here's my report :) PR #45 Benchmark ReportGPU: AMD Radeon RX 9060 XT (ROCm, gfx1200), 16GB VRAM Model under test:
Bench settings used:
1) Q8_0 baseline2) Config I (TQ4_1S)3) Config I + TurboQuant KV (
|
|
Good catch @swfsql. I verified this on Qwen3.5-27B-Q8_0. The layer pattern is a 1:3 interleave — only 16 of 64 layers are self-attention with split With boundary=2, blk.0-1 are protected but those are delta net layers — no split attention tensors to protect. The first real self-attention layer (blk.3) is unprotected. The fused I tested boundary=2 vs boundary=4 on M5 Max to measure the impact:
boundary=4 protects blk.3 (first real self-attention) and gives a 0.3% PPL improvement at the cost of 500 MiB. Real but small. The delta net layers are accidentally safe — their fused For the default Config I recommendation, I am keeping boundary=2 since the practical difference is marginal. But this is useful context for anyone tuning Qwen 3.5 specifically — boundary=4 is a free 0.3% if you can spare the 500 MiB. Re: compressing the fused |
- turbo4 K+V results on Qwen3.5-27B (-0.32% vs q8_0) and Qwen3-14B (+6.3%) - Sparse V dequant benchmarks: MoE native dequant +10.9% at 8K - Gemma-3 turbo3 results post-iSWA fix (+3.3%) - KVLinC no-K-rotation negative result - Speculative decoding negative result - CUDA 13.2 compatibility verified - Experiments #31, #39, #42, #45, #49, #50, #51 status updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I guess you're done here anyway, yet wanting to let you know that the compressing as described in Getting started seems to either miss something or isn't very clear, at least in my case I wasn't able to replicate the "Quick Test" as the output model (Qwen 3.5 35b) always turned out nearly the same size with config_i except for a few MBs. Or do I need to change the config_i for the 35b model somewhat? str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 |
|
thank you for the continued benchmarks! |
Good catch. Qwen3.5-35B is a hybrid model — only 16 of 64 layers have split attn_q/attn_k/attn_v tensors (blk.3, 7, 11, 15, ..., 63). The other 48 are Gated Delta Net layers with fused You need to adjust Compressing fused |
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I'm working on a vulkan port of this. PR soon |
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue #45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
simd_shuffle_xor, NR0=8llama-quantize --allow-requantize --tensor-type-file config.txtTested Models
Llama Note
Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. Use Hybrid (TQ4 attn + Q4_K FFN) or Premium (TQ4 attn + Q5_K/Q6_K FFN) configs. Both beat Q4_K_M in quality and speed at similar size. Full investigation in the paper.
What's needed before merge
Metal only
The quantization step (
llama-quantize) works on any platform. The runtime dequant kernels are Metal-specific. Compressed GGUFs will not run correctly on CUDA/HIP until those backends are ported.Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md
Getting started: https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md
🤖 Generated with Claude Code