feat: TQ4_1S weight compression (Metal only, needs CUDA port) by TheTom · Pull Request #45 · TheTom/llama-cpp-turboquant

TheTom · 2026-04-02T02:20:40Z

Summary

TQ3_1S (3-bit, 4.0 BPW) and TQ4_1S (4-bit, 5.0 BPW) weight quantization using WHT rotation + Lloyd-Max centroids
V2.1 fused Metal kernel: zero threadgroup memory, cooperative SIMD rotation via simd_shuffle_xor, NR0=8
Post-training quantization — no retraining, calibration data, or model modification required
Quantize via llama-quantize --allow-requantize --tensor-type-file config.txt

Tested Models

Model	Config	Size Reduction	PPL Delta	Decode	NIAH
Qwen2.5-1.5B	Config I	-27%	+1.9%	96%	6/6
Qwen3.5-27B	Config I	-28%	+1.3%	99%	3/3
Qwen3.5-35B MoE	Config I	-37%	+1.4%	102%	—
Qwen2.5-72B	Config I	-38%	+3.9%	95%	3/3
Phi-4 14B	Config I	-36%	+1.0%	254%	3/3
Llama 3.1 70B	Premium	-29%	+5.8%	fast	3/3
Llama 3.1 70B	Hybrid	-42%	+16%	133%	3/3

Llama Note

Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. Use Hybrid (TQ4 attn + Q4_K FFN) or Premium (TQ4 attn + Q5_K/Q6_K FFN) configs. Both beat Q4_K_M in quality and speed at similar size. Full investigation in the paper.

What's needed before merge

CUDA port of V2.1 kernel (calling @signalnine 👀)
HIP/ROCm testing
Regression tests on existing TurboQuant KV functionality
Community validation on untested model families

Metal only

The quantization step (llama-quantize) works on any platform. The runtime dequant kernels are Metal-specific. Compressed GGUFs will not run correctly on CUDA/HIP until those backends are ported.

Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md
Getting started: https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md

🤖 Generated with Claude Code

Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv) WHT-rotated weight compression types. These achieve 27-37% model size reduction at +1.0-1.9% PPL on Qwen/Phi families. Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45. CUDA additions: - turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels), sign array for 32-element inverse WHT - dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block inverse RHT (5 butterfly stages + normalize + unsign) - convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers - ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from mmvq/mmq (uses cuBLAS dequant-to-f16 path) The cuBLAS path is correct for initial support. Future optimization: pre-rotate activations via warp shuffle WHT (same pattern as KV cache Q rotation) to eliminate per-block inverse WHT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-04-02T03:18:15Z

Regression Test Results — PR #45

Verified that the TQ4_1S weight compression PR does NOT break existing TurboQuant KV cache functionality or standard inference on non-compressed models.

Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB)
Branch: pr/tq4-weight-compression (commit 6c3e503)

Speed — No Regressions

Model	Hardware	Config	pp512	tg128
Qwen2.5-1.5B Q8_0	M5 Max	q8_0/q8_0	10,787	198
Qwen2.5-1.5B Q8_0	M5 Max	q8_0/turbo4	10,460	141
Qwen2.5-1.5B Q8_0	M5 Max	q8_0/turbo3	10,468	138
Phi-4 14B Q8_0	M5 Max	q8_0/q8_0	1,052	33.7
Phi-4 14B Q8_0	M5 Max	q8_0/turbo4	1,051	30.9
Qwen3.5-27B Q8_0	M5 Max	q8_0/q8_0	408	17.6
Qwen3.5-27B Q8_0	M5 Max	q8_0/turbo4	497	17.1
Qwen3.5-27B Q8_0	M5 Max	q8_0/turbo3	487	17.0
Qwen3.5-35B MoE Q8_0	M5 Max	q8_0/q8_0	2,920	76.6
Qwen3.5-35B MoE Q8_0	M5 Max	q8_0/turbo4	2,878	69.6
Qwen2.5-7B Q4_K_M	M2 Pro	q8_0/q8_0	352	34.9
Qwen2.5-7B Q4_K_M	M2 Pro	q8_0/turbo4	351	30.8
Qwen2.5-7B Q4_K_M	M2 Pro	q8_0/turbo3	350	30.0
Qwen2.5-7B Q4_K_M	M2 Pro	turbo3/turbo3	346	26.4

All speeds normal or improved. No regressions.

PPL — No Regressions (full wikitext-2 runs)

Model	q8_0/q8_0	q8_0/turbo4	q8_0/turbo3
Qwen2.5-1.5B	10.31	10.45 (+1.4%)	10.55 (+2.4%)
Phi-4 14B	6.54	6.55 (+0.2%)	—
Qwen3.5-27B	6.87	6.89 (+0.3%)	—
Qwen3.5-35B MoE	6.53	6.56 (+0.5%)	—

All PPL values match known-good. MUL_MAT_ID (MoE path) verified working.

Verdict

ALL TESTS PASS. 5 models, 2 hardware platforms, 4 KV configs. The ggml-metal-ops.cpp restructuring (MUL_MAT + MUL_MAT_ID dispatch) did not break any existing functionality. Safe for review.

turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue #45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-04-02T04:39:17Z

Update: Rebased on upstream master + regression test

Branch force-pushed. Now rebased on latest ggml-org/llama.cpp master (106 upstream commits). 3 commits on top:

187a2c1 feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
9aaa054 fix: add post-unrotate memory barrier for in-layer mixing safety
cb8bddc fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream conflict: activation rotation (commit `744c0c7`)

Upstream added graph-level Hadamard rotation for KV cache quantization (llama : rotate activations for better quantization). This feature:

Crashes on Phi-4 (graph hash table overflow from extra rotation nodes)
Is redundant with our kernel-level WHT rotation (which is more efficient — no extra graph nodes)

Fix: disabled upstream rotation by default in our fork. Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0. Our TurboQuant KV rotation is unaffected.

Regression test (M5 Max, rebased branch `cb8bddc`)

Test	What	Result	Status
Config I quantize + speed	Qwen 1.5B, 202 tg/s	Matches expected	✅
Config I PPL	Qwen 1.5B, 10.77 (8ch)	Within noise	✅
Turbo4 KV speed	Qwen 1.5B, 148 tg/s	Matches expected	✅
Turbo4 KV PPL	Qwen 1.5B, 10.74 (8ch)	Within noise	✅
Phi-4 + turbo4	31.5 tg/s	No crash	✅ Fixed
Large model turbo4	Qwen 27B, 17.4 tg/s	Matches expected	✅

All tests pass. No regressions. Phi-4 crash resolved.

signalnine · 2026-04-02T04:55:04Z

CUDA port available on our branch: signalnine/llama-cpp-turboquant feature/tq4-weight-cuda

What's implemented:

CUDA dequant for TQ4_1S/TQ3_1S (convert path + cuBLAS MUL_MAT)
Fused mul_mat_vec kernel with pre-rotated activations (warp shuffle WHT)
mmvq exclusion for fused dispatch path
llama-quantize registration for TQ4_1S/TQ3_1S types

Results (Qwen2.5-7B TQ4_1S, RTX 5090):

Metric	cuBLAS path	Fused kernel
Decode tg128	20 t/s	69 t/s
vs q8_0 (177 t/s)	11%	39%
PPL	8.82	8.82

The fused kernel pre-rotates the activation vector once per mul_mat via __shfl_xor_sync (5 butterfly stages, 32-element blocks), then the mmvq kernel just does centroid[idx] × scale with no per-block WHT. 13 kernel variants tested — the gap to q4_0 (275 t/s) is from dp4a integer intrinsics that TQ4_1S can't use (centroid lookup requires float). The gap to your Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency.

Happy to iterate on this if you have ideas for closing the CUDA gap further.

TheTom · 2026-04-02T05:19:09Z

This is great work, thank you for turning this around so fast. PPL matching between cuBLAS and fused confirms correctness.

One question before we merge: can you confirm that uncompressed models (q8_0, q4_0, etc.) show no decode regression on this branch? i.e. the new code paths only activate for TQ4_1S/TQ3_1S and existing quant types run at the same speed as before the PR.

TheTom · 2026-04-02T05:25:32Z

CUDA kernel review — performance improvement opportunities

Nice work on the V8 pre-rotation approach. PPL matching confirms correctness. Here's what I see for closing the gap from 39% to 70-85% of q8_0:

High priority (biggest decode wins)

NR0 multi-row CTA with shared activation reuse — Currently each warp handles one row independently, re-reading vy_rot from global memory. On Metal, NR0=8 (8 rows sharing one activation tile) was the single biggest optimization (decode went from 70% to 99% of baseline). On CUDA the benefit is shared reuse of vy_rot across rows per CTA. Realistic target: 69 → 110-140 t/s.
Hot loop load dedup — d0/d1 are loaded per-lane but only have 2 unique values per block. Broadcast via __shfl_sync once per half-warp. Similarly, qs[lane/2] is loaded by both even and odd lanes — load once, broadcast to partner.
__restrict__ qualifiers + vectorized loads — The kernel pointers lack __restrict__ (compiler can't prove no aliasing). Adding it enables better instruction scheduling. Also consider float4 loads for the pre-rotated activation to hit 128-bit memory transactions.

Medium priority

Small-batch kernels (ne[1]=2..8) — Real serving hits batch > 1 frequently. A tuned small-batch path before falling to cuBLAS would help.
CTA shape sweep per architecture — NWARPS=8 was tested on 5090. Ampere/Ada may prefer different occupancy points. Worth parameterizing.
Fused prefill path (MMQ-style) — Currently falls to dequant→cuBLAS for ne[1] > 1. A tiled matmul with shared-memory activation rotation would be faster for medium batch.

Skip / low value

dp4a / fp16 LUT — The pipeline is float activation × float centroid × float scale. Forcing int8 paths is awkward and unlikely to win vs a well-tuned float path.
Scratch buffer cleanup — Lifecycle issue (static cudaMalloc leak), not a throughput lever. Fix eventually but not urgent.

Realistic ceiling

Per architecture with full tuning (NR0 + load dedup + vectorized + batch):

Ampere: ~55-70% of q8_0
Ada: ~60-75%
Blackwell: ~70-85%

The 39% → 70-85% gap is primarily data reuse, not math precision. The pre-rotation design is correct — it just needs the activation tile shared across more rows per CTA.

TheTom · 2026-04-02T05:29:48Z

Full Regression Test — PR #45 (`cb8bddc`)

Hardware

M5 Max 128GB (Apple Silicon)
Mac Mini M2 Pro 32GB (Apple Silicon)

Quantize tool verification

Format	Quantize	Loads	Generates	Status
TQ4_1S Config I	1312 MiB (6.20 BPW)	✅	✅ 198 tg	✅
TQ3_1S attn-only	1730 MiB (8.17 BPW)	✅	✅	✅
Q4_K_M (standard)	1060 MiB (5.00 BPW)	✅	✅	✅

M5 Max — Uncompressed weights + TurboQuant KV

Qwen2.5-1.5B Q8_0

Config	pp512	tg128	PPL (full)
q8_0/q8_0	10,742	209	10.31
q8_0/turbo4	10,604	148	10.45
q8_0/turbo3	10,537	141	10.55

Phi-4 14B Q8_0 (crash fix verification)

Config	pp512	tg128	PPL (full)
q8_0/q8_0	1,083	34.0	6.54
q8_0/turbo4	1,088	31.1	6.55

No crash. Upstream attn_rot disabled by default (commit cb8bddc).

Qwen3.5-27B Q8_0

Config	pp512	tg128	PPL (full)
q8_0/q8_0	554	17.5	6.87
q8_0/turbo4	498	16.9	6.89
q8_0/turbo3	501	16.0	—

Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)

Config	pp512	tg128	PPL (full)
q8_0/q8_0	2,837	77.0	6.53
q8_0/turbo4	2,826	77.8	6.56

M5 Max — TQ4_1S Weight Compression

Qwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)

Config	pp512	tg128	PPL (full)
Config I	7,610	198	10.53
Config I + turbo4 KV	7,394	142	—

Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M

Config	pp512	tg128
q8_0/q8_0	316 ± 4.6	33.6 ± 0.2
q8_0/turbo4	348 ± 0.5	30.3 ± 0.3
q8_0/turbo3	346 ± 0.3	28.7 ± 0.4
turbo3/turbo3	338 ± 1.2	25.7 ± 0.3

Summary

5 models tested across 2 hardware platforms
All PPL values match known-good (within measurement noise)
All speed values normal — no regressions
Phi-4 crash fixed (upstream attn_rot disabled)
MUL_MAT_ID (MoE) verified
Weight compression quantize + inference verified (TQ4_1S, TQ3_1S, Q4_K_M)
TurboQuant KV configs verified (q8_0/turbo4, q8_0/turbo3, turbo3/turbo3)

All tests pass. PR is safe for review.

nenkoru · 2026-04-04T01:16:11Z

Here is my lllama-bench of 27b, same quantization results as for the size of the model.
Tested on two profiles of Nvidia V100. SM70.

nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-q8_0.gguf -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
  Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
  Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |        986.70 ± 2.48 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         22.18 ± 0.03 |

build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
  Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
  Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |        947.65 ± 1.47 |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         24.12 ± 0.04 |

build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -ctk q8_0 -ctv turbo4 -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
  Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
  Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model                          |       size |     params | backend    | ngl | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |   q8_0 | turbo4 |  1 |           pp512 |        938.75 ± 0.75 |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |   q8_0 | turbo4 |  1 |           tg128 |         23.79 ± 0.03 |

build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$

theaaronhughes · 2026-04-04T02:07:11Z

Hey Tom — first Gemma 4 Blackwell numbers from an RTX 5090 on commit bc05a6803, using Gemma-4-26B-A4B-it-UD-Q4_K_XL.

Good news: loads clean, runs stable, and all tests completed cleanly at 64k and 128k.

Throughput:
Baseline q8_0/q8_0 stayed fastest, but turbo3 and turbo4 were only a few percent behind across 4k, 32k, and 64k.

Memory / KV:
64k

baseline: 18,584 MiB GPU / KV 839.38 MiB
turbo4: 18,372 MiB GPU / KV 629.53 MiB
turbo3: 18,318 MiB GPU / KV 573.98 MiB

128k

baseline: 19,264 MiB GPU / KV 1519.38 MiB
turbo4: 18,882 MiB GPU / KV 1139.53 MiB
turbo3: 18,782 MiB GPU / KV 1038.98 MiB

So on this setup, turbo3 saves ~266 MiB at 64k and ~482 MiB at 128k vs baseline. turbo4 saves ~212 MiB at 64k and ~382 MiB at 128k.

Savings look smaller than on dense full-context models, which makes sense here since Gemma 4’s hybrid sliding-window layout already limits full-context KV to 5 of 30 layers.

Happy to run 256k or perplexity next if that’s useful.

gemma4_26b_5090_tqkv_bench_4k_32k_64k_2026-04-04_11-57-07.log
gemma4_26b_5090_tqkv_bench_4k_32k_64k_2026-04-04_11-57-07.md
gemma4_26b_5090_tqkv_memory_64k_2026-04-04_12-27-33.log
gemma4_26b_5090_tqkv_memory_64k_2026-04-04_12-27-33.md
gemma4_26b_5090_tqkv_memory_128k_2026-04-04_12-31-59.log
gemma4_26b_5090_tqkv_memory_128k_2026-04-04_12-31-59.md

theaaronhughes · 2026-04-04T04:55:21Z

Finished the Gemma 4 26B Config I weight-compression run on RTX 5090 and packaged the logs/results.

Setup

GPU: RTX 5090 32GB
Build: bc05a6803
Source: gemma-4-26B-A4B-it-Q8_0.gguf
Output: gemma-4-26B-A4B-it-Config-I.gguf

Config used

n_layers = 30
boundary = 2
attn_q, attn_k, attn_v, attn_output, ffn_gate, ffn_up => tq4_1s
ffn_down requested q4_k, but incompatible tensors fell back to q5_0 during quantization

Size

source: ~25.00 GiB
Config I: ~24.43 GiB
reduction: ~2.28%

Bench
Baseline Q8_0

pp512: 9349.90 ± 151.69
tg128: 172.53 ± 0.89

Config I only

pp512: 9371.40 ± 71.86
tg128: 173.18 ± 3.10

Config I + safe KV (-ctk q8_0 -ctv turbo4 -fa on)

pp512: 9004.91 ± 148.02
tg128: 159.17 ± 1.03

PPL
First run (--chunks 20)

baseline Q8_0: 303.9771 ± 25.21109
Config I: 242.7204 ± 19.86373

Confirmation run (--chunks 100)

baseline Q8_0: 194.8263 ± 6.85970
Config I: 190.3490 ± 6.72620

So on this setup, Config I came out:

smaller than the source model
neutral / slightly better in the quick bench when run alone
slower when combined with q8_0/turbo4 KV
modestly better on the 100-chunk PPL confirmation pass (~2.30% lower PPL)

Attaching the packaged zip with summary + raw quant / bench / PPL logs.

gemma4_26b_config_i_results_2026-04-04.zip
gemma4_26b_config_i_summary_2026-04-04.md

swfsql · 2026-04-04T06:59:14Z

Hi, thanks for all the tests and writings @TheTom!
I'm new to this, but I think I noticed something.

Looking at qwen 3.5 9B and 4B models, where each have 32 layers, the first 3 layers are not self-attention layers but are Gated Delta Nets. So you're not really preserving any self-attention layers if you skip e.g. the first 2 layers (blk.0 and blk.1) - you're only preserving the feed-forward part against quantizations. Preserving the first two {q,k,v}'s would require preserving blk.3 and blk.7.

The self-attention layers are otherwise correctly changed with blk.{i}.attn_{q,k,v}. The other layers (blk.{i}.attn_qkv) are currently not being quantized, but maybe they could since they kind of correspond to self-attention QKV, although it would be annoying to do asymmetric quants if they aren't split in 3 parts like {q,k,v} (and are single tensor instead). Would need to have 3 separate fields to start, and perhaps you could just concat them at the right place and leave everything as is..

Edit: 27B works similarly, where blk.3 is the first self-attention layer with split {q,k,v}.

TheTom · 2026-04-04T13:40:51Z

Good catch @swfsql. I verified this on Qwen3.5-27B-Q8_0.

The layer pattern is a 1:3 interleave — only 16 of 64 layers are self-attention with split attn_q/attn_k/attn_v (blk.3, 7, 11, 15, ..., 63). The other 48 are Gated Delta Net layers with fused attn_qkv + SSM tensors.

With boundary=2, blk.0-1 are protected but those are delta net layers — no split attention tensors to protect. The first real self-attention layer (blk.3) is unprotected. The fused attn_qkv tensors are not in our config pattern, so they stay at q8_0 by default regardless.

I tested boundary=2 vs boundary=4 on M5 Max to measure the impact:

Config	PPL (8 chunks)	Delta vs Q8_0	Size
Q8_0 baseline	6.8879	—	27.3G
Config I boundary=2	7.0564	+2.44%	19.1G
Config I boundary=4	7.0349	+2.13%	19.6G

boundary=4 protects blk.3 (first real self-attention) and gives a 0.3% PPL improvement at the cost of 500 MiB. Real but small.

The delta net layers are accidentally safe — their fused attn_qkv tensors do not match the attn_q/attn_k/attn_v patterns in the config file, so they remain at q8_0 regardless of boundary setting.

For the default Config I recommendation, I am keeping boundary=2 since the practical difference is marginal. But this is useful context for anyone tuning Qwen 3.5 specifically — boundary=4 is a free 0.3% if you can spare the 500 MiB.

Re: compressing the fused attn_qkv tensors in delta net layers — that would require adding attn_qkv as a target in the config file. Worth exploring but would need separate testing since the delta net attention mechanism may have different sensitivity characteristics than standard self-attention.

swfsql · 2026-04-04T16:43:11Z

@TheTom interesting, I just started trying and am basically limited to 4B models.

On Qwen 3.5 9B I just noted that token_embd.weight and output.weight are quite heavy (~20% of the entire original model, and ~35% after a quantized model). Maybe those could also be compressed. Idk how the theory works, but if the rotation is un-applied after the logit softmax, could then output.weight also be compressed with TQ?

For the 4B model, the token_embd.weight appears to be tied (and weights ~15% of the original and ~25% after quant), but maybe this would be harder to quantize. Or perhaps they could just be quantized to Q4_K.

I'm informing this because Config_I currently doesn't quantize those token input-output weights, and they are quite heavy.

tryingET · 2026-04-05T05:14:34Z

Blackwell workstation follow-up for PR #45

I ran a bounded local follow-up on an RTX PRO 6000 Blackwell Workstation Edition 96 GB across four surfaces:

TurboQuant PR45 llama.cpp lanes for Qwen3.5-27B
native vLLM 0.18 sidecar for Qwen3.5-27B-FP8
APEX Qwen3.5-35B-A3B GGUF follow-up
Gemma 4 TurboQuant+ follow-up on TheTom’s feature/turboquant-kv-cache branch

Important caveat up front: cross-family promotion remains blocked on a direct qualitative task set. Local perplexity is used here as a reproducibility anchor within a family, not as a direct cross-family ranking metric.

Two provenance notes matter:

Scope	Provenance
Qwen/APEX PR45 canary	`build commit 3b8a01a92`, `build number 8771`
Gemma TurboQuant+ follow-up	`TheTom/llama-cpp-turboquant`, ref `feature/turboquant-kv-cache`, commit `bc05a6803e48f17e0f2c7a99fce9b50d03882de7`

1. Qwen3.5-27B PR45 validation and 32K KV follow-up

Local quality anchor

Lane	Local PPL	Prompt eval tok/s	Approx CUDA working set	Read
`Q8_0`	`12.6259`	`7584.82`	`26.566 GiB`	same-family quality reference
`Config I` (`tq4_1s`)	`12.6320`	`7932.73`	`24.076 GiB`	carried candidate
Delta (`Config I - Q8_0`)	`+0.0061` / `+0.048%`	`+4.59%`	`-2.49 GiB`	near-flat quality, lower footprint

`32K` KV check on `Config I`

KV mode	`pp32768` tok/s	`tg128` tok/s	KV buffer	Working set	Read
`f16`	`3535.14`	`52.46`	`2048.00 MiB`	`25.503 GiB`	reference
`turbo3`	`3441.67`	`51.18`	`400.13 MiB`	`23.894 GiB`	best compressed-KV trade here
`turbo4`	`3388.16`	`51.29`	`544.13 MiB`	`24.034 GiB`	valid, but weaker than `turbo3`

Carried lane vs `Q8_0` alternatives

Lane	`pp32768` tok/s	`tg128` tok/s	Working set	Read
`Config I + turbo3`	`3555.99`	`51.01`	`23.894 GiB`	carried balanced Qwen default
`Q8_0 + turbo3`	`3497.44`	`47.38`	`26.384 GiB`	worse speed, higher footprint
`Q8_0 + turbo4`	`3448.25`	`47.25`	`26.524 GiB`	worse speed, higher footprint

2. Native `vLLM 0.18` sidecar vs the carried Qwen lane

Surface	native `vLLM 0.18` (`Qwen3.5-27B-FP8`)	`Config I + turbo3`	Delta
Long-prompt prefill	`7789.745 tok/s`	`3555.99 tok/s`	`+119.060%` for `vLLM`
Short decode	`35.64 tok/s`	`51.01 tok/s`	`-30.131%` for `vLLM`
GPU working set / load delta	`86.897 GiB`	`23.894 GiB`	`3.637x` larger for `vLLM`
Context in this pass	`32K`	`32K`	same ceiling in this comparison

Read: native vLLM 0.18 earned a real prefill-specialist role on this workstation, but not the balanced-default role because decode was slower and footprint was much larger.

3. APEX Qwen3.5-35B-A3B follow-up

First-pass variant benchmark

Variant	File size	`pp32768` tok/s	`tg128` tok/s	Working set	Outcome
`Compact`	`15.811 GiB`	`6578.17`	`218.24`	`16.858 GiB`	low-footprint fallback
`Balanced`	`23.650 GiB`	`6736.65`	`215.02`	`24.698 GiB`	dominated
`Quality`	`21.306 GiB`	`6919.55`	`216.53`	`22.354 GiB`	carry-forward

`Quality` validation

Check	Result
Local PPL vs `Compact`	`15.2822` vs `15.3825` (`-0.652%`)
Best `32K` KV mode	`f16`
Why not compressed KV?	`turbo3` and `turbo4` both missed the local `tg128` regression window

`Quality + f16` vs carried Qwen default

Lane	Local quality note	`pp32768` tok/s	`tg128` tok/s	Working set	Read
`Config I + turbo3`	same-family Qwen baseline	`3555.99`	`51.01`	`23.894 GiB`	balanced default
`Quality + f16`	`+20.980%` PPL regression vs `Config I + turbo3`	`6749.54`	`208.79`	`22.354 GiB`	throughput specialist

Read: APEX Quality + f16 is operationally strong as a throughput specialist, but the local same-family quality gap is too large to justify replacing Config I + turbo3 as the balanced Qwen default.

4. Gemma 4 TurboQuant+ branch bring-up and carry-forward

Clean canary branch proof

Item	Result
Source	`TheTom/llama-cpp-turboquant`
Required ref	`feature/turboquant-kv-cache`
Measured commit	`bc05a6803e48f17e0f2c7a99fce9b50d03882de7`
Important build flag	`-DLLAMA_OPENSSL=ON`
Why it mattered	`master` rejected `turbo4`; non-SSL build could not use the `-hf` path
First working canary posture	Gemma 4 `Q4_K_M`, `-ctk q8_0`, `-ctv turbo4`, `-c 32768`, `-ngl 99`, `-fa on`
First measured load delta	`18.604 GiB`
First proof request throughput	prompt `308.83 tok/s`, completion `195.45 tok/s`

Temp=0 KV sanity sweep

Candidate	Probes passed	Mean decode tok/s	Mean prompt tok/s	Memory delta	Outcome
`q8_0/turbo3`	`2 / 3`	`181.588`	`597.918`	`18.552 GiB`	rejected for visible-answer miss
`q8_0/turbo4`	`3 / 3`	`181.859`	`546.791`	`18.604 GiB`	asymmetric winner
`turbo3/turbo3`	`3 / 3`	`178.122`	`583.902`	`18.299 GiB`	symmetric fallback
`turbo4/turbo4`	`3 / 3`	`176.657`	`600.757`	`18.405 GiB`	valid, but weaker than `turbo3/turbo3`

Benchmark carry-forward

Lane	`pp32768` tok/s	`tg128` tok/s	Working set	Outcome
`q8_0/turbo4`	`8626.64`	`202.35`	`16.392 GiB`	carry-forward
`turbo3/turbo3`	`8558.52`	`191.80`	`16.259 GiB`	fallback only

Validation

Check	Result
Local PPL anchor	`222.2524` on the shared repo-local corpus
Validated contexts	`64K`, `128K`, `262K`
Highest-context GPU delta	`20.642 GiB` at `262K`
KV growth read	non-SWA KV grew `510 MiB -> 1020 MiB -> 2040 MiB`; SWA KV stayed bounded at `358.59 MiB`

5. Current role split after the Qwen/APEX/Gemma synthesis

Role	Lane	Why
Balanced Qwen default	`Config I + turbo3`	still the only carried Qwen lane backed by same-family quality evidence
Qwen throughput specialist	`Quality + f16`	much faster than the carried Qwen default, but too much PPL regression to become the balanced default
Prefill-first specialist	native `vLLM 0.18` (`Qwen3.5-27B-FP8`)	strongest long-prompt prefill sidecar, but too large and too slow on decode to become the default
Preferred interactive long-context canary	Gemma 4 `q8_0/turbo4`	strongest measured GGUF lane in this tested set on this workstation for speed + footprint + context, with validated `262K`

6. Gemma vs the current carried Qwen role split

Compare	Prefill delta	Decode delta	Working-set delta	Context delta	Read
Gemma `q8_0/turbo4` vs `Config I + turbo3`	`+142.595%`	`+296.687%`	`-7.502 GiB`	`262K` vs `32K`	Gemma leads the measured speed/footprint/context envelope
Gemma `q8_0/turbo4` vs `Quality + f16`	`+27.811%`	`-3.084%`	`-5.962 GiB`	`262K` vs `32K`	APEX keeps a small decode edge, but Gemma is stronger overall operationally
Gemma `q8_0/turbo4` vs native `vLLM 0.18`	`+10.744%`	`+467.761%`	`-70.506 GiB`	`262K` vs `32K`	directional only; Gemma is much easier to carry locally

7. Limitations

These are single-workstation measurements on one exact Blackwell host.
The runtime surfaces are not identical: TurboQuant PR45 llama.cpp, native vLLM 0.18, APEX GGUF, and Gemma TurboQuant+ branch.
Same-family quality evidence is stronger here than cross-family comparison evidence.
Gemma local perplexity is treated only as a reproducibility anchor, not as a direct cross-family ranking metric.
Cross-family promotion remains blocked on a direct qualitative task set.

8. Bottom line

Config I + turbo3 remains the balanced Qwen default.
Quality + f16 earned a real throughput-specialist role.
native vLLM 0.18 earned a real prefill-specialist role.
Gemma 4 q8_0/turbo4 is the strongest measured interactive long-context GGUF lane in this tested set on this workstation.
I would not auto-replace the balanced Qwen default with Gemma yet, because I still do not have an apples-to-apples cross-family qualitative gate.

9. Artifact note

The raw workstation docs/receipts behind this summary are currently stored in a separate local evaluation repo rather than in this PR branch, so I am intentionally not pasting broken repo-relative links here. If useful, I can provide the exact local receipt/doc path list or extract a cleaner public artifact bundle.

volschin · 2026-04-05T08:04:42Z

@tryingET very interesting comparison in regard to vllm. Would be interesting how sglang would be positioned with its radix tree approach here.

Michael-Z-Freeman · 2026-04-05T10:43:50Z

Here's my report :)

PR #45 Benchmark Report

GPU: AMD Radeon RX 9060 XT (ROCm, gfx1200), 16GB VRAM
Backend: HIP/ROCm
Build: e9c54d5 (8772)

Model under test:

Source: qwen2.5-7b-instruct-q8_0.gguf
Compressed (Config I): qwen2.5-7b-instruct-config-i.gguf
Config I quantization result: 7.54 GiB -> 5.16 GiB

Bench settings used:

-p 512 -n 128 -fa 1 -ngl 99 -b 256 -ub 128 -r 1

1) Q8_0 baseline

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen2 7B Q8_0                  |   7.54 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           pp512 |       1163.73 ± 0.00 |
| qwen2 7B Q8_0                  |   7.54 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           tg128 |         38.15 ± 0.00 |

build: e9c54d557 (8772)

2) Config I (TQ4_1S)

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           pp512 |       1330.85 ± 0.00 |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           tg128 |         42.33 ± 0.00 |

build: e9c54d557 (8772)

3) Config I + TurboQuant KV (`-ctk q8_0 -ctv turbo4`)

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |   q8_0 | turbo4 |  1 |           pp512 |       1303.25 ± 0.00 |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |   q8_0 | turbo4 |  1 |           tg128 |         41.57 ± 0.00 |

build: e9c54d557 (8772)

TheTom · 2026-04-05T13:30:54Z

Good catch @swfsql. I verified this on Qwen3.5-27B-Q8_0.

The layer pattern is a 1:3 interleave — only 16 of 64 layers are self-attention with split attn_q/attn_k/attn_v (blk.3, 7, 11, 15, ..., 63). The other 48 are Gated Delta Net layers with fused attn_qkv + SSM tensors.

With boundary=2, blk.0-1 are protected but those are delta net layers — no split attention tensors to protect. The first real self-attention layer (blk.3) is unprotected. The fused attn_qkv tensors are not in our config pattern, so they stay at q8_0 by default regardless.

I tested boundary=2 vs boundary=4 on M5 Max to measure the impact:

Config	PPL (8 chunks)	Delta vs Q8_0	Size
Q8_0 baseline	6.8879	—	27.3G
Config I boundary=2	7.0564	+2.44%	19.1G
Config I boundary=4	7.0349	+2.13%	19.6G

boundary=4 protects blk.3 (first real self-attention) and gives a 0.3% PPL improvement at the cost of 500 MiB. Real but small.

The delta net layers are accidentally safe — their fused attn_qkv tensors do not match the attn_q/attn_k/attn_v patterns in the config file, so they remain at q8_0 regardless of boundary setting.

For the default Config I recommendation, I am keeping boundary=2 since the practical difference is marginal. But this is useful context for anyone tuning Qwen 3.5 specifically — boundary=4 is a free 0.3% if you can spare the 500 MiB.

Re: compressing the fused attn_qkv tensors in delta net layers — that would require adding attn_qkv as a target in the config file. Worth exploring but would need separate testing since the delta net attention mechanism may have different sensitivity characteristics than standard self-attention.

- turbo4 K+V results on Qwen3.5-27B (-0.32% vs q8_0) and Qwen3-14B (+6.3%) - Sparse V dequant benchmarks: MoE native dequant +10.9% at 8K - Gemma-3 turbo3 results post-iSWA fix (+3.3%) - KVLinC no-K-rotation negative result - Speculative decoding negative result - CUDA 13.2 compatibility verified - Experiments #31, #39, #42, #45, #49, #50, #51 status updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bobylein · 2026-04-07T03:02:56Z

I guess you're done here anyway, yet wanting to let you know that the compressing as described in Getting started seems to either miss something or isn't very clear, at least in my case I wasn't able to replicate the "Quick Test" as the output model (Qwen 3.5 35b) always turned out nearly the same size with config_i except for a few MBs.

Or do I need to change the config_i for the 35b model somewhat?

str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
[...]
llama_tensor_get_type: blk.15.attn_k.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
llama_tensor_get_type: blk.15.attn_output.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
llama_tensor_get_type: blk.15.attn_q.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
llama_tensor_get_type: blk.15.attn_v.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
[...]
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
[ 1/ 733] output.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB
[ 2/ 733] output_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 3/ 733] token_embd.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB
[ 4/ 733] blk.0.attn_gate.weight - [ 2048, 4096, 1, 1], type = q8_0, size = 8.500 MiB
[ 5/ 733] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 6/ 733] blk.0.attn_qkv.weight - [ 2048, 8192, 1, 1], type = q8_0, size = 17.000 MiB
[...]
[ 419/ 733] blk.22.ssm_a - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB
[ 420/ 733] blk.22.ssm_alpha.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB
[ 421/ 733] blk.22.ssm_beta.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB
[ 422/ 733] blk.22.ssm_conv1d.weight - [ 4, 8192, 1, 1], type = f32, size = 0.125 MiB
[ 423/ 733] blk.22.ssm_dt.bias - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB
[ 424/ 733] blk.22.ssm_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MiB
[ 425/ 733] blk.22.ssm_out.weight - [ 4096, 2048, 1, 1], type = q8_0, size = 8.500 MiB
[ 426/ 733] blk.23.attn_k.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB
[ 427/ 733] blk.23.attn_k_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB
[ 428/ 733] blk.23.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 429/ 733] blk.23.attn_output.weight - [ 4096, 2048, 1, 1], type = q8_0, converting to tq4_1s .. size = 8.50 MiB -> 5.00 MiB
[ 430/ 733] blk.23.attn_q.weight - [ 2048, 8192, 1, 1], type = q8_0, converting to tq4_1s .. size = 17.00 MiB -> 10.00 MiB
[ 431/ 733] blk.23.attn_q_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB
[ 432/ 733] blk.23.attn_v.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB
[...]
[ 729/ 733] blk.39.ffn_gate_inp_shexp.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 730/ 733] blk.39.ffn_gate_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB
[ 731/ 733] blk.39.ffn_up_exps.weight - [ 2048, 512, 256, 1], type = q8_0, size = 272.000 MiB
[ 732/ 733] blk.39.ffn_up_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB
[ 733/ 733] blk.39.post_attention_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
llama_model_quantize_impl: model size = 35183.10 MiB (8.52 BPW)
llama_model_quantize_impl: quant size = 35069.35 MiB (8.49 BPW)`

TheTom · 2026-04-07T03:23:59Z

thank you for the continued benchmarks!

TheTom · 2026-04-07T03:30:44Z

I guess you're done here anyway, yet wanting to let you know that the compressing as described in Getting started seems to either miss something or isn't very clear, at least in my case I wasn't able to replicate the "Quick Test" as the output model (Qwen 3.5 35b) always turned out nearly the same size with config_i except for a few MBs.

Or do I need to change the config_i for the 35b model somewhat?

str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 [...] llama_tensor_get_type: blk.15.attn_k.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 llama_tensor_get_type: blk.15.attn_output.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 llama_tensor_get_type: blk.15.attn_q.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 llama_tensor_get_type: blk.15.attn_v.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 [...] str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 [ 1/ 733] output.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB [ 2/ 733] output_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 3/ 733] token_embd.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB [ 4/ 733] blk.0.attn_gate.weight - [ 2048, 4096, 1, 1], type = q8_0, size = 8.500 MiB [ 5/ 733] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 6/ 733] blk.0.attn_qkv.weight - [ 2048, 8192, 1, 1], type = q8_0, size = 17.000 MiB [...] [ 419/ 733] blk.22.ssm_a - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB [ 420/ 733] blk.22.ssm_alpha.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB [ 421/ 733] blk.22.ssm_beta.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB [ 422/ 733] blk.22.ssm_conv1d.weight - [ 4, 8192, 1, 1], type = f32, size = 0.125 MiB [ 423/ 733] blk.22.ssm_dt.bias - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB [ 424/ 733] blk.22.ssm_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MiB [ 425/ 733] blk.22.ssm_out.weight - [ 4096, 2048, 1, 1], type = q8_0, size = 8.500 MiB [ 426/ 733] blk.23.attn_k.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB [ 427/ 733] blk.23.attn_k_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB [ 428/ 733] blk.23.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 429/ 733] blk.23.attn_output.weight - [ 4096, 2048, 1, 1], type = q8_0, converting to tq4_1s .. size = 8.50 MiB -> 5.00 MiB [ 430/ 733] blk.23.attn_q.weight - [ 2048, 8192, 1, 1], type = q8_0, converting to tq4_1s .. size = 17.00 MiB -> 10.00 MiB [ 431/ 733] blk.23.attn_q_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB [ 432/ 733] blk.23.attn_v.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB [...] [ 729/ 733] blk.39.ffn_gate_inp_shexp.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 730/ 733] blk.39.ffn_gate_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB [ 731/ 733] blk.39.ffn_up_exps.weight - [ 2048, 512, 256, 1], type = q8_0, size = 272.000 MiB [ 732/ 733] blk.39.ffn_up_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB [ 733/ 733] blk.39.post_attention_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB llama_model_quantize_impl: model size = 35183.10 MiB (8.52 BPW) llama_model_quantize_impl: quant size = 35069.35 MiB (8.49 BPW)`

Good catch. Qwen3.5-35B is a hybrid model — only 16 of 64 layers have split attn_q/attn_k/attn_v tensors (blk.3, 7, 11, 15, ..., 63). The other 48 are Gated Delta Net layers with fused attn_qkv tensors that don't match the Config I pattern, so they stay at q8_0.

You need to adjust n_layers to 64 for this model, but the real issue is that the config pattern only hits the self-attention layers. The fused attn_qkv and SSM tensors in delta net layers aren't targeted, so compression on this model will be smaller than expected (~30% vs ~37% on pure attention models like Qwen2.5-27B).

Compressing fused attn_qkv would need adding it as a target in the config file. Haven't validated that yet — delta net attention may have different sensitivity. i'll check on this more tomororw

turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Titaniumtown · 2026-04-10T19:32:13Z

I'm working on a vulkan port of this. PR soon

turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue #45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue TheTom#45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added ggml Apple Metal labels Apr 2, 2026

TheTom mentioned this pull request Apr 2, 2026

chore: sync with upstream ggml-org/llama.cpp (106 commits) #46

Merged

4 tasks

TheTom force-pushed the pr/tq4-weight-compression branch from 6c3e503 to cb8bddc Compare April 2, 2026 04:39

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU examples server Vulkan testing devops python script model OpenCL SYCL build nix jinja parser Ascend NPU Hexagon WebGPU labels Apr 2, 2026

seanrasch mentioned this pull request Apr 6, 2026

perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch #53

Open

7 tasks

TheTom mentioned this pull request Apr 6, 2026

perf: TQ4_1S native kernel 3.5× faster — 240 t/s, less VRAM than q8_0 conversion #57

Merged

7 tasks

Titaniumtown mentioned this pull request Apr 11, 2026

vulkan: TQ4_1s support for model weights #69

Merged

Uh oh!

Conversation

TheTom commented Apr 2, 2026

Summary

Tested Models

Llama Note

What's needed before merge

Metal only

Uh oh!

TheTom commented Apr 2, 2026

Regression Test Results — PR #45

Speed — No Regressions

PPL — No Regressions (full wikitext-2 runs)

Verdict

Uh oh!

TheTom commented Apr 2, 2026

Update: Rebased on upstream master + regression test

Upstream conflict: activation rotation (commit 744c0c7)

Regression test (M5 Max, rebased branch cb8bddc)

Uh oh!

signalnine commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

CUDA kernel review — performance improvement opportunities

High priority (biggest decode wins)

Medium priority

Skip / low value

Realistic ceiling

Uh oh!

TheTom commented Apr 2, 2026

Full Regression Test — PR #45 (cb8bddc)

Hardware

Quantize tool verification

M5 Max — Uncompressed weights + TurboQuant KV

Qwen2.5-1.5B Q8_0

Phi-4 14B Q8_0 (crash fix verification)

Qwen3.5-27B Q8_0

Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)

M5 Max — TQ4_1S Weight Compression

Qwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)

Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M

Summary

Uh oh!

nenkoru commented Apr 4, 2026

Uh oh!

theaaronhughes commented Apr 4, 2026

Uh oh!

theaaronhughes commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swfsql commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented Apr 4, 2026

Uh oh!

swfsql commented Apr 4, 2026

Uh oh!

tryingET commented Apr 5, 2026

Blackwell workstation follow-up for PR #45

1. Qwen3.5-27B PR45 validation and 32K KV follow-up

Local quality anchor

32K KV check on Config I

Carried lane vs Q8_0 alternatives

2. Native vLLM 0.18 sidecar vs the carried Qwen lane

3. APEX Qwen3.5-35B-A3B follow-up

First-pass variant benchmark

Quality validation

Quality + f16 vs carried Qwen default

4. Gemma 4 TurboQuant+ branch bring-up and carry-forward

Clean canary branch proof

Temp=0 KV sanity sweep

Benchmark carry-forward

Validation

5. Current role split after the Qwen/APEX/Gemma synthesis

6. Gemma vs the current carried Qwen role split

7. Limitations

8. Bottom line

9. Artifact note

Upstream conflict: activation rotation (commit `744c0c7`)

Regression test (M5 Max, rebased branch `cb8bddc`)

Full Regression Test — PR #45 (`cb8bddc`)

theaaronhughes commented Apr 4, 2026 •

edited

Loading

swfsql commented Apr 4, 2026 •

edited

Loading

`32K` KV check on `Config I`

Carried lane vs `Q8_0` alternatives

2. Native `vLLM 0.18` sidecar vs the carried Qwen lane

`Quality` validation

`Quality + f16` vs carried Qwen default

volschin commented Apr 5, 2026 •

edited

Loading

3) Config I + TurboQuant KV (`-ctk q8_0 -ctv turbo4`)

Bobylein commented Apr 7, 2026 •

edited

Loading