Skip to content

feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45

Merged
TheTom merged 0 commit into
feature/turboquant-kv-cachefrom
pr/tq4-weight-compression
Apr 3, 2026
Merged

feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45
TheTom merged 0 commit into
feature/turboquant-kv-cachefrom
pr/tq4-weight-compression

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented Apr 2, 2026

Summary

  • TQ3_1S (3-bit, 4.0 BPW) and TQ4_1S (4-bit, 5.0 BPW) weight quantization using WHT rotation + Lloyd-Max centroids
  • V2.1 fused Metal kernel: zero threadgroup memory, cooperative SIMD rotation via simd_shuffle_xor, NR0=8
  • Post-training quantization — no retraining, calibration data, or model modification required
  • Quantize via llama-quantize --allow-requantize --tensor-type-file config.txt

Tested Models

Model Config Size Reduction PPL Delta Decode NIAH
Qwen2.5-1.5B Config I -27% +1.9% 96% 6/6
Qwen3.5-27B Config I -28% +1.3% 99% 3/3
Qwen3.5-35B MoE Config I -37% +1.4% 102%
Qwen2.5-72B Config I -38% +3.9% 95% 3/3
Phi-4 14B Config I -36% +1.0% 254% 3/3
Llama 3.1 70B Premium -29% +5.8% fast 3/3
Llama 3.1 70B Hybrid -42% +16% 133% 3/3

Llama Note

Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. Use Hybrid (TQ4 attn + Q4_K FFN) or Premium (TQ4 attn + Q5_K/Q6_K FFN) configs. Both beat Q4_K_M in quality and speed at similar size. Full investigation in the paper.

What's needed before merge

  • CUDA port of V2.1 kernel (calling @signalnine 👀)
  • HIP/ROCm testing
  • Regression tests on existing TurboQuant KV functionality
  • Community validation on untested model families

Metal only

The quantization step (llama-quantize) works on any platform. The runtime dequant kernels are Metal-specific. Compressed GGUFs will not run correctly on CUDA/HIP until those backends are ported.

Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md
Getting started: https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md

🤖 Generated with Claude Code

signalnine added a commit to signalnine/llama-cpp-turboquant that referenced this pull request Apr 2, 2026
Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv)
WHT-rotated weight compression types. These achieve 27-37% model size
reduction at +1.0-1.9% PPL on Qwen/Phi families.

Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45.
CUDA additions:

- turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels),
  sign array for 32-element inverse WHT
- dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block
  inverse RHT (5 butterfly stages + normalize + unsign)
- convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers
- ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from
  mmvq/mmq (uses cuBLAS dequant-to-f16 path)

The cuBLAS path is correct for initial support. Future optimization:
pre-rotate activations via warp shuffle WHT (same pattern as KV cache
Q rotation) to eliminate per-block inverse WHT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Regression Test Results — PR #45

Verified that the TQ4_1S weight compression PR does NOT break existing TurboQuant KV cache functionality or standard inference on non-compressed models.

Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB)
Branch: pr/tq4-weight-compression (commit 6c3e503)

Speed — No Regressions

Model Hardware Config pp512 tg128
Qwen2.5-1.5B Q8_0 M5 Max q8_0/q8_0 10,787 198
Qwen2.5-1.5B Q8_0 M5 Max q8_0/turbo4 10,460 141
Qwen2.5-1.5B Q8_0 M5 Max q8_0/turbo3 10,468 138
Phi-4 14B Q8_0 M5 Max q8_0/q8_0 1,052 33.7
Phi-4 14B Q8_0 M5 Max q8_0/turbo4 1,051 30.9
Qwen3.5-27B Q8_0 M5 Max q8_0/q8_0 408 17.6
Qwen3.5-27B Q8_0 M5 Max q8_0/turbo4 497 17.1
Qwen3.5-27B Q8_0 M5 Max q8_0/turbo3 487 17.0
Qwen3.5-35B MoE Q8_0 M5 Max q8_0/q8_0 2,920 76.6
Qwen3.5-35B MoE Q8_0 M5 Max q8_0/turbo4 2,878 69.6
Qwen2.5-7B Q4_K_M M2 Pro q8_0/q8_0 352 34.9
Qwen2.5-7B Q4_K_M M2 Pro q8_0/turbo4 351 30.8
Qwen2.5-7B Q4_K_M M2 Pro q8_0/turbo3 350 30.0
Qwen2.5-7B Q4_K_M M2 Pro turbo3/turbo3 346 26.4

All speeds normal or improved. No regressions.

PPL — No Regressions (full wikitext-2 runs)

Model q8_0/q8_0 q8_0/turbo4 q8_0/turbo3
Qwen2.5-1.5B 10.31 10.45 (+1.4%) 10.55 (+2.4%)
Phi-4 14B 6.54 6.55 (+0.2%)
Qwen3.5-27B 6.87 6.89 (+0.3%)
Qwen3.5-35B MoE 6.53 6.56 (+0.5%)

All PPL values match known-good. MUL_MAT_ID (MoE path) verified working.

Verdict

ALL TESTS PASS. 5 models, 2 hardware platforms, 4 KV configs. The ggml-metal-ops.cpp restructuring (MUL_MAT + MUL_MAT_ID dispatch) did not break any existing functionality. Safe for review.

@TheTom TheTom force-pushed the pr/tq4-weight-compression branch from 6c3e503 to cb8bddc Compare April 2, 2026 04:39
TheTom added a commit that referenced this pull request Apr 2, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue #45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Update: Rebased on upstream master + regression test

Branch force-pushed. Now rebased on latest ggml-org/llama.cpp master (106 upstream commits). 3 commits on top:

  1. 187a2c1 feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
  2. 9aaa054 fix: add post-unrotate memory barrier for in-layer mixing safety
  3. cb8bddc fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream conflict: activation rotation (commit 744c0c7)

Upstream added graph-level Hadamard rotation for KV cache quantization (llama : rotate activations for better quantization). This feature:

  • Crashes on Phi-4 (graph hash table overflow from extra rotation nodes)
  • Is redundant with our kernel-level WHT rotation (which is more efficient — no extra graph nodes)

Fix: disabled upstream rotation by default in our fork. Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0. Our TurboQuant KV rotation is unaffected.

Regression test (M5 Max, rebased branch cb8bddc)

Test What Result Status
Config I quantize + speed Qwen 1.5B, 202 tg/s Matches expected
Config I PPL Qwen 1.5B, 10.77 (8ch) Within noise
Turbo4 KV speed Qwen 1.5B, 148 tg/s Matches expected
Turbo4 KV PPL Qwen 1.5B, 10.74 (8ch) Within noise
Phi-4 + turbo4 31.5 tg/s No crash ✅ Fixed
Large model turbo4 Qwen 27B, 17.4 tg/s Matches expected

All tests pass. No regressions. Phi-4 crash resolved.

@signalnine
Copy link
Copy Markdown

CUDA port available on our branch: signalnine/llama-cpp-turboquant feature/tq4-weight-cuda

What's implemented:

  • CUDA dequant for TQ4_1S/TQ3_1S (convert path + cuBLAS MUL_MAT)
  • Fused mul_mat_vec kernel with pre-rotated activations (warp shuffle WHT)
  • mmvq exclusion for fused dispatch path
  • llama-quantize registration for TQ4_1S/TQ3_1S types

Results (Qwen2.5-7B TQ4_1S, RTX 5090):

Metric cuBLAS path Fused kernel
Decode tg128 20 t/s 69 t/s
vs q8_0 (177 t/s) 11% 39%
PPL 8.82 8.82

The fused kernel pre-rotates the activation vector once per mul_mat via __shfl_xor_sync (5 butterfly stages, 32-element blocks), then the mmvq kernel just does centroid[idx] × scale with no per-block WHT. 13 kernel variants tested — the gap to q4_0 (275 t/s) is from dp4a integer intrinsics that TQ4_1S can't use (centroid lookup requires float). The gap to your Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency.

Happy to iterate on this if you have ideas for closing the CUDA gap further.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

This is great work, thank you for turning this around so fast. PPL matching between cuBLAS and fused confirms correctness.

One question before we merge: can you confirm that uncompressed models (q8_0, q4_0, etc.) show no decode regression on this branch? i.e. the new code paths only activate for TQ4_1S/TQ3_1S and existing quant types run at the same speed as before the PR.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

CUDA kernel review — performance improvement opportunities

Nice work on the V8 pre-rotation approach. PPL matching confirms correctness. Here's what I see for closing the gap from 39% to 70-85% of q8_0:

High priority (biggest decode wins)

  1. NR0 multi-row CTA with shared activation reuse — Currently each warp handles one row independently, re-reading vy_rot from global memory. On Metal, NR0=8 (8 rows sharing one activation tile) was the single biggest optimization (decode went from 70% to 99% of baseline). On CUDA the benefit is shared reuse of vy_rot across rows per CTA. Realistic target: 69 → 110-140 t/s.

  2. Hot loop load dedupd0/d1 are loaded per-lane but only have 2 unique values per block. Broadcast via __shfl_sync once per half-warp. Similarly, qs[lane/2] is loaded by both even and odd lanes — load once, broadcast to partner.

  3. __restrict__ qualifiers + vectorized loads — The kernel pointers lack __restrict__ (compiler can't prove no aliasing). Adding it enables better instruction scheduling. Also consider float4 loads for the pre-rotated activation to hit 128-bit memory transactions.

Medium priority

  1. Small-batch kernels (ne[1]=2..8) — Real serving hits batch > 1 frequently. A tuned small-batch path before falling to cuBLAS would help.

  2. CTA shape sweep per architecture — NWARPS=8 was tested on 5090. Ampere/Ada may prefer different occupancy points. Worth parameterizing.

  3. Fused prefill path (MMQ-style) — Currently falls to dequant→cuBLAS for ne[1] > 1. A tiled matmul with shared-memory activation rotation would be faster for medium batch.

Skip / low value

  • dp4a / fp16 LUT — The pipeline is float activation × float centroid × float scale. Forcing int8 paths is awkward and unlikely to win vs a well-tuned float path.
  • Scratch buffer cleanup — Lifecycle issue (static cudaMalloc leak), not a throughput lever. Fix eventually but not urgent.

Realistic ceiling

Per architecture with full tuning (NR0 + load dedup + vectorized + batch):

  • Ampere: ~55-70% of q8_0
  • Ada: ~60-75%
  • Blackwell: ~70-85%

The 39% → 70-85% gap is primarily data reuse, not math precision. The pre-rotation design is correct — it just needs the activation tile shared across more rows per CTA.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Full Regression Test — PR #45 (cb8bddc)

Hardware

  • M5 Max 128GB (Apple Silicon)
  • Mac Mini M2 Pro 32GB (Apple Silicon)

Quantize tool verification

Format Quantize Loads Generates Status
TQ4_1S Config I 1312 MiB (6.20 BPW) ✅ 198 tg
TQ3_1S attn-only 1730 MiB (8.17 BPW)
Q4_K_M (standard) 1060 MiB (5.00 BPW)

M5 Max — Uncompressed weights + TurboQuant KV

Qwen2.5-1.5B Q8_0

Config pp512 tg128 PPL (full)
q8_0/q8_0 10,742 209 10.31
q8_0/turbo4 10,604 148 10.45
q8_0/turbo3 10,537 141 10.55

Phi-4 14B Q8_0 (crash fix verification)

Config pp512 tg128 PPL (full)
q8_0/q8_0 1,083 34.0 6.54
q8_0/turbo4 1,088 31.1 6.55

No crash. Upstream attn_rot disabled by default (commit cb8bddc).

Qwen3.5-27B Q8_0

Config pp512 tg128 PPL (full)
q8_0/q8_0 554 17.5 6.87
q8_0/turbo4 498 16.9 6.89
q8_0/turbo3 501 16.0

Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)

Config pp512 tg128 PPL (full)
q8_0/q8_0 2,837 77.0 6.53
q8_0/turbo4 2,826 77.8 6.56

M5 Max — TQ4_1S Weight Compression

Qwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)

Config pp512 tg128 PPL (full)
Config I 7,610 198 10.53
Config I + turbo4 KV 7,394 142

Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M

Config pp512 tg128
q8_0/q8_0 316 ± 4.6 33.6 ± 0.2
q8_0/turbo4 348 ± 0.5 30.3 ± 0.3
q8_0/turbo3 346 ± 0.3 28.7 ± 0.4
turbo3/turbo3 338 ± 1.2 25.7 ± 0.3

Summary

  • 5 models tested across 2 hardware platforms
  • All PPL values match known-good (within measurement noise)
  • All speed values normal — no regressions
  • Phi-4 crash fixed (upstream attn_rot disabled)
  • MUL_MAT_ID (MoE) verified
  • Weight compression quantize + inference verified (TQ4_1S, TQ3_1S, Q4_K_M)
  • TurboQuant KV configs verified (q8_0/turbo4, q8_0/turbo3, turbo3/turbo3)

All tests pass. PR is safe for review.

@nenkoru
Copy link
Copy Markdown

nenkoru commented Apr 4, 2026

Here is my lllama-bench of 27b, same quantization results as for the size of the model.
Tested on two profiles of Nvidia V100. SM70.

nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-q8_0.gguf -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
  Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
  Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |        986.70 ± 2.48 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         22.18 ± 0.03 |

build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
  Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
  Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |        947.65 ± 1.47 |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         24.12 ± 0.04 |

build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$ ./build/bin/llama-bench -m models/qwen3.5-27b-config-i.gguf -ctk q8_0 -ctv turbo4 -fa 1 -ngl 99 -p 512 -n 128
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 40960 MiB):
  Device 0: GRID V100DX-32Q, compute capability 7.0, VMM: no, VRAM: 32768 MiB
  Device 1: GRID V100DX-8Q, compute capability 7.0, VMM: no, VRAM: 8192 MiB
| model                          |       size |     params | backend    | ngl | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |   q8_0 | turbo4 |  1 |           pp512 |        938.75 ± 0.75 |
| qwen35 27B Q8_0                |  19.13 GiB |    26.90 B | CUDA       |  99 |   q8_0 | turbo4 |  1 |           tg128 |         23.79 ± 0.03 |

build: bc05a6803 (8793)
nenkoru@bayfut-ubuntu-v100-vgpu:~/llama-cpp-turboquant$

@theaaronhughes
Copy link
Copy Markdown

Hey Tom — first Gemma 4 Blackwell numbers from an RTX 5090 on commit bc05a6803, using Gemma-4-26B-A4B-it-UD-Q4_K_XL.

Good news: loads clean, runs stable, and all tests completed cleanly at 64k and 128k.

Throughput:
Baseline q8_0/q8_0 stayed fastest, but turbo3 and turbo4 were only a few percent behind across 4k, 32k, and 64k.

Memory / KV:
64k

  • baseline: 18,584 MiB GPU / KV 839.38 MiB
  • turbo4: 18,372 MiB GPU / KV 629.53 MiB
  • turbo3: 18,318 MiB GPU / KV 573.98 MiB

128k

  • baseline: 19,264 MiB GPU / KV 1519.38 MiB
  • turbo4: 18,882 MiB GPU / KV 1139.53 MiB
  • turbo3: 18,782 MiB GPU / KV 1038.98 MiB

So on this setup, turbo3 saves ~266 MiB at 64k and ~482 MiB at 128k vs baseline. turbo4 saves ~212 MiB at 64k and ~382 MiB at 128k.

Savings look smaller than on dense full-context models, which makes sense here since Gemma 4’s hybrid sliding-window layout already limits full-context KV to 5 of 30 layers.

Happy to run 256k or perplexity next if that’s useful.

gemma4_26b_5090_tqkv_bench_4k_32k_64k_2026-04-04_11-57-07.log
gemma4_26b_5090_tqkv_bench_4k_32k_64k_2026-04-04_11-57-07.md
gemma4_26b_5090_tqkv_memory_64k_2026-04-04_12-27-33.log
gemma4_26b_5090_tqkv_memory_64k_2026-04-04_12-27-33.md
gemma4_26b_5090_tqkv_memory_128k_2026-04-04_12-31-59.log
gemma4_26b_5090_tqkv_memory_128k_2026-04-04_12-31-59.md

@theaaronhughes
Copy link
Copy Markdown

theaaronhughes commented Apr 4, 2026

Finished the Gemma 4 26B Config I weight-compression run on RTX 5090 and packaged the logs/results.

Setup

  • GPU: RTX 5090 32GB
  • Build: bc05a6803
  • Source: gemma-4-26B-A4B-it-Q8_0.gguf
  • Output: gemma-4-26B-A4B-it-Config-I.gguf

Config used

  • n_layers = 30
  • boundary = 2
  • attn_q, attn_k, attn_v, attn_output, ffn_gate, ffn_up => tq4_1s
  • ffn_down requested q4_k, but incompatible tensors fell back to q5_0 during quantization

Size

  • source: ~25.00 GiB
  • Config I: ~24.43 GiB
  • reduction: ~2.28%

Bench
Baseline Q8_0

  • pp512: 9349.90 ± 151.69
  • tg128: 172.53 ± 0.89

Config I only

  • pp512: 9371.40 ± 71.86
  • tg128: 173.18 ± 3.10

Config I + safe KV (-ctk q8_0 -ctv turbo4 -fa on)

  • pp512: 9004.91 ± 148.02
  • tg128: 159.17 ± 1.03

PPL
First run (--chunks 20)

  • baseline Q8_0: 303.9771 ± 25.21109
  • Config I: 242.7204 ± 19.86373

Confirmation run (--chunks 100)

  • baseline Q8_0: 194.8263 ± 6.85970
  • Config I: 190.3490 ± 6.72620

So on this setup, Config I came out:

  • smaller than the source model
  • neutral / slightly better in the quick bench when run alone
  • slower when combined with q8_0/turbo4 KV
  • modestly better on the 100-chunk PPL confirmation pass (~2.30% lower PPL)

Attaching the packaged zip with summary + raw quant / bench / PPL logs.

gemma4_26b_config_i_results_2026-04-04.zip
gemma4_26b_config_i_summary_2026-04-04.md

@swfsql
Copy link
Copy Markdown

swfsql commented Apr 4, 2026

Hi, thanks for all the tests and writings @TheTom!
I'm new to this, but I think I noticed something.

Looking at qwen 3.5 9B and 4B models, where each have 32 layers, the first 3 layers are not self-attention layers but are Gated Delta Nets. So you're not really preserving any self-attention layers if you skip e.g. the first 2 layers (blk.0 and blk.1) - you're only preserving the feed-forward part against quantizations. Preserving the first two {q,k,v}'s would require preserving blk.3 and blk.7.

The self-attention layers are otherwise correctly changed with blk.{i}.attn_{q,k,v}. The other layers (blk.{i}.attn_qkv) are currently not being quantized, but maybe they could since they kind of correspond to self-attention QKV, although it would be annoying to do asymmetric quants if they aren't split in 3 parts like {q,k,v} (and are single tensor instead). Would need to have 3 separate fields to start, and perhaps you could just concat them at the right place and leave everything as is..

Edit: 27B works similarly, where blk.3 is the first self-attention layer with split {q,k,v}.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 4, 2026

Good catch @swfsql. I verified this on Qwen3.5-27B-Q8_0.

The layer pattern is a 1:3 interleave — only 16 of 64 layers are self-attention with split attn_q/attn_k/attn_v (blk.3, 7, 11, 15, ..., 63). The other 48 are Gated Delta Net layers with fused attn_qkv + SSM tensors.

With boundary=2, blk.0-1 are protected but those are delta net layers — no split attention tensors to protect. The first real self-attention layer (blk.3) is unprotected. The fused attn_qkv tensors are not in our config pattern, so they stay at q8_0 by default regardless.

I tested boundary=2 vs boundary=4 on M5 Max to measure the impact:

Config PPL (8 chunks) Delta vs Q8_0 Size
Q8_0 baseline 6.8879 27.3G
Config I boundary=2 7.0564 +2.44% 19.1G
Config I boundary=4 7.0349 +2.13% 19.6G

boundary=4 protects blk.3 (first real self-attention) and gives a 0.3% PPL improvement at the cost of 500 MiB. Real but small.

The delta net layers are accidentally safe — their fused attn_qkv tensors do not match the attn_q/attn_k/attn_v patterns in the config file, so they remain at q8_0 regardless of boundary setting.

For the default Config I recommendation, I am keeping boundary=2 since the practical difference is marginal. But this is useful context for anyone tuning Qwen 3.5 specifically — boundary=4 is a free 0.3% if you can spare the 500 MiB.

Re: compressing the fused attn_qkv tensors in delta net layers — that would require adding attn_qkv as a target in the config file. Worth exploring but would need separate testing since the delta net attention mechanism may have different sensitivity characteristics than standard self-attention.

@swfsql
Copy link
Copy Markdown

swfsql commented Apr 4, 2026

@TheTom interesting, I just started trying and am basically limited to 4B models.

On Qwen 3.5 9B I just noted that token_embd.weight and output.weight are quite heavy (~20% of the entire original model, and ~35% after a quantized model). Maybe those could also be compressed. Idk how the theory works, but if the rotation is un-applied after the logit softmax, could then output.weight also be compressed with TQ?

For the 4B model, the token_embd.weight appears to be tied (and weights ~15% of the original and ~25% after quant), but maybe this would be harder to quantize. Or perhaps they could just be quantized to Q4_K.

I'm informing this because Config_I currently doesn't quantize those token input-output weights, and they are quite heavy.

@tryingET
Copy link
Copy Markdown

tryingET commented Apr 5, 2026

Blackwell workstation follow-up for PR #45

I ran a bounded local follow-up on an RTX PRO 6000 Blackwell Workstation Edition 96 GB across four surfaces:

  1. TurboQuant PR45 llama.cpp lanes for Qwen3.5-27B
  2. native vLLM 0.18 sidecar for Qwen3.5-27B-FP8
  3. APEX Qwen3.5-35B-A3B GGUF follow-up
  4. Gemma 4 TurboQuant+ follow-up on TheTom’s feature/turboquant-kv-cache branch

Important caveat up front: cross-family promotion remains blocked on a direct qualitative task set. Local perplexity is used here as a reproducibility anchor within a family, not as a direct cross-family ranking metric.

Two provenance notes matter:

Scope Provenance
Qwen/APEX PR45 canary build commit 3b8a01a92, build number 8771
Gemma TurboQuant+ follow-up TheTom/llama-cpp-turboquant, ref feature/turboquant-kv-cache, commit bc05a6803e48f17e0f2c7a99fce9b50d03882de7

1. Qwen3.5-27B PR45 validation and 32K KV follow-up

Local quality anchor

Lane Local PPL Prompt eval tok/s Approx CUDA working set Read
Q8_0 12.6259 7584.82 26.566 GiB same-family quality reference
Config I (tq4_1s) 12.6320 7932.73 24.076 GiB carried candidate
Delta (Config I - Q8_0) +0.0061 / +0.048% +4.59% -2.49 GiB near-flat quality, lower footprint

32K KV check on Config I

KV mode pp32768 tok/s tg128 tok/s KV buffer Working set Read
f16 3535.14 52.46 2048.00 MiB 25.503 GiB reference
turbo3 3441.67 51.18 400.13 MiB 23.894 GiB best compressed-KV trade here
turbo4 3388.16 51.29 544.13 MiB 24.034 GiB valid, but weaker than turbo3

Carried lane vs Q8_0 alternatives

Lane pp32768 tok/s tg128 tok/s Working set Read
Config I + turbo3 3555.99 51.01 23.894 GiB carried balanced Qwen default
Q8_0 + turbo3 3497.44 47.38 26.384 GiB worse speed, higher footprint
Q8_0 + turbo4 3448.25 47.25 26.524 GiB worse speed, higher footprint

2. Native vLLM 0.18 sidecar vs the carried Qwen lane

Surface native vLLM 0.18 (Qwen3.5-27B-FP8) Config I + turbo3 Delta
Long-prompt prefill 7789.745 tok/s 3555.99 tok/s +119.060% for vLLM
Short decode 35.64 tok/s 51.01 tok/s -30.131% for vLLM
GPU working set / load delta 86.897 GiB 23.894 GiB 3.637x larger for vLLM
Context in this pass 32K 32K same ceiling in this comparison

Read: native vLLM 0.18 earned a real prefill-specialist role on this workstation, but not the balanced-default role because decode was slower and footprint was much larger.

3. APEX Qwen3.5-35B-A3B follow-up

First-pass variant benchmark

Variant File size pp32768 tok/s tg128 tok/s Working set Outcome
Compact 15.811 GiB 6578.17 218.24 16.858 GiB low-footprint fallback
Balanced 23.650 GiB 6736.65 215.02 24.698 GiB dominated
Quality 21.306 GiB 6919.55 216.53 22.354 GiB carry-forward

Quality validation

Check Result
Local PPL vs Compact 15.2822 vs 15.3825 (-0.652%)
Best 32K KV mode f16
Why not compressed KV? turbo3 and turbo4 both missed the local tg128 regression window

Quality + f16 vs carried Qwen default

Lane Local quality note pp32768 tok/s tg128 tok/s Working set Read
Config I + turbo3 same-family Qwen baseline 3555.99 51.01 23.894 GiB balanced default
Quality + f16 +20.980% PPL regression vs Config I + turbo3 6749.54 208.79 22.354 GiB throughput specialist

Read: APEX Quality + f16 is operationally strong as a throughput specialist, but the local same-family quality gap is too large to justify replacing Config I + turbo3 as the balanced Qwen default.

4. Gemma 4 TurboQuant+ branch bring-up and carry-forward

Clean canary branch proof

Item Result
Source TheTom/llama-cpp-turboquant
Required ref feature/turboquant-kv-cache
Measured commit bc05a6803e48f17e0f2c7a99fce9b50d03882de7
Important build flag -DLLAMA_OPENSSL=ON
Why it mattered master rejected turbo4; non-SSL build could not use the -hf path
First working canary posture Gemma 4 Q4_K_M, -ctk q8_0, -ctv turbo4, -c 32768, -ngl 99, -fa on
First measured load delta 18.604 GiB
First proof request throughput prompt 308.83 tok/s, completion 195.45 tok/s

Temp=0 KV sanity sweep

Candidate Probes passed Mean decode tok/s Mean prompt tok/s Memory delta Outcome
q8_0/turbo3 2 / 3 181.588 597.918 18.552 GiB rejected for visible-answer miss
q8_0/turbo4 3 / 3 181.859 546.791 18.604 GiB asymmetric winner
turbo3/turbo3 3 / 3 178.122 583.902 18.299 GiB symmetric fallback
turbo4/turbo4 3 / 3 176.657 600.757 18.405 GiB valid, but weaker than turbo3/turbo3

Benchmark carry-forward

Lane pp32768 tok/s tg128 tok/s Working set Outcome
q8_0/turbo4 8626.64 202.35 16.392 GiB carry-forward
turbo3/turbo3 8558.52 191.80 16.259 GiB fallback only

Validation

Check Result
Local PPL anchor 222.2524 on the shared repo-local corpus
Validated contexts 64K, 128K, 262K
Highest-context GPU delta 20.642 GiB at 262K
KV growth read non-SWA KV grew 510 MiB -> 1020 MiB -> 2040 MiB; SWA KV stayed bounded at 358.59 MiB

5. Current role split after the Qwen/APEX/Gemma synthesis

Role Lane Why
Balanced Qwen default Config I + turbo3 still the only carried Qwen lane backed by same-family quality evidence
Qwen throughput specialist Quality + f16 much faster than the carried Qwen default, but too much PPL regression to become the balanced default
Prefill-first specialist native vLLM 0.18 (Qwen3.5-27B-FP8) strongest long-prompt prefill sidecar, but too large and too slow on decode to become the default
Preferred interactive long-context canary Gemma 4 q8_0/turbo4 strongest measured GGUF lane in this tested set on this workstation for speed + footprint + context, with validated 262K

6. Gemma vs the current carried Qwen role split

Compare Prefill delta Decode delta Working-set delta Context delta Read
Gemma q8_0/turbo4 vs Config I + turbo3 +142.595% +296.687% -7.502 GiB 262K vs 32K Gemma leads the measured speed/footprint/context envelope
Gemma q8_0/turbo4 vs Quality + f16 +27.811% -3.084% -5.962 GiB 262K vs 32K APEX keeps a small decode edge, but Gemma is stronger overall operationally
Gemma q8_0/turbo4 vs native vLLM 0.18 +10.744% +467.761% -70.506 GiB 262K vs 32K directional only; Gemma is much easier to carry locally

7. Limitations

  1. These are single-workstation measurements on one exact Blackwell host.
  2. The runtime surfaces are not identical: TurboQuant PR45 llama.cpp, native vLLM 0.18, APEX GGUF, and Gemma TurboQuant+ branch.
  3. Same-family quality evidence is stronger here than cross-family comparison evidence.
  4. Gemma local perplexity is treated only as a reproducibility anchor, not as a direct cross-family ranking metric.
  5. Cross-family promotion remains blocked on a direct qualitative task set.

8. Bottom line

  1. Config I + turbo3 remains the balanced Qwen default.
  2. Quality + f16 earned a real throughput-specialist role.
  3. native vLLM 0.18 earned a real prefill-specialist role.
  4. Gemma 4 q8_0/turbo4 is the strongest measured interactive long-context GGUF lane in this tested set on this workstation.
  5. I would not auto-replace the balanced Qwen default with Gemma yet, because I still do not have an apples-to-apples cross-family qualitative gate.

9. Artifact note

The raw workstation docs/receipts behind this summary are currently stored in a separate local evaluation repo rather than in this PR branch, so I am intentionally not pasting broken repo-relative links here. If useful, I can provide the exact local receipt/doc path list or extract a cleaner public artifact bundle.

@volschin
Copy link
Copy Markdown

volschin commented Apr 5, 2026

@tryingET very interesting comparison in regard to vllm. Would be interesting how sglang would be positioned with its radix tree approach here.

@Michael-Z-Freeman
Copy link
Copy Markdown

Here's my report :)

PR #45 Benchmark Report

GPU: AMD Radeon RX 9060 XT (ROCm, gfx1200), 16GB VRAM
Backend: HIP/ROCm
Build: e9c54d5 (8772)

Model under test:

  • Source: qwen2.5-7b-instruct-q8_0.gguf
  • Compressed (Config I): qwen2.5-7b-instruct-config-i.gguf
  • Config I quantization result: 7.54 GiB -> 5.16 GiB

Bench settings used:

  • -p 512 -n 128 -fa 1 -ngl 99 -b 256 -ub 128 -r 1

1) Q8_0 baseline

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen2 7B Q8_0                  |   7.54 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           pp512 |       1163.73 ± 0.00 |
| qwen2 7B Q8_0                  |   7.54 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           tg128 |         38.15 ± 0.00 |

build: e9c54d557 (8772)

2) Config I (TQ4_1S)

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           pp512 |       1330.85 ± 0.00 |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |  1 |           tg128 |         42.33 ± 0.00 |

build: e9c54d557 (8772)

3) Config I + TurboQuant KV (-ctk q8_0 -ctv turbo4)

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
  Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32, VRAM: 16304 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |   q8_0 | turbo4 |  1 |           pp512 |       1303.25 ± 0.00 |
| qwen2 7B Q8_0                  |   5.16 GiB |     7.62 B | ROCm       |  99 |     256 |      128 |   q8_0 | turbo4 |  1 |           tg128 |         41.57 ± 0.00 |

build: e9c54d557 (8772)

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 5, 2026

Good catch @swfsql. I verified this on Qwen3.5-27B-Q8_0.

The layer pattern is a 1:3 interleave — only 16 of 64 layers are self-attention with split attn_q/attn_k/attn_v (blk.3, 7, 11, 15, ..., 63). The other 48 are Gated Delta Net layers with fused attn_qkv + SSM tensors.

With boundary=2, blk.0-1 are protected but those are delta net layers — no split attention tensors to protect. The first real self-attention layer (blk.3) is unprotected. The fused attn_qkv tensors are not in our config pattern, so they stay at q8_0 by default regardless.

I tested boundary=2 vs boundary=4 on M5 Max to measure the impact:

Config PPL (8 chunks) Delta vs Q8_0 Size
Q8_0 baseline 6.8879 27.3G
Config I boundary=2 7.0564 +2.44% 19.1G
Config I boundary=4 7.0349 +2.13% 19.6G

boundary=4 protects blk.3 (first real self-attention) and gives a 0.3% PPL improvement at the cost of 500 MiB. Real but small.

The delta net layers are accidentally safe — their fused attn_qkv tensors do not match the attn_q/attn_k/attn_v patterns in the config file, so they remain at q8_0 regardless of boundary setting.

For the default Config I recommendation, I am keeping boundary=2 since the practical difference is marginal. But this is useful context for anyone tuning Qwen 3.5 specifically — boundary=4 is a free 0.3% if you can spare the 500 MiB.

Re: compressing the fused attn_qkv tensors in delta net layers — that would require adding attn_qkv as a target in the config file. Worth exploring but would need separate testing since the delta net attention mechanism may have different sensitivity characteristics than standard self-attention.

spiritbuun referenced this pull request in spiritbuun/buun-llama-cpp Apr 6, 2026
- turbo4 K+V results on Qwen3.5-27B (-0.32% vs q8_0) and Qwen3-14B (+6.3%)
- Sparse V dequant benchmarks: MoE native dequant +10.9% at 8K
- Gemma-3 turbo3 results post-iSWA fix (+3.3%)
- KVLinC no-K-rotation negative result
- Speculative decoding negative result
- CUDA 13.2 compatibility verified
- Experiments #31, #39, #42, #45, #49, #50, #51 status updates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Bobylein
Copy link
Copy Markdown

Bobylein commented Apr 7, 2026

I guess you're done here anyway, yet wanting to let you know that the compressing as described in Getting started seems to either miss something or isn't very clear, at least in my case I wasn't able to replicate the "Quick Test" as the output model (Qwen 3.5 35b) always turned out nearly the same size with config_i except for a few MBs.

Or do I need to change the config_i for the 35b model somewhat?

str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
[...]
llama_tensor_get_type: blk.15.attn_k.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
llama_tensor_get_type: blk.15.attn_output.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
llama_tensor_get_type: blk.15.attn_q.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
llama_tensor_get_type: blk.15.attn_v.weight - applying manual override: q8_0 -> tq4_1s
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
[...]
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1
[ 1/ 733] output.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB
[ 2/ 733] output_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 3/ 733] token_embd.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB
[ 4/ 733] blk.0.attn_gate.weight - [ 2048, 4096, 1, 1], type = q8_0, size = 8.500 MiB
[ 5/ 733] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 6/ 733] blk.0.attn_qkv.weight - [ 2048, 8192, 1, 1], type = q8_0, size = 17.000 MiB
[...]
[ 419/ 733] blk.22.ssm_a - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB
[ 420/ 733] blk.22.ssm_alpha.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB
[ 421/ 733] blk.22.ssm_beta.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB
[ 422/ 733] blk.22.ssm_conv1d.weight - [ 4, 8192, 1, 1], type = f32, size = 0.125 MiB
[ 423/ 733] blk.22.ssm_dt.bias - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB
[ 424/ 733] blk.22.ssm_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MiB
[ 425/ 733] blk.22.ssm_out.weight - [ 4096, 2048, 1, 1], type = q8_0, size = 8.500 MiB
[ 426/ 733] blk.23.attn_k.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB
[ 427/ 733] blk.23.attn_k_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB
[ 428/ 733] blk.23.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 429/ 733] blk.23.attn_output.weight - [ 4096, 2048, 1, 1], type = q8_0, converting to tq4_1s .. size = 8.50 MiB -> 5.00 MiB
[ 430/ 733] blk.23.attn_q.weight - [ 2048, 8192, 1, 1], type = q8_0, converting to tq4_1s .. size = 17.00 MiB -> 10.00 MiB
[ 431/ 733] blk.23.attn_q_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB
[ 432/ 733] blk.23.attn_v.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB
[...]
[ 729/ 733] blk.39.ffn_gate_inp_shexp.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
[ 730/ 733] blk.39.ffn_gate_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB
[ 731/ 733] blk.39.ffn_up_exps.weight - [ 2048, 512, 256, 1], type = q8_0, size = 272.000 MiB
[ 732/ 733] blk.39.ffn_up_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB
[ 733/ 733] blk.39.post_attention_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
llama_model_quantize_impl: model size = 35183.10 MiB (8.52 BPW)
llama_model_quantize_impl: quant size = 35069.35 MiB (8.49 BPW)`

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 7, 2026

thank you for the continued benchmarks!

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 7, 2026

I guess you're done here anyway, yet wanting to let you know that the compressing as described in Getting started seems to either miss something or isn't very clear, at least in my case I wasn't able to replicate the "Quick Test" as the output model (Qwen 3.5 35b) always turned out nearly the same size with config_i except for a few MBs.

Or do I need to change the config_i for the 35b model somewhat?

str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 [...] llama_tensor_get_type: blk.15.attn_k.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 llama_tensor_get_type: blk.15.attn_output.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 llama_tensor_get_type: blk.15.attn_q.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 llama_tensor_get_type: blk.15.attn_v.weight - applying manual override: q8_0 -> tq4_1s str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 [...] str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1 str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1 [ 1/ 733] output.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB [ 2/ 733] output_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 3/ 733] token_embd.weight - [ 2048, 248320, 1, 1], type = q8_0, size = 515.312 MiB [ 4/ 733] blk.0.attn_gate.weight - [ 2048, 4096, 1, 1], type = q8_0, size = 8.500 MiB [ 5/ 733] blk.0.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 6/ 733] blk.0.attn_qkv.weight - [ 2048, 8192, 1, 1], type = q8_0, size = 17.000 MiB [...] [ 419/ 733] blk.22.ssm_a - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB [ 420/ 733] blk.22.ssm_alpha.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB [ 421/ 733] blk.22.ssm_beta.weight - [ 2048, 32, 1, 1], type = q8_0, size = 0.066 MiB [ 422/ 733] blk.22.ssm_conv1d.weight - [ 4, 8192, 1, 1], type = f32, size = 0.125 MiB [ 423/ 733] blk.22.ssm_dt.bias - [ 32, 1, 1, 1], type = f32, size = 0.000 MiB [ 424/ 733] blk.22.ssm_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MiB [ 425/ 733] blk.22.ssm_out.weight - [ 4096, 2048, 1, 1], type = q8_0, size = 8.500 MiB [ 426/ 733] blk.23.attn_k.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB [ 427/ 733] blk.23.attn_k_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB [ 428/ 733] blk.23.attn_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 429/ 733] blk.23.attn_output.weight - [ 4096, 2048, 1, 1], type = q8_0, converting to tq4_1s .. size = 8.50 MiB -> 5.00 MiB [ 430/ 733] blk.23.attn_q.weight - [ 2048, 8192, 1, 1], type = q8_0, converting to tq4_1s .. size = 17.00 MiB -> 10.00 MiB [ 431/ 733] blk.23.attn_q_norm.weight - [ 256, 1, 1, 1], type = f32, size = 0.001 MiB [ 432/ 733] blk.23.attn_v.weight - [ 2048, 512, 1, 1], type = q8_0, converting to tq4_1s .. size = 1.06 MiB -> 0.62 MiB [...] [ 729/ 733] blk.39.ffn_gate_inp_shexp.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB [ 730/ 733] blk.39.ffn_gate_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB [ 731/ 733] blk.39.ffn_up_exps.weight - [ 2048, 512, 256, 1], type = q8_0, size = 272.000 MiB [ 732/ 733] blk.39.ffn_up_shexp.weight - [ 2048, 512, 1, 1], type = q8_0, size = 1.062 MiB [ 733/ 733] blk.39.post_attention_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB llama_model_quantize_impl: model size = 35183.10 MiB (8.52 BPW) llama_model_quantize_impl: quant size = 35069.35 MiB (8.49 BPW)`

Good catch. Qwen3.5-35B is a hybrid model — only 16 of 64 layers have split attn_q/attn_k/attn_v tensors (blk.3, 7, 11, 15, ..., 63). The other 48 are Gated Delta Net layers with fused attn_qkv tensors that don't match the Config I pattern, so they stay at q8_0.

You need to adjust n_layers to 64 for this model, but the real issue is that the config pattern only hits the self-attention layers. The fused attn_qkv and SSM tensors in delta net layers aren't targeted, so compression on this model will be smaller than expected (~30% vs ~37% on pure attention models like Qwen2.5-27B).

Compressing fused attn_qkv would need adding it as a target in the config file. Haven't validated that yet — delta net attention may have different sensitivity. i'll check on this more tomororw

iamwavecut pushed a commit to iamwavecut/llama-cpp-turboquant that referenced this pull request Apr 8, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 9, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 10, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Titaniumtown
Copy link
Copy Markdown

I'm working on a vulkan port of this. PR soon

KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 13, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 14, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 15, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit that referenced this pull request Apr 15, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue #45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 23, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 27, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue TheTom#45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.