feat: add TBQ types (SRHT + Lloyd-Max) alongside TURBO by dusterbloom · Pull Request #2 · spiritbuun/buun-llama-cpp

dusterbloom · 2026-03-28T16:49:35Z

TBQ: 300K Context on a Single RTX 3090

Three new KV cache types — tbq2 (2-bit), tbq3 (3-bit), tbq4 (4-bit) — using SRHT + Lloyd-Max optimal codebook quantization. Native CUDA with Flash Attention.

Why merge this

1. It unlocks context lengths that are physically impossible with f16 KV.

Model	f16 KV max (24GB)	tbq3 max	tbq2 max
Qwen3.5-9B Q8_0	~32K	200K	300K
Qwen3.5-35B-A3B MoE	~8K	100K (ngl 999)	~150K
Nemotron-Cascade-2 31B	~8K	131K	~170K

f16 KV for 300K tokens on 9B needs 91GB. TBQ2 fits in 21GB.

2. Decode speed stays flat regardless of context length.

Qwen3.5-9B, tbq3 K+V	tg t/s
32K context	81
65K context	78
100K context	82
200K context	84

No decode degradation. The native vec_dot kernel reads TBQ blocks directly with zero temp allocation.

3. Near-zero quality loss on every architecture tested.

Model	head_dim	f16 PPL	tbq4 K+V PPL	delta
Qwen3.5-9B (dense)	256	8.196	8.214	+0.21%
Qwen3.5-27B (dense)	144	8.602	8.628	+0.30%
Nemotron-Cascade-2 31B (hybrid)	128	9.897	10.002	+1.06%
Nemotron-Nano-9B (hybrid)	128	8.342	8.368	+0.30%

4. +28% decode speed on hybrid Mamba-Transformer models.

Nemotron-Cascade-2 31B	f16	tbq3 K+V
tg2048	159 t/s	204 t/s (+28%)
context decay (tg512→tg2048)	-17%	-2%

5. Coexists with TURBO + InnerQ. Zero breaking changes.

Users choose via -ctk tbq4 -ctv tbq4 -fa on. All existing TURBO types and InnerQ work unchanged. 50/50 tests pass.

Compression vs quality tradeoff

Type	bpv	compression	PPL (9B)	Best for
tbq4	4.1	3.9x	+0.21%	Quality-sensitive
tbq3	3.1	5.1x	+0.78%	Balanced
tbq2	2.1	7.5x	+3.9%	Maximum context

Compatibility

head_dim=128, 144, 256 all tested (auto-pads to multiple of 128)
Dense (Qwen 9B, 27B), MoE (Qwen 35B-A3B), Hybrid (Nemotron Cascade, Nano)
Partial GPU offload works (CPU layers fall back to q8_0)
FA required (same as TURBO)

Implementation

8 commits, ~6,200 lines. Key design: native vec_dot for decode (Q pre-rotation + centroid lookup, O(1) temp memory), dequant-to-f16 + MMA for prefill.

Quick start

# 300K context, single GPU
./llama-server -m Qwen3.5-9B-Q8_0.gguf --ctx-size 300000 \
  -ngl 999 -fa on --cache-type-k tbq2 --cache-type-v tbq2

# Best quality/speed
./llama-server -m model.gguf -fa on \
  --cache-type-k tbq4 --cache-type-v tbq4

…ation Port TBQ (TurboQuant-B) types from ik_llama.cpp alongside existing TURBO types. Uses SRHT rotation + Lloyd-Max codebook (3-bit/4-bit) with 128-element blocks. Key differences from TURBO: packed codebook indices (no separate signs array), unnormalized Hadamard in quant path (centroids expect N(0,1) scale), norm correction (stored_norm = ||x|| / ||centroids||). Qwen3.5-9B: tbq4 K + f16 V = +0.09% PPL, matching ik_llama.cpp reference. Decode path dequants TBQ→f16 via inverse SRHT for FA compatibility.

Route TBQ prefill (Q->ne[1] > 1) through dedicated ggml_cuda_tbq_prefill_attend that bulk-dequants K/V via inverse SRHT to f16, then dispatches MMA kernel. Simpler than TURBO prefill — no Q pre-rotation needed since dequant produces original-domain values. Prefill speed: 4,755 t/s (tbq4) vs 4,710 t/s (f16 baseline) on Qwen3.5-9B. PPL unchanged at 8.2038 (+0.09%).

…en malloc Replace cudaMallocAsync/cudaFreeAsync per decode token with persistent per-device buffers (same pattern as q_rot_buf for TURBO Q rotation). Buffers grow-only via cudaMalloc on first use or size increase. Decode tg128: 63 → 83 t/s on Qwen3.5-9B (recovers to f16 parity). PPL unchanged at 8.2038.

Skip vec_dot tests when vec_dot function pointer is NULL — TURBO and TBQ types are GPU-only KV cache quantizations without CPU dot product support. Add MAX_QUANTIZATION_TOTAL_ERROR_TURBO threshold (0.05) for rotated-domain types that have inherently higher CPU round-trip error on non-rotated test data. Fixes test-quantize-fns, test-quantize-perf, and test-gguf segfaults. All 50 tests now pass (excluding tokenizer vocab tests).

dusterbloom · 2026-03-28T16:54:20Z

@spiritbuun have a look at the PR, hope you appreciate

Whamp · 2026-03-28T16:55:38Z

have you tried any very long prompts like 200k tokens? Have you tried multi-gpu setup?

dusterbloom · 2026-03-28T16:56:22Z

no I have only one GPU so not sure about that. long prompt can do right now ... in progress

…ingle GPU Switch TBQ decode from dequant-to-f16 to native vec_dot with Q pre-rotation via k_tbq_fwht_forward. Eliminates O(context) temporary f16 buffer that was the bottleneck for long context — at 65K+ the temp buffer exceeded VRAM. Now scales to 200K tokens on Qwen3.5-9B Q8_0 with tbq3 K+V on a single RTX 3090 (24GB). f16 KV would need 61GB at this context length. 200K benchmark: 1,939 pp t/s, 83.6 tg t/s. PPL unchanged at 8.2038.

dusterbloom · 2026-03-28T17:48:35Z

@Whamp tested 200K tokens on a single RTX 3090 (24GB):

Qwen3.5-9B Q8_0, tbq3 K+V, FA=1:

Context	pp t/s	tg t/s
32K	3,915	81.3
65K	2,939	77.6
100K	2,678	81.9
131K	2,500	57.8
200K	1,939	83.6

f16 KV would need 61GB VRAM for 200K context on this model — impossible on a single 24GB card.

Key fix: switched TBQ decode from dequant-to-f16 to native vec_dot with Q pre-rotation. This eliminates the O(context) temporary f16 buffer that was the VRAM bottleneck. Decode stays at ~80 t/s regardless of context length.

No multi-GPU setup available to test — single RTX 3090 only. The implementation should work with tensor-split across GPUs since it uses standard FA dispatch, but untested.

PPL unchanged at 8.2038 (+0.09% vs f16 baseline).

dusterbloom · 2026-03-28T18:52:16Z

Usage Examples

llama-server (OpenAI-compatible API)

# Qwen3.5-9B with TBQ4 KV — 200K context on single RTX 3090
./llama-server -m Qwen3.5-9B-Q8_0.gguf \
  --ctx-size 200000 --batch-size 4096 --ubatch-size 4096 \
  --n-gpu-layers 999 --flash-attn on \
  --cache-type-k tbq4 --cache-type-v tbq4 \
  --host 0.0.0.0 --port 8080

# Qwen3.5-35B-A3B MoE — 131K context, partial offload
./llama-server -m Qwen3.5-35B-A3B-Q3_K_XL.gguf \
  --ctx-size 131072 --batch-size 4096 --ubatch-size 4096 \
  --n-gpu-layers 50 --flash-attn on \
  --cache-type-k tbq3 --cache-type-v tbq3 \
  --host 0.0.0.0 --port 8080

# Nemotron-Cascade-2 31B hybrid — 131K context
./llama-server -m Nemotron-Cascade-2-30B-A3B-IQ4_XS.gguf \
  --ctx-size 131072 --batch-size 4096 --ubatch-size 4096 \
  --n-gpu-layers 999 --flash-attn on \
  --cache-type-k tbq3 --cache-type-v tbq3 \
  --host 0.0.0.0 --port 8080

llama-bench (benchmarking)

# Context scaling test
./llama-bench -m model.gguf \
  -ctk tbq3 -ctv tbq3 -fa 1 -ngl 999 \
  -p 32768,65536,131072,200000 -n 128 -r 1 -o md

# Compare KV types
./llama-bench -m model.gguf \
  -ctk f16,tbq4,tbq3,turbo3 -ctv f16 \
  -fa 1 -ngl 999 -p 8192 -n 2048 -r 2 -o md

llama-perplexity (quality validation)

./llama-perplexity -m model.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  -ctk tbq4 -ctv f16 -fa 1 -ngl 999 -c 512

Available TBQ types

tbq4 — 4-bit, 4.125 bpv, 3.9x compression. Best quality (+0.09% PPL on 9B)
tbq3 — 3-bit, 3.125 bpv, 5.1x compression. Best for max context (+0.78% PPL on 9B)

Both require -fa on (flash attention).

Whamp · 2026-03-28T19:09:14Z

Apologies if I'm being a bit dense, but I couldn't quite tell for certain if you were reporting results for a context window set to that size or actually processing a prompt of that size.

dusterbloom · 2026-03-28T20:42:33Z

@Whamp Good question — yes, these are actual 200K token prefills, not just setting the context window.

llama-bench -p 200000 generates 200,000 tokens of synthetic input and runs them through the full prefill path (all layers, all attention). The pp (prompt processing) number is real throughput processing that many tokens. The tg (token generation) number is decode speed after those 200K tokens are in the KV cache.

So when we report:

pp200000: 1,939 t/s
tg128:      84 t/s

That means:

Actually filled the KV cache with 200,000 tokens at 1,939 tokens/second
Then generated 128 new tokens at 84 tokens/second, with 200K tokens in context

The KV cache at 200K tokens with tbq3 K+V uses ~12 GB of the 24 GB VRAM. With f16 KV it would need 61 GB — 2.5x more than the entire GPU.

Update: TBQ2 (2-bit, 7.5x compression) is now working too. Testing 300K+ context right now:

200K: 1,942 pp t/s, 75 tg t/s
PPL: 8.515 (+3.9% vs f16 baseline)
KV at 300K: only 12.2 GB — should fit

4-level Lloyd-Max codebook for N(0,1). 34 bytes per 128 values (2.125 bpv). Enables 300K context on Qwen3.5-9B with single RTX 3090. 300K benchmark: 1,501 pp t/s, 81 tg t/s. PPL: 8.515 (+3.9%).

Models with head_dim not divisible by 128 (e.g. Qwen3.5-27B with head_dim=144, n_embd_k_gqa=576) now work with TBQ by rounding up the KV tensor dimension to the next multiple of 128. Extra elements are zero-padded. Qwen3.5-27B IQ2_XXS: tbq4 K+V PPL = 8.628 (+0.30% vs f16 baseline 8.602).

spiritbuun · 2026-03-28T22:00:16Z

@dusterbloom I don't want to throw more types into this branch but I used your implementation of inverse-FWHT prefill and brought the turbo4 prefill from 420 tok/s to 1124 tok/s (+167%). Thank you!!

@dusterbloom

Inverse FWHT during K dequant-to-fp16 mixes centroid values in float32 shmem before fp16 cast, avoiding the precision collapse that forced turbo4 off the MMA path. K-only (V fp16 loss is negligible). Q stays unrotated since K is now in the original domain. turbo4 prefill: 420 → 1124 tok/s. PPL and decode unchanged. Inspired-By: @dusterbloom (PR #2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mraxai added 4 commits March 28, 2026 17:51

dusterbloom force-pushed the feature/turboquant-kv-cache branch from 0d4246c to 2785c89 Compare March 28, 2026 16:51

dusterbloom marked this pull request as ready for review March 28, 2026 16:53

mraxai added 2 commits March 28, 2026 21:49

feat: add GGML_TYPE_TBQ2_0 — 2-bit SRHT + Lloyd-Max (7.5x compression)

e7d31f6

4-level Lloyd-Max codebook for N(0,1). 34 bytes per 128 values (2.125 bpv). Enables 300K context on Qwen3.5-9B with single RTX 3090. 300K benchmark: 1,501 pp t/s, 81 tg t/s. PPL: 8.515 (+3.9%).

spiritbuun closed this Mar 28, 2026

HumbleUser33 mentioned this pull request Apr 2, 2026

Segfault on startup with CUDA build (RTX 3090) — get_all_kv_cache_types() NULL pointer #9

Open

LateLinux mentioned this pull request May 8, 2026

Misc. bug: [CUDA error] In a dual RTX 4090 environment. #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add TBQ types (SRHT + Lloyd-Max) alongside TURBO#2

feat: add TBQ types (SRHT + Lloyd-Max) alongside TURBO#2
dusterbloom wants to merge 7 commits into
spiritbuun:feature/turboquant-kv-cachefrom
dusterbloom:feature/turboquant-kv-cache

dusterbloom commented Mar 28, 2026 •

edited

Loading

Uh oh!

dusterbloom commented Mar 28, 2026

Uh oh!

Whamp commented Mar 28, 2026

Uh oh!

dusterbloom commented Mar 28, 2026 •

edited

Loading

Uh oh!

dusterbloom commented Mar 28, 2026

Uh oh!

dusterbloom commented Mar 28, 2026

Uh oh!

Whamp commented Mar 28, 2026

Uh oh!

dusterbloom commented Mar 28, 2026

Uh oh!

spiritbuun commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dusterbloom commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TBQ: 300K Context on a Single RTX 3090

Why merge this

Compression vs quality tradeoff

Compatibility

Implementation

Quick start

Uh oh!

dusterbloom commented Mar 28, 2026

Uh oh!

Whamp commented Mar 28, 2026

Uh oh!

dusterbloom commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dusterbloom commented Mar 28, 2026

Uh oh!

dusterbloom commented Mar 28, 2026

Usage Examples

llama-server (OpenAI-compatible API)

llama-bench (benchmarking)

llama-perplexity (quality validation)

Available TBQ types

Uh oh!

Whamp commented Mar 28, 2026

Uh oh!

dusterbloom commented Mar 28, 2026

Uh oh!

spiritbuun commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dusterbloom commented Mar 28, 2026 •

edited

Loading

dusterbloom commented Mar 28, 2026 •

edited

Loading