feat: Lloyd-Max optimal codebooks for KV cache quantization by influenist · Pull Request #182 · turboderp-org/exllamav3

influenist · 2026-04-01T19:13:03Z

Summary

Adds CacheLayer_lloyd_max — drop-in alternative to CacheLayer_quant with three stacked improvements:

Lloyd-Max codebooks: k-means optimal centroids for post-WHT distribution
Sub-block scales (per-8): 4 scales per 32-element block instead of 1
Asymmetric K quantization: scale + zero-point captures per-channel bias

All hardware-validated on RTX 3090 with real TinyLlama-1.1B KV projections.

Results (RTX 3090, TinyLlama-1.1B)

Bits	Stock Uniform	+ Lloyd-Max	+ Sub-8 scales	+ Asymmetric	Total gain
2	8.19 dB	9.51 dB	9.88 dB	12.05 dB	+3.86 dB
3	14.49 dB	15.51 dB	16.80 dB	19.47 dB	+4.98 dB
4	20.63 dB	21.81 dB	23.72 dB	26.09 dB	+5.46 dB
5	26.69 dB	28.05 dB	30.17 dB	32.38 dB	+5.69 dB
6	32.73 dB	34.19 dB	36.37 dB	38.54 dB	+5.81 dB

Latency: identical to stock (all improvements are in codebook values and scale storage, not algorithm complexity).

Per-technique contribution (4-bit)

Technique	SQNR gain	MSE reduction
Lloyd-Max codebooks	+1.19 dB	24%
Sub-block scales (per-8)	+1.90 dB	—
Asymmetric (scale + zero-point)	+2.37 dB	—
Combined	+5.46 dB	~71%

Why asymmetric matters

Analysis of TinyLlama-1.1B K/V projections shows K vectors have fundamentally different statistics than V:

Metric	K	V	K/V ratio
Per-channel bias	0.095	0.024	4x
Channel variance heterogeneity	0.731	0.077	9.5x
Kurtosis (heavy tails)	1.77	0.26	6.8x
Dynamic range	14.08	2.68	5.3x

K has systematic per-channel non-zero mean and outlier channels. Symmetric quantization wastes half the code range. Asymmetric (scale + zero-point) captures this bias directly.

Supported by KIVI (ICML 2024): per-channel K + per-token V is optimal.

Usage

from exllamav3.cache import CacheLayer_lloyd_max

# Full stack: Lloyd-Max + sub-8 + asymmetric
cache = Cache(model, max_tokens,
    layer_type=CacheLayer_lloyd_max,
    k_bits=4, v_bits=4,
    sub_scale_size=8,
    asymmetric=True)

Files (5 commits, clean branch from master)

File	Purpose
`lloyd_max_codebooks.cuh`	k-means optimal codebooks (2-8 bit)
`lm_cache_kernels.cuh`	All kernel variants: standard, sub-block, asymmetric
`lm_cache.cu/.cuh`	Wrappers: cont + paged × {standard, sub, asym}
`cache/lloyd_max.py`	`CacheLayer_lloyd_max(sub_scale_size, asymmetric)`
`compute_kmeans_codebooks.py`	Reproducible codebook computation
`bench_lloyd_max_v2.py`	Benchmark script

References

TurboQuant (ICLR 2026) — WHT + Lloyd-Max
TQ3_4S — sub-block scale insight
KIVI (ICML 2024) — asymmetric K/V
Lloyd 1957 / Max 1960 — the algorithm

🤖 Generated with Claude Code

Adds CacheLayer_lloyd_max — drop-in replacement for CacheLayer_quant using k-means optimal codebooks instead of uniform spacing. Codebooks computed via k-means (Lloyd's algorithm) on the actual post-WHT, post-max-normalized distribution (10M samples, bounded [-1,1], std=0.43). NOT Gaussian-optimal — tuned for the real data. Hardware-validated results (RTX 3090, CUDA kernels): 2-bit: +1.31 dB SQNR, 26.0% MSE reduction 3-bit: +0.98 dB SQNR, 20.3% MSE reduction 4-bit: +1.12 dB SQNR, 22.7% MSE reduction 5-bit: +1.40 dB SQNR, 27.6% MSE reduction 6-bit: +1.39 dB SQNR, 27.4% MSE reduction 7-bit: +1.51 dB SQNR, 29.3% MSE reduction 8-bit: +0.81 dB SQNR, 17.0% MSE reduction Latency: 1.02-1.03x (negligible overhead) Implementation: - lloyd_max_codebooks.cuh: k-means boundaries/centroids for 2-8 bit - lm_cache_kernels.cuh: quant_block_lm/dequant_block_lm templates - lm_cache.cu/.cuh: Cont + paged wrapper functions - cache/lloyd_max.py: CacheLayer_lloyd_max (drop-in) - Wire-compatible bitplane layout (same storage format) - compute_kmeans_codebooks.py: Reproducible codebook computation - bench_lloyd_max_v2.py: Self-contained benchmark (PyTorch only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

influenist · 2026-04-01T21:40:55Z

Update: validated on real model weights (TinyLlama-1.1B)

Ran the Lloyd-Max kernels on actual KV projection outputs (hidden states projected through real K/V weight matrices, not random data):

RESULTS ON REAL TinyLlama-1.1B KV PROJECTIONS (RTX 3090)
  2-bit: Uniform=8.20dB  LloydMax=9.51dB  delta=+1.32dB  MSE_red=26.2%
  3-bit: Uniform=14.48dB  LloydMax=15.51dB  delta=+1.03dB  MSE_red=21.1%
  4-bit: Uniform=20.62dB  LloydMax=21.81dB  delta=+1.19dB  MSE_red=24.0%
  5-bit: Uniform=26.70dB  LloydMax=28.05dB  delta=+1.34dB  MSE_red=26.6%
  6-bit: Uniform=32.73dB  LloydMax=34.19dB  delta=+1.46dB  MSE_red=28.5%

Consistent with the random-data results. Full model perplexity (end-to-end inference with quantized cache) is still TODO.

Adds finer scale granularity following TQ3_4S pattern: 4 scales per 32-element block (one per 8 elements) instead of 1 scale per 32. WHT stays on full 32-element warp. Sub-group scale computation uses scoped warp shuffle (shuffle_max_sub). Bitplane packing unchanged. New functions: quant/dequant_lm_cache_{cont,paged}_sub Activated via: CacheLayer_lloyd_max(sub_scale_size=8) Default sub_scale_size=32 preserves existing behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instant TQ (LinearTQInstant): - WHT + Lloyd-Max quantization at model load time - No calibration data, takes seconds per layer - 4-bit cosine similarity vs FP16: 0.99 - Configurable bits (2-8) and sub-block scales Hot-swap (TQToEXL3HotSwap): - Background thread converts TQ layers to EXL3 progressively - Thread-safe layer swap with lock - Model serves inference the entire time - Uses q_fallback=True for round-to-nearest EXL3 (no Hessian needed) Usage: linear.convert_tq_instant(bits=4, sub_scale_size=8) # instant swapper = TQToEXL3HotSwap(model) swapper.start() # background upgrade Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

K vectors have 4-9x more statistical structure than V (per-channel bias, heterogeneous variance, heavy tails). Asymmetric quant (scale + zero-point) captures the per-channel DC offset that symmetric misses. Implementation: quant_block_lm_sub_asym / dequant_block_lm_sub_asym - K: asymmetric (scale + zero-point per sub-8 group) - V: symmetric Lloyd-Max (unchanged, already near-optimal for V) - Combined in single paged kernel launch Activated via: CacheLayer_lloyd_max(asymmetric=True, sub_scale_size=8) Based on KIVI (ICML 2024) finding: per-channel K + per-token V is optimal. Community consensus: 4-bit K + 2-bit V retains 98.3% accuracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

influenist · 2026-04-03T12:12:44Z

Withdrawn — moved to separate project

influenist mentioned this pull request Apr 1, 2026

feat: TQ3 compressed ring all-reduce for TP bandwidth reduction #181

Closed

3 tasks

influenist and others added 4 commits April 2, 2026 08:50

fix: add missing 1/sqrt(32) factor in asymmetric dequant

99d2a25

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

influenist closed this Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Lloyd-Max optimal codebooks for KV cache quantization#182

feat: Lloyd-Max optimal codebooks for KV cache quantization#182
influenist wants to merge 5 commits intoturboderp-org:masterfrom
influenist:feature/lloyd-max-kv-cache

influenist commented Apr 1, 2026 •

edited

Loading

Uh oh!

influenist commented Apr 1, 2026

Uh oh!

influenist commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

influenist commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (RTX 3090, TinyLlama-1.1B)

Per-technique contribution (4-bit)

Why asymmetric matters

Usage

Files (5 commits, clean branch from master)

References

Uh oh!

influenist commented Apr 1, 2026

Uh oh!

influenist commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

influenist commented Apr 1, 2026 •

edited

Loading