Skip to content

feat: Lloyd-Max optimal codebooks for KV cache quantization#182

Closed
influenist wants to merge 5 commits intoturboderp-org:masterfrom
influenist:feature/lloyd-max-kv-cache
Closed

feat: Lloyd-Max optimal codebooks for KV cache quantization#182
influenist wants to merge 5 commits intoturboderp-org:masterfrom
influenist:feature/lloyd-max-kv-cache

Conversation

@influenist
Copy link
Copy Markdown

@influenist influenist commented Apr 1, 2026

Summary

Adds CacheLayer_lloyd_max — drop-in alternative to CacheLayer_quant with three stacked improvements:

  1. Lloyd-Max codebooks: k-means optimal centroids for post-WHT distribution
  2. Sub-block scales (per-8): 4 scales per 32-element block instead of 1
  3. Asymmetric K quantization: scale + zero-point captures per-channel bias

All hardware-validated on RTX 3090 with real TinyLlama-1.1B KV projections.

Results (RTX 3090, TinyLlama-1.1B)

Bits Stock Uniform + Lloyd-Max + Sub-8 scales + Asymmetric Total gain
2 8.19 dB 9.51 dB 9.88 dB 12.05 dB +3.86 dB
3 14.49 dB 15.51 dB 16.80 dB 19.47 dB +4.98 dB
4 20.63 dB 21.81 dB 23.72 dB 26.09 dB +5.46 dB
5 26.69 dB 28.05 dB 30.17 dB 32.38 dB +5.69 dB
6 32.73 dB 34.19 dB 36.37 dB 38.54 dB +5.81 dB

Latency: identical to stock (all improvements are in codebook values and scale storage, not algorithm complexity).

Per-technique contribution (4-bit)

Technique SQNR gain MSE reduction
Lloyd-Max codebooks +1.19 dB 24%
Sub-block scales (per-8) +1.90 dB
Asymmetric (scale + zero-point) +2.37 dB
Combined +5.46 dB ~71%

Why asymmetric matters

Analysis of TinyLlama-1.1B K/V projections shows K vectors have fundamentally different statistics than V:

Metric K V K/V ratio
Per-channel bias 0.095 0.024 4x
Channel variance heterogeneity 0.731 0.077 9.5x
Kurtosis (heavy tails) 1.77 0.26 6.8x
Dynamic range 14.08 2.68 5.3x

K has systematic per-channel non-zero mean and outlier channels. Symmetric quantization wastes half the code range. Asymmetric (scale + zero-point) captures this bias directly.

Supported by KIVI (ICML 2024): per-channel K + per-token V is optimal.

Usage

from exllamav3.cache import CacheLayer_lloyd_max

# Full stack: Lloyd-Max + sub-8 + asymmetric
cache = Cache(model, max_tokens,
    layer_type=CacheLayer_lloyd_max,
    k_bits=4, v_bits=4,
    sub_scale_size=8,
    asymmetric=True)

Files (5 commits, clean branch from master)

File Purpose
`lloyd_max_codebooks.cuh` k-means optimal codebooks (2-8 bit)
`lm_cache_kernels.cuh` All kernel variants: standard, sub-block, asymmetric
`lm_cache.cu/.cuh` Wrappers: cont + paged × {standard, sub, asym}
`cache/lloyd_max.py` `CacheLayer_lloyd_max(sub_scale_size, asymmetric)`
`compute_kmeans_codebooks.py` Reproducible codebook computation
`bench_lloyd_max_v2.py` Benchmark script

References

🤖 Generated with Claude Code

Adds CacheLayer_lloyd_max — drop-in replacement for CacheLayer_quant
using k-means optimal codebooks instead of uniform spacing.

Codebooks computed via k-means (Lloyd's algorithm) on the actual
post-WHT, post-max-normalized distribution (10M samples, bounded
[-1,1], std=0.43). NOT Gaussian-optimal — tuned for the real data.

Hardware-validated results (RTX 3090, CUDA kernels):

  2-bit: +1.31 dB SQNR, 26.0% MSE reduction
  3-bit: +0.98 dB SQNR, 20.3% MSE reduction
  4-bit: +1.12 dB SQNR, 22.7% MSE reduction
  5-bit: +1.40 dB SQNR, 27.6% MSE reduction
  6-bit: +1.39 dB SQNR, 27.4% MSE reduction
  7-bit: +1.51 dB SQNR, 29.3% MSE reduction
  8-bit: +0.81 dB SQNR, 17.0% MSE reduction

  Latency: 1.02-1.03x (negligible overhead)

Implementation:
- lloyd_max_codebooks.cuh: k-means boundaries/centroids for 2-8 bit
- lm_cache_kernels.cuh: quant_block_lm/dequant_block_lm templates
- lm_cache.cu/.cuh: Cont + paged wrapper functions
- cache/lloyd_max.py: CacheLayer_lloyd_max (drop-in)
- Wire-compatible bitplane layout (same storage format)
- compute_kmeans_codebooks.py: Reproducible codebook computation
- bench_lloyd_max_v2.py: Self-contained benchmark (PyTorch only)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@influenist
Copy link
Copy Markdown
Author

Update: validated on real model weights (TinyLlama-1.1B)

Ran the Lloyd-Max kernels on actual KV projection outputs (hidden states projected through real K/V weight matrices, not random data):

RESULTS ON REAL TinyLlama-1.1B KV PROJECTIONS (RTX 3090)
  2-bit: Uniform=8.20dB  LloydMax=9.51dB  delta=+1.32dB  MSE_red=26.2%
  3-bit: Uniform=14.48dB  LloydMax=15.51dB  delta=+1.03dB  MSE_red=21.1%
  4-bit: Uniform=20.62dB  LloydMax=21.81dB  delta=+1.19dB  MSE_red=24.0%
  5-bit: Uniform=26.70dB  LloydMax=28.05dB  delta=+1.34dB  MSE_red=26.6%
  6-bit: Uniform=32.73dB  LloydMax=34.19dB  delta=+1.46dB  MSE_red=28.5%

Consistent with the random-data results. Full model perplexity (end-to-end inference with quantized cache) is still TODO.

influenist and others added 4 commits April 2, 2026 08:50
Adds finer scale granularity following TQ3_4S pattern: 4 scales per
32-element block (one per 8 elements) instead of 1 scale per 32.

WHT stays on full 32-element warp. Sub-group scale computation uses
scoped warp shuffle (shuffle_max_sub). Bitplane packing unchanged.

New functions: quant/dequant_lm_cache_{cont,paged}_sub
Activated via: CacheLayer_lloyd_max(sub_scale_size=8)
Default sub_scale_size=32 preserves existing behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instant TQ (LinearTQInstant):
- WHT + Lloyd-Max quantization at model load time
- No calibration data, takes seconds per layer
- 4-bit cosine similarity vs FP16: 0.99
- Configurable bits (2-8) and sub-block scales

Hot-swap (TQToEXL3HotSwap):
- Background thread converts TQ layers to EXL3 progressively
- Thread-safe layer swap with lock
- Model serves inference the entire time
- Uses q_fallback=True for round-to-nearest EXL3 (no Hessian needed)

Usage:
  linear.convert_tq_instant(bits=4, sub_scale_size=8)  # instant
  swapper = TQToEXL3HotSwap(model)
  swapper.start()  # background upgrade

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
K vectors have 4-9x more statistical structure than V (per-channel bias,
heterogeneous variance, heavy tails). Asymmetric quant (scale + zero-point)
captures the per-channel DC offset that symmetric misses.

Implementation: quant_block_lm_sub_asym / dequant_block_lm_sub_asym
- K: asymmetric (scale + zero-point per sub-8 group)
- V: symmetric Lloyd-Max (unchanged, already near-optimal for V)
- Combined in single paged kernel launch

Activated via: CacheLayer_lloyd_max(asymmetric=True, sub_scale_size=8)

Based on KIVI (ICML 2024) finding: per-channel K + per-token V is optimal.
Community consensus: 4-bit K + 2-bit V retains 98.3% accuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@influenist
Copy link
Copy Markdown
Author

Withdrawn — moved to separate project

@influenist influenist closed this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant