feat: Lloyd-Max optimal codebooks for KV cache quantization#182
Closed
influenist wants to merge 5 commits intoturboderp-org:masterfrom
Closed
feat: Lloyd-Max optimal codebooks for KV cache quantization#182influenist wants to merge 5 commits intoturboderp-org:masterfrom
influenist wants to merge 5 commits intoturboderp-org:masterfrom
Conversation
Adds CacheLayer_lloyd_max — drop-in replacement for CacheLayer_quant using k-means optimal codebooks instead of uniform spacing. Codebooks computed via k-means (Lloyd's algorithm) on the actual post-WHT, post-max-normalized distribution (10M samples, bounded [-1,1], std=0.43). NOT Gaussian-optimal — tuned for the real data. Hardware-validated results (RTX 3090, CUDA kernels): 2-bit: +1.31 dB SQNR, 26.0% MSE reduction 3-bit: +0.98 dB SQNR, 20.3% MSE reduction 4-bit: +1.12 dB SQNR, 22.7% MSE reduction 5-bit: +1.40 dB SQNR, 27.6% MSE reduction 6-bit: +1.39 dB SQNR, 27.4% MSE reduction 7-bit: +1.51 dB SQNR, 29.3% MSE reduction 8-bit: +0.81 dB SQNR, 17.0% MSE reduction Latency: 1.02-1.03x (negligible overhead) Implementation: - lloyd_max_codebooks.cuh: k-means boundaries/centroids for 2-8 bit - lm_cache_kernels.cuh: quant_block_lm/dequant_block_lm templates - lm_cache.cu/.cuh: Cont + paged wrapper functions - cache/lloyd_max.py: CacheLayer_lloyd_max (drop-in) - Wire-compatible bitplane layout (same storage format) - compute_kmeans_codebooks.py: Reproducible codebook computation - bench_lloyd_max_v2.py: Self-contained benchmark (PyTorch only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
Author
|
Update: validated on real model weights (TinyLlama-1.1B) Ran the Lloyd-Max kernels on actual KV projection outputs (hidden states projected through real K/V weight matrices, not random data): Consistent with the random-data results. Full model perplexity (end-to-end inference with quantized cache) is still TODO. |
Adds finer scale granularity following TQ3_4S pattern: 4 scales per
32-element block (one per 8 elements) instead of 1 scale per 32.
WHT stays on full 32-element warp. Sub-group scale computation uses
scoped warp shuffle (shuffle_max_sub). Bitplane packing unchanged.
New functions: quant/dequant_lm_cache_{cont,paged}_sub
Activated via: CacheLayer_lloyd_max(sub_scale_size=8)
Default sub_scale_size=32 preserves existing behavior.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instant TQ (LinearTQInstant): - WHT + Lloyd-Max quantization at model load time - No calibration data, takes seconds per layer - 4-bit cosine similarity vs FP16: 0.99 - Configurable bits (2-8) and sub-block scales Hot-swap (TQToEXL3HotSwap): - Background thread converts TQ layers to EXL3 progressively - Thread-safe layer swap with lock - Model serves inference the entire time - Uses q_fallback=True for round-to-nearest EXL3 (no Hessian needed) Usage: linear.convert_tq_instant(bits=4, sub_scale_size=8) # instant swapper = TQToEXL3HotSwap(model) swapper.start() # background upgrade Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
K vectors have 4-9x more statistical structure than V (per-channel bias, heterogeneous variance, heavy tails). Asymmetric quant (scale + zero-point) captures the per-channel DC offset that symmetric misses. Implementation: quant_block_lm_sub_asym / dequant_block_lm_sub_asym - K: asymmetric (scale + zero-point per sub-8 group) - V: symmetric Lloyd-Max (unchanged, already near-optimal for V) - Combined in single paged kernel launch Activated via: CacheLayer_lloyd_max(asymmetric=True, sub_scale_size=8) Based on KIVI (ICML 2024) finding: per-channel K + per-token V is optimal. Community consensus: 4-bit K + 2-bit V retains 98.3% accuracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Withdrawn — moved to separate project |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
CacheLayer_lloyd_max— drop-in alternative toCacheLayer_quantwith three stacked improvements:All hardware-validated on RTX 3090 with real TinyLlama-1.1B KV projections.
Results (RTX 3090, TinyLlama-1.1B)
Latency: identical to stock (all improvements are in codebook values and scale storage, not algorithm complexity).
Per-technique contribution (4-bit)
Why asymmetric matters
Analysis of TinyLlama-1.1B K/V projections shows K vectors have fundamentally different statistics than V:
K has systematic per-channel non-zero mean and outlier channels. Symmetric quantization wastes half the code range. Asymmetric (scale + zero-point) captures this bias directly.
Supported by KIVI (ICML 2024): per-channel K + per-token V is optimal.
Usage
Files (5 commits, clean branch from master)
References
🤖 Generated with Claude Code