feat: Add --kv-cache-bits flag for KV cache quantization by janhilgard · Pull Request #67 · waybarrios/vllm-mlx

janhilgard · 2026-02-11T15:23:41Z

Summary

Adds optional --kv-cache-bits {4,8} and --kv-cache-group-size CLI flags to quantize KV cache entries in the prefix cache
4-bit quantization saves ~75% memory, 8-bit saves ~50% — allowing significantly more cache entries to fit in memory
Quantization is transparent: applied on store, dequantized on fetch, so BatchGenerator always operates on full-precision KV states (no impact on inference quality)

Motivation

Requested in #17. The prefix cache can consume significant memory, especially with large context windows (128K+). This feature trades a small amount of compute (quantize/dequantize on cache store/fetch) for a major reduction in memory usage, enabling more concurrent users and longer context caching.

Changes

File	Change
`vllm_mlx/cli.py`	Add `--kv-cache-bits` and `--kv-cache-group-size` arguments
`vllm_mlx/scheduler.py`	Quantize cache on store, dequantize on fetch in all cache paths (memory-aware + legacy)
`vllm_mlx/memory_cache.py`	Add `quantize_kv_cache()` / `dequantize_kv_cache()` helpers, support `QuantizedKVCache` in memory estimation and cache trimming

Usage

# 8-bit quantization (~50% memory savings, minimal quality impact)
vllm-mlx serve model --kv-cache-bits 8

# 4-bit quantization (~75% memory savings)
vllm-mlx serve model --kv-cache-bits 4

# Custom group size
vllm-mlx serve model --kv-cache-bits 8 --kv-cache-group-size 32

Implementation details

Uses mlx-lm's built-in QuantizedKVCache and KVCache.to_quantized() for quantization
Uses mx.dequantize() for converting back to full precision on cache fetch
Memory estimation (estimate_kv_cache_memory) updated to handle quantized 3-tuple format (data, scales, biases)
Cache trimming (_trim_cache_offset) updated to handle QuantizedKVCache objects

Test plan

Verify --kv-cache-bits 8 starts without errors
Verify --kv-cache-bits 4 starts without errors
Confirm prefix cache hit/miss behavior unchanged
Compare inference quality with and without quantization
Verify memory savings via /health endpoint memory stats

🤖 Generated with Claude Code

Add optional KV cache quantization for prefix cache storage to reduce memory usage. Supports 4-bit (~75% savings) and 8-bit (~50% savings) quantization using mlx-lm's QuantizedKVCache. New CLI flags: --kv-cache-bits {4,8} Enable KV cache quantization --kv-cache-group-size N Group size for quantization (default: 64) Quantization is applied transparently when storing to prefix cache and dequantized on fetch, so BatchGenerator always operates on full-precision KV states. This allows significantly more cache entries to fit in memory without changing inference behavior. Changes: - cli.py: Add --kv-cache-bits and --kv-cache-group-size arguments - scheduler.py: Quantize on store, dequantize on fetch in all cache paths - memory_cache.py: Add quantize/dequantize helpers, support QuantizedKVCache in memory estimation and cache trimming Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

waybarrios · 2026-02-11T15:29:39Z

@janhilgard I think this PR is a duplication, based on #62

Take a look on that.

janhilgard · 2026-02-11T15:31:12Z

Closing as duplicate of #62 which implements the same feature with unit tests.

janhilgard closed this Feb 11, 2026

janhilgard mentioned this pull request Feb 11, 2026

Add KV cache quantization for prefix cache memory reduction #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add --kv-cache-bits flag for KV cache quantization#67

feat: Add --kv-cache-bits flag for KV cache quantization#67
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/kv-cache-quantization

janhilgard commented Feb 11, 2026

Uh oh!

waybarrios commented Feb 11, 2026

Uh oh!

janhilgard commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janhilgard commented Feb 11, 2026

Summary

Motivation

Changes

Usage

Implementation details

Test plan

Uh oh!

waybarrios commented Feb 11, 2026

Uh oh!

janhilgard commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants