Skip to content

feat: Add --kv-cache-bits flag for KV cache quantization#67

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/kv-cache-quantization
Closed

feat: Add --kv-cache-bits flag for KV cache quantization#67
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/kv-cache-quantization

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Adds optional --kv-cache-bits {4,8} and --kv-cache-group-size CLI flags to quantize KV cache entries in the prefix cache
  • 4-bit quantization saves ~75% memory, 8-bit saves ~50% — allowing significantly more cache entries to fit in memory
  • Quantization is transparent: applied on store, dequantized on fetch, so BatchGenerator always operates on full-precision KV states (no impact on inference quality)

Motivation

Requested in #17. The prefix cache can consume significant memory, especially with large context windows (128K+). This feature trades a small amount of compute (quantize/dequantize on cache store/fetch) for a major reduction in memory usage, enabling more concurrent users and longer context caching.

Changes

File Change
vllm_mlx/cli.py Add --kv-cache-bits and --kv-cache-group-size arguments
vllm_mlx/scheduler.py Quantize cache on store, dequantize on fetch in all cache paths (memory-aware + legacy)
vllm_mlx/memory_cache.py Add quantize_kv_cache() / dequantize_kv_cache() helpers, support QuantizedKVCache in memory estimation and cache trimming

Usage

# 8-bit quantization (~50% memory savings, minimal quality impact)
vllm-mlx serve model --kv-cache-bits 8

# 4-bit quantization (~75% memory savings)
vllm-mlx serve model --kv-cache-bits 4

# Custom group size
vllm-mlx serve model --kv-cache-bits 8 --kv-cache-group-size 32

Implementation details

  • Uses mlx-lm's built-in QuantizedKVCache and KVCache.to_quantized() for quantization
  • Uses mx.dequantize() for converting back to full precision on cache fetch
  • Memory estimation (estimate_kv_cache_memory) updated to handle quantized 3-tuple format (data, scales, biases)
  • Cache trimming (_trim_cache_offset) updated to handle QuantizedKVCache objects

Test plan

  • Verify --kv-cache-bits 8 starts without errors
  • Verify --kv-cache-bits 4 starts without errors
  • Confirm prefix cache hit/miss behavior unchanged
  • Compare inference quality with and without quantization
  • Verify memory savings via /health endpoint memory stats

🤖 Generated with Claude Code

Add optional KV cache quantization for prefix cache storage to reduce
memory usage. Supports 4-bit (~75% savings) and 8-bit (~50% savings)
quantization using mlx-lm's QuantizedKVCache.

New CLI flags:
  --kv-cache-bits {4,8}       Enable KV cache quantization
  --kv-cache-group-size N     Group size for quantization (default: 64)

Quantization is applied transparently when storing to prefix cache and
dequantized on fetch, so BatchGenerator always operates on full-precision
KV states. This allows significantly more cache entries to fit in memory
without changing inference behavior.

Changes:
- cli.py: Add --kv-cache-bits and --kv-cache-group-size arguments
- scheduler.py: Quantize on store, dequantize on fetch in all cache paths
- memory_cache.py: Add quantize/dequantize helpers, support QuantizedKVCache
  in memory estimation and cache trimming

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waybarrios
Copy link
Copy Markdown
Owner

@janhilgard I think this PR is a duplication, based on #62

Take a look on that.

@janhilgard
Copy link
Copy Markdown
Collaborator Author

Closing as duplicate of #62 which implements the same feature with unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants