feat: Add --kv-cache-bits flag for KV cache quantization#67
Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Closed
feat: Add --kv-cache-bits flag for KV cache quantization#67janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Conversation
Add optional KV cache quantization for prefix cache storage to reduce
memory usage. Supports 4-bit (~75% savings) and 8-bit (~50% savings)
quantization using mlx-lm's QuantizedKVCache.
New CLI flags:
--kv-cache-bits {4,8} Enable KV cache quantization
--kv-cache-group-size N Group size for quantization (default: 64)
Quantization is applied transparently when storing to prefix cache and
dequantized on fetch, so BatchGenerator always operates on full-precision
KV states. This allows significantly more cache entries to fit in memory
without changing inference behavior.
Changes:
- cli.py: Add --kv-cache-bits and --kv-cache-group-size arguments
- scheduler.py: Quantize on store, dequantize on fetch in all cache paths
- memory_cache.py: Add quantize/dequantize helpers, support QuantizedKVCache
in memory estimation and cache trimming
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
|
@janhilgard I think this PR is a duplication, based on #62 Take a look on that. |
Collaborator
Author
|
Closing as duplicate of #62 which implements the same feature with unit tests. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--kv-cache-bits {4,8}and--kv-cache-group-sizeCLI flags to quantize KV cache entries in the prefix cacheMotivation
Requested in #17. The prefix cache can consume significant memory, especially with large context windows (128K+). This feature trades a small amount of compute (quantize/dequantize on cache store/fetch) for a major reduction in memory usage, enabling more concurrent users and longer context caching.
Changes
vllm_mlx/cli.py--kv-cache-bitsand--kv-cache-group-sizeargumentsvllm_mlx/scheduler.pyvllm_mlx/memory_cache.pyquantize_kv_cache()/dequantize_kv_cache()helpers, supportQuantizedKVCachein memory estimation and cache trimmingUsage
Implementation details
QuantizedKVCacheandKVCache.to_quantized()for quantizationmx.dequantize()for converting back to full precision on cache fetchestimate_kv_cache_memory) updated to handle quantized 3-tuple format(data, scales, biases)_trim_cache_offset) updated to handleQuantizedKVCacheobjectsTest plan
--kv-cache-bits 8starts without errors--kv-cache-bits 4starts without errors/healthendpoint memory stats🤖 Generated with Claude Code