feat: min_quantize_tokens threshold + trim oversized KV buffers#73
Conversation
- Add _trim_to_offset() to trim pre-allocated KV arrays to their actual used size before storage, saving memory in both FP16 and quantized paths - Add kv_min_quantize_tokens config (default 256) to skip quantization for short sequences where overhead exceeds savings - Thread the new config through SchedulerConfig and CLI arguments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the duplicate entry check in store() before trim and quantize so repeated tokens skip the expensive work entirely. Rewrite _trim_to_offset to validate that offset is positive before slicing, use KVCache() instead of __new__ to avoid skipping init, call mx.eval on the sliced arrays so the original large buffer gets freed and memory accounting stays accurate, and skip the function entirely when no layer actually needs trimming. Add validation for kv_min_quantize_tokens in MemoryCacheConfig so negative values are rejected at init time. Document the field in the class docstring and add Args and Returns to the _trim_to_offset docstring.
|
Pushed a commit with fixes for the _trim_to_offset function and the store() method. The duplicate entry check in store() was happening after trim and quantize, so repeated tokens were doing expensive work for nothing. Moved it before those operations. Rewrote _trim_to_offset to validate that offset is positive before slicing, which prevents silent corruption if offset ever goes negative. Replaced KVCache.new() with KVCache() so future mlx-lm versions that add attributes in init will not break silently. Added mx.eval on the sliced arrays so the original large buffer actually gets freed and memory tracking stays accurate instead of holding a lazy reference to the full allocation. Also added an early exit when no layer needs trimming to avoid unnecessary object creation on every store call. Added validation for kv_min_quantize_tokens in MemoryCacheConfig so negative values get rejected at init time instead of silently forcing quantization on all sequences. Documented the field in the class docstring and added Args and Returns to the _trim_to_offset docstring. |
|
@janhilgard I’m handing over the quantized KV work to you. It would be great if you could refactor the code, clean up any inconsistencies, and organize everything into a clearer, more maintainable structure. |
* Add KV cache quantization for prefix cache memory reduction Adds --kv-cache-quantization flag that uses mlx-lm's QuantizedKVCache to compress stored prefix cache entries (8-bit or 4-bit), reducing memory usage ~3.5x with minimal quality loss (~0.005 mean abs error). Quantization happens on store, dequantization on fetch, so active inference is unaffected. Includes CLI flags for serve and bench commands, config wiring through SchedulerConfig, and 16 tests. Closes #60 * Add bench-kv-cache command to benchmark quantization savings New CLI command that compares FP16, 8-bit, and 4-bit KV cache quantization using synthetic data, reporting memory usage, compression ratio, quality metrics, and quantize/dequantize latency. Usage: vllm-mlx bench-kv-cache [--layers 32] [--seq-len 512] * Release FP16 cache reference after quantized store After storing the quantized cache in the prefix cache, the original FP16 reference on the request is no longer needed. Setting it to None allows the memory to be reclaimed sooner, preventing temporary memory spikes when quantization is enabled on long sequences. * Fix _trim_cache_offset to handle QuantizedKVCache layers (#69) When KV cache quantization is enabled, prefix cache entries are stored as QuantizedKVCache objects. The _trim_cache_offset function (used for supersequence and LCP matches) was silently skipping these layers because QuantizedKVCache.keys returns a tuple, failing the `not isinstance(keys, (list, tuple))` guard. This caused the offset to remain untrimmed, so dequantized caches passed to BatchGenerator had their original (large) offset. The BatchGenerator then concatenated new tokens to the full buffer instead of the trimmed prefix, producing oversized KV arrays that negated all memory savings from quantization. Tested: after fix, cache_mem=33MB (correct) vs 62MB (broken, same as unquantized baseline) for the same 2-request workload. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * feat: min_quantize_tokens threshold + trim oversized KV buffers (#73) * feat: add min_quantize_tokens threshold and trim oversized KV buffers - Add _trim_to_offset() to trim pre-allocated KV arrays to their actual used size before storage, saving memory in both FP16 and quantized paths - Add kv_min_quantize_tokens config (default 256) to skip quantization for short sequences where overhead exceeds savings - Thread the new config through SchedulerConfig and CLI arguments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply black formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Harden _trim_to_offset and store() in memory cache Move the duplicate entry check in store() before trim and quantize so repeated tokens skip the expensive work entirely. Rewrite _trim_to_offset to validate that offset is positive before slicing, use KVCache() instead of __new__ to avoid skipping init, call mx.eval on the sliced arrays so the original large buffer gets freed and memory accounting stays accurate, and skip the function entirely when no layer actually needs trimming. Add validation for kv_min_quantize_tokens in MemoryCacheConfig so negative values are rejected at init time. Document the field in the class docstring and add Args and Returns to the _trim_to_offset docstring. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wayner Barrios <waybarrios@gmail.com> * Defer MLX import in _trim_to_offset to fix non-Apple CI * Fix mock path in TestMemoryStats to match mx.get_active_memory API --------- Co-authored-by: Jan Hilgard <89418784+janhilgard@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Addresses both requests from PR #62 comment:
kv_min_quantize_tokens(default 256) — sequences shorter than this skip quantization since the overhead exceeds memory savings for short entries_trim_to_offset()which trims pre-allocated KV arrays to their actual used size before storage. Applied unconditionally (both FP16 and quantized paths) since oversized buffers waste memory regardless of quantizationChanges
vllm_mlx/memory_cache.py— new_trim_to_offset()helper,kv_min_quantize_tokensconfig field, updatedstore()logicvllm_mlx/scheduler.py—kv_cache_min_quantize_tokensfield inSchedulerConfig, threaded intoMemoryCacheConfigvllm_mlx/cli.py—--kv-cache-min-quantize-tokensCLI argument for bothserveandbenchcommandstests/test_kv_cache_quantization.py— 7 new tests covering trim and threshold behaviorTest plan
python3 -m pytest tests/test_kv_cache_quantization.py -v— all 23 tests pass (16 existing + 7 new)ruff check— no new lint errors introduced--kv-cache-quantization --kv-cache-min-quantize-tokens 256🤖 Generated with Claude Code