feat: min_quantize_tokens threshold + trim oversized KV buffers by janhilgard · Pull Request #73 · waybarrios/vllm-mlx

janhilgard · 2026-02-11T17:08:55Z

Summary

Addresses both requests from PR #62 comment:

Threshold-based quantization: Added kv_min_quantize_tokens (default 256) — sequences shorter than this skip quantization since the overhead exceeds memory savings for short entries
Trim oversized KV buffers: Added _trim_to_offset() which trims pre-allocated KV arrays to their actual used size before storage. Applied unconditionally (both FP16 and quantized paths) since oversized buffers waste memory regardless of quantization

Changes

vllm_mlx/memory_cache.py — new _trim_to_offset() helper, kv_min_quantize_tokens config field, updated store() logic
vllm_mlx/scheduler.py — kv_cache_min_quantize_tokens field in SchedulerConfig, threaded into MemoryCacheConfig
vllm_mlx/cli.py — --kv-cache-min-quantize-tokens CLI argument for both serve and bench commands
tests/test_kv_cache_quantization.py — 7 new tests covering trim and threshold behavior

Test plan

python3 -m pytest tests/test_kv_cache_quantization.py -v — all 23 tests pass (16 existing + 7 new)
ruff check — no new lint errors introduced
Manual test with server: --kv-cache-quantization --kv-cache-min-quantize-tokens 256

🤖 Generated with Claude Code

- Add _trim_to_offset() to trim pre-allocated KV arrays to their actual used size before storage, saving memory in both FP16 and quantized paths - Add kv_min_quantize_tokens config (default 256) to skip quantization for short sequences where overhead exceeds savings - Thread the new config through SchedulerConfig and CLI arguments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the duplicate entry check in store() before trim and quantize so repeated tokens skip the expensive work entirely. Rewrite _trim_to_offset to validate that offset is positive before slicing, use KVCache() instead of __new__ to avoid skipping init, call mx.eval on the sliced arrays so the original large buffer gets freed and memory accounting stays accurate, and skip the function entirely when no layer actually needs trimming. Add validation for kv_min_quantize_tokens in MemoryCacheConfig so negative values are rejected at init time. Document the field in the class docstring and add Args and Returns to the _trim_to_offset docstring.

waybarrios · 2026-02-13T02:21:05Z

Pushed a commit with fixes for the _trim_to_offset function and the store() method.

The duplicate entry check in store() was happening after trim and quantize, so repeated tokens were doing expensive work for nothing. Moved it before those operations.

Rewrote _trim_to_offset to validate that offset is positive before slicing, which prevents silent corruption if offset ever goes negative. Replaced KVCache.new() with KVCache() so future mlx-lm versions that add attributes in init will not break silently. Added mx.eval on the sliced arrays so the original large buffer actually gets freed and memory tracking stays accurate instead of holding a lazy reference to the full allocation. Also added an early exit when no layer needs trimming to avoid unnecessary object creation on every store call.

Added validation for kv_min_quantize_tokens in MemoryCacheConfig so negative values get rejected at init time instead of silently forcing quantization on all sequences. Documented the field in the class docstring and added Args and Returns to the _trim_to_offset docstring.

waybarrios · 2026-02-13T02:22:22Z

@janhilgard I’m handing over the quantized KV work to you. It would be great if you could refactor the code, clean up any inconsistencies, and organize everything into a clearer, more maintainable structure.

* Add KV cache quantization for prefix cache memory reduction Adds --kv-cache-quantization flag that uses mlx-lm's QuantizedKVCache to compress stored prefix cache entries (8-bit or 4-bit), reducing memory usage ~3.5x with minimal quality loss (~0.005 mean abs error). Quantization happens on store, dequantization on fetch, so active inference is unaffected. Includes CLI flags for serve and bench commands, config wiring through SchedulerConfig, and 16 tests. Closes #60 * Add bench-kv-cache command to benchmark quantization savings New CLI command that compares FP16, 8-bit, and 4-bit KV cache quantization using synthetic data, reporting memory usage, compression ratio, quality metrics, and quantize/dequantize latency. Usage: vllm-mlx bench-kv-cache [--layers 32] [--seq-len 512] * Release FP16 cache reference after quantized store After storing the quantized cache in the prefix cache, the original FP16 reference on the request is no longer needed. Setting it to None allows the memory to be reclaimed sooner, preventing temporary memory spikes when quantization is enabled on long sequences. * Fix _trim_cache_offset to handle QuantizedKVCache layers (#69) When KV cache quantization is enabled, prefix cache entries are stored as QuantizedKVCache objects. The _trim_cache_offset function (used for supersequence and LCP matches) was silently skipping these layers because QuantizedKVCache.keys returns a tuple, failing the `not isinstance(keys, (list, tuple))` guard. This caused the offset to remain untrimmed, so dequantized caches passed to BatchGenerator had their original (large) offset. The BatchGenerator then concatenated new tokens to the full buffer instead of the trimmed prefix, producing oversized KV arrays that negated all memory savings from quantization. Tested: after fix, cache_mem=33MB (correct) vs 62MB (broken, same as unquantized baseline) for the same 2-request workload. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * feat: min_quantize_tokens threshold + trim oversized KV buffers (#73) * feat: add min_quantize_tokens threshold and trim oversized KV buffers - Add _trim_to_offset() to trim pre-allocated KV arrays to their actual used size before storage, saving memory in both FP16 and quantized paths - Add kv_min_quantize_tokens config (default 256) to skip quantization for short sequences where overhead exceeds savings - Thread the new config through SchedulerConfig and CLI arguments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply black formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Harden _trim_to_offset and store() in memory cache Move the duplicate entry check in store() before trim and quantize so repeated tokens skip the expensive work entirely. Rewrite _trim_to_offset to validate that offset is positive before slicing, use KVCache() instead of __new__ to avoid skipping init, call mx.eval on the sliced arrays so the original large buffer gets freed and memory accounting stays accurate, and skip the function entirely when no layer actually needs trimming. Add validation for kv_min_quantize_tokens in MemoryCacheConfig so negative values are rejected at init time. Document the field in the class docstring and add Args and Returns to the _trim_to_offset docstring. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wayner Barrios <waybarrios@gmail.com> * Defer MLX import in _trim_to_offset to fix non-Apple CI * Fix mock path in TestMemoryStats to match mx.get_active_memory API --------- Co-authored-by: Jan Hilgard <89418784+janhilgard@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard and others added 2 commits February 11, 2026 18:08

style: apply black formatting

5eacce1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard mentioned this pull request Feb 11, 2026

Add KV cache quantization for prefix cache memory reduction #62

Merged

waybarrios merged commit 08318f2 into waybarrios:feat/kv-cache-quantization Feb 13, 2026

waybarrios mentioned this pull request Feb 14, 2026

feat: Add MTP speculative decoding for Qwen3-Next #82

Merged

16 tasks

janhilgard mentioned this pull request Mar 21, 2026

Refactor: review improvements for KV cache quantization #72

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: min_quantize_tokens threshold + trim oversized KV buffers#73

feat: min_quantize_tokens threshold + trim oversized KV buffers#73
waybarrios merged 3 commits intowaybarrios:feat/kv-cache-quantizationfrom
janhilgard:threshold-and-trim

janhilgard commented Feb 11, 2026

Uh oh!

waybarrios commented Feb 13, 2026

Uh oh!

waybarrios commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janhilgard commented Feb 11, 2026

Summary

Changes

Test plan

Uh oh!

waybarrios commented Feb 13, 2026

Uh oh!

waybarrios commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants