Skip to content

feat: min_quantize_tokens threshold + trim oversized KV buffers#73

Merged
waybarrios merged 3 commits intowaybarrios:feat/kv-cache-quantizationfrom
janhilgard:threshold-and-trim
Feb 13, 2026
Merged

feat: min_quantize_tokens threshold + trim oversized KV buffers#73
waybarrios merged 3 commits intowaybarrios:feat/kv-cache-quantizationfrom
janhilgard:threshold-and-trim

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

Addresses both requests from PR #62 comment:

  • Threshold-based quantization: Added kv_min_quantize_tokens (default 256) — sequences shorter than this skip quantization since the overhead exceeds memory savings for short entries
  • Trim oversized KV buffers: Added _trim_to_offset() which trims pre-allocated KV arrays to their actual used size before storage. Applied unconditionally (both FP16 and quantized paths) since oversized buffers waste memory regardless of quantization

Changes

  • vllm_mlx/memory_cache.py — new _trim_to_offset() helper, kv_min_quantize_tokens config field, updated store() logic
  • vllm_mlx/scheduler.pykv_cache_min_quantize_tokens field in SchedulerConfig, threaded into MemoryCacheConfig
  • vllm_mlx/cli.py--kv-cache-min-quantize-tokens CLI argument for both serve and bench commands
  • tests/test_kv_cache_quantization.py — 7 new tests covering trim and threshold behavior

Test plan

  • python3 -m pytest tests/test_kv_cache_quantization.py -v — all 23 tests pass (16 existing + 7 new)
  • ruff check — no new lint errors introduced
  • Manual test with server: --kv-cache-quantization --kv-cache-min-quantize-tokens 256

🤖 Generated with Claude Code

janhilgard and others added 2 commits February 11, 2026 18:08
- Add _trim_to_offset() to trim pre-allocated KV arrays to their actual
  used size before storage, saving memory in both FP16 and quantized paths
- Add kv_min_quantize_tokens config (default 256) to skip quantization
  for short sequences where overhead exceeds savings
- Thread the new config through SchedulerConfig and CLI arguments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the duplicate entry check in store() before trim and quantize
so repeated tokens skip the expensive work entirely. Rewrite
_trim_to_offset to validate that offset is positive before slicing,
use KVCache() instead of __new__ to avoid skipping init, call
mx.eval on the sliced arrays so the original large buffer gets freed
and memory accounting stays accurate, and skip the function entirely
when no layer actually needs trimming.

Add validation for kv_min_quantize_tokens in MemoryCacheConfig so
negative values are rejected at init time. Document the field in the
class docstring and add Args and Returns to the _trim_to_offset
docstring.
@waybarrios
Copy link
Copy Markdown
Owner

Pushed a commit with fixes for the _trim_to_offset function and the store() method.

The duplicate entry check in store() was happening after trim and quantize, so repeated tokens were doing expensive work for nothing. Moved it before those operations.

Rewrote _trim_to_offset to validate that offset is positive before slicing, which prevents silent corruption if offset ever goes negative. Replaced KVCache.new() with KVCache() so future mlx-lm versions that add attributes in init will not break silently. Added mx.eval on the sliced arrays so the original large buffer actually gets freed and memory tracking stays accurate instead of holding a lazy reference to the full allocation. Also added an early exit when no layer needs trimming to avoid unnecessary object creation on every store call.

Added validation for kv_min_quantize_tokens in MemoryCacheConfig so negative values get rejected at init time instead of silently forcing quantization on all sequences. Documented the field in the class docstring and added Args and Returns to the _trim_to_offset docstring.

@waybarrios
Copy link
Copy Markdown
Owner

@janhilgard I’m handing over the quantized KV work to you. It would be great if you could refactor the code, clean up any inconsistencies, and organize everything into a clearer, more maintainable structure.

@waybarrios waybarrios merged commit 08318f2 into waybarrios:feat/kv-cache-quantization Feb 13, 2026
waybarrios added a commit that referenced this pull request Feb 14, 2026
* Add KV cache quantization for prefix cache memory reduction

Adds --kv-cache-quantization flag that uses mlx-lm's QuantizedKVCache
to compress stored prefix cache entries (8-bit or 4-bit), reducing
memory usage ~3.5x with minimal quality loss (~0.005 mean abs error).

Quantization happens on store, dequantization on fetch, so active
inference is unaffected. Includes CLI flags for serve and bench
commands, config wiring through SchedulerConfig, and 16 tests.

Closes #60

* Add bench-kv-cache command to benchmark quantization savings

New CLI command that compares FP16, 8-bit, and 4-bit KV cache
quantization using synthetic data, reporting memory usage, compression
ratio, quality metrics, and quantize/dequantize latency.

Usage: vllm-mlx bench-kv-cache [--layers 32] [--seq-len 512]

* Release FP16 cache reference after quantized store

After storing the quantized cache in the prefix cache, the original
FP16 reference on the request is no longer needed. Setting it to None
allows the memory to be reclaimed sooner, preventing temporary memory
spikes when quantization is enabled on long sequences.

* Fix _trim_cache_offset to handle QuantizedKVCache layers (#69)

When KV cache quantization is enabled, prefix cache entries are stored
as QuantizedKVCache objects. The _trim_cache_offset function (used for
supersequence and LCP matches) was silently skipping these layers
because QuantizedKVCache.keys returns a tuple, failing the
`not isinstance(keys, (list, tuple))` guard.

This caused the offset to remain untrimmed, so dequantized caches
passed to BatchGenerator had their original (large) offset. The
BatchGenerator then concatenated new tokens to the full buffer instead
of the trimmed prefix, producing oversized KV arrays that negated
all memory savings from quantization.

Tested: after fix, cache_mem=33MB (correct) vs 62MB (broken, same as
unquantized baseline) for the same 2-request workload.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: min_quantize_tokens threshold + trim oversized KV buffers (#73)

* feat: add min_quantize_tokens threshold and trim oversized KV buffers

- Add _trim_to_offset() to trim pre-allocated KV arrays to their actual
  used size before storage, saving memory in both FP16 and quantized paths
- Add kv_min_quantize_tokens config (default 256) to skip quantization
  for short sequences where overhead exceeds savings
- Thread the new config through SchedulerConfig and CLI arguments

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply black formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Harden _trim_to_offset and store() in memory cache

Move the duplicate entry check in store() before trim and quantize
so repeated tokens skip the expensive work entirely. Rewrite
_trim_to_offset to validate that offset is positive before slicing,
use KVCache() instead of __new__ to avoid skipping init, call
mx.eval on the sliced arrays so the original large buffer gets freed
and memory accounting stays accurate, and skip the function entirely
when no layer actually needs trimming.

Add validation for kv_min_quantize_tokens in MemoryCacheConfig so
negative values are rejected at init time. Document the field in the
class docstring and add Args and Returns to the _trim_to_offset
docstring.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Wayner Barrios <waybarrios@gmail.com>

* Defer MLX import in _trim_to_offset to fix non-Apple CI

* Fix mock path in TestMemoryStats to match mx.get_active_memory API

---------

Co-authored-by: Jan Hilgard <89418784+janhilgard@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants