Better caching in the server by angeloskath · Pull Request #911 · ml-explore/mlx-lm

angeloskath · 2026-02-19T21:33:00Z

This aims to solve the problem of non-trimmable caches resulting in no KV cache, very common with linear attention and sliding window attention models.

The standard agentic interaction is kind of the following

System and user prompt
Thinking and generation
Tool
Repeat 2-3 until done
Repeat 1 (system prompt reminder)

For 2 & 3 we want to keep the thinking in the cache for 1 we don't because typically the thinking will be removed from the history.

This PR makes the following changes (they only apply to the batched generation for now):

Requests that end in a user message are treated specially and saved as a checkpoint
The cache does not extract on fetch but removes prefixes of trimmable caches on insert
Checkpoint caches have higher priority than normal caches so we can have many 2-3 steps (above) with cache and without evicting the cache from step 1.

It isn't perfect and may require a bigger refactoring to improve but at least for now we can utilize the cache for non trimmable KVs as well.

I will add some benchmarks and leave it for comments/reviews before merging.

nastya236 · 2026-03-04T17:27:57Z

Thank you! Looking into it now!

mlx-lm PR ml-explore/mlx-lm#911 ("Better caching in the server") added a `prompt_checkpoints` field to the prompt tuples returned by BatchGenerator.unprocessed_prompts. This causes the zip(*batch_prompts) unpacking in _chunked_next() to fail with a ValueError when using --continuous-batching, since the code expects exactly 6 elements but now receives 7. Add a *_extra catch-all to the tuple unpacking so it gracefully handles any additional fields from upstream mlx-lm changes. This is backward- compatible — older mlx-lm versions that return 6 elements still work fine (the *_extra will simply be empty). Fixes waybarrios#155

mlx-lm 0.31.0 (PR ml-explore/mlx-lm#911) added prompt_checkpoints as a 7th element to BatchGenerator.unprocessed_prompts tuples. The chunked prefill code in _install_chunked_prefill hardcoded a 6-element zip unpacking which crashed with ValueError. Fix: use *_extra catch-all in the zip unpacking so extra fields from mlx-lm are captured but ignored. This is forward-compatible with any future tuple additions. The "small prompt" path already works because it passes tuples directly to mlx-lm's own _process_prompts which handles 7 elements. Only the chunked prefill path (total_tokens > budget) did its own unpacking. Closes waybarrios#155

Semantic merge of upstream/main (through 4d3af3c) into dev. Key changes: - LRUPromptCache moved from server.py to cache.py (upstream ml-explore#1019). Dev's rewind preflight/fail-closed logic integrated into the new PromptTrie-based LRUPromptCache.fetch_nearest_cache. - Presence/frequency penalties (upstream ml-explore#971). - Better caching with CacheOrder and PromptTrie (upstream ml-explore#911, ml-explore#1019). - Harden can_trim_prompt_cache against missing is_trimmable. - Adapt behavioral tests from dev's refcounting model to upstream's deepcopy-on-fetch model. Safety contracts preserved: fail-closed on partial rewind, preflight deepcopy avoidance, exact entry preservation. All 247 tests pass (+ 88 subtests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

angeloskath force-pushed the thinking-cache branch 6 times, most recently from d6c1148 to d2f6b0f Compare February 28, 2026 03:01

angeloskath added 5 commits March 2, 2026 14:42

Always fork and remove overlapping histories on insert

27947b3

Add a more sophisticated ordering to the Cache

a3b1cb1

Checkpoint KV caches when the prompt ends

4bc14c1

Change the way checkpoints are stored

4af59c9

Refactor the checkpoint calculation a bit

5733d68

angeloskath force-pushed the thinking-cache branch from d2f6b0f to 5733d68 Compare March 4, 2026 00:51

Fix the test

48ab954

angeloskath marked this pull request as ready for review March 4, 2026 01:42

angeloskath requested a review from nastya236 March 4, 2026 10:05

spicyneuron mentioned this pull request Mar 4, 2026

Caching doesn't seem to be working for Qwen3.5 #903

Open

andresy approved these changes Mar 6, 2026

View reviewed changes

angeloskath merged commit 2105aaf into main Mar 6, 2026
2 checks passed

angeloskath deleted the thinking-cache branch March 6, 2026 21:42

This was referenced Mar 10, 2026

Fix hybrid cache checkpoints for short conversations #929

Closed

Hybrid cache for Qwen3.5 #923

Closed

omprashantjain mentioned this pull request Mar 12, 2026

Chunked prefill crashes with mlx-lm >= 0.31.0: zip unpacking expects 6 values, gets 7 waybarrios/vllm-mlx#155

Open

omprashantjain mentioned this pull request Mar 12, 2026

Fix chunked prefill crash with mlx-lm >= 0.31.0 waybarrios/vllm-mlx#156

Closed

JJJYmmm mentioned this pull request Mar 13, 2026

Chat template breaks KV-cache reuse when enable_thinking=false QwenLM/Qwen3.5#48

Open

Thump604 mentioned this pull request Mar 16, 2026

fix: chunked prefill compat with mlx-lm >= 0.31.0 waybarrios/vllm-mlx#169

Closed

lyonsno mentioned this pull request Mar 23, 2026

Fix zero prompt-cache reuse for thinking models in multi-turn chat #1042

Closed

angeloskath mentioned this pull request Mar 30, 2026

Enable prefix cache reuse for hybrid models (Qwen3.5, Nemotron-H, Mamba) #1006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better caching in the server#911

Better caching in the server#911
angeloskath merged 6 commits intomainfrom
thinking-cache

angeloskath commented Feb 19, 2026 •

edited

Loading

Uh oh!

nastya236 commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

angeloskath commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nastya236 commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

angeloskath commented Feb 19, 2026 •

edited

Loading