Skip to content

Better caching in the server#911

Merged
angeloskath merged 6 commits intomainfrom
thinking-cache
Mar 6, 2026
Merged

Better caching in the server#911
angeloskath merged 6 commits intomainfrom
thinking-cache

Conversation

@angeloskath
Copy link
Copy Markdown
Member

@angeloskath angeloskath commented Feb 19, 2026

This aims to solve the problem of non-trimmable caches resulting in no KV cache, very common with linear attention and sliding window attention models.

The standard agentic interaction is kind of the following

  1. System and user prompt
  2. Thinking and generation
  3. Tool
  4. Repeat 2-3 until done
  5. Repeat 1 (system prompt reminder)

For 2 & 3 we want to keep the thinking in the cache for 1 we don't because typically the thinking will be removed from the history.

This PR makes the following changes (they only apply to the batched generation for now):

  • Requests that end in a user message are treated specially and saved as a checkpoint
  • The cache does not extract on fetch but removes prefixes of trimmable caches on insert
  • Checkpoint caches have higher priority than normal caches so we can have many 2-3 steps (above) with cache and without evicting the cache from step 1.

It isn't perfect and may require a bigger refactoring to improve but at least for now we can utilize the cache for non trimmable KVs as well.

I will add some benchmarks and leave it for comments/reviews before merging.

@angeloskath angeloskath force-pushed the thinking-cache branch 6 times, most recently from d6c1148 to d2f6b0f Compare February 28, 2026 03:01
@angeloskath angeloskath marked this pull request as ready for review March 4, 2026 01:42
@angeloskath angeloskath requested a review from nastya236 March 4, 2026 10:05
@nastya236
Copy link
Copy Markdown
Collaborator

Thank you! Looking into it now!

@angeloskath angeloskath merged commit 2105aaf into main Mar 6, 2026
2 checks passed
@angeloskath angeloskath deleted the thinking-cache branch March 6, 2026 21:42
omprashantjain added a commit to omprashantjain/vllm-mlx that referenced this pull request Mar 12, 2026
mlx-lm PR ml-explore/mlx-lm#911 ("Better caching in the server")
added a `prompt_checkpoints` field to the prompt tuples returned by
BatchGenerator.unprocessed_prompts. This causes the zip(*batch_prompts)
unpacking in _chunked_next() to fail with a ValueError when using
--continuous-batching, since the code expects exactly 6 elements but
now receives 7.

Add a *_extra catch-all to the tuple unpacking so it gracefully handles
any additional fields from upstream mlx-lm changes. This is backward-
compatible — older mlx-lm versions that return 6 elements still work
fine (the *_extra will simply be empty).

Fixes waybarrios#155
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Mar 16, 2026
mlx-lm 0.31.0 (PR ml-explore/mlx-lm#911) added prompt_checkpoints as
a 7th element to BatchGenerator.unprocessed_prompts tuples. The chunked
prefill code in _install_chunked_prefill hardcoded a 6-element zip
unpacking which crashed with ValueError.

Fix: use *_extra catch-all in the zip unpacking so extra fields from
mlx-lm are captured but ignored. This is forward-compatible with any
future tuple additions.

The "small prompt" path already works because it passes tuples directly
to mlx-lm's own _process_prompts which handles 7 elements. Only the
chunked prefill path (total_tokens > budget) did its own unpacking.

Closes waybarrios#155
lyonsno added a commit to lyonsno/mlx-lm that referenced this pull request Mar 28, 2026
Semantic merge of upstream/main (through 4d3af3c) into dev. Key changes:

- LRUPromptCache moved from server.py to cache.py (upstream ml-explore#1019).
  Dev's rewind preflight/fail-closed logic integrated into the new
  PromptTrie-based LRUPromptCache.fetch_nearest_cache.
- Presence/frequency penalties (upstream ml-explore#971).
- Better caching with CacheOrder and PromptTrie (upstream ml-explore#911, ml-explore#1019).
- Harden can_trim_prompt_cache against missing is_trimmable.
- Adapt behavioral tests from dev's refcounting model to upstream's
  deepcopy-on-fetch model. Safety contracts preserved: fail-closed on
  partial rewind, preflight deepcopy avoidance, exact entry preservation.

All 247 tests pass (+ 88 subtests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants