Skip to content

Refactor LRUPromptCache#1019

Merged
angeloskath merged 3 commits intomainfrom
lru-cache-refactor
Mar 26, 2026
Merged

Refactor LRUPromptCache#1019
angeloskath merged 3 commits intomainfrom
lru-cache-refactor

Conversation

@angeloskath
Copy link
Copy Markdown
Member

Refactor the LRUPromptCache into the models.cache. Closes #1013 .

In that spirit this PR also

  • Refactors the trie out of the lru cache
  • Prepares the LRUPromptCache for a system prompt cache. Basically the LRUPromptCache is a multi-tier LRU cache where each tier is more and more persistent than the others.

@Thump604
Copy link
Copy Markdown

Clean refactor. The N-tier CacheOrder generalization is the right move — I've been running a system prompt KV cache in vllm-mlx (PR #175) on a 122B MoE and the tiered eviction concept maps directly to what I needed: system prompts that survive across turns while user/assistant caches rotate. One thing to flag: the old insert_cache had a len(self._lru) > 1 guard on the max_bytes eviction loop that's been dropped — this means a single large cache entry that exceeds max_bytes will now be immediately evicted after insertion, where previously it would have been kept as the last entry. Might be intentional but worth a note.

@nastya236
Copy link
Copy Markdown
Collaborator

Thank you! Very helpful refactoring.
This PR:

  • Moves LRUPromptCache and related classes (CacheEntry, CacheOrder), and all search logic out of the server.py module into models/cache.py (where is belongs)
  • introduces PromptTrie that encapsulates prefix tree handling that was in LRUPromptCache.
    Minor changes:
  • moves log_cache_stats() to _log_cache_stats() of ResponseGenerator so it is no longer a part of cache
  • moves parse_size() to utils.py

I left couple of questions mainly for myself to understand better the code.

except ValueError:
pass

def pop(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions about the eviction logic in CacheOrder.pop():

  1. With the default ordering ["assistant", "user", "system"], we evict from the assistant queue until its size drops below the user queue, then start evicting user entries. What's the reasoning behind this?
  2. Would it be correct to say that when the cache is trimmable, insert_cache calls pop_prefixes, so in that case, the trie effectively have just one branch, making the eviction ordering irrelevant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yep exactly this. This is done to ensure that the assistant doesn't kick out of the cache the user message. The reason we don't want the assistant to kick out the user message is that most (if not all) chat templates remove the thinking tokens before the last user message. Even more so they could insert a system message reminder and so on. The interaction looks like this (let's assume 1 slot per cache type)

1st message

<system> <--- system cache
<user> <--- user cache
<thinking>
<assistant>
<tool> <--- assistant cache
<thinking>
<assistant>
<tool> <--- assistant cache

You can see that the 2nd assistant cache will evict the user cache (actually the system cache but bear with me cause it is better for the example) if we simply treat them as a single LRU cache.

Moreover the chat template will remove all the thinking when a second user message arrives which renders the assistant caches useless (if not trimmable).

2nd message

<system> <--- system cache
<user> <--- user cache
<assistant>
<tool>
<assistant>
<tool>
<system reminder>
<user> <--- user cache
...

Since the messages changed after the user message we have to process all the messages up to the next user message which we put in the cache. You can see that we lost the two assistant cache entries, they are useless to us after the new user message.

  1. Yes. When the cache is trimmable the user cache and system cache will always be included so we only need to hold the different diverging branches. pop_prefixes ensures that.

def join(self):
self._generation_thread.join()

def _log_cache_stats(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version logged the latest user cache token count. The new _log_cache_stats only logs count and bytes. Just curious how important is logging the token count?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not sure. I was kind of using it for debugging but it isn't too useful either. Perhaps we 'll expose some more stats later 🤷‍♂️

def pop_prefixes(self, model: Any, tokens: List[int]):
values = []
current = self._trie[model]
for i in range(len(tokens) - 1):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here -1 because we want to pop prefixes but not exact match because we don't want to pop something we just inserted, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep pop_prefixes removes strict prefixes not including this key.

@angeloskath angeloskath merged commit 4d3af3c into main Mar 26, 2026
2 checks passed
@angeloskath angeloskath deleted the lru-cache-refactor branch March 26, 2026 17:04
lyonsno added a commit to lyonsno/mlx-lm that referenced this pull request Mar 28, 2026
Semantic merge of upstream/main (through 4d3af3c) into dev. Key changes:

- LRUPromptCache moved from server.py to cache.py (upstream ml-explore#1019).
  Dev's rewind preflight/fail-closed logic integrated into the new
  PromptTrie-based LRUPromptCache.fetch_nearest_cache.
- Presence/frequency penalties (upstream ml-explore#971).
- Better caching with CacheOrder and PromptTrie (upstream ml-explore#911, ml-explore#1019).
- Harden can_trim_prompt_cache against missing is_trimmable.
- Adapt behavioral tests from dev's refcounting model to upstream's
  deepcopy-on-fetch model. Safety contracts preserved: fail-closed on
  partial rewind, preflight deepcopy avoidance, exact entry preservation.

All 247 tests pass (+ 88 subtests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suggestion: move LRUPromptCache from server.py to models/cache.py

3 participants