Refactor LRUPromptCache by angeloskath · Pull Request #1019 · ml-explore/mlx-lm

angeloskath · 2026-03-18T10:16:57Z

Refactor the LRUPromptCache into the models.cache. Closes #1013 .

In that spirit this PR also

Refactors the trie out of the lru cache
Prepares the LRUPromptCache for a system prompt cache. Basically the LRUPromptCache is a multi-tier LRU cache where each tier is more and more persistent than the others.

Thump604 · 2026-03-21T20:53:18Z

Clean refactor. The N-tier CacheOrder generalization is the right move — I've been running a system prompt KV cache in vllm-mlx (PR #175) on a 122B MoE and the tiered eviction concept maps directly to what I needed: system prompts that survive across turns while user/assistant caches rotate. One thing to flag: the old insert_cache had a len(self._lru) > 1 guard on the max_bytes eviction loop that's been dropped — this means a single large cache entry that exceeds max_bytes will now be immediately evicted after insertion, where previously it would have been kept as the last entry. Might be intentional but worth a note.

nastya236 · 2026-03-23T12:27:14Z

Thank you! Very helpful refactoring.
This PR:

Moves LRUPromptCache and related classes (CacheEntry, CacheOrder), and all search logic out of the server.py module into models/cache.py (where is belongs)
introduces PromptTrie that encapsulates prefix tree handling that was in LRUPromptCache.
Minor changes:
moves log_cache_stats() to _log_cache_stats() of ResponseGenerator so it is no longer a part of cache
moves parse_size() to utils.py

I left couple of questions mainly for myself to understand better the code.

nastya236 · 2026-03-23T12:10:08Z

mlx_lm/models/cache.py

+                except ValueError:
+                    pass
+
+        def pop(self):


A couple of questions about the eviction logic in CacheOrder.pop():

With the default ordering ["assistant", "user", "system"], we evict from the assistant queue until its size drops below the user queue, then start evicting user entries. What's the reasoning behind this?

Would it be correct to say that when the cache is trimmable, insert_cache calls pop_prefixes, so in that case, the trie effectively have just one branch, making the eviction ordering irrelevant?

Yep exactly this. This is done to ensure that the assistant doesn't kick out of the cache the user message. The reason we don't want the assistant to kick out the user message is that most (if not all) chat templates remove the thinking tokens before the last user message. Even more so they could insert a system message reminder and so on. The interaction looks like this (let's assume 1 slot per cache type)

1st message

<system> <--- system cache <user> <--- user cache <thinking> <assistant> <tool> <--- assistant cache <thinking> <assistant> <tool> <--- assistant cache

You can see that the 2nd assistant cache will evict the user cache (actually the system cache but bear with me cause it is better for the example) if we simply treat them as a single LRU cache.

Moreover the chat template will remove all the thinking when a second user message arrives which renders the assistant caches useless (if not trimmable).

2nd message

<system> <--- system cache <user> <--- user cache <assistant> <tool> <assistant> <tool> <system reminder> <user> <--- user cache ...

Since the messages changed after the user message we have to process all the messages up to the next user message which we put in the cache. You can see that we lost the two assistant cache entries, they are useless to us after the new user message.

Yes. When the cache is trimmable the user cache and system cache will always be included so we only need to hold the different diverging branches. pop_prefixes ensures that.

nastya236 · 2026-03-23T12:19:29Z

mlx_lm/server.py

    def join(self):
        self._generation_thread.join()

+    def _log_cache_stats(self):


The old version logged the latest user cache token count. The new _log_cache_stats only logs count and bytes. Just curious how important is logging the token count?

Yeah not sure. I was kind of using it for debugging but it isn't too useful either. Perhaps we 'll expose some more stats later 🤷‍♂️

nastya236 · 2026-03-23T12:23:53Z

mlx_lm/models/cache.py

+    def pop_prefixes(self, model: Any, tokens: List[int]):
+        values = []
+        current = self._trie[model]
+        for i in range(len(tokens) - 1):


Here -1 because we want to pop prefixes but not exact match because we don't want to pop something we just inserted, right?

Yep pop_prefixes removes strict prefixes not including this key.

Semantic merge of upstream/main (through 4d3af3c) into dev. Key changes: - LRUPromptCache moved from server.py to cache.py (upstream ml-explore#1019). Dev's rewind preflight/fail-closed logic integrated into the new PromptTrie-based LRUPromptCache.fetch_nearest_cache. - Presence/frequency penalties (upstream ml-explore#971). - Better caching with CacheOrder and PromptTrie (upstream ml-explore#911, ml-explore#1019). - Harden can_trim_prompt_cache against missing is_trimmable. - Adapt behavioral tests from dev's refcounting model to upstream's deepcopy-on-fetch model. Safety contracts preserved: fail-closed on partial rewind, preflight deepcopy avoidance, exact entry preservation. All 247 tests pass (+ 88 subtests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

angeloskath requested review from andresy and nastya236 March 18, 2026 10:17

Refactor LRUPromptCache

10808f2

angeloskath force-pushed the lru-cache-refactor branch from 2397229 to 10808f2 Compare March 18, 2026 20:34

angeloskath added 2 commits March 18, 2026 13:55

Fix fetch nearest cache

8ae7b37

Fix test

0aeb5ec

nastya236 approved these changes Mar 23, 2026

View reviewed changes

angeloskath merged commit 4d3af3c into main Mar 26, 2026
2 checks passed

angeloskath deleted the lru-cache-refactor branch March 26, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LRUPromptCache#1019

Refactor LRUPromptCache#1019
angeloskath merged 3 commits intomainfrom
lru-cache-refactor

angeloskath commented Mar 18, 2026

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

nastya236 commented Mar 23, 2026

Uh oh!

nastya236 Mar 23, 2026

Uh oh!

angeloskath Mar 26, 2026

Uh oh!

nastya236 Mar 23, 2026

Uh oh!

angeloskath Mar 26, 2026

Uh oh!

nastya236 Mar 23, 2026

Uh oh!

angeloskath Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

angeloskath commented Mar 18, 2026

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

nastya236 commented Mar 23, 2026

Uh oh!

nastya236 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nastya236 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nastya236 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants