Conversation
2397229 to
10808f2
Compare
|
Clean refactor. The N-tier |
|
Thank you! Very helpful refactoring.
I left couple of questions mainly for myself to understand better the code. |
| except ValueError: | ||
| pass | ||
|
|
||
| def pop(self): |
There was a problem hiding this comment.
A couple of questions about the eviction logic in CacheOrder.pop():
- With the default ordering
["assistant", "user", "system"], we evict from the assistant queue until its size drops below the user queue, then start evicting user entries. What's the reasoning behind this? - Would it be correct to say that when the cache is trimmable,
insert_cachecallspop_prefixes, so in that case, the trie effectively have just one branch, making the eviction ordering irrelevant?
There was a problem hiding this comment.
- Yep exactly this. This is done to ensure that the assistant doesn't kick out of the cache the user message. The reason we don't want the assistant to kick out the user message is that most (if not all) chat templates remove the thinking tokens before the last user message. Even more so they could insert a system message reminder and so on. The interaction looks like this (let's assume 1 slot per cache type)
1st message
<system> <--- system cache
<user> <--- user cache
<thinking>
<assistant>
<tool> <--- assistant cache
<thinking>
<assistant>
<tool> <--- assistant cache
You can see that the 2nd assistant cache will evict the user cache (actually the system cache but bear with me cause it is better for the example) if we simply treat them as a single LRU cache.
Moreover the chat template will remove all the thinking when a second user message arrives which renders the assistant caches useless (if not trimmable).
2nd message
<system> <--- system cache
<user> <--- user cache
<assistant>
<tool>
<assistant>
<tool>
<system reminder>
<user> <--- user cache
...
Since the messages changed after the user message we have to process all the messages up to the next user message which we put in the cache. You can see that we lost the two assistant cache entries, they are useless to us after the new user message.
- Yes. When the cache is trimmable the user cache and system cache will always be included so we only need to hold the different diverging branches.
pop_prefixesensures that.
| def join(self): | ||
| self._generation_thread.join() | ||
|
|
||
| def _log_cache_stats(self): |
There was a problem hiding this comment.
The old version logged the latest user cache token count. The new _log_cache_stats only logs count and bytes. Just curious how important is logging the token count?
There was a problem hiding this comment.
Yeah not sure. I was kind of using it for debugging but it isn't too useful either. Perhaps we 'll expose some more stats later 🤷♂️
| def pop_prefixes(self, model: Any, tokens: List[int]): | ||
| values = [] | ||
| current = self._trie[model] | ||
| for i in range(len(tokens) - 1): |
There was a problem hiding this comment.
Here -1 because we want to pop prefixes but not exact match because we don't want to pop something we just inserted, right?
There was a problem hiding this comment.
Yep pop_prefixes removes strict prefixes not including this key.
Semantic merge of upstream/main (through 4d3af3c) into dev. Key changes: - LRUPromptCache moved from server.py to cache.py (upstream ml-explore#1019). Dev's rewind preflight/fail-closed logic integrated into the new PromptTrie-based LRUPromptCache.fetch_nearest_cache. - Presence/frequency penalties (upstream ml-explore#971). - Better caching with CacheOrder and PromptTrie (upstream ml-explore#911, ml-explore#1019). - Harden can_trim_prompt_cache against missing is_trimmable. - Adapt behavioral tests from dev's refcounting model to upstream's deepcopy-on-fetch model. Safety contracts preserved: fail-closed on partial rewind, preflight deepcopy avoidance, exact entry preservation. All 247 tests pass (+ 88 subtests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor the LRUPromptCache into the
models.cache. Closes #1013 .In that spirit this PR also