Merged
Conversation
d6c1148 to
d2f6b0f
Compare
d2f6b0f to
5733d68
Compare
Collaborator
|
Thank you! Looking into it now! |
andresy
approved these changes
Mar 6, 2026
This was referenced Mar 10, 2026
omprashantjain
added a commit
to omprashantjain/vllm-mlx
that referenced
this pull request
Mar 12, 2026
mlx-lm PR ml-explore/mlx-lm#911 ("Better caching in the server") added a `prompt_checkpoints` field to the prompt tuples returned by BatchGenerator.unprocessed_prompts. This causes the zip(*batch_prompts) unpacking in _chunked_next() to fail with a ValueError when using --continuous-batching, since the code expects exactly 6 elements but now receives 7. Add a *_extra catch-all to the tuple unpacking so it gracefully handles any additional fields from upstream mlx-lm changes. This is backward- compatible — older mlx-lm versions that return 6 elements still work fine (the *_extra will simply be empty). Fixes waybarrios#155
Thump604
added a commit
to Thump604/vllm-mlx
that referenced
this pull request
Mar 16, 2026
mlx-lm 0.31.0 (PR ml-explore/mlx-lm#911) added prompt_checkpoints as a 7th element to BatchGenerator.unprocessed_prompts tuples. The chunked prefill code in _install_chunked_prefill hardcoded a 6-element zip unpacking which crashed with ValueError. Fix: use *_extra catch-all in the zip unpacking so extra fields from mlx-lm are captured but ignored. This is forward-compatible with any future tuple additions. The "small prompt" path already works because it passes tuples directly to mlx-lm's own _process_prompts which handles 7 elements. Only the chunked prefill path (total_tokens > budget) did its own unpacking. Closes waybarrios#155
lyonsno
added a commit
to lyonsno/mlx-lm
that referenced
this pull request
Mar 28, 2026
Semantic merge of upstream/main (through 4d3af3c) into dev. Key changes: - LRUPromptCache moved from server.py to cache.py (upstream ml-explore#1019). Dev's rewind preflight/fail-closed logic integrated into the new PromptTrie-based LRUPromptCache.fetch_nearest_cache. - Presence/frequency penalties (upstream ml-explore#971). - Better caching with CacheOrder and PromptTrie (upstream ml-explore#911, ml-explore#1019). - Harden can_trim_prompt_cache against missing is_trimmable. - Adapt behavioral tests from dev's refcounting model to upstream's deepcopy-on-fetch model. Safety contracts preserved: fail-closed on partial rewind, preflight deepcopy avoidance, exact entry preservation. All 247 tests pass (+ 88 subtests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This aims to solve the problem of non-trimmable caches resulting in no KV cache, very common with linear attention and sliding window attention models.
The standard agentic interaction is kind of the following
For 2 & 3 we want to keep the thinking in the cache for 1 we don't because typically the thinking will be removed from the history.
This PR makes the following changes (they only apply to the batched generation for now):
It isn't perfect and may require a bigger refactoring to improve but at least for now we can utilize the cache for non trimmable KVs as well.
I will add some benchmarks and leave it for comments/reviews before merging.