Support prefix cache hits in unified prefill path#185
Closed
Kingwl wants to merge 1 commit intovllm-project:mainfrom
Closed
Support prefix cache hits in unified prefill path#185Kingwl wants to merge 1 commit intovllm-project:mainfrom
Kingwl wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: kingwl <kingwenlu@gmail.com>
Collaborator
|
Thanks for the quick catch!
I think we should explicitly disable prefix caching for now, because we have to and will support prefix caching on the paged path. But prefix caching is not an one-day job. Please join the discussion in #182 Will take a closer look on your other fixes soon! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a crash in the Metal unified paged prefill path when vLLM core prefix caching schedules a new request with
num_computed_tokens > 0.The root cause was that the unified path only tracked the current prefill chunk, but not the full request tokens, and it also assumed
start_pos == 0for complete prefills on new requests. This PR updates the model runner to carry both chunk tokens and full request tokens through the unified prefill flow, correctly updates paged sequence lengths for non-zerostart_pos, and removes the incorrect assertion for new complete prefills with cached prefixes.With this change, the long-context sonnet serve benchmark that previously failed with
AssertionError: new complete prefill with start_pos > 0 not supportednow completes successfully with prefix caching enabled.Fixed #184