Skip to content

Support prefix cache hits in unified prefill path#185

Closed
Kingwl wants to merge 1 commit intovllm-project:mainfrom
Kingwl:fix/prefix-cache-engine-crash
Closed

Support prefix cache hits in unified prefill path#185
Kingwl wants to merge 1 commit intovllm-project:mainfrom
Kingwl:fix/prefix-cache-engine-crash

Conversation

@Kingwl
Copy link
Copy Markdown
Contributor

@Kingwl Kingwl commented Mar 20, 2026

This PR fixes a crash in the Metal unified paged prefill path when vLLM core prefix caching schedules a new request with num_computed_tokens > 0.

The root cause was that the unified path only tracked the current prefill chunk, but not the full request tokens, and it also assumed start_pos == 0 for complete prefills on new requests. This PR updates the model runner to carry both chunk tokens and full request tokens through the unified prefill flow, correctly updates paged sequence lengths for non-zero start_pos, and removes the incorrect assertion for new complete prefills with cached prefixes.

With this change, the long-context sonnet serve benchmark that previously failed with AssertionError: new complete prefill with start_pos > 0 not supported now completes successfully with prefix caching enabled.

Fixed #184

Signed-off-by: kingwl <kingwenlu@gmail.com>
@WindChimeRan
Copy link
Copy Markdown
Collaborator

Thanks for the quick catch!

explicitly disable incompatible vLLM core prefix caching when this path is active.

I think we should explicitly disable prefix caching for now, because we have to and will support prefix caching on the paged path.

But prefix caching is not an one-day job. Please join the discussion in #182

Will take a closer look on your other fixes soon!

@Kingwl Kingwl closed this Mar 21, 2026
WindChimeRan pushed a commit that referenced this pull request Mar 21, 2026
This PR temporarily disables vLLM core prefix caching on Metal when
paged attention is enabled.

As #185 's comments
Fixed #184

Signed-off-by: kingwl <kingwenlu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Engine crashes when prefix caching hits new complete prefill in Metal paged unified path

2 participants