[KV Connector] Fix async connector prefix cache metrics#28585
Merged
robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom Nov 21, 2025
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where asynchronous KV connector prefix cache metrics were being recorded incorrectly, leading to a deflated hit rate. The fix involves persisting the number of externally computed tokens on the Request object and moving the metrics recording logic to a later stage in the scheduling process to avoid double-counting. The changes also correctly handle adjustments to this metric when KV block loads fail. The implementation appears correct and effectively resolves the described issue. The new test case is currently commented out, pending a dependency on another pull request, which is noted in the code.
Currently we are recording async connector prefix cache queries and hits twice, for the same reason that update_state_after_alloc() is called twice: > If get_num_new_matched_tokens previously returned True for a > request, this function may be called twice for that same request - > first when blocks are allocated for the connector tokens to be > asynchronously loaded into, and second when any additional blocks > are allocated, after the load/transfer is complete. Worse, the second time we are recording with `num_external_computed_tokens=0` so effectively we are halving the hit rate. Before ``` External prefix cache hit rate: 100.0% ``` After ``` External prefix cache hit rate: 50.0% ``` Borrows part of vllm-project#27569 to track `num_external_computed_tokens` for use when the KV transfer completes. Will use vllm-project#28550 for testing this scenario. Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
20ec755 to
0a1a297
Compare
ywang96
pushed a commit
to ywang96/vllm
that referenced
this pull request
Nov 23, 2025
…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
RunkaiTao
pushed a commit
to RunkaiTao/vllm
that referenced
this pull request
Nov 24, 2025
…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio
pushed a commit
to SumanthRH/vllm
that referenced
this pull request
Nov 29, 2025
…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
kitaekatt
pushed a commit
to kitaekatt/vllm
that referenced
this pull request
Dec 1, 2025
…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
markmc
added a commit
to markmc/vllm
that referenced
this pull request
Feb 4, 2026
…ueries Somewhat related to vllm-project#28585 The `vllm:external_prefix_cache_queries` metric was double-counting queries by recording the total prompt tokens instead of only the tokens actually queried from the KV connector. Delay the recording of connector prefix cache queries and hits until after KV load has succeeded or failed. In the sync connector mode, we only know a KV load has succeeded after the model step has completed. So: - We record the queried/hit token count on the request - In update_from_output(), we record these stats from successful requests, and include them in the SchedulerStats for this iteration - If a reset comes in, we note this so that it can also be included in the next SchedulerStats Example scenario: - Request with 1000 prompt tokens - Local cache finds 600 tokens - External cache finds 200 of the remaining 400 tokens - Computed: 200 tokens Metrics before this fix: ``` vllm:prefix_cache_queries: 1000 vllm:prefix_cache_hits: 600 vllm:external_prefix_cache_queries: 1000 # Double counting! vllm:external_prefix_cache_hits: 200 ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently we are recording async connector prefix cache queries and hits twice, for the same reason that
update_state_after_alloc()is called twice:Worse, the second time we are recording with
num_external_computed_tokens=0so effectively we are halving the hit rate.Before
After
Borrows part of #27569 to track
num_external_computed_tokensfor use when the KV transfer completes.Will use #28550 for testing this scenario.