[KV Connector] Fix async connector prefix cache metrics by markmc · Pull Request #28585 · vllm-project/vllm

markmc · 2025-11-12T19:48:56Z

Currently we are recording async connector prefix cache queries and hits twice, for the same reason that update_state_after_alloc() is called twice:

If get_num_new_matched_tokens previously returned True for a
request, this function may be called twice for that same request -
first when blocks are allocated for the connector tokens to be
asynchronously loaded into, and second when any additional blocks
are allocated, after the load/transfer is complete.

Worse, the second time we are recording with num_external_computed_tokens=0 so effectively we are halving the hit rate.

Before

External prefix cache hit rate: 100.0%

After

External prefix cache hit rate: 50.0%

Borrows part of #27569 to track num_external_computed_tokens for use when the KV transfer completes.

Will use #28550 for testing this scenario.

gemini-code-assist

Code Review

This pull request addresses a bug where asynchronous KV connector prefix cache metrics were being recorded incorrectly, leading to a deflated hit rate. The fix involves persisting the number of externally computed tokens on the Request object and moving the metrics recording logic to a later stage in the scheduling process to avoid double-counting. The changes also correctly handle adjustments to this metric when KV block loads fail. The implementation appears correct and effectively resolves the described issue. The new test case is currently commented out, pending a dependency on another pull request, which is noted in the code.

Currently we are recording async connector prefix cache queries and hits twice, for the same reason that update_state_after_alloc() is called twice: > If get_num_new_matched_tokens previously returned True for a > request, this function may be called twice for that same request - > first when blocks are allocated for the connector tokens to be > asynchronously loaded into, and second when any additional blocks > are allocated, after the load/transfer is complete. Worse, the second time we are recording with `num_external_computed_tokens=0` so effectively we are halving the hit rate. Before ``` External prefix cache hit rate: 100.0% ``` After ``` External prefix cache hit rate: 50.0% ``` Borrows part of vllm-project#27569 to track `num_external_computed_tokens` for use when the KV transfer completes. Will use vllm-project#28550 for testing this scenario. Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

robertgshaw2-redhat

nice fix!

…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

…ueries Somewhat related to vllm-project#28585 The `vllm:external_prefix_cache_queries` metric was double-counting queries by recording the total prompt tokens instead of only the tokens actually queried from the KV connector. Delay the recording of connector prefix cache queries and hits until after KV load has succeeded or failed. In the sync connector mode, we only know a KV load has succeeded after the model step has completed. So: - We record the queried/hit token count on the request - In update_from_output(), we record these stats from successful requests, and include them in the SchedulerStats for this iteration - If a reset comes in, we note this so that it can also be included in the next SchedulerStats Example scenario: - Request with 1000 prompt tokens - Local cache finds 600 tokens - External cache finds 200 of the remaining 400 tokens - Computed: 200 tokens Metrics before this fix: ``` vllm:prefix_cache_queries: 1000 vllm:prefix_cache_hits: 600 vllm:external_prefix_cache_queries: 1000 # Double counting! vllm:external_prefix_cache_hits: 200 ``` Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc requested a review from NickLucche November 12, 2025 19:48

markmc requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners November 12, 2025 19:48

mergify Bot added the v1 label Nov 12, 2025

gemini-code-assist Bot reviewed Nov 12, 2025

View reviewed changes

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025

markmc force-pushed the async-load-connector-prefix-cache-stats branch from 20ec755 to 0a1a297 Compare November 14, 2025 07:45

robertgshaw2-redhat approved these changes Nov 21, 2025

View reviewed changes

robertgshaw2-redhat merged commit c6fa389 into vllm-project:main Nov 21, 2025
43 checks passed

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025

[KV Connector] Fix async connector prefix cache metrics (vllm-project…

3de69a7

…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[KV Connector] Fix async connector prefix cache metrics (vllm-project…

5b2139b

…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[KV Connector] Fix async connector prefix cache metrics (vllm-project…

24ab0cb

…#28585) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

markmc mentioned this pull request Dec 11, 2025

[KV Connector][Metrics] Do not count local prefix cache hits in connector queries #30522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KV Connector] Fix async connector prefix cache metrics#28585

[KV Connector] Fix async connector prefix cache metrics#28585
robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom
markmc:async-load-connector-prefix-cache-stats

markmc commented Nov 12, 2025 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

robertgshaw2-redhat left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

markmc commented Nov 12, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markmc commented Nov 12, 2025 •

edited by github-actions Bot

Loading