Skip to content

[KV Connector] Fix async connector prefix cache metrics#28585

Merged
robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom
markmc:async-load-connector-prefix-cache-stats
Nov 21, 2025
Merged

[KV Connector] Fix async connector prefix cache metrics#28585
robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom
markmc:async-load-connector-prefix-cache-stats

Conversation

@markmc
Copy link
Copy Markdown
Member

@markmc markmc commented Nov 12, 2025

Currently we are recording async connector prefix cache queries and hits twice, for the same reason that update_state_after_alloc() is called twice:

If get_num_new_matched_tokens previously returned True for a
request, this function may be called twice for that same request -
first when blocks are allocated for the connector tokens to be
asynchronously loaded into, and second when any additional blocks
are allocated, after the load/transfer is complete.

Worse, the second time we are recording with num_external_computed_tokens=0 so effectively we are halving the hit rate.

Before

External prefix cache hit rate: 100.0%

After

External prefix cache hit rate: 50.0%

Borrows part of #27569 to track num_external_computed_tokens for use when the KV transfer completes.

Will use #28550 for testing this scenario.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where asynchronous KV connector prefix cache metrics were being recorded incorrectly, leading to a deflated hit rate. The fix involves persisting the number of externally computed tokens on the Request object and moving the metrics recording logic to a later stage in the scheduling process to avoid double-counting. The changes also correctly handle adjustments to this metric when KV block loads fail. The implementation appears correct and effectively resolves the described issue. The new test case is currently commented out, pending a dependency on another pull request, which is noted in the code.

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025
Currently we are recording async connector prefix cache queries
and hits twice, for the same reason that update_state_after_alloc()
is called twice:

> If get_num_new_matched_tokens previously returned True for a
> request, this function may be called twice for that same request -
> first when blocks are allocated for the connector tokens to be
> asynchronously loaded into, and second when any additional blocks
> are allocated, after the load/transfer is complete.

Worse, the second time we are recording with `num_external_computed_tokens=0`
so effectively we are halving the hit rate.

Before

```
External prefix cache hit rate: 100.0%
```

After

```
External prefix cache hit rate: 50.0%
```

Borrows part of vllm-project#27569 to track `num_external_computed_tokens`
for use when the KV transfer completes.

Will use vllm-project#28550 for testing this scenario.

Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the async-load-connector-prefix-cache-stats branch from 20ec755 to 0a1a297 Compare November 14, 2025 07:45
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice fix!

@robertgshaw2-redhat robertgshaw2-redhat merged commit c6fa389 into vllm-project:main Nov 21, 2025
43 checks passed
ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025
…#28585)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
…#28585)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…#28585)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…#28585)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
markmc added a commit to markmc/vllm that referenced this pull request Feb 4, 2026
…ueries

Somewhat related to vllm-project#28585

The `vllm:external_prefix_cache_queries` metric was double-counting
queries by recording the total prompt tokens instead of only the
tokens actually queried from the KV connector.

Delay the recording of connector prefix cache queries and hits until
after KV load has succeeded or failed. In the sync connector mode,
we only know a KV load has succeeded after the model step has
completed.

So:

- We record the queried/hit token count on the request
- In update_from_output(), we record these stats from successful
  requests, and include them in the SchedulerStats for this
  iteration
- If a reset comes in, we note this so that it can also be
  included in the next SchedulerStats

Example scenario:
- Request with 1000 prompt tokens
- Local cache finds 600 tokens
- External cache finds 200 of the remaining 400 tokens
- Computed: 200 tokens

Metrics before this fix:
```
vllm:prefix_cache_queries: 1000
vllm:prefix_cache_hits: 600
vllm:external_prefix_cache_queries: 1000 # Double counting!
vllm:external_prefix_cache_hits: 200
```

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants