Skip to content

[Core][Metrics] Remove vllm:prompt_tokens_recomputed metric#38709

Merged
orozery merged 1 commit intovllm-project:mainfrom
markmc:remove-prompt-tokens-recomputed
Apr 12, 2026
Merged

[Core][Metrics] Remove vllm:prompt_tokens_recomputed metric#38709
orozery merged 1 commit intovllm-project:mainfrom
markmc:remove-prompt-tokens-recomputed

Conversation

@markmc
Copy link
Copy Markdown
Member

@markmc markmc commented Apr 1, 2026

In the case of a full local prefix cache hit (prompt length N), we actually only use N-1 tokens. The vllm:prompt_tokens_recomputed was intended to count how many cached tokens we are effectively discarding because of this.

KVCacheManager.get_computed_blocks():
    ...
    # NOTE: When all tokens hit the cache, we must recompute the last token
    # to obtain logits. [...]
    max_cache_hit_length = request.num_tokens - 1

However, even here, we can't assume the last token would have been a cache hit and should be counted as "recomputed". Given this, the metric seems quite misguided, in retrospect.

The metric was added as a side-effect in #33290 in order to make sense of the fact that:

vllm:prompt_tokens_by_source_total{source="external_kv_transfer"}

will include a token that is recomputed. See this comment:

Note: external_kv_transfer reports the actual number of tokens
transferred (e.g., prompt length N), while prompt_tokens_cached_total
reports the adjusted count (e.g., N-1). The last token is both
transferred AND recomputed locally, so there's overlap.

However, it makes more sense for the external_kv_transfer count to reflect only tokens we actually used, not any recomputed tokens. This will be done in ##37460.

I'm not aware of any user demand for this metric, or anyone relying on it now. So it seems safe to remove it, rather than go through a deprecation period.

@markmc markmc requested a review from orozery April 1, 2026 09:07
@mergify mergify Bot added the v1 label Apr 1, 2026
@markmc
Copy link
Copy Markdown
Member Author

markmc commented Apr 1, 2026

/cc @ZhanqiuHu

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the tracking and reporting of recomputed tokens from the metrics system. Specifically, it deletes the vllm:prompt_tokens_recomputed metric, simplifies the PromptTokenStats class by removing the recomputation logic and its associated invariants, and updates the relevant tests to reflect these changes. I have no feedback to provide.

@markmc markmc force-pushed the remove-prompt-tokens-recomputed branch from bd96727 to e773216 Compare April 9, 2026 12:17
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 9, 2026

Deprecation notice: This pull request comes from a fork and was rebased using bot_account impersonation. This capability will be removed on July 1, 2026. After this date, the rebase action will no longer be able to rebase fork pull requests with this configuration. Please switch to the update action/command to ensure compatibility going forward.

In the case of a full local prefix cache hit (prompt length N),
we actually only use N-1 tokens. The `vllm:prompt_tokens_recomputed`
was intended to count how many cached tokens we are effectively
discarding because of this.

```
KVCacheManager.get_computed_blocks():
    ...
    # NOTE: When all tokens hit the cache, we must recompute the last token
    # to obtain logits. [...]
    max_cache_hit_length = request.num_tokens - 1
```

However, even here, we can't assume the last token would have been
a cache hit and should be counted as "recomputed". Given this, the
metric seems quite misguided, in retrospect.

The metric was added as a side-effect in vllm-project#33290 in order to make
sense of the fact that:

```
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"}
```

will include a token that is recomputed. See this comment:

> Note: external_kv_transfer reports the actual number of tokens
> transferred (e.g., prompt length N), while prompt_tokens_cached_total
> reports the adjusted count (e.g., N-1). The last token is both
> transferred AND recomputed locally, so there's overlap.

However, it makes more sense for the `external_kv_transfer` count to
reflect only tokens we actually used, not any recomputed tokens. This
will be done in #vllm-project#37460.

I'm not aware of any user demand for this metric, or anyone relying
on it now. So it seems safe to remove it, rather than go through
a deprecation period.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the remove-prompt-tokens-recomputed branch from e773216 to a167d11 Compare April 10, 2026 11:48
@orozery orozery merged commit 72ff142 into vllm-project:main Apr 12, 2026
49 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in Metrics & Tracing Apr 12, 2026
wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
…roject#38709)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…roject#38709)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants