[Core][Metrics] Remove vllm:prompt_tokens_recomputed metric#38709
[Core][Metrics] Remove vllm:prompt_tokens_recomputed metric#38709orozery merged 1 commit intovllm-project:mainfrom
vllm:prompt_tokens_recomputed metric#38709Conversation
|
/cc @ZhanqiuHu |
There was a problem hiding this comment.
Code Review
This pull request removes the tracking and reporting of recomputed tokens from the metrics system. Specifically, it deletes the vllm:prompt_tokens_recomputed metric, simplifies the PromptTokenStats class by removing the recomputation logic and its associated invariants, and updates the relevant tests to reflect these changes. I have no feedback to provide.
fa8c338 to
f857b8d
Compare
309cc9b to
bd96727
Compare
bd96727 to
e773216
Compare
|
Deprecation notice: This pull request comes from a fork and was rebased using |
In the case of a full local prefix cache hit (prompt length N),
we actually only use N-1 tokens. The `vllm:prompt_tokens_recomputed`
was intended to count how many cached tokens we are effectively
discarding because of this.
```
KVCacheManager.get_computed_blocks():
...
# NOTE: When all tokens hit the cache, we must recompute the last token
# to obtain logits. [...]
max_cache_hit_length = request.num_tokens - 1
```
However, even here, we can't assume the last token would have been
a cache hit and should be counted as "recomputed". Given this, the
metric seems quite misguided, in retrospect.
The metric was added as a side-effect in vllm-project#33290 in order to make
sense of the fact that:
```
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"}
```
will include a token that is recomputed. See this comment:
> Note: external_kv_transfer reports the actual number of tokens
> transferred (e.g., prompt length N), while prompt_tokens_cached_total
> reports the adjusted count (e.g., N-1). The last token is both
> transferred AND recomputed locally, so there's overlap.
However, it makes more sense for the `external_kv_transfer` count to
reflect only tokens we actually used, not any recomputed tokens. This
will be done in #vllm-project#37460.
I'm not aware of any user demand for this metric, or anyone relying
on it now. So it seems safe to remove it, rather than go through
a deprecation period.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
e773216 to
a167d11
Compare
…roject#38709) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…roject#38709) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…roject#38709) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
In the case of a full local prefix cache hit (prompt length N), we actually only use N-1 tokens. The
vllm:prompt_tokens_recomputedwas intended to count how many cached tokens we are effectively discarding because of this.However, even here, we can't assume the last token would have been a cache hit and should be counted as "recomputed". Given this, the metric seems quite misguided, in retrospect.
The metric was added as a side-effect in #33290 in order to make sense of the fact that:
will include a token that is recomputed. See this comment:
However, it makes more sense for the
external_kv_transfercount to reflect only tokens we actually used, not any recomputed tokens. This will be done in ##37460.I'm not aware of any user demand for this metric, or anyone relying on it now. So it seems safe to remove it, rather than go through a deprecation period.