feat(metrics): Add prefill KV compute metric excluding cached tokens#30189
feat(metrics): Add prefill KV compute metric excluding cached tokens#30189ApostaC merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new metric, vllm:request_prefill_kv_computed_tokens, to track the number of KV tokens computed during the prefill phase, excluding any tokens served from the cache. The changes are well-implemented, adding the num_cached_tokens field to FinishedRequestStats and plumbing it through from the output processor. A new histogram is added to the Prometheus logger to record this metric, correctly calculating it as the difference between prompt tokens and cached tokens. The inclusion of comprehensive unit tests covering various scenarios, including edge cases, ensures the reliability of this new feature. The code is clear, follows existing patterns, and improves the observability of cache effectiveness. Overall, this is a solid contribution.
17b00c9 to
9a5fc4d
Compare
|
Hi @ziliangpeng, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Add new Prometheus metric `vllm:request_prefill_kv_computed_tokens` to track the number of new KV cache tokens computed during the prefill phase, excluding tokens served from prefix cache. This metric helps measure actual compute workload during prefill, accounting for prefix cache hits. It correctly handles: - Prefix caching (excludes cached tokens) - Chunked prefill (counts total prompt tokens, not per-chunk) - Edge cases (negative values, no cache) Changes: - Add `num_cached_tokens` field to `FinishedRequestStats` - Pass `num_cached_tokens` from `RequestState` through stats pipeline - Calculate prefill KV compute as `num_prompt_tokens - num_cached_tokens` - Add Prometheus histogram metric with standard buckets - Add comprehensive unit tests covering cache hits, no cache, and edge cases Example: Request with 10,000 token prompt Prefix cache hit: 1,200 tokens Metric reports: 8,800 tokens (10,000 - 1,200) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Ziliang Peng <ziliang@character.ai>
9a5fc4d to
34d07c5
Compare
|
@ApostaC - do you maintain the metrics subsystem? No. Please ensure that relevant reviewers have a chance to review the PR before merging. |
|
Hey @robertgshaw2-redhat, apologize if it caused any inconvenience. I found this PR is clear and easy to understand when I'm doing PR on call, so I did a review. @markmc Can you take a look at this and see if there is anything we don't want or need to be changed? We can revert if needed. @ziliangpeng Sorry for the confusion 🙏. |
no problem! let's do what's best for the project. |
|
I haven't reviewed the code. I'm sure Claude did just fine. But I do wonder about how this relates to the collective, big picture of all of our metrics ... and that's no so trivial to think through. I've added an auto-generated list of all of our metrics here - https://docs.vllm.ai/en/latest/usage/metrics.html So, existing metrics that are relevant here are:
In a recent PR, I drew this as my mental model for the above metrics: So what have we added here? The per-request equivalent of: If I had seen this PR, I think I would have asked
|
Summary
This PR adds a new metric
vllm:request_prefill_kv_computed_tokensthat tracks the number of KV tokens computed during prefill phase, excluding cached tokens.Motivation
Currently, vLLM tracks total prompt tokens (
vllm:request_prompt_tokens) but doesn't have per-request visibility into how many KV tokens were actually computed vs served from cache (local prefix cache or remote KV cache like LMCache). This metric helps:Changes
num_cached_tokensfield toFinishedRequestStatsdataclassupdate_from_finished_request()to acceptnum_cached_tokensparametervllm:request_prefill_kv_computed_tokensin metrics loggersnum_prompt_tokens - max(num_cached_tokens, 0)Testing
tests/v1/metrics/test_stats.py:The metric correctly includes cache hits from both local prefix cache and remote KV stores (KV connector, LMCache).
🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com