feat(metrics): Add prefill KV compute metric excluding cached tokens by ziliangpeng · Pull Request #30189 · vllm-project/vllm

ziliangpeng · 2025-12-06T19:36:57Z

Summary

This PR adds a new metric vllm:request_prefill_kv_computed_tokens that tracks the number of KV tokens computed during prefill phase, excluding cached tokens.

Motivation

Currently, vLLM tracks total prompt tokens (vllm:request_prompt_tokens) but doesn't have per-request visibility into how many KV tokens were actually computed vs served from cache (local prefix cache or remote KV cache like LMCache). This metric helps:

Understand cache effectiveness on a per-request basis
Better estimate actual compute costs vs total prompt size
Debug and optimize caching strategies
Monitor workload characteristics more accurately

Changes

Added num_cached_tokens field to FinishedRequestStats dataclass
Updated update_from_finished_request() to accept num_cached_tokens parameter
Added new histogram metric vllm:request_prefill_kv_computed_tokens in metrics loggers
Metric calculation: num_prompt_tokens - max(num_cached_tokens, 0)
Added comprehensive unit tests

Testing

Added unit tests in tests/v1/metrics/test_stats.py:
- Test with prefix cache hits
- Test without cache
- Test edge cases (negative values, all tokens cached)
Verified in production workloads showing expected cache effectiveness

The metric correctly includes cache hits from both local prefix cache and remote KV stores (KV connector, LMCache).

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

gemini-code-assist

Code Review

This pull request introduces a new metric, vllm:request_prefill_kv_computed_tokens, to track the number of KV tokens computed during the prefill phase, excluding any tokens served from the cache. The changes are well-implemented, adding the num_cached_tokens field to FinishedRequestStats and plumbing it through from the output processor. A new histogram is added to the Prometheus logger to record this metric, correctly calculating it as the difference between prompt tokens and cached tokens. The inclusion of comprehensive unit tests covering various scenarios, including edge cases, ensures the reliability of this new feature. The code is clear, follows existing patterns, and improves the observability of cache effectiveness. Overall, this is a solid contribution.

mergify · 2025-12-06T19:44:35Z

Hi @ziliangpeng, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Add new Prometheus metric `vllm:request_prefill_kv_computed_tokens` to track the number of new KV cache tokens computed during the prefill phase, excluding tokens served from prefix cache. This metric helps measure actual compute workload during prefill, accounting for prefix cache hits. It correctly handles: - Prefix caching (excludes cached tokens) - Chunked prefill (counts total prompt tokens, not per-chunk) - Edge cases (negative values, no cache) Changes: - Add `num_cached_tokens` field to `FinishedRequestStats` - Pass `num_cached_tokens` from `RequestState` through stats pipeline - Calculate prefill KV compute as `num_prompt_tokens - num_cached_tokens` - Add Prometheus histogram metric with standard buckets - Add comprehensive unit tests covering cache hits, no cache, and edge cases Example: Request with 10,000 token prompt Prefix cache hit: 1,200 tokens Metric reports: 8,800 tokens (10,000 - 1,200) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Ziliang Peng <ziliang@character.ai>

ApostaC

LGTM!

robertgshaw2-redhat · 2025-12-09T00:16:17Z

@ApostaC - do you maintain the metrics subsystem? No. Please ensure that relevant reviewers have a chance to review the PR before merging.

ApostaC · 2025-12-09T05:31:32Z

Hey @robertgshaw2-redhat, apologize if it caused any inconvenience. I found this PR is clear and easy to understand when I'm doing PR on call, so I did a review.

@markmc Can you take a look at this and see if there is anything we don't want or need to be changed? We can revert if needed.

@ziliangpeng Sorry for the confusion 🙏.

ziliangpeng · 2025-12-09T06:03:28Z

Hey @robertgshaw2-redhat, apologize if it caused any inconvenience. I found this PR is clear and easy to understand when I'm doing PR on call, so I did a review.

@markmc Can you take a look at this and see if there is anything we don't want or need to be changed? We can revert if needed.

@ziliangpeng Sorry for the confusion 🙏.

no problem! let's do what's best for the project.

markmc · 2025-12-12T17:09:53Z

I haven't reviewed the code. I'm sure Claude did just fine. But I do wonder about how this relates to the collective, big picture of all of our metrics ... and that's no so trivial to think through.

I've added an auto-generated list of all of our metrics here - https://docs.vllm.ai/en/latest/usage/metrics.html

So, existing metrics that are relevant here are:

vllm:prompt_tokens (Counter)
vllm:request_prompt_tokens (Histogram)
vllm:prefix_cache_queries, vllm:prefix_cache_hits (Counters)
vllm:external_prefix_cache_queries, vllm:external_prefix_cache_hits (Counters)

In a recent PR, I drew this as my mental model for the above metrics:

prompt input
   ↓ [prompt tokens]
lookup internal prefix cache
   ↓ [tokens queried and found]
lookup external connector prefix cache
   ↓ [tokens queried and found]
....
   ↓ 
generated tokens output
   ↓ [generated tokens]

So what have we added here? The per-request equivalent of:

prompt_tokens - prefix_cache_hits - external_prefix_cache_hits

If I had seen this PR, I think I would have asked

What is the justification for recording this per-request (given that it adds 10-15 individual time-series for the Histogram)? What actionable information will it give you beyond the prefix cache lookup rates (hits/queries), and what actions would you or another user take based on them? How niche/common is this use case?
If per-request information is useful, why not add per-request prefix cache lookup rates? This would seem to be more consistent?
Different naming might help users understand how these metrics relate to each other - e.g. vllm:request_prompt_cache_misses or something?

ziliangpeng requested a review from markmc as a code owner December 6, 2025 19:36

mergify Bot added the v1 label Dec 6, 2025

gemini-code-assist Bot reviewed Dec 6, 2025

View reviewed changes

ziliangpeng force-pushed the feat-prefill-kv-metric branch from 17b00c9 to 9a5fc4d Compare December 6, 2025 19:40

ziliangpeng force-pushed the feat-prefill-kv-metric branch from 9a5fc4d to 34d07c5 Compare December 6, 2025 19:50

ApostaC approved these changes Dec 8, 2025

View reviewed changes

ApostaC enabled auto-merge (squash) December 8, 2025 21:47

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 8, 2025

ApostaC merged commit f1599ca into vllm-project:main Dec 9, 2025
46 checks passed

markmc added this to Metrics & Tracing Dec 19, 2025

github-project-automation Bot moved this to Backlog in Metrics & Tracing Dec 19, 2025

github-project-automation Bot moved this from Backlog to Done in Metrics & Tracing Dec 19, 2025

markmc moved this from Done to Done 0.13 in Metrics & Tracing Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(metrics): Add prefill KV compute metric excluding cached tokens#30189

feat(metrics): Add prefill KV compute metric excluding cached tokens#30189
ApostaC merged 1 commit intovllm-project:mainfrom
ziliangpeng:feat-prefill-kv-metric

ziliangpeng commented Dec 6, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented Dec 6, 2025

Uh oh!

ApostaC left a comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 9, 2025

Uh oh!

ApostaC commented Dec 9, 2025 •

edited

Loading

Uh oh!

ziliangpeng commented Dec 9, 2025

Uh oh!

markmc commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ziliangpeng commented Dec 6, 2025

Summary

Motivation

Changes

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Dec 6, 2025

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Dec 9, 2025

Uh oh!

ApostaC commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ziliangpeng commented Dec 9, 2025

Uh oh!

markmc commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ApostaC commented Dec 9, 2025 •

edited

Loading