Skip to content

feat(metrics): Add prefill KV compute metric excluding cached tokens#30189

Merged
ApostaC merged 1 commit intovllm-project:mainfrom
ziliangpeng:feat-prefill-kv-metric
Dec 9, 2025
Merged

feat(metrics): Add prefill KV compute metric excluding cached tokens#30189
ApostaC merged 1 commit intovllm-project:mainfrom
ziliangpeng:feat-prefill-kv-metric

Conversation

@ziliangpeng
Copy link
Copy Markdown
Contributor

Summary

This PR adds a new metric vllm:request_prefill_kv_computed_tokens that tracks the number of KV tokens computed during prefill phase, excluding cached tokens.

Motivation

Currently, vLLM tracks total prompt tokens (vllm:request_prompt_tokens) but doesn't have per-request visibility into how many KV tokens were actually computed vs served from cache (local prefix cache or remote KV cache like LMCache). This metric helps:

  • Understand cache effectiveness on a per-request basis
  • Better estimate actual compute costs vs total prompt size
  • Debug and optimize caching strategies
  • Monitor workload characteristics more accurately

Changes

  • Added num_cached_tokens field to FinishedRequestStats dataclass
  • Updated update_from_finished_request() to accept num_cached_tokens parameter
  • Added new histogram metric vllm:request_prefill_kv_computed_tokens in metrics loggers
  • Metric calculation: num_prompt_tokens - max(num_cached_tokens, 0)
  • Added comprehensive unit tests

Testing

  • Added unit tests in tests/v1/metrics/test_stats.py:
    • Test with prefix cache hits
    • Test without cache
    • Test edge cases (negative values, all tokens cached)
  • Verified in production workloads showing expected cache effectiveness

The metric correctly includes cache hits from both local prefix cache and remote KV stores (KV connector, LMCache).

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

@ziliangpeng ziliangpeng requested a review from markmc as a code owner December 6, 2025 19:36
@mergify mergify Bot added the v1 label Dec 6, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new metric, vllm:request_prefill_kv_computed_tokens, to track the number of KV tokens computed during the prefill phase, excluding any tokens served from the cache. The changes are well-implemented, adding the num_cached_tokens field to FinishedRequestStats and plumbing it through from the output processor. A new histogram is added to the Prometheus logger to record this metric, correctly calculating it as the difference between prompt tokens and cached tokens. The inclusion of comprehensive unit tests covering various scenarios, including edge cases, ensures the reliability of this new feature. The code is clear, follows existing patterns, and improves the observability of cache effectiveness. Overall, this is a solid contribution.

@ziliangpeng ziliangpeng force-pushed the feat-prefill-kv-metric branch from 17b00c9 to 9a5fc4d Compare December 6, 2025 19:40
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 6, 2025

Hi @ziliangpeng, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Add new Prometheus metric `vllm:request_prefill_kv_computed_tokens` to
track the number of new KV cache tokens computed during the prefill
phase, excluding tokens served from prefix cache.

This metric helps measure actual compute workload during prefill,
accounting for prefix cache hits. It correctly handles:
- Prefix caching (excludes cached tokens)
- Chunked prefill (counts total prompt tokens, not per-chunk)
- Edge cases (negative values, no cache)

Changes:
- Add `num_cached_tokens` field to `FinishedRequestStats`
- Pass `num_cached_tokens` from `RequestState` through stats pipeline
- Calculate prefill KV compute as `num_prompt_tokens - num_cached_tokens`
- Add Prometheus histogram metric with standard buckets
- Add comprehensive unit tests covering cache hits, no cache, and edge cases

Example:
  Request with 10,000 token prompt
  Prefix cache hit: 1,200 tokens
  Metric reports: 8,800 tokens (10,000 - 1,200)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Ziliang Peng <ziliang@character.ai>
@ziliangpeng ziliangpeng force-pushed the feat-prefill-kv-metric branch from 9a5fc4d to 34d07c5 Compare December 6, 2025 19:50
Copy link
Copy Markdown
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC enabled auto-merge (squash) December 8, 2025 21:47
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 8, 2025
@ApostaC ApostaC merged commit f1599ca into vllm-project:main Dec 9, 2025
46 checks passed
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@ApostaC - do you maintain the metrics subsystem? No. Please ensure that relevant reviewers have a chance to review the PR before merging.

@ApostaC
Copy link
Copy Markdown
Collaborator

ApostaC commented Dec 9, 2025

Hey @robertgshaw2-redhat, apologize if it caused any inconvenience. I found this PR is clear and easy to understand when I'm doing PR on call, so I did a review.

@markmc Can you take a look at this and see if there is anything we don't want or need to be changed? We can revert if needed.

@ziliangpeng Sorry for the confusion 🙏.

@ziliangpeng
Copy link
Copy Markdown
Contributor Author

Hey @robertgshaw2-redhat, apologize if it caused any inconvenience. I found this PR is clear and easy to understand when I'm doing PR on call, so I did a review.

@markmc Can you take a look at this and see if there is anything we don't want or need to be changed? We can revert if needed.

@ziliangpeng Sorry for the confusion 🙏.

no problem! let's do what's best for the project.

@markmc
Copy link
Copy Markdown
Member

markmc commented Dec 12, 2025

I haven't reviewed the code. I'm sure Claude did just fine. But I do wonder about how this relates to the collective, big picture of all of our metrics ... and that's no so trivial to think through.

I've added an auto-generated list of all of our metrics here - https://docs.vllm.ai/en/latest/usage/metrics.html

So, existing metrics that are relevant here are:

  • vllm:prompt_tokens (Counter)
  • vllm:request_prompt_tokens (Histogram)
  • vllm:prefix_cache_queries, vllm:prefix_cache_hits (Counters)
  • vllm:external_prefix_cache_queries, vllm:external_prefix_cache_hits (Counters)

In a recent PR, I drew this as my mental model for the above metrics:

prompt input
   ↓ [prompt tokens]
lookup internal prefix cache
   ↓ [tokens queried and found]
lookup external connector prefix cache
   ↓ [tokens queried and found]
....
   ↓ 
generated tokens output
   ↓ [generated tokens]

So what have we added here? The per-request equivalent of:

prompt_tokens - prefix_cache_hits - external_prefix_cache_hits

If I had seen this PR, I think I would have asked

  1. What is the justification for recording this per-request (given that it adds 10-15 individual time-series for the Histogram)? What actionable information will it give you beyond the prefix cache lookup rates (hits/queries), and what actions would you or another user take based on them? How niche/common is this use case?
  2. If per-request information is useful, why not add per-request prefix cache lookup rates? This would seem to be more consistent?
  3. Different naming might help users understand how these metrics relate to each other - e.g. vllm:request_prompt_cache_misses or something?

@github-project-automation github-project-automation Bot moved this from Backlog to Done in Metrics & Tracing Dec 19, 2025
@markmc markmc moved this from Done to Done 0.13 in Metrics & Tracing Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done - 0.13

Development

Successfully merging this pull request may close these issues.

4 participants