Skip to content

[Bugfix][P/D] Fix throughput stats in disaggregated setup#27569

Closed
NickLucche wants to merge 7 commits intovllm-project:mainfrom
NickLucche:fix-throughput-stats
Closed

[Bugfix][P/D] Fix throughput stats in disaggregated setup#27569
NickLucche wants to merge 7 commits intovllm-project:mainfrom
NickLucche:fix-throughput-stats

Conversation

@NickLucche
Copy link
Copy Markdown
Collaborator

Fix prompt throughput stats in CLI logger by only accounting for tokens that were prefilled locally.

In a P/D setup, kv cache is copied over from P to D, and this currently resolves in outputting the following on the Decoder side:

# Sending 2 different reqs one after another
(APIServer pid=3322894) INFO:     Started server process [3322894]
(APIServer pid=3322894) INFO:     Waiting for application startup.
(APIServer pid=3322894) INFO:     Application startup complete.
(APIServer pid=3322894) INFO:     127.0.0.1:38042 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3322894) INFO 10-27 11:21:40 [loggers.py:208] Engine 000: Avg prompt throughput: 53.2 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%
(APIServer pid=3322894) INFO 10-27 11:21:50 [loggers.py:208] Engine 000: Avg prompt throughput: 10.3 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

which is plain wrong given we have not actually prefilled those tokens, but actually just "copied" them over.

After this PR:

# Same setup
(APIServer pid=3318553) INFO:     Started server process [3318553]
(APIServer pid=3318553) INFO:     Waiting for application startup.
(APIServer pid=3318553) INFO:     Application startup complete.
(APIServer pid=3318553) INFO:     127.0.0.1:54610 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:05 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

(APIServer pid=3318553) INFO:     127.0.0.1:35638 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:15 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the prompt throughput stats calculation for disaggregated setups, where tokens copied from the primary to the decoder were incorrectly counted as prefilled tokens. The changes involve modifying the scheduler to track locally prefilled tokens and updating the logging to reflect the corrected throughput. The review focuses on ensuring the correctness of the fix and the clarity of the code changes.

Comment thread vllm/v1/metrics/loggers.py
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/metrics/stats.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment on lines +1496 to +1497
# Prefill is to be recomputed locally.
request.num_external_computed_tokens = 0
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdavidbd can you please double check this, my understanding is that we have to re-compute the whole prefill now so we can track prompt throughput

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even from the docstring, it seems clear we don't always re-compute the whole prefill:

This method scans the given requests, detects those with invalid blocks and adjusts their num_computed_tokens to the longest valid prefix.

A few things:

  1. I think num_external_computed_tokens = 0 should only happen inside the not marked_invalid_block clause - here we're saying all externally computed blocks are invalid
  2. (Unrelated to this PR - an observation) Setting request.num_computed_tokens = request.num_cached_tokens on line 1489 doesn't make sense to me - since num_cached_tokens includes both local and external computed tokens?
  3. We should update num_external_computed_tokens at # Truncate the computed tokens at the first failed block - to something like request.num_computed_tokens - local_computed_tokens (but not obvious how we calculated local_computed_tokens)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche — as @markmc noted, we don’t recompute the entire prefill. Only the externally computed tokens starting from the first failed block are recomputed.

To correctly update num_external_computed_tokens, we should first determine how many externally computed tokens are affected. This can be derived from the delta between the original and truncated num_computed_tokens — the same tokens already aggregated in total_affected_tokens (lines 1473–1477):

# Truncate the computed tokens at the first failed block
request.num_computed_tokens = idx * self.block_size
num_affected_tokens = req_num_computed_tokens - request.num_computed_tokens
total_affected_tokens += num_affected_tokens
request.num_external_computed_tokens -= num_affected_tokens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc — regarding your points:

  1. The not marked_invalid_block condition covers the sync-loading edge case where a request is affected by externally computed tokens that failed to load but are shared with preceding requests that will handle their recomputation. In this situation, the affected request still treats those tokens as locally computed, so its num_external_computed_tokens remains unchanged.

For example, assuming block_size = 1 and the following prompts (with R1receding R2 in the batch):

R1: t1 t2 t3
R2: t1 t2 t4 t5

Suppose t1 is locally computed, t2 and t4 are externally computed, and t2 fails to load while t4 succeeds. Then:

Before failure
Request num_computed_tokens num_external_computed_tokens
R1 2 1
R2 3 1
After failure
Request num_computed_tokens num_external_computed_tokens
R1 1 0
R2 3 1

Both R1 and R2 are affected and will recompute t2, t3 and t5 respectively, but R2’s total number of computed tokens remains unchanged.

  1. Correct — num_cached_tokens represents the total number of computed tokens (both local and external). Setting num_computed_tokens = num_cached_tokens ensures that all new tokens are recomputed in the current iteration, since the previous num_computed_tokens value already included them.

  2. Agreed — see my suggested code changes above for how we update num_external_computed_tokens accordingly.

Comment thread vllm/v1/core/sched/scheduler.py Outdated
Comment on lines +1496 to +1497
# Prefill is to be recomputed locally.
request.num_external_computed_tokens = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even from the docstring, it seems clear we don't always re-compute the whole prefill:

This method scans the given requests, detects those with invalid blocks and adjusts their num_computed_tokens to the longest valid prefix.

A few things:

  1. I think num_external_computed_tokens = 0 should only happen inside the not marked_invalid_block clause - here we're saying all externally computed blocks are invalid
  2. (Unrelated to this PR - an observation) Setting request.num_computed_tokens = request.num_cached_tokens on line 1489 doesn't make sense to me - since num_cached_tokens includes both local and external computed tokens?
  3. We should update num_external_computed_tokens at # Truncate the computed tokens at the first failed block - to something like request.num_computed_tokens - local_computed_tokens (but not obvious how we calculated local_computed_tokens)

@@ -121,6 +121,8 @@ class EngineCoreOutput(
trace_headers: Mapping[str, str] | None = None
# The number of tokens with prefix cache hits.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this comment looks incorrect ... assuming "prefix cache" refers to the local cache?

                    # Total computed tokens (local + external).                                                                                                                 
                    num_computed_tokens = (
                        num_new_local_computed_tokens + num_external_computed_tokens
                    )
                ...
                # Count the number of prefix cached tokens.                                                                                                                     
                if request.num_cached_tokens < 0:
                    request.num_cached_tokens = num_computed_tokens

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with it cc @chaunceyjiang

Comment thread vllm/v1/request.py Outdated
Comment thread vllm/v1/metrics/stats.py
@@ -221,6 +221,8 @@ def __init__(self):
self.num_generation_tokens = 0
self.num_prompt_tokens = 0
self.num_preempted_reqs = 0
# Num of prompt tokens that have been computed locally.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the naming here a big confusing? By "computed locally" here we mean both computed and locally cached?

If you just tracked num_external_computed_tokens and then subtracted it in _track_iteration_stats() would that be more clear?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "computed locally" here we mean both computed and locally cached?

Yes the behavior is unchanged, cached ones would still result in higher throughput even in regular aggregated setup.

If you just tracked num_external_computed_tokens and then subtracted it in _track_iteration_stats() would that be more clear?

I think looking at the diff

self.num_prompt_tokens += iteration_stats.num_prompt_tokens
-->
self.num_prompt_tokens += iteration_stats.num_local_prompt_tokens

this is pretty clear that I just want to rule out the remote tokens ie I assume the semantic was the intended one from the beginning, it's just "local" used to be redundant

@@ -121,6 +121,8 @@ class EngineCoreOutput(
trace_headers: Mapping[str, str] | None = None
# The number of tokens with prefix cache hits.
num_cached_tokens: int = 0
# The number of tokens that have been computed remotely.
num_external_computed_tokens: int = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to refactor these two into a PrefillStats object ... and only include that in the ECO when the prefill completes ... especially if we ever wanted to also send like num_locally_cached_tokens too

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this tbh, we can probably wait to have a few more things to bundle before executing the suggestion

@@ -113,7 +113,7 @@ def _reset(self, now):

def _track_iteration_stats(self, iteration_stats: IterationStats):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably you want to update the Prometheus metric too?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc which one? I intentionally left self.counter_prompt_tokens unchanged to avoid replacing the actual prompt count.
Should I just make a new one for local tokens?

@NickLucche
Copy link
Copy Markdown
Collaborator Author

NickLucche commented Oct 29, 2025

@markmc @sdavidbd addressed the suggestions, thanks a lot for reviewing 🙏🏻

@NickLucche NickLucche force-pushed the fix-throughput-stats branch from c3d6723 to 950baf4 Compare November 8, 2025 17:03
@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 9, 2025
@markmc
Copy link
Copy Markdown
Member

markmc commented Nov 11, 2025

Sorry for the delay in coming back to this.

I see more clearly now where you're coming from. On a decode instance, you want to see this:

Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s

i.e. "this vLLM instance isn't doing any prefill computation"

So, you're proposing that we subtract any prefilling done by a KV connector from the prompt throughput reported on the console

If that's the desired behavior, the code changes lgtm now.

However, the whole thing raises a lot of other questions for me!

  • Why would you see zero on the console, but not in your Grafana dashboard showing vllm:prompt_tokens? Why should externally-computed tokens not be subtracted from vllm:iteration_tokens and vllm:request_prompt_tokens also?
  • Why do we include tokens pre-filled from the local prefix cache in Avg prompt throughput on the console
  • Why would we exclude tokens pre-filled from the CPU-offloaded KV cache? Why are these "externally computed"?

On the other hand though, looks like you found a clear bug here:

External prefix cache hit rate: 50.0%

I have a fix for that, just need to update tests before submitting

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@NickLucche
Copy link
Copy Markdown
Collaborator Author

I think these are all good questions, but as you noted I was really just proposing a fix for a very specific use-case.

Why would you see zero on the console, but not in your Grafana dashboard showing vllm:prompt_tokens? Why should externally-computed tokens not be subtracted from vllm:iteration_tokens and vllm:request_prompt_tokens also?

Don't have a strong opinion on Grafana. Open to address that in this or a follow-up PR if needs be. In general my approach was: counting prompt tokens is not wrong, but it's wrong trying to derive throughput from them since result is bonkers.

Why do we include tokens pre-filled from the local prefix cache in Avg prompt throughput on the console

To be honest I don't know and don't have a strong opinion on it, I see how in practice you want to show a signal rather than 0.0 in throughput in the regular colocated setup. Perhaps it's better discussed in a separate issue, I am was just trying to fix the disaggregated case.

Why would we exclude tokens pre-filled from the CPU-offloaded KV cache? Why are these "externally computed"?

I think in this case the rationale is actually the same as in this PR, even offloaded KV caches are not necessarily computed locally.

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: David Ben-David davidb@pliops.com
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche
Copy link
Copy Markdown
Collaborator Author

@markmc let me know if you have some other comment for this fix

@markmc
Copy link
Copy Markdown
Member

markmc commented Nov 13, 2025

it's wrong trying to derive throughput from them since result is bonkers

The throughput figure is just expressing a counter as a rate. In a Grafana dashboard, we do the same with e.g.

          "expr": "rate(vllm:prompt_tokens_total{model_name=\"$model_name\"}[$__rate_interval])",
          "legendFormat": "Prompt Tokens/Sec",

That's why I think "Prompt Throughput" in Prometheus/Grafana should mean the same thing as in the console log.

And that's just to say I think it's important that we can articulate a consistent mental model for what all of these metrics mean, whether in the console log or Prometheus.

Here's my mental model for how we're counting things now:

prompt input
   ↓ [prompt tokens]
lookup internal prefix cache
   ↓ [tokens queried and found]
lookup external connector prefix cache
   ↓ [tokens queried and found]
....
   ↓ 
generated tokens output
   ↓ [generated tokens]

With that mental model, this:

Avg prompt throughput: 53.2 tokens/s, Avg generation throughput: 15.0 tokens/s, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 100.0%

is not bonkers at all?

53.2 prompt token/s came in at the top, none were found in the local prefix cache, 100% were retrieved from the connector, and we generated 15 tokens/s

If you see it differently, could you draw your mental model for these counters, in such a way that the same model can be applied consistently across use cases?

markmc added a commit to markmc/vllm that referenced this pull request Nov 14, 2025
Currently we are recording async connector prefix cache queries
and hits twice, for the same reason that update_state_after_alloc()
is called twice:

> If get_num_new_matched_tokens previously returned True for a
> request, this function may be called twice for that same request -
> first when blocks are allocated for the connector tokens to be
> asynchronously loaded into, and second when any additional blocks
> are allocated, after the load/transfer is complete.

Worse, the second time we are recording with `num_external_computed_tokens=0`
so effectively we are halving the hit rate.

Before

```
External prefix cache hit rate: 100.0%
```

After

```
External prefix cache hit rate: 50.0%
```

Borrows part of vllm-project#27569 to track `num_external_computed_tokens`
for use when the KV transfer completes.

Will use vllm-project#28550 for testing this scenario.

Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@smarterclayton
Copy link
Copy Markdown
Contributor

From an end user perspective, I would argue the top level outcome is:

I want a “prompt throughput” prometheus metric with various sources (a category / type / source label), and I want the sum of the series under that label to equal the observed throughput in terms of prompt tokens processed by this server. I want the label to be sufficiently scoped that I can identify:

  1. “only tokens computed for this request” (actually takes up compute time) - used to calculate actual load
  2. Tokens transferred in to satisfy this request from connector / logical disaggregation (plus connector / deeper source?) - used to verify kv transfer is happening
  3. Prompt tokens retrieved from cache, by tier (hbm, cpu, arbitrary tier labels) - distinguish how many tokens provided by my tiers for effectiveness

Then the metrics printed are a derivative of that end user observability outcome.

@markmc
Copy link
Copy Markdown
Member

markmc commented Dec 19, 2025

From an end user perspective, I would argue the top level outcome is:

Claude helped me out with this version of your proposal ... does it match?

(There's definitely still some information duplication with the prefix_cache_queries/hits, connector_prefix_cache_queries/hits metrics, but I could see the argument that this approach is better aligned with end-user goals)

Proposal: Labeled Prompt Token Counters

Prometheus Metrics

Replace single vllm:prompt_tokens_total with labeled counter:

vllm:prompt_tokens_total{source="local_compute"}      # Tokens prefilled locally
vllm:prompt_tokens_total{source="kv_transfer"}        # Tokens received via KV transfer
vllm:prompt_tokens_total{source="cache_hit_gpu"}      # Tokens from GPU cache
vllm:prompt_tokens_total{source="cache_hit_cpu"}      # Tokens from CPU cache

Grafana Queries

Total throughput (all tokens flowing through): rate(vllm:prompt_tokens_total[$__rate_interval])

Compute load (actual prefill work): rate(vllm:prompt_tokens_total{source="local_compute"}[$__rate_interval])

KV transfer verification: rate(vllm:prompt_tokens_total{source="kv_transfer"}[$__rate_interval])

Cache effectiveness: rate(vllm:prompt_tokens_total{source=~"cache_hit_.*"}[$__rate_interval])

CLI Logger

Display breakdown: Avg prompt throughput: 53.2 tokens/s (0.0 local, 53.2 transferred)

Benefits

  • No information loss: Both perspectives available
  • Flexible querying: Users choose what to measure
  • Clear semantics: Source label disambiguates token origin

@smarterclayton
Copy link
Copy Markdown
Contributor

smarterclayton commented Jan 7, 2026

Roughly yes

vllm:prompt_tokens_total{source="cache_hit_gpu"}      # Tokens from GPU cache
vllm:prompt_tokens_total{source="cache_hit_cpu"}      # Tokens from CPU cache

We are likely to have more cache hit sources in the future. I'd suggest two labels, source=cache_hit and cache_source=cpu|gpu|<empty>. Prometheus convention would probably be to keep the labels orthogonal (I.e. cache_source and source are probably less preferable than two labels that don't share _source) but I don't have an alternative suggestion.

Otherwise, that structure looks like what I would expect

EDIT: One note, we should pick values for cache_source consistent with the actual source as defined in our cache hierarchy code, (cpu and gpu may be too generic, especially given that there are likely more levels of GPU hierarchy in future hardware generations and there may be multiple CPU tiers). Ideally, we use a constant defined by the cache plugin hierarchy itself for cache source.

@NickLucche
Copy link
Copy Markdown
Collaborator Author

@smarterclayton

I'm ok with fleshing out current metric, I think end-user clarity would benefit from it. We can repurpose this PR to implement at least the compute/transfer I suppose.

Re: cache source, why would we move away from internal/external? Where internal is co-located hbm wrt current EngineCore, external is anything else.

@smarterclayton
Copy link
Copy Markdown
Contributor

Re: cache source:

An operator who has multiple external sources would want to know which source it is. I'm not arguing for changing labels that exist, just ensuring that we can attribute the external contribution appropriately. I agree that preserving existing labels without disruption is better than removing them.

@markmc markmc moved this from Backlog to P1 in Metrics & Tracing Jan 13, 2026
@mergify mergify Bot added the bug Something isn't working label Jan 14, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 14, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@ZhanqiuHu
Copy link
Copy Markdown
Contributor

I created a follow-up issue here #33289 (PR: #33290). Would appreciate any feedback!

@markmc
Copy link
Copy Markdown
Member

markmc commented Feb 4, 2026

#33290 has merged, so I think we can close this now. Cool!

@markmc markmc closed this Feb 4, 2026
@markmc markmc moved this from P1 to Done in Metrics & Tracing Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working kv-connector needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants