[Bugfix][P/D] Fix throughput stats in disaggregated setup by NickLucche · Pull Request #27569 · vllm-project/vllm

NickLucche · 2025-10-27T11:33:53Z

Fix prompt throughput stats in CLI logger by only accounting for tokens that were prefilled locally.

In a P/D setup, kv cache is copied over from P to D, and this currently resolves in outputting the following on the Decoder side:

# Sending 2 different reqs one after another
(APIServer pid=3322894) INFO:     Started server process [3322894]
(APIServer pid=3322894) INFO:     Waiting for application startup.
(APIServer pid=3322894) INFO:     Application startup complete.
(APIServer pid=3322894) INFO:     127.0.0.1:38042 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3322894) INFO 10-27 11:21:40 [loggers.py:208] Engine 000: Avg prompt throughput: 53.2 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%
(APIServer pid=3322894) INFO 10-27 11:21:50 [loggers.py:208] Engine 000: Avg prompt throughput: 10.3 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

which is plain wrong given we have not actually prefilled those tokens, but actually just "copied" them over.

After this PR:

# Same setup
(APIServer pid=3318553) INFO:     Started server process [3318553]
(APIServer pid=3318553) INFO:     Waiting for application startup.
(APIServer pid=3318553) INFO:     Application startup complete.
(APIServer pid=3318553) INFO:     127.0.0.1:54610 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:05 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

(APIServer pid=3318553) INFO:     127.0.0.1:35638 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=3318553) INFO 10-27 11:15:15 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 50.0%

gemini-code-assist

Code Review

This pull request addresses a bug in the prompt throughput stats calculation for disaggregated setups, where tokens copied from the primary to the decoder were incorrectly counted as prefilled tokens. The changes involve modifying the scheduler to track locally prefilled tokens and updating the logging to reflect the corrected throughput. The review focuses on ensuring the correctness of the fix and the clarity of the code changes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

NickLucche · 2025-10-27T14:59:31Z

+                # Prefill is to be recomputed locally.
+                request.num_external_computed_tokens = 0


@sdavidbd can you please double check this, my understanding is that we have to re-compute the whole prefill now so we can track prompt throughput

Even from the docstring, it seems clear we don't always re-compute the whole prefill:

This method scans the given requests, detects those with invalid blocks and adjusts their num_computed_tokens to the longest valid prefix.

A few things:

I think num_external_computed_tokens = 0 should only happen inside the not marked_invalid_block clause - here we're saying all externally computed blocks are invalid

(Unrelated to this PR - an observation) Setting request.num_computed_tokens = request.num_cached_tokens on line 1489 doesn't make sense to me - since num_cached_tokens includes both local and external computed tokens?

We should update num_external_computed_tokens at # Truncate the computed tokens at the first failed block - to something like request.num_computed_tokens - local_computed_tokens (but not obvious how we calculated local_computed_tokens)

@NickLucche — as @markmc noted, we don’t recompute the entire prefill. Only the externally computed tokens starting from the first failed block are recomputed.

To correctly update num_external_computed_tokens, we should first determine how many externally computed tokens are affected. This can be derived from the delta between the original and truncated num_computed_tokens — the same tokens already aggregated in total_affected_tokens (lines 1473–1477):

# Truncate the computed tokens at the first failed block request.num_computed_tokens = idx * self.block_size num_affected_tokens = req_num_computed_tokens - request.num_computed_tokens total_affected_tokens += num_affected_tokens request.num_external_computed_tokens -= num_affected_tokens

@markmc — regarding your points:

The not marked_invalid_block condition covers the sync-loading edge case where a request is affected by externally computed tokens that failed to load but are shared with preceding requests that will handle their recomputation. In this situation, the affected request still treats those tokens as locally computed, so its num_external_computed_tokens remains unchanged.

For example, assuming block_size = 1 and the following prompts (with R1receding R2 in the batch):

R1: t1 t2 t3 R2: t1 t2 t4 t5

Suppose t1 is locally computed, t2 and t4 are externally computed, and t2 fails to load while t4 succeeds. Then:
Before failure

Request num_computed_tokens num_external_computed_tokens

R1 2 1

R2 3 1

After failure

Request num_computed_tokens num_external_computed_tokens

R1 1 0

R2 3 1

Both R1 and R2 are affected and will recompute t2, t3 and t5 respectively, but R2’s total number of computed tokens remains unchanged.

Correct — num_cached_tokens represents the total number of computed tokens (both local and external). Setting num_computed_tokens = num_cached_tokens ensures that all new tokens are recomputed in the current iteration, since the previous num_computed_tokens value already included them.

Agreed — see my suggested code changes above for how we update num_external_computed_tokens accordingly.

markmc · 2025-10-28T10:43:53Z

+                # Prefill is to be recomputed locally.
+                request.num_external_computed_tokens = 0


Even from the docstring, it seems clear we don't always re-compute the whole prefill:

This method scans the given requests, detects those with invalid blocks and adjusts their num_computed_tokens to the longest valid prefix.

A few things:

I think num_external_computed_tokens = 0 should only happen inside the not marked_invalid_block clause - here we're saying all externally computed blocks are invalid

(Unrelated to this PR - an observation) Setting request.num_computed_tokens = request.num_cached_tokens on line 1489 doesn't make sense to me - since num_cached_tokens includes both local and external computed tokens?

We should update num_external_computed_tokens at # Truncate the computed tokens at the first failed block - to something like request.num_computed_tokens - local_computed_tokens (but not obvious how we calculated local_computed_tokens)

markmc · 2025-10-28T10:46:11Z

@@ -121,6 +121,8 @@ class EngineCoreOutput(
    trace_headers: Mapping[str, str] | None = None
    # The number of tokens with prefix cache hits.


Yeah, this comment looks incorrect ... assuming "prefix cache" refers to the local cache?

# Total computed tokens (local + external). num_computed_tokens = ( num_new_local_computed_tokens + num_external_computed_tokens ) ... # Count the number of prefix cached tokens. if request.num_cached_tokens < 0: request.num_cached_tokens = num_computed_tokens

I am not familiar with it cc @chaunceyjiang

markmc · 2025-10-28T10:54:28Z

@@ -221,6 +221,8 @@ def __init__(self):
        self.num_generation_tokens = 0
        self.num_prompt_tokens = 0
        self.num_preempted_reqs = 0
+        # Num of prompt tokens that have been computed locally.


Is the naming here a big confusing? By "computed locally" here we mean both computed and locally cached?

If you just tracked num_external_computed_tokens and then subtracted it in _track_iteration_stats() would that be more clear?

By "computed locally" here we mean both computed and locally cached?

Yes the behavior is unchanged, cached ones would still result in higher throughput even in regular aggregated setup.

If you just tracked num_external_computed_tokens and then subtracted it in _track_iteration_stats() would that be more clear?

I think looking at the diff

self.num_prompt_tokens += iteration_stats.num_prompt_tokens --> self.num_prompt_tokens += iteration_stats.num_local_prompt_tokens

this is pretty clear that I just want to rule out the remote tokens ie I assume the semantic was the intended one from the beginning, it's just "local" used to be redundant

markmc · 2025-10-28T10:59:41Z

@@ -121,6 +121,8 @@ class EngineCoreOutput(
    trace_headers: Mapping[str, str] | None = None
    # The number of tokens with prefix cache hits.
    num_cached_tokens: int = 0
+    # The number of tokens that have been computed remotely.
+    num_external_computed_tokens: int = 0


I'd be tempted to refactor these two into a PrefillStats object ... and only include that in the ECO when the prefill completes ... especially if we ever wanted to also send like num_locally_cached_tokens too

I don't have a strong opinion on this tbh, we can probably wait to have a few more things to bundle before executing the suggestion

markmc · 2025-10-28T11:01:22Z

@@ -113,7 +113,7 @@ def _reset(self, now):

    def _track_iteration_stats(self, iteration_stats: IterationStats):


Presumably you want to update the Prometheus metric too?

@markmc which one? I intentionally left self.counter_prompt_tokens unchanged to avoid replacing the actual prompt count.
Should I just make a new one for local tokens?

NickLucche · 2025-10-29T16:04:21Z

@markmc @sdavidbd addressed the suggestions, thanks a lot for reviewing 🙏🏻

markmc · 2025-11-11T15:25:32Z

Sorry for the delay in coming back to this.

I see more clearly now where you're coming from. On a decode instance, you want to see this:

Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s

i.e. "this vLLM instance isn't doing any prefill computation"

So, you're proposing that we subtract any prefilling done by a KV connector from the prompt throughput reported on the console

If that's the desired behavior, the code changes lgtm now.

However, the whole thing raises a lot of other questions for me!

Why would you see zero on the console, but not in your Grafana dashboard showing vllm:prompt_tokens? Why should externally-computed tokens not be subtracted from vllm:iteration_tokens and vllm:request_prompt_tokens also?
Why do we include tokens pre-filled from the local prefix cache in Avg prompt throughput on the console
Why would we exclude tokens pre-filled from the CPU-offloaded KV cache? Why are these "externally computed"?

On the other hand though, looks like you found a clear bug here:

External prefix cache hit rate: 50.0%

I have a fix for that, just need to update tests before submitting

NickLucche · 2025-11-11T17:53:49Z

I think these are all good questions, but as you noted I was really just proposing a fix for a very specific use-case.

Why would you see zero on the console, but not in your Grafana dashboard showing vllm:prompt_tokens? Why should externally-computed tokens not be subtracted from vllm:iteration_tokens and vllm:request_prompt_tokens also?

Don't have a strong opinion on Grafana. Open to address that in this or a follow-up PR if needs be. In general my approach was: counting prompt tokens is not wrong, but it's wrong trying to derive throughput from them since result is bonkers.

Why do we include tokens pre-filled from the local prefix cache in Avg prompt throughput on the console

To be honest I don't know and don't have a strong opinion on it, I see how in practice you want to show a signal rather than 0.0 in throughput in the regular colocated setup. Perhaps it's better discussed in a separate issue, I am was just trying to fix the disaggregated case.

Why would we exclude tokens pre-filled from the CPU-offloaded KV cache? Why are these "externally computed"?

I think in this case the rationale is actually the same as in this PR, even offloaded KV caches are not necessarily computed locally.

Signed-off-by: NickLucche <nlucches@redhat.com>

Co-authored-by: David Ben-David davidb@pliops.com Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-11-13T08:48:21Z

@markmc let me know if you have some other comment for this fix

markmc · 2025-11-13T15:05:25Z

it's wrong trying to derive throughput from them since result is bonkers

The throughput figure is just expressing a counter as a rate. In a Grafana dashboard, we do the same with e.g.

          "expr": "rate(vllm:prompt_tokens_total{model_name=\"$model_name\"}[$__rate_interval])",
          "legendFormat": "Prompt Tokens/Sec",

That's why I think "Prompt Throughput" in Prometheus/Grafana should mean the same thing as in the console log.

And that's just to say I think it's important that we can articulate a consistent mental model for what all of these metrics mean, whether in the console log or Prometheus.

Here's my mental model for how we're counting things now:

prompt input
   ↓ [prompt tokens]
lookup internal prefix cache
   ↓ [tokens queried and found]
lookup external connector prefix cache
   ↓ [tokens queried and found]
....
   ↓ 
generated tokens output
   ↓ [generated tokens]

With that mental model, this:

Avg prompt throughput: 53.2 tokens/s, Avg generation throughput: 15.0 tokens/s, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 100.0%

is not bonkers at all?

53.2 prompt token/s came in at the top, none were found in the local prefix cache, 100% were retrieved from the connector, and we generated 15 tokens/s

If you see it differently, could you draw your mental model for these counters, in such a way that the same model can be applied consistently across use cases?

Currently we are recording async connector prefix cache queries and hits twice, for the same reason that update_state_after_alloc() is called twice: > If get_num_new_matched_tokens previously returned True for a > request, this function may be called twice for that same request - > first when blocks are allocated for the connector tokens to be > asynchronously loaded into, and second when any additional blocks > are allocated, after the load/transfer is complete. Worse, the second time we are recording with `num_external_computed_tokens=0` so effectively we are halving the hit rate. Before ``` External prefix cache hit rate: 100.0% ``` After ``` External prefix cache hit rate: 50.0% ``` Borrows part of vllm-project#27569 to track `num_external_computed_tokens` for use when the KV transfer completes. Will use vllm-project#28550 for testing this scenario. Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

smarterclayton · 2025-12-18T16:48:04Z

From an end user perspective, I would argue the top level outcome is:

I want a “prompt throughput” prometheus metric with various sources (a category / type / source label), and I want the sum of the series under that label to equal the observed throughput in terms of prompt tokens processed by this server. I want the label to be sufficiently scoped that I can identify:

“only tokens computed for this request” (actually takes up compute time) - used to calculate actual load
Tokens transferred in to satisfy this request from connector / logical disaggregation (plus connector / deeper source?) - used to verify kv transfer is happening
Prompt tokens retrieved from cache, by tier (hbm, cpu, arbitrary tier labels) - distinguish how many tokens provided by my tiers for effectiveness

Then the metrics printed are a derivative of that end user observability outcome.

markmc · 2025-12-19T12:54:28Z

From an end user perspective, I would argue the top level outcome is:

Claude helped me out with this version of your proposal ... does it match?

(There's definitely still some information duplication with the prefix_cache_queries/hits, connector_prefix_cache_queries/hits metrics, but I could see the argument that this approach is better aligned with end-user goals)

Proposal: Labeled Prompt Token Counters

Prometheus Metrics

Replace single vllm:prompt_tokens_total with labeled counter:

vllm:prompt_tokens_total{source="local_compute"}      # Tokens prefilled locally
vllm:prompt_tokens_total{source="kv_transfer"}        # Tokens received via KV transfer
vllm:prompt_tokens_total{source="cache_hit_gpu"}      # Tokens from GPU cache
vllm:prompt_tokens_total{source="cache_hit_cpu"}      # Tokens from CPU cache

Grafana Queries

Total throughput (all tokens flowing through): rate(vllm:prompt_tokens_total[$__rate_interval])

Compute load (actual prefill work): rate(vllm:prompt_tokens_total{source="local_compute"}[$__rate_interval])

KV transfer verification: rate(vllm:prompt_tokens_total{source="kv_transfer"}[$__rate_interval])

Cache effectiveness: rate(vllm:prompt_tokens_total{source=~"cache_hit_.*"}[$__rate_interval])

CLI Logger

Display breakdown: Avg prompt throughput: 53.2 tokens/s (0.0 local, 53.2 transferred)

Benefits

No information loss: Both perspectives available
Flexible querying: Users choose what to measure
Clear semantics: Source label disambiguates token origin

smarterclayton · 2026-01-07T14:28:42Z

Roughly yes

vllm:prompt_tokens_total{source="cache_hit_gpu"}      # Tokens from GPU cache
vllm:prompt_tokens_total{source="cache_hit_cpu"}      # Tokens from CPU cache

We are likely to have more cache hit sources in the future. I'd suggest two labels, source=cache_hit and cache_source=cpu|gpu|<empty>. Prometheus convention would probably be to keep the labels orthogonal (I.e. cache_source and source are probably less preferable than two labels that don't share _source) but I don't have an alternative suggestion.

Otherwise, that structure looks like what I would expect

EDIT: One note, we should pick values for cache_source consistent with the actual source as defined in our cache hierarchy code, (cpu and gpu may be too generic, especially given that there are likely more levels of GPU hierarchy in future hardware generations and there may be multiple CPU tiers). Ideally, we use a constant defined by the cache plugin hierarchy itself for cache source.

NickLucche · 2026-01-08T17:26:12Z

@smarterclayton

I'm ok with fleshing out current metric, I think end-user clarity would benefit from it. We can repurpose this PR to implement at least the compute/transfer I suppose.

Re: cache source, why would we move away from internal/external? Where internal is co-located hbm wrt current EngineCore, external is anything else.

smarterclayton · 2026-01-12T16:29:24Z

Re: cache source:

An operator who has multiple external sources would want to know which source it is. I'm not arguing for changing labels that exist, just ensuring that we can attribute the external contribution appropriately. I agree that preserving existing labels without disruption is better than removing them.

mergify · 2026-01-14T07:32:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ZhanqiuHu · 2026-01-29T16:18:19Z

I created a follow-up issue here #33289 (PR: #33290). Would appreciate any feedback!

markmc · 2026-02-04T11:12:26Z

#33290 has merged, so I think we can close this now. Cool!

NickLucche requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 27, 2025 11:33

mergify Bot added v1 kv-connector labels Oct 27, 2025

gemini-code-assist Bot reviewed Oct 27, 2025

View reviewed changes

Comment thread vllm/v1/metrics/loggers.py

Comment thread vllm/v1/core/sched/scheduler.py Outdated

Comment thread vllm/v1/metrics/stats.py

chatgpt-codex-connector Bot reviewed Oct 27, 2025

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

NickLucche commented Oct 27, 2025

View reviewed changes

markmc reviewed Oct 28, 2025

View reviewed changes

NickLucche force-pushed the fix-throughput-stats branch from c3d6723 to 950baf4 Compare November 8, 2025 17:03

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 9, 2025

markmc reviewed Nov 11, 2025

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

NickLucche added 6 commits November 11, 2025 18:02

propagate num_external_tokens to logger

45f732f

Signed-off-by: NickLucche <nlucches@redhat.com>

assign remote token count earlier

96f91ba

Signed-off-by: NickLucche <nlucches@redhat.com>

reset num external tokens when having to recompute prefill

d4fc218

Signed-off-by: NickLucche <nlucches@redhat.com>

fix num external tokens on failures

d038f1c

Co-authored-by: David Ben-David davidb@pliops.com Signed-off-by: NickLucche <nlucches@redhat.com>

comment

a1b82fb

Signed-off-by: NickLucche <nlucches@redhat.com>

fix ext_tokens

4d2bacb

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the fix-throughput-stats branch from be1002f to 4d2bacb Compare November 11, 2025 18:02

markmc mentioned this pull request Nov 12, 2025

[KV Connector] Test async mode in scheduler tests #28550

Merged

Merge branch 'main' into fix-throughput-stats

f0f16c0

markmc mentioned this pull request Nov 12, 2025

[KV Connector] Fix async connector prefix cache metrics #28585

Merged

robertgshaw2-redhat mentioned this pull request Dec 2, 2025

[Roadmap] llm-d v0.4.0 Roadmap llm-d/llm-d#347

Closed

39 tasks

markmc mentioned this pull request Dec 12, 2025

feat(metrics): Add prefill KV compute metric excluding cached tokens #30189

Merged

markmc added this to Metrics & Tracing Jan 13, 2026

github-project-automation Bot moved this to Backlog in Metrics & Tracing Jan 13, 2026

markmc moved this from Backlog to P1 in Metrics & Tracing Jan 13, 2026

mergify Bot added the bug Something isn't working label Jan 14, 2026

mergify Bot added the needs-rebase label Jan 14, 2026

This was referenced Jan 28, 2026

[Feature]: [Metrics] Labeled prompt token metrics for P/D disaggregation (Follow-up on PR #27569) #33289

Closed

[Metrics] Add labeled prompt token metrics for P/D disaggregation #33290

Merged

markmc closed this Feb 4, 2026

markmc moved this from P1 to Done in Metrics & Tracing Feb 4, 2026

		# Prefill is to be recomputed locally.
		request.num_external_computed_tokens = 0

		@@ -121,6 +121,8 @@ class EngineCoreOutput(
		trace_headers: Mapping[str, str] \| None = None
		# The number of tokens with prefix cache hits.

		@@ -113,7 +113,7 @@ def _reset(self, now):

		def _track_iteration_stats(self, iteration_stats: IterationStats):

Uh oh!

Conversation

NickLucche commented Oct 27, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Nov 11, 2025

Uh oh!

Uh oh!

NickLucche commented Nov 11, 2025

Uh oh!

NickLucche commented Nov 13, 2025

Uh oh!

markmc commented Nov 13, 2025

Uh oh!

smarterclayton commented Dec 18, 2025

Uh oh!

markmc commented Dec 19, 2025

Proposal: Labeled Prompt Token Counters

Prometheus Metrics

Grafana Queries

CLI Logger

Benefits

Uh oh!

smarterclayton commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented Jan 8, 2026

Uh oh!

smarterclayton commented Jan 12, 2026

Uh oh!

mergify Bot commented Jan 14, 2026

Uh oh!

ZhanqiuHu commented Jan 29, 2026

Uh oh!

markmc commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

NickLucche commented Oct 29, 2025 •

edited

Loading

smarterclayton commented Jan 7, 2026 •

edited

Loading