Skip to content

[Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats#37460

Merged
markmc merged 2 commits intovllm-project:mainfrom
markmc:prefill-stats
Apr 14, 2026
Merged

[Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats#37460
markmc merged 2 commits intovllm-project:mainfrom
markmc:prefill-stats

Conversation

@markmc
Copy link
Copy Markdown
Member

@markmc markmc commented Mar 18, 2026

Related to the discussion in #36859 and the Counters can only be incremented by non-negative amounts error with the vllm:prompt_tokens_by_source_total metric.

In OutputProcessor, we take the first EngineCoreOutput as a signal that prefill has completed, and record certain statistics about it.

On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use PrefillStats to track the first scheduled prefill so that the stats can be returned to the frontend via EngineCoreOutput.

num_cached_tokens was previously used for KV transfer failure recovery, but this is no longer true as of #38096. We also no longer attempt to correct these prefill metrics if KV transfers failed, since this introduced unjustified brittleness to an already brittle code path.

@markmc
Copy link
Copy Markdown
Member Author

markmc commented Mar 18, 2026

I've just realized an obvious gap - the prefill stats aren't updated on the failed-KV-transfer path

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces PrefillStats to more accurately track and report prefill-related metrics, which is a valuable improvement for clarity and correctness, especially in scenarios involving preemption. The refactoring to use a dedicated PrefillStats object instead of separate numerical values is a good design choice. The changes are consistently applied across the scheduler, engine outputs, request objects, and tests. I have found one critical issue in the metrics calculation logic that leads to double-counting of tokens, for which I've provided a detailed comment and a suggested fix.

Comment thread vllm/v1/metrics/stats.py Outdated
@mergify mergify Bot added the kv-connector label Mar 18, 2026
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026
@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Mar 19, 2026

num_cached_tokens and num_external_computed_tokens remain in place and untouched for now - next we will look at what the remaining uses of these are.

Right now we basically duplicate these fields into prefill stats.
But we already have problems with these original fields, so I think it's better we set prefill stats directly, removing any connection to the old fields.

And I guess once we do that, we can think whether we want to keep the old fields (probably not?).

@markmc
Copy link
Copy Markdown
Member Author

markmc commented Mar 19, 2026

num_cached_tokens and num_external_computed_tokens remain in place and untouched for now - next we will look at what the remaining uses of these are.

Right now we basically duplicate these fields into prefill stats. But we already have problems with these original fields, so I think it's better we set prefill stats directly, removing any connection to the old fields.

IMO, the PR basically does that (set the stats directly) ?

And I guess once we do that, we can think whether we want to keep the old fields (probably not?).

I'm going to take a look at this, and I expect I can remove/rework these fields without affecting PrefillStats

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Mar 19, 2026

MO, the PR basically does that (set the stats directly) ?

Right, my bad.

@markmc markmc changed the title [Metrics][Core] Add PrefillStats to EngineCoreOutputs [Metrics][Core] Replace num_cached_tokens and num_external_computed_tokens with PrefillStats Mar 24, 2026
@markmc
Copy link
Copy Markdown
Member Author

markmc commented Mar 24, 2026

@orozery from #36859 (comment)

The failure handling code use of this field is hacky IMO and I can think of alternative more robust ways to go without it. (I actually implemented a fix in one of the revisions of #35223: f02a5c8).

I've incorporated this approach from that earlier revision now, PTAL

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Mar 25, 2026

I've incorporated this approach from that earlier revision now, PTAL

I think it's better to create a separate PR for these changes in the failure recovery flow.
They stand for themselves, and don't relay on the other changes in this PR.

@markmc markmc removed the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026
@markmc markmc marked this pull request as draft March 25, 2026 11:21
@markmc
Copy link
Copy Markdown
Member Author

markmc commented Mar 25, 2026

Moved to draft since #38096 needs to merge first now

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 13, 2026
…s with PrefillStats

In OutputProcessor, we take the first EngineCoreOutput as a signal
that prefill has completed, and record certain statistics about it.

On the scheduler side, because of preemption, we might have prefills
that are scheduled but never completed, or we might need to recompute
an already completed prefill. To add clarity, we use PrefillStats
to track the first scheduled prefill so that the stats can be returned
to the frontend via EngineCoreOutput.

num_cached_tokens was previously used for KV transfer failure recovery,
but this is no longer true as of vllm-project#38096. We also no longer attempt to
correct these prefill metrics if KV transfers failed, since this
introduced unjustified brittleness to an already brittle code path.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc
Copy link
Copy Markdown
Member Author

markmc commented Apr 13, 2026

FTR, this is a significant change in the most recent update:

We now track stats from the first time the prefill is scheduled (previously: first time prefill is completed) since the proven-error-prone effort to update the stats in the KV transfer error-handling path seems unjustified

Comment thread vllm/v1/metrics/stats.py Outdated
Comment thread vllm/v1/core/sched/scheduler.py Outdated
Or points out:

> we call take_prefill_stats at update_from_output in the same
> step the request was first scheduled to run. So I don't think
> there's a way the request will be preempted before we set
> Request.prefill_stats to None.

Co-authored-by: Or Ozeri <or@ozery.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Copy link
Copy Markdown
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markmc markmc merged commit d3af8c1 into vllm-project:main Apr 14, 2026
56 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in Metrics & Tracing Apr 14, 2026
@ZhanqiuHu
Copy link
Copy Markdown
Contributor

Note: This PR changed the prefill stats reported in the case of multi kv connector (NIXL connector + CPUOffloading connector). It impacts a couple of test cases in tests/v1/kv_connector/nixl_integration/run_multi_connector_edge_case_test.sh, which uses the metrics reported to check if the transferred kv cache and offloading kv cache are expected.

Will post a PR that update the test cases accordingly.

@NickLucche
Copy link
Copy Markdown
Collaborator

yep it broke CI for v1/kv_connector/nixl_integration/run_multi_connector_edge_case_test.sh

zxd1997066 pushed a commit to zxd1997066/vllm that referenced this pull request Apr 15, 2026
…ed_tokens with PrefillStats (vllm-project#37460)

Related to `Counters can only be incremented by non-negative amounts`
error with the `vllm:prompt_tokens_by_source_total` metric.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 20, 2026
…fferent vllm version (#8426)

### What this PR does / why we need it?
This fix vllm-project/vllm#37460
This PR introduces version-specific logic to handle `num_cached_tokens`
and `num_external_computed_tokens` in the scheduler, ensuring
compatibility with vLLM 0.19.0 and maintaining legacy support for older
versions via `prefill_stats`.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

Signed-off-by: wangli <wangli858794774@gmail.com>
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Apr 20, 2026
…fferent vllm version (vllm-project#8426)

### What this PR does / why we need it?
This fix vllm-project/vllm#37460
This PR introduces version-specific logic to handle `num_cached_tokens`
and `num_external_computed_tokens` in the scheduler, ensuring
compatibility with vLLM 0.19.0 and maintaining legacy support for older
versions via `prefill_stats`.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

Signed-off-by: wangli <wangli858794774@gmail.com>
tfhddd pushed a commit to ascend-gha-runners/vllm-ascend that referenced this pull request Apr 21, 2026
…fferent vllm version (vllm-project#8426)

### What this PR does / why we need it?
This fix vllm-project/vllm#37460
This PR introduces version-specific logic to handle `num_cached_tokens`
and `num_external_computed_tokens` in the scheduler, ensuring
compatibility with vLLM 0.19.0 and maintaining legacy support for older
versions via `prefill_stats`.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: tfhddd <2272751277@qq.com>
anning-2026 pushed a commit to anning-2026/vllm-ascend that referenced this pull request Apr 21, 2026
…fferent vllm version (vllm-project#8426)

### What this PR does / why we need it?
This fix vllm-project/vllm#37460
This PR introduces version-specific logic to handle `num_cached_tokens`
and `num_external_computed_tokens` in the scheduler, ensuring
compatibility with vLLM 0.19.0 and maintaining legacy support for older
versions via `prefill_stats`.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

Signed-off-by: wangli <wangli858794774@gmail.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
…ed_tokens with PrefillStats (vllm-project#37460)

Related to `Counters can only be incremented by non-negative amounts`
error with the `vllm:prompt_tokens_by_source_total` metric.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Or Ozeri <or@ozery.com>
njhill added a commit to Inferact/vllm-frontend-rs that referenced this pull request Apr 24, 2026
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
…fferent vllm version (vllm-project#8426)

### What this PR does / why we need it?
This fix vllm-project/vllm#37460
This PR introduces version-specific logic to handle `num_cached_tokens`
and `num_external_computed_tokens` in the scheduler, ensuring
compatibility with vLLM 0.19.0 and maintaining legacy support for older
versions via `prefill_stats`.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: guxin108 <1252896542@qq.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…ed_tokens with PrefillStats (vllm-project#37460)

Related to `Counters can only be incremented by non-negative amounts`
error with the `vllm:prompt_tokens_by_source_total` metric.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
…fferent vllm version (vllm-project#8426)

### What this PR does / why we need it?
This fix vllm-project/vllm#37460
This PR introduces version-specific logic to handle `num_cached_tokens`
and `num_external_computed_tokens` in the scheduler, ensuring
compatibility with vLLM 0.19.0 and maintaining legacy support for older
versions via `prefill_stats`.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants