Skip to content

[BugFix] Scheduler: Only set num_external_computed_tokens once#36859

Closed
orozery wants to merge 1 commit intovllm-project:mainfrom
orozery:stats-set-num-external-tokens-once
Closed

[BugFix] Scheduler: Only set num_external_computed_tokens once#36859
orozery wants to merge 1 commit intovllm-project:mainfrom
orozery:stats-set-num-external-tokens-once

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Mar 12, 2026

Request.num_cached_tokens and Request.num_external_computed_tokens are two fields used for reporting request level cache hit stats. While num_cached_tokens is only set for the first time a request gets schedule, num_external_computed_tokens gets re-set whenever a request tries to gets re-scheduled, in case the request is preempted or when initial allocation fails. This creates a possible inconsistency between the two fields, which can yield to wrongful deduction of the derived stat local_cache_hit, which can cause vLLM to crash in case the wrong value is negative.
This PR fixes it by properly setting these two fields only after a request gets scheduled for the first time (by checking Request.num_preemptions == 0).
This fields may be updated only in the case of an error reported by the connector loading external tokens, We modify a scheduler unit-test for preemptions with KV connector to verify this fields are only set once.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 12, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added needs-rebase bug Something isn't working labels Mar 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where request-level cache hit statistics (num_external_computed_tokens) were being incorrectly reset upon request rescheduling (e.g., after preemption). The fix ensures that both num_cached_tokens and num_external_computed_tokens are set only once, during the initial scheduling of a request, by checking if request.num_preemptions == 0. The logic for updating these stats upon external cache load failures has also been corrected to respect this new rule. The accompanying changes in the test suite correctly verify this new behavior. The changes appear correct and effectively resolve the described issue.

@orozery orozery force-pushed the stats-set-num-external-tokens-once branch from f4b6538 to a3ac960 Compare March 12, 2026 07:08
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 12, 2026

Hi @orozery, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify mergify Bot removed the needs-rebase label Mar 12, 2026
Request.num_cached_tokens and Request.num_external_computed_tokens
are two fields used for reporting request level cache hit stats.
While num_cached_tokens is only set for the first time a request gets schedule,
num_external_computed_tokens gets re-set whenever a request tries to gets re-scheduled,
in case the request is preempted or when initial allocation fails.
This creates a possible inconsistency between the two fields, which can yield to wrongful
deduction of the derived stat local_cache_hit, which can cause vLLM to crash in case
the wrong value is negative.
This commit fixes it by properly setting these two fields only after a request gets scheduled for the first time
(by checking Request.num_preemptions == 0).
This fields may be updated only in the case of an error reported by the connector loading external tokens,
We modify a scheduler unit-test for preemptions with KV connector to verify this fields are only set once.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery force-pushed the stats-set-num-external-tokens-once branch from a3ac960 to 0889687 Compare March 12, 2026 07:32
@orozery orozery requested a review from markmc March 12, 2026 07:36
@markmc
Copy link
Copy Markdown
Member

markmc commented Mar 12, 2026

xref #36533 and #36755

@markmc
Copy link
Copy Markdown
Member

markmc commented Mar 16, 2026

Still looking, but some initial thoughts ...

Preemption scenario (#36533)

A request is scheduled for the first time -

  1. num_external_computed_tokens and num_cached_tokens are both set
  2. The request gets preempted due to memory pressure
  3. The request is re-scheduled later, num_external_computed_tokens is set, but num_cached_tokens is not

If a later num_external_computed_tokens > first num_cached tokens, we will hit the Counters can only be incremented by non-negative amounts error described in #36533

Failed transfer scenario (#36638)

#36638 describes a scenario where num_external_computed_tokens is reduced on KV transfer failure. I’m less clear on how this could cause this same “non-negative amounts” error though, since reducing num_external_computed_tokens obviously can’t drive it above num_cached_tokens.

Reproducer

I have failed to find a reliable reproducer using KV offloading and llama-3.1-8b-instruct, and an automated sweep of variations like KV offloading size, long/short inputs, long/short outputs, random and sharegpt datasets, different levels of high concurrency, and more … it did reproduce once, but I couldn’t repeat it.

Of course, we could artificially recreate the scenario in a carefully controlled unit test, but that wasn’t my goal.

Metrics Purpose

Let’s consider the metrics flow in isolation, since the error relates to metrics updates.

Request.num_cached_tokens and Request.num_external_computed_tokens get sent back to the frontend in an EngineCoreOutput principally when prefill has completed (e.g. first new tokens produced).

On the frontend side, in the OutputProcess.process_outputs() loop, we assume that the first EngineCoreOput signifies that prefill has completed, and it is only at this point that we (in IterationStats.update_from_output()) consider Request.num_cached_tokens and Request.num_external_computed_tokens. (Streaming inputs is a recent change to this invariant)

Note that in the case of preemption, the frontend only considers prefill to have been completed once, and so these two values are irrelevant for metrics once that initial EngineCoreOutput has been sent.

This is all quite challenging to validate 100% from reading the code. It could make things a lot more clear if we separated any integral scheduler accounting use of these two values from the metrics-related information associated with the “prefill completed event”.

KV Transfer Failures

It seems like KV transfer failure handling it the other purpose for tracking these two values on Request, and the scope of an error here is beyond simply incorrect metrics tracking ... whereas the most important thing is to get the update to Request.num_computed_tokens correct. This too is very twisty to validate.

Minimal fix

Very similar to Or's proposed fix, the simplest thing we can do is to ensure both values only get updated once (except for KV transfer failure handling), and at the same time e.g.

if request.num_cached_tokens < 0:
    request.num_cached_tokens = num_computed_tokens
    request.num_external_computed_tokens =  num_external_computed_tokens

I don't love request.num_preemptions == 0 as a "set only the first time" signal

But I also wonder whether we should just drop the "only set the first time" thing, since we do use the values in KV transfer failure handling, even after preemptions

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Mar 16, 2026

I don't love request.num_preemptions == 0 as a "set only the first time" signal

The reason I prefer this condition over initializing fields to -1 is that -1 is not the true default value for this fields.
The default value is 0.
If for example a request gets aborted before even getting to the point where we set those fields, we will report -1 instead of 0.
But I agree request.num_preemptions == 0 is also somewhat fragile.
Maybe we can keep the -1 and just make sure we initialize to 0 before returning to the user if it wasn't set.

But I also wonder whether we should just drop the "only set the first time" thing, since we do use the values in KV transfer failure handling, even after preemptions

These fields are used to reporting stats to the user.
I believe they were introduced before failure handling piggy-backed on it.
The failure handling code use of this field is hacky IMO and I can think of alternative more robust ways to go without it.
(I actually implemented a fix in one of the revisions of #35223: f02a5c8).

If we allow these fields to be re-set I think we lose the desirable semantics of these stats fields.

@markmc
Copy link
Copy Markdown
Member

markmc commented Mar 18, 2026

These fields are used to reporting stats to the user.

Agree, and I think this could be done in a way that more clearly reflects the intended semantics - see #37460

I believe they were introduced before failure handling piggy-backed on it. The failure handling code use of this field is hacky IMO and I can think of alternative more robust ways to go without it. (I actually implemented a fix in one of the revisions of #35223: f02a5c8).

I agree it would be much better to not depend on these fields in the error handling code 👍

@markmc
Copy link
Copy Markdown
Member

markmc commented Apr 8, 2026

We've since iterated on #37460 to resolve this, so closing

@markmc markmc closed this Apr 8, 2026
@markmc markmc moved this from In Review to Not planned in Metrics & Tracing Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

Status: Not planned

Development

Successfully merging this pull request may close these issues.

2 participants