[Bugfix] Reset num_cached_tokens sentinel on request preemption by xiaguan · Pull Request #36752 · vllm-project/vllm

xiaguan · 2026-03-11T07:43:32Z

Description

When a request is preempted, _preempt_request resets num_computed_tokens = 0 but leaves num_cached_tokens at its previous value. On reschedule, the initialization guard:

if request.num_cached_tokens < 0:
    request.num_cached_tokens = num_computed_tokens

never fires again because num_cached_tokens is already ≥ 0. Meanwhile num_external_computed_tokens is re-queried from the KV connector (line 607) and can return a larger value than before (e.g. more blocks became available on a remote store). This breaks the invariant num_cached_tokens >= num_external_computed_tokens and produces a negative local_cache_hit in stats:

local_cache_hit = num_cached_tokens + recomputed - num_external_computed_tokens
                = 1500 + 0 - 2000 = -500  # BOOM

Root cause (_preempt_request, scheduler.py):

# Before fix
request.num_computed_tokens = 0
# num_cached_tokens NOT reset → stale value persists

Fix: reset num_cached_tokens to the sentinel value (-1) so the guard fires on the next schedule and both fields are initialised consistently from fresh values.

Test plan

Added test_preempt_resets_num_cached_tokens in tests/v1/core/test_scheduler.py:

Schedule a request with a MockKVConnector → num_cached_tokens is set.
Directly call _preempt_request → assert num_cached_tokens == -1.
Re-schedule → assert num_cached_tokens >= num_external_computed_tokens (invariant holds).

🤖 Generated with Claude Code

When a request is preempted, `num_computed_tokens` is reset to 0 but `num_cached_tokens` was not. On reschedule the guard `if request.num_cached_tokens < 0` never fires again, so the stale cached-token count is kept while `num_external_computed_tokens` is re-queried from the connector and can grow larger. This produces a negative `local_cache_hit` in stats: local_cache_hit = num_cached_tokens - num_external_computed_tokens < 0 Fix: reset `num_cached_tokens` to the sentinel value (-1) in `_preempt_request` so the guard fires on the next schedule and both fields are set consistently from fresh values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: xiaguan <751080330@qq.com>

orozery · 2026-03-11T07:53:48Z

I think that num_cached_tokens is only used once in the request lifetime:

vllm/vllm/v1/metrics/stats.py

Line 344 in a40ee48

if is_prefilling:

Since is_prefilling will not be re-set to True upon preemption.
I actually think this is the desired behavior since we want to be able to characterize the kv cache hits stats per workload, which should be independent of whether preemptions occurred.

gemini-code-assist

Code Review

This pull request addresses a bug in the scheduler's preemption logic where num_cached_tokens was not reset, leading to stale values and potentially incorrect stats upon rescheduling. The fix correctly resets this field to its sentinel value of -1 in _preempt_request. The change is accompanied by a well-written and thorough unit test that validates the fix by simulating preemption and ensuring the relevant invariant holds on reschedule. The implementation is correct and effectively resolves the issue.

xiaguan · 2026-03-11T08:03:27Z

Thanks for the reply! That's a great point about the decode-phase preemption case — I agree is_prefilling correctly stays False there and the stats wouldn't be re-evaluated.

I guess the crash we're seeing is specifically with requests preempted during partial prefill — before they ever complete prefill and produce their first output. In that case, is_prefilling on the output_processor side is still True
(since line 621 req_state.is_prefilling = False was never reached). When the request is rescheduled and eventually completes prefill, PromptTokenStats.update_from_output() does get called, but num_cached_tokens still holds the
value from the original scheduling while num_external_computed_tokens has been refreshed by the connector re-query — which can result in the negative local_cache_hit.

orozery · 2026-03-11T08:11:27Z

I guess the crash we're seeing is specifically with requests preempted during partial prefill — before they ever complete prefill and produce their first output. In that case, is_prefilling on the output_processor side is still True (since line 621 req_state.is_prefilling = False was never reached). When the request is rescheduled and eventually completes prefill, PromptTokenStats.update_from_output() does get called, but num_cached_tokens still holds the value from the original scheduling while num_external_computed_tokens has been refreshed by the connector re-query — which can result in the negative local_cache_hit.

I see.
I think that num_external_computed_tokens should have the same "set-once" behavior as num_cached_tokens.

markmc · 2026-03-11T10:21:56Z

xref #34079

… non-negative amounts" Since `num_computed_tokens`, `num_cached_tokens`, and `num_external_computed_tokens` accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors: ``` ValueError: Counters can only be incremented by non-negative amounts ``` The invariant check enforces: ``` prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0 ``` with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the: ``` num_external_computed_tokens <= num_cached_tokens + recomputed ``` When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics. Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion). Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2026-04-08T10:10:14Z

#37460 will resolve this, thank you

xiaguan requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners March 11, 2026 07:43

mergify Bot added v1 bug Something isn't working labels Mar 11, 2026

orozery requested a review from markmc March 11, 2026 07:46

xiaguan mentioned this pull request Mar 11, 2026

[Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters #36755

Closed

gemini-code-assist Bot reviewed Mar 11, 2026

View reviewed changes

markmc added this to Metrics & Tracing Mar 11, 2026

github-project-automation Bot moved this to Backlog in Metrics & Tracing Mar 11, 2026

markmc mentioned this pull request Mar 11, 2026

[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts" #36812

Closed

markmc moved this from Backlog to In Review in Metrics & Tracing Apr 8, 2026

markmc closed this Apr 8, 2026

markmc moved this from In Review to Stale in Metrics & Tracing Apr 8, 2026

markmc moved this from Stale to Not planned in Metrics & Tracing Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Reset num_cached_tokens sentinel on request preemption#36752

[Bugfix] Reset num_cached_tokens sentinel on request preemption#36752
xiaguan wants to merge 1 commit intovllm-project:mainfrom
xiaguan:fix/preempt-reset-num-cached-tokens

xiaguan commented Mar 11, 2026

Uh oh!

orozery commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

xiaguan commented Mar 11, 2026 •

edited

Loading

Uh oh!

orozery commented Mar 11, 2026

Uh oh!

markmc commented Mar 11, 2026

Uh oh!

markmc commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

xiaguan commented Mar 11, 2026

Description

Test plan

Uh oh!

orozery commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

xiaguan commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orozery commented Mar 11, 2026

Uh oh!

markmc commented Mar 11, 2026

Uh oh!

markmc commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiaguan commented Mar 11, 2026 •

edited

Loading