Skip to content

[Bugfix] Reset num_cached_tokens sentinel on request preemption#36752

Closed
xiaguan wants to merge 1 commit intovllm-project:mainfrom
xiaguan:fix/preempt-reset-num-cached-tokens
Closed

[Bugfix] Reset num_cached_tokens sentinel on request preemption#36752
xiaguan wants to merge 1 commit intovllm-project:mainfrom
xiaguan:fix/preempt-reset-num-cached-tokens

Conversation

@xiaguan
Copy link
Copy Markdown

@xiaguan xiaguan commented Mar 11, 2026

Description

When a request is preempted, _preempt_request resets num_computed_tokens = 0 but leaves num_cached_tokens at its previous value. On reschedule, the initialization guard:

if request.num_cached_tokens < 0:
    request.num_cached_tokens = num_computed_tokens

never fires again because num_cached_tokens is already ≥ 0. Meanwhile num_external_computed_tokens is re-queried from the KV connector (line 607) and can return a larger value than before (e.g. more blocks became available on a remote store). This breaks the invariant num_cached_tokens >= num_external_computed_tokens and produces a negative local_cache_hit in stats:

local_cache_hit = num_cached_tokens + recomputed - num_external_computed_tokens
                = 1500 + 0 - 2000 = -500  # BOOM

Root cause (_preempt_request, scheduler.py):

# Before fix
request.num_computed_tokens = 0
# num_cached_tokens NOT reset → stale value persists

Fix: reset num_cached_tokens to the sentinel value (-1) so the guard fires on the next schedule and both fields are initialised consistently from fresh values.

Test plan

Added test_preempt_resets_num_cached_tokens in tests/v1/core/test_scheduler.py:

  1. Schedule a request with a MockKVConnectornum_cached_tokens is set.
  2. Directly call _preempt_request → assert num_cached_tokens == -1.
  3. Re-schedule → assert num_cached_tokens >= num_external_computed_tokens (invariant holds).

🤖 Generated with Claude Code

When a request is preempted, `num_computed_tokens` is reset to 0 but
`num_cached_tokens` was not. On reschedule the guard
`if request.num_cached_tokens < 0` never fires again, so the stale
cached-token count is kept while `num_external_computed_tokens` is
re-queried from the connector and can grow larger. This produces a
negative `local_cache_hit` in stats:

  local_cache_hit = num_cached_tokens - num_external_computed_tokens < 0

Fix: reset `num_cached_tokens` to the sentinel value (-1) in
`_preempt_request` so the guard fires on the next schedule and both
fields are set consistently from fresh values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: xiaguan <751080330@qq.com>
@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Mar 11, 2026

I think that num_cached_tokens is only used once in the request lifetime:

if is_prefilling:

Since is_prefilling will not be re-set to True upon preemption.
I actually think this is the desired behavior since we want to be able to characterize the kv cache hits stats per workload, which should be independent of whether preemptions occurred.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the scheduler's preemption logic where num_cached_tokens was not reset, leading to stale values and potentially incorrect stats upon rescheduling. The fix correctly resets this field to its sentinel value of -1 in _preempt_request. The change is accompanied by a well-written and thorough unit test that validates the fix by simulating preemption and ensuring the relevant invariant holds on reschedule. The implementation is correct and effectively resolves the issue.

@xiaguan
Copy link
Copy Markdown
Author

xiaguan commented Mar 11, 2026

Thanks for the reply! That's a great point about the decode-phase preemption case — I agree is_prefilling correctly stays False there and the stats wouldn't be re-evaluated.

I guess the crash we're seeing is specifically with requests preempted during partial prefill — before they ever complete prefill and produce their first output. In that case, is_prefilling on the output_processor side is still True
(since line 621 req_state.is_prefilling = False was never reached). When the request is rescheduled and eventually completes prefill, PromptTokenStats.update_from_output() does get called, but num_cached_tokens still holds the
value from the original scheduling while num_external_computed_tokens has been refreshed by the connector re-query — which can result in the negative local_cache_hit.

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Mar 11, 2026

I guess the crash we're seeing is specifically with requests preempted during partial prefill — before they ever complete prefill and produce their first output. In that case, is_prefilling on the output_processor side is still True (since line 621 req_state.is_prefilling = False was never reached). When the request is rescheduled and eventually completes prefill, PromptTokenStats.update_from_output() does get called, but num_cached_tokens still holds the value from the original scheduling while num_external_computed_tokens has been refreshed by the connector re-query — which can result in the negative local_cache_hit.

I see.
I think that num_external_computed_tokens should have the same "set-once" behavior as num_cached_tokens.

@markmc
Copy link
Copy Markdown
Member

markmc commented Mar 11, 2026

xref #34079

markmc added a commit to markmc/vllm that referenced this pull request Mar 11, 2026
… non-negative amounts"

Since `num_computed_tokens`, `num_cached_tokens`, and
`num_external_computed_tokens` accounting seems quite brittle currently -
with preemption reset bugs and P/D disaggregation accounting issues -
add a defensive check to detect and prevent instances of Prometheus
counter errors:

```
ValueError: Counters can only be incremented by non-negative amounts
```

The invariant check enforces:

```
prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0
```

with the additional nuance that when all tokens are cached, the scheduler
forces recomputation of the last token, so the:

```
num_external_computed_tokens <= num_cached_tokens + recomputed
```

When the invariant is violated, we log a a warning once with diagnostic
details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and
future-proof, at which point we can remove this check (perhaps replacing
with a simple assertion).

Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc moved this from Backlog to In Review in Metrics & Tracing Apr 8, 2026
@markmc
Copy link
Copy Markdown
Member

markmc commented Apr 8, 2026

#37460 will resolve this, thank you

@markmc markmc closed this Apr 8, 2026
@markmc markmc moved this from In Review to Stale in Metrics & Tracing Apr 8, 2026
@markmc markmc moved this from Stale to Not planned in Metrics & Tracing Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

Status: Not planned

Development

Successfully merging this pull request may close these issues.

3 participants