[Bugfix] Reset num_cached_tokens sentinel on request preemption#36752
[Bugfix] Reset num_cached_tokens sentinel on request preemption#36752xiaguan wants to merge 1 commit intovllm-project:mainfrom
Conversation
When a request is preempted, `num_computed_tokens` is reset to 0 but `num_cached_tokens` was not. On reschedule the guard `if request.num_cached_tokens < 0` never fires again, so the stale cached-token count is kept while `num_external_computed_tokens` is re-queried from the connector and can grow larger. This produces a negative `local_cache_hit` in stats: local_cache_hit = num_cached_tokens - num_external_computed_tokens < 0 Fix: reset `num_cached_tokens` to the sentinel value (-1) in `_preempt_request` so the guard fires on the next schedule and both fields are set consistently from fresh values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: xiaguan <751080330@qq.com>
|
I think that Line 344 in a40ee48 Since |
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in the scheduler's preemption logic where num_cached_tokens was not reset, leading to stale values and potentially incorrect stats upon rescheduling. The fix correctly resets this field to its sentinel value of -1 in _preempt_request. The change is accompanied by a well-written and thorough unit test that validates the fix by simulating preemption and ensuring the relevant invariant holds on reschedule. The implementation is correct and effectively resolves the issue.
|
Thanks for the reply! That's a great point about the decode-phase preemption case — I agree is_prefilling correctly stays False there and the stats wouldn't be re-evaluated. I guess the crash we're seeing is specifically with requests preempted during partial prefill — before they ever complete prefill and produce their first output. In that case, is_prefilling on the output_processor side is still True |
I see. |
|
xref #34079 |
… non-negative amounts" Since `num_computed_tokens`, `num_cached_tokens`, and `num_external_computed_tokens` accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors: ``` ValueError: Counters can only be incremented by non-negative amounts ``` The invariant check enforces: ``` prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0 ``` with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the: ``` num_external_computed_tokens <= num_cached_tokens + recomputed ``` When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics. Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion). Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
#37460 will resolve this, thank you |
Description
When a request is preempted,
_preempt_requestresetsnum_computed_tokens = 0but leavesnum_cached_tokensat its previous value. On reschedule, the initialization guard:never fires again because
num_cached_tokensis already ≥ 0. Meanwhilenum_external_computed_tokensis re-queried from the KV connector (line 607) and can return a larger value than before (e.g. more blocks became available on a remote store). This breaks the invariantnum_cached_tokens >= num_external_computed_tokensand produces a negativelocal_cache_hitin stats:Root cause (
_preempt_request, scheduler.py):Fix: reset
num_cached_tokensto the sentinel value (-1) so the guard fires on the next schedule and both fields are initialised consistently from fresh values.Test plan
Added
test_preempt_resets_num_cached_tokensintests/v1/core/test_scheduler.py:MockKVConnector→num_cached_tokensis set._preempt_request→ assertnum_cached_tokens == -1.num_cached_tokens >= num_external_computed_tokens(invariant holds).🤖 Generated with Claude Code