[WIP][Bugfix] Fix negative prompt token counter crash under KV offloading#36638
[WIP][Bugfix] Fix negative prompt token counter crash under KV offloading#36638haosdent wants to merge 1 commit intovllm-project:mainfrom
Conversation
Clamp num_external_computed_tokens to non-negative in _update_requests_with_invalid_blocks, reset stale num_cached_tokens on preemption/retry paths, and add defensive guards in PromptTokenStats.update_from_output to prevent Prometheus Counter.inc() from receiving negative values. Fixes vllm-project#36533 Signed-off-by: haosdent <haosdent@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a crash caused by negative token counters during KV offloading by introducing clamping in the scheduler and defensive guards in PromptTokenStats. It also correctly resets request.num_cached_tokens on preemption and KV failure to prevent using stale values. The changes are well-tested with new unit tests covering various edge cases. My review identified one potential logic issue in vllm/v1/metrics/stats.py where a variable is calculated before its dependent value is sanitized, which could lead to incorrect metrics. I've provided a recommendation to reorder the operations for improved robustness.
| # Guard against inconsistent values from KV load failures or | ||
| # stale num_cached_tokens after request retry/preemption. | ||
| num_cached_tokens = max(0, num_cached_tokens) | ||
| num_external_computed_tokens = max(0, num_external_computed_tokens) | ||
| num_external_computed_tokens = min( | ||
| num_external_computed_tokens, num_cached_tokens + recomputed | ||
| ) |
There was a problem hiding this comment.
There's a potential logic issue with the order of operations. The recomputed variable is calculated on line 280 using num_cached_tokens before it is sanitized on line 284. If num_cached_tokens is negative (e.g., the sentinel value -1), this could lead to an incorrect value for recomputed and subsequently incorrect metrics, although it may not cause a crash in most scenarios.
To ensure correctness and improve robustness, the recomputed calculation should occur after num_cached_tokens has been clamped to a non-negative value.
I recommend reordering the logic like this:
# Guard against inconsistent values from KV load failures or
# stale num_cached_tokens after request retry/preemption.
num_cached_tokens = max(0, num_cached_tokens)
# When all tokens are cached, the scheduler reduces num_cached_tokens
# by 1 to force the model to recompute the last token, since the model
# needs at least one input token to run a forward pass.
recomputed = 1 if (num_cached_tokens + 1 == prompt_len) else 0
num_external_computed_tokens = max(0, num_external_computed_tokens)
num_external_computed_tokens = min(
num_external_computed_tokens, num_cached_tokens + recomputed
)This would involve moving line 280 to after line 284.
gambletan
left a comment
There was a problem hiding this comment.
Good defensive fix for the negative prompt token counter crash. Clamping values at the stats layer is a pragmatic approach.
A few observations:
-
num_cached_tokens = -1as sentinel (inscheduler.py): Using-1as a sentinel value for "needs to be re-captured" is functional but fragile. If any code path readsnum_cached_tokensbefore it's re-set (e.g., during a race between scheduling and metrics collection), it will silently propagate-1into calculations. Consider usingOptional[int]withNoneas the sentinel instead — this would cause aTypeErrorrather than silently producing wrong metrics if the value is used before being re-initialized. -
Clamping order in
stats.py: The clamping logic is:num_cached_tokens = max(0, num_cached_tokens) num_external_computed_tokens = max(0, num_external_computed_tokens) num_external_computed_tokens = min(num_external_computed_tokens, num_cached_tokens + recomputed)
Note that
recomputedis computed before clampingnum_cached_tokens. Ifnum_cached_tokenswas originally negative,recomputedcould be 0 when it should be 1 (or vice versa). For example, ifnum_cached_tokens=-1andprompt_len=0(edge case), therecomputedchecknum_cached_tokens + 1 == prompt_lenwould beTrue(since-1 + 1 == 0), incorrectly settingrecomputed = 1. Moving the clamping before therecomputedcalculation would be safer. -
Test coverage: The parameterized test
test_prompt_token_stats_all_non_negativeis a nice touch — it ensures no field goes negative regardless of input. The case(50, 100, 100)(external > cached) is particularly useful for catching the overflow scenario. -
_update_requests_with_invalid_blocks: Themax(0, ...)clamping here is the right fix for the root cause. Good that it's applied at the source rather than only at the metrics layer.
|
xref #34079 |
|
Thx @markmc , let me close mine |
… non-negative amounts" Since `num_computed_tokens`, `num_cached_tokens`, and `num_external_computed_tokens` accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors: ``` ValueError: Counters can only be incremented by non-negative amounts ``` The invariant check enforces: ``` prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0 ``` with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the: ``` num_external_computed_tokens <= num_cached_tokens + recomputed ``` When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics. Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion). Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Purpose
Fix engine crash (
ValueError: Counters can only be incremented by non-negative amounts) that occurs under high concurrency with CPU KV offloading enabled (GitHub issue #36533).Root cause: In
_update_requests_with_invalid_blocks(scheduler.py), when KV cache blocks fail to load,num_affected_tokens— which includes both locally-cached and externally-loaded tokens — is subtracted entirely fromrequest.num_external_computed_tokens, driving it negative. This negative value propagates throughEngineCoreOutput→PromptTokenStats→PrometheusStatLogger.record()→Counter.inc()crash.Secondary issue:
request.num_cached_tokensis set once during initial scheduling and never reset when a request is freed for retry after KV failure or preemption. On reschedule,num_external_computed_tokensmay be re-queried to a new value whilenum_cached_tokensstays stale, creating a mismatch that can also produce negativelocal_cache_hit.Fix:
num_external_computed_tokensto non-negative after subtraction in_update_requests_with_invalid_blocksnum_cached_tokensto-1(sentinel) in preemption and KV failure retry paths so it is re-captured consistently on reschedulePromptTokenStats.update_from_outputto clamp inputs and maintain invariantsTest Plan
test_prompt_token_stats_negative_external_clamped— verifies negativenum_external_computed_tokensis clamped to 0test_prompt_token_stats_external_exceeds_cached— verifiesnum_external_computed_tokens > num_cached_tokensis clampedtest_prompt_token_stats_negative_cached_clamped— verifies negativenum_cached_tokensis clamped to 0test_prompt_token_stats_all_non_negative(parametrized, 6 cases) — fuzz-style check that allPromptTokenStatsfields remain non-negative across edge casesTest Result
All 19 tests in
test_stats.pypass. All 3 tests intest_invalid_blocks_correctness.pypass.