[Bugfix] Fix negative local_cache_hit in P/D disaggregation metrics#34079
[Bugfix] Fix negative local_cache_hit in P/D disaggregation metrics#34079Prowindy wants to merge 3 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--34079.org.readthedocs.build/en/34079/ |
There was a problem hiding this comment.
Code Review
This pull request addresses a crash caused by a negative metric value being passed to a Prometheus counter. While the fix of clamping the value to be non-negative resolves the crash, it unfortunately introduces a new critical issue. The change breaks the documented invariants for token statistics, which will lead to inconsistent and incorrect metrics. I have added a critical review comment on vllm/v1/metrics/stats.py detailing this problem and suggesting a path towards a more robust solution that maintains metric integrity. The addition of new tests and documentation is a good practice.
| self.local_cache_hit += max( | ||
| 0, num_cached_tokens + recomputed - num_external_computed_tokens | ||
| ) |
There was a problem hiding this comment.
This change correctly prevents a crash by ensuring the local_cache_hit increment is non-negative. However, it introduces a critical issue by breaking the accounting invariants for PromptTokenStats documented in the class docstring (lines 246-248).
The invariants are:
computed + local_cache_hit + external_kv_transfer - recomputed_tokens = totallocal_cache_hit + external_kv_transfer - recomputed_tokens = cached_tokens
When num_cached_tokens + recomputed - num_external_computed_tokens is negative, clamping its contribution to local_cache_hit to 0 causes these identities to no longer hold, leading to inconsistent and incorrect metrics.
For example, let X = num_cached_tokens + recomputed - num_external_computed_tokens. With the original code, the change (delta) to both sides of the first invariant was prompt_len. With this new change, when X < 0, the delta of the left-hand side becomes prompt_len - X, which does not equal prompt_len (the delta of the right-hand side). This means the metrics no longer add up correctly.
A more robust fix would adjust other metrics like computed and cached_tokens to compensate for clamping local_cache_hit, thus preserving the invariants. This would require a more comprehensive change to the update logic in this method to ensure all metrics remain consistent.
|
Hi @Prowindy, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Can we remove these two docs? They're great analysis but I don't think we need to include them as part of the PR.
| """Test P/D disagg case where external tokens exceed cached tokens. | ||
|
|
||
| In P/D disaggregation, the decode instance may receive more tokens via | ||
| external KV transfer than it has cached locally. This previously caused | ||
| negative local_cache_hit values which crashed Prometheus counters. | ||
|
|
||
| See: https://github.com/vllm-project/vllm/issues/XXXXX |
There was a problem hiding this comment.
Maybe you can create an issue with the two design docs attached and reference here?
In P/D (Prefill/Decode) disaggregated deployments, the local_cache_hit metric could become negative when external KV transfer tokens exceed locally cached tokens. This caused Prometheus counter increment failures with ValueError: "Counters can only be incremented by non-negative amounts." The fix clamps local_cache_hit to non-negative values using max(0, ...). Root cause: - In P/D disagg, decode receives tokens via external KV transfer - The calculation: local_cache_hit = num_cached_tokens - num_external_computed_tokens - When external > cached, this goes negative - Prometheus counters reject negative increments Example scenario: - Prefill sends 7000 tokens to decode via NIXL - Decode has 0 local cache - Old: local_cache_hit = 0 - 7000 = -7000 (CRASH!) - New: local_cache_hit = max(0, 0 - 7000) = 0 (OK) Fixes the regression introduced in vllm-project#33290. Signed-off-by: Cong Chen <congc@meta.com>
09d60f2 to
3676a28
Compare
Signed-off-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
|
Hi @Prowindy, how do you reproduce this error? From my understanding, But maybe there's a case I missed, could you share how to reproduce this? |
@ZhanqiuHu I encountered the error |
Thanks! Cool! |
👍 Agree - it's such a strong invariant that an assertion would make sense here, we definitely shouldn't try to just suppress this error
That suggests some issue with the logic in |
This is Claude's theory. I haven't validated it yet |
|
Just hit this one benchmarking a WideEP deployment, cranking the concurrency too high |
…lid blocks
When two requests share an invalid block, the first request marks it
for recomputation and properly updates num_external_computed_tokens.
The second request takes the `not marked_invalid_block` path, which
reverts num_computed_tokens to num_cached_tokens but did NOT update
num_external_computed_tokens. This broke the invariant:
num_cached_tokens >= num_external_computed_tokens
When this invariant is violated, PromptTokenStats.local_cache_hit
(computed as num_cached_tokens - num_external_computed_tokens) goes
negative, crashing Prometheus counters with:
ValueError: Counters can only be incremented by non-negative amounts.
This is fatal - it kills the AsyncLLM output_handler, crashing the
entire engine process.
The fix reduces num_external_computed_tokens by the same
num_affected_tokens delta, matching the pattern already used in the
marked_invalid_block=True path (line 2126).
Reproduction: Run P/D disaggregation with NixlConnector. Restart
prefill pods while decode pods are running. The NIXL_ERR_REMOTE_DISCONNECT
triggers _update_requests_with_invalid_blocks, and any shared blocks
between concurrent requests hit this path.
Fixes: vllm-project#26372
Related: vllm-project#34079
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
btw, this also occurs during high concurrency scenarios w/o PD disagg but instead native KV cache offloading (as of vLLM 0.16.0) |
Regression Source
This bug was introduced in PR #33290 (commit
4403e3ed4):Error Observed
From decode service crash logs:
(ApiServer_0 pid=308) ERROR 02-08 03:01:13 [v1/engine/async_llm.py:698] AsyncLLM output_handler failed.
(ApiServer_0 pid=308) ERROR 02-08 03:01:13 [v1/engine/async_llm.py:698] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record
self.counter_prompt_tokens_by_source[source][engine_idx].inc(
File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc
raise ValueError('Counters can only be incremented by non-negative amounts.')
ValueError: Counters can only be incremented by non-negative amounts.
This results in
400 Bad Requestresponses for all incoming requests.In P/D (Prefill/Decode) disaggregated deployments, the local_cache_hit metric could become negative when external KV transfer tokens exceed locally cached tokens. This caused Prometheus counter increment failures with ValueError: "Counters can only be incremented by non-negative amounts."
The fix clamps local_cache_hit to non-negative values using max(0, ...).
Root cause:
Example scenario:
Fixes the regression introduced in #33290.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.