Skip to content

[Bugfix] Fix negative local_cache_hit in P/D disaggregation metrics#34079

Closed
Prowindy wants to merge 3 commits intovllm-project:mainfrom
Prowindy:fix/negative-prompt-token-stats-pd-disagg
Closed

[Bugfix] Fix negative local_cache_hit in P/D disaggregation metrics#34079
Prowindy wants to merge 3 commits intovllm-project:mainfrom
Prowindy:fix/negative-prompt-token-stats-pd-disagg

Conversation

@Prowindy
Copy link
Copy Markdown
Contributor

@Prowindy Prowindy commented Feb 8, 2026

Regression Source

This bug was introduced in PR #33290 (commit 4403e3ed4):

Error Observed

From decode service crash logs:
(ApiServer_0 pid=308) ERROR 02-08 03:01:13 [v1/engine/async_llm.py:698] AsyncLLM output_handler failed.
(ApiServer_0 pid=308) ERROR 02-08 03:01:13 [v1/engine/async_llm.py:698] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record
self.counter_prompt_tokens_by_source[source][engine_idx].inc(
File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc
raise ValueError('Counters can only be incremented by non-negative amounts.')
ValueError: Counters can only be incremented by non-negative amounts.

This results in 400 Bad Request responses for all incoming requests.

In P/D (Prefill/Decode) disaggregated deployments, the local_cache_hit metric could become negative when external KV transfer tokens exceed locally cached tokens. This caused Prometheus counter increment failures with ValueError: "Counters can only be incremented by non-negative amounts."

The fix clamps local_cache_hit to non-negative values using max(0, ...).

Root cause:

  • In P/D disagg, decode receives tokens via external KV transfer
  • The calculation: local_cache_hit = num_cached_tokens - num_external_computed_tokens
  • When external > cached, this goes negative
  • Prometheus counters reject negative increments

Example scenario:

  • Prefill sends 7000 tokens to decode via NIXL
  • Decode has 0 local cache
  • Old: local_cache_hit = 0 - 7000 = -7000 (CRASH!)
  • New: local_cache_hit = max(0, 0 - 7000) = 0 (OK)

Fixes the regression introduced in #33290.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@Prowindy Prowindy requested a review from markmc as a code owner February 8, 2026 05:58
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 8, 2026

Documentation preview: https://vllm--34079.org.readthedocs.build/en/34079/

@mergify mergify Bot added documentation Improvements or additions to documentation v1 bug Something isn't working kv-connector labels Feb 8, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash caused by a negative metric value being passed to a Prometheus counter. While the fix of clamping the value to be non-negative resolves the crash, it unfortunately introduces a new critical issue. The change breaks the documented invariants for token statistics, which will lead to inconsistent and incorrect metrics. I have added a critical review comment on vllm/v1/metrics/stats.py detailing this problem and suggesting a path towards a more robust solution that maintains metric integrity. The addition of new tests and documentation is a good practice.

Comment thread vllm/v1/metrics/stats.py
Comment on lines +280 to 282
self.local_cache_hit += max(
0, num_cached_tokens + recomputed - num_external_computed_tokens
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly prevents a crash by ensuring the local_cache_hit increment is non-negative. However, it introduces a critical issue by breaking the accounting invariants for PromptTokenStats documented in the class docstring (lines 246-248).

The invariants are:

  1. computed + local_cache_hit + external_kv_transfer - recomputed_tokens = total
  2. local_cache_hit + external_kv_transfer - recomputed_tokens = cached_tokens

When num_cached_tokens + recomputed - num_external_computed_tokens is negative, clamping its contribution to local_cache_hit to 0 causes these identities to no longer hold, leading to inconsistent and incorrect metrics.

For example, let X = num_cached_tokens + recomputed - num_external_computed_tokens. With the original code, the change (delta) to both sides of the first invariant was prompt_len. With this new change, when X < 0, the delta of the left-hand side becomes prompt_len - X, which does not equal prompt_len (the delta of the right-hand side). This means the metrics no longer add up correctly.

A more robust fix would adjust other metrics like computed and cached_tokens to compensate for clamping local_cache_hit, thus preserving the invariants. This would require a more comprehensive change to the update logic in this method to ensure all metrics remain consistent.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc any suggestion on this? Thank you!

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 8, 2026

Hi @Prowindy, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these two docs? They're great analysis but I don't think we need to include them as part of the PR.

Comment thread tests/v1/metrics/test_stats.py Outdated
Comment on lines +215 to +221
"""Test P/D disagg case where external tokens exceed cached tokens.

In P/D disaggregation, the decode instance may receive more tokens via
external KV transfer than it has cached locally. This previously caused
negative local_cache_hit values which crashed Prometheus counters.

See: https://github.com/vllm-project/vllm/issues/XXXXX
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can create an issue with the two design docs attached and reference here?

In P/D (Prefill/Decode) disaggregated deployments, the local_cache_hit
metric could become negative when external KV transfer tokens exceed
locally cached tokens. This caused Prometheus counter increment failures
with ValueError: "Counters can only be incremented by non-negative amounts."

The fix clamps local_cache_hit to non-negative values using max(0, ...).

Root cause:
- In P/D disagg, decode receives tokens via external KV transfer
- The calculation: local_cache_hit = num_cached_tokens - num_external_computed_tokens
- When external > cached, this goes negative
- Prometheus counters reject negative increments

Example scenario:
- Prefill sends 7000 tokens to decode via NIXL
- Decode has 0 local cache
- Old: local_cache_hit = 0 - 7000 = -7000 (CRASH!)
- New: local_cache_hit = max(0, 0 - 7000) = 0 (OK)

Fixes the regression introduced in vllm-project#33290.

Signed-off-by: Cong Chen <congc@meta.com>
@Prowindy Prowindy force-pushed the fix/negative-prompt-token-stats-pd-disagg branch from 09d60f2 to 3676a28 Compare February 8, 2026 16:55
Signed-off-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
@ZhanqiuHu
Copy link
Copy Markdown
Contributor

ZhanqiuHu commented Feb 9, 2026

Hi @Prowindy, how do you reproduce this error?

From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.

But maybe there's a case I missed, could you share how to reproduce this?

@Prowindy
Copy link
Copy Markdown
Contributor Author

Hi @Prowindy, how do you reproduce this error?

From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.

But maybe there's a case I missed, could you share how to reproduce this?

@ZhanqiuHu I encountered the error Counters can only be incremented by non-negative amounts. when KV transmission through NIXLConnector failed. I'm currently focusing fixing the NIXL error, if that is fixed, it's also possible that this error won't happen too. I will keep you updated if this repeats despite NIXL fix.

@markmc markmc moved this from Backlog to P1 in Metrics & Tracing Feb 10, 2026
@ZhanqiuHu
Copy link
Copy Markdown
Contributor

Hi @Prowindy, how do you reproduce this error?
From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.
But maybe there's a case I missed, could you share how to reproduce this?

@ZhanqiuHu I encountered the error Counters can only be incremented by non-negative amounts. when KV transmission through NIXLConnector failed. I'm currently focusing fixing the NIXL error, if that is fixed, it's also possible that this error won't happen too. I will keep you updated if this repeats despite NIXL fix.

Thanks! Cool!

@markmc
Copy link
Copy Markdown
Member

markmc commented Feb 10, 2026

From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.

👍

Agree - it's such a strong invariant that an assertion would make sense here, we definitely shouldn't try to just suppress this error

@ZhanqiuHu I encountered the error Counters can only be incremented by non-negative amounts. when KV transmission through NIXLConnector failed.

That suggests some issue with the logic in _update_requests_with_invalid_blocks() that we need to address ... but it's not obvious to me what the problem is

@markmc
Copy link
Copy Markdown
Member

markmc commented Feb 10, 2026

That suggests some issue with the logic in _update_requests_with_invalid_blocks() that we need to address ... but it's not obvious to me what the problem is

This is Claude's theory. I haven't validated it yet

  The Bug                                                                                                                                                                                                               
                                                                                                                                                                                                                        
  In _update_requests_with_invalid_blocks() at lines 2093-2103, there's a path where:                                                                                                                                   

  1. Line 2094: if not marked_invalid_block: - This executes when all invalid blocks for this request are shared with previous requests (they'll recompute them)
  2. Line 2103: request.num_computed_tokens = request.num_cached_tokens - The code reverts to considering only cached tokens as computed
  3. THE BUG: request.num_external_computed_tokens is NOT updated on this path!

  Why This Breaks the Invariant

  Normally:
  - num_cached_tokens includes both local and external tokens (set at scheduler.py:786)
  - Therefore: num_cached_tokens >= num_external_computed_tokens

  But when NIXL error occurs and this path executes:
  - Line 2103 sets num_computed_tokens = num_cached_tokens
  - But num_external_computed_tokens keeps its old value (potentially larger)
  - This can result in: num_cached_tokens < num_external_computed_tokens ❌

  The Fix

  At scheduler.py:2103, when setting request.num_computed_tokens = request.num_cached_tokens, the code should also adjust request.num_external_computed_tokens:

  if not marked_invalid_block:
      # All invalid blocks of this request are shared with
      # previous requests and will be recomputed by them.
      # Revert to considering only cached tokens as computed.
      total_affected_tokens += (
          request.num_computed_tokens - request.num_cached_tokens
      )
      # BUG FIX: Also reduce num_external_computed_tokens
      request.num_external_computed_tokens -= (
          request.num_computed_tokens - request.num_cached_tokens
      )
      request.num_computed_tokens = request.num_cached_tokens

  This ensures that when we reduce num_computed_tokens, we also reduce num_external_computed_tokens by the same amount, maintaining the invariant that num_cached_tokens >= num_external_computed_tokens.

@tlrmchlsmth
Copy link
Copy Markdown
Member

tlrmchlsmth commented Feb 26, 2026

Just hit this one benchmarking a WideEP deployment, cranking the concurrency too high

tlrmchlsmth added a commit to tlrmchlsmth/vllm that referenced this pull request Feb 26, 2026
…lid blocks

When two requests share an invalid block, the first request marks it
for recomputation and properly updates num_external_computed_tokens.
The second request takes the `not marked_invalid_block` path, which
reverts num_computed_tokens to num_cached_tokens but did NOT update
num_external_computed_tokens. This broke the invariant:

    num_cached_tokens >= num_external_computed_tokens

When this invariant is violated, PromptTokenStats.local_cache_hit
(computed as num_cached_tokens - num_external_computed_tokens) goes
negative, crashing Prometheus counters with:

    ValueError: Counters can only be incremented by non-negative amounts.

This is fatal - it kills the AsyncLLM output_handler, crashing the
entire engine process.

The fix reduces num_external_computed_tokens by the same
num_affected_tokens delta, matching the pattern already used in the
marked_invalid_block=True path (line 2126).

Reproduction: Run P/D disaggregation with NixlConnector. Restart
prefill pods while decode pods are running. The NIXL_ERR_REMOTE_DISCONNECT
triggers _update_requests_with_invalid_blocks, and any shared blocks
between concurrent requests hit this path.

Fixes: vllm-project#26372
Related: vllm-project#34079

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cquil11
Copy link
Copy Markdown

cquil11 commented Mar 11, 2026

btw, this also occurs during high concurrency scenarios w/o PD disagg but instead native KV cache offloading (as of vLLM 0.16.0)

@markmc
Copy link
Copy Markdown
Member

markmc commented Apr 8, 2026

The issue has been fixed on main since #37160 introduced this band-aid:

        self.local_cache_hit += max(
            0, (num_cached_tokens + recomputed - num_external_computed_tokens)

#37460 is the current candidate for a long-term fix

@markmc markmc closed this Apr 8, 2026
@markmc markmc moved this from In Review to Not planned in Metrics & Tracing Apr 8, 2026
@markmc markmc moved this from Not planned to Done in Metrics & Tracing Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation kv-connector v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants