[Bugfix] Fix negative local_cache_hit in P/D disaggregation metrics by Prowindy · Pull Request #34079 · vllm-project/vllm

Prowindy · 2026-02-08T05:58:52Z

Regression Source

This bug was introduced in PR #33290 (commit 4403e3ed4):

PR: [Metrics] Add labeled prompt token metrics for P/D disaggregation ([Metrics] Add labeled prompt token metrics for P/D disaggregation #33290)
Commit: 4403e3e
Merged: Feb 4, 2026

Error Observed

From decode service crash logs:
(ApiServer_0 pid=308) ERROR 02-08 03:01:13 [v1/engine/async_llm.py:698] AsyncLLM output_handler failed.
(ApiServer_0 pid=308) ERROR 02-08 03:01:13 [v1/engine/async_llm.py:698] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record
self.counter_prompt_tokens_by_source[source][engine_idx].inc(
File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc
raise ValueError('Counters can only be incremented by non-negative amounts.')
ValueError: Counters can only be incremented by non-negative amounts.

This results in 400 Bad Request responses for all incoming requests.

In P/D (Prefill/Decode) disaggregated deployments, the local_cache_hit metric could become negative when external KV transfer tokens exceed locally cached tokens. This caused Prometheus counter increment failures with ValueError: "Counters can only be incremented by non-negative amounts."

The fix clamps local_cache_hit to non-negative values using max(0, ...).

Root cause:

In P/D disagg, decode receives tokens via external KV transfer
The calculation: local_cache_hit = num_cached_tokens - num_external_computed_tokens
When external > cached, this goes negative
Prometheus counters reject negative increments

Example scenario:

Prefill sends 7000 tokens to decode via NIXL
Decode has 0 local cache
Old: local_cache_hit = 0 - 7000 = -7000 (CRASH!)
New: local_cache_hit = max(0, 0 - 7000) = 0 (OK)

Fixes the regression introduced in #33290.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-02-08T05:59:26Z

Documentation preview: https://vllm--34079.org.readthedocs.build/en/34079/

gemini-code-assist

Code Review

This pull request addresses a crash caused by a negative metric value being passed to a Prometheus counter. While the fix of clamping the value to be non-negative resolves the crash, it unfortunately introduces a new critical issue. The change breaks the documented invariants for token statistics, which will lead to inconsistent and incorrect metrics. I have added a critical review comment on vllm/v1/metrics/stats.py detailing this problem and suggesting a path towards a more robust solution that maintains metric integrity. The addition of new tests and documentation is a good practice.

gemini-code-assist · 2026-02-08T06:02:18Z

+        self.local_cache_hit += max(
+            0, num_cached_tokens + recomputed - num_external_computed_tokens
        )


This change correctly prevents a crash by ensuring the local_cache_hit increment is non-negative. However, it introduces a critical issue by breaking the accounting invariants for PromptTokenStats documented in the class docstring (lines 246-248).

The invariants are:

computed + local_cache_hit + external_kv_transfer - recomputed_tokens = total

local_cache_hit + external_kv_transfer - recomputed_tokens = cached_tokens

When num_cached_tokens + recomputed - num_external_computed_tokens is negative, clamping its contribution to local_cache_hit to 0 causes these identities to no longer hold, leading to inconsistent and incorrect metrics.

For example, let X = num_cached_tokens + recomputed - num_external_computed_tokens. With the original code, the change (delta) to both sides of the first invariant was prompt_len. With this new change, when X < 0, the delta of the left-hand side becomes prompt_len - X, which does not equal prompt_len (the delta of the right-hand side). This means the metrics no longer add up correctly.

A more robust fix would adjust other metrics like computed and cached_tokens to compensate for clamping local_cache_hit, thus preserving the invariants. This would require a more comprehensive change to the update logic in this method to ensure all metrics remain consistent.

@markmc any suggestion on this? Thank you!

mergify · 2026-02-08T06:02:52Z

Hi @Prowindy, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

ywang96 · 2026-02-08T06:23:03Z

Can we remove these two docs? They're great analysis but I don't think we need to include them as part of the PR.

ywang96 · 2026-02-08T06:23:43Z

+    """Test P/D disagg case where external tokens exceed cached tokens.
+
+    In P/D disaggregation, the decode instance may receive more tokens via
+    external KV transfer than it has cached locally. This previously caused
+    negative local_cache_hit values which crashed Prometheus counters.
+
+    See: https://github.com/vllm-project/vllm/issues/XXXXX


Maybe you can create an issue with the two design docs attached and reference here?

In P/D (Prefill/Decode) disaggregated deployments, the local_cache_hit metric could become negative when external KV transfer tokens exceed locally cached tokens. This caused Prometheus counter increment failures with ValueError: "Counters can only be incremented by non-negative amounts." The fix clamps local_cache_hit to non-negative values using max(0, ...). Root cause: - In P/D disagg, decode receives tokens via external KV transfer - The calculation: local_cache_hit = num_cached_tokens - num_external_computed_tokens - When external > cached, this goes negative - Prometheus counters reject negative increments Example scenario: - Prefill sends 7000 tokens to decode via NIXL - Decode has 0 local cache - Old: local_cache_hit = 0 - 7000 = -7000 (CRASH!) - New: local_cache_hit = max(0, 0 - 7000) = 0 (OK) Fixes the regression introduced in vllm-project#33290. Signed-off-by: Cong Chen <congc@meta.com>

Signed-off-by: Simon Mo <simon.mo@hey.com>

ZhanqiuHu · 2026-02-09T20:22:12Z

Hi @Prowindy, how do you reproduce this error?

From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.

But maybe there's a case I missed, could you share how to reproduce this?

Prowindy · 2026-02-10T06:35:53Z

Hi @Prowindy, how do you reproduce this error?

From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.

But maybe there's a case I missed, could you share how to reproduce this?

@ZhanqiuHu I encountered the error Counters can only be incremented by non-negative amounts. when KV transmission through NIXLConnector failed. I'm currently focusing fixing the NIXL error, if that is fixed, it's also possible that this error won't happen too. I will keep you updated if this repeats despite NIXL fix.

ZhanqiuHu · 2026-02-10T14:18:41Z

Hi @Prowindy, how do you reproduce this error?
From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.
But maybe there's a case I missed, could you share how to reproduce this?

@ZhanqiuHu I encountered the error Counters can only be incremented by non-negative amounts. when KV transmission through NIXLConnector failed. I'm currently focusing fixing the NIXL error, if that is fixed, it's also possible that this error won't happen too. I will keep you updated if this repeats despite NIXL fix.

Thanks! Cool!

markmc · 2026-02-10T17:49:05Z

From my understanding, num_cached_tokens is updated at scheduler.py:L785-L786, so it will be computed as num_new_local_computed_tokens + num_external_computed_tokens in the sync case (L629-L631), or len(block_ids) * block_size (covering both local and external blocks) in the async case (L637, L1956). So normally it should always be >= num_external_computed_tokens.

👍

Agree - it's such a strong invariant that an assertion would make sense here, we definitely shouldn't try to just suppress this error

@ZhanqiuHu I encountered the error Counters can only be incremented by non-negative amounts. when KV transmission through NIXLConnector failed.

That suggests some issue with the logic in _update_requests_with_invalid_blocks() that we need to address ... but it's not obvious to me what the problem is

markmc · 2026-02-10T21:07:56Z

That suggests some issue with the logic in _update_requests_with_invalid_blocks() that we need to address ... but it's not obvious to me what the problem is

This is Claude's theory. I haven't validated it yet

  The Bug                                                                                                                                                                                                               
                                                                                                                                                                                                                        
  In _update_requests_with_invalid_blocks() at lines 2093-2103, there's a path where:                                                                                                                                   

  1. Line 2094: if not marked_invalid_block: - This executes when all invalid blocks for this request are shared with previous requests (they'll recompute them)
  2. Line 2103: request.num_computed_tokens = request.num_cached_tokens - The code reverts to considering only cached tokens as computed
  3. THE BUG: request.num_external_computed_tokens is NOT updated on this path!

  Why This Breaks the Invariant

  Normally:
  - num_cached_tokens includes both local and external tokens (set at scheduler.py:786)
  - Therefore: num_cached_tokens >= num_external_computed_tokens

  But when NIXL error occurs and this path executes:
  - Line 2103 sets num_computed_tokens = num_cached_tokens
  - But num_external_computed_tokens keeps its old value (potentially larger)
  - This can result in: num_cached_tokens < num_external_computed_tokens ❌

  The Fix

  At scheduler.py:2103, when setting request.num_computed_tokens = request.num_cached_tokens, the code should also adjust request.num_external_computed_tokens:

  if not marked_invalid_block:
      # All invalid blocks of this request are shared with
      # previous requests and will be recomputed by them.
      # Revert to considering only cached tokens as computed.
      total_affected_tokens += (
          request.num_computed_tokens - request.num_cached_tokens
      )
      # BUG FIX: Also reduce num_external_computed_tokens
      request.num_external_computed_tokens -= (
          request.num_computed_tokens - request.num_cached_tokens
      )
      request.num_computed_tokens = request.num_cached_tokens

  This ensures that when we reduce num_computed_tokens, we also reduce num_external_computed_tokens by the same amount, maintaining the invariant that num_cached_tokens >= num_external_computed_tokens.

tlrmchlsmth · 2026-02-26T03:21:00Z

Just hit this one benchmarking a WideEP deployment, cranking the concurrency too high

…lid blocks When two requests share an invalid block, the first request marks it for recomputation and properly updates num_external_computed_tokens. The second request takes the `not marked_invalid_block` path, which reverts num_computed_tokens to num_cached_tokens but did NOT update num_external_computed_tokens. This broke the invariant: num_cached_tokens >= num_external_computed_tokens When this invariant is violated, PromptTokenStats.local_cache_hit (computed as num_cached_tokens - num_external_computed_tokens) goes negative, crashing Prometheus counters with: ValueError: Counters can only be incremented by non-negative amounts. This is fatal - it kills the AsyncLLM output_handler, crashing the entire engine process. The fix reduces num_external_computed_tokens by the same num_affected_tokens delta, matching the pattern already used in the marked_invalid_block=True path (line 2126). Reproduction: Run P/D disaggregation with NixlConnector. Restart prefill pods while decode pods are running. The NIXL_ERR_REMOTE_DISCONNECT triggers _update_requests_with_invalid_blocks, and any shared blocks between concurrent requests hit this path. Fixes: vllm-project#26372 Related: vllm-project#34079 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cquil11 · 2026-03-11T02:23:29Z

btw, this also occurs during high concurrency scenarios w/o PD disagg but instead native KV cache offloading (as of vLLM 0.16.0)

markmc · 2026-04-08T10:03:03Z

The issue has been fixed on main since #37160 introduced this band-aid:

        self.local_cache_hit += max(
            0, (num_cached_tokens + recomputed - num_external_computed_tokens)

#37460 is the current candidate for a long-term fix

Prowindy requested a review from markmc as a code owner February 8, 2026 05:58

mergify Bot added documentation Improvements or additions to documentation v1 bug Something isn't working kv-connector labels Feb 8, 2026

gemini-code-assist Bot reviewed Feb 8, 2026

View reviewed changes

ywang96 reviewed Feb 8, 2026

View reviewed changes

Prowindy force-pushed the fix/negative-prompt-token-stats-pd-disagg branch from 09d60f2 to 3676a28 Compare February 8, 2026 16:55

simon-mo added 2 commits February 9, 2026 11:26

Delete decode_crash_sample.txt

3583201

Signed-off-by: Simon Mo <simon.mo@hey.com>

Delete kimi25_int4_pd_disagg_config.yaml

a6bdbd1

Signed-off-by: Simon Mo <simon.mo@hey.com>

simon-mo mentioned this pull request Feb 9, 2026

[Metrics] Add labeled prompt token metrics for P/D disaggregation #33290

Merged

5 tasks

markmc added this to Metrics & Tracing Feb 10, 2026

github-project-automation Bot moved this to Backlog in Metrics & Tracing Feb 10, 2026

markmc moved this from Backlog to P1 in Metrics & Tracing Feb 10, 2026

This was referenced Mar 10, 2026

[Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency #36533

Closed

[WIP][Bugfix] Fix negative prompt token counter crash under KV offloading #36638

Closed

cjackal mentioned this pull request Mar 11, 2026

[Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters #36755

Closed

markmc mentioned this pull request Mar 11, 2026

[Bugfix] Reset num_cached_tokens sentinel on request preemption #36752

Closed

Pz1116 mentioned this pull request Mar 20, 2026

[Bug][KV POOL][Spec Decode] ValueError: Counters can only be incremented by non-negative amounts with MTP and KV Pool Enabled under high concurrency vllm-project/vllm-ascend#7489

Open

XingLiu1 mentioned this pull request Apr 2, 2026

[Bug]: KV load failure path triggers metrics exception and internal server error with kv_load_failure_policy=fail vllm-project/vllm-ascend#7871

Open

markmc moved this from P1 to In Review in Metrics & Tracing Apr 8, 2026

markmc closed this Apr 8, 2026

markmc moved this from In Review to Not planned in Metrics & Tracing Apr 8, 2026

markmc moved this from Not planned to Done in Metrics & Tracing Apr 8, 2026

Uh oh!

Conversation

Prowindy commented Feb 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Source

Error Observed

Purpose

Test Plan

Test Result

Uh oh!

mergify Bot commented Feb 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

simon-mo Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Feb 8, 2026

Uh oh!

ywang96 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

ywang96 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

ZhanqiuHu commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Prowindy commented Feb 10, 2026

Uh oh!

ZhanqiuHu commented Feb 10, 2026

Uh oh!

markmc commented Feb 10, 2026

Uh oh!

markmc commented Feb 10, 2026

Uh oh!

tlrmchlsmth commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cquil11 commented Mar 11, 2026

Uh oh!

markmc commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Prowindy commented Feb 8, 2026 •

edited by github-actions Bot

Loading

ZhanqiuHu commented Feb 9, 2026 •

edited

Loading

tlrmchlsmth commented Feb 26, 2026 •

edited

Loading