[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts" by markmc · Pull Request #36812 · vllm-project/vllm

markmc · 2026-03-11T18:38:34Z

Since num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors:

ValueError: Counters can only be incremented by non-negative amounts

The invariant check enforces:

prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0

with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the:

num_external_computed_tokens <= num_cached_tokens + recomputed

When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion).

Related to issues #36533, #36755 and PRs #36638, #36752, #36757.

markmc · 2026-03-11T18:38:54Z

/cc @ZhanqiuHu

markmc · 2026-03-11T18:40:34Z

To be clear, I don't like this, I see it as purely a temporary "stop the bleeding" band-aid that is also backportable to v0.16.0 where this first showed up

We still need to solidify the num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting ... hopefully in a way that easier to maintain

gemini-code-assist

Code Review

This pull request introduces a defensive check to prevent crashes from metrics accounting bugs, specifically when Prometheus counters are incremented by negative values. The changes involve adding an invariant check for token counts and logging a warning when the invariant is violated. I've found a critical issue with the invariant check itself, as it doesn't fully prevent negative values in all cases. I've also identified an issue with the diagnostic logging that would make debugging harder. My review includes suggestions to address both of these points.

… non-negative amounts" Since `num_computed_tokens`, `num_cached_tokens`, and `num_external_computed_tokens` accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors: ``` ValueError: Counters can only be incremented by non-negative amounts ``` The invariant check enforces: ``` prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0 ``` with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the: ``` num_external_computed_tokens <= num_cached_tokens + recomputed ``` When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics. Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion). Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2026-03-12T12:39:02Z

From Slack, @orozery view is that #36859 is a correct, simpler fix and there's no need for a temporary, defensive check like this

markmc · 2026-04-08T09:56:52Z

The issue has been fixed on main since #37160 introduced this band-aid:

        self.local_cache_hit += max(
            0, (num_cached_tokens + recomputed - num_external_computed_tokens)

#37460 is the current candidate for a long-term fix

ZhanqiuHu · 2026-04-08T17:23:41Z

Quick question, do we have some metrics to distinguish CPU cache hit vs. GPU cache hit when CPU offloading is on?

markmc requested review from NickLucche, njhill, orozery, robertgshaw2-redhat and tlrmchlsmth March 11, 2026 18:38

mergify Bot added the v1 label Mar 11, 2026

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2026

markmc added this to Metrics & Tracing Mar 11, 2026

github-project-automation Bot moved this to Backlog in Metrics & Tracing Mar 11, 2026

markmc moved this from Backlog to Ready in Metrics & Tracing Mar 11, 2026

gemini-code-assist Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread vllm/v1/metrics/stats.py

Comment thread vllm/v1/metrics/stats.py Outdated

markmc force-pushed the prompt-token-stats-negative-inc branch from 6b2de38 to 04a4886 Compare March 11, 2026 19:28

markmc closed this Apr 8, 2026

markmc moved this from Ready to Not planned in Metrics & Tracing Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"#36812

[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"#36812
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:prompt-token-stats-negative-inc

markmc commented Mar 11, 2026 •

edited by github-actions Bot

Loading

Uh oh!

markmc commented Mar 11, 2026

Uh oh!

markmc commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

markmc commented Mar 12, 2026

Uh oh!

markmc commented Apr 8, 2026

Uh oh!

ZhanqiuHu commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

markmc commented Mar 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Mar 11, 2026

Uh oh!

markmc commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

markmc commented Mar 12, 2026

Uh oh!

markmc commented Apr 8, 2026

Uh oh!

ZhanqiuHu commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markmc commented Mar 11, 2026 •

edited by github-actions Bot

Loading