Bugfix: zero-initialized metrics for failed/aborted/skipped requests throwing off histograms by wseaton · Pull Request #32726 · vllm-project/vllm

wseaton · 2026-01-20T21:49:05Z

This fixes an issue reported by @kfirwolfson where the new finish reasons (and old ones like ABORT) cause incorrect stats to be recorded in histogram counters due to zero initialized values.

Now we use a type state pattern to conditionally update them based on how far the request actually progressed through the pipeline.

gemini-code-assist

Code Review

This pull request effectively addresses the issue of zero-initialized metrics skewing histogram data for requests that did not fully complete. By introducing a type state pattern with ScheduledTiming and CompletedTiming dataclasses, the system now accurately reflects the progress of each request and only records relevant timing metrics. The changes in loggers.py correctly adapt to this new structure, ensuring that histograms are populated with meaningful data. The added test cases thoroughly validate the new timing logic, significantly improving the correctness and reliability of the metrics system.

gemini-code-assist · 2026-01-20T21:50:54Z

vllm/v1/metrics/stats.py

+class ScheduledTiming:
+    """Timing for a request that was scheduled but didn't generate tokens.
+
+    This occurs when a request is aborted during prefill or fails due to
+    errors like KV load failures.
+    """
+
+    queued_time: float


The ScheduledTiming dataclass name, while technically accurate, could be made more explicit to immediately convey its purpose in the context of requests that did not complete token generation. Given that this PR addresses issues with zero-initialized metrics for failed/aborted/skipped requests, a name like ScheduledButIncompleteTiming or AbortedOrFailedScheduledTiming might more clearly communicate that this timing object represents a request that progressed to scheduling but did not reach full completion. This improves maintainability by making the intent clearer at a glance, reducing potential for misinterpretation of critical metrics.

wseaton · 2026-01-20T21:54:41Z

vllm/v1/metrics/stats.py

+        elif was_queued and was_scheduled:
+            # scheduled but no tokens (abort during prefill, KV error, etc.)
+            queued_time = req_stats.scheduled_ts - req_stats.queued_ts
+            timing = ScheduledTiming(queued_time=queued_time)


The question here is does recording this way throw off any other reporting or muddy the waters semantically? Is it worth the extra cardinality of adding a label for finish reason?

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-20T21:58:54Z

vllm/v1/metrics/stats.py

+                decode_time=decode_time,
+                inference_time=inference_time,
+                mean_time_per_output_token=mean_tpot,
+            )


Missing prerequisite checks in CompletedTiming branch

Medium Severity

The CompletedTiming branch only checks got_first_token and got_last_token before calculating queued_time = scheduled_ts - queued_ts, but doesn't verify that was_queued and was_scheduled are true. In contrast, the ScheduledTiming branch correctly checks both was_queued and was_scheduled. If queued_ts is 0 (not set) but tokens were generated, queued_time would incorrectly equal scheduled_ts (a potentially large monotonic timestamp), leading to wrong values in the histogram_queue_time_request histogram.

wseaton · 2026-01-21T14:13:40Z

cc @NickLucche

kfirwolfson · 2026-01-21T15:12:43Z

Thanks, @wseaton, very nice.

kfirwolfson · 2026-01-28T09:14:41Z

vllm/v1/metrics/stats.py

I think this might also come into play and add incorrect stats of first_token_latency
I believe is_prefilling can be True, for example for requests which failed to load KV-Cache. Consider changing the above if statement to

if is_prefilling and num_new_generation_tokens > 0: #...

and adding a comment above it that explains why we test both

pushed a fix and regression tests for this!

You can see in OutputProcessor.process_outputs() that we assume the first EngineCoreOutput we receive for a request means that prefilling has completed

for engine_core_output in engine_core_outputs: req_id = engine_core_output.request_id req_state = self.request_states.get(req_id) if req_state is None: # Ignore output for already-aborted request. continue # 1) Compute stats for this iteration. self._update_stats_from_output( req_state, engine_core_output, engine_core_timestamp, iteration_stats ) new_token_ids = engine_core_output.new_token_ids ... req_state.is_prefilling = False

I guess since #26813 that is obviously not a safe assumption anymore, but maybe also #9880?

In any case, I think we should start by addressing this assumption in OutputProcessor. And that feels like a separate PR

…oject#32726 review Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

Signed-off-by: Will Eaton <weaton@redhat.com>

actually generated Signed-off-by: Will Eaton <weaton@redhat.com>

markmc

Nice finds!

markmc · 2026-02-04T14:07:29Z

vllm/v1/metrics/stats.py

+        # build timing based on how far the request progressed
+        timing: RequestTiming = None
+        was_queued = req_stats.queued_ts > 0
+        was_scheduled = req_stats.scheduled_ts > 0


Honestly, I'm pretty sceptical of all of this new logic to determine "how far the request progressed"

The fix, in the end, is nothing more than:

if req_stats.queued_ts > 0 and req_stats.scheduled_ts > 0: queued_time = req_stats.scheduled_ts - req_stats.queued_ts ... if finished_request.queued_time > 0: self.histogram_queue_time_request[engine_idx].observe(finished_request.queued_time)

AFAICT, we don't actually use this "how far the request progressed" information ...

markmc · 2026-02-04T14:15:17Z

vllm/v1/metrics/stats.py

+    - CompletedTiming: generated at least one token (normal completion)
+    """

    finish_reason: "FinishReason"


I particularly don't like the way this data structure has this type-like field (finish_reason)

class FinishReason(enum.IntEnum): STOP = 0 LENGTH = 1 ABORT = 2 ERROR = 3

which you might expect would determine which fields below are valid, but then we have this other timing = None|scheduled|completed thing which also indicates why/when the request was finished

We could make it all more coherent, but it doesn't seem like a bit deal to just say some of the fields may not be valid, and we'll only observe the fields that are valid

Some docs here explaining that some of the fields may not be valid would be nice though

markmc · 2026-02-04T14:19:23Z

vllm/v1/metrics/stats.py


        self.num_generation_tokens += num_new_generation_tokens
        if is_prefilling:
            self.prompt_token_stats.update_from_output(


This also assumes the prefill has completed ... ?

markmc · 2026-02-04T14:22:31Z

vllm/v1/metrics/stats.py

        lora_states: "LoRARequestStates",
        lora_name: str | None,
    ):
        num_new_generation_tokens = len(output.new_token_ids)


I'd restructure this method so that we process events and maybe num_nans_in_logits, then just return if num_new_generation_tokens == 0

markmc · 2026-02-04T14:33:50Z

vllm/v1/metrics/stats.py

You can see in OutputProcessor.process_outputs() that we assume the first EngineCoreOutput we receive for a request means that prefilling has completed

for engine_core_output in engine_core_outputs: req_id = engine_core_output.request_id req_state = self.request_states.get(req_id) if req_state is None: # Ignore output for already-aborted request. continue # 1) Compute stats for this iteration. self._update_stats_from_output( req_state, engine_core_output, engine_core_timestamp, iteration_stats ) new_token_ids = engine_core_output.new_token_ids ... req_state.is_prefilling = False

I guess since #26813 that is obviously not a safe assumption anymore, but maybe also #9880?

In any case, I think we should start by addressing this assumption in OutputProcessor. And that feels like a separate PR

Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton requested a review from markmc as a code owner January 20, 2026 21:49

mergify bot added v1 bug Something isn't working labels Jan 20, 2026

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

wseaton commented Jan 20, 2026

View reviewed changes

cursor bot reviewed Jan 20, 2026

View reviewed changes

markmc added this to Metrics & Tracing Jan 21, 2026

github-project-automation bot moved this to Backlog in Metrics & Tracing Jan 21, 2026

markmc moved this from Backlog to P1 in Metrics & Tracing Jan 21, 2026

kfirwolfson reviewed Jan 28, 2026

View reviewed changes

kfirwolfson pushed a commit to kfirwolfson/vllm that referenced this pull request Jan 28, 2026

fix stats for rejected requests. Also suggested as part of PR vllm-pr…

d11d1fa

…oject#32726 review Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

wseaton added 3 commits February 4, 2026 08:31

initial commit

1f12af7

Signed-off-by: Will Eaton <weaton@redhat.com>

restore old comments

a24fb8f

Signed-off-by: Will Eaton <weaton@redhat.com>

calculate queued once

c739503

Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton force-pushed the optional-metric-fields branch from 9b69a5c to c739503 Compare February 4, 2026 13:31

fix: only record token latencies when tokens are

c68cc16

actually generated Signed-off-by: Will Eaton <weaton@redhat.com>

markmc requested changes Feb 4, 2026

View reviewed changes

fixup comment example

0e938a7

Signed-off-by: Will Eaton <weaton@redhat.com>

Uh oh!

Conversation

wseaton commented Jan 20, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 20, 2026

Choose a reason for hiding this comment

Missing prerequisite checks in CompletedTiming branch

Uh oh!

wseaton commented Jan 21, 2026

Uh oh!

kfirwolfson commented Jan 21, 2026

Uh oh!

kfirwolfson Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wseaton commented Jan 20, 2026 •

edited by github-actions bot

Loading

kfirwolfson Jan 28, 2026 •

edited

Loading