Skip to content

Bugfix: zero-initialized metrics for failed/aborted/skipped requests throwing off histograms#32726

Open
wseaton wants to merge 5 commits intovllm-project:mainfrom
wseaton:optional-metric-fields
Open

Bugfix: zero-initialized metrics for failed/aborted/skipped requests throwing off histograms#32726
wseaton wants to merge 5 commits intovllm-project:mainfrom
wseaton:optional-metric-fields

Conversation

@wseaton
Copy link
Copy Markdown
Contributor

@wseaton wseaton commented Jan 20, 2026

This fixes an issue reported by @kfirwolfson where the new finish reasons (and old ones like ABORT) cause incorrect stats to be recorded in histogram counters due to zero initialized values.

Now we use a type state pattern to conditionally update them based on how far the request actually progressed through the pipeline.

@wseaton wseaton requested a review from markmc as a code owner January 20, 2026 21:49
@mergify mergify bot added v1 bug Something isn't working labels Jan 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of zero-initialized metrics skewing histogram data for requests that did not fully complete. By introducing a type state pattern with ScheduledTiming and CompletedTiming dataclasses, the system now accurately reflects the progress of each request and only records relevant timing metrics. The changes in loggers.py correctly adapt to this new structure, ensuring that histograms are populated with meaningful data. The added test cases thoroughly validate the new timing logic, significantly improving the correctness and reliability of the metrics system.

Comment on lines +216 to +223
class ScheduledTiming:
"""Timing for a request that was scheduled but didn't generate tokens.

This occurs when a request is aborted during prefill or fails due to
errors like KV load failures.
"""

queued_time: float
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ScheduledTiming dataclass name, while technically accurate, could be made more explicit to immediately convey its purpose in the context of requests that did not complete token generation. Given that this PR addresses issues with zero-initialized metrics for failed/aborted/skipped requests, a name like ScheduledButIncompleteTiming or AbortedOrFailedScheduledTiming might more clearly communicate that this timing object represents a request that progressed to scheduling but did not reach full completion. This improves maintainability by making the intent clearer at a glance, reducing potential for misinterpretation of critical metrics.

elif was_queued and was_scheduled:
# scheduled but no tokens (abort during prefill, KV error, etc.)
queued_time = req_stats.scheduled_ts - req_stats.queued_ts
timing = ScheduledTiming(queued_time=queued_time)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here is does recording this way throw off any other reporting or muddy the waters semantically? Is it worth the extra cardinality of adding a label for finish reason?

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

decode_time=decode_time,
inference_time=inference_time,
mean_time_per_output_token=mean_tpot,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing prerequisite checks in CompletedTiming branch

Medium Severity

The CompletedTiming branch only checks got_first_token and got_last_token before calculating queued_time = scheduled_ts - queued_ts, but doesn't verify that was_queued and was_scheduled are true. In contrast, the ScheduledTiming branch correctly checks both was_queued and was_scheduled. If queued_ts is 0 (not set) but tokens were generated, queued_time would incorrectly equal scheduled_ts (a potentially large monotonic timestamp), leading to wrong values in the histogram_queue_time_request histogram.

Fix in Cursor Fix in Web

@wseaton
Copy link
Copy Markdown
Contributor Author

wseaton commented Jan 21, 2026

cc @NickLucche

@kfirwolfson
Copy link
Copy Markdown

Thanks, @wseaton, very nice.

@markmc markmc moved this from Backlog to P1 in Metrics & Tracing Jan 21, 2026
Copy link
Copy Markdown

@kfirwolfson kfirwolfson Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might also come into play and add incorrect stats of first_token_latency
I believe is_prefilling can be True, for example for requests which failed to load KV-Cache. Consider changing the above if statement to

if is_prefilling and num_new_generation_tokens > 0:
    #...

and adding a comment above it that explains why we test both

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a fix and regression tests for this!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see in OutputProcessor.process_outputs() that we assume the first EngineCoreOutput we receive for a request means that prefilling has completed

        for engine_core_output in engine_core_outputs:
            req_id = engine_core_output.request_id
            req_state = self.request_states.get(req_id)
	       if req_state is None:
                # Ignore output for already-aborted request.                                                                                                                                                                   
                continue

            # 1) Compute stats for this iteration.                                                                                                                                                                             
      	    self._update_stats_from_output(
                req_state, engine_core_output, engine_core_timestamp, iteration_stats
            )

            new_token_ids = engine_core_output.new_token_ids
            ...
            req_state.is_prefilling = False

I guess since #26813 that is obviously not a safe assumption anymore, but maybe also #9880?

In any case, I think we should start by addressing this assumption in OutputProcessor. And that feels like a separate PR

kfirwolfson pushed a commit to kfirwolfson/vllm that referenced this pull request Jan 28, 2026
…oject#32726 review

Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
@wseaton wseaton force-pushed the optional-metric-fields branch from 9b69a5c to c739503 Compare February 4, 2026 13:31
      actually generated

Signed-off-by: Will Eaton <weaton@redhat.com>
Copy link
Copy Markdown
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finds!

# build timing based on how far the request progressed
timing: RequestTiming = None
was_queued = req_stats.queued_ts > 0
was_scheduled = req_stats.scheduled_ts > 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm pretty sceptical of all of this new logic to determine "how far the request progressed"

The fix, in the end, is nothing more than:

if req_stats.queued_ts > 0 and req_stats.scheduled_ts > 0:
    queued_time = req_stats.scheduled_ts - req_stats.queued_ts
...
if finished_request.queued_time > 0:
    self.histogram_queue_time_request[engine_idx].observe(finished_request.queued_time)

AFAICT, we don't actually use this "how far the request progressed" information ...

- CompletedTiming: generated at least one token (normal completion)
"""

finish_reason: "FinishReason"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I particularly don't like the way this data structure has this type-like field (finish_reason)

class FinishReason(enum.IntEnum):
    STOP = 0
    LENGTH = 1
    ABORT = 2
    ERROR = 3

which you might expect would determine which fields below are valid, but then we have this other timing = None|scheduled|completed thing which also indicates why/when the request was finished

We could make it all more coherent, but it doesn't seem like a bit deal to just say some of the fields may not be valid, and we'll only observe the fields that are valid

Some docs here explaining that some of the fields may not be valid would be nice though


self.num_generation_tokens += num_new_generation_tokens
if is_prefilling:
self.prompt_token_stats.update_from_output(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also assumes the prefill has completed ... ?

lora_states: "LoRARequestStates",
lora_name: str | None,
):
num_new_generation_tokens = len(output.new_token_ids)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd restructure this method so that we process events and maybe num_nans_in_logits, then just return if num_new_generation_tokens == 0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see in OutputProcessor.process_outputs() that we assume the first EngineCoreOutput we receive for a request means that prefilling has completed

        for engine_core_output in engine_core_outputs:
            req_id = engine_core_output.request_id
            req_state = self.request_states.get(req_id)
	       if req_state is None:
                # Ignore output for already-aborted request.                                                                                                                                                                   
                continue

            # 1) Compute stats for this iteration.                                                                                                                                                                             
      	    self._update_stats_from_output(
                req_state, engine_core_output, engine_core_timestamp, iteration_stats
            )

            new_token_ids = engine_core_output.new_token_ids
            ...
            req_state.is_prefilling = False

I guess since #26813 that is obviously not a safe assumption anymore, but maybe also #9880?

In any case, I think we should start by addressing this assumption in OutputProcessor. And that feels like a separate PR

Signed-off-by: Will Eaton <weaton@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

Development

Successfully merging this pull request may close these issues.

3 participants