Bugfix: zero-initialized metrics for failed/aborted/skipped requests throwing off histograms#32726
Bugfix: zero-initialized metrics for failed/aborted/skipped requests throwing off histograms#32726wseaton wants to merge 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the issue of zero-initialized metrics skewing histogram data for requests that did not fully complete. By introducing a type state pattern with ScheduledTiming and CompletedTiming dataclasses, the system now accurately reflects the progress of each request and only records relevant timing metrics. The changes in loggers.py correctly adapt to this new structure, ensuring that histograms are populated with meaningful data. The added test cases thoroughly validate the new timing logic, significantly improving the correctness and reliability of the metrics system.
| class ScheduledTiming: | ||
| """Timing for a request that was scheduled but didn't generate tokens. | ||
|
|
||
| This occurs when a request is aborted during prefill or fails due to | ||
| errors like KV load failures. | ||
| """ | ||
|
|
||
| queued_time: float |
There was a problem hiding this comment.
The ScheduledTiming dataclass name, while technically accurate, could be made more explicit to immediately convey its purpose in the context of requests that did not complete token generation. Given that this PR addresses issues with zero-initialized metrics for failed/aborted/skipped requests, a name like ScheduledButIncompleteTiming or AbortedOrFailedScheduledTiming might more clearly communicate that this timing object represents a request that progressed to scheduling but did not reach full completion. This improves maintainability by making the intent clearer at a glance, reducing potential for misinterpretation of critical metrics.
vllm/v1/metrics/stats.py
Outdated
| elif was_queued and was_scheduled: | ||
| # scheduled but no tokens (abort during prefill, KV error, etc.) | ||
| queued_time = req_stats.scheduled_ts - req_stats.queued_ts | ||
| timing = ScheduledTiming(queued_time=queued_time) |
There was a problem hiding this comment.
The question here is does recording this way throw off any other reporting or muddy the waters semantically? Is it worth the extra cardinality of adding a label for finish reason?
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
vllm/v1/metrics/stats.py
Outdated
| decode_time=decode_time, | ||
| inference_time=inference_time, | ||
| mean_time_per_output_token=mean_tpot, | ||
| ) |
There was a problem hiding this comment.
Missing prerequisite checks in CompletedTiming branch
Medium Severity
The CompletedTiming branch only checks got_first_token and got_last_token before calculating queued_time = scheduled_ts - queued_ts, but doesn't verify that was_queued and was_scheduled are true. In contrast, the ScheduledTiming branch correctly checks both was_queued and was_scheduled. If queued_ts is 0 (not set) but tokens were generated, queued_time would incorrectly equal scheduled_ts (a potentially large monotonic timestamp), leading to wrong values in the histogram_queue_time_request histogram.
|
cc @NickLucche |
|
Thanks, @wseaton, very nice. |
There was a problem hiding this comment.
I think this might also come into play and add incorrect stats of first_token_latency
I believe is_prefilling can be True, for example for requests which failed to load KV-Cache. Consider changing the above if statement to
if is_prefilling and num_new_generation_tokens > 0:
#...and adding a comment above it that explains why we test both
There was a problem hiding this comment.
pushed a fix and regression tests for this!
There was a problem hiding this comment.
You can see in OutputProcessor.process_outputs() that we assume the first EngineCoreOutput we receive for a request means that prefilling has completed
for engine_core_output in engine_core_outputs:
req_id = engine_core_output.request_id
req_state = self.request_states.get(req_id)
if req_state is None:
# Ignore output for already-aborted request.
continue
# 1) Compute stats for this iteration.
self._update_stats_from_output(
req_state, engine_core_output, engine_core_timestamp, iteration_stats
)
new_token_ids = engine_core_output.new_token_ids
...
req_state.is_prefilling = False
I guess since #26813 that is obviously not a safe assumption anymore, but maybe also #9880?
In any case, I think we should start by addressing this assumption in OutputProcessor. And that feels like a separate PR
…oject#32726 review Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
9b69a5c to
c739503
Compare
actually generated Signed-off-by: Will Eaton <weaton@redhat.com>
| # build timing based on how far the request progressed | ||
| timing: RequestTiming = None | ||
| was_queued = req_stats.queued_ts > 0 | ||
| was_scheduled = req_stats.scheduled_ts > 0 |
There was a problem hiding this comment.
Honestly, I'm pretty sceptical of all of this new logic to determine "how far the request progressed"
The fix, in the end, is nothing more than:
if req_stats.queued_ts > 0 and req_stats.scheduled_ts > 0:
queued_time = req_stats.scheduled_ts - req_stats.queued_ts
...
if finished_request.queued_time > 0:
self.histogram_queue_time_request[engine_idx].observe(finished_request.queued_time)
AFAICT, we don't actually use this "how far the request progressed" information ...
| - CompletedTiming: generated at least one token (normal completion) | ||
| """ | ||
|
|
||
| finish_reason: "FinishReason" |
There was a problem hiding this comment.
I particularly don't like the way this data structure has this type-like field (finish_reason)
class FinishReason(enum.IntEnum):
STOP = 0
LENGTH = 1
ABORT = 2
ERROR = 3
which you might expect would determine which fields below are valid, but then we have this other timing = None|scheduled|completed thing which also indicates why/when the request was finished
We could make it all more coherent, but it doesn't seem like a bit deal to just say some of the fields may not be valid, and we'll only observe the fields that are valid
Some docs here explaining that some of the fields may not be valid would be nice though
|
|
||
| self.num_generation_tokens += num_new_generation_tokens | ||
| if is_prefilling: | ||
| self.prompt_token_stats.update_from_output( |
There was a problem hiding this comment.
This also assumes the prefill has completed ... ?
| lora_states: "LoRARequestStates", | ||
| lora_name: str | None, | ||
| ): | ||
| num_new_generation_tokens = len(output.new_token_ids) |
There was a problem hiding this comment.
I'd restructure this method so that we process events and maybe num_nans_in_logits, then just return if num_new_generation_tokens == 0
There was a problem hiding this comment.
You can see in OutputProcessor.process_outputs() that we assume the first EngineCoreOutput we receive for a request means that prefilling has completed
for engine_core_output in engine_core_outputs:
req_id = engine_core_output.request_id
req_state = self.request_states.get(req_id)
if req_state is None:
# Ignore output for already-aborted request.
continue
# 1) Compute stats for this iteration.
self._update_stats_from_output(
req_state, engine_core_output, engine_core_timestamp, iteration_stats
)
new_token_ids = engine_core_output.new_token_ids
...
req_state.is_prefilling = False
I guess since #26813 that is obviously not a safe assumption anymore, but maybe also #9880?
In any case, I think we should start by addressing this assumption in OutputProcessor. And that feels like a separate PR
Signed-off-by: Will Eaton <weaton@redhat.com>
This fixes an issue reported by @kfirwolfson where the new finish reasons (and old ones like
ABORT) cause incorrect stats to be recorded in histogram counters due to zero initialized values.Now we use a type state pattern to conditionally update them based on how far the request actually progressed through the pipeline.