[Enhancement] Modify --log-stats#3069
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f47f63b9a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| overall_summary = { | ||
| "e2e_requests": int(self.e2e_count), | ||
| "e2e_wall_time_ms": float(wall_time_ms), | ||
| "request_wall_time_ms": float(wall_time_ms), |
There was a problem hiding this comment.
Derive request_wall_time_ms from accumulated request latency
build_and_log_summary sets request_wall_time_ms to the global run span (wall_time_ms) while input_preprocess_time_ms and engine_pipeline_time_ms are accumulated across finalized requests. In offline/batch runs with multiple overlapping requests, this makes the new timing decomposition inconsistent (request_wall_time_ms can be smaller than the sum of its components) and underestimates avg_request_wall_time_ms, which can mislead latency analysis and experiment comparisons. Compute request_wall_time_ms from per-request totals (or from input_preprocess_total_ms + engine_pipeline_total_ms) instead of run-span wall time.
Useful? React with 👍 / 👎.
| stage_metrics.handoff_to_stage_id = next_stage_id | ||
| stage_metrics.stage_handoff_time_ms = max(0.0, (next_submit_ts - handoff_start_ts) * 1000.0) | ||
| if stage_metrics.ar2diffusion_time_ms == 0.0: |
There was a problem hiding this comment.
Set handoff metrics before publishing stage metrics
The new handoff fields are assigned in _forward_to_next_stage after the stage output has already been enqueued for final-output stages, so consumers can observe stage_metrics before stage_handoff_time_ms/ar2diffusion_time_ms are populated. This is observable in pipelines where a stage is both user-visible and forwards to diffusion (e.g., bagel/hunyuan-image configs), producing nondeterministic zeros in --log-stats output depending on thread scheduling. Populate these fields before enqueueing, or enqueue an immutable copy after mutation.
Useful? React with 👍 / 👎.
|
@hsliuustc0106 PTAL, thx. |
|
provide the full test command and test results, try glm-image, qwen3-omni, WAN.... |
|
|
||
| | Field | Value | | ||
| |-----------------------------|--------------| | ||
| | e2e_requests | 1 | |
There was a problem hiding this comment.
| | e2e_requests | 1 | | |
| | num_of_requests | 1 | |
| | request_wall_time_ms | 41,299.190 | | ||
| | input_preprocess_time_ms | 57.000 | | ||
| | engine_pipeline_time_ms | 41,299.133 | | ||
| | e2e_total_tokens | 5,202 | |
There was a problem hiding this comment.
| | e2e_total_tokens | 5,202 | | |
| | total_tokens | 5,202 | |
| | e2e_total_tokens | 5,202 | | ||
| | e2e_avg_time_per_request_ms | 41,299.190 | | ||
| | avg_request_wall_time_ms | 41,299.190 | | ||
| | e2e_avg_tokens_per_s | 125.959 | |
There was a problem hiding this comment.
| | e2e_avg_tokens_per_s | 125.959 | | |
| | avg_tokens_per_s | 125.959 | |
| | e2e_avg_time_per_request_ms | 41,299.190 | | ||
| | avg_request_wall_time_ms | 41,299.190 | | ||
| | e2e_avg_tokens_per_s | 125.959 | | ||
| | e2e_stage_0_wall_time_ms | 10,192.289 | |
There was a problem hiding this comment.
| | e2e_stage_0_wall_time_ms | 10,192.289 | | |
| | stage_0_wall_time_ms | 10,192.289 | |
| | avg_request_wall_time_ms | 41,299.190 | | ||
| | e2e_avg_tokens_per_s | 125.959 | | ||
| | e2e_stage_0_wall_time_ms | 10,192.289 | | ||
| | e2e_stage_1_wall_time_ms | 30,541.409 | |
There was a problem hiding this comment.
| | e2e_stage_1_wall_time_ms | 30,541.409 | | |
| | stage_1_wall_time_ms | 30,541.409 | |
change of all the rest accordingly
|
could this be used for high concurrency cases? cc @amy-why-3459 |
|
@JaredforReal PTAL and have a try |
|
c1c33dd to
010e472
Compare
Added, plz check |
|
@hsliuustc0106 Already modified as the review comments, it displayed in order with no redundant information, and added a sum check to avoid manual calculations |
|
we can remove [request_id=chatcmpl-b33010b6d2e785cb] |
| submit_ts = req_state.stage_submit_ts.get(stage_id, now) | ||
| stage_gen_time_ms = (now - submit_ts) * 1000.0 | ||
| output_processor_time_ms = float(req_state.output_processor_time_ms.get(stage_id, 0.0)) | ||
| stage_wall_time_ms = (now - submit_ts) * 1000.0 |
There was a problem hiding this comment.
@bjf-frz I think we may need to standarize the stage info into a config class or dataclass in the following PRs
|
fix CI please |
010e472 to
5bb8a84
Compare
|
Could you show the Metrics Summary when 100 requests are successfully completed? |
6968280 to
7c13dce
Compare
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Updated in the PR introduction, plz check. |
done |
|
fix ci |
# Conflicts: # tests/entrypoints/test_async_omni_abort.py # vllm_omni/engine/orchestrator.py # vllm_omni/metrics/stats.py
| | `avg_stage_gen_total_time_ms` | Average summed stage generation time per completed request. | | ||
| | `avg_output_processor_time_ms` | Average output processor time per completed request. | | ||
| | `avg_stage_handoff_total_time_ms` | Average summed inter-stage handoff time per completed request. | | ||
| | `avg_ar2diffusion_time_ms` | Average AR-to-diffusion conversion time per completed request. | |
There was a problem hiding this comment.
Will metrics still be emitted (e.g., as 0 ms) if they aren't applicable for a model, or will they just not be emitted?
There was a problem hiding this comment.
They are omitted from the printed summary when the value is zero or not applicable, like in the diffusion model like Wan2.2, this field will not be printed.
| "avg_stage_handoff_total_time_ms", | ||
| "avg_ar2diffusion_time_ms", | ||
| "avg_final_output_time_ms", | ||
| "avg_breakdown_delta_time_ms", |
There was a problem hiding this comment.
Can you define these and the keys in overall_summary in a common place or more well-formed data structure? Otherwise it's easy to change the string in one place and accidentally miss it in another
There was a problem hiding this comment.
Will do in the following PR to centralize these keys into a dataclass.
| | `avg_input_preprocess_time_ms` | Average pre-submit request preparation time per completed request. | | ||
| | `avg_engine_pipeline_time_ms` | Average engine pipeline time per completed request. | | ||
| | `avg_stage_gen_total_time_ms` | Average summed stage generation time per completed request. | | ||
| | `avg_output_processor_time_ms` | Average output processor time per completed request. | |
There was a problem hiding this comment.
Can you clarify that this is currently approximated by dividing across the requests in the batch, as opposed to individually timed per request and then averaged?
There was a problem hiding this comment.
Yes, I'll clarify this in the docs. For batch/offline, some average fields are currently computed from aggregate batch totals divided by the number of completed requests.
| float(overall_summary.get("engine_pipeline_time_ms", 0.0)), | ||
| ), | ||
| self._summary_line( | ||
| "Sum check (ms):", |
There was a problem hiding this comment.
Are the sum check lines intentional or from debugging?
There was a problem hiding this comment.
They are intentional. The sum check lines are meant to make the timing decomposition auditable in the log output, especially when comparing E2E time against the measured components and spotting missing overhead.
| "avg_final_output_time_ms": float( | ||
| self.final_output_total_ms / self.e2e_count if self.e2e_count > 0 else 0.0 | ||
| ), | ||
| "avg_breakdown_delta_time_ms": float(breakdown_delta_ms / self.e2e_count if self.e2e_count > 0 else 0.0), |
There was a problem hiding this comment.
It would be nice if this could be cleaned up a bit or simplified. This function is pretty long, and a lot of these ternary conditions are the same
There was a problem hiding this comment.
It would be nice if this could be cleaned up a bit or simplified. This function is pretty long, and a lot of these ternary conditions are the same
Agreed. I’ll clean this up by factoring the repeated average / optional-field handling into helpers so build_and_log_summary is easier to read and less error-prone.
| queue_wait_ms = max(0.0, (service_start_ts - submit_ts) * 1000.0) | ||
| service_time_ms = max(0.0, (end_ts - service_start_ts) * 1000.0) | ||
| execution_ms = max(0.0, service_time_ms - float(evt.output_processor_time_ms or 0.0)) | ||
| evt.stage_latency_time_ms = latency_ms |
There was a problem hiding this comment.
Can you add a docstring saying that the metrics are set on the event by this method?
Also, I think it may be better to default to None for the values instead of 0 for the object to be more clear in case something tries to access these values before this is called
There was a problem hiding this comment.
Can you add a docstring saying that the metrics are set on the event by this method?
Also, I think it may be better to default to
Nonefor the values instead of 0 for the object to be more clear in case something tries to access these values before this is called
Good suggestion. I’ll add a docstring to make it explicit that this method mutates the stage event with derived queue/execution metrics. I’ll also switch the derived timing fields to default to None where appropriate so it is clear whether they have been computed yet, instead of relying on 0.0 for both “not set” and “measured zero”.
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com> # Conflicts: # tests/e2e/online_serving/test_qwen3_omni.py # vllm_omni/engine/orchestrator.py # vllm_omni/engine/stage_init_utils.py
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review: [Enhancement] Modify --log-stats (#3069)
Summary
This PR significantly enhances the --log-stats output with richer timing breakdowns (queue wait vs execution, handoff, final output wrapping, AR-to-diffusion conversion), stage metadata (name/type), a new concise [OmniTiming] per-request log line, and a rewritten summary output with component sum checking. 13 files, ~1213 additions, ~258 deletions.
Gate Issue
mergeStateStatus: BLOCKED despite all visible checks passing (build, pre-commit, DCO all SUCCESS). There may be a required check that hasn't run yet — please investigate.
PR Size
13 files, >1000 LOC changed. Could you run the L3 tests locally and paste the results?
PR Description Issues
1. Description is sparse
The description says "This PR aims to modify --log-stats" but doesn't explain what was changed or why. The checklist items at the bottom are all unchecked. Please fill in:
- Summary of what changed
- Why the output format was redesigned (e.g., better debugging, new breakdowns)
- Test plan (commands run)
2. "Before vs after" test results
The test results show only the new output. A before/after comparison would help reviewers understand what changed and verify correctness.
Good Parts
_estimate_stage_wait_and_execution_times()is a solid addition — breaking stage latency into queue wait + execution provides actionable insight for performance debugging.[OmniTiming]log line is concise and useful (especiallyar2diffusion=...for multi-stage pipelines).- Stage metadata (name, type) flowing through to logs makes output much more readable.
- Component sum checking (
request_wall_time_ms = input_preprocess + engine_pipeline + final_output) adds integrity verification. - Documentation (
docs/contributing/metrics.md) is comprehensively updated with new field names and example output. - Removal of
_format_tablecode paths is a net simplification — the old tables were hard to read compared to the new formatted sections.
Concerns
3. Breaking change for programmatic consumers of build_and_log_summary()
Field renames are extensive:
| Old key | New key |
|---|---|
e2e_requests |
num_of_requests |
e2e_wall_time_ms |
request_wall_time_ms |
e2e_total_tokens |
total_tokens |
e2e_avg_time_per_request_ms |
avg_request_wall_time_ms |
e2e_avg_tokens_per_s |
avg_tokens_per_s |
e2e_stage_{i}_wall_time_ms |
stage_{i}_wall_time_ms |
e2e_total_ms (per-request) |
request_wall_time_ms |
e2e_total_tokens (per-request) |
total_tokens |
Anyone reading these programmatically from logs or the returned dict will break. The build_and_log_summary() return dict structure has also changed substantially. Please call this out in the PR description.
4. _estimate_stage_wait_and_execution_times() assumes single-server queue
The method sorts by stage_end_ts and computes queue wait as max(0, submit - prev_finish). This assumes requests to a given stage are processed sequentially (single-server queue). If a stage has multiple workers or processes requests concurrently, this model will overestimate queue wait. Consider documenting this assumption.
5. Double logging in _log_summary_and_cleanup
The new code in omni_base.py:
summary = req_state.metrics.build_and_log_summary()
if summary:
logger.debug("[Summary] %s", pformat(summary, sort_dicts=False))build_and_log_summary() itself calls logger.info(...) with the formatted output. So every request with --log-stats will emit both:
- An
info-level multiline formatted summary (from insidebuild_and_log_summary) - A
debug-level pformat'd dict (from_log_summary_and_cleanup)
Is the pformat debug logging intended, or is it leftover from development? If the info-level formatted string is sufficient, the debug line should be removed.
Non-blocking
6. num_of_requests → num_requests
The field name "num_of_requests" is slightly awkward English. Consider num_requests or request_count instead. Non-blocking.
7. avg_breakdown_delta_time_ms
This is useful for debugging but could confuse users. Consider including it only when non-zero, similar to how avg_fields are already filtered for single-request batches.
Verdict
The PR is valuable — the richer timing breakdown is a real improvement for debugging multi-stage pipelines, and the new output format is more readable. The main issues are the sparse PR description (no checklist items checked, no before/after comparison), the breaking field renames (call them out), and the potential double-logging in _log_summary_and_cleanup. Please also investigate the BLOCKED merge state.
Signed-off-by: bjf-frz <frz123db@gmail.com> # Conflicts: # vllm_omni/entrypoints/omni_base.py
Signed-off-by: bjf-frz <frz123db@gmail.com>
|
I think we need to split the output into different levels |
| submit_ts=submit_ts, | ||
| replica_id=replica_id, | ||
| ) | ||
| stage_end_ts = _time.time() |
There was a problem hiding this comment.
Timing robustness
Could we make the timing measurements robust against wall-clock changes and distributed clock skew?
Several new durations are computed from time.time() deltas, including dispatch wait, stage submit/end latency, handoff time, and request finalization. These measurements are mostly captured on the orchestrator side, so they avoid many cross-machine clock comparisons, but they are still sensitive to NTP/wall-clock jumps. A clock adjustment can clamp values to zero or inflate latency unexpectedly.
Suggestion:
- Use
time.perf_counter()ortime.monotonic()for same-process durations. - For distributed paths, pass measured local durations instead of subtracting absolute timestamps from different machines.
| return | ||
| summary = req_state.metrics.build_and_log_summary() | ||
| if summary: | ||
| logger.debug("[Summary] %s", pformat(summary, sort_dicts=False)) |
There was a problem hiding this comment.
Completion-path stats overhead
Could we reduce the amount of stats work done around request completion?
--log-stats-request-breakdown-limit limits the printed request breakdowns, but the summary still materializes the full stage_table, trans_table, and e2e_table for all requests. In addition, pformat(summary, sort_dicts=False) is evaluated before logger.debug, so the formatting cost is paid even when debug logging is disabled.
For large offline batches this can become noticeable in the completion path.
Suggested follow-ups:
- Guard the
pformatcall withlogger.isEnabledFor(logging.DEBUG). - Consider separating the lightweight logged summary from full detailed table generation/export.
| final_stage_id=final_stage_id_for_e2e, | ||
| ) | ||
| submit_ts = time.time() | ||
| input_preprocess_time_ms[request_id] = (submit_ts - request_prep_start_ts) * 1000.0 |
There was a problem hiding this comment.
Streaming timing undercount
I think the streaming path may undercount part of the request timing.
For streaming input, input_preprocess_time_ms is recorded immediately after creating the background input-stream task. At that point the first chunk may not have been consumed or submitted to the engine yet, so the time spent pulling/preparing the initial streamed input can be missed from the request breakdown.
Could we record the preprocess/dispatch timing from the point where the first streaming chunk is actually submitted, or propagate that timing from _add_streaming_input_request back into the metrics? That would make streaming and non-streaming request breakdowns comparable.
| handoff_edge_ar2diffusion[edge] += ar2d_ms | ||
| breakdown_delta_ms = sum(self._request_final_orchestration_time_ms(evt) for evt in self.e2e_events) | ||
|
|
||
| overall_summary = { |
There was a problem hiding this comment.
Prometheus exposure
Could we expose these new breakdown metrics through the Prometheus metrics path as well?
The PR builds overall_summary, stage_table, trans_table, and e2e_table, but they appear to be returned/logged only. I could not find Prometheus / StatLogger wiring for the new fields, so operators scraping /metrics would not see the request breakdowns added here.
Suggested shape:
- Export stable aggregate fields as Prometheus histograms/counters/gauges.
- Use bounded labels such as
stage_id,stage_type, andedge. - Avoid labels like
request_id, since that would create unbounded cardinality.


PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
This PR aims to modify --log-stats.
Test Plan
Test Result
If we draw a timeline, it shows like:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)