Skip to content

[Enhancement] Modify --log-stats#3069

Open
bjf-frz wants to merge 14 commits into
vllm-project:mainfrom
bjf-frz:modify-log-stats
Open

[Enhancement] Modify --log-stats#3069
bjf-frz wants to merge 14 commits into
vllm-project:mainfrom
bjf-frz:modify-log-stats

Conversation

@bjf-frz
Copy link
Copy Markdown
Contributor

@bjf-frz bjf-frz commented Apr 23, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR aims to modify --log-stats.

Test Plan

Test Result

============ Omni Metrics Summary ============
Successful requests:                                     5
Total E2E time (ms):                           137,337.674
Input preprocess time (ms):                      3,128.475
Engine pipeline time (ms):                     134,209.199
Sum check (ms):                                137,337.674

------------ Overall Time Breakdown ------------
Input preprocess time (ms):                      3,128.475
Stage 0 total latency time (ms):               335,510.910
Stage 0 queue wait time (ms):                  208,619.139
Stage 0 execution time (ms):                   126,785.114
Stage 0 output processor time (ms):                106.658
Stage 0 -> Stage 1 handoff time (ms):                7.778
Stage 1 total latency time (ms):                40,248.438
Stage 1 queue wait time (ms):                        0.000
Stage 1 execution time (ms):                    40,248.438
Stage 1 output processor time (ms):                  0.000
Final output time (ms):                              0.064

------------ Average Time Breakdown ------------
Average input preprocess time (ms):              3,012.061
Average Stage 0 latency time (ms):              67,102.182
Average Stage 0 queue wait time (ms):           41,723.828
Average Stage 0 execution time (ms):            25,357.023
Average Stage 0 output processor time (ms):          21.332
Average Stage 0 handoff time (ms):                   1.556
Average Stage 1 latency time (ms):               8,049.688
Average Stage 1 queue wait time (ms):                0.000
Average Stage 1 execution time (ms):             8,049.688
Average Stage 1 output processor time (ms):           0.000
Average final output time (ms):                      0.013

------------ Request 0_ad8714ff-6f80-4394-9218-ac39cddf8846 Breakdown ------------
Input preprocess time (ms):                      3,128.386
Input preprocess sum check (ms):                 3,128.386
Request dispatch wait time (ms):                     1.046

------------ Stage 0 Breakdown ------------
Stage latency time (ms):                        26,487.533
Queue wait time (ms):                                0.000
Execution time (ms):                            26,456.955
Output processor time (ms):                         30.578

Stage id:                                                0
Stage name:                                             ar
Stage type:                                            llm
Final output type:                                        
Batch id:                                                1
Batch size:                                              1

Input tokens:                                           19
Output tokens:                                        1281
Output token throughput (tok/s):                    48.418

------------ Stage 0 -> Stage 1 Handoff ------------
Handoff total time (ms):                             2.038
AR to diffusion time (ms):                           1.439
Other handoff processing time (ms):                  0.599

------------ Stage 1 Breakdown ------------
Stage latency time (ms):                        10,940.283
Queue wait time (ms):                                0.000
Execution time (ms):                            10,940.283
Output processor time (ms):                          0.000

Stage id:                                                1
Stage name:                                      diffusion
Stage type:                                      diffusion
Final output type:                                        
Batch id:                                                1
Batch size:                                              1

------------ Final Output Breakdown ------------
Final output wrapping time (ms):                     0.014
Final output total time (ms):                        0.014
Final output sum check (ms):                         0.014
Remaining orchestration overhead time (ms):           0.341

------------ Request 1_5c53d8c3-2a22-421d-97ad-6b5ff7452e0e Breakdown ------------
Input preprocess time (ms):                      3,028.149
Input preprocess sum check (ms):                 3,028.149
Request dispatch wait time (ms):                 8,933.591

------------ Stage 0 Breakdown ------------
Stage latency time (ms):                        39,628.275
Queue wait time (ms):                           14,526.835
Execution time (ms):                            25,079.286
Output processor time (ms):                         22.154

Stage id:                                                0
Stage name:                                             ar
Stage type:                                            llm
Final output type:                                        
Batch id:                                                2
Batch size:                                              1

Input tokens:                                           19
Output tokens:                                        1281
Output token throughput (tok/s):                    32.343

------------ Stage 0 -> Stage 1 Handoff ------------
Handoff total time (ms):                             1.725
AR to diffusion time (ms):                           1.201
Other handoff processing time (ms):                  0.524

------------ Stage 1 Breakdown ------------
Stage latency time (ms):                         7,327.434
Queue wait time (ms):                                0.000
Execution time (ms):                             7,327.434
Output processor time (ms):                          0.000

Stage id:                                                1
Stage name:                                      diffusion
Stage type:                                      diffusion
Final output type:                                        
Batch id:                                                2
Batch size:                                              1

------------ Final Output Breakdown ------------
Final output wrapping time (ms):                     0.011
Final output total time (ms):                        0.011
Final output sum check (ms):                         0.011
Remaining orchestration overhead time (ms):           0.176

....
request2 -> 4
...

If we draw a timeline, it shows like:

Req0 main   | input preprocess 0.000-3.128 | dispatch 3.128-3.129 | final 40.559
  Req0 stage0 |                         exec 3.129-29.586 | out 29.586-29.617
  Req0 handoff|                                                 29.617-29.619
  Req0 stage1 |                                                   exec 29.619-40.559

  Req1 main   | input preprocess 0.000-3.028 | dispatch wait 3.028-11.962 | final 58.919
  Req1 stage0 |                                      queue 11.962-26.489 | exec 26.489-51.568 | out 51.568-51.590
  Req1 handoff|                                                                                 51.590-51.592
  Req1 stage1 |                                                                                   exec 51.592-58.919

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@bjf-frz bjf-frz requested a review from hsliuustc0106 as a code owner April 23, 2026 12:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f47f63b9a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

overall_summary = {
"e2e_requests": int(self.e2e_count),
"e2e_wall_time_ms": float(wall_time_ms),
"request_wall_time_ms": float(wall_time_ms),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive request_wall_time_ms from accumulated request latency

build_and_log_summary sets request_wall_time_ms to the global run span (wall_time_ms) while input_preprocess_time_ms and engine_pipeline_time_ms are accumulated across finalized requests. In offline/batch runs with multiple overlapping requests, this makes the new timing decomposition inconsistent (request_wall_time_ms can be smaller than the sum of its components) and underestimates avg_request_wall_time_ms, which can mislead latency analysis and experiment comparisons. Compute request_wall_time_ms from per-request totals (or from input_preprocess_total_ms + engine_pipeline_total_ms) instead of run-span wall time.

Useful? React with 👍 / 👎.

Comment thread vllm_omni/engine/orchestrator.py Outdated
Comment on lines +717 to +719
stage_metrics.handoff_to_stage_id = next_stage_id
stage_metrics.stage_handoff_time_ms = max(0.0, (next_submit_ts - handoff_start_ts) * 1000.0)
if stage_metrics.ar2diffusion_time_ms == 0.0:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set handoff metrics before publishing stage metrics

The new handoff fields are assigned in _forward_to_next_stage after the stage output has already been enqueued for final-output stages, so consumers can observe stage_metrics before stage_handoff_time_ms/ar2diffusion_time_ms are populated. This is observable in pipelines where a stage is both user-visible and forwards to diffusion (e.g., bagel/hunyuan-image configs), producing nondeterministic zeros in --log-stats output depending on thread scheduling. Populate these fields before enqueueing, or enqueue an immutable copy after mutation.

Useful? React with 👍 / 👎.

@bjf-frz
Copy link
Copy Markdown
Contributor Author

bjf-frz commented Apr 23, 2026

@hsliuustc0106 PTAL, thx.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

provide the full test command and test results, try glm-image, qwen3-omni, WAN....

Comment thread docs/contributing/metrics.md Outdated

| Field | Value |
|-----------------------------|--------------|
| e2e_requests | 1 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| e2e_requests | 1 |
| num_of_requests | 1 |

Comment thread docs/contributing/metrics.md Outdated
| request_wall_time_ms | 41,299.190 |
| input_preprocess_time_ms | 57.000 |
| engine_pipeline_time_ms | 41,299.133 |
| e2e_total_tokens | 5,202 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| e2e_total_tokens | 5,202 |
| total_tokens | 5,202 |

Comment thread docs/contributing/metrics.md Outdated
| e2e_total_tokens | 5,202 |
| e2e_avg_time_per_request_ms | 41,299.190 |
| avg_request_wall_time_ms | 41,299.190 |
| e2e_avg_tokens_per_s | 125.959 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| e2e_avg_tokens_per_s | 125.959 |
| avg_tokens_per_s | 125.959 |

Comment thread docs/contributing/metrics.md Outdated
| e2e_avg_time_per_request_ms | 41,299.190 |
| avg_request_wall_time_ms | 41,299.190 |
| e2e_avg_tokens_per_s | 125.959 |
| e2e_stage_0_wall_time_ms | 10,192.289 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| e2e_stage_0_wall_time_ms | 10,192.289 |
| stage_0_wall_time_ms | 10,192.289 |

Comment thread docs/contributing/metrics.md Outdated
| avg_request_wall_time_ms | 41,299.190 |
| e2e_avg_tokens_per_s | 125.959 |
| e2e_stage_0_wall_time_ms | 10,192.289 |
| e2e_stage_1_wall_time_ms | 30,541.409 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| e2e_stage_1_wall_time_ms | 30,541.409 |
| stage_1_wall_time_ms | 30,541.409 |

change of all the rest accordingly

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

could this be used for high concurrency cases? cc @amy-why-3459

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@JaredforReal PTAL and have a try

@amy-why-3459
Copy link
Copy Markdown
Contributor

could this be used for high concurrency cases? cc @amy-why-3459
This is perfect! This Breakdown feature is exactly what I wanted to add. Could you also add time statistics for the output_processor?

@gcanlin gcanlin added the ready label to trigger buildkite CI label Apr 24, 2026
@bjf-frz bjf-frz force-pushed the modify-log-stats branch 3 times, most recently from c1c33dd to 010e472 Compare April 24, 2026 10:15
@bjf-frz
Copy link
Copy Markdown
Contributor Author

bjf-frz commented Apr 24, 2026

could this be used for high concurrency cases? cc @amy-why-3459
This is perfect! This Breakdown feature is exactly what I wanted to add. Could you also add time statistics for the output_processor?

could this be used for high concurrency cases? cc @amy-why-3459
This is perfect! This Breakdown feature is exactly what I wanted to add. Could you also add time statistics for the output_processor?

Added, plz check

@bjf-frz
Copy link
Copy Markdown
Contributor Author

bjf-frz commented Apr 24, 2026

@hsliuustc0106 Already modified as the review comments, it displayed in order with no redundant information, and added a sum check to avoid manual calculations

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

we can remove [request_id=chatcmpl-b33010b6d2e785cb]

Comment thread vllm_omni/engine/orchestrator.py Outdated
submit_ts = req_state.stage_submit_ts.get(stage_id, now)
stage_gen_time_ms = (now - submit_ts) * 1000.0
output_processor_time_ms = float(req_state.output_processor_time_ms.get(stage_id, 0.0))
stage_wall_time_ms = (now - submit_ts) * 1000.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjf-frz I think we may need to standarize the stage info into a config class or dataclass in the following PRs

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

fix CI please

@amy-why-3459
Copy link
Copy Markdown
Contributor

Could you show the Metrics Summary when 100 requests are successfully completed?

@bjf-frz bjf-frz force-pushed the modify-log-stats branch 5 times, most recently from 6968280 to 7c13dce Compare April 25, 2026 08:14
bjf-frz added 4 commits April 25, 2026 16:54
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
@bjf-frz
Copy link
Copy Markdown
Contributor Author

bjf-frz commented Apr 25, 2026

Could you show the Metrics Summary when 100 requests are successfully completed?

Updated in the PR introduction, plz check.

@bjf-frz
Copy link
Copy Markdown
Contributor Author

bjf-frz commented Apr 25, 2026

fix CI please

done

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

fix ci

# Conflicts:
#	tests/entrypoints/test_async_omni_abort.py
#	vllm_omni/engine/orchestrator.py
#	vllm_omni/metrics/stats.py
| `avg_stage_gen_total_time_ms` | Average summed stage generation time per completed request. |
| `avg_output_processor_time_ms` | Average output processor time per completed request. |
| `avg_stage_handoff_total_time_ms` | Average summed inter-stage handoff time per completed request. |
| `avg_ar2diffusion_time_ms` | Average AR-to-diffusion conversion time per completed request. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will metrics still be emitted (e.g., as 0 ms) if they aren't applicable for a model, or will they just not be emitted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are omitted from the printed summary when the value is zero or not applicable, like in the diffusion model like Wan2.2, this field will not be printed.

"avg_stage_handoff_total_time_ms",
"avg_ar2diffusion_time_ms",
"avg_final_output_time_ms",
"avg_breakdown_delta_time_ms",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define these and the keys in overall_summary in a common place or more well-formed data structure? Otherwise it's easy to change the string in one place and accidentally miss it in another

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in the following PR to centralize these keys into a dataclass.

| `avg_input_preprocess_time_ms` | Average pre-submit request preparation time per completed request. |
| `avg_engine_pipeline_time_ms` | Average engine pipeline time per completed request. |
| `avg_stage_gen_total_time_ms` | Average summed stage generation time per completed request. |
| `avg_output_processor_time_ms` | Average output processor time per completed request. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify that this is currently approximated by dividing across the requests in the batch, as opposed to individually timed per request and then averaged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll clarify this in the docs. For batch/offline, some average fields are currently computed from aggregate batch totals divided by the number of completed requests.

float(overall_summary.get("engine_pipeline_time_ms", 0.0)),
),
self._summary_line(
"Sum check (ms):",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the sum check lines intentional or from debugging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are intentional. The sum check lines are meant to make the timing decomposition auditable in the log output, especially when comparing E2E time against the measured components and spotting missing overhead.

Comment thread vllm_omni/metrics/stats.py Outdated
"avg_final_output_time_ms": float(
self.final_output_total_ms / self.e2e_count if self.e2e_count > 0 else 0.0
),
"avg_breakdown_delta_time_ms": float(breakdown_delta_ms / self.e2e_count if self.e2e_count > 0 else 0.0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this could be cleaned up a bit or simplified. This function is pretty long, and a lot of these ternary conditions are the same

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this could be cleaned up a bit or simplified. This function is pretty long, and a lot of these ternary conditions are the same

Agreed. I’ll clean this up by factoring the repeated average / optional-field handling into helpers so build_and_log_summary is easier to read and less error-prone.

queue_wait_ms = max(0.0, (service_start_ts - submit_ts) * 1000.0)
service_time_ms = max(0.0, (end_ts - service_start_ts) * 1000.0)
execution_ms = max(0.0, service_time_ms - float(evt.output_processor_time_ms or 0.0))
evt.stage_latency_time_ms = latency_ms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring saying that the metrics are set on the event by this method?

Also, I think it may be better to default to None for the values instead of 0 for the object to be more clear in case something tries to access these values before this is called

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring saying that the metrics are set on the event by this method?

Also, I think it may be better to default to None for the values instead of 0 for the object to be more clear in case something tries to access these values before this is called

Good suggestion. I’ll add a docstring to make it explicit that this method mutates the stage event with derived queue/execution metrics. I’ll also switch the derived timing fields to default to None where appropriate so it is clear whether they have been computed yet, instead of relying on 0.0 for both “not set” and “measured zero”.

bjf-frz added 2 commits April 27, 2026 15:10
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>
Signed-off-by: bjf-frz <frz123db@gmail.com>

# Conflicts:
#	tests/e2e/online_serving/test_qwen3_omni.py
#	vllm_omni/engine/orchestrator.py
#	vllm_omni/engine/stage_init_utils.py
@bjf-frz bjf-frz force-pushed the modify-log-stats branch from 9fa5668 to 04e1915 Compare May 8, 2026 08:26
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [Enhancement] Modify --log-stats (#3069)

Summary

This PR significantly enhances the --log-stats output with richer timing breakdowns (queue wait vs execution, handoff, final output wrapping, AR-to-diffusion conversion), stage metadata (name/type), a new concise [OmniTiming] per-request log line, and a rewritten summary output with component sum checking. 13 files, ~1213 additions, ~258 deletions.

Gate Issue

mergeStateStatus: BLOCKED despite all visible checks passing (build, pre-commit, DCO all SUCCESS). There may be a required check that hasn't run yet — please investigate.

PR Size

13 files, >1000 LOC changed. Could you run the L3 tests locally and paste the results?


PR Description Issues

1. Description is sparse

The description says "This PR aims to modify --log-stats" but doesn't explain what was changed or why. The checklist items at the bottom are all unchecked. Please fill in:

  • Summary of what changed
  • Why the output format was redesigned (e.g., better debugging, new breakdowns)
  • Test plan (commands run)

2. "Before vs after" test results

The test results show only the new output. A before/after comparison would help reviewers understand what changed and verify correctness.


Good Parts

  • _estimate_stage_wait_and_execution_times() is a solid addition — breaking stage latency into queue wait + execution provides actionable insight for performance debugging.
  • [OmniTiming] log line is concise and useful (especially ar2diffusion=... for multi-stage pipelines).
  • Stage metadata (name, type) flowing through to logs makes output much more readable.
  • Component sum checking (request_wall_time_ms = input_preprocess + engine_pipeline + final_output) adds integrity verification.
  • Documentation (docs/contributing/metrics.md) is comprehensively updated with new field names and example output.
  • Removal of _format_table code paths is a net simplification — the old tables were hard to read compared to the new formatted sections.

Concerns

3. Breaking change for programmatic consumers of build_and_log_summary()

Field renames are extensive:

Old key New key
e2e_requests num_of_requests
e2e_wall_time_ms request_wall_time_ms
e2e_total_tokens total_tokens
e2e_avg_time_per_request_ms avg_request_wall_time_ms
e2e_avg_tokens_per_s avg_tokens_per_s
e2e_stage_{i}_wall_time_ms stage_{i}_wall_time_ms
e2e_total_ms (per-request) request_wall_time_ms
e2e_total_tokens (per-request) total_tokens

Anyone reading these programmatically from logs or the returned dict will break. The build_and_log_summary() return dict structure has also changed substantially. Please call this out in the PR description.

4. _estimate_stage_wait_and_execution_times() assumes single-server queue

The method sorts by stage_end_ts and computes queue wait as max(0, submit - prev_finish). This assumes requests to a given stage are processed sequentially (single-server queue). If a stage has multiple workers or processes requests concurrently, this model will overestimate queue wait. Consider documenting this assumption.

5. Double logging in _log_summary_and_cleanup

The new code in omni_base.py:

summary = req_state.metrics.build_and_log_summary()
if summary:
    logger.debug("[Summary] %s", pformat(summary, sort_dicts=False))

build_and_log_summary() itself calls logger.info(...) with the formatted output. So every request with --log-stats will emit both:

  • An info-level multiline formatted summary (from inside build_and_log_summary)
  • A debug-level pformat'd dict (from _log_summary_and_cleanup)

Is the pformat debug logging intended, or is it leftover from development? If the info-level formatted string is sufficient, the debug line should be removed.


Non-blocking

6. num_of_requestsnum_requests

The field name "num_of_requests" is slightly awkward English. Consider num_requests or request_count instead. Non-blocking.

7. avg_breakdown_delta_time_ms

This is useful for debugging but could confuse users. Consider including it only when non-zero, similar to how avg_fields are already filtered for single-request batches.


Verdict

The PR is valuable — the richer timing breakdown is a real improvement for debugging multi-stage pipelines, and the new output format is more readable. The main issues are the sparse PR description (no checklist items checked, no before/after comparison), the breaking field renames (call them out), and the potential double-logging in _log_summary_and_cleanup. Please also investigate the BLOCKED merge state.

@amy-why-3459
Copy link
Copy Markdown
Contributor

image Is this the breakdown that is printed for every request?

Signed-off-by: bjf-frz <frz123db@gmail.com>

# Conflicts:
#	vllm_omni/entrypoints/omni_base.py
@bjf-frz
Copy link
Copy Markdown
Contributor Author

bjf-frz commented May 12, 2026

image Is this the breakdown that is printed for every request?

yes, the berekdown of each request will be printed

@amy-why-3459
Copy link
Copy Markdown
Contributor

image Is this the breakdown that is printed for every request?

yes, the berekdown of each request will be printed

This may result in too much log output.

Signed-off-by: bjf-frz <frz123db@gmail.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

image Is this the breakdown that is printed for every request?

yes, the berekdown of each request will be printed

This may result in too much log output.

we should restrict the amount of outputs

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

I think we need to split the output into different levels

submit_ts=submit_ts,
replica_id=replica_id,
)
stage_end_ts = _time.time()
Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timing robustness

Could we make the timing measurements robust against wall-clock changes and distributed clock skew?

Several new durations are computed from time.time() deltas, including dispatch wait, stage submit/end latency, handoff time, and request finalization. These measurements are mostly captured on the orchestrator side, so they avoid many cross-machine clock comparisons, but they are still sensitive to NTP/wall-clock jumps. A clock adjustment can clamp values to zero or inflate latency unexpectedly.

Suggestion:

  • Use time.perf_counter() or time.monotonic() for same-process durations.
  • For distributed paths, pass measured local durations instead of subtracting absolute timestamps from different machines.

return
summary = req_state.metrics.build_and_log_summary()
if summary:
logger.debug("[Summary] %s", pformat(summary, sort_dicts=False))
Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completion-path stats overhead

Could we reduce the amount of stats work done around request completion?

--log-stats-request-breakdown-limit limits the printed request breakdowns, but the summary still materializes the full stage_table, trans_table, and e2e_table for all requests. In addition, pformat(summary, sort_dicts=False) is evaluated before logger.debug, so the formatting cost is paid even when debug logging is disabled.

For large offline batches this can become noticeable in the completion path.

Suggested follow-ups:

  • Guard the pformat call with logger.isEnabledFor(logging.DEBUG).
  • Consider separating the lightweight logged summary from full detailed table generation/export.

final_stage_id=final_stage_id_for_e2e,
)
submit_ts = time.time()
input_preprocess_time_ms[request_id] = (submit_ts - request_prep_start_ts) * 1000.0
Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming timing undercount

I think the streaming path may undercount part of the request timing.

For streaming input, input_preprocess_time_ms is recorded immediately after creating the background input-stream task. At that point the first chunk may not have been consumed or submitted to the engine yet, so the time spent pulling/preparing the initial streamed input can be missed from the request breakdown.

Could we record the preprocess/dispatch timing from the point where the first streaming chunk is actually submitted, or propagate that timing from _add_streaming_input_request back into the metrics? That would make streaming and non-streaming request breakdowns comparable.

handoff_edge_ar2diffusion[edge] += ar2d_ms
breakdown_delta_ms = sum(self._request_final_orchestration_time_ms(evt) for evt in self.e2e_events)

overall_summary = {
Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus exposure

Could we expose these new breakdown metrics through the Prometheus metrics path as well?

The PR builds overall_summary, stage_table, trans_table, and e2e_table, but they appear to be returned/logged only. I could not find Prometheus / StatLogger wiring for the new fields, so operators scraping /metrics would not see the request breakdowns added here.

Suggested shape:

  • Export stable aggregate fields as Prometheus histograms/counters/gauges.
  • Use bounded labels such as stage_id, stage_type, and edge.
  • Avoid labels like request_id, since that would create unbounded cardinality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants