vllm-project · bjf-frz · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026
@@ -3,7 +3,7 @@
 
 You can use these metrics in production to monitor the health and performance of the vLLM-omni system. Typical scenarios include:
 
-- **Performance Monitoring**: Track throughput (e.g., `e2e_avg_tokens_per_s`), latency (e.g., `e2e_total_ms`), and resource utilization to verify that the system meets expected standards.
+- **Performance Monitoring**: Track throughput (e.g., `avg_tokens_per_s`), latency (e.g., `request_wall_time_ms` / `engine_pipeline_time_ms`), and resource utilization to verify that the system meets expected standards.
 
 - **Debugging and Troubleshooting**: Use detailed per-request metrics to diagnose issues, such as high transfer times or unexpected token counts.
 
@@ -25,69 +25,75 @@ python openai_chat_completion_client_for_multimodal_generation.py --query-type u
 
 With `--log-stats` enabled, the server will output detailed metrics logs after each request. Example output:
 
+For multi-stage pipelines, vLLM-Omni also emits a concise per-request timing line:
 
-#### Overall Summary
-
-| Field                       | Value        |
-|-----------------------------|--------------|
-| e2e_requests                | 1            |
-| e2e_wall_time_ms            | 41,299.190   |
-| e2e_total_tokens            | 5,202        |
-| e2e_avg_time_per_request_ms | 41,299.190   |
-| e2e_avg_tokens_per_s        | 125.959      |
-| e2e_stage_0_wall_time_ms    | 10,192.289   |
-| e2e_stage_1_wall_time_ms    | 30,541.409   |
-| e2e_stage_2_wall_time_ms    |    207.496   |
-
-#### RequestE2EStats
-
-| Field                   | Value      |
-|-------------------------|------------|
-| e2e_total_ms            | 41,299.133 |
-| e2e_total_tokens        | 5,202      |
-| transfers_total_time_ms | 245.895    |
-| transfers_total_kbytes  | 138,089.939|
-
-#### StageRequestStats
-
-| Field                  | 0      | 1      | 2      |
-|------------------------|--------|--------|--------|
-| audio_generated_frames | 0      | 0      | 525,525|
-| batch_id               | 38     | 274    | 0      |
-| batch_size             | 1      | 1      | 1      |
-| num_tokens_in          | 4,860  | 4,826  | 4,384  |
-| num_tokens_out         | 67     | 275    | 0      |
-| postprocess_time_ms    | 256.158| 0.491  | 0.000  |
-| stage_gen_time_ms      | 9,910.007|30,379.198|160.745|
-
-#### TransferEdgeStats
-
-| Field               | 0->1        | 1->2       |
-|---------------------|-------------|------------|
-| size_kbytes         | 109,277.349 | 28,812.591 |
-| tx_time_ms          | 78.701      | 18.790     |
-| rx_decode_time_ms   | 111.865     | 31.706     |
-| in_flight_time_ms   | 2.015       | 2.819      |
-
-
-These logs include:
-
-- **Overall summary**: total requests, wall time, average tokens/sec, etc.
+```text
+[OmniTiming] req=chatcmpl-a0edd05 total=32.99s preprocess=2.80s engine=30.19s stages=[0:ar=16.23s,1:diffusion=13.96s] transfers=[0->1=0.780ms]
+```
 
-- **E2E table**: per-request latency and token counts.
 
-- **Stage table**: per-stage batch and timing details.
+#### Omni Metrics Summary
+
+```text
+============ Omni Metrics Summary ============
+Successful requests:                                  1
+Total E2E time (ms):                         41,356.133
+Input preprocess time (ms):                      57.000
+Engine pipeline time (ms):                   41,299.133
+Sum check (ms):                              41,356.133
+
+------------ Overall Time Breakdown ------------
+Input preprocess time (ms):                      57.000
+Stage 0 generation time (ms):                 9,910.007
+Stage 0 output processor time (ms):               7.000
+Stage 0 -> Stage 1 handoff time (ms):           245.895
+Stage 1 generation time (ms):                30,379.198
+Stage 1 output processor time (ms):              10.000
+Final output time (ms):                         747.033
+Component sum (ms):                          41,356.133
+E2E - component sum (ms):                         0.000
+
+------------ Stage 0 Breakdown ------------
+Stage generation time (ms):                  9,910.007
+Output processor time (ms):                      7.000
+Stage sum check (ms):                        9,917.007
+
+Stage id:                                            0
+Stage name:                                         ar
+Stage type:                                        llm
+Final output type:                                text
+Batch id:                                           38
+Batch size:                                          1
+
+Input tokens:                                    4,860
+Output tokens:                                      67
+Output token throughput (tok/s):                 6.761
+Postprocess time (ms):                         256.158
+
+------------ Stage 0 -> Stage 1 Handoff ------------
+Handoff total time (ms):                       245.895
+AR to diffusion time (ms):                      53.314
+Other handoff processing time (ms):            192.581
+
+------------ Final Output Breakdown ------------
+Final output wrapping time (ms):                160.745
+Final orchestration overhead time (ms):         586.288
+Final output total time (ms):                   747.033
+Final output sum check (ms):                    747.033
+```
 
-- **Transfer table**: data transfer and timing for each edge.
+These logs include a high-level end-to-end summary, a non-overlapping stage/time breakdown, and per-stage details from `StageRequestStats`. The final output layer includes output wrapping plus the remaining orchestration overhead so the top-level component sum can be checked directly against E2E time.
 
 You can use these logs to monitor system health, debug performance, and analyze request-level metrics as described above.
 
 
 ## Metrics Scope: Offline vs Online Inference
 
-For **offline inference** (batch mode), the summary includes both system-level metrics (aggregated across all requests) and per-request metrics. In this case, `e2e_requests` can be greater than 1, reflecting multiple completed requests in a batch.
+For **offline inference** (batch mode), the summary includes both system-level metrics (aggregated across all requests) and per-request metrics. In this case, `num_of_requests` can be greater than 1, reflecting multiple completed requests in a batch.
+
+For **online inference** (serving mode), the summary is always per-request. `num_of_requests` is always 1, and only request-level metrics are reported for each completion.
 
-For **online inference** (serving mode), the summary is always per-request. `e2e_requests` is always 1, and only request-level metrics are reported for each completion.
+When `num_of_requests` is 1, average fields are omitted from the returned overall summary because they are identical to the total/request-level values. They remain meaningful for offline batches with multiple completed requests.
 
 ---
 
@@ -97,23 +103,62 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
 
 | Field                     | Meaning                                                                                       |
 |---------------------------|----------------------------------------------------------------------------------------------|
-| `e2e_requests`            | Number of completed requests.                                                                |
-| `e2e_wall_time_ms`        | Wall-clock time span from run start to last completion, in ms.                               |
-| `e2e_total_tokens`        | Total tokens counted across all completed requests (stage0 input + all stage outputs).       |
-| `e2e_avg_time_per_request_ms` | Average wall time per request: `e2e_wall_time_ms / e2e_requests`.                        |
-| `e2e_avg_tokens_per_s`    | Average token throughput over wall time: `e2e_total_tokens * 1000 / e2e_wall_time_ms`.      |
-| `e2e_stage_{i}_wall_time_ms` | Wall-clock time span for stage i, in ms. Each stage's wall time is reported as a separate field, e.g., `e2e_stage_0_wall_time_ms`, `e2e_stage_1_wall_time_ms`, etc. |
+| `num_of_requests`         | Number of completed requests.                                                                |
+| `request_wall_time_ms`        | Wall-clock time span from request preparation start to final completion, in ms.          |
+| `input_preprocess_time_ms` | Time spent preparing and submitting requests before the engine pipeline starts.             |
+| `engine_pipeline_time_ms` | Time from engine request submission to final completion.                                     |
+| `stage_gen_total_time_ms` | Sum of all stage `stage_gen_time_ms` values.                                                 |
+| `output_processor_total_time_ms` | Sum of time spent in per-stage output processors after raw engine outputs are returned. |
+| `stage_handoff_total_time_ms` | Sum of inter-stage handoff time, measured after an upstream stage finishes and before the downstream stage is submitted. |
+| `ar2diffusion_total_time_ms` | Subset of `stage_handoff_total_time_ms` spent converting AR outputs into diffusion inputs. |
+| `final_output_total_time_ms` | Measured time spent wrapping final stage outputs into `OmniRequestOutput` objects before yielding them. |
+| `breakdown_delta_time_ms` | Difference between E2E wall time and the printed component sum; useful for detecting missing timers or overlap. |
+| `stage_{i}_to_{j}_handoff_time_ms` | Handoff time for a specific stage edge, e.g. `stage_0_to_1_handoff_time_ms`.       |
+| `stage_{i}_to_{j}_ar2diffusion_time_ms` | AR-to-diffusion conversion time included in that edge's handoff time.           |
+| `total_tokens`            | Total tokens counted across all completed requests (stage0 input + all stage outputs).       |
+| `avg_request_wall_time_ms` | Average wall time per request: `request_wall_time_ms / num_of_requests`.                  |
+| `avg_input_preprocess_time_ms` | Average pre-submit request preparation time per completed request.                    |
+| `avg_engine_pipeline_time_ms` | Average engine pipeline time per completed request.                                      |
+| `avg_stage_gen_total_time_ms` | Average summed stage generation time per completed request.                              |
+| `avg_output_processor_time_ms` | Average output processor time per completed request.                                   |
+| `avg_stage_handoff_total_time_ms` | Average summed inter-stage handoff time per completed request.                    |
+| `avg_ar2diffusion_time_ms` | Average AR-to-diffusion conversion time per completed request.                             |
+| `avg_final_output_time_ms` | Average final output wrapping time per completed request.                                  |
+| `avg_breakdown_delta_time_ms` | Average difference between E2E wall time and the printed component sum per completed request. |
+| `avg_tokens_per_s`        | Average token throughput over wall time: `total_tokens * 1000 / request_wall_time_ms`.      |
+| `stage_{i}_wall_time_ms`  | Wall-clock time span for stage i, in ms. Each stage's wall time is reported as a separate field, e.g., `stage_0_wall_time_ms`, `stage_1_wall_time_ms`, etc. |
+
+### Timing Breakdown
+
+The printed summary includes sum checks that show the containment relationship:
+
+`request_wall_time_ms = input_preprocess_time_ms + engine_pipeline_time_ms`
+
+`request_wall_time_ms ~= input_preprocess_time_ms + sum(stage_gen_time_ms) + sum(output_processor_time_ms) + sum(stage_handoff_time_ms) + final_output_total_time_ms`
+
+Any remaining difference is reported as `breakdown_delta_time_ms`.
+
+`ar2diffusion_total_time_ms` is included in `stage_handoff_total_time_ms`. For a concrete AR-to-diffusion edge, `stage_0_to_1_ar2diffusion_time_ms` is included in `stage_0_to_1_handoff_time_ms`.
+
+For offline batches, average component fields such as `avg_stage_gen_total_time_ms`,
+`avg_output_processor_time_ms`, `avg_stage_handoff_total_time_ms`, and
+`avg_ar2diffusion_time_ms` are computed by dividing the batch aggregate by
+`num_of_requests`. They are not individually timed per request and then averaged.
+Per-request timers remain available in the E2E table where they are measured directly.
 
 ---
 
 ### E2E Table (per request)
 
 | Field                     | Meaning                                                               |
 |---------------------------|-----------------------------------------------------------------------|
-| `e2e_total_ms`            | End-to-end latency in ms.                                             |
-| `e2e_total_tokens`        | Total tokens for the request (stage0 input + all stage outputs).      |
+| `request_wall_time_ms`    | End-to-end latency in ms, including input preprocessing and engine pipeline time.            |
+| `input_preprocess_time_ms` | Time spent preparing and submitting the request before `engine_pipeline_time_ms` starts.    |
+| `engine_pipeline_time_ms` | Time from engine request submission to final completion.                                     |
+| `total_tokens`            | Total tokens for the request (stage0 input + all stage outputs).      |
 | `transfers_total_time_ms` | Sum of transfer edge `total_time_ms` for this request.                |
 | `transfers_total_kbytes`  | Sum of transfer kbytes for this request.                              |
+| `final_output_time_ms`    | Time spent wrapping the final engine output into an `OmniRequestOutput` before yielding it. |
 
 
 ---
@@ -126,7 +171,8 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
 | `batch_size`              | Batch size.                                                                                     |
 | `num_tokens_in`           | Input tokens to the stage.                                                                      |
 | `num_tokens_out`          | Output tokens from the stage.                                                                   |
-| `stage_gen_time_ms`       | Stage compute time in ms, excluding postprocessing time (reported separately as `postprocess_time_ms`). |
+| `stage_gen_time_ms`       | Stage compute time in ms, excluding output processor time and postprocessing time.             |
+| `output_processor_time_ms` | Time spent in the stage output processor after raw engine outputs are returned.              |
 | `image_num`               | Number of images generated (for diffusion/image stages).                                        |
 | `resolution`              | Image resolution (for diffusion/image stages).                                                                  |
 | `postprocess_time_ms` | Diffusion/image: post-processing time in ms.                                                    |
@@ -147,20 +193,20 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
 
 **Formulas:**
 
-- `e2e_total_tokens = Stage0's num_tokens_in + sum(all stages' num_tokens_out)`
+- `total_tokens = Stage0's num_tokens_in + sum(all stages' num_tokens_out)`
 
 - `transfers_total_time_ms = sum(tx_time_ms + rx_decode_time_ms + in_flight_time_ms)` for every edge
 
 **Using the example above:**
 
-**e2e_total_tokens**
+**total_tokens**
 
 - Stage0's `num_tokens_in`: **4,860**
 - Stage0's `num_tokens_out`: **67**
 - Stage1's `num_tokens_out`: **275**
 - Stage2's `num_tokens_out`: **0**
 
-so `e2e_total_tokens = 4,860 + 67 + 275 + 0 = 5,202`, which matches the table value `e2e_total_tokens`.
+so `total_tokens = 4,860 + 67 + 275 + 0 = 5,202`, which matches the table value `total_tokens`.
 
 **transfers_total_time_ms**
 

@@ -28,8 +28,15 @@ async def fake_abort_async(request_ids):
     return fake_abort_async
 
 
-async def fake_process_results(request_id, metrics, final_stage_id_for_e2e, req_start_ts, wall_start_ts):
-    del metrics, final_stage_id_for_e2e, req_start_ts, wall_start_ts
+async def fake_process_results(
+    request_id,
+    metrics,
+    final_stage_id_for_e2e,
+    req_start_ts,
+    input_preprocess_time_ms,
+    wall_start_ts,
+):
+    del metrics, final_stage_id_for_e2e, req_start_ts, input_preprocess_time_ms, wall_start_ts
     if request_id.startswith("cancel-"):
         await asyncio.Future()
         return

@@ -698,6 +698,84 @@ def test_omni_generate_returns_list_when_not_using_generator(monkeypatch: pytest
     assert [output.stage_id for output in outputs] == [0, 2, 0, 2]
 
 
+def test_omni_generate_batches_offline_metrics_summary(monkeypatch: pytest.MonkeyPatch):
+    class FakeMetrics:
+        instances: list[FakeMetrics] = []
+
+        def __init__(
+            self,
+            num_stages: int,
+            log_stats: bool,
+            wall_start_ts: float,
+            final_stage_id_for_e2e: dict[str, int],
+            **_: Any,
+        ) -> None:
+            self.num_stages = num_stages
+            self.log_stats = log_stats
+            self.wall_start_ts = wall_start_ts
+            self.final_stage_id_for_e2e = final_stage_id_for_e2e
+            self.stage_first_ts = [None] * num_stages
+            self.stage_last_ts = [None] * num_stages
+            self.e2e_done: set[str] = set()
+            self.final_output_time_ms: dict[str, float] = {}
+            self.build_count = 0
+            FakeMetrics.instances.append(self)
+
+        def on_stage_metrics(self, *_: Any, **__: Any) -> None:
+            pass
+
+        def record_final_output_time(
+            self,
+            req_id: str,
+            final_output_time_ms: float,
+            start_ts: float | None = None,
+            end_ts: float | None = None,
+        ) -> None:
+            del start_ts, end_ts
+            self.final_output_time_ms[req_id] = final_output_time_ms
+
+        def on_finalize_request(
+            self,
+            stage_id: int,
+            req_id: str,
+            req_start_ts: float,
+            input_preprocess_time_ms: float = 0.0,
+        ) -> None:
+            del stage_id, req_start_ts, input_preprocess_time_ms
+            self.e2e_done.add(req_id)
+
+        def build_and_log_summary(self) -> dict[str, Any]:
+            self.build_count += 1
+            return {"num_requests": len(self.e2e_done)}
+
+    sampling_params = [SamplingParams(max_tokens=8) for _ in range(2)]
+    engine = FakeAsyncOmniEngine(
+        stage_metadata=LLM_DIFFUSION_META,
+        default_sampling_params_list=sampling_params,
+        on_add_request=_enqueue_omni_llm_diffusion_outputs,
+    )
+    _patch_engine(monkeypatch, engine)
+    monkeypatch.setattr("vllm_omni.entrypoints.omni.OrchestratorMetrics", FakeMetrics)
+
+    app = Omni("dummy-model", log_stats=True)
+    try:
+        outputs = list(app.generate(["p1", "p2"], py_generator=True, use_tqdm=False))
+    finally:
+        if not engine.shutdown_called:
+            app.shutdown()
+
+    assert len(outputs) == 4
+    assert len(FakeMetrics.instances) == 1
+    metrics = FakeMetrics.instances[0]
+    assert metrics.final_stage_id_for_e2e == {
+        engine.submitted[0]["request_id"]: 1,
+        engine.submitted[1]["request_id"]: 1,
+    }
+    assert metrics.e2e_done == {engine.submitted[0]["request_id"], engine.submitted[1]["request_id"]}
+    assert metrics.build_count == 1
+    assert app.request_states == {}
+
+
 def test_omni_generate_diffusion_only_yields_single_image_per_request(monkeypatch: pytest.MonkeyPatch):
     sampling_params = [SamplingParams(max_tokens=8)]
     engine = FakeAsyncOmniEngine(