Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 111 additions & 65 deletions docs/contributing/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

You can use these metrics in production to monitor the health and performance of the vLLM-omni system. Typical scenarios include:

- **Performance Monitoring**: Track throughput (e.g., `e2e_avg_tokens_per_s`), latency (e.g., `e2e_total_ms`), and resource utilization to verify that the system meets expected standards.
- **Performance Monitoring**: Track throughput (e.g., `avg_tokens_per_s`), latency (e.g., `request_wall_time_ms` / `engine_pipeline_time_ms`), and resource utilization to verify that the system meets expected standards.

- **Debugging and Troubleshooting**: Use detailed per-request metrics to diagnose issues, such as high transfer times or unexpected token counts.

Expand All @@ -25,69 +25,75 @@ python openai_chat_completion_client_for_multimodal_generation.py --query-type u

With `--log-stats` enabled, the server will output detailed metrics logs after each request. Example output:

For multi-stage pipelines, vLLM-Omni also emits a concise per-request timing line:

#### Overall Summary

| Field | Value |
|-----------------------------|--------------|
| e2e_requests | 1 |
| e2e_wall_time_ms | 41,299.190 |
| e2e_total_tokens | 5,202 |
| e2e_avg_time_per_request_ms | 41,299.190 |
| e2e_avg_tokens_per_s | 125.959 |
| e2e_stage_0_wall_time_ms | 10,192.289 |
| e2e_stage_1_wall_time_ms | 30,541.409 |
| e2e_stage_2_wall_time_ms | 207.496 |

#### RequestE2EStats

| Field | Value |
|-------------------------|------------|
| e2e_total_ms | 41,299.133 |
| e2e_total_tokens | 5,202 |
| transfers_total_time_ms | 245.895 |
| transfers_total_kbytes | 138,089.939|

#### StageRequestStats

| Field | 0 | 1 | 2 |
|------------------------|--------|--------|--------|
| audio_generated_frames | 0 | 0 | 525,525|
| batch_id | 38 | 274 | 0 |
| batch_size | 1 | 1 | 1 |
| num_tokens_in | 4,860 | 4,826 | 4,384 |
| num_tokens_out | 67 | 275 | 0 |
| postprocess_time_ms | 256.158| 0.491 | 0.000 |
| stage_gen_time_ms | 9,910.007|30,379.198|160.745|

#### TransferEdgeStats

| Field | 0->1 | 1->2 |
|---------------------|-------------|------------|
| size_kbytes | 109,277.349 | 28,812.591 |
| tx_time_ms | 78.701 | 18.790 |
| rx_decode_time_ms | 111.865 | 31.706 |
| in_flight_time_ms | 2.015 | 2.819 |


These logs include:

- **Overall summary**: total requests, wall time, average tokens/sec, etc.
```text
[OmniTiming] req=chatcmpl-a0edd05 total=32.99s preprocess=2.80s engine=30.19s stages=[0:ar=16.23s,1:diffusion=13.96s] transfers=[0->1=0.780ms]
```

- **E2E table**: per-request latency and token counts.

- **Stage table**: per-stage batch and timing details.
#### Omni Metrics Summary

```text
============ Omni Metrics Summary ============
Successful requests: 1
Total E2E time (ms): 41,356.133
Input preprocess time (ms): 57.000
Engine pipeline time (ms): 41,299.133
Sum check (ms): 41,356.133

------------ Overall Time Breakdown ------------
Input preprocess time (ms): 57.000
Stage 0 generation time (ms): 9,910.007
Stage 0 output processor time (ms): 7.000
Stage 0 -> Stage 1 handoff time (ms): 245.895
Stage 1 generation time (ms): 30,379.198
Stage 1 output processor time (ms): 10.000
Final output time (ms): 747.033
Component sum (ms): 41,356.133
E2E - component sum (ms): 0.000

------------ Stage 0 Breakdown ------------
Stage generation time (ms): 9,910.007
Output processor time (ms): 7.000
Stage sum check (ms): 9,917.007

Stage id: 0
Stage name: ar
Stage type: llm
Final output type: text
Batch id: 38
Batch size: 1

Input tokens: 4,860
Output tokens: 67
Output token throughput (tok/s): 6.761
Postprocess time (ms): 256.158

------------ Stage 0 -> Stage 1 Handoff ------------
Handoff total time (ms): 245.895
AR to diffusion time (ms): 53.314
Other handoff processing time (ms): 192.581

------------ Final Output Breakdown ------------
Final output wrapping time (ms): 160.745
Final orchestration overhead time (ms): 586.288
Final output total time (ms): 747.033
Final output sum check (ms): 747.033
```

- **Transfer table**: data transfer and timing for each edge.
These logs include a high-level end-to-end summary, a non-overlapping stage/time breakdown, and per-stage details from `StageRequestStats`. The final output layer includes output wrapping plus the remaining orchestration overhead so the top-level component sum can be checked directly against E2E time.

You can use these logs to monitor system health, debug performance, and analyze request-level metrics as described above.


## Metrics Scope: Offline vs Online Inference

For **offline inference** (batch mode), the summary includes both system-level metrics (aggregated across all requests) and per-request metrics. In this case, `e2e_requests` can be greater than 1, reflecting multiple completed requests in a batch.
For **offline inference** (batch mode), the summary includes both system-level metrics (aggregated across all requests) and per-request metrics. In this case, `num_of_requests` can be greater than 1, reflecting multiple completed requests in a batch.

For **online inference** (serving mode), the summary is always per-request. `num_of_requests` is always 1, and only request-level metrics are reported for each completion.

For **online inference** (serving mode), the summary is always per-request. `e2e_requests` is always 1, and only request-level metrics are reported for each completion.
When `num_of_requests` is 1, average fields are omitted from the returned overall summary because they are identical to the total/request-level values. They remain meaningful for offline batches with multiple completed requests.

---

Expand All @@ -97,23 +103,62 @@ For **online inference** (serving mode), the summary is always per-request. `e2e

| Field | Meaning |
|---------------------------|----------------------------------------------------------------------------------------------|
| `e2e_requests` | Number of completed requests. |
| `e2e_wall_time_ms` | Wall-clock time span from run start to last completion, in ms. |
| `e2e_total_tokens` | Total tokens counted across all completed requests (stage0 input + all stage outputs). |
| `e2e_avg_time_per_request_ms` | Average wall time per request: `e2e_wall_time_ms / e2e_requests`. |
| `e2e_avg_tokens_per_s` | Average token throughput over wall time: `e2e_total_tokens * 1000 / e2e_wall_time_ms`. |
| `e2e_stage_{i}_wall_time_ms` | Wall-clock time span for stage i, in ms. Each stage's wall time is reported as a separate field, e.g., `e2e_stage_0_wall_time_ms`, `e2e_stage_1_wall_time_ms`, etc. |
| `num_of_requests` | Number of completed requests. |
| `request_wall_time_ms` | Wall-clock time span from request preparation start to final completion, in ms. |
| `input_preprocess_time_ms` | Time spent preparing and submitting requests before the engine pipeline starts. |
| `engine_pipeline_time_ms` | Time from engine request submission to final completion. |
| `stage_gen_total_time_ms` | Sum of all stage `stage_gen_time_ms` values. |
| `output_processor_total_time_ms` | Sum of time spent in per-stage output processors after raw engine outputs are returned. |
| `stage_handoff_total_time_ms` | Sum of inter-stage handoff time, measured after an upstream stage finishes and before the downstream stage is submitted. |
| `ar2diffusion_total_time_ms` | Subset of `stage_handoff_total_time_ms` spent converting AR outputs into diffusion inputs. |
| `final_output_total_time_ms` | Measured time spent wrapping final stage outputs into `OmniRequestOutput` objects before yielding them. |
| `breakdown_delta_time_ms` | Difference between E2E wall time and the printed component sum; useful for detecting missing timers or overlap. |
| `stage_{i}_to_{j}_handoff_time_ms` | Handoff time for a specific stage edge, e.g. `stage_0_to_1_handoff_time_ms`. |
| `stage_{i}_to_{j}_ar2diffusion_time_ms` | AR-to-diffusion conversion time included in that edge's handoff time. |
| `total_tokens` | Total tokens counted across all completed requests (stage0 input + all stage outputs). |
| `avg_request_wall_time_ms` | Average wall time per request: `request_wall_time_ms / num_of_requests`. |
| `avg_input_preprocess_time_ms` | Average pre-submit request preparation time per completed request. |
| `avg_engine_pipeline_time_ms` | Average engine pipeline time per completed request. |
| `avg_stage_gen_total_time_ms` | Average summed stage generation time per completed request. |
| `avg_output_processor_time_ms` | Average output processor time per completed request. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify that this is currently approximated by dividing across the requests in the batch, as opposed to individually timed per request and then averaged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll clarify this in the docs. For batch/offline, some average fields are currently computed from aggregate batch totals divided by the number of completed requests.

| `avg_stage_handoff_total_time_ms` | Average summed inter-stage handoff time per completed request. |
| `avg_ar2diffusion_time_ms` | Average AR-to-diffusion conversion time per completed request. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will metrics still be emitted (e.g., as 0 ms) if they aren't applicable for a model, or will they just not be emitted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are omitted from the printed summary when the value is zero or not applicable, like in the diffusion model like Wan2.2, this field will not be printed.

| `avg_final_output_time_ms` | Average final output wrapping time per completed request. |
| `avg_breakdown_delta_time_ms` | Average difference between E2E wall time and the printed component sum per completed request. |
| `avg_tokens_per_s` | Average token throughput over wall time: `total_tokens * 1000 / request_wall_time_ms`. |
| `stage_{i}_wall_time_ms` | Wall-clock time span for stage i, in ms. Each stage's wall time is reported as a separate field, e.g., `stage_0_wall_time_ms`, `stage_1_wall_time_ms`, etc. |

### Timing Breakdown

The printed summary includes sum checks that show the containment relationship:

`request_wall_time_ms = input_preprocess_time_ms + engine_pipeline_time_ms`

`request_wall_time_ms ~= input_preprocess_time_ms + sum(stage_gen_time_ms) + sum(output_processor_time_ms) + sum(stage_handoff_time_ms) + final_output_total_time_ms`

Any remaining difference is reported as `breakdown_delta_time_ms`.

`ar2diffusion_total_time_ms` is included in `stage_handoff_total_time_ms`. For a concrete AR-to-diffusion edge, `stage_0_to_1_ar2diffusion_time_ms` is included in `stage_0_to_1_handoff_time_ms`.

For offline batches, average component fields such as `avg_stage_gen_total_time_ms`,
`avg_output_processor_time_ms`, `avg_stage_handoff_total_time_ms`, and
`avg_ar2diffusion_time_ms` are computed by dividing the batch aggregate by
`num_of_requests`. They are not individually timed per request and then averaged.
Per-request timers remain available in the E2E table where they are measured directly.

---

### E2E Table (per request)

| Field | Meaning |
|---------------------------|-----------------------------------------------------------------------|
| `e2e_total_ms` | End-to-end latency in ms. |
| `e2e_total_tokens` | Total tokens for the request (stage0 input + all stage outputs). |
| `request_wall_time_ms` | End-to-end latency in ms, including input preprocessing and engine pipeline time. |
| `input_preprocess_time_ms` | Time spent preparing and submitting the request before `engine_pipeline_time_ms` starts. |
| `engine_pipeline_time_ms` | Time from engine request submission to final completion. |
| `total_tokens` | Total tokens for the request (stage0 input + all stage outputs). |
| `transfers_total_time_ms` | Sum of transfer edge `total_time_ms` for this request. |
| `transfers_total_kbytes` | Sum of transfer kbytes for this request. |
| `final_output_time_ms` | Time spent wrapping the final engine output into an `OmniRequestOutput` before yielding it. |


---
Expand All @@ -126,7 +171,8 @@ For **online inference** (serving mode), the summary is always per-request. `e2e
| `batch_size` | Batch size. |
| `num_tokens_in` | Input tokens to the stage. |
| `num_tokens_out` | Output tokens from the stage. |
| `stage_gen_time_ms` | Stage compute time in ms, excluding postprocessing time (reported separately as `postprocess_time_ms`). |
| `stage_gen_time_ms` | Stage compute time in ms, excluding output processor time and postprocessing time. |
| `output_processor_time_ms` | Time spent in the stage output processor after raw engine outputs are returned. |
| `image_num` | Number of images generated (for diffusion/image stages). |
| `resolution` | Image resolution (for diffusion/image stages). |
| `postprocess_time_ms` | Diffusion/image: post-processing time in ms. |
Expand All @@ -147,20 +193,20 @@ For **online inference** (serving mode), the summary is always per-request. `e2e

**Formulas:**

- `e2e_total_tokens = Stage0's num_tokens_in + sum(all stages' num_tokens_out)`
- `total_tokens = Stage0's num_tokens_in + sum(all stages' num_tokens_out)`

- `transfers_total_time_ms = sum(tx_time_ms + rx_decode_time_ms + in_flight_time_ms)` for every edge

**Using the example above:**

**e2e_total_tokens**
**total_tokens**

- Stage0's `num_tokens_in`: **4,860**
- Stage0's `num_tokens_out`: **67**
- Stage1's `num_tokens_out`: **275**
- Stage2's `num_tokens_out`: **0**

so `e2e_total_tokens = 4,860 + 67 + 275 + 0 = 5,202`, which matches the table value `e2e_total_tokens`.
so `total_tokens = 4,860 + 67 + 275 + 0 = 5,202`, which matches the table value `total_tokens`.

**transfers_total_time_ms**

Expand Down
11 changes: 9 additions & 2 deletions tests/entrypoints/test_async_omni.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,15 @@ async def fake_abort_async(request_ids):
return fake_abort_async


async def fake_process_results(request_id, metrics, final_stage_id_for_e2e, req_start_ts, wall_start_ts):
del metrics, final_stage_id_for_e2e, req_start_ts, wall_start_ts
async def fake_process_results(
request_id,
metrics,
final_stage_id_for_e2e,
req_start_ts,
input_preprocess_time_ms,
wall_start_ts,
):
del metrics, final_stage_id_for_e2e, req_start_ts, input_preprocess_time_ms, wall_start_ts
if request_id.startswith("cancel-"):
await asyncio.Future()
return
Expand Down
78 changes: 78 additions & 0 deletions tests/entrypoints/test_omni_entrypoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -698,6 +698,84 @@ def test_omni_generate_returns_list_when_not_using_generator(monkeypatch: pytest
assert [output.stage_id for output in outputs] == [0, 2, 0, 2]


def test_omni_generate_batches_offline_metrics_summary(monkeypatch: pytest.MonkeyPatch):
class FakeMetrics:
instances: list[FakeMetrics] = []

def __init__(
self,
num_stages: int,
log_stats: bool,
wall_start_ts: float,
final_stage_id_for_e2e: dict[str, int],
**_: Any,
) -> None:
self.num_stages = num_stages
self.log_stats = log_stats
self.wall_start_ts = wall_start_ts
self.final_stage_id_for_e2e = final_stage_id_for_e2e
self.stage_first_ts = [None] * num_stages
self.stage_last_ts = [None] * num_stages
self.e2e_done: set[str] = set()
self.final_output_time_ms: dict[str, float] = {}
self.build_count = 0
FakeMetrics.instances.append(self)

def on_stage_metrics(self, *_: Any, **__: Any) -> None:
pass

def record_final_output_time(
self,
req_id: str,
final_output_time_ms: float,
start_ts: float | None = None,
end_ts: float | None = None,
) -> None:
del start_ts, end_ts
self.final_output_time_ms[req_id] = final_output_time_ms

def on_finalize_request(
self,
stage_id: int,
req_id: str,
req_start_ts: float,
input_preprocess_time_ms: float = 0.0,
) -> None:
del stage_id, req_start_ts, input_preprocess_time_ms
self.e2e_done.add(req_id)

def build_and_log_summary(self) -> dict[str, Any]:
self.build_count += 1
return {"num_requests": len(self.e2e_done)}

sampling_params = [SamplingParams(max_tokens=8) for _ in range(2)]
engine = FakeAsyncOmniEngine(
stage_metadata=LLM_DIFFUSION_META,
default_sampling_params_list=sampling_params,
on_add_request=_enqueue_omni_llm_diffusion_outputs,
)
_patch_engine(monkeypatch, engine)
monkeypatch.setattr("vllm_omni.entrypoints.omni.OrchestratorMetrics", FakeMetrics)

app = Omni("dummy-model", log_stats=True)
try:
outputs = list(app.generate(["p1", "p2"], py_generator=True, use_tqdm=False))
finally:
if not engine.shutdown_called:
app.shutdown()

assert len(outputs) == 4
assert len(FakeMetrics.instances) == 1
metrics = FakeMetrics.instances[0]
assert metrics.final_stage_id_for_e2e == {
engine.submitted[0]["request_id"]: 1,
engine.submitted[1]["request_id"]: 1,
}
assert metrics.e2e_done == {engine.submitted[0]["request_id"], engine.submitted[1]["request_id"]}
assert metrics.build_count == 1
assert app.request_states == {}


def test_omni_generate_diffusion_only_yields_single_image_per_request(monkeypatch: pytest.MonkeyPatch):
sampling_params = [SamplingParams(max_tokens=8)]
engine = FakeAsyncOmniEngine(
Expand Down
Loading
Loading