Skip to content
4 changes: 4 additions & 0 deletions docs/design/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ This section contains design documents and architecture specifications for vLLM-
- [Adding Step Execution Support for Diffusion Pipelines](feature/diffusion_step_execution.md)
- [Continuous Batching for Step-Wise Diffusion](feature/diffusion_continuous_batching.md)

## Infrastructure Design Documents

- [Prometheus Metrics](metrics.md)

## Module Design Documents

- [AR Module](module/ar_module.md)
Expand Down
198 changes: 198 additions & 0 deletions docs/design/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Prometheus Metrics Design

This document describes how vLLM-Omni exposes Prometheus metrics for
multi-stage pipelines, the constraints that shaped the design, and how
the pipeline-level metrics coexist with upstream vLLM per-engine
metrics.

## Objectives

- Expose pipeline-level request and latency metrics that span the full
multi-stage execution (orchestrator scope).
- Preserve all upstream vLLM per-engine metrics (`vllm:*`) for stages
backed by an AR LLM engine.
- Expose per-stage diffusion timing breakdowns for pipelines that
include a diffusion engine.
- Keep the metrics collection overhead low enough that it does not
regress TTFA or throughput.

## Background

### Upstream vLLM Metrics

Upstream vLLM defines 44 Prometheus metrics under the `vllm:` prefix.
These are registered by `PrometheusStatLogger` and cover engine-level
state: KV cache usage, running/waiting request counts, token
throughput, TTFT, inter-token latency, e2e latency, and so on. They
are served via the `/metrics` HTTP endpoint provided by
`prometheus_fastapi_instrumentator` and the default
`prometheus_client` WSGI handler.

vLLM's `unregister_vllm_metrics()` function strips every
`prometheus_client` collector whose `_name` attribute contains the
substring `"vllm"`. This runs during engine initialization to clean up
stale collectors from prior instantiations within the same process.

### The Problem

vLLM-Omni runs multiple engine instances (stages) within a single
process, coordinated by an Orchestrator. The pipeline needs its own
metrics — aggregate request counts, end-to-end latency across all
stages, and diffusion timing breakdowns — that do not exist in upstream
vLLM. All pipeline-level metrics use the `vllm_omni:` prefix to
distinguish them from upstream per-engine metrics. The
`unregister_vllm_metrics()` function is monkey-patched to a no-op at
import time (see `vllm_omni/patch.py`) so that these metrics are not
destroyed during engine initialization (this is a temporary fix until
vLLM patches this behavior).

Upstream per-engine metrics retain the `vllm:` prefix and are
registered by a `PrometheusStatLogger` instance that the Orchestrator
creates and feeds directly.
Comment on lines +38 to +51
Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think v_llm_omni: is the right long-term design here. The orchestrator already uses the correct pattern for upstream metrics: one PrometheusStatLogger with per-engine labels, not one logger per stage. The problem is the separate OmniPrometheusMetrics path in OmniBase, which registers fresh collectors per instance.

That separate registration path is what forces the custom prefix, because upstream unregister_vllm_metrics() removes any collector whose name contains "vllm". So a simple rename to vllm:* would still be fragile.

I’d suggest folding these omni metrics into the single orchestrator-owned/global Prometheus logger instead of registering a second collector set in OmniBase. Then they can live in the same namespace with names like vllm:omni_*, while stage/model separation stays in labels. That would avoid both duplicate registration and the prefix workaround.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly one logger per stage. It's one logger for vLLM stages (PrometheusStatLogger) and one logger for diffuser stages/the pipeline (OmniPrometheusMetrics)

The separate OmniPrometheusMetrics class inside OmniBase is necessary because some vLLM metrics require per-iteration updates, while e2e pipeline statistics require the OmniBase/entrypoint context. For example:

  • num_requests_waiting: Requires reading the length of the OmniBase.request_states
  • num_requests_fail: Requires tracking in OmniBase since malformed HTTP requests can fail before they reach the orchestrator
    Meanwhile, any metrics that depend upon vLLM's IterationStats logically should be collected at the orchestrator level, because that's where the per-iteration control occurs.

Additionally, I chose to keep the vLLM-Omni metrics separate from the vLLM metrics because PrometheusStatLogger is imported directly from vLLM. The reason vLLM metrics work seamlessly in this PR is that vLLM metrics infrastructure is (mostly) preserved, and trying to merge it with the vLLM-Omni-specific logic is likely to break that support or at least make it more brittle.

Folding OmniPrometheusMetrics into this class (i.e. subclassing PrometheusStatLogger in OmniPrometheusMetrics) would still require at least the following:

  • Collecting all generated vLLM IterationStats into a list within the orchestrator and passing this up to OmniBase
  • Moving PrometheusStatLogger.record() to OmniBase, where you would have to call it once per collected IterationStats collected
  • Creating a new method to record pipeline/diffusion statistics since PrometheusStatLogger.record() takes the vLLM-specific dataclasses IterationStats and SchedulerStats

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense.

I still don’t think v_llm_omni: is the right design here. It feels like a workaround for unregister_vllm_metrics() being too broad, rather than the right metric namespace.

I’d rather keep the metric names aligned with the vLLM namespace and add a small hack/fix in unregister_vllm_metrics() so it avoids unnecessary unregister, instead of encoding that workaround into a new long-term prefix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,I've submitted a vLLM PR to fix: vllm-project/vllm#42331 . This won't get merged until at least the next minor version, so I've added a patch that changes unregister_vllm_metrics to a no-op temporarily. Either way, vLLM-Omni metrics now use the prefix vllm:omni_


## Architecture

### Component Overview

```
+-----------------------+
| API Server (FastAPI)|
| GET /metrics |
+----------+------------+
|
prometheus_client default handler
|
+-------------+-------------+
| |
vllm_omni:* collectors vllm:* collectors
| |
+----------------------------+ +--------------------------+
| OmniPrometheusStatLogger | | VllmPrometheusStatLogger |
+----------------------------+ +--------------------------+
| |
OmniBase Orchestrator
(request lifecycle, (feeds SchedulerStats
diffusion timing) + IterationStats
per engine step)
```

### Data Flow

There are two independent paths for metric collection.

**Path 1: Pipeline-level metrics (`vllm_omni:*`)**

`OmniPrometheusStatLogger` registers Gauge, Counter, and Histogram
collectors at init time. It is instantiated once per entrypoint,
labeled with the model name. The entrypoint calls its methods as
requests progress:

- `set_running(n)` / `set_waiting(n)` — updated after each request
completes. The running count comes from `OmniRequestCounter`, a
simple counter incremented/decremented by the Orchestrator as it
tracks requests. Waiting is derived as `total - running`.

- `request_succeeded(e2e_seconds, queue_seconds)` — recorded when a
request finishes at the final stage.

- `request_failed()` — recorded when a request errors.

- `observe_diffusion_metrics(stage_id, metrics)` — recorded when a
diffusion stage finishes. The metrics dict contains timing
breakdowns (preprocess, exec, postprocess, total step time)
accumulated from engine output.

**Path 2: Per-engine metrics (`vllm:*`)**

The Orchestrator instantiates upstream vLLM's `PrometheusStatLogger`
and feeds it scheduler stats and iteration stats after processing
each batch of engine outputs. This populates the standard vLLM
metrics (TTFT, token throughput, cache usage, etc.) using the same
code path as standalone vLLM. For diffusion-only pipelines that have
no AR engine, `SchedulerStats` is never produced and `vllm:*` metrics
are absent.

### Shared State Between Threads

The Orchestrator runs in a background thread. The API server
(OmniBase) runs in the asyncio event loop thread.
`OmniRequestCounter` bridges them — a plain Python object with an
`int` field. The Orchestrator increments/decrements it; the
entrypoint reads it for gauge updates. No lock is needed because the
counter is advisory (a stale read by one Prometheus scrape interval
is acceptable). It is created by `AsyncOmniEngine.__init__()` and
passed to the Orchestrator at construction time.

### Metric Registration and Lifecycle

All `vllm_omni:*` collectors are registered once when
`OmniPrometheusStatLogger.__init__()` runs. Per-stage labels
(`model_name`, `engine`) are bound lazily on first observation to
avoid registering labels for stages that never produce data (e.g., a
diffusion pipeline has no AR stage stats).

The `prometheus_client` default registry holds all collectors.
FastAPI's `/metrics` endpoint serves the default registry, so both
`vllm_omni:*` and `vllm:*` metrics appear in the same scrape
response alongside `http_*` and `process_*` metrics from the
instrumentator and the Python client runtime.

## Throttling: `make_stats()` Override

Upstream vLLM's `Scheduler.make_stats()` runs on every AR generation step,
returning a SchedulerStats object for the orchestrator.
Under vLLM's architecture, this is fine. But since vLLM-Omni requires that the
object be serialized and transferred over ZMQ, receiving a SchedulerStats object on
every step can introduce unacceptable overhead to the system.

`OmniSchedulerMixin.make_stats()` (in
`vllm_omni/core/sched/omni_scheduler_mixin.py`) throttles stats
emission to at most once per second. Between intervals it returns
`None`, which the engine core skips serializing. This keeps gauges
fresh enough for Prometheus scrapes (typically 15-30s intervals) while
eliminating the per-step overhead.

## Metric Definitions

### Pipeline-Level

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `vllm_omni:num_requests_running` | Gauge | `model_name` | Requests currently executing across all stages |
| `vllm_omni:num_requests_waiting` | Gauge | `model_name` | Requests queued but not yet scheduled |
| `vllm_omni:num_requests_success` | Counter | `model_name` | Requests completed without error |
| `vllm_omni:num_requests_fail` | Counter | `model_name` | Requests that returned an error |
| `vllm_omni:e2e_request_latency_seconds` | Histogram | `model_name` | End-to-end request latency across all stages |
| `vllm_omni:request_queue_time_seconds` | Histogram | `model_name` | Time spent waiting in the request queue |

### Diffusion Stage-Level

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `vllm_omni:diffusion_preprocess_time_ms` | Histogram | `model_name`, `engine` | Diffusion input preprocessing time |
| `vllm_omni:diffusion_exec_time_ms` | Histogram | `model_name`, `engine` | Diffusion model forward pass time |
| `vllm_omni:diffusion_postprocess_time_ms` | Histogram | `model_name`, `engine` | Diffusion output postprocessing time |
| `vllm_omni:diffusion_step_time_ms` | Histogram | `model_name`, `engine` | Total diffusion step time |

### LLM Stage-Level

Reference [vLLM docs](https://github.com/vllm-project/vllm/blob/main/docs/usage/metrics.md)

Note that metrics that depend upon features that are not supported in vLLM-Omni (e.g. speculative decoding, LoRA) will not be available as well.

## Logging vs. Prometheus

`OrchestratorAggregator` (in `vllm_omni/metrics/stats.py`) is the
logging-oriented metrics path. It collects detailed per-request,
per-stage, and per-transfer statistics and prints formatted tables to
the `INFO` log. This is designed for development and debugging —
individual request traces, transfer bandwidth, inter-stage timing.

`OmniPrometheusStatLogger` is the Prometheus-oriented path. It records
aggregate counters, gauges, and histograms suitable for time-series
monitoring and alerting. The two paths are independent; both can run
simultaneously.

The separation follows upstream vLLM's pattern of `LoggingStatLogger`
vs. `PrometheusStatLogger` — same underlying data, different
consumption models.
79 changes: 79 additions & 0 deletions docs/usage/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Production Metrics

vLLM-Omni exposes Prometheus metrics via the `/metrics` endpoint on the
OpenAI-compatible API server. The metrics fall into three categories depending
on the pipeline type.

```bash
vllm-omni serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8000
curl http://localhost:8000/metrics
```

## Metric Namespaces

| Prefix | Source | Present when |
|--------|--------|--------------|
| `vllm_omni:` | vLLM-Omni orchestrator / diffusion stages | Always / Pipeline includes a diffusion stage |
| `vllm:` | Upstream vLLM engine | Pipeline includes an LLM (AR) stage |
| `http_` / `process_` | Uvicorn / Python runtime | Always |

## Pipeline-Level Metrics (`vllm_omni:`)

These metrics are defined in `vllm_omni/metrics/prometheus.py` and track
request lifecycle across the full multi-stage pipeline.

### Request Tracking

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `vllm_omni:num_requests_running` | Gauge | `model_name` | Requests currently running across all pipeline stages |
| `vllm_omni:num_requests_waiting` | Gauge | `model_name` | Requests waiting to be scheduled |
| `vllm_omni:num_requests_success` | Counter | `model_name` | Requests that completed without error |
| `vllm_omni:num_requests_fail` | Counter | `model_name` | Requests that returned an error |

### Latency

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `vllm_omni:e2e_request_latency_seconds` | Histogram | `model_name` | End-to-end request latency in seconds |
| `vllm_omni:request_queue_time_seconds` | Histogram | `model_name` | Time spent waiting in the request queue |

## Diffusion Engine Metrics (`vllm_omni:`)

These histograms are populated only when the pipeline includes a diffusion
stage (e.g. image or video generation models).

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `vllm_omni:diffusion_preprocess_time_ms` | Histogram | `model_name`, `engine` | Input preprocessing time per request |
| `vllm_omni:diffusion_exec_time_ms` | Histogram | `model_name`, `engine` | DiT forward pass execution time per request |
| `vllm_omni:diffusion_postprocess_time_ms` | Histogram | `model_name`, `engine` | Output postprocessing time (VAE decode) per request |
| `vllm_omni:diffusion_step_time_ms` | Histogram | `model_name`, `engine` | Total diffusion step time per request |

## vLLM Engine Metrics (`vllm:`)

When the pipeline includes an LLM stage, the upstream vLLM engine exposes its
full set of metrics under the `vllm:` prefix. These are registered by
`vllm.v1.metrics.loggers.PrometheusStatLogger` and cover scheduler state,
token throughput, cache utilization, and request latencies.

For a full overview of vLLM metrics, consult [the vLLM docs](https://github.com/vllm-project/vllm/blob/main/docs/usage/metrics.md)

## Metric Availability by Pipeline Type

| Metric group | Multi-stage LLM (Qwen3-Omni) | Diffusion-only (Z-Image-Turbo) |
|---|---|---|
| `vllm_omni:` request tracking | Yes | Yes |
| `vllm_omni:` latency | Yes | Yes |
| `vllm_omni:` KV cache | Yes | No |
| `vllm_omni:` diffusion timing | Only if pipeline has a diffusion stage | Yes |
| `vllm:` engine metrics | Yes | No |
| `vllm:` MFU metrics | With `--enable-mfu-metrics` | No |

## Naming Convention

vLLM-Omni pipeline metrics use the `vllm_omni:` prefix to distinguish
them from upstream per-engine `vllm:` metrics. The upstream
`unregister_vllm_metrics()` function is monkey-patched to a no-op (see
`vllm_omni/patch.py`) so that these metrics are not destroyed during
engine initialization.
Loading
Loading