-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Prometheus /metrics Support #3362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vraiti
wants to merge
9
commits into
vllm-project:main
Choose a base branch
from
vraiti:feat_prometheus
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
d5a9286
Add Prometheus metrics for multi-stage pipelines
vraiti 00f4d8e
Add E2E test for Prometheus metrics under multi-replica config
vraiti 9a1ef63
Fix dict-style .get() on OutputMessage dataclass attributes
vraiti 1dd335b
Fix per-engine stats dropping finished requests due to shared timer
vraiti 8cb1cd8
Downgrade per-request metric logs to DEBUG
vraiti 5d13fbc
Delegate make_stats() to upstream Scheduler via super()
vraiti 280c127
Rename OmniPrometheusMetrics to OmniPrometheusStatLogger, alias Prome…
vraiti 3d1beed
Exclude diffusion stages from engine_indexes to avoid stale zero-valu…
vraiti 40ec127
Fix typo
vraiti File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,198 @@ | ||
| # Prometheus Metrics Design | ||
|
|
||
| This document describes how vLLM-Omni exposes Prometheus metrics for | ||
| multi-stage pipelines, the constraints that shaped the design, and how | ||
| the pipeline-level metrics coexist with upstream vLLM per-engine | ||
| metrics. | ||
|
|
||
| ## Objectives | ||
|
|
||
| - Expose pipeline-level request and latency metrics that span the full | ||
| multi-stage execution (orchestrator scope). | ||
| - Preserve all upstream vLLM per-engine metrics (`vllm:*`) for stages | ||
| backed by an AR LLM engine. | ||
| - Expose per-stage diffusion timing breakdowns for pipelines that | ||
| include a diffusion engine. | ||
| - Keep the metrics collection overhead low enough that it does not | ||
| regress TTFA or throughput. | ||
|
|
||
| ## Background | ||
|
|
||
| ### Upstream vLLM Metrics | ||
|
|
||
| Upstream vLLM defines 44 Prometheus metrics under the `vllm:` prefix. | ||
| These are registered by `PrometheusStatLogger` and cover engine-level | ||
| state: KV cache usage, running/waiting request counts, token | ||
| throughput, TTFT, inter-token latency, e2e latency, and so on. They | ||
| are served via the `/metrics` HTTP endpoint provided by | ||
| `prometheus_fastapi_instrumentator` and the default | ||
| `prometheus_client` WSGI handler. | ||
|
|
||
| vLLM's `unregister_vllm_metrics()` function strips every | ||
| `prometheus_client` collector whose `_name` attribute contains the | ||
| substring `"vllm"`. This runs during engine initialization to clean up | ||
| stale collectors from prior instantiations within the same process. | ||
|
|
||
| ### The Problem | ||
|
|
||
| vLLM-Omni runs multiple engine instances (stages) within a single | ||
| process, coordinated by an Orchestrator. The pipeline needs its own | ||
| metrics — aggregate request counts, end-to-end latency across all | ||
| stages, and diffusion timing breakdowns — that do not exist in upstream | ||
| vLLM. All pipeline-level metrics use the `vllm_omni:` prefix to | ||
| distinguish them from upstream per-engine metrics. The | ||
| `unregister_vllm_metrics()` function is monkey-patched to a no-op at | ||
| import time (see `vllm_omni/patch.py`) so that these metrics are not | ||
| destroyed during engine initialization (this is a temporary fix until | ||
| vLLM patches this behavior). | ||
|
|
||
| Upstream per-engine metrics retain the `vllm:` prefix and are | ||
| registered by a `PrometheusStatLogger` instance that the Orchestrator | ||
| creates and feeds directly. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Component Overview | ||
|
|
||
| ``` | ||
| +-----------------------+ | ||
| | API Server (FastAPI)| | ||
| | GET /metrics | | ||
| +----------+------------+ | ||
| | | ||
| prometheus_client default handler | ||
| | | ||
| +-------------+-------------+ | ||
| | | | ||
| vllm_omni:* collectors vllm:* collectors | ||
| | | | ||
| +----------------------------+ +--------------------------+ | ||
| | OmniPrometheusStatLogger | | VllmPrometheusStatLogger | | ||
| +----------------------------+ +--------------------------+ | ||
| | | | ||
| OmniBase Orchestrator | ||
| (request lifecycle, (feeds SchedulerStats | ||
| diffusion timing) + IterationStats | ||
| per engine step) | ||
| ``` | ||
|
|
||
| ### Data Flow | ||
|
|
||
| There are two independent paths for metric collection. | ||
|
|
||
| **Path 1: Pipeline-level metrics (`vllm_omni:*`)** | ||
|
|
||
| `OmniPrometheusStatLogger` registers Gauge, Counter, and Histogram | ||
| collectors at init time. It is instantiated once per entrypoint, | ||
| labeled with the model name. The entrypoint calls its methods as | ||
| requests progress: | ||
|
|
||
| - `set_running(n)` / `set_waiting(n)` — updated after each request | ||
| completes. The running count comes from `OmniRequestCounter`, a | ||
| simple counter incremented/decremented by the Orchestrator as it | ||
| tracks requests. Waiting is derived as `total - running`. | ||
|
|
||
| - `request_succeeded(e2e_seconds, queue_seconds)` — recorded when a | ||
| request finishes at the final stage. | ||
|
|
||
| - `request_failed()` — recorded when a request errors. | ||
|
|
||
| - `observe_diffusion_metrics(stage_id, metrics)` — recorded when a | ||
| diffusion stage finishes. The metrics dict contains timing | ||
| breakdowns (preprocess, exec, postprocess, total step time) | ||
| accumulated from engine output. | ||
|
|
||
| **Path 2: Per-engine metrics (`vllm:*`)** | ||
|
|
||
| The Orchestrator instantiates upstream vLLM's `PrometheusStatLogger` | ||
| and feeds it scheduler stats and iteration stats after processing | ||
| each batch of engine outputs. This populates the standard vLLM | ||
| metrics (TTFT, token throughput, cache usage, etc.) using the same | ||
| code path as standalone vLLM. For diffusion-only pipelines that have | ||
| no AR engine, `SchedulerStats` is never produced and `vllm:*` metrics | ||
| are absent. | ||
|
|
||
| ### Shared State Between Threads | ||
|
|
||
| The Orchestrator runs in a background thread. The API server | ||
| (OmniBase) runs in the asyncio event loop thread. | ||
| `OmniRequestCounter` bridges them — a plain Python object with an | ||
| `int` field. The Orchestrator increments/decrements it; the | ||
| entrypoint reads it for gauge updates. No lock is needed because the | ||
| counter is advisory (a stale read by one Prometheus scrape interval | ||
| is acceptable). It is created by `AsyncOmniEngine.__init__()` and | ||
| passed to the Orchestrator at construction time. | ||
|
|
||
| ### Metric Registration and Lifecycle | ||
|
|
||
| All `vllm_omni:*` collectors are registered once when | ||
| `OmniPrometheusStatLogger.__init__()` runs. Per-stage labels | ||
| (`model_name`, `engine`) are bound lazily on first observation to | ||
| avoid registering labels for stages that never produce data (e.g., a | ||
| diffusion pipeline has no AR stage stats). | ||
|
|
||
| The `prometheus_client` default registry holds all collectors. | ||
| FastAPI's `/metrics` endpoint serves the default registry, so both | ||
| `vllm_omni:*` and `vllm:*` metrics appear in the same scrape | ||
| response alongside `http_*` and `process_*` metrics from the | ||
| instrumentator and the Python client runtime. | ||
|
|
||
| ## Throttling: `make_stats()` Override | ||
|
|
||
| Upstream vLLM's `Scheduler.make_stats()` runs on every AR generation step, | ||
| returning a SchedulerStats object for the orchestrator. | ||
| Under vLLM's architecture, this is fine. But since vLLM-Omni requires that the | ||
| object be serialized and transferred over ZMQ, receiving a SchedulerStats object on | ||
| every step can introduce unacceptable overhead to the system. | ||
|
|
||
| `OmniSchedulerMixin.make_stats()` (in | ||
| `vllm_omni/core/sched/omni_scheduler_mixin.py`) throttles stats | ||
| emission to at most once per second. Between intervals it returns | ||
| `None`, which the engine core skips serializing. This keeps gauges | ||
| fresh enough for Prometheus scrapes (typically 15-30s intervals) while | ||
| eliminating the per-step overhead. | ||
|
|
||
| ## Metric Definitions | ||
|
|
||
| ### Pipeline-Level | ||
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `vllm_omni:num_requests_running` | Gauge | `model_name` | Requests currently executing across all stages | | ||
| | `vllm_omni:num_requests_waiting` | Gauge | `model_name` | Requests queued but not yet scheduled | | ||
| | `vllm_omni:num_requests_success` | Counter | `model_name` | Requests completed without error | | ||
| | `vllm_omni:num_requests_fail` | Counter | `model_name` | Requests that returned an error | | ||
| | `vllm_omni:e2e_request_latency_seconds` | Histogram | `model_name` | End-to-end request latency across all stages | | ||
| | `vllm_omni:request_queue_time_seconds` | Histogram | `model_name` | Time spent waiting in the request queue | | ||
|
|
||
| ### Diffusion Stage-Level | ||
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `vllm_omni:diffusion_preprocess_time_ms` | Histogram | `model_name`, `engine` | Diffusion input preprocessing time | | ||
| | `vllm_omni:diffusion_exec_time_ms` | Histogram | `model_name`, `engine` | Diffusion model forward pass time | | ||
| | `vllm_omni:diffusion_postprocess_time_ms` | Histogram | `model_name`, `engine` | Diffusion output postprocessing time | | ||
| | `vllm_omni:diffusion_step_time_ms` | Histogram | `model_name`, `engine` | Total diffusion step time | | ||
|
|
||
| ### LLM Stage-Level | ||
|
|
||
| Reference [vLLM docs](https://github.com/vllm-project/vllm/blob/main/docs/usage/metrics.md) | ||
|
|
||
| Note that metrics that depend upon features that are not supported in vLLM-Omni (e.g. speculative decoding, LoRA) will not be available as well. | ||
|
|
||
| ## Logging vs. Prometheus | ||
|
|
||
| `OrchestratorAggregator` (in `vllm_omni/metrics/stats.py`) is the | ||
| logging-oriented metrics path. It collects detailed per-request, | ||
| per-stage, and per-transfer statistics and prints formatted tables to | ||
| the `INFO` log. This is designed for development and debugging — | ||
| individual request traces, transfer bandwidth, inter-stage timing. | ||
|
|
||
| `OmniPrometheusStatLogger` is the Prometheus-oriented path. It records | ||
| aggregate counters, gauges, and histograms suitable for time-series | ||
| monitoring and alerting. The two paths are independent; both can run | ||
| simultaneously. | ||
|
|
||
| The separation follows upstream vLLM's pattern of `LoggingStatLogger` | ||
| vs. `PrometheusStatLogger` — same underlying data, different | ||
| consumption models. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # Production Metrics | ||
|
|
||
| vLLM-Omni exposes Prometheus metrics via the `/metrics` endpoint on the | ||
| OpenAI-compatible API server. The metrics fall into three categories depending | ||
| on the pipeline type. | ||
|
|
||
| ```bash | ||
| vllm-omni serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8000 | ||
| curl http://localhost:8000/metrics | ||
| ``` | ||
|
|
||
| ## Metric Namespaces | ||
|
|
||
| | Prefix | Source | Present when | | ||
| |--------|--------|--------------| | ||
| | `vllm_omni:` | vLLM-Omni orchestrator / diffusion stages | Always / Pipeline includes a diffusion stage | | ||
| | `vllm:` | Upstream vLLM engine | Pipeline includes an LLM (AR) stage | | ||
| | `http_` / `process_` | Uvicorn / Python runtime | Always | | ||
|
|
||
| ## Pipeline-Level Metrics (`vllm_omni:`) | ||
|
|
||
| These metrics are defined in `vllm_omni/metrics/prometheus.py` and track | ||
| request lifecycle across the full multi-stage pipeline. | ||
|
|
||
| ### Request Tracking | ||
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `vllm_omni:num_requests_running` | Gauge | `model_name` | Requests currently running across all pipeline stages | | ||
| | `vllm_omni:num_requests_waiting` | Gauge | `model_name` | Requests waiting to be scheduled | | ||
| | `vllm_omni:num_requests_success` | Counter | `model_name` | Requests that completed without error | | ||
| | `vllm_omni:num_requests_fail` | Counter | `model_name` | Requests that returned an error | | ||
|
|
||
| ### Latency | ||
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `vllm_omni:e2e_request_latency_seconds` | Histogram | `model_name` | End-to-end request latency in seconds | | ||
| | `vllm_omni:request_queue_time_seconds` | Histogram | `model_name` | Time spent waiting in the request queue | | ||
|
|
||
| ## Diffusion Engine Metrics (`vllm_omni:`) | ||
|
|
||
| These histograms are populated only when the pipeline includes a diffusion | ||
| stage (e.g. image or video generation models). | ||
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `vllm_omni:diffusion_preprocess_time_ms` | Histogram | `model_name`, `engine` | Input preprocessing time per request | | ||
| | `vllm_omni:diffusion_exec_time_ms` | Histogram | `model_name`, `engine` | DiT forward pass execution time per request | | ||
| | `vllm_omni:diffusion_postprocess_time_ms` | Histogram | `model_name`, `engine` | Output postprocessing time (VAE decode) per request | | ||
| | `vllm_omni:diffusion_step_time_ms` | Histogram | `model_name`, `engine` | Total diffusion step time per request | | ||
|
|
||
| ## vLLM Engine Metrics (`vllm:`) | ||
|
|
||
| When the pipeline includes an LLM stage, the upstream vLLM engine exposes its | ||
| full set of metrics under the `vllm:` prefix. These are registered by | ||
| `vllm.v1.metrics.loggers.PrometheusStatLogger` and cover scheduler state, | ||
| token throughput, cache utilization, and request latencies. | ||
|
|
||
| For a full overview of vLLM metrics, consult [the vLLM docs](https://github.com/vllm-project/vllm/blob/main/docs/usage/metrics.md) | ||
|
|
||
| ## Metric Availability by Pipeline Type | ||
|
|
||
| | Metric group | Multi-stage LLM (Qwen3-Omni) | Diffusion-only (Z-Image-Turbo) | | ||
| |---|---|---| | ||
| | `vllm_omni:` request tracking | Yes | Yes | | ||
| | `vllm_omni:` latency | Yes | Yes | | ||
| | `vllm_omni:` KV cache | Yes | No | | ||
| | `vllm_omni:` diffusion timing | Only if pipeline has a diffusion stage | Yes | | ||
| | `vllm:` engine metrics | Yes | No | | ||
| | `vllm:` MFU metrics | With `--enable-mfu-metrics` | No | | ||
|
|
||
| ## Naming Convention | ||
|
|
||
| vLLM-Omni pipeline metrics use the `vllm_omni:` prefix to distinguish | ||
| them from upstream per-engine `vllm:` metrics. The upstream | ||
| `unregister_vllm_metrics()` function is monkey-patched to a no-op (see | ||
| `vllm_omni/patch.py`) so that these metrics are not destroyed during | ||
| engine initialization. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think v_llm_omni: is the right long-term design here. The orchestrator already uses the correct pattern for upstream metrics: one PrometheusStatLogger with per-engine labels, not one logger per stage. The problem is the separate OmniPrometheusMetrics path in OmniBase, which registers fresh collectors per instance.
That separate registration path is what forces the custom prefix, because upstream unregister_vllm_metrics() removes any collector whose name contains "vllm". So a simple rename to vllm:* would still be fragile.
I’d suggest folding these omni metrics into the single orchestrator-owned/global Prometheus logger instead of registering a second collector set in OmniBase. Then they can live in the same namespace with names like vllm:omni_*, while stage/model separation stays in labels. That would avoid both duplicate registration and the prefix workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not exactly one logger per stage. It's one logger for vLLM stages (PrometheusStatLogger) and one logger for diffuser stages/the pipeline (OmniPrometheusMetrics)
The separate OmniPrometheusMetrics class inside OmniBase is necessary because some vLLM metrics require per-iteration updates, while e2e pipeline statistics require the OmniBase/entrypoint context. For example:
num_requests_waiting: Requires reading the length of the OmniBase.request_statesnum_requests_fail: Requires tracking in OmniBase since malformed HTTP requests can fail before they reach the orchestratorMeanwhile, any metrics that depend upon vLLM's IterationStats logically should be collected at the orchestrator level, because that's where the per-iteration control occurs.
Additionally, I chose to keep the vLLM-Omni metrics separate from the vLLM metrics because PrometheusStatLogger is imported directly from vLLM. The reason vLLM metrics work seamlessly in this PR is that vLLM metrics infrastructure is (mostly) preserved, and trying to merge it with the vLLM-Omni-specific logic is likely to break that support or at least make it more brittle.
Folding OmniPrometheusMetrics into this class (i.e. subclassing PrometheusStatLogger in OmniPrometheusMetrics) would still require at least the following:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this makes sense.
I still don’t think
v_llm_omni:is the right design here. It feels like a workaround forunregister_vllm_metrics()being too broad, rather than the right metric namespace.I’d rather keep the metric names aligned with the vLLM namespace and add a small hack/fix in
unregister_vllm_metrics()so it avoids unnecessary unregister, instead of encoding that workaround into a new long-term prefix.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok,I've submitted a vLLM PR to fix: vllm-project/vllm#42331 . This won't get merged until at least the next minor version, so I've added a patch that changes
unregister_vllm_metricsto a no-op temporarily. Either way, vLLM-Omni metrics now use the prefixvllm:omni_