Skip to content

Add Prometheus /metrics Support#3362

Open
vraiti wants to merge 9 commits into
vllm-project:mainfrom
vraiti:feat_prometheus
Open

Add Prometheus /metrics Support#3362
vraiti wants to merge 9 commits into
vllm-project:mainfrom
vraiti:feat_prometheus

Conversation

@vraiti
Copy link
Copy Markdown
Contributor

@vraiti vraiti commented May 5, 2026

Purpose

Addresses #3228

Add Prometheus metrics for Omni multi-stage pipelines. Pipeline-level metrics (vllm:omni_*) track request counts, e2e latency, queue time, and per-stage diffusion timing breakdowns. Upstream vLLM per-engine metrics (vllm:*) are preserved for AR stages via PrometheusStatLogger.

To distinguish metrics generated by distinct engines, each metric has four labels:

  • model_name: The name of the model the engine is running
  • engine: A unique integer identifying that engine
  • stage_id: The index of the pipeline stage this engine is serving (can be non-unique within a replicated stage)
  • replica_id: A base-0 index unique to each replica within a stage

Workarounds

stage_id and replica_id actually can't be supported right now for LLM stages since the upstream PrometheusStatLogger does not accept custom metric labels. This is addressed in vllm-project/vllm#42743. Once that PR is merged, only the following will be required to provide LLM metrics with stage_id and replica_id: vraiti@fa7d3a9

Upstream unregister_vllm_metrics() (called inside PrometheusStatLogger.__init__) strips any collector whose name contains "vllm". There is currently a PR in vLLM to make this more precise: vllm-project/vllm#42331, but until then a new patch has been added to make unregister_vllm_metrics a no-op since the function was just there for CI purposes (it does not affect production use).

This PR also includes a make_stats() throttle (implemented in OmniSchedulerMixin) that limits SchedulerStats collection to once per second, eliminating a performance regression in LLM stages due to serialization overhead

Metrics added

Metric Type
vllm_omni:num_requests_running Gauge
vllm_omni:num_requests_waiting Gauge
vllm_omni:num_requests_success Counter
vllm_omni:num_requests_fail Counter
vllm_omni:e2e_request_latency_seconds Histogram
vllm_omni:request_queue_time_seconds Histogram
vllm_omni:diffusion_preprocess_time_ms Histogram
vllm_omni:diffusion_exec_time_ms Histogram
vllm_omni:diffusion_postprocess_time_ms Histogram
vllm_omni:diffusion_step_time_ms Histogram

Test Plan

1. vLLM regression (Qwen3-Omni, 4x A100-40GB)

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --omni --deploy-config vllm_omni/deploy/qwen3_omni_tp2.yaml --port 8000

Run text-only chat completions: 1 warmup request, 3 benchmark requests. Verify vllm:* metrics are present on /metrics.

2. Diffusion regression (Z-Image-Turbo, 1x A100-40GB)

vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000

Run 512x512 image generation at 8 diffusion steps: 1 warmup request, 5 benchmark requests. Verify v_llm_omni:* metrics are present and diffusion timing histograms are populated.

3. vLLM metrics (Qwen3-Omni)

Scrape /metrics from the Qwen3-Omni deployment and verify all expected vllm:* metric families are present (scheduler state, cache usage, token throughput, latencies).

4. Diffusion metrics (Z-Image-Turbo)

Scrape /metrics from the Z-Image-Turbo deployment and verify v_llm_omni:* pipeline metrics are present, diffusion timing histograms have observations, and no vllm:* metrics appear.

5. engine_id and stage_id metadata

Deploy diffusion and LLM pipelines with multiple replicas and observe correct

Test Result

1. vLLM regression (Qwen3-Omni)

Baseline feat_prometheus
Mean latency 3.08s 3.07s
Text tok/s 83.4 83.1

No regression observed

2. Diffusion regression (Z-Image-Turbo)

Baseline feat_prometheus
Mean latency 1.31s 1.31s

No regression observed.

3. vLLM metrics (Qwen3-Omni)

37 vllm:* metric families present. Full output: metrics_prometheus_qwen3.txt

4. Diffusion metrics (Z-Image-Turbo)

10 v_llm_omni:* metric families present. Diffusion timing histograms populated. No vllm:* metrics (expected — no AR engine). Full output: metrics_prometheus_z-image.txt

5. engine_id and stage_id metadata

Diffusion (Z-Image-Turbo):
metrics_with_replicas_diffusion.txt

LLM (


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@vraiti vraiti requested a review from hsliuustc0106 as a code owner May 5, 2026 18:37
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@vraiti vraiti force-pushed the feat_prometheus branch 4 times, most recently from 8eaeb34 to ca092e5 Compare May 5, 2026 18:49
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Good addition of Prometheus /metrics support for observability. The new metrics for diffusion (preprocess, exec, postprocess, total time) will help with performance monitoring. Suggestions: 1) Consider adding metrics for request queue length/wait time. 2) Document which endpoints expose /metrics and how to configure Prometheus scraping. 3) Add tests to verify metrics are recorded correctly.

@hsliuustc0106 hsliuustc0106 requested a review from gcanlin May 6, 2026 12:29
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

cc @bjf-frz @wuhang2014 PTAL

@vraiti
Copy link
Copy Markdown
Contributor Author

vraiti commented May 6, 2026

Good addition of Prometheus /metrics support for observability. The new metrics for diffusion (preprocess, exec, postprocess, total time) will help with performance monitoring. Suggestions: 1) Consider adding metrics for request queue length/wait time. 2) Document which endpoints expose /metrics and how to configure Prometheus scraping.

  1. v_llm_omni:num_requests_waiting is a queue depth gauge and v_llm_omni:request_queue_time_seconds is a queue time histogram.

  2. /metrics is mostly exposed unconditionally when serving online, but docs/usage/metrics.md documents when each metric is exposed (just dependent on whether the pipeline includes the relevant stage engine type).

Is there something not covered here you'd like to see?

  1. Add tests to verify metrics are recorded correctly.

I will add unit tests

@vraiti vraiti force-pushed the feat_prometheus branch 3 times, most recently from 6fc5de6 to c0e86e9 Compare May 6, 2026 20:37
Comment thread vllm_omni/metrics/prometheus.py
Comment thread vllm_omni/entrypoints/omni_base.py Outdated
Comment thread docs/design/index.md
Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the metrics work here — the overall direction looks good, but I found two blocking issues and one doc regression:

  1. OmniPrometheusMetrics registers collectors on every instance creation, which is likely to break when multiple Omni/OmniBase instances are created in the same Python process due to duplicate Prometheus timeseries registration.
  2. request_queue_time_seconds is defined, documented, and unit-tested synthetically, but the production path never passes queue_seconds into request_succeeded(), so the metric will never accumulate real observations.
  3. docs/design/index.md appears to have dropped links to two existing feature design docs.

I left inline comments with details and suggested fixes.

Comment thread docs/design/metrics.md
Comment on lines +38 to +47
vLLM-Omni runs multiple engine instances (stages) within a single
process, coordinated by an Orchestrator. The pipeline needs its own
metrics — aggregate request counts, end-to-end latency across all
stages, and diffusion timing breakdowns — that do not exist in upstream
vLLM. These metrics must survive the `unregister_vllm_metrics()`
cleanup pass. So, all pipeline-level metrics use the prefix `v_llm_omni:`.

Upstream per-engine metrics retain the `vllm:` prefix and are
registered by a `PrometheusStatLogger` instance that the Orchestrator
creates and feeds directly.
Copy link
Copy Markdown
Contributor

@wuhang2014 wuhang2014 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think v_llm_omni: is the right long-term design here. The orchestrator already uses the correct pattern for upstream metrics: one PrometheusStatLogger with per-engine labels, not one logger per stage. The problem is the separate OmniPrometheusMetrics path in OmniBase, which registers fresh collectors per instance.

That separate registration path is what forces the custom prefix, because upstream unregister_vllm_metrics() removes any collector whose name contains "vllm". So a simple rename to vllm:* would still be fragile.

I’d suggest folding these omni metrics into the single orchestrator-owned/global Prometheus logger instead of registering a second collector set in OmniBase. Then they can live in the same namespace with names like vllm:omni_*, while stage/model separation stays in labels. That would avoid both duplicate registration and the prefix workaround.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly one logger per stage. It's one logger for vLLM stages (PrometheusStatLogger) and one logger for diffuser stages/the pipeline (OmniPrometheusMetrics)

The separate OmniPrometheusMetrics class inside OmniBase is necessary because some vLLM metrics require per-iteration updates, while e2e pipeline statistics require the OmniBase/entrypoint context. For example:

  • num_requests_waiting: Requires reading the length of the OmniBase.request_states
  • num_requests_fail: Requires tracking in OmniBase since malformed HTTP requests can fail before they reach the orchestrator
    Meanwhile, any metrics that depend upon vLLM's IterationStats logically should be collected at the orchestrator level, because that's where the per-iteration control occurs.

Additionally, I chose to keep the vLLM-Omni metrics separate from the vLLM metrics because PrometheusStatLogger is imported directly from vLLM. The reason vLLM metrics work seamlessly in this PR is that vLLM metrics infrastructure is (mostly) preserved, and trying to merge it with the vLLM-Omni-specific logic is likely to break that support or at least make it more brittle.

Folding OmniPrometheusMetrics into this class (i.e. subclassing PrometheusStatLogger in OmniPrometheusMetrics) would still require at least the following:

  • Collecting all generated vLLM IterationStats into a list within the orchestrator and passing this up to OmniBase
  • Moving PrometheusStatLogger.record() to OmniBase, where you would have to call it once per collected IterationStats collected
  • Creating a new method to record pipeline/diffusion statistics since PrometheusStatLogger.record() takes the vLLM-specific dataclasses IterationStats and SchedulerStats

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense.

I still don’t think v_llm_omni: is the right design here. It feels like a workaround for unregister_vllm_metrics() being too broad, rather than the right metric namespace.

I’d rather keep the metric names aligned with the vLLM namespace and add a small hack/fix in unregister_vllm_metrics() so it avoids unnecessary unregister, instead of encoding that workaround into a new long-term prefix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,I've submitted a vLLM PR to fix: vllm-project/vllm#42331 . This won't get merged until at least the next minor version, so I've added a patch that changes unregister_vllm_metrics to a no-op temporarily. Either way, vLLM-Omni metrics now use the prefix vllm:omni_

@amy-why-3459
Copy link
Copy Markdown
Contributor

Thank you very much for your contribution. Would you be interested in supporting performance data statistics for different stages in the benchmark? #1361

LHXuuu added a commit to LHXuuu/vllm-omni that referenced this pull request May 13, 2026
PR vllm-project#3362 introduced docs/usage/metrics.md and docs/design/metrics.md
covering its own pipeline-level + diffusion families. The follow-up
work in this branch (G1 audio, G2 image/video, G3 cross-stage transfer,
G6 success/fail merge into requests_success_total{finished_reason},
G7 OmniPrometheusStatLogger per-replica wrap) wasn't reflected. Refresh
both docs to the final state.

usage/metrics.md (+117 / -38):

- Request Tracking table swaps num_requests_success / num_requests_fail
  for the merged requests_success_total{finished_reason} family.
- New sections for Modality (audio / image / video) and Cross-Stage
  Transfer with the per-modality / per-edge metric tables.
- vLLM Engine Metrics section gains a before/after example showing how
  the G7 wrap reshapes engine -> stage + replica, with a note that the
  ~37 upstream families gain per-replica visibility automatically.
- Diffusion Engine Metrics section clarifies that omni-side diffusion
  families bypass the wrap (engine label = stage_id, not relabelled).
- Pipeline Type availability matrix gains modality / transfer /
  per-replica rows.
- Naming Convention section explains the RFC "co-position, different
  name" three-modality TTFP convention.

design/metrics.md (+260 / -74):

- Objectives section calls out the per-replica + modality + transfer
  goals introduced after PR vllm-project#3362.
- Architecture diagram replaces the original two-component picture with
  the four-component one (OmniPrometheusMetrics + OmniModalityMetrics +
  OmniTransferMetrics + OmniPrometheusStatLogger).
- Data Flow grows from two paths to four: Pipeline-level, Modality
  (finalize hook + audio_ttfp streaming hook), Cross-stage transfer
  (sticky-routing replica_resolver), and the wrapped per-engine path.
  The transfer subsection notes the TX-side hook is wired but currently
  inactive pending try_send_via_connector being called from the main
  code path.
- New "OmniPrometheusStatLogger Wrap (G7)" section explains the four
  upstream impedance points and the three coordinated mechanisms
  (class-level metric class slots, per_engine_labelvalues property
  descriptor, _RelabelMixin.labels() override) used to handle them.
  Also documents the stage_replica_map construction and the deferred
  dynamic add/remove decision.
- Metric Definitions table refreshed end-to-end: pipeline-level row
  now lists requests_success_total{finished_reason}; new modality and
  transfer subtables added; transfer subtable carries a note that
  model_name was added on top of the RFC-listed four
  stage/replica labels for cross-omni consistency.
- Logging vs. Prometheus section expanded to mention OmniModality /
  OmniTransfer alongside OmniPrometheusMetrics, and notes that
  OrchestratorAggregator.record_transfer_tx/rx fans out to both the
  log accumulator and the Prometheus emit hook in one method body.
vraiti and others added 7 commits May 13, 2026 14:47
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: vraiti <vraiti@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
…er, and thinker2talker

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
@vraiti vraiti force-pushed the feat_prometheus branch from afdb8c5 to 3a930b8 Compare May 13, 2026 18:48
@vraiti
Copy link
Copy Markdown
Contributor Author

vraiti commented May 13, 2026

@wuhang2014 changed metrics prefix to vllm_omni: and vllm#42331 should be merged soon enough. Anything else?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
@vraiti
Copy link
Copy Markdown
Contributor Author

vraiti commented May 13, 2026

Thank you very much for your contribution. Would you be interested in supporting performance data statistics for different stages in the benchmark? #1361

Will work on #1361 once this PR is merged since this PR already implements mechanisms for per-stage statistics

@amy-why-3459
Copy link
Copy Markdown
Contributor

Will metrics be printed in the logs?

@wuhang2014
Copy link
Copy Markdown
Contributor

@wuhang2014 changed metrics prefix to vllm_omni: and vllm#42331 should be merged soon enough. Anything else?

Looks good overall. One remaining issue: since PR #2396, a stage can have multiple replicas. Without a replica_id label, metrics from replicas of the same stage will be merged. Please add replica_id so they can be split per replica.

@vraiti
Copy link
Copy Markdown
Contributor Author

vraiti commented May 14, 2026

Will metrics be printed in the logs?

No, that's the whole point of Prometheus metrics: instead of writing performance statistics in the logs (which are generally reserved for system diagnostics), you make them accessible via a separate channel (the /metrics endpoint).

You then spin up a Prometheus server that runs alongside the main server and queries metrics for aggregation. Then, when you want to observe these metrics, you connect Grafana to the Prometheus server. Grafana allows you to use the Prometheus Query Language to automatically construct graphs using the aggregated metrics.

You can imagine how this provides a much cleaner developer experience than trying to parse server logs and perform your own aggregations and visualizations whenever you want to evaluate server performance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: vraiti <vraiti@redhat.com>
@vraiti
Copy link
Copy Markdown
Contributor Author

vraiti commented May 15, 2026

Looks good overall. One remaining issue: since PR #2396, a stage can have multiple replicas. Without a replica_id label, metrics from replicas of the same stage will be merged. Please add replica_id so they can be split per replica.

@wuhang2014 done: 6aab25e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants