[WIP][Metrics] Extend Prometheus with multi-modal SLOs, cross-stage transfer, and per-replica labels#3576
[WIP][Metrics] Extend Prometheus with multi-modal SLOs, cross-stage transfer, and per-replica labels#3576LHXuuu wants to merge 13 commits into
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: vraiti <vraiti@redhat.com>
Introduce vllm_omni/metrics/definitions.py as the single source of truth for metric family names, label sets, histogram buckets, and RTF formulas. Server-side prometheus.py and bench-side MultiModalsBenchmark Metrics now consume the same constants, eliminating the dual-track naming drift between /metrics output and bench CLI reports. Pre-work for the upcoming RFC extensions: per-modality audio/image/ video families (G1/G2), cross-stage transfer family (G3), and the OmniPrometheusStatLogger wrap that relabels engine into stage+replica (G7). All refactors are name-only; PR vllm-project#3362's existing 10 prometheus tests pass unchanged.
Phase 2.1 of multi-replica observability (RFC §3.2.7): introduce three wrapper metric classes (_RelabelGauge / _RelabelCounter / _RelabelHistogram) that intercept labelnames at family creation and translate .labels(engine=idx, ...) kwargs into stage/replica via a process-level engine→(stage, replica) map. These are the ingredients for OmniPrometheusStatLogger (Phase 2.2), which will populate the engine map and slot the wrappers into upstream's _gauge_cls / _counter_cls / _histogram_cls hooks. Shipped standalone with 18 unit tests so the relabel logic can be vetted before the StatLogger subclass lands.
…l mixin Phase 2.1 only intercepted .labels(engine=int_idx, ...) kwarg form. Upstream PrometheusStatLogger has four .labels() call sites for engine-bearing families: loggers.py:510 kwarg form, int engine (gauge_engine_sleep_state) loggers.py:646 positional, str engine (counter_prompt_tokens_by_source) loggers.py:679 positional, str engine (counter_request_success_base) loggers.py:1056 kwarg form, str engine (info_gauge via metrics_info) Track the original engine label index at family creation so positional .labels() can splice (stage, replica) at the right offset, and accept both int and str engine values in either form. Replaces the narrow _rewrite_label_kwargs helper with _engine_to_stage_replica which is shared by both code paths. Adds tests for positional engine at middle index (str + int values), str-form kwarg engine, and child metric non-recursion.
Phase 2.2b of multi-replica observability (RFC §3.2.7). Subclass upstream
PrometheusStatLogger to expose ~37 vllm:* metric families with
{stage, replica} labels instead of the single {engine} label.
Three pieces of glue:
1. Class-level slot overrides (_gauge_cls / _counter_cls / _histogram_cls)
plug the relabel mixin wrappers into upstream's family-construction path
so labelnames are rewritten as families are created.
2. A property descriptor on per_engine_labelvalues intercepts upstream's
loggers.py:433 self-assignment and rewrites each [model_name, str(idx)]
tuple to [model_name, stage, replica] so downstream create_metric_per_engine
feeds 3-element labelvalues into the 3-label families.
3. __init__ accepts a stage_replica_map: dict[int, tuple[str, str]] and
populates the process-level _ENGINE_INDEX_MAP that the wrappers consult
on every .labels() call. Map is cleared first so re-init in the same
process (tests, orchestrator restart) starts clean.
Dynamic add/remove of replicas at runtime is intentionally out of scope.
Phase 2.3 of multi-replica observability (RFC §3.2.7) — completes the
end-to-end path so the ~37 vllm:* metric families now expose
per-(stage, replica) series.
Three targeted edits to orchestrator.py:
1. Build a flat stage_replica_map at __init__ by walking stage_pools and
assigning each (stage_id, replica_id) tuple a monotonically increasing
engine_idx via prefix-sum across pools. The reverse map
_stage_replica_to_engine_idx is consulted at record() time to translate
the orchestrator's (stage_id, replica_id) loop variables back into the
flat engine_idx upstream expects.
2. Swap PrometheusStatLogger for OmniPrometheusStatLogger and pass
stage_replica_map instead of the old range(num_stages) engine_indexes.
3. record() now passes the per-replica flat engine_idx instead of just
stage_id, so multi-replica deployments emit distinct
{stage, replica} series per replica rather than aggregating them.
The stage label currently uses str(stage_id); migrating to semantic names
("thinker", "talker", "diffusion") is a separate config schema change
deferred to a later iteration.
Phase 3.1 of multi-modal observability (RFC G1/G2). Per-modality
business-semantic Prometheus families with {model_name, stage, replica}
labels:
audio: ttfp_seconds, duration_seconds, rtf, frames (Counter)
image: ttfp_seconds, num (Counter), generation_time_seconds
video: generation_time_seconds
video_duration_seconds and video_rtf are intentionally deferred —
diffusion video pipelines (i2v / t2v / cogvideo / hunyuan / wan)
expose num_frames + fps in heterogeneous shapes and a clean abstraction
is out of scope for this iteration.
Text-path metrics (TTFT/ITL/TPOT) are NOT here — they come from the
upstream vllm:*{stage="thinker", ...} families exposed by the Phase 2
OmniPrometheusStatLogger wrap. RFC §3.2.6 single-source naming via
definitions.py constants; RTF families use RTF_BUCKETS, time families
use SECONDS_BUCKETS. observe APIs accept stage/replica at call time
since one OmniModalityMetrics instance per pipeline serves all
(stage, replica) combinations.
Phase 3.2 of multi-modal observability (RFC G1/G2). Emit per-modality Prometheus observations at request finalize for 7 of the 8 modality families: audio: frames (Counter), duration_seconds, rtf image: num (Counter), generation_time_seconds, ttfp_seconds video: generation_time_seconds audio_ttfp_seconds is intentionally NOT here — it requires a first-packet timestamp from the streaming path and lands separately in Phase 3.3. Five connected edits: 1. definitions.py — add resolve_audio_sample_rate(multimodal_output) helper that mirrors the (audio_sample_rate / sample_rate / sampling_rate / sr) fallback chain already used by serving_chat.py for the OpenAI audio response, plus DEFAULT_AUDIO_SAMPLE_RATE=24000. 2. modality.py — add observe_modality_at_finalize() module-level routing function. Extracted from omni_base so unit tests can exercise the routing logic without importing the heavy AsyncOmniBase stack. 3. orchestrator.py — add replica_id to _route_output() signature and into the "type=output" / "type=stage_metrics" queue dicts so the downstream finalize hook in omni_base can emit per-replica labels. The error path at line 867 intentionally omits replica_id; the modality routing function defensive-skips when None. 4. omni_base.py — instantiate self.mod_metrics (OmniModalityMetrics) alongside self.prom_metrics; call observe_modality_at_finalize() inside the existing e2e_done finalize guard so it fires once per request and inherits the try/except isolation. 5. tests/metrics/test_modality.py — 9 routing tests covering all four output_type branches, sample-rate resolution, ttfp clamping on clock skew, and defensive skips for None replica_id / stage_metrics. Uses a stub mod_metrics that records call signatures so we don't need a live PrometheusRegistry for routing assertions.
…ervation Phase 3.3 of multi-modal observability (RFC G1) — completes the 8th modality family. Unlike the other 7 (finalize-path), audio_ttfp must be measured at the moment the first audio packet leaves the engine, not at request finalize. Five edits: 1. ClientRequestState — add request_arrival_ts (wall-clock anchor for the TTFP delta) and first_audio_ts (once-per-request guard). 2. async_omni.generate() — populate req_state.request_arrival_ts = wall_start_ts right after creating the ClientRequestState, before the orchestrator accepts the request. 3. modality.observe_audio_first_packet() — module-level helper that computes max(now_ts - arrival_ts, 0.0) and emits audio_ttfp_seconds. Defensive-skips when replica_id is None or arrival_ts == 0 (uninitialized). The once-per-request guard lives in the caller. 4. serving_chat.chat_completion_stream_generator() — HTTP SSE audio branch checks req_state.first_audio_ts and observes on first hit. Looks up replica via stage_pools[stage_id].get_bound_replica_id() so the metric carries the right per-replica label. 5. serving_video_stream._process_query_engine() — WebSocket path inserts the same guard + observe right where t_first_audio is already set, so the existing first-packet detection logic is reused. Caller-side guards (req_state.first_audio_ts is None) prevent re-observe on subsequent audio chunks for the same request_id. Both hook sites defensive-skip if request_states or stage_pools is missing. 4 new helper unit tests (valid inputs / replica=None skip / arrival_ts=0 skip / clock-skew clamp to 0). Total 26 modality tests pass.
Phase 4 of multi-modal observability (RFC G3). Four Histogram families
with {model_name, from_stage, from_replica, to_stage, to_replica} labels,
emitted from the existing TransferEdgeStats accumulators in
OrchestratorAggregator:
transfer_size_bytes -- BYTES_BUCKETS, observed at TX hook
transfer_tx_time_ms -- MS_BUCKETS, observed at TX hook
transfer_rx_decode_time_ms -- MS_BUCKETS, observed at RX hook
transfer_in_flight_time_ms -- MS_BUCKETS, observed at RX hook
Each .observe_*() corresponds to one physical transfer event (one chunk
hop) so the histogram tracks per-transfer distribution rather than
request-aggregated sums.
Two non-trivial design choices, both deviating from RFC §3.2.6:
1. Add model_name to the label set (RFC lists only the four stage/replica
labels). Rationale: aligns transfer with the rest of the omni_*
families (audio_*, image_*, video_*, num_requests_*) so PromQL joins
on model_name work uniformly. Cardinality cost is zero for typical
single-model deployments. To be amended back into the RFC as D11.
2. Resolve from_replica / to_replica by consulting the sticky-routing
binding the orchestrator already maintains, rather than plumbing
replica ids through TransferEdgeStats / StageRequestStats / connector
adapters. Phase 2 (PR vllm-project#2396) gave us
stage_pool.get_bound_replica_id(request_id), which returns the replica
each request is bound to within a stage; we look it up at emit time.
Five files would need touching otherwise (TransferEdgeStats,
StageRequestStats, stage_pool.build_stage_metrics, adapter.on_forward
call site, request bookkeeping). To be amended back into the RFC as D12.
Five edits:
1. definitions.py — TRANSFER_LABELS prepends model_name.
2. transfer.py — new OmniTransferMetrics(model_name) class with the four
Histogram families and one observe method per family. Same wrapper
pattern as OmniPrometheusMetrics (PR vllm-project#3362) and OmniModalityMetrics
(Phase 3): bind model_name at __init__, accept stage/replica at
observe time.
3. stats.py — OrchestratorAggregator gains two optional kwargs at
__init__ (transfer_emitter, replica_resolver). record_transfer_tx
and record_transfer_rx call thin _emit_transfer_tx / _emit_transfer_rx
helpers after the existing accumulation, which fail-safe (skip emit)
when either dep is missing or the resolver returns None for either
side. The TransferEdgeStats accumulation path is unchanged so
existing log/aggregation consumers stay intact.
4. omni_base.py — instantiate self.transfer_metrics alongside
self.prom_metrics / self.mod_metrics.
5. async_omni.py — wire transfer_emitter and replica_resolver into the
per-request OrchestratorMetrics construction. New
_resolve_transfer_replica helper looks up
self.engine.stage_pools[s].get_bound_replica_id(rid).
15 unit tests cover the 4-family registration, observe APIs, bucket
selection, multi-edge cardinality, plus the OrchestratorAggregator emit
hooks (tx + rx full emit, defensive skips for emitter=None /
resolver=None / one-side-resolves-to-None, rx skips stage 0). Stub
emitter records call signatures so the routing logic is asserted
without standing up a Prometheus registry.
Note: TX-side hook (record_transfer_tx) is emit-ready but
adapter.try_send_via_connector — its only caller — currently has no
upstream invocation in the main code path (likely plugin-driven or
about-to-land). RX side (record_transfer_rx) is the active path today;
TX side will start emitting as soon as the connector pipeline activates.
…al{finished_reason}
Phase 5 of multi-modal observability (RFC G6). Collapse the two
PR-vllm-project#3362-era pipeline-level counters
vllm:omni_num_requests_success
vllm:omni_num_requests_fail
into a single per-reason Counter:
vllm:omni_requests_success_total{model_name, finished_reason}
with finished_reason ∈ {stop, length, abort, ...} mirroring upstream
vllm:request_success_total. The previous "fail" path now maps to
finished_reason="abort" so the Pipeline-level success-rate dashboard
becomes one query (sum by (finished_reason) ...) instead of needing
two metrics joined.
Breaking change: dashboards / alerts that reference the old
num_requests_success / num_requests_fail counter names must migrate
to the new requests_success_total{finished_reason} family.
Four edits:
1. definitions.py — drop NUM_REQUESTS_SUCCESS, NUM_REQUESTS_FAIL.
Add REQUESTS_SUCCESS (passed without _total since Counter
auto-suffixes at exposition).
2. prometheus.py — replace _success_family + _fail_family with a
single _completion_family using SUCCESS_LABELS = ("model_name",
"finished_reason"). The label is bound per-call rather than at
__init__ time.
- OmniPrometheusMetrics.request_succeeded() gains a
finished_reason="stop" default kwarg.
- OmniPrometheusMetrics.request_failed() keeps its no-arg signature
for back-compat with the cleanup call site, internally mapping
to finished_reason="abort".
3. omni_base.py — at the request-finalize hook, extract
finish_reason from engine_outputs.outputs[0].finish_reason
(vLLM CompletionOutput convention) and pass it through to
request_succeeded(). Falls back to "stop" when no completion
output is present (defensive — e.g. diffusion stages).
4. tests/metrics/test_prometheus.py — refresh _PIPELINE_METRICS,
exercise three reason buckets (stop x 2 + length x 1 + abort x 1)
in the scrape fixture, replace the old success/fail-count assertions
with per-reason ones, and adjust the histogram-count assertions
(e2e/queue go from 2 to 3 since the abort path now only inc's the
counter and doesn't observe latency).
PR vllm-project#3362 introduced docs/usage/metrics.md and docs/design/metrics.md covering its own pipeline-level + diffusion families. The follow-up work in this branch (G1 audio, G2 image/video, G3 cross-stage transfer, G6 success/fail merge into requests_success_total{finished_reason}, G7 OmniPrometheusStatLogger per-replica wrap) wasn't reflected. Refresh both docs to the final state. usage/metrics.md (+117 / -38): - Request Tracking table swaps num_requests_success / num_requests_fail for the merged requests_success_total{finished_reason} family. - New sections for Modality (audio / image / video) and Cross-Stage Transfer with the per-modality / per-edge metric tables. - vLLM Engine Metrics section gains a before/after example showing how the G7 wrap reshapes engine -> stage + replica, with a note that the ~37 upstream families gain per-replica visibility automatically. - Diffusion Engine Metrics section clarifies that omni-side diffusion families bypass the wrap (engine label = stage_id, not relabelled). - Pipeline Type availability matrix gains modality / transfer / per-replica rows. - Naming Convention section explains the RFC "co-position, different name" three-modality TTFP convention. design/metrics.md (+260 / -74): - Objectives section calls out the per-replica + modality + transfer goals introduced after PR vllm-project#3362. - Architecture diagram replaces the original two-component picture with the four-component one (OmniPrometheusMetrics + OmniModalityMetrics + OmniTransferMetrics + OmniPrometheusStatLogger). - Data Flow grows from two paths to four: Pipeline-level, Modality (finalize hook + audio_ttfp streaming hook), Cross-stage transfer (sticky-routing replica_resolver), and the wrapped per-engine path. The transfer subsection notes the TX-side hook is wired but currently inactive pending try_send_via_connector being called from the main code path. - New "OmniPrometheusStatLogger Wrap (G7)" section explains the four upstream impedance points and the three coordinated mechanisms (class-level metric class slots, per_engine_labelvalues property descriptor, _RelabelMixin.labels() override) used to handle them. Also documents the stage_replica_map construction and the deferred dynamic add/remove decision. - Metric Definitions table refreshed end-to-end: pipeline-level row now lists requests_success_total{finished_reason}; new modality and transfer subtables added; transfer subtable carries a note that model_name was added on top of the RFC-listed four stage/replica labels for cross-omni consistency. - Logging vs. Prometheus section expanded to mention OmniModality / OmniTransfer alongside OmniPrometheusMetrics, and notes that OrchestratorAggregator.record_transfer_tx/rx fans out to both the log accumulator and the Prometheus emit hook in one method body.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56afe670b8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| total = len(self.request_states) | ||
| self.prom_metrics.set_running(running) | ||
| self.prom_metrics.set_waiting(max(0, total - running)) |
There was a problem hiding this comment.
Recompute waiting gauge after request-state cleanup
total = len(self.request_states) is sampled before the completed request is removed from self.request_states, while the orchestrator may already have decremented _running_counter; for a single completed request this yields running=0, total=1 and publishes omni_num_requests_waiting=1 even though the system is idle. Because _log_summary_and_cleanup() pops the request after this point without refreshing gauges, the stale waiting value can persist until another request arrives.
Useful? React with 👍 / 👎.
| pass | ||
|
|
||
|
|
||
| _vllm_prometheus.unregister_vllm_metrics = _noop_unregister_vllm_metrics |
There was a problem hiding this comment.
Keep collector cleanup to avoid duplicate metric registration
Replacing unregister_vllm_metrics() with a no-op means previously registered vllm:* collectors are never removed, so creating another engine in the same process (e.g., tests that instantiate multiple engines, hot-reload, or multi-tenant hosting) can fail at startup with duplicated Prometheus timeseries registration errors. This patch should preserve upstream cleanup for vllm:* families and only protect vllm:omni_* collectors from removal.
Useful? React with 👍 / 👎.
…mismatch
Phase 2.4 — two bugs in the G7 wrap surfaced when bringing the server
up against a real PrometheusStatLogger / qwen3-omni deployment, both
caused by paths the unit tests didn't reach against the stub upstream:
1. Helper-class label-count mismatch (orchestrator init crash with
`ValueError: Incorrect label count`).
Upstream PrometheusStatLogger.__init__ instantiates three sub-helper
collectors at loggers.py:438-446:
self.spec_decoding_prom = self._spec_decoding_cls(
vllm_config.speculative_config, labelnames, per_engine_labelvalues)
self.kv_connector_prom = self._kv_connector_cls(
vllm_config, labelnames, per_engine_labelvalues)
self.perf_metrics_prom = self._perf_metrics_cls(
vllm_config, labelnames, per_engine_labelvalues)
Each helper takes raw `labelnames` as a constructor argument and
builds its internal Counter/Gauge/Histogram families via its own
class-level `_counter_cls` / `_gauge_cls` / `_histogram_cls` slots.
The slot overrides on OmniPrometheusStatLogger only reach families
created via *its* slots, so the helpers still build 2-label
families and then crash when create_metric_per_engine fans the
rewritten 3-element per_engine_labelvalues into them.
Fix: subclass each helper (`_OmniPerfMetricsProm`,
`_OmniSpecDecodingProm`, `_OmniKVConnectorProm`) and override
their cls slots to the relabel-mixin wrappers, then assign these
subclasses to `_perf_metrics_cls` / `_spec_decoding_cls` /
`_kv_connector_cls` on OmniPrometheusStatLogger so the helpers
construct families through the same labelname-rewrite path.
2. Double-rewrite in mixin's positional-args path (also surfaces as
`Incorrect label count` after the helpers were patched).
Phase 2.2b's per_engine_labelvalues property setter already rewrites
each entry from [model_name, str(idx)] to [model_name, stage,
replica]. create_metric_per_engine then fans those into
`metric.labels(*values)` — i.e. the wrapper's positional .labels()
path receives 3 already-rewritten values. The old splice logic
would then re-interpret args[engine_label_index=1] (now "stage")
as an engine_idx and splice (stage, replica) again, blowing the
label count to 4.
Fix: detect `len(args) == len(self._labelnames)` at the start of
the positional path and short-circuit to passthrough — the caller
has already shaped the values to the rewritten family. The legacy
path (e.g. counter_request_success.labels(model_name, str(idx),
reason) at loggers.py:679) keeps working because its arg count is
short by one relative to the rewritten labelnames, so the splice
still triggers as before.
Tests:
- New `TestHelperClassWraps` asserts the four cls-slot overrides on
OmniPrometheusStatLogger and the Counter/Gauge/Histogram slots on
each helper subclass.
- New `TestDoubleRewriteGuard` covers both branches: a 2-label original
family receiving 3 already-rewritten values (passthrough) and a
3-label original (engine in middle) receiving 3 ORIGINAL-shape
values (splice path still fires).
- Updated the docstring on `test_labels_positional_passthrough` to
call out that it specifically exercises the new guard.
Purpose
Implements RFC #3545 on top of PR #3362's
/metricsbaseline, layering seven incremental goals:audio_ttfp_seconds/audio_duration_seconds/audio_rtf/audio_frameswith labels{stage, replica, model_name}.audio_ttfpis "time to first audio packet/frame," semantically distinct from upstreamvllm:time_to_first_token_seconds{stage="talker"}("time to first audio token (code)") — different units, exposed independently.image_ttfp_seconds/image_num/image_generation_time_seconds/video_duration_seconds/video_rtf/video_generation_time_seconds. Introduces "video content duration vs. generation time" and recognizes that the image path has no RTF concept, so its stage time is exposed directly. PR Add Prometheus /metrics Support #3362's four ms-level diffusion timings are reused, not duplicated.transfer_size_bytes/transfer_tx_time_ms/transfer_rx_decode_time_ms/transfer_in_flight_time_mswith labels{model_name, from_stage, from_replica, to_stage, to_replica}. The_mssuffix matches the source fields onTransferEdgeStats.*_msand follows the same style as PR Add Prometheus /metrics Support #3362'sdiffusion_*_time_ms.MultiModalsBenchmarkMetrics; a shared definition module is imported by both server and bench.vllm:time_to_first_token_seconds(text — reuse upstream) /vllm:omni_audio_ttfp_seconds/vllm:omni_image_ttfp_seconds. No single metric with amodalitylabel.vllm:omni_requests_success_total{finished_reason=...}, Pipeline-global, withfinished_reasonmirroring upstream completion values (stop/length/abort/ ...).OmniPrometheusStatLoggerwraps the upstreamPrometheusStatLogger. It usesengine_indexes = range(num_stages × num_replicas)for flat numbering and reshapes the singleenginelabel into astage+replicapair. All ~37vllm:*families automatically gain per-(stage, replica)dimensions — including text-path TTFT / ITL / TPOT / e2e, which therefore do not need to be re-implemented on the omni side.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md