[Bugfix][Metrics] Fix RayPrometheusMetric.labels() returning shared labeled child by marwan116 · Pull Request #40369 · vllm-project/vllm

marwan116 · 2026-04-20T14:28:52Z

Purpose

When vLLM runs with the Ray Prometheus path (Ray Serve, ray.data.llm, etc.), vllm:request_success{finished_reason=...} only ever increments the repetition bucket regardless of the request's actual finish reason; stop, length, abort, and error stay at zero.

Root cause. RayPrometheusMetric.labels() mutated the wrapped Ray metric's default tags in place (via set_default_tags) and returned self, so every .labels(...) call on a given wrapper returned the same object. In PrometheusStatLogger, the initialization loop partitions counter_request_success over FinishReason; all five entries end up pointing at the same wrapper, whose default tags get frozen at the last-iterated member (REPETITION). Subsequent .inc() calls record under that tag, no matter the request's actual finish_reason.

The same flaw affects every other .labels(...)-partitioned counter, gauge, and histogram on the Ray path — per-engine splits via create_metric_per_engine, plus spec-decoding, KV-connector, perf, and NIXL metrics. Any alerting, SLO, or capacity-planning built on multi-bucket vLLM-on-Ray metrics is silently wrong.

Fix. labels() now returns an independent _LabeledRayMetric that carries its own tag dict and forwards .inc() / .set() / .observe() to the underlying Ray metric with tags=self._tags on every call. Per Ray's metric API, per-call tags take precedence over default tags, so concurrent labeled children cannot clobber each other. This matches the prometheus_client.Metric.labels() contract that callsites rely on — no callsite changes needed.

Per AGENTS.md §1: searched open/closed PRs and issues for request_success finished_reason, RayPrometheusMetric labels, set_default_tags ray, ray metrics labels independent — no duplicate work.

Test Plan

New unit tests in tests/v1/metrics/test_ray_metrics.py (no Ray cluster required — the underlying Ray metric is swapped for a MagicMock):

test_ray_counter_labels_returns_independent_children — two .labels(...) calls return distinct objects with independent tag dicts.
test_ray_counter_inc_forwards_per_child_tags — each child's tags reach the underlying Ray counter; .inc(0) stays a no-op.
test_ray_gauge_labels_returns_independent_children_and_forwards_tags — same for RayGaugeWrapper.set.
test_ray_histogram_labels_returns_independent_children_and_forwards_tags — same for RayHistogramWrapper.observe.
test_ray_counter_labels_accepts_non_string_label_values — covers the str(idx) coercion path used by the per-engine split.
test_ray_counter_labels_arity_validation — arity check still fires.

Local commands run:

pre-commit run ruff-check  --files vllm/v1/metrics/ray_wrappers.py tests/v1/metrics/test_ray_metrics.py
pre-commit run ruff-format --files vllm/v1/metrics/ray_wrappers.py tests/v1/metrics/test_ray_metrics.py
pre-commit run typos       --files vllm/v1/metrics/ray_wrappers.py tests/v1/metrics/test_ray_metrics.py
pre-commit run mypy-local  --files vllm/v1/metrics/ray_wrappers.py tests/v1/metrics/test_ray_metrics.py

The existing test_engine_log_metrics_ray smoke test requires a built vLLM + GPU environment and is deferred to CI.

Test Result

All four pre-commit hooks pass locally:

ruff check...............................................................Passed
ruff format..............................................................Passed
typos....................................................................Passed
Run mypy locally for lowest supported Python version.......................Passed

Production-workload evidence motivating the fix (ground truth via an in-worker check_stop probe, compared against the Prometheus scrape on the same run):

`request.status` (ground truth)	Count
`FINISHED_STOPPED`	87
`FINISHED_LENGTH_CAPPED`	0
`FINISHED_REPETITION`	0

Pre-fix, Prometheus reported 100% of those increments under vllm:request_success{finished_reason="repetition"}. With the fix, increments land on the matching FinishReason string (stop in this run).

AI-assisted (per AGENTS.md §1.3). Every changed line was reviewed by the submitter.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

github-actions · 2026-04-20T14:29:26Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

gemini-code-assist

Code Review

This pull request refactors the Ray metrics wrappers to ensure that calling .labels() returns independent labeled children rather than mutating the base metric's default tags. This change aligns the behavior with the prometheus_client contract, preventing label clobbering where increments were incorrectly attributed to the last used label set. The implementation introduces _LabeledRayMetric subclasses for counters, gauges, and histograms to handle per-call tag forwarding. Comprehensive regression tests have been added to verify independent tag sets, tag forwarding, and label arity validation. I have no feedback to provide as the review comments were evaluative or confirmatory in nature.

…hild RayPrometheusMetric.labels() mutated the wrapped Ray metric's default tags in place and returned self, so every .labels(...) call on a given wrapper instance returned the same object. The initialization loop in PrometheusStatLogger iterates over FinishReason and uses counter.labels(model, idx, str(reason)) to create a "child" per reason; under the Ray wrapper, all five children pointed at the same underlying Ray counter whose default tags were set by the last iteration. Every .inc() call landed on finished_reason="repetition", regardless of the request's actual finish reason. The same flaw affected every other .labels(...)-partitioned counter, gauge, and histogram in the Ray metrics path (per-engine splits via create_metric_per_engine, spec decoding / KV connector / perf / NIXL metrics, etc.), silently falsifying any multi-bucket dashboard or alert derived from vLLM metrics on Ray. Fix: .labels() now returns an independent _LabeledRayMetric that carries its own tag dict and forwards .inc()/.set()/.observe() to the underlying Ray metric with tags=self._tags on every call. Per Ray's metric API, per-call tags take precedence over any default tags, so concurrent labeled children no longer clobber each other. This matches the prometheus_client.Metric.labels() contract callsites rely on. Adds regression tests covering Counter, Gauge, and Histogram wrappers: labels() returns distinct children, per-child tags forward to the underlying metric, non-string label values are coerced, and arity validation still fires. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>

eicherseiji

Thanks @marwan116! LGTM.

markmc

This sounds like it has been a bug for a long time? Before #35451 it would have been finish_reason[error] being incremented?

In any case, see inline comments - I'd like to avoid an additional hierarchy of wrappers

markmc · 2026-04-21T08:23:04Z

+    def inc(self, value: int | float = 1.0):
+        if value == 0:
+            return
+        return self._wrapper.metric.inc(value, tags=self._tags)


This duplicates the inc() in RayCounterWrapper (same for the other classes too)

Can we not have labels() return a new instance of the RayPrometheusMetric subclass rather than adding a new class hierarchy?

markmc · 2026-04-21T08:23:57Z

+        # Ray metric's default tags in place and returned self, so every
+        # labeled "child" shared the last-set label values -- e.g. every
+        # vllm:request_success increment was attributed to the last
+        # FinishReason iterated (REPETITION).


I'm not a fan of "earlier versions [did this other thing]" that coding models tend to spit out - I think it just adds clutter

markmc · 2026-04-21T08:24:45Z

+    """Regression test: RayCounterWrapper.labels() must return distinct
+    labeled children that each carry their own tag set.
+
+    Prior to the fix, labels() mutated the wrapped Ray counter's default


Ditto on this "prior to the fix" comment

eicherseiji · 2026-04-24T21:54:35Z

Discussed with @marwan116 offline, I will pick this up in #40840.

We can close this PR.

markmc

lgtm, thanks

markmc

Wrong PR

markmc · 2026-04-27T08:39:00Z

Closing in favor of #40840

marwan116 requested a review from markmc as a code owner April 20, 2026 14:28

claude Bot reviewed Apr 20, 2026

View reviewed changes

mergify Bot added v1 bug Something isn't working labels Apr 20, 2026

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

marwan116 force-pushed the marwan/fix-vllm-ray-metrics-finish-reason branch from 2b5df5b to 6dbe47b Compare April 20, 2026 14:39

eicherseiji approved these changes Apr 20, 2026

View reviewed changes

markmc requested changes Apr 21, 2026

View reviewed changes

markmc added this to Metrics & Tracing Apr 21, 2026

markmc moved this from Backlog to In Review in Metrics & Tracing Apr 21, 2026

github-project-automation Bot moved this to Backlog in Metrics & Tracing Apr 21, 2026

eicherseiji mentioned this pull request Apr 24, 2026

[Bugfix][Metrics] Fix RayPrometheusMetric.labels() returning shared labeled child #40840

Merged

markmc approved these changes Apr 27, 2026

View reviewed changes

markmc requested changes Apr 27, 2026

View reviewed changes

markmc closed this Apr 27, 2026

markmc moved this from In Review to Stale in Metrics & Tracing Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Metrics] Fix RayPrometheusMetric.labels() returning shared labeled child#40369

[Bugfix][Metrics] Fix RayPrometheusMetric.labels() returning shared labeled child#40369
marwan116 wants to merge 1 commit intovllm-project:mainfrom
marwan116:marwan/fix-vllm-ray-metrics-finish-reason

marwan116 commented Apr 20, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

eicherseiji left a comment

Uh oh!

markmc left a comment

Uh oh!

markmc Apr 21, 2026

Uh oh!

markmc Apr 21, 2026

Uh oh!

markmc Apr 21, 2026

Uh oh!

eicherseiji commented Apr 24, 2026

Uh oh!

markmc left a comment

Uh oh!

markmc left a comment

Uh oh!

markmc commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

marwan116 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

markmc Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

markmc Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

markmc Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

eicherseiji commented Apr 24, 2026

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

markmc commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marwan116 commented Apr 20, 2026 •

edited

Loading