[Misc][DP] Fix AsyncLLM metrics for multi-API server deployments by njhill · Pull Request #6 · njhill/vllm

njhill · 2025-05-13T18:56:06Z

No description provided.

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

github-actions · 2025-05-13T18:56:16Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

njhill

Thanks for this @kouroshHakha.

I have a few general comments:

When there is a single api-server, we log metrics that come back from each engine separately with the engine index as a label. In the multi-api-server PR I changed the logic on the engine side to only send it's SchedulerStats back to one of the client api-servers. So hopefully for the metrics corresponding to these, not much more should be needed apart from making sure the gauges use "mostrecent" mode.

However, other metrics are computed during the loop in async_llm.py based on the requests that were processed in that iteration (in IterationStats). With multiple API servers, the processing of these requests for a given engine will in general be distributed amongst the api-servers since the outputs are sent back to each based on which originally sent the request.

These we may have to look at more closely and on a case-by-case basis since for example some are histograms where we assume the count corresponds to the number of iterations that have run on the corresponding engine, and we'll now be recording multiple of these (could be between 1 and num api servers).

From your PR description:

Set "sum" as default mode for gauges with special handling for lora_info metric

I don't see this anywhere in the changes?

njhill · 2025-05-14T01:34:47Z

vllm/v1/metrics/loggers.py

+    def _create_counter(self, name: str, documentation: Optional[str],
+                        labelnames: list[str]):
+        return prometheus_client.Counter(name=name,
+                                         documentation=documentation,
+                                         labelnames=labelnames)


What's the purpose of adding this indirection if all of the args are just passed to the corresponding constructors?

Mostly modularity. We can extend these, for example in this PR vllm-project#17925 we want to wrap these primitives with their Ray equivalent.

It might make more sense to make this particular change in that other PR then since it's not directly related to this one. cc @markmc

njhill · 2025-05-14T01:36:20Z

vllm/v1/engine/async_llm.py

+        # Upon shutdown of this process, we should mark the process as dead
+        # See https://prometheus.github.io/client_python/multiprocess/
+        try:
+            import os
+
+            from prometheus_client import multiprocess
+
+            multiprocess.mark_process_dead(os.getpid())
+            logger.debug("Marked Prometheus metrics for process %d as dead",
+                         os.getpid())
+        except Exception as e:
+            logger.error("Error during metrics cleanup: %s", str(e))
+


Does it matter if we run this even when prometheus logging is disabled?

so this part of the shutdown logic runs through and is a no-op when log-stats=False. So I if we want to be too pedantic about something like prometheus_client not being installed on the host we can only gate the logic on log_stats. But I think it's better to keep it like this. The only edge case I can think of is if prometheus_client package is not installed in which case the try-except block will just emit a logger.error that won't be raised unless the user wants. So it's fine by default.

I added more comments to clarify the choice.

njhill · 2025-05-14T01:40:57Z

vllm/entrypoints/cli/serve.py

+    assert num_api_servers > 1
+    if "PROMETHEUS_MULTIPROC_DIR" not in os.environ:
+        # Make TemporaryDirectory for prometheus multiprocessing
+        # Note: global TemporaryDirectory will be automatically
+        #   cleaned up upon exit.
+        global prometheus_multiproc_dir
+        prometheus_multiproc_dir = tempfile.TemporaryDirectory()
+        os.environ["PROMETHEUS_MULTIPROC_DIR"] = prometheus_multiproc_dir.name
+    else:
+        logger.warning("Found PROMETHEUS_MULTIPROC_DIR was set by user. "
+                       "This directory must be wiped between vLLM runs or "
+                       "you will find inaccurate metrics. Unset the variable "
+                       "and vLLM will properly handle cleanup.")


We should probably only do this is prometheus logging is disabled .. at least we probably shouldn't if log_stats is False.

Setting the env var even if log_stats is false is fine? it doesn't hurt? Why do you think we should gate it on log_stats? or even to be more precise when Prometheus logger is used?

njhill · 2025-05-14T01:42:11Z

vllm/v1/metrics/loggers.py

+            labelnames=labelnames,
+            multiprocess_mode="all").labels(*labelvalues)


Suggested change

labelnames=labelnames,

multiprocess_mode="all").labels(*labelvalues)

labelnames=labelnames).labels(*labelvalues)

njhill · 2025-05-14T02:09:25Z

@kouroshHakha in particular I also don't think we should be using the "all" mode which just labels by pid.

njhill · 2025-05-14T02:18:25Z

We also need to make sure that the PROMETHEUS_MULTIPROC_DIR env var is always propagated properly. It was tricky to get this right before because I think it needs to be set prior to importing prometheus.

The docs recommend setting it externally but we obviously don't want to have to require that.

This environment variable should be set from a start-up shell script, and not directly from Python (otherwise it may not propagate to child processes).

kouroshHakha · 2025-05-14T17:22:56Z

From your PR description: Set "sum" as default mode for gauges with special handling for lora_info metric
I don't see this anywhere in the changes?

I think I changed it to livemostrecent (forgot to update the description)

https://github.com/njhill/vllm/pull/6/files#diff-43531f6ec44e98f78e8f3fd53839a31c4fe4dd1c1b9015a45bf88b6f5dfeaabdR358

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

kouroshHakha · 2025-05-15T03:51:12Z

@njhill

Comparing -asc=1 and -asc=16 and a counter metric like num_prompt_tokens over time on a fixed workload that has ~2M input tokens.

kouroshHakha · 2025-05-15T04:44:06Z

These we may have to look at more closely and on a case-by-case basis since for example some are histograms where we assume the count corresponds to the number of iterations that have run on the corresponding engine, and we'll now be recording multiple of these (could be between 1 and num api servers).

@njhill I have thought about this for almost the entire day looking at metrics and how they show up on grafana, etc. My conclusion is that we have only one metric that is impacted by hitting multiple iterationState records from different api_servers. That is histogram of vllm:iteration_tokens_total. Basically the main insight is that those metrics that are by definition invariant with notion of iteration step remain intact (e.g. time metrics, request level metrics, etc). But iteration_tokens_total does not fall into this category.

You can also observe the diff on vllm:iteration_tokens_total on the same workload. I can solve this by computing the n_tokens_per_iteration in each engine and attach it to the EngineCoreOutputs of only one of the clients. On the client side we can update the iterationstat.n_total_token_this_iteration only if it's returned by the EngineCoreOutputs. I am not sure if this is easy to do in the scheduler actually

markmc · 2025-05-15T16:57:17Z

Nice work thinking this through @kouroshHakha

Let's work through a simple example.

With a single API server and single engine

req1: prompt_len=100
req2: prompt_len=50
req3: prompt_len=80

iter1:
  req1: is_prefill=True, new_tokens=1
  req2: is_prefill=True, new_tokens=1
      observe(152)

iter2: 
  req1: is_prefill=False, new_tokens=1
  req2: is_prefill=False, new_tokens=1
  req3: is_prefill=True, new_tokens=1
    observe(83)

iter3:
  req1: is_prefill=False, new_tokens=1
  req2: is_prefill=False, new_tokens=1
  req3: is_prefill=False, new_tokens=1
    observe(3)

iter4:
  req1: is_prefill=False, new_tokens=1
  req2: is_prefill=False, new_tokens=1
  req3: is_prefill=False, new_tokens=1
    observe(3)

result:
  count = 4
  sum = 241
  buckets:
    le=8: 2
    le=128: 3
    le=256: 4

With 2 API servers and 2 engines

req1: prompt_len=100  ===> routed to API1, Engine1
req2: prompt_len=50  ===> routed to API2, Engine2
req3: prompt_len=80  ===> routed to API1, Engine1

API1, Engine1 iter1:
  req1: is_prefill=True, new_tokens=1
      observe(101)
      
API2, Engine2 iter1:
  req2: is_prefill=True, new_tokens=1
      observe(51)

API1, Engine1 iter2: 
  req1: is_prefill=False, new_tokens=1
  req3: is_prefill=True, new_tokens=1
    observe(82)

API2, Engine2 iter2: 
  req2: is_prefill=False, new_tokens=1
      observe(1)

API1, Engine1 iter3:
  req1: is_prefill=False, new_tokens=1
  req3: is_prefill=False, new_tokens=1
    observe(2)

API2, Engine2 iter3:
  req2: is_prefill=False, new_tokens=1
    observe(1)
    
 API1, Engine1 iter4:
  req1: is_prefill=False, new_tokens=1
  req3: is_prefill=False, new_tokens=1
    observe(2)

API2, Engine2 iter4:
  req2: is_prefill=False, new_tokens=1
    observe(1)

result:
  count = 8
  sum = 241
  buckets:
    le=3: 
    le=8: 5
    le=64: 6
    le=128: 8

So the view you get is that the same number of tokens is being generated with more, smaller iterations?

Is that going to be a problem? Surely people are watching trends, or comparing across like-for-like instances, etc. rather than relying on the actual values?

e.g. sure, you'd see a drop if you rolled out a multi-api-server change ... but that might even be reassuring, and make a ton of sense?

wdyt?

markmc · 2025-05-15T17:04:56Z

On the code ... I absolutely detest all this "make sure PROMETHEUS_MULTIPROC_DIR env var is set before importing prometheus_client" nonsense!

Firstly, I'm skeptical that the env var needs to be set before importing prometheus_client? Yes, we need to set it before creating the first metric, but why before importing? Especially since we're not using prometheus_client.REGISTRY?

Maybe I'm missing something there, but I'd like to be really sure we have to lazy import ... that's always going to be super brittle

Secondly, if we can put all the prometheus multiproc nonsense in one place - e.g. vllm.v1.metrics.prometheus - then it becomes much more maintainable. In V0, it was sprinkled over a bunch of places. I'd prefer to see no mention of PROMETHEUS_MULTIPROC_DIR anywhere except in vllm.v1.metrics.prometheus

Does that make sense?

njhill · 2025-05-15T17:39:47Z

Thanks @kouroshHakha @markmc for that careful analysis!

I can solve this by computing the n_tokens_per_iteration in each engine and attach it to the EngineCoreOutputs of only one of the clients. On the client side we can update the iterationstat.n_total_token_this_iteration only if it's returned by the EngineCoreOutputs. I am not sure if this is easy to do in the scheduler actually

I'm not sure we want to complicate the scheduler or introduce more work there... we should aim to avoid that if possible.

Is that going to be a problem? Surely people are watching trends, or comparing across like-for-like instances, etc. rather than relying on the actual values?

I'm not sure, it's possible the rate of the "count" of this histogram could be used to track iteration frequency and used in various other derived "per iteration" metrics. Unfortunately the count also won't be a constant multiple of the "original" count since the number of times it's recorded per engine per iteration would vary depending on how the requests are distributed between api servers / engines.

If this is the only problematic one though it seems reasonable to not block the PR and just document this along with the multi-api-server option.

Firstly, I'm skeptical that the env var needs to be set before importing prometheus_client? Yes, we need to set it before creating the first metric, but why before importing? Especially since we're not using prometheus_client.REGISTRY?

I'm not sure exactly, and agree that's horrible, it's just a (possibly incorrect) recollection from when we were wrangling with this some time back with V0. Hopefully it's not really the case

Secondly, if we can put all the prometheus multiproc nonsense in one place - e.g. vllm.v1.metrics.prometheus - then it becomes much more maintainable. In V0, it was sprinkled over a bunch of places. I'd prefer to see no mention of PROMETHEUS_MULTIPROC_DIR anywhere except in vllm.v1.metrics.prometheus

Good point and I very much agree with this! If we do end up having an import ordering issue it should make that easier to manage too.

kouroshHakha · 2025-05-15T18:00:39Z

Hey @markmc,

So the view you get is that the same number of tokens is being generated with more, smaller iterations?

yep exactly. I liked your toy example. It's really illuminating. Let's consider a scenario that engine is shared between two api servers. I think the conclusion would be the same tho:

Two API severs but one engine (all-to-all)

req1: prompt_len=100  ===> routed to API1, Engine1
req2: prompt_len=50  ===> routed to API2, Engine1
req3: prompt_len=80  ===> routed to API1, Engine1


Engine1 iter1:
    req1 (api1): is_prefill=True, new_tokens=1
    req2 (api2): is_prefill=True, new_tokens=1
    API1: observe(101)
    API2: observe(51)

Engine1 iter2:
    req1 (api1): is_prefill=False, new_tokens=1
    req2 (api2): is_prefill=False, new_tokens=1
    req3 (api1): is_prefill=True, new_tokens=1
    API1: observe(82)
    API2: observe(1)

Engine1 iter3:
    req1 (api1): isprefill=False, new_tokens=1
    req2 (api2): is_prefill=False, new_tokens=1
    req3 (api1): is_prefill=False, new_tokens=1
    API1: observe(2)
    API2: observe(1)

Engine1 iter4:
    req1 (api1): is_prefill=False, new_tokens=1
    req2 (api2): is_prefill=False, new_tokens=1
    req3 (api1): is_prefill=False, new_tokens=1
    API1: observe(2)
    API2: observe(1)



result:
  count = 8
  sum = 241
  buckets:
    le=1: 5
    le=8: 0
    le=16: 0
    le=32: 1
    le=64: 2
    le=128: 0


vs. with only one api server

result:
  count = 4
  sum = 241
  buckets:
    le=1: 2
    le=8: 0
    le=16: 0
    le=32: 0
    le=64: 1
    le=128: 1

Whether this is a problem or not, goes back to what we want from the system. I think keeping as is, is a reasonable design choice. i.e. As you scale the api server, the total sum remains the same, but it's broken down into smaller steps therefore changing the histogram. If we want to go this route, I think the metric name is a bit inconsistent with what it represents. The iteration_token_total suggest that as long as the number of engines remain the same for each engine I should see the similar distribution of tokens processed per step by the engine. Tho what @njhill suggested also could be a problem. It depends on how much the end user relies on these metrics :)

But I do agree with you that this is certainly not a big problem. We should certainly not block this PR for this @njhill.

Secondly, if we can put all the prometheus multiproc nonsense in one place - e.g. vllm.v1.metrics.prometheus - then it becomes much more maintainable. In V0, it was sprinkled over a bunch of places. I'd prefer to see no mention of PROMETHEUS_MULTIPROC_DIR anywhere except in vllm.v1.metrics.prometheus

@njhill @markmc I also think this was actually not necessary. Because once I added the check of sys.module during development I noticed I was failing that check but my metrics were still correct. So I don't think having to define the env var before an import is an absolute requirements. It was that based on past comments on V0 code path + the note written on prometheus docs https://prometheus.github.io/client_python/multiprocess/ (This environment variable should be set from a start-up shell script, and not directly from Python (otherwise it may not propagate to child processes).)

I will double check this and if the lazy import doesn't end up as a requirement I'd remove it like suggested. This will allow us to self contain Prometheus parts in one place for easier maintenance as well.

njhill

Thanks @kouroshHakha

as well as the inline comments it would be good to make the change suggested by @markmc to move all the prometheus-touching logic into the prometheus package, calling utility methods from there as needed.

vllm/v1/metrics/loggers.py

njhill · 2025-05-15T17:52:06Z

vllm/v1/metrics/loggers.py

        metrics_info["engine"] = self.engine_index

        name, documentation = None, None
+        multiprocess_mode = "mostrecent"


What's the reason for this variable?

waiting for a green light from you to refactor this function entirely. It's a bit wierd that it's doing some conditioning but the condition is always true when I search across the project globally.

I would say just leave it at least for this PR? Could always look into it some more and open a separate PR to refactor...

njhill · 2025-05-15T17:53:01Z

vllm/v1/metrics/loggers.py

+def build_buckets(mantissa_lst: list[int],
+                  max_value: int) -> list[Union[int, float]]:


What's the reason for these changes? I'm not sure what's wrong with list[int] and regardless I think it's unrelated to the PR purpose?

reverting. linter artifact when I had the indirection

njhill · 2025-05-15T18:00:34Z

vllm/v1/metrics/loggers.py

                        self.labelname_running_lora_adapters,
-                    ])
+                    ],
+                    multiprocess_mode="livemostrecent"


Is this lora_info gauge another potentially problematic one? Since it's updated via IteratonStats and not SchedulerStats ... So I'm not sure "livemostrecent" is the right thing to use here.

I haven't looked closely at what this metric / how it is computed, maybe "sum" would fit better. I have a feeling though even that might not be correct, because it may be counting e.g. the number of unique lora adapters across the running requests and so it's not really possible to just combine the separate counts when the requests from a given engine are partitioned.

If there's not an easy answer we could include this in the list that we document as not being correct when mutli-api-servers are in play.

In v0 world, this was livemostrecent. I assumed it was intentional, so I kept it.
https://github.com/vllm-project/vllm/blob/main/vllm/engine/metrics.py#L80

It is potentially one of those problematic ones. But supporting lora metrics even falls lower in priority than the other one. So we may as well just record it in the docs. I'd keep as is since it's coming from historical context anyways.

In V0 I don't think were were ever logging the same metrics in multiple places, so the aggregation mode was probably irrelevant. I do think "sum" is probably slightly less bad here. But probably we should just disable this metric when there are multiple API servers since I think the values will just be wrong (and can document of course).

I made it sum. We can simply document for now.

… module Signed-off-by: kouroshhakha <kourosh@anyscale.com>

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

kouroshHakha · 2025-05-15T22:34:41Z

ok @njhill @markmc I separated out prometheus non sense into its own python module under v1.metrics.prometheus. Also tested some of the metrics in single vs. multi process scenarios that I found brittle before and can confirm doing eager import of promptheus_client the way it is done in this PR is good.

We can follow up with moving PrometheusStatLogger to v1.metrics.prometheus later, but I am avoiding this in this PR now to keep the diffs small.

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

njhill

Thanks @kouroshHakha

njhill · 2025-05-15T23:46:19Z

vllm/v1/metrics/prometheus.py

+        return REGISTRY
+
+
+def mount_metrics(app):


Suggested change

def mount_metrics(app):

def mount_metrics(app: FastAPI):

I think it might be better to leave this method in api_server.py, and just call get_prometheus_registry() from here

So this method is very prometheus heavy (meaning it's not just the registry but also other things like mak_asgi_app, etc that are coming from prometheus. This really belongs to prometheus module of vllm. WDYT?

njhill · 2025-05-15T23:47:22Z

vllm/v1/metrics/prometheus.py

+    app.routes.append(metrics_route)
+
+
+def mark_process_dead(pid):


Suggested change

def mark_process_dead(pid):

def mark_process_dead(pid: int):

njhill · 2025-05-15T23:48:21Z

vllm/v1/metrics/prometheus.py

+        registry = CollectorRegistry()
+        multiprocess.MultiProcessCollector(registry)
+        return registry
+    else:


nit: redundant else

njhill · 2025-05-15T23:48:50Z

vllm/v1/engine/async_llm.py

+        try:
+            mark_process_dead(os.getpid())
+            logger.debug("Marked Prometheus metrics for process %d as dead",
+                         os.getpid())
+        except Exception as e:
+            logger.error("Error during metrics cleanup: %s", str(e))


put the try/except inside the method too?

I think the caller should decide what they want to do. Sg?

Agree with Nick. There's nothing the caller can do, and wrapping anything in such a broad try/except implies some knowledge of what the function is doing. I'd move the os.getpid() into the prometheus module too and call it shutdown_prometheus() or similar

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

markmc · 2025-05-16T06:18:26Z

vllm/entrypoints/openai/api_server.py

-        logger.debug("vLLM to use %s as PROMETHEUS_MULTIPROC_DIR",
-                     prometheus_multiproc_dir_path)
-        registry = CollectorRegistry()
-        multiprocess.MultiProcessCollector(registry)


Everything above should be in the prometheus module, returning a registry

markmc · 2025-05-16T06:19:09Z

vllm/entrypoints/openai/api_server.py

-
-    # Workaround for 307 Redirect for /metrics
-    metrics_route.path_regex = re.compile("^/metrics(?P<path>.*)$")
-    app.routes.append(metrics_route)


And all of the rest of it should remain in the API server module, using registry returned from the prometheus module

After doing this, it looks nicer, I have to admit :D

markmc · 2025-05-16T06:25:29Z

vllm/v1/metrics/prometheus.py

+    # Unregister any existing vLLM collectors
+    for collector in list(REGISTRY._collector_to_names):
+        if hasattr(collector, "_name") and "vllm" in collector._name:
+            REGISTRY.unregister(collector)


One nice thing about having all this nonsense together in one module, you wonder ...

If we're using REGISTRY here, we're assuming multiprocess mode is not enabled? Maybe assert that the env var is not set?

good catch .. logic here should differ here in multproc case I think?

good catch :)

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

kouroshHakha · 2025-05-16T16:39:43Z

@njhill @markmc ready. incorporated your feedbacks.

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

markmc

Super minor comments

markmc · 2025-05-16T17:03:46Z

vllm/entrypoints/openai/api_server.py

 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import JSONResponse, Response, StreamingResponse
+from prometheus_client import make_asgi_app
+from prometheus_fastapi_instrumentator import Instrumentator


can't remove. mount_metrics is added back here, so it's needed. That's why I wanted to keep mount_metrics entirely in prometheus.

Doh, I misread, sorry. That's fine, this stuff is in the "api server" category not the "disgusting prometheus multi proc hackery" category 😃

markmc · 2025-05-16T17:05:13Z

vllm/v1/metrics/prometheus.py

+    """Mark a process as dead in prometheus multiprocessing.
+
+    Args:
+        pid: Process ID to mark as dead


done already.

markmc · 2025-05-16T17:07:22Z

vllm/v1/metrics/loggers.py

        self.gauge_scheduler_running = prometheus_client.Gauge(
            name="vllm:num_requests_running",
            documentation="Number of requests in model execution batches.",
-            labelnames=labelnames).labels(*labelvalues)


If you're re-spinning again, then a very minor stylistic request ...

In the original version this line is all "label stuff"

In the new version, it becomes "label stuff, new line, multiproc stuff, label stuff"

sorry did not quite understand the desired style?

oh you mean the order? put multiprocess_mode before label stuff?

markmc · 2025-05-16T17:09:24Z

vllm/v1/metrics/loggers.py

            documentation="Number of requests in model execution batches.",
-            labelnames=labelnames).labels(*labelvalues)
+            labelnames=labelnames,
+            multiprocess_mode="mostrecent").labels(*labelvalues)


Suggest (for all these multiprocess_mode changes)

multiprocess_mode="mostrecent", labelnames=labelnames).labels(*labelvalues)

njhill

Couple more small comments since @markmc also had some :)

Could you also add a similar warning comment to the iteration_tokens_total metric?

njhill · 2025-05-16T17:13:57Z

vllm/v1/metrics/loggers.py

                        self.labelname_running_lora_adapters,
-                    ])
+                    ],
+                    multiprocess_mode="sum"


Could you add a comment here (maybe above, as part of the "LoRA metrics" comment), explaining that this metric will not be correct when using api-server scaleout, which uses prometheus mp mode.

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

njhill · 2025-05-16T19:50:20Z

Thanks for all of your help and patience with this @kouroshHakha!

kouroshHakha added 3 commits May 12, 2025 23:56

wip

ee7cc58

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

Wip

fe507f6

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

Merge branch 'all-to-all' into kh/fix-a2a-metrics

70e2513

lint

b376674

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

njhill commented May 14, 2025

View reviewed changes

njhill mentioned this pull request May 14, 2025

[Misc][DP] Fix AsyncLLM metrics for multi-API server deployments vllm-project/vllm#18053

Closed

markmc mentioned this pull request May 14, 2025

[Misc] Add Ray Prometheus logger to V1 vllm-project/vllm#17925

Merged

wip

0c0a1a5

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

njhill commented May 15, 2025

View reviewed changes

kouroshHakha added 2 commits May 15, 2025 15:26

refactored prometheus non-sense into a separate self contained python…

0fa7aac

… module Signed-off-by: kouroshhakha <kourosh@anyscale.com>

clean

106265d

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

nit

aace3a0

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

njhill commented May 15, 2025

View reviewed changes

wip

b876684

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

markmc reviewed May 16, 2025

View reviewed changes

kouroshHakha added 2 commits May 16, 2025 09:31

feedback

4cadec3

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

wip

e3b0dd2

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

wip

aad9b52

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

markmc reviewed May 16, 2025

View reviewed changes

njhill commented May 16, 2025

View reviewed changes

kouroshHakha added 2 commits May 16, 2025 10:46

nits

f3f65bd

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

nit

7260f4f

Signed-off-by: kouroshhakha <kourosh@anyscale.com>

njhill merged commit 1bf3a63 into njhill:all-to-all May 16, 2025

		labelnames=labelnames,
		multiprocess_mode="all").labels(*labelvalues)

	labelnames=labelnames,
	multiprocess_mode="all").labels(*labelvalues)
	labelnames=labelnames).labels(*labelvalues)

		def build_buckets(mantissa_lst: list[int],
		max_value: int) -> list[Union[int, float]]:

Conversation

njhill commented May 13, 2025

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented May 14, 2025

Uh oh!

njhill commented May 14, 2025

Uh oh!

kouroshHakha commented May 14, 2025

Uh oh!

kouroshHakha commented May 15, 2025

Uh oh!

kouroshHakha commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented May 15, 2025

Uh oh!

markmc commented May 15, 2025

Uh oh!

njhill commented May 15, 2025

Uh oh!

kouroshHakha commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kouroshHakha commented May 15, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

njhill May 14, 2025 •

edited

Loading

kouroshHakha commented May 15, 2025 •

edited

Loading

kouroshHakha commented May 15, 2025 •

edited

Loading

kouroshHakha May 16, 2025 •

edited

Loading