[Misc][DP] Fix AsyncLLM metrics for multi-API server deployments#6
[Misc][DP] Fix AsyncLLM metrics for multi-API server deployments#6njhill merged 14 commits intonjhill:all-to-allfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
njhill
left a comment
There was a problem hiding this comment.
Thanks for this @kouroshHakha.
I have a few general comments:
When there is a single api-server, we log metrics that come back from each engine separately with the engine index as a label. In the multi-api-server PR I changed the logic on the engine side to only send it's SchedulerStats back to one of the client api-servers. So hopefully for the metrics corresponding to these, not much more should be needed apart from making sure the gauges use "mostrecent" mode.
However, other metrics are computed during the loop in async_llm.py based on the requests that were processed in that iteration (in IterationStats). With multiple API servers, the processing of these requests for a given engine will in general be distributed amongst the api-servers since the outputs are sent back to each based on which originally sent the request.
These we may have to look at more closely and on a case-by-case basis since for example some are histograms where we assume the count corresponds to the number of iterations that have run on the corresponding engine, and we'll now be recording multiple of these (could be between 1 and num api servers).
From your PR description:
- Set "sum" as default mode for gauges with special handling for lora_info metric
I don't see this anywhere in the changes?
vllm/v1/metrics/loggers.py
Outdated
| def _create_counter(self, name: str, documentation: Optional[str], | ||
| labelnames: list[str]): | ||
| return prometheus_client.Counter(name=name, | ||
| documentation=documentation, | ||
| labelnames=labelnames) |
There was a problem hiding this comment.
What's the purpose of adding this indirection if all of the args are just passed to the corresponding constructors?
There was a problem hiding this comment.
Mostly modularity. We can extend these, for example in this PR vllm-project#17925 we want to wrap these primitives with their Ray equivalent.
There was a problem hiding this comment.
It might make more sense to make this particular change in that other PR then since it's not directly related to this one. cc @markmc
vllm/v1/engine/async_llm.py
Outdated
| # Upon shutdown of this process, we should mark the process as dead | ||
| # See https://prometheus.github.io/client_python/multiprocess/ | ||
| try: | ||
| import os | ||
|
|
||
| from prometheus_client import multiprocess | ||
|
|
||
| multiprocess.mark_process_dead(os.getpid()) | ||
| logger.debug("Marked Prometheus metrics for process %d as dead", | ||
| os.getpid()) | ||
| except Exception as e: | ||
| logger.error("Error during metrics cleanup: %s", str(e)) | ||
|
|
There was a problem hiding this comment.
Does it matter if we run this even when prometheus logging is disabled?
There was a problem hiding this comment.
so this part of the shutdown logic runs through and is a no-op when log-stats=False. So I if we want to be too pedantic about something like prometheus_client not being installed on the host we can only gate the logic on log_stats. But I think it's better to keep it like this. The only edge case I can think of is if prometheus_client package is not installed in which case the try-except block will just emit a logger.error that won't be raised unless the user wants. So it's fine by default.
There was a problem hiding this comment.
I added more comments to clarify the choice.
vllm/entrypoints/cli/serve.py
Outdated
| assert num_api_servers > 1 | ||
| if "PROMETHEUS_MULTIPROC_DIR" not in os.environ: | ||
| # Make TemporaryDirectory for prometheus multiprocessing | ||
| # Note: global TemporaryDirectory will be automatically | ||
| # cleaned up upon exit. | ||
| global prometheus_multiproc_dir | ||
| prometheus_multiproc_dir = tempfile.TemporaryDirectory() | ||
| os.environ["PROMETHEUS_MULTIPROC_DIR"] = prometheus_multiproc_dir.name | ||
| else: | ||
| logger.warning("Found PROMETHEUS_MULTIPROC_DIR was set by user. " | ||
| "This directory must be wiped between vLLM runs or " | ||
| "you will find inaccurate metrics. Unset the variable " | ||
| "and vLLM will properly handle cleanup.") |
There was a problem hiding this comment.
We should probably only do this is prometheus logging is disabled .. at least we probably shouldn't if log_stats is False.
There was a problem hiding this comment.
Setting the env var even if log_stats is false is fine? it doesn't hurt? Why do you think we should gate it on log_stats? or even to be more precise when Prometheus logger is used?
vllm/v1/metrics/loggers.py
Outdated
| labelnames=labelnames, | ||
| multiprocess_mode="all").labels(*labelvalues) |
There was a problem hiding this comment.
| labelnames=labelnames, | |
| multiprocess_mode="all").labels(*labelvalues) | |
| labelnames=labelnames).labels(*labelvalues) |
|
@kouroshHakha in particular I also don't think we should be using the "all" mode which just labels by pid. |
|
We also need to make sure that the The docs recommend setting it externally but we obviously don't want to have to require that.
|
I think I changed it to livemostrecent (forgot to update the description) |
|
Comparing |
@njhill I have thought about this for almost the entire day looking at metrics and how they show up on grafana, etc. My conclusion is that we have only one metric that is impacted by hitting multiple iterationState records from different api_servers. That is histogram of You can also observe the diff on |
|
Nice work thinking this through @kouroshHakha Let's work through a simple example. With a single API server and single engine With 2 API servers and 2 engines So the view you get is that the same number of tokens is being generated with more, smaller iterations? Is that going to be a problem? Surely people are watching trends, or comparing across like-for-like instances, etc. rather than relying on the actual values? e.g. sure, you'd see a drop if you rolled out a multi-api-server change ... but that might even be reassuring, and make a ton of sense? wdyt? |
|
On the code ... I absolutely detest all this "make sure Firstly, I'm skeptical that the env var needs to be set before importing prometheus_client? Yes, we need to set it before creating the first metric, but why before importing? Especially since we're not using Maybe I'm missing something there, but I'd like to be really sure we have to lazy import ... that's always going to be super brittle Secondly, if we can put all the prometheus multiproc nonsense in one place - e.g. Does that make sense? |
|
Thanks @kouroshHakha @markmc for that careful analysis!
I'm not sure we want to complicate the scheduler or introduce more work there... we should aim to avoid that if possible.
I'm not sure, it's possible the rate of the "count" of this histogram could be used to track iteration frequency and used in various other derived "per iteration" metrics. Unfortunately the count also won't be a constant multiple of the "original" count since the number of times it's recorded per engine per iteration would vary depending on how the requests are distributed between api servers / engines. If this is the only problematic one though it seems reasonable to not block the PR and just document this along with the multi-api-server option.
I'm not sure exactly, and agree that's horrible, it's just a (possibly incorrect) recollection from when we were wrangling with this some time back with V0. Hopefully it's not really the case
Good point and I very much agree with this! If we do end up having an import ordering issue it should make that easier to manage too. |
|
Hey @markmc,
yep exactly. I liked your toy example. It's really illuminating. Let's consider a scenario that engine is shared between two api servers. I think the conclusion would be the same tho: Whether this is a problem or not, goes back to what we want from the system. I think keeping as is, is a reasonable design choice. i.e. As you scale the api server, the total sum remains the same, but it's broken down into smaller steps therefore changing the histogram. If we want to go this route, I think the metric name is a bit inconsistent with what it represents. The iteration_token_total suggest that as long as the number of engines remain the same for each engine I should see the similar distribution of tokens processed per step by the engine. Tho what @njhill suggested also could be a problem. It depends on how much the end user relies on these metrics :) But I do agree with you that this is certainly not a big problem. We should certainly not block this PR for this @njhill.
@njhill @markmc I also think this was actually not necessary. Because once I added the check of I will double check this and if the lazy import doesn't end up as a requirement I'd remove it like suggested. This will allow us to self contain Prometheus parts in one place for easier maintenance as well. |
njhill
left a comment
There was a problem hiding this comment.
Thanks @kouroshHakha
as well as the inline comments it would be good to make the change suggested by @markmc to move all the prometheus-touching logic into the prometheus package, calling utility methods from there as needed.
vllm/v1/metrics/loggers.py
Outdated
| metrics_info["engine"] = self.engine_index | ||
|
|
||
| name, documentation = None, None | ||
| multiprocess_mode = "mostrecent" |
There was a problem hiding this comment.
What's the reason for this variable?
There was a problem hiding this comment.
waiting for a green light from you to refactor this function entirely. It's a bit wierd that it's doing some conditioning but the condition is always true when I search across the project globally.
There was a problem hiding this comment.
I would say just leave it at least for this PR? Could always look into it some more and open a separate PR to refactor...
vllm/v1/metrics/loggers.py
Outdated
| def build_buckets(mantissa_lst: list[int], | ||
| max_value: int) -> list[Union[int, float]]: |
There was a problem hiding this comment.
What's the reason for these changes? I'm not sure what's wrong with list[int] and regardless I think it's unrelated to the PR purpose?
There was a problem hiding this comment.
reverting. linter artifact when I had the indirection
vllm/v1/metrics/loggers.py
Outdated
| self.labelname_running_lora_adapters, | ||
| ]) | ||
| ], | ||
| multiprocess_mode="livemostrecent" |
There was a problem hiding this comment.
Is this lora_info gauge another potentially problematic one? Since it's updated via IteratonStats and not SchedulerStats ... So I'm not sure "livemostrecent" is the right thing to use here.
I haven't looked closely at what this metric / how it is computed, maybe "sum" would fit better. I have a feeling though even that might not be correct, because it may be counting e.g. the number of unique lora adapters across the running requests and so it's not really possible to just combine the separate counts when the requests from a given engine are partitioned.
If there's not an easy answer we could include this in the list that we document as not being correct when mutli-api-servers are in play.
There was a problem hiding this comment.
In v0 world, this was livemostrecent. I assumed it was intentional, so I kept it.
https://github.com/vllm-project/vllm/blob/main/vllm/engine/metrics.py#L80
It is potentially one of those problematic ones. But supporting lora metrics even falls lower in priority than the other one. So we may as well just record it in the docs. I'd keep as is since it's coming from historical context anyways.
There was a problem hiding this comment.
In V0 I don't think were were ever logging the same metrics in multiple places, so the aggregation mode was probably irrelevant. I do think "sum" is probably slightly less bad here. But probably we should just disable this metric when there are multiple API servers since I think the values will just be wrong (and can document of course).
There was a problem hiding this comment.
I made it sum. We can simply document for now.
… module Signed-off-by: kouroshhakha <kourosh@anyscale.com>
|
ok @njhill @markmc I separated out prometheus non sense into its own python module under We can follow up with moving |
vllm/v1/metrics/prometheus.py
Outdated
| return REGISTRY | ||
|
|
||
|
|
||
| def mount_metrics(app): |
There was a problem hiding this comment.
| def mount_metrics(app): | |
| def mount_metrics(app: FastAPI): |
There was a problem hiding this comment.
I think it might be better to leave this method in api_server.py, and just call get_prometheus_registry() from here
There was a problem hiding this comment.
So this method is very prometheus heavy (meaning it's not just the registry but also other things like mak_asgi_app, etc that are coming from prometheus. This really belongs to prometheus module of vllm. WDYT?
vllm/v1/metrics/prometheus.py
Outdated
| app.routes.append(metrics_route) | ||
|
|
||
|
|
||
| def mark_process_dead(pid): |
There was a problem hiding this comment.
| def mark_process_dead(pid): | |
| def mark_process_dead(pid: int): |
vllm/v1/metrics/prometheus.py
Outdated
| registry = CollectorRegistry() | ||
| multiprocess.MultiProcessCollector(registry) | ||
| return registry | ||
| else: |
vllm/v1/engine/async_llm.py
Outdated
| try: | ||
| mark_process_dead(os.getpid()) | ||
| logger.debug("Marked Prometheus metrics for process %d as dead", | ||
| os.getpid()) | ||
| except Exception as e: | ||
| logger.error("Error during metrics cleanup: %s", str(e)) |
There was a problem hiding this comment.
put the try/except inside the method too?
There was a problem hiding this comment.
I think the caller should decide what they want to do. Sg?
There was a problem hiding this comment.
Agree with Nick. There's nothing the caller can do, and wrapping anything in such a broad try/except implies some knowledge of what the function is doing. I'd move the os.getpid() into the prometheus module too and call it shutdown_prometheus() or similar
| logger.debug("vLLM to use %s as PROMETHEUS_MULTIPROC_DIR", | ||
| prometheus_multiproc_dir_path) | ||
| registry = CollectorRegistry() | ||
| multiprocess.MultiProcessCollector(registry) |
There was a problem hiding this comment.
Everything above should be in the prometheus module, returning a registry
|
|
||
| # Workaround for 307 Redirect for /metrics | ||
| metrics_route.path_regex = re.compile("^/metrics(?P<path>.*)$") | ||
| app.routes.append(metrics_route) |
There was a problem hiding this comment.
And all of the rest of it should remain in the API server module, using registry returned from the prometheus module
There was a problem hiding this comment.
After doing this, it looks nicer, I have to admit :D
vllm/v1/metrics/prometheus.py
Outdated
| # Unregister any existing vLLM collectors | ||
| for collector in list(REGISTRY._collector_to_names): | ||
| if hasattr(collector, "_name") and "vllm" in collector._name: | ||
| REGISTRY.unregister(collector) |
There was a problem hiding this comment.
One nice thing about having all this nonsense together in one module, you wonder ...
If we're using REGISTRY here, we're assuming multiprocess mode is not enabled? Maybe assert that the env var is not set?
There was a problem hiding this comment.
good catch .. logic here should differ here in multproc case I think?
| from fastapi.middleware.cors import CORSMiddleware | ||
| from fastapi.responses import JSONResponse, Response, StreamingResponse | ||
| from prometheus_client import make_asgi_app | ||
| from prometheus_fastapi_instrumentator import Instrumentator |
There was a problem hiding this comment.
can't remove. mount_metrics is added back here, so it's needed. That's why I wanted to keep mount_metrics entirely in prometheus.
There was a problem hiding this comment.
Doh, I misread, sorry. That's fine, this stuff is in the "api server" category not the "disgusting prometheus multi proc hackery" category 😃
vllm/v1/metrics/prometheus.py
Outdated
| """Mark a process as dead in prometheus multiprocessing. | ||
|
|
||
| Args: | ||
| pid: Process ID to mark as dead |
| self.gauge_scheduler_running = prometheus_client.Gauge( | ||
| name="vllm:num_requests_running", | ||
| documentation="Number of requests in model execution batches.", | ||
| labelnames=labelnames).labels(*labelvalues) |
There was a problem hiding this comment.
If you're re-spinning again, then a very minor stylistic request ...
In the original version this line is all "label stuff"
In the new version, it becomes "label stuff, new line, multiproc stuff, label stuff"
There was a problem hiding this comment.
sorry did not quite understand the desired style?
There was a problem hiding this comment.
oh you mean the order? put multiprocess_mode before label stuff?
vllm/v1/metrics/loggers.py
Outdated
| documentation="Number of requests in model execution batches.", | ||
| labelnames=labelnames).labels(*labelvalues) | ||
| labelnames=labelnames, | ||
| multiprocess_mode="mostrecent").labels(*labelvalues) |
There was a problem hiding this comment.
Suggest (for all these multiprocess_mode changes)
multiprocess_mode="mostrecent",
labelnames=labelnames).labels(*labelvalues)
vllm/v1/metrics/loggers.py
Outdated
| self.labelname_running_lora_adapters, | ||
| ]) | ||
| ], | ||
| multiprocess_mode="sum" |
There was a problem hiding this comment.
Could you add a comment here (maybe above, as part of the "LoRA metrics" comment), explaining that this metric will not be correct when using api-server scaleout, which uses prometheus mp mode.
|
Thanks for all of your help and patience with this @kouroshHakha! |


No description provided.