[Misc][DP] Fix AsyncLLM metrics for multi-API server deployments by kouroshHakha · Pull Request #18053 · vllm-project/vllm

kouroshHakha · 2025-05-13T06:57:08Z

Problem

This PR addresses one of the problem with #17546 regarding metrics inconsistency when num api_servers > 1 in a multi-api server setup. When running multiple API server instances with the V1 implementation, metrics were inconsistently collected and aggregated. This happened because:

In V1, the AsyncLLM wasn't setting up the PROMETHEUS_MULTIPROC_DIR environment variable needed for proper multi-process metrics collection.
The multiprocess modes for different metric types (gauges, counters, histograms) were incorrectly configured, causing metrics to be double-counted or improperly aggregated.
Process cleanup code was missing, preventing metrics from being marked as dead when processes exit

Solution

This PR ensures consistent metrics handling across multiple API servers in V1 by:
Setting up PROMETHEUS_MULTIPROC_DIR environment variable in the AsyncLLM initialization, borrowing some of the existing tricks in V0:

Updating the multiprocess_mode settings in metric loggers to match V0 implementation for some gauges like lora_request_info, etc.
Set "mostrecent" as default mode for gauges
Added proper process cleanup code to ensure metrics are correctly handled when processes exit

Result

Comparing -asc=1 and -asc=16 and a counter metric like num_prompt_tokens over time on a fixed workload that has ~2M input tokens.

Known issues that still need to be addressed (later)

The histogram of `vllm:iteration_tokens_total` will not be accurate when asc > 1

The current fundamental assumption is that by analyzing all the requests that came back from engines we can construct IterationStats which includes num_generation_tokens, num_preempted_tokens, num_prompt_tokens, etc.

This assumption is not true anymore with multiple api_servers. With multiple api_server processes, each front-end will get a sub-batch of requests that came from the same engine step. Therefore the IterationStats constructed off of these requests will have a partial view. For example num_generation_tokens will not be num_generation_tokens per that iteration. It will be just part of it.

Most of the metrics in IterationStats are fine, because they fall into two categories:

They are invariant to the notion of actual engine iteration. For example things like ttft, etc won't be affected
They are counter. If num_generation_tokens is logged in prometheus and is setup as a counter, it will be summed anyways.

The histogram of vllm:iteration_tokens_total does not fall into either of these categories. Proof:

You can also observe the diff on vllm:iteration_tokens_total on the same workload. Solving this at first glance is not straight forward, as I think it would need making scheduler logic more complex to just be able to keep track of some of these iteration level metrics. Since this metric is not that important it's not that urgent to solve this issue right now. The caveat is that, tomorrow if we add any new histogram with other metrics like num_generation_tokens they will have the same problem in asc > 1 case.

NOTE to Reviewer

This PR is built on top of #17546 so that has to be merged first.