[DP] Fix Prometheus Logging #21257

robertgshaw2-redhat · 2025-07-20T17:22:59Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

previously, we created a PrometheusStatLogger for each EngineCore. This appears okay on the surface, but what was happening is that only the final EngineCore would be able to log stats since we reset the Prometheus state in each constructor unregister_vllm_metrics
simply removing the unregister_vllm_metrics does not work, because we can only * create * the metrics once. We just want to have multiple labels for the same metric not multiple metrics
this PR adds a class called StatLoggerMananger to deal with this:

StatLoggerManager:
        Logging happens at the level of the EngineCore (per scheduler).
         * DP: >1 EngineCore per AsyncLLM - loggers for each EngineCore.
         * With Local Logger, just make N copies for N EngineCores.
         * With Prometheus, we need a single logger with N "labels"
        This class abstracts away this implementation detail from
        the AsyncLLM, allowing the AsyncLLM to just call .record()
        and .log() to a simple interface.

this PR refactors PrometheusStatLogger to enable logging from multiple engine cores
this PR ensures that the AsyncLLM only logs the metrics of the EngineCores that it is directly managing

Follow up:

Make it work with LoRA
Make it work with SpecDecoding
Make it work with elastic EP

Test Plan

existing CI
justfile

MODEL := "Qwen/Qwen3-30B-A3B-FP8"

tp PORT:
  vllm serve {{MODEL}} \
    --port {{PORT}} \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --disable-log-requests

dp_a_internal_lb PORT:
  vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_internal_lb:
  vllm serve {{MODEL}} \
    --headless \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 2 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_a_external_lb PORT:
   vllm serve {{MODEL}} \
    --port 8100 \
    --data-parallel-size 2 \
    --data-parallel-rank 0 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_external_lb PORT:
  vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 2 \
    --data-parallel-rank 1 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests


eval PORT CONCURRENT LIMIT:
  lm_eval --model local-completions --tasks gsm8k \
    --model_args model={{MODEL}},base_url=http://127.0.0.1:{{PORT}}/v1/completions,num_concurrent={{CONCURRENT}},num_retries=0,tokenized_requests=False \
    --limit {{LIMIT}}

metrics PORT:
   curl http://localhost:{{PORT}}/metrics

Test Result

Sample:

tp:

just tp 8100
just eval 8100
just metrics 8100

# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{engine="0",model_name="Qwen/Qwen3-30B-A3B-FP8"} 4.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{engine="0",model_name="Qwen/Qwen3-30B-A3B-FP8"} 0.0

INFO 07-20 18:05:20 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

dp (internal lb) --- head node gives logs for all ranks

just dp_a_internal_lb 8100
just dp_b_internal_lb 8100
just eval 8100
just metrics 8100

vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.6821875921956284e-05
vllm:kv_cache_usage_perc{engine="1",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.686799752815716e-05
vllm:kv_cache_usage_perc{engine="2",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.686799752815716e-05
vllm:kv_cache_usage_perc{engine="3",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.663896214605277e-05

INFO 07-20 18:10:58 [loggers.py:122] Engine 000: Avg prompt throughput: 3064.0 tokens/s, Avg generation throughput: 359.0 tokens/s, Running: 26 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 0.0%
INFO 07-20 18:10:58 [loggers.py:122] Engine 001: Avg prompt throughput: 2508.4 tokens/s, Avg generation throughput: 353.7 tokens/s, Running: 25 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.4%, Prefix cache hit rate: 0.0%
INFO 07-20 18:10:58 [loggers.py:122] Engine 002: Avg prompt throughput: 1962.9 tokens/s, Avg generation throughput: 353.5 tokens/s, Running: 24 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 0.0%
INFO 07-20 18:10:58 [loggers.py:122] Engine 003: Avg prompt throughput: 2619.2 tokens/s, Avg generation throughput: 354.6 tokens/s, Running: 25 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.6%

dp (external lb) --- each node gives logs for its own rank

just dp_a_external_lb 8100
just dp_b_external_lb 8100
just eval 8100
just eval 8200
just metrics 8100
just metrics 8200

rank 0:

vllm:request_success_total{engine="0",finished_reason="stop",model_name="Qwen/Qwen3-30B-A3B-FP8"} 88.0
vllm:request_success_total{engine="0",finished_reason="length",model_name="Qwen/Qwen3-30B-A3B-FP8"} 12.0
vllm:request_success_total{engine="0",finished_reason="abort",model_name="Qwen/Qwen3-30B-A3B-FP8"} 0.0

INFO 07-20 18:15:37 [loggers.py:122] Engine 000: Avg prompt throughput: 10130.6 tokens/s, Avg generation throughput: 506.1 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 20.0%, Prefix cache hit rate: 0.0%

rank 1:

vllm:request_success_total{engine="1",finished_reason="stop",model_name="Qwen/Qwen3-30B-A3B-FP8"} 88.0
vllm:request_success_total{engine="1",finished_reason="length",model_name="Qwen/Qwen3-30B-A3B-FP8"} 12.0
vllm:request_success_total{engine="1",finished_reason="abort",model_name="Qwen/Qwen3-30B-A3B-FP8"} 0.0

INFO 07-20 18:15:47 [loggers.py:122] Engine 001: Avg prompt throughput: 10129.2 tokens/s, Avg generation throughput: 894.6 tokens/s, Running: 69 reqs, Waiting: 0 reqs, GPU KV cache usage: 14.4%, Prefix cache hit rate: 0.0%

(Optional) Documentation Update

Signed-off-by: Robert Shaw <[email protected]>

mergify · 2025-07-20T18:47:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Robert Shaw <[email protected]>

DarkLight1337 · 2025-07-21T16:13:07Z

Merging to unblock release

njhill · 2025-07-21T18:09:29Z

I will do a retroactive review :)

Signed-off-by: Seiji Eicher <[email protected]>

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

Robert Shaw added 30 commits July 19, 2025 16:27

added debug logging

14f13ed

Signed-off-by: Robert Shaw <[email protected]>

updated

b90d331

Signed-off-by: Robert Shaw <[email protected]>

updated

aefeeed

Signed-off-by: Robert Shaw <[email protected]>

updated

59a9583

Signed-off-by: Robert Shaw <[email protected]>

updated

48cf09b

Signed-off-by: Robert Shaw <[email protected]>

updated

2fd0587

Signed-off-by: Robert Shaw <[email protected]>

updated

14cf3c4

Signed-off-by: Robert Shaw <[email protected]>

updated

4f5d3ea

Signed-off-by: Robert Shaw <[email protected]>

updated

14db660

Signed-off-by: Robert Shaw <[email protected]>

updated

2aa4975

Signed-off-by: Robert Shaw <[email protected]>

cleanup

b142571

Signed-off-by: Robert Shaw <[email protected]>

updated

e1843b7

Signed-off-by: Robert Shaw <[email protected]>

updated

d2d54e9

Signed-off-by: Robert Shaw <[email protected]>

fix lb issues

4438796

Signed-off-by: Robert Shaw <[email protected]>

updated

2a68433

Signed-off-by: Robert Shaw <[email protected]>

updatedd

1ced153

Signed-off-by: Robert Shaw <[email protected]>

nits

b9c0f65

Signed-off-by: Robert Shaw <[email protected]>

nits

dbc51d6

Signed-off-by: Robert Shaw <[email protected]>

updated

471fa4a

Signed-off-by: Robert Shaw <[email protected]>

stash

6569fac

Signed-off-by: Robert Shaw <[email protected]>

stash

1e5303a

Signed-off-by: Robert Shaw <[email protected]>

convert to use only one prometheus stat logger per async llm

a69edca

Signed-off-by: Robert Shaw <[email protected]>

convert to use only one prometheus stat logger per async llm

de91a3c

Signed-off-by: Robert Shaw <[email protected]>

cleanup prometheus logging

e08e1e9

Signed-off-by: Robert Shaw <[email protected]>

updated

d39cf93

Signed-off-by: Robert Shaw <[email protected]>

updated

9a2e26d

Signed-off-by: Robert Shaw <[email protected]>

updated

3956d8c

Signed-off-by: Robert Shaw <[email protected]>

updated

cad9670

Signed-off-by: Robert Shaw <[email protected]>

updated

fd0650f

Signed-off-by: Robert Shaw <[email protected]>

updated

896b0a2

Signed-off-by: Robert Shaw <[email protected]>

Robert Shaw added 2 commits July 20, 2025 17:52

cleanup

4b50833

Signed-off-by: Robert Shaw <[email protected]>

stash

eb5b84e

Signed-off-by: Robert Shaw <[email protected]>

mergify bot added the needs-rebase label Jul 20, 2025

merged

efdeb01

Signed-off-by: Robert Shaw <[email protected]>

mergify bot removed the needs-rebase label Jul 20, 2025

robertgshaw2-redhat mentioned this pull request Jul 20, 2025

[Bug]: Prometheus DP Metrics #21260

Closed

1 task

simon-mo added this to the v0.10.0 milestone Jul 20, 2025

robertgshaw2-redhat requested a review from simon-mo July 20, 2025 19:00

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 20, 2025

robertgshaw2-redhat mentioned this pull request Jul 21, 2025

[DP] Internal Load Balancing Per Node [one-pod-per-node] #21238

Merged

4 tasks

Robert Shaw added 3 commits July 21, 2025 00:53

fixing tests

c54c17e

Signed-off-by: Robert Shaw <[email protected]>

passing

4be985d

Signed-off-by: Robert Shaw <[email protected]>

get other failing tst to pass

20e7f17

Signed-off-by: Robert Shaw <[email protected]>

vllm-bot merged commit 29d1ffc into vllm-project:main Jul 21, 2025
65 of 67 checks passed

ruisearch42 mentioned this pull request Jul 21, 2025

[Misc] Have AsyncLLM custom_stat_loggers extend default logger list #20952

Merged

4 tasks

eicherseiji added a commit to eicherseiji/vllm that referenced this pull request Jul 22, 2025

Adapt to vllm-project#21257

82a3a09

Signed-off-by: Seiji Eicher <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[DP] Fix Prometheus Logging (vllm-project#21257)

83b9362

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: x22x22 <[email protected]>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[DP] Fix Prometheus Logging (vllm-project#21257)

cc92410

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[DP] Fix Prometheus Logging (vllm-project#21257)

179dec6

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[DP] Fix Prometheus Logging (vllm-project#21257)

3b0f682

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[DP] Fix Prometheus Logging (vllm-project#21257)

a9f0d50

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: Paul Pak <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025

[DP] Fix Prometheus Logging (vllm-project#21257)

ad81fe8

Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[DP] Fix Prometheus Logging #21257

[DP] Fix Prometheus Logging #21257

Uh oh!

robertgshaw2-redhat commented Jul 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Jul 20, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Jul 21, 2025

Uh oh!

njhill commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[DP] Fix Prometheus Logging #21257

[DP] Fix Prometheus Logging #21257

Uh oh!

Conversation

robertgshaw2-redhat commented Jul 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

mergify bot commented Jul 20, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Jul 21, 2025

Uh oh!

njhill commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

robertgshaw2-redhat commented Jul 20, 2025 •

edited by github-actions bot

Loading