[Misc][DP] support customized aggregated logger for dp by luccafong · Pull Request #24354 · vllm-project/vllm

luccafong · 2025-09-06T00:51:46Z

Purpose

Support customized global logger for data parallel. This allows users to provide custom global loggers that can aggregate metrics across multiple DP instances.
Update DP multi host instance example with customized loggers.

Test Plan

Unit tests
Verify in online e2e example that both per-engine and global loggers work correctly
Verify with e2e online serve and benchmark with -aggregate-engine-logging
Verify with e2e online serve and benchmark with default behavior not changed

Test Result

Unit tests pass

python -m pytest tests/v1/engine/test_async_llm.py::test_customize_aggregated_loggers -v
============================= 1 passed, 3 warnings in 28.54s ==============================

python -m pytest tests/v1/engine/test_async_llm.py::test_customize_loggers -v
============================= 1 passed, 3 warnings in 24.11s ==============================

e2e test with online examples:

VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=WARN python3 /data/users/fanglu/gitrepos/vllm/examples/online_serving/multi_instance_data_parallel.py
CUDA_VISIBLE_DEVICES=1  vllm serve ibm-research/PowerMoE-3b -dp 2 -dpr 1         --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 62300 --max-model-len 2048 --max-num-batched-tokens 2048          --data-parallel-size-local 1 --enforce-eager --headless

logs:

DEBUG 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
DEBUG 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 001: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 001: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:190] 2 Engines Aggregated: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

e2e test with serving

Add option --aggregate-engine-logging to enable aggregated log stats.

serve

vllm serve "/data/local/models/DeepSeek-V3.1" --max_model_len=4096 --gpu_memory_utilization=0.9 --tensor_parallel_size 1 --data_parallel_size 8 --enable_expert_parallel  --max_num_seqs=256  --aggregate-engine-logging --enforce-eager > /tmp/test_serving.log 2>&1 &

benchmark

vllm bench serve  --model "/data/local/models/DeepSeek-V3.1" --dataset-name random  --random-input-len 2048 --random-output-len 1024  --num-prompts 400 --host 127.0.0.1 --port 8000 --ignore-eos --random-range-ratio 0 --max-concurrency 200 --ready-check-timeout-sec 0

Logs:

^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:03 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 13106.5 tokens/s, Avg generation throughput: 28.8 tokens/s, Running: 168 reqs, Waiting: 32 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:13 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 8184.7 tokens/s, Avg generation throughput: 1318.0 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 21.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:23 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1339.9 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.4%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:33 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1340.0 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.1%, Prefix cache hit rate: 0.0%

e2e test with serving with default settings

^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 000: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 001: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 002: Avg prompt throughput: 1843.1 tokens/s, Avg generation throughput: 4.3 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 1.9%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 003: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 004: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 005: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 006: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 007: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

facebook-github-bot · 2025-09-06T00:52:47Z

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

gemini-code-assist

Code Review

This pull request introduces support for a customized global logger for data parallel (DP) training, which is a valuable addition for monitoring distributed training. The changes primarily affect the V1 engine, with corresponding updates to examples and tests. My review has identified a few issues: a bug in the updated example code where a request ID is not unique within a loop, a leftover debug print statement in the new logger logic, and an unused parameter in the V0 engine's API that could cause confusion. Addressing these points will improve the quality and correctness of the implementation.

facebook-github-bot · 2025-09-06T01:00:30Z

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

facebook-github-bot · 2025-09-06T01:09:17Z

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

facebook-github-bot · 2025-09-06T01:18:25Z

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

mergify · 2025-09-06T01:18:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

houseroad · 2025-09-18T20:51:15Z

        # Each EngineCore's metrics are expressed as a unique label.
        self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs)
+        self.global_logger: Optional[StatLoggerBase] = None


let's name this as aggregated_logger?

Perhaps type here better as GlobalStatLoggerBase

houseroad · 2025-09-18T20:51:34Z

@@ -145,6 +156,63 @@ def log_engine_initialized(self):
                self.vllm_config.cache_config.num_gpu_blocks)


+class GlobalStatLogger(LoggingStatLogger, GlobalStatLoggerBase):


AggregatedLogger is a better name.

Add some comments to explain the difference?

houseroad · 2025-09-18T20:52:51Z

@@ -222,6 +230,7 @@ def from_engine_args(
        start_engine_loop: bool = True,
        usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
        stat_loggers: Optional[list[StatLoggerFactory]] = None,
+        stat_logger_global: Optional[GlobalStatLoggerFactory] = None,


I think we should rename this per_engine_loggers and aggregated_logger.

njhill

Thanks @luccafong!

I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).

For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.

Then we also wouldn't need a special case/field for the PrometheusLogger.

A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.

(btw my comments below might not be applicable in the same way if we refactor per above suggestion).

njhill · 2025-09-18T21:04:26Z

        # Each EngineCore's metrics are expressed as a unique label.
        self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs)
+        self.global_logger: Optional[StatLoggerBase] = None


Perhaps type here better as GlobalStatLoggerBase

njhill · 2025-09-18T21:13:32Z

+        now = time.monotonic()
+        prompt_throughput = self._get_throughput(self.num_prompt_tokens, now)
+        generation_throughput = self._get_throughput(
+            self.num_generation_tokens, now)
+
+        self._reset(now)
+
+        scheduler_stats = self.last_scheduler_stats
+
+        log_fn = logger.info
+        if not any(
+            (prompt_throughput, generation_throughput,
+             self.last_prompt_throughput, self.last_generation_throughput)):
+            # Avoid log noise on an idle production system
+            log_fn = logger.debug
+        self.last_generation_throughput = generation_throughput
+        self.last_prompt_throughput = prompt_throughput


Could we restructure so that this isn't duplicated with the superclass?

luccafong · 2025-09-19T01:01:01Z

Thanks @luccafong!

I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).

For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.

Then we also wouldn't need a special case/field for the PrometheusLogger.

A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.

(btw my comments below might not be applicable in the same way if we refactor per above suggestion).

thanks @njhill for the review and suggestions.

Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger.

njhill · 2025-09-19T18:18:11Z

Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger.

Re PrometheusLogger... this is already a "global" one right? so it would just mean we can include that in the list as one of the default ones and not have a separate field.

Thinking more, maybe we can keep the global stats logger abstract subclass after all (but I agree that it would be better to name it AggregateStatsLogger). I don't think we need a new arg for this, we can just check the type of passed-in statsloggers.

And then have a simple adapter - if a provided stats logger is not an AggregateStatsLogger, wrap it in this adapter which converts non-aggregate into aggregate (just contains dict with n instances). Then our internal field can just be a list of AggregateStatsLoggers and we invoke them all in the same way. PrometheusLogger would just be updated to extend AggregateStatsLogger.

And your concrete impl of the aggregate logging one we could name e.g. AggregateLoggingStatsLogger.

I hope that makes sense!

njhill · 2025-10-13T15:45:55Z

@luccafong it looks like there's a test that needs updating: https://buildkite.com/vllm/ci/builds/34468#0199d15b-035d-435e-a030-a453e91e436e

and will need another rebase now that all those formatting changes have been made :(

mergify · 2025-10-13T18:56:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lu Fang <fanglu@fb.com> fix the test Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> add aggregator interface and abstract common logic Signed-off-by: Lu Fang <fanglu@fb.com> add corrupted request aggregation Signed-off-by: Lu Fang <fanglu@fb.com> more refactor Signed-off-by: Lu Fang <fanglu@fb.com> fix ut Signed-off-by: Lu Fang <fanglu@fb.com> fix kv_connector_logging Signed-off-by: Lu Fang <fanglu@fb.com> fix merge conflicts Signed-off-by: Lu Fang <fanglu@fb.com> fix lint Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> address comments Signed-off-by: Lu Fang <fanglu@fb.com> address commnet Signed-off-by: Lu Fang <fanglu@fb.com>

Signed-off-by: Lu Fang <fanglu@fb.com>

…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: 1994 <1994@users.noreply.github.com>

…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…24354) Signed-off-by: Lu Fang <fanglu@fb.com>

…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…24354) Signed-off-by: Lu Fang <fanglu@fb.com>

luccafong requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 6, 2025 00:51

mergify bot added documentation Improvements or additions to documentation v1 labels Sep 6, 2025

gemini-code-assist bot reviewed Sep 6, 2025

View reviewed changes

Comment thread examples/online_serving/multi_instance_data_parallel.py Outdated

Comment thread vllm/engine/async_llm_engine.py Outdated

Comment thread vllm/v1/metrics/loggers.py Outdated

luccafong requested review from houseroad and zhuohan123 September 6, 2025 00:53

luccafong force-pushed the support_global_dp_logging branch from cb2d450 to 893ea0f Compare September 6, 2025 01:08

luccafong force-pushed the support_global_dp_logging branch 2 times, most recently from cb2d450 to 7bfddba Compare September 6, 2025 01:17

mergify bot added the needs-rebase label Sep 6, 2025

luccafong force-pushed the support_global_dp_logging branch from 7bfddba to 8156746 Compare September 6, 2025 01:21

mergify bot removed the needs-rebase label Sep 6, 2025

luccafong force-pushed the support_global_dp_logging branch 2 times, most recently from 8633cf4 to 562f552 Compare September 8, 2025 23:58

houseroad reviewed Sep 18, 2025

View reviewed changes

njhill reviewed Sep 18, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 13, 2025

luccafong force-pushed the support_global_dp_logging branch from 5484535 to ce5c02f Compare October 13, 2025 18:56

mergify bot removed the needs-rebase label Oct 13, 2025

luccafong added 6 commits October 13, 2025 13:44

remove comment

a181ebb

Signed-off-by: Lu Fang <fanglu@fb.com>

fix test

88570d3

Signed-off-by: Lu Fang <fanglu@fb.com>

fix rebase conflicts

c1f9fba

Signed-off-by: Lu Fang <fanglu@fb.com>

remove unused import

ae0d51e

Signed-off-by: Lu Fang <fanglu@fb.com>

further fix test

554625d

Signed-off-by: Lu Fang <fanglu@fb.com>

luccafong force-pushed the support_global_dp_logging branch from 7ab416c to 554625d Compare October 13, 2025 20:44

luccafong merged commit 8317f72 into vllm-project:main Oct 14, 2025
51 checks passed

1994 pushed a commit to 1994/vllm that referenced this pull request Oct 14, 2025

[Misc][DP] support customized aggregated logger for dp (vllm-project#…

3936701

…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: 1994 <1994@users.noreply.github.com>

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025

[Misc][DP] support customized aggregated logger for dp (vllm-project#…

9e4f3e1

…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[Misc][DP] support customized aggregated logger for dp (vllm-project#…

a44f13d

…24354) Signed-off-by: Lu Fang <fanglu@fb.com> Signed-off-by: bbartels <benjamin@bartels.dev>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Misc][DP] support customized aggregated logger for dp (vllm-project#…

22a081f

…24354) Signed-off-by: Lu Fang <fanglu@fb.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Misc][DP] support customized aggregated logger for dp (vllm-project#…

105e815

…24354) Signed-off-by: Lu Fang <fanglu@fb.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Misc][DP] support customized aggregated logger for dp (vllm-project#…

18451fe

…24354) Signed-off-by: Lu Fang <fanglu@fb.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Misc][DP] support customized aggregated logger for dp (vllm-project#…

f871fd0

…24354) Signed-off-by: Lu Fang <fanglu@fb.com>

markmc added this to Metrics & Tracing Feb 4, 2026

github-project-automation bot moved this from Backlog to Done in Metrics & Tracing Feb 4, 2026

github-project-automation bot moved this to Backlog in Metrics & Tracing Feb 4, 2026

markmc moved this from Done to Done - 0.12 in Metrics & Tracing Feb 4, 2026

markmc moved this from Done - 0.12 to Done - 0.11 in Metrics & Tracing Feb 4, 2026

		@@ -145,6 +156,63 @@ def log_engine_initialized(self):
		self.vllm_config.cache_config.num_gpu_blocks)


		class GlobalStatLogger(LoggingStatLogger, GlobalStatLoggerBase):

Uh oh!

Conversation

luccafong commented Sep 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

facebook-github-bot commented Sep 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Sep 6, 2025

Uh oh!

facebook-github-bot commented Sep 6, 2025

Uh oh!

facebook-github-bot commented Sep 6, 2025

Uh oh!

mergify bot commented Sep 6, 2025

Uh oh!

houseroad Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

njhill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

luccafong commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Sep 19, 2025

Uh oh!

njhill commented Oct 13, 2025

Uh oh!

mergify bot commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

luccafong commented Sep 6, 2025 •

edited by github-actions bot

Loading

njhill left a comment •

edited

Loading

luccafong commented Sep 19, 2025 •

edited

Loading