Skip to content

[Misc][DP] support customized aggregated logger for dp#24354

Merged
luccafong merged 6 commits intovllm-project:mainfrom
luccafong:support_global_dp_logging
Oct 14, 2025
Merged

[Misc][DP] support customized aggregated logger for dp#24354
luccafong merged 6 commits intovllm-project:mainfrom
luccafong:support_global_dp_logging

Conversation

@luccafong
Copy link
Copy Markdown
Collaborator

@luccafong luccafong commented Sep 6, 2025

Purpose

  • Support customized global logger for data parallel. This allows users to provide custom global loggers that can aggregate metrics across multiple DP instances.
  • Update DP multi host instance example with customized loggers.

Test Plan

  • Unit tests
  • Verify in online e2e example that both per-engine and global loggers work correctly
  • Verify with e2e online serve and benchmark with -aggregate-engine-logging
  • Verify with e2e online serve and benchmark with default behavior not changed

Test Result

  • Unit tests pass
python -m pytest tests/v1/engine/test_async_llm.py::test_customize_aggregated_loggers -v
============================= 1 passed, 3 warnings in 28.54s ==============================

python -m pytest tests/v1/engine/test_async_llm.py::test_customize_loggers -v
============================= 1 passed, 3 warnings in 24.11s ==============================

  • e2e test with online examples:
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=WARN python3 /data/users/fanglu/gitrepos/vllm/examples/online_serving/multi_instance_data_parallel.py
CUDA_VISIBLE_DEVICES=1  vllm serve ibm-research/PowerMoE-3b -dp 2 -dpr 1         --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 62300 --max-model-len 2048 --max-num-batched-tokens 2048          --data-parallel-size-local 1 --enforce-eager --headless

logs:

DEBUG 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
DEBUG 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 001: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 001: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:190] 2 Engines Aggregated: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

  • e2e test with serving

Add option --aggregate-engine-logging to enable aggregated log stats.

serve

vllm serve "/data/local/models/DeepSeek-V3.1" --max_model_len=4096 --gpu_memory_utilization=0.9 --tensor_parallel_size 1 --data_parallel_size 8 --enable_expert_parallel  --max_num_seqs=256  --aggregate-engine-logging --enforce-eager > /tmp/test_serving.log 2>&1 &

benchmark

vllm bench serve  --model "/data/local/models/DeepSeek-V3.1" --dataset-name random  --random-input-len 2048 --random-output-len 1024  --num-prompts 400 --host 127.0.0.1 --port 8000 --ignore-eos --random-range-ratio 0 --max-concurrency 200 --ready-check-timeout-sec 0

Logs:

^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:03 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 13106.5 tokens/s, Avg generation throughput: 28.8 tokens/s, Running: 168 reqs, Waiting: 32 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:13 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 8184.7 tokens/s, Avg generation throughput: 1318.0 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 21.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:23 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1339.9 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.4%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:33 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1340.0 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.1%, Prefix cache hit rate: 0.0%





  • e2e test with serving with default settings
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 000: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 001: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 002: Avg prompt throughput: 1843.1 tokens/s, Avg generation throughput: 4.3 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 1.9%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 003: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 004: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 005: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 006: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 007: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%



Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Sep 6, 2025
@facebook-github-bot
Copy link
Copy Markdown

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a customized global logger for data parallel (DP) training, which is a valuable addition for monitoring distributed training. The changes primarily affect the V1 engine, with corresponding updates to examples and tests. My review has identified a few issues: a bug in the updated example code where a request ID is not unique within a loop, a leftover debug print statement in the new logger logic, and an unused parameter in the V0 engine's API that could cause confusion. Addressing these points will improve the quality and correctness of the implementation.

Comment thread examples/online_serving/multi_instance_data_parallel.py Outdated
Comment thread vllm/engine/async_llm_engine.py Outdated
Comment thread vllm/v1/metrics/loggers.py Outdated
@facebook-github-bot
Copy link
Copy Markdown

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

@luccafong luccafong force-pushed the support_global_dp_logging branch from cb2d450 to 893ea0f Compare September 6, 2025 01:08
@facebook-github-bot
Copy link
Copy Markdown

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

@luccafong luccafong force-pushed the support_global_dp_logging branch 2 times, most recently from cb2d450 to 7bfddba Compare September 6, 2025 01:17
@facebook-github-bot
Copy link
Copy Markdown

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Sep 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 6, 2025
@luccafong luccafong force-pushed the support_global_dp_logging branch from 7bfddba to 8156746 Compare September 6, 2025 01:21
@mergify mergify bot removed the needs-rebase label Sep 6, 2025
@luccafong luccafong force-pushed the support_global_dp_logging branch 2 times, most recently from 8633cf4 to 562f552 Compare September 8, 2025 23:58
Comment thread vllm/v1/metrics/loggers.py Outdated
# Each EngineCore's metrics are expressed as a unique label.
self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs)
self.global_logger: Optional[StatLoggerBase] = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name this as aggregated_logger?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps type here better as GlobalStatLoggerBase

Comment thread vllm/v1/metrics/loggers.py Outdated
@@ -145,6 +156,63 @@ def log_engine_initialized(self):
self.vllm_config.cache_config.num_gpu_blocks)


class GlobalStatLogger(LoggingStatLogger, GlobalStatLoggerBase):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AggregatedLogger is a better name.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments to explain the difference?

Comment thread vllm/v1/engine/async_llm.py Outdated
@@ -222,6 +230,7 @@ def from_engine_args(
start_engine_loop: bool = True,
usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
stat_loggers: Optional[list[StatLoggerFactory]] = None,
stat_logger_global: Optional[GlobalStatLoggerFactory] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this per_engine_loggers and aggregated_logger.

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luccafong!

I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).

For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.

Then we also wouldn't need a special case/field for the PrometheusLogger.

A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.

(btw my comments below might not be applicable in the same way if we refactor per above suggestion).

Comment thread vllm/v1/metrics/loggers.py Outdated
# Each EngineCore's metrics are expressed as a unique label.
self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs)
self.global_logger: Optional[StatLoggerBase] = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps type here better as GlobalStatLoggerBase

Comment thread vllm/v1/metrics/loggers.py Outdated
Comment on lines +172 to +188
now = time.monotonic()
prompt_throughput = self._get_throughput(self.num_prompt_tokens, now)
generation_throughput = self._get_throughput(
self.num_generation_tokens, now)

self._reset(now)

scheduler_stats = self.last_scheduler_stats

log_fn = logger.info
if not any(
(prompt_throughput, generation_throughput,
self.last_prompt_throughput, self.last_generation_throughput)):
# Avoid log noise on an idle production system
log_fn = logger.debug
self.last_generation_throughput = generation_throughput
self.last_prompt_throughput = prompt_throughput
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we restructure so that this isn't duplicated with the superclass?

@luccafong
Copy link
Copy Markdown
Collaborator Author

luccafong commented Sep 19, 2025

Thanks @luccafong!

I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).

For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.

Then we also wouldn't need a special case/field for the PrometheusLogger.

A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.

(btw my comments below might not be applicable in the same way if we refactor per above suggestion).

thanks @njhill for the review and suggestions.

Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger.

@njhill
Copy link
Copy Markdown
Member

njhill commented Sep 19, 2025

Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger.

Re PrometheusLogger... this is already a "global" one right? so it would just mean we can include that in the list as one of the default ones and not have a separate field.

Thinking more, maybe we can keep the global stats logger abstract subclass after all (but I agree that it would be better to name it AggregateStatsLogger). I don't think we need a new arg for this, we can just check the type of passed-in statsloggers.

And then have a simple adapter - if a provided stats logger is not an AggregateStatsLogger, wrap it in this adapter which converts non-aggregate into aggregate (just contains dict with n instances). Then our internal field can just be a list of AggregateStatsLoggers and we invoke them all in the same way. PrometheusLogger would just be updated to extend AggregateStatsLogger.

And your concrete impl of the aggregate logging one we could name e.g. AggregateLoggingStatsLogger.

I hope that makes sense!

@njhill
Copy link
Copy Markdown
Member

njhill commented Oct 13, 2025

@luccafong it looks like there's a test that needs updating: https://buildkite.com/vllm/ci/builds/34468#0199d15b-035d-435e-a030-a453e91e436e

and will need another rebase now that all those formatting changes have been made :(

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Oct 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 13, 2025
@luccafong luccafong force-pushed the support_global_dp_logging branch from 5484535 to ce5c02f Compare October 13, 2025 18:56
@mergify mergify bot removed the needs-rebase label Oct 13, 2025
Signed-off-by: Lu Fang <fanglu@fb.com>

fix the test

Signed-off-by: Lu Fang <fanglu@fb.com>

address comments

Signed-off-by: Lu Fang <fanglu@fb.com>

add aggregator interface and abstract common logic

Signed-off-by: Lu Fang <fanglu@fb.com>

add corrupted request aggregation

Signed-off-by: Lu Fang <fanglu@fb.com>

more refactor

Signed-off-by: Lu Fang <fanglu@fb.com>

fix ut

Signed-off-by: Lu Fang <fanglu@fb.com>

fix kv_connector_logging

Signed-off-by: Lu Fang <fanglu@fb.com>

fix merge conflicts

Signed-off-by: Lu Fang <fanglu@fb.com>

fix lint

Signed-off-by: Lu Fang <fanglu@fb.com>

address comments

Signed-off-by: Lu Fang <fanglu@fb.com>

address comments

Signed-off-by: Lu Fang <fanglu@fb.com>

address comments

Signed-off-by: Lu Fang <fanglu@fb.com>

address commnet

Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
@luccafong luccafong force-pushed the support_global_dp_logging branch from 7ab416c to 554625d Compare October 13, 2025 20:44
@luccafong luccafong merged commit 8317f72 into vllm-project:main Oct 14, 2025
51 checks passed
1994 pushed a commit to 1994/vllm that referenced this pull request Oct 14, 2025
…24354)

Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: 1994 <1994@users.noreply.github.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…24354)

Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…24354)

Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…24354)

Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…24354)

Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
@github-project-automation github-project-automation bot moved this from Backlog to Done in Metrics & Tracing Feb 4, 2026
@markmc markmc moved this from Done to Done - 0.12 in Metrics & Tracing Feb 4, 2026
@markmc markmc moved this from Done - 0.12 to Done - 0.11 in Metrics & Tracing Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done - 0.11

Development

Successfully merging this pull request may close these issues.

5 participants