Skip to content

[Redo][Log] Wire stat logging into AsyncOmniEngine matching AsyncLLM#2918

Open
gcanlin wants to merge 3 commits into
vllm-project:mainfrom
gcanlin:redo-log
Open

[Redo][Log] Wire stat logging into AsyncOmniEngine matching AsyncLLM#2918
gcanlin wants to merge 3 commits into
vllm-project:mainfrom
gcanlin:redo-log

Conversation

@gcanlin
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin commented Apr 19, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Redo #2551, which made Qwen3-Omni performance regression before. This PR is fixing the performance issue.

  • Derive log_stats from stage0 engine_args.disable_log_stats instead of hardcoding False; thread it through spawn_stage_core and MultimodalOutputProcessor.
  • Build a single StatLoggerManager in _bootstrap_orchestrator with one engine_idx per stage; pass it to Orchestrator.
  • Orchestrator: accept logger_manager, derive log_stats from it; create IterationStats and call manager.record() in _process_stage_outputs.
  • AsyncOmniEngine.do_log_stats: fire-and-forget via loop.call_soon_threadsafe(manager.log) to avoid blocking the API server thread while keeping all StatLoggerManager access on the orchestrator loop (no data race).
  • AsyncOmni.do_log_stats: delegate to self.engine.do_log_stats().
  • Add guard tests for do_log_stats no-op branches.

Test Plan

Run Omni Test.

vllm bench serve \
    --omni \
  --dataset-name random \
  --port 8000 \
  --max-concurrency 10 \
  --model Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --num-prompts 100 \
  --random-input-len 100 \
  --ignore-eos \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
  --random-output-len 100 \
  --extra_body '{"modalities": ["text", "audio"]}'

Test Result

https://buildkite.com/vllm/vllm-omni/builds/7245/steps/canvas

(APIServer pid=367) INFO 04-19 14:07:00 [stage_engine_core_client.py:172] [StageEngineCoreClient] Stage-0 adding request: chatcmpl-bench-75e3fce4-13
(APIServer pid=367) INFO 04-19 14:07:00 [stage_engine_core_client.py:172] [StageEngineCoreClient] Stage-1 adding request: chatcmpl-bench-75e3fce4-13
(APIServer pid=367) INFO 04-19 14:07:00 [stage_engine_core_client.py:172] [StageEngineCoreClient] Stage-2 adding request: chatcmpl-bench-75e3fce4-13
(APIServer pid=367) INFO 04-19 14:07:09 [loggers.py:259] Engine 000: Avg prompt throughput: 250.8 tokens/s, Avg generation throughput: 90.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:09 [loggers.py:259] Engine 001: Avg prompt throughput: 251.4 tokens/s, Avg generation throughput: 66.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:09 [loggers.py:259] Engine 002: Avg prompt throughput: 251.4 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:19 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:19 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 74.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:19 [loggers.py:259] Engine 002: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:29 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 74.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:39 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 46.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:07:49 [loggers.py:259] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:08:09 [loggers.py:259] Engine 002: Avg prompt throughput: 251.4 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=367) INFO 04-19 14:08:19 [loggers.py:259] Engine 002: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin requested a review from hsliuustc0106 as a code owner April 19, 2026 13:32
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@gcanlin gcanlin added nightly-test label to trigger buildkite nightly test CI omni-test label to trigger buildkite omni model test in nightly CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 19, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@amy-why-3459 PTAL

@amy-why-3459
Copy link
Copy Markdown
Contributor

Just wondering, what was the question from last time?

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Apr 19, 2026

Just wondering, what was the question from last time?

Before the log behavior is sync and slow the stage-stage data flow, so we observed that the batch of stage-2 can't be used fully. Now this PR is making it async to keep consistent with vLLM.

@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Apr 19, 2026

@gcanlin gcanlin added ready label to trigger buildkite CI and removed omni-test label to trigger buildkite omni model test in nightly CI labels Apr 19, 2026
@amy-why-3459
Copy link
Copy Markdown
Contributor

Based on the test results, this PR still seems to have a significant impact on performance.

gcanlin added 2 commits April 20, 2026 01:14
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin added omni-test label to trigger buildkite omni model test in nightly CI and removed ready label to trigger buildkite CI labels Apr 20, 2026
@Gaohan123 Gaohan123 removed the omni-test label to trigger buildkite omni model test in nightly CI label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants