Skip to content

[CI] Add nightly-ci for multi-stage deployment#3610

Open
ZhengWG wants to merge 5 commits into
vllm-project:mainfrom
ZhengWG:py/add-mutli-replicas-ci
Open

[CI] Add nightly-ci for multi-stage deployment#3610
ZhengWG wants to merge 5 commits into
vllm-project:mainfrom
ZhengWG:py/add-mutli-replicas-ci

Conversation

@ZhengWG
Copy link
Copy Markdown
Contributor

@ZhengWG ZhengWG commented May 14, 2026

Purpose

PR #2396 ([FEAT] support multi-stage deployment) shipped the multi-replica stage-pool runtime, but the only nightly CI gate for it today is the functional smoke step Omni · Multi-Replica Startup Test with 4x H100. There is no perf gate to quantify how much throughput / latency benefit the multi-replica routing actually delivers.

This PR adds a paired perf step nightly-omni-performance-multi-replicas that runs 2-GPU single-replica base vs 3-GPU multi-replica (stage 1 / 2 each scaled to 2 replicas) under the same workload posted on PR #2396 (vllm bench serve --omni --backend openai-chat-omni --dataset-name random --num-prompts 128 --random-input-len 1024 --output-len 512 --ignore-eos, concurrency sweep 8 / 16 / 24 / 32).

cc @hsliuustc0106 @yinpeiqi @fake0fan @amy-why-3459

Test Plan

Full sweep — 2gpu_base vs 3gpu_replica2 (the JSON committed in this PR):

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export BENCHMARK_DIR=tests/dfx/perf/results
mkdir -p $BENCHMARK_DIR
set +e
pytest -s -v tests/dfx/perf/scripts/run_benchmark.py \
  --test-config-file tests/dfx/perf/tests/test_qwen3_omni_multi_replicas.json

Test Result

Both runs completed 128/128 requests at every concurrency step, tested on 8×H20-96G:

Concurrency Config req/s output tput (tok/s) total tput TTFT mean (ms) TTFT p99 TPOT (ms) audio_ttfp (ms) audio_rtf e2el mean (s)
8 2gpu_base 0.1565 56.98 219.5 238 1829 20.8 1565 0.40 49.9
8 3gpu_replica2 0.2226 81.41 312.5 156 721 19.1 1029 0.28 35.0
16 2gpu_base 0.1979 74.69 280.1 308 1504 19.4 2145 0.59 78.3
16 3gpu_replica2 0.3078 112.01 431.5 249 1160 24.3 1470 0.38 50.1
24 2gpu_base 0.2155 81.67 305.4 505 2105 23.1 3099 0.86 106.6
24 3gpu_replica2 0.3750 134.83 524.1 426 1808 26.2 2002 0.48 58.7
32 2gpu_base 0.2363 86.80 332.1 839 2970 26.0 3838 0.96 123.7
32 3gpu_replica2 0.4074 142.61 565.5 652 2328 31.0 2432 0.58 69.2

Speedup of 3gpu_replica2 over 2gpu_base:

Metric c=8 c=16 c=24 c=32 Trend
output throughput 1.43× 1.50× 1.65× 1.64× gain grows with concurrency
total token tput 1.42× 1.54× 1.72× 1.70× same
TTFT mean −35% −19% −16% −22% faster across the board
audio_ttfp −34% −31% −35% −37% stable ~1/3 reduction
audio_rtf 0.68× 0.64× 0.55× 0.60× stronger at high concurrency
e2el mean 0.70× 0.64× 0.55× 0.56× nearly halved at c≥24

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: ZhengWG <zwg0606@gmail.com>
@ZhengWG ZhengWG requested review from congw729 and yenuo26 as code owners May 14, 2026 12:23
@amy-why-3459
Copy link
Copy Markdown
Contributor

Please resolve the conflict.

@@ -0,0 +1,56 @@
[
{
"test_name": "test_qwen3_omni_2gpu_base",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to test this scenario; we've already tested 2gpu_base in other scenarios.

"endpoint": "/v1/chat/completions",
"num_prompts": [128, 128, 128, 128],
"max_concurrency": [8, 16, 24, 32],
"random_input_len": 1024,
Copy link
Copy Markdown
Contributor

@amy-why-3459 amy-why-3459 May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to align with our existing scenario? For example, using an input length of 2500 and an output length of 900?

ZhengWG added 3 commits May 14, 2026 20:35
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
@hsliuustc0106 hsliuustc0106 added the omni-test label to trigger buildkite omni model test in nightly CI label May 14, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

why multi replica of talker&code2wav can benefit TTFT?

@@ -0,0 +1,31 @@
[
{
"test_name": "test_qwen3_omni_3gpu_replica2",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test require 3 GPUs or 4?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3

@amy-why-3459
Copy link
Copy Markdown
Contributor

CI failed due to a shape mismatch; this PR will fix it. #3147.

@ZhengWG
Copy link
Copy Markdown
Contributor Author

ZhengWG commented May 14, 2026

why multi replica of talker&code2wav can benefit TTFT?

I think it's not really stage-0 compute, it's queue back-pressure. The connector between stages has bounded capacity, so when single-replica talker/code2wav can't drain fast enough, stage 0 stops admitting new requests and they pile up at the API side.

@amy-why-3459
Copy link
Copy Markdown
Contributor

H100:

  max_concurrency RTF E2E req/s
2gpu_base 8 0.3 50987.29 0.14
2gpu_base 16 0.42 73726.81 0.2
2gpu_base 32 0.68 109713.05 0.26
3gpu_replica2 8 0.29 35423.29 0.21
3gpu_replica2 16 0.38 43161.99 0.34
3gpu_replica2 24 0.45 52433.34 0.41
3gpu_replica2 32 0.52 55672.33 0.49

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

H100:

  max_concurrency RTF E2E req/s
2gpu_base 8 0.3 50987.29 0.14
2gpu_base 16 0.42 73726.81 0.2
2gpu_base 32 0.68 109713.05 0.26
3gpu_replica2 8 0.29 35423.29 0.21
3gpu_replica2 16 0.38 43161.99 0.34
3gpu_replica2 24 0.45 52433.34 0.41
3gpu_replica2 32 0.52 55672.33 0.49

this looks improving more in qps. @gcanlin , can we test it in NPU as well?

@hsliuustc0106 hsliuustc0106 removed the omni-test label to trigger buildkite omni model test in nightly CI label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants