[CI] Add nightly-ci for multi-stage deployment by ZhengWG · Pull Request #3610 · vllm-project/vllm-omni

ZhengWG · 2026-05-14T12:23:16Z

Purpose

PR #2396 ([FEAT] support multi-stage deployment) shipped the multi-replica stage-pool runtime, but the only nightly CI gate for it today is the functional smoke step Omni · Multi-Replica Startup Test with 4x H100. There is no perf gate to quantify how much throughput / latency benefit the multi-replica routing actually delivers.

This PR adds a paired perf step nightly-omni-performance-multi-replicas that runs 2-GPU single-replica base vs 3-GPU multi-replica (stage 1 / 2 each scaled to 2 replicas) under the same workload posted on PR #2396 (vllm bench serve --omni --backend openai-chat-omni --dataset-name random --num-prompts 128 --random-input-len 1024 --output-len 512 --ignore-eos, concurrency sweep 8 / 16 / 24 / 32).

cc @hsliuustc0106 @yinpeiqi @fake0fan @amy-why-3459

Test Plan

Full sweep — 2gpu_base vs 3gpu_replica2 (the JSON committed in this PR):

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export BENCHMARK_DIR=tests/dfx/perf/results
mkdir -p $BENCHMARK_DIR
set +e
pytest -s -v tests/dfx/perf/scripts/run_benchmark.py \
  --test-config-file tests/dfx/perf/tests/test_qwen3_omni_multi_replicas.json

Test Result

Both runs completed 128/128 requests at every concurrency step, tested on 8×H20-96G:

Concurrency	Config	req/s	output tput (tok/s)	total tput	TTFT mean (ms)	TTFT p99	TPOT (ms)	audio_ttfp (ms)	audio_rtf	e2el mean (s)
8	2gpu_base	0.1565	56.98	219.5	238	1829	20.8	1565	0.40	49.9
8	3gpu_replica2	0.2226	81.41	312.5	156	721	19.1	1029	0.28	35.0
16	2gpu_base	0.1979	74.69	280.1	308	1504	19.4	2145	0.59	78.3
16	3gpu_replica2	0.3078	112.01	431.5	249	1160	24.3	1470	0.38	50.1
24	2gpu_base	0.2155	81.67	305.4	505	2105	23.1	3099	0.86	106.6
24	3gpu_replica2	0.3750	134.83	524.1	426	1808	26.2	2002	0.48	58.7
32	2gpu_base	0.2363	86.80	332.1	839	2970	26.0	3838	0.96	123.7
32	3gpu_replica2	0.4074	142.61	565.5	652	2328	31.0	2432	0.58	69.2

Speedup of 3gpu_replica2 over 2gpu_base:

Metric	c=8	c=16	c=24	c=32	Trend
output throughput	1.43×	1.50×	1.65×	1.64×	gain grows with concurrency
total token tput	1.42×	1.54×	1.72×	1.70×	same
TTFT mean	−35%	−19%	−16%	−22%	faster across the board
audio_ttfp	−34%	−31%	−35%	−37%	stable ~1/3 reduction
audio_rtf	0.68×	0.64×	0.55×	0.60×	stronger at high concurrency
e2el mean	0.70×	0.64×	0.55×	0.56×	nearly halved at c≥24

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: ZhengWG <zwg0606@gmail.com>

amy-why-3459 · 2026-05-14T12:25:35Z

Please resolve the conflict.

amy-why-3459 · 2026-05-14T12:27:12Z

@@ -0,0 +1,56 @@
+[
+    {
+        "test_name": "test_qwen3_omni_2gpu_base",


I don't think we need to test this scenario; we've already tested 2gpu_base in other scenarios.

amy-why-3459 · 2026-05-14T12:28:21Z

+                "endpoint": "/v1/chat/completions",
+                "num_prompts": [128, 128, 128, 128],
+                "max_concurrency": [8, 16, 24, 32],
+                "random_input_len": 1024,


Is it possible to align with our existing scenario? For example, using an input length of 2500 and an output length of 900?

Signed-off-by: ZhengWG <zwg0606@gmail.com>

hsliuustc0106 · 2026-05-14T13:03:23Z

why multi replica of talker&code2wav can benefit TTFT?

yenuo26 · 2026-05-14T13:18:04Z

@@ -0,0 +1,31 @@
+[
+    {
+        "test_name": "test_qwen3_omni_3gpu_replica2",


Does this test require 3 GPUs or 4?

amy-why-3459 · 2026-05-14T14:03:07Z

CI failed due to a shape mismatch; this PR will fix it. #3147.

ZhengWG · 2026-05-14T14:31:40Z

why multi replica of talker&code2wav can benefit TTFT?

I think it's not really stage-0 compute, it's queue back-pressure. The connector between stages has bounded capacity, so when single-replica talker/code2wav can't drain fast enough, stage 0 stops admitting new requests and they pile up at the API side.

amy-why-3459 · 2026-05-15T17:42:44Z

H100:

	max_concurrency	RTF	E2E	req/s
2gpu_base	8	0.3	50987.29	0.14
2gpu_base	16	0.42	73726.81	0.2
2gpu_base	32	0.68	109713.05	0.26
3gpu_replica2	8	0.29	35423.29	0.21
3gpu_replica2	16	0.38	43161.99	0.34
3gpu_replica2	24	0.45	52433.34	0.41
3gpu_replica2	32	0.52	55672.33	0.49

hsliuustc0106 · 2026-05-16T05:25:36Z

H100:

max_concurrency RTF E2E req/s
2gpu_base 8 0.3 50987.29 0.14
2gpu_base 16 0.42 73726.81 0.2
2gpu_base 32 0.68 109713.05 0.26
3gpu_replica2 8 0.29 35423.29 0.21
3gpu_replica2 16 0.38 43161.99 0.34
3gpu_replica2 24 0.45 52433.34 0.41
3gpu_replica2 32 0.52 55672.33 0.49

this looks improving more in qps. @gcanlin , can we test it in NPU as well?

add nightly-ci for multi-replica

f63fb3d

Signed-off-by: ZhengWG <zwg0606@gmail.com>

ZhengWG requested review from congw729 and yenuo26 as code owners May 14, 2026 12:23

amy-why-3459 reviewed May 14, 2026

View reviewed changes

ZhengWG added 3 commits May 14, 2026 20:35

Merge branch 'main' into py/add-mutli-replicas-ci

102836b

Signed-off-by: ZhengWG <zwg0606@gmail.com>

align input/output len

04ee1b2

Signed-off-by: ZhengWG <zwg0606@gmail.com>

del duplicated configs

ea98bb4

Signed-off-by: ZhengWG <zwg0606@gmail.com>

hsliuustc0106 added the omni-test label to trigger buildkite omni model test in nightly CI label May 14, 2026

yenuo26 reviewed May 14, 2026

View reviewed changes

Merge branch 'main' into py/add-mutli-replicas-ci

c8d9be0

hsliuustc0106 removed the omni-test label to trigger buildkite omni model test in nightly CI label May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add nightly-ci for multi-stage deployment#3610

[CI] Add nightly-ci for multi-stage deployment#3610
ZhengWG wants to merge 5 commits into
vllm-project:mainfrom
ZhengWG:py/add-mutli-replicas-ci

ZhengWG commented May 14, 2026 •

edited

Loading

Uh oh!

amy-why-3459 commented May 14, 2026

Uh oh!

amy-why-3459 May 14, 2026

Uh oh!

amy-why-3459 May 14, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented May 14, 2026

Uh oh!

yenuo26 May 14, 2026

Uh oh!

hsliuustc0106 May 16, 2026

Uh oh!

amy-why-3459 commented May 14, 2026

Uh oh!

ZhengWG commented May 14, 2026

Uh oh!

amy-why-3459 commented May 15, 2026

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ZhengWG commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

amy-why-3459 commented May 14, 2026

Uh oh!

amy-why-3459 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented May 14, 2026

Uh oh!

yenuo26 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 May 16, 2026

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 commented May 14, 2026

Uh oh!

ZhengWG commented May 14, 2026

Uh oh!

amy-why-3459 commented May 15, 2026

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZhengWG commented May 14, 2026 •

edited

Loading

amy-why-3459 May 14, 2026 •

edited

Loading