[CI] Add nightly-ci for multi-stage deployment#3610
Conversation
Signed-off-by: ZhengWG <zwg0606@gmail.com>
|
Please resolve the conflict. |
| @@ -0,0 +1,56 @@ | |||
| [ | |||
| { | |||
| "test_name": "test_qwen3_omni_2gpu_base", | |||
There was a problem hiding this comment.
I don't think we need to test this scenario; we've already tested 2gpu_base in other scenarios.
| "endpoint": "/v1/chat/completions", | ||
| "num_prompts": [128, 128, 128, 128], | ||
| "max_concurrency": [8, 16, 24, 32], | ||
| "random_input_len": 1024, |
There was a problem hiding this comment.
Is it possible to align with our existing scenario? For example, using an input length of 2500 and an output length of 900?
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
|
why multi replica of talker&code2wav can benefit TTFT? |
| @@ -0,0 +1,31 @@ | |||
| [ | |||
| { | |||
| "test_name": "test_qwen3_omni_3gpu_replica2", | |||
There was a problem hiding this comment.
Does this test require 3 GPUs or 4?
|
CI failed due to a shape mismatch; this PR will fix it. #3147. |
I think it's not really stage-0 compute, it's queue back-pressure. The connector between stages has bounded capacity, so when single-replica talker/code2wav can't drain fast enough, stage 0 stops admitting new requests and they pile up at the API side. |
|
H100:
|
this looks improving more in qps. @gcanlin , can we test it in NPU as well? |
Purpose
PR #2396 (
[FEAT] support multi-stage deployment) shipped the multi-replica stage-pool runtime, but the only nightly CI gate for it today is the functional smoke stepOmni · Multi-Replica Startup Test with 4x H100. There is no perf gate to quantify how much throughput / latency benefit the multi-replica routing actually delivers.This PR adds a paired perf step
nightly-omni-performance-multi-replicasthat runs 2-GPU single-replica base vs 3-GPU multi-replica (stage 1 / 2 each scaled to 2 replicas) under the same workload posted on PR #2396 (vllm bench serve --omni --backend openai-chat-omni --dataset-name random --num-prompts 128 --random-input-len 1024 --output-len 512 --ignore-eos, concurrency sweep 8 / 16 / 24 / 32).cc @hsliuustc0106 @yinpeiqi @fake0fan @amy-why-3459
Test Plan
Full sweep — 2gpu_base vs 3gpu_replica2 (the JSON committed in this PR):
Test Result
Both runs completed 128/128 requests at every concurrency step, tested on 8×H20-96G:
Speedup of 3gpu_replica2 over 2gpu_base:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)