[CI][Perf] Add high-load stress phase for Qwen3-TTS daily perf by linyueqian · Pull Request #3238 · vllm-project/vllm-omni

linyueqian · 2026-04-29T05:21:42Z

Summary

Daily DFX TTS perf currently caps at max_concurrency=8 in the throughput regime, so high-load TTFA tail regressions (e.g. the cross-request Code2Wav batching gap that #3163 proposes to fix and that #3221's Triton stack already demonstrates a fix for) are invisible to nightly CI.

This PR adds a stress phase to test_qwen3_tts_customvoice for both default_voice and voice_design, mirroring the open-loop pattern already used by test_qwen_omni.json:

num_prompts: [100], request_rate: [2.0] (open-loop, ~2 req/s offered for ~50 s of wall time)
Same prompt source as the existing throughput phase, so no new dataset deps
Baselines intentionally loose (median TTFA 3.0–3.5 s, median RTF 0.25–0.30, audio_throughput floor 4.0 audio-s/wall-s) so it alarms only on real regressions and can be tightened once we have a few nightly runs

eval_phase: \"stress\" is metadata only — run_benchmark.py already lists eval_phase in exclude_keys, so no script change is needed.

Why this matters

On H20 (single H20-3e), the gap between current main and a stack with cross-request codec batching shows up exactly in this load region:

Concurrency	main req/s	main TTFA p95	PR #3221 req/s	PR #3221 TTFA p95
8	2.29	386 ms	4.41	397 ms
16	2.44	4437 ms	4.52	258 ms
32	2.47	8626 ms	6.73	463 ms

main saturates at ~2.47 req/s starting at c=8, so any offered load above that queues hard. The new entry sits at offered rate=2.0 (~80 % of main's sustainable rate, ~30 % of #3221's) — that's the regime where regressions in scheduler / codec batching are loudest.

Test plan

python3 -c \"import json; json.load(open('tests/dfx/perf/tests/test_tts.json'))\" parses cleanly
tests/dfx/perf/tests/test_runner_metadata.py (already excludes eval_phase) still passes
First nightly run produces baseline metrics for the new stress entries; tighten the floors in a follow-up PR
After [RFC]: Cross-request batching for Qwen3-TTS Code2Wav stage to fix TTFB scaling under concurrency #3163 lands, re-check the median TTFA / RTF on the stress entries to confirm the win is captured by daily

cc @ischencheng (re #3163), @vklimkov-nvidia (re #3221), @hsliuustc0106

Daily TTS perf CI currently caps at max_concurrency=8 in the throughput regime, so high-load TTFA tail regressions (e.g. the Code2Wav cross-request batching gap discussed in vllm-project#3163 / shown by vllm-project#3221) are invisible to nightly. This adds a stress phase mirroring the open-loop pattern already used by test_qwen_omni.json: 100 requests at request_rate=2.0 for both default_voice and voice_design. Baselines are intentionally loose (median TTFA 3.0-3.5 s, median RTF 0.25-0.30, audio_throughput floor 4.0 audio-s/wall-s) so the entry alarms only on real regressions and can be tightened in a follow-up once we have a few nightly runs. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

hsliuustc0106

Review summary

Validated:

All gates pass (DCO, pre-commit, build 3.11/3.12, docs)
eval_phase is in exclude_keys at run_benchmark.py:336 — no script change needed
New entries mirror the existing throughput/latency structure exactly (same percentile-metrics, dataset_path, backend, consistent per-task config)
Baselines are intentionally loose, documented as such, with a plan to tighten after nightly data
Benchmark evidence in the PR body is thorough: concrete concurrency/throughput/TTFA table showing exactly the gap this stress phase will catch

No blocking issues. The change is well-scoped (config-only, +36 lines), well-justified, and well-evidenced. LGTM.

…project#3238) Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>

…project#3238) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

hsliuustc0106 approved these changes Apr 29, 2026

View reviewed changes

linyueqian enabled auto-merge (squash) May 1, 2026 18:14

linyueqian added ready label to trigger buildkite CI merge-test label to trigger buildkite merge test CI labels May 2, 2026

linyueqian merged commit 5e82d7f into vllm-project:main May 2, 2026
7 of 8 checks passed

This was referenced May 3, 2026

[Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec #3221

Open

[Bug]: /v1/audio/speech with stream=true + response_format=pcm concatenates multiple TTS generations into one stream (~2× audio_duration vs WAV) #3326

Closed

ischencheng mentioned this pull request May 5, 2026

[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0) #3322

Open

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[CI][Perf] Add high-load stress phase for Qwen3-TTS daily perf (vllm-…

39a6f0e

…project#3238) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][Perf] Add high-load stress phase for Qwen3-TTS daily perf#3238

[CI][Perf] Add high-load stress phase for Qwen3-TTS daily perf#3238
linyueqian merged 1 commit into
vllm-project:mainfrom
linyueqian:feat/bench-dfx-tts-c10-n100

linyueqian commented Apr 29, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linyueqian commented Apr 29, 2026

Summary

Why this matters

Test plan

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants