[CI][Perf] Add high-load stress phase for Qwen3-TTS daily perf#3238
Merged
linyueqian merged 1 commit intoMay 2, 2026
Merged
Conversation
Daily TTS perf CI currently caps at max_concurrency=8 in the throughput regime, so high-load TTFA tail regressions (e.g. the Code2Wav cross-request batching gap discussed in vllm-project#3163 / shown by vllm-project#3221) are invisible to nightly. This adds a stress phase mirroring the open-loop pattern already used by test_qwen_omni.json: 100 requests at request_rate=2.0 for both default_voice and voice_design. Baselines are intentionally loose (median TTFA 3.0-3.5 s, median RTF 0.25-0.30, audio_throughput floor 4.0 audio-s/wall-s) so the entry alarms only on real regressions and can be tightened in a follow-up once we have a few nightly runs. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
hsliuustc0106
approved these changes
Apr 29, 2026
Collaborator
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review summary
Validated:
- All gates pass (DCO, pre-commit, build 3.11/3.12, docs)
eval_phaseis inexclude_keysatrun_benchmark.py:336— no script change needed- New entries mirror the existing throughput/latency structure exactly (same
percentile-metrics,dataset_path,backend, consistent per-task config) - Baselines are intentionally loose, documented as such, with a plan to tighten after nightly data
- Benchmark evidence in the PR body is thorough: concrete concurrency/throughput/TTFA table showing exactly the gap this stress phase will catch
No blocking issues. The change is well-scoped (config-only, +36 lines), well-justified, and well-evidenced. LGTM.
sphinxkkkbc
pushed a commit
to sphinxkkkbc/vllm-omni
that referenced
this pull request
May 4, 2026
…project#3238) Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
clodaghwalsh17
pushed a commit
to clodaghwalsh17/nm-vllm-omni-ent
that referenced
this pull request
May 12, 2026
…project#3238) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Daily DFX TTS perf currently caps at
max_concurrency=8in the throughput regime, so high-load TTFA tail regressions (e.g. the cross-request Code2Wav batching gap that #3163 proposes to fix and that #3221's Triton stack already demonstrates a fix for) are invisible to nightly CI.This PR adds a
stressphase totest_qwen3_tts_customvoicefor bothdefault_voiceandvoice_design, mirroring the open-loop pattern already used bytest_qwen_omni.json:num_prompts: [100],request_rate: [2.0](open-loop, ~2 req/s offered for ~50 s of wall time)audio_throughputfloor 4.0 audio-s/wall-s) so it alarms only on real regressions and can be tightened once we have a few nightly runseval_phase: \"stress\"is metadata only —run_benchmark.pyalready listseval_phaseinexclude_keys, so no script change is needed.Why this matters
On H20 (single H20-3e), the gap between current
mainand a stack with cross-request codec batching shows up exactly in this load region:mainsaturates at ~2.47 req/s starting at c=8, so any offered load above that queues hard. The new entry sits at offeredrate=2.0(~80 % ofmain's sustainable rate, ~30 % of #3221's) — that's the regime where regressions in scheduler / codec batching are loudest.Test plan
python3 -c \"import json; json.load(open('tests/dfx/perf/tests/test_tts.json'))\"parses cleanlytests/dfx/perf/tests/test_runner_metadata.py(already excludeseval_phase) still passescc @ischencheng (re #3163), @vklimkov-nvidia (re #3221), @hsliuustc0106