[Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline by linyueqian · Pull Request #3839 · vllm-project/vllm-omni

linyueqian · 2026-05-24T15:28:05Z

Summary

On qwen3_tts_high_concurrency.yaml from #3662, default_voice short c=64 median TTFP regressed ~2.18× vs v0.21.0rc1 (736 → 1604 ms on 2× H20). The #3662 author (@Sy0307) measured only a ~6 % delta at the merge commit c99df1eb, so the catastrophic part of the regression is in one or more of the 60 commits between c99df1eb and current main HEAD; which specific commit is still under bisect.

This PR restores v021-or-better TTFP on the affected cell by defaulting the code_predictor_prefix_graphs knob off in the bundled yaml and tightening a few adjacent code paths. Result: 710 ms (3.5 % better than v021).

Voice_clone deployments that relied on the captured prefix graphs can re-enable them via a one-line yaml override; the keys remain in the bundled file with false so the override is documented in-place.

Measured (default_voice short c=64, p=512, w=8, 2× H20)

state	median TTFP (ms)	gate ≤ 773 ms
v021 (0.21.0rc1)	736	—
#3662 merge `c99df1eb` (per @Sy0307)	~757	✓
vanilla current main + bundled yaml	1604	✗ (2.18×)
this PR	710	✓ (0.97×)

Changes

Site 1 (qwen3_tts_talker.py): scalar decode-preprocess fast-path when the decode batch is small (≤ scalar_decode_preprocess_threshold, default 8) or contains no task_type=Base request. Both knobs exposed as connector_extra fields for yaml tuning.
Site 2 (same file): raise the trailing-text compaction floor from 64 → 256 frames so short prompts no longer pay a mid-stream slice/copy. The original 64 is the legacy default if the connector_extra knob isn't set.
Site 3 (qwen3_tts_high_concurrency.yaml): default code_predictor_prefix_graphs: false. This single knob alone is the dominant fix. Voice_clone deployments can re-enable via a one-line override.
Site 4 (same yaml): widen decode_cudagraph_capture_sizes to [25, 49, 73, 97, 145, 169, 325] so default_voice's 49 / 145-frame chunks no longer pay re-compile cost.
Tests: 12-case parametrized parity test between scalar fast-path and batched path, plus 4 routing-predicate unit tests. Runs in <1 s without GPU.

Scope honestly stated

Only the worst-regression cell (default_voice short c=64) has been measured on this branch. Other concurrencies and voice_clone cells are not yet verified. Two known risks:

voice_clone c=64 — [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662's headline win. This PR defaults prefix-graphs off, so a vanilla voice_clone user loses that improvement unless they re-enable the flag in a downstream yaml. A reviewer-side check at voice_clone short c=64 with the override yaml closes this gap.
The unexplained 700+ ms cost of prefix graphs on current main — the dispatch is a single dict.in check per AR step (~100 ns), so the empirical 700+ ms cost must come from a CUDA-graph replay / memory-layout interaction inside the captured graphs. Needs a bisect across the 60 post-[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 commits to pin down the actual breaking change.

@Sy0307 — would appreciate your eyes on this. Two specific questions:

Does the voice_clone-override-yaml story match your original intent for [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662?
Have you observed any perf change between c99df1eb and current main HEAD on your end? Your ~757 ms baseline plus my 1604 ms on the same yaml is what flagged the post-merge regression.

Test plan

pytest tests/model_executor/models/qwen3_tts/test_decode_preprocess_parity.py — 12 cases pass
sanity bench: default_voice short c=64 p=512 on 2× H20 — 710 ms
full 60-cell sweep (default_voice × voice_clone × 6 concurrencies × 2 text lens)
voice_clone short c=64 with prefix-graph override yaml: confirm ≤ 1.05× current main
post-[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 bisect to localize the prefix-graph-cost amplifier
quality eval (WER / SIM / UTMOS) at voice_clone short c=8 within ±2 % of v021

chatgpt-codex-connector · 2026-05-24T15:28:10Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@Sy0307

Qwen3-TTS default_voice short c=64 median TTFP on the bundled qwen3_tts_high_concurrency.yaml profile regressed ~2.18x against v0.21.0rc1 sometime after vllm-project#3662 (c99df1e). Measured on 2x H20: v021 (0.21.0rc1) baseline: 736 ms PR vllm-project#3662 merge c99df1e (per @Sy0307): ~757 ms (+ ~3%) current main HEAD + bundled yaml: 1604 ms (2.18x) this commit: 710 ms (0.97x) vllm-project#3662 itself did not regress this cell; the author measured only a ~6% delta at the merge commit. One or more of the 60 commits that landed between c99df1e and current main amplified the cost of the new `code_predictor_prefix_graphs` code path; which specific commit is still under bisect. Until that is identified and fixed at root, this commit restores v021-equivalent TTFP at the affected cell by defaulting the prefix-graph knob off in the bundled yaml and tightening a few adjacent code paths. Changes ------- Site 1 -- vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py Add a scalar decode-preprocess fast-path that loops to the existing single-request preprocess() when the decode batch is small (<= scalar_decode_preprocess_threshold, default 8) or contains no task_type=Base request. The batched path's per-step Python coordination costs more than the single embed_input_ids call it amortizes for those batches. Both knobs are exposed as connector_extra fields so the routing is yaml-tunable. Site 2 -- same file Raise the trailing-text compaction floor from 64 to 256 frames so short prompts no longer pay a mid-stream slice / copy. The original 64 is preserved as the legacy default for callers that do not set the connector_extra knob. Site 3 -- vllm_omni/deploy/qwen3_tts_high_concurrency.yaml Default `code_predictor_prefix_graphs: false`. Disabling this knob alone is the dominant fix on the worst cell. Voice_clone deployments that previously relied on the captured prefix graphs can re-enable them by overriding `code_predictor_prefix_graphs: true` (and supplying the buckets / seq_lens) in a downstream yaml; the keys stay in the bundled file with `false` so the override is documented in-place. Site 4 -- same yaml Widen `decode_cudagraph_capture_sizes` to [25, 49, 73, 97, 145, 169, 325] so default_voice's 49 / 145- frame chunks no longer fall outside the captured set and pay re-compile cost per cell. Tests ----- tests/model_executor/models/qwen3_tts/test_decode_preprocess_parity.py adds a 12-case parametrized parity test covering batch_size in {1, 2, 4, 8} crossed with task_type in {Base, CustomVoice}. Each case runs both the scalar fast-path and the batched path against the same synthetic inputs and asserts that (input_ids, inputs_embeds, past_hidden, text_step, updates) are byte-equivalent. Plus four unit tests on the routing predicate. Runs in <1s without GPU. Scope honestly stated --------------------- Only the worst-regression cell (default_voice short c=64) has been measured on this branch. Other concurrencies and voice_clone cells are unverified. The voice_clone-c=64 prefix-graph win is preserved only for deployments that explicitly re-enable the flag in a downstream yaml. A reviewer-side measurement at voice_clone short c=64 with the override yaml closes the remaining gap. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian · 2026-05-24T16:00:47Z

Benchmark: this PR vs current `main`

Hardware: 2× H20. Stage 0 talker on GPU 0 (S0=64), Stage 1 Code2Wav on GPU 1 (S1=10). Deploy config: vllm_omni/deploy/qwen3_tts_high_concurrency.yaml.

Workload: default_voice (task_type=CustomVoice), seed-tts-text EN short bucket, --num-prompts 512 --num-warmups 8 --max-concurrency 64 --request-rate inf. Bench cmd is vllm bench serve --omni ….

build	median TTFP (ms)	P99 TTFP (ms)	successful / failed
v0.21.0rc1 (baseline)	736	1353	512 / 0
current `main` HEAD `748470c1` + bundled yaml	1604	1936	512 / 0
this PR (`ca6b175e`)	710	2092	512 / 0

Headline: this PR's median TTFP is 0.97× v021 and 0.44× current main at the cell where #3662 most visibly regressed.

The driver of the 1604 → 710 ms improvement is yaml-level: defaulting code_predictor_prefix_graphs: false. Sites 1, 2 and 4 are smaller contributors (about 200 ms in aggregate vs the 700+ ms prefix-graph drop). The code paths for prefix graphs themselves are untouched, so a downstream yaml that re-enables code_predictor_prefix_graphs: true falls back to the pre-PR behavior for that connector.

Other concurrencies (c=1, 8, 32, 128, 256) and long-text cells are still under measurement; I'll follow up with a wider sweep, and with the post-#3662 bisect to locate which commit amplified the prefix-graph cost.

hsliuustc0106 · 2026-05-24T17:16:45Z

do we have a test case to avoid perf regression?

hsliuustc0106

Approved. The fix is correct and performance claims are well-evidenced.

What I validated:

Root cause: code_predictor_prefix_graphs: true on the high-concurrency YAML is the dominant regression source. The knob is consumed by Qwen3CodePredictor._stage_connector_extra_config at qwen3_code_predictor.py:474-477 — turning it off is safe and the YAML comment documents the voice_clone override path.
Scalar decode fast-path: The routing logic (_should_use_scalar_decode_preprocess) is correct — scalar path for small batches (≤8) and batches with no task_type=Base requests, batched path otherwise. 16 parity tests confirm byte-identical output.
Config parsing: _stage_connector_extra_config mirrors the existing implementation in qwen3_code_predictor.py:566-573. _parse_non_negative_int safely handles None, invalid types, and negatives.
CUDA graph capture: Adding 49 and 145 to decode_cudagraph_capture_sizes stops default_voice chunks from paying re-compile cost.
Benchmarks: 710ms vs 1604ms (current main) vs 736ms (v021 baseline) measured on 2× H20. The table is clear and credible.
Gates: All pass.

Follow-ups (non-blocking):

Only one cell (default_voice short c=64) has been measured. The full sweep is important before declaring victory — especially voice_clone c=64 with prefix graphs re-enabled to confirm ≤ 1.05× current main, as the PR itself notes.
The trailing_text_compact_min_frames knob change (64→256) affects how aggressively short prompts compact their tail — no specific test covers it. A follow-up quality check (WER / SIM) at the boundaries would be reassuring but is not gating.
The _stage_connector_extra_config fallback (connector_cfg when "extra" is missing) would be worth extracting to a shared utility in a follow-up cleanup.

Sy0307 · 2026-05-25T08:59:32Z

Thanks for the fix. A few notes from my side.

For the voice_clone override story: when [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 was merged, code_predictor_prefix_graphs was true by default in the bundled qwen3_tts_high_concurrency.yaml. So changing it to false in [Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline #3839 does change the default behavior introduced by [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662. I understand it as a mitigation: if the prefix-graph path currently hurts default/custom voice c64 TTFP, disabling it by default is reasonable, while voice_clone deployments can explicitly re-enable it in a downstream yaml.
For perf between c99df1eb and current main: on my single-H20 CustomVoice C=64,N=500 run, I did not reproduce a 1604ms TTFP regression. The post-[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 numbers were around 756/1103-1118 ms median/p90; my main snapshot retest was 733/1075 ms with ~29.3x audio throughput. This is still not the same workload as your 2x H20 default_voice C=64,N=512 bundled-yaml run.
I agree with the scalar fast-path for decode preprocess. For small batches or batches without task_type == "Base", the batched preprocess path may not amortize its extra Python coordination cost. Routing CustomVoice/default_voice through the scalar path makes sense, and the parity test covers Base/CustomVoice consistency.

One concern: the new Code2Wav decode_cudagraph_capture_sizes look inactive under the bundled yaml, because Stage1 still has enforce_eager: true, and qwen3_tts_code2wav.py returns before enabling the inner decoder CUDA Graph. That part should either be documented as inactive/override-only, or guarded by a separate explicit knob that decouples Stage1 engine eager mode from inner Code2Wav decoder graph.

…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian · 2026-05-25T16:59:43Z

Closing this for now — re-tested against latest main (8f45e68b) on a second hardware class (2× L20X). On L20X this PR is essentially a wash vs vanilla main (1637 → 1584 ms median TTFP at default_voice c=64 p=512); the dramatic improvement is H20-specific. That makes the PR a workaround for an H20-only regression whose root cause we haven't diagnosed, and not the right shape of fix.

Will open a separate issue with the H20 measurements and ping @Sy0307 for root-cause investigation. H20 users who need an immediate mitigation can set code_predictor_prefix_graphs: false in their downstream yaml.

linyueqian requested review from ZeldaHuang, lishunyang12, princepride, tzhouam, yenuo26 and yuanheng-zhao as code owners May 24, 2026 15:28

linyueqian force-pushed the perf/qwen3-tts-restore-v021-default-voice branch 2 times, most recently from 98fd42f to 07b7d0b Compare May 24, 2026 15:39

linyueqian force-pushed the perf/qwen3-tts-restore-v021-default-voice branch from 07b7d0b to ca6b175 Compare May 24, 2026 15:41

hsliuustc0106 added the ready label to trigger buildkite CI label May 24, 2026

hsliuustc0106 approved these changes May 25, 2026

View reviewed changes

linyueqian mentioned this pull request May 25, 2026

[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits #3861

Closed

linyueqian closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline#3839

[Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline#3839
linyueqian wants to merge 1 commit into
vllm-project:mainfrom
linyueqian:perf/qwen3-tts-restore-v021-default-voice

linyueqian commented May 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 24, 2026

Uh oh!

linyueqian commented May 24, 2026

Uh oh!

hsliuustc0106 commented May 24, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

Sy0307 commented May 25, 2026

Uh oh!

linyueqian commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

linyueqian commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured (default_voice short c=64, p=512, w=8, 2× H20)

Changes

Scope honestly stated

Test plan

Uh oh!

chatgpt-codex-connector Bot commented May 24, 2026

Uh oh!

linyueqian commented May 24, 2026

Benchmark: this PR vs current main

Uh oh!

hsliuustc0106 commented May 24, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Sy0307 commented May 25, 2026

Uh oh!

linyueqian commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linyueqian commented May 24, 2026 •

edited

Loading

Benchmark: this PR vs current `main`