[Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline#3839
[Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline#3839linyueqian wants to merge 1 commit into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
98fd42f to
07b7d0b
Compare
Qwen3-TTS default_voice short c=64 median TTFP on the bundled qwen3_tts_high_concurrency.yaml profile regressed ~2.18x against v0.21.0rc1 sometime after vllm-project#3662 (c99df1e). Measured on 2x H20: v021 (0.21.0rc1) baseline: 736 ms PR vllm-project#3662 merge c99df1e (per @Sy0307): ~757 ms (+ ~3%) current main HEAD + bundled yaml: 1604 ms (2.18x) this commit: 710 ms (0.97x) vllm-project#3662 itself did not regress this cell; the author measured only a ~6% delta at the merge commit. One or more of the 60 commits that landed between c99df1e and current main amplified the cost of the new `code_predictor_prefix_graphs` code path; which specific commit is still under bisect. Until that is identified and fixed at root, this commit restores v021-equivalent TTFP at the affected cell by defaulting the prefix-graph knob off in the bundled yaml and tightening a few adjacent code paths. Changes ------- Site 1 -- vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py Add a scalar decode-preprocess fast-path that loops to the existing single-request preprocess() when the decode batch is small (<= scalar_decode_preprocess_threshold, default 8) or contains no task_type=Base request. The batched path's per-step Python coordination costs more than the single embed_input_ids call it amortizes for those batches. Both knobs are exposed as connector_extra fields so the routing is yaml-tunable. Site 2 -- same file Raise the trailing-text compaction floor from 64 to 256 frames so short prompts no longer pay a mid-stream slice / copy. The original 64 is preserved as the legacy default for callers that do not set the connector_extra knob. Site 3 -- vllm_omni/deploy/qwen3_tts_high_concurrency.yaml Default `code_predictor_prefix_graphs: false`. Disabling this knob alone is the dominant fix on the worst cell. Voice_clone deployments that previously relied on the captured prefix graphs can re-enable them by overriding `code_predictor_prefix_graphs: true` (and supplying the buckets / seq_lens) in a downstream yaml; the keys stay in the bundled file with `false` so the override is documented in-place. Site 4 -- same yaml Widen `decode_cudagraph_capture_sizes` to [25, 49, 73, 97, 145, 169, 325] so default_voice's 49 / 145- frame chunks no longer fall outside the captured set and pay re-compile cost per cell. Tests ----- tests/model_executor/models/qwen3_tts/test_decode_preprocess_parity.py adds a 12-case parametrized parity test covering batch_size in {1, 2, 4, 8} crossed with task_type in {Base, CustomVoice}. Each case runs both the scalar fast-path and the batched path against the same synthetic inputs and asserts that (input_ids, inputs_embeds, past_hidden, text_step, updates) are byte-equivalent. Plus four unit tests on the routing predicate. Runs in <1s without GPU. Scope honestly stated --------------------- Only the worst-regression cell (default_voice short c=64) has been measured on this branch. Other concurrencies and voice_clone cells are unverified. The voice_clone-c=64 prefix-graph win is preserved only for deployments that explicitly re-enable the flag in a downstream yaml. A reviewer-side measurement at voice_clone short c=64 with the override yaml closes the remaining gap. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
07b7d0b to
ca6b175
Compare
Benchmark: this PR vs current
|
| build | median TTFP (ms) | P99 TTFP (ms) | successful / failed |
|---|---|---|---|
| v0.21.0rc1 (baseline) | 736 | 1353 | 512 / 0 |
current main HEAD 748470c1 + bundled yaml |
1604 | 1936 | 512 / 0 |
this PR (ca6b175e) |
710 | 2092 | 512 / 0 |
Headline: this PR's median TTFP is 0.97× v021 and 0.44× current main at the cell where #3662 most visibly regressed.
The driver of the 1604 → 710 ms improvement is yaml-level: defaulting code_predictor_prefix_graphs: false. Sites 1, 2 and 4 are smaller contributors (about 200 ms in aggregate vs the 700+ ms prefix-graph drop). The code paths for prefix graphs themselves are untouched, so a downstream yaml that re-enables code_predictor_prefix_graphs: true falls back to the pre-PR behavior for that connector.
Other concurrencies (c=1, 8, 32, 128, 256) and long-text cells are still under measurement; I'll follow up with a wider sweep, and with the post-#3662 bisect to locate which commit amplified the prefix-graph cost.
|
do we have a test case to avoid perf regression? |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Approved. The fix is correct and performance claims are well-evidenced.
What I validated:
- Root cause:
code_predictor_prefix_graphs: trueon the high-concurrency YAML is the dominant regression source. The knob is consumed byQwen3CodePredictor._stage_connector_extra_configatqwen3_code_predictor.py:474-477— turning it off is safe and the YAML comment documents the voice_clone override path. - Scalar decode fast-path: The routing logic (
_should_use_scalar_decode_preprocess) is correct — scalar path for small batches (≤8) and batches with notask_type=Baserequests, batched path otherwise. 16 parity tests confirm byte-identical output. - Config parsing:
_stage_connector_extra_configmirrors the existing implementation inqwen3_code_predictor.py:566-573._parse_non_negative_intsafely handles None, invalid types, and negatives. - CUDA graph capture: Adding 49 and 145 to
decode_cudagraph_capture_sizesstops default_voice chunks from paying re-compile cost. - Benchmarks: 710ms vs 1604ms (current main) vs 736ms (v021 baseline) measured on 2× H20. The table is clear and credible.
- Gates: All pass.
Follow-ups (non-blocking):
- Only one cell (default_voice short c=64) has been measured. The full sweep is important before declaring victory — especially voice_clone c=64 with prefix graphs re-enabled to confirm ≤ 1.05× current main, as the PR itself notes.
- The
trailing_text_compact_min_framesknob change (64→256) affects how aggressively short prompts compact their tail — no specific test covers it. A follow-up quality check (WER / SIM) at the boundaries would be reassuring but is not gating. - The
_stage_connector_extra_configfallback (connector_cfgwhen"extra"is missing) would be worth extracting to a shared utility in a follow-up cleanup.
|
Thanks for the fix. A few notes from my side.
One concern: the new Code2Wav |
…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
Closing this for now — re-tested against latest Will open a separate issue with the H20 measurements and ping @Sy0307 for root-cause investigation. H20 users who need an immediate mitigation can set |
Summary
On
qwen3_tts_high_concurrency.yamlfrom #3662, default_voice short c=64 median TTFP regressed ~2.18× vs v0.21.0rc1 (736 → 1604 ms on 2× H20). The #3662 author (@Sy0307) measured only a ~6 % delta at the merge commitc99df1eb, so the catastrophic part of the regression is in one or more of the 60 commits betweenc99df1eband current main HEAD; which specific commit is still under bisect.This PR restores v021-or-better TTFP on the affected cell by defaulting the
code_predictor_prefix_graphsknob off in the bundled yaml and tightening a few adjacent code paths. Result: 710 ms (3.5 % better than v021).Voice_clone deployments that relied on the captured prefix graphs can re-enable them via a one-line yaml override; the keys remain in the bundled file with
falseso the override is documented in-place.Measured (default_voice short c=64, p=512, w=8, 2× H20)
c99df1eb(per @Sy0307)Changes
qwen3_tts_talker.py): scalar decode-preprocess fast-path when the decode batch is small (≤scalar_decode_preprocess_threshold, default 8) or contains notask_type=Baserequest. Both knobs exposed as connector_extra fields for yaml tuning.qwen3_tts_high_concurrency.yaml): defaultcode_predictor_prefix_graphs: false. This single knob alone is the dominant fix. Voice_clone deployments can re-enable via a one-line override.decode_cudagraph_capture_sizesto[25, 49, 73, 97, 145, 169, 325]so default_voice's 49 / 145-frame chunks no longer pay re-compile cost.Scope honestly stated
Only the worst-regression cell (default_voice short c=64) has been measured on this branch. Other concurrencies and
voice_clonecells are not yet verified. Two known risks:dict.incheck per AR step (~100 ns), so the empirical 700+ ms cost must come from a CUDA-graph replay / memory-layout interaction inside the captured graphs. Needs a bisect across the 60 post-[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 commits to pin down the actual breaking change.@Sy0307 — would appreciate your eyes on this. Two specific questions:
c99df1eband current main HEAD on your end? Your ~757 ms baseline plus my 1604 ms on the same yaml is what flagged the post-merge regression.Test plan
pytest tests/model_executor/models/qwen3_tts/test_decode_preprocess_parity.py— 12 cases pass