Skip to content

[Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline#3839

Closed
linyueqian wants to merge 1 commit into
vllm-project:mainfrom
linyueqian:perf/qwen3-tts-restore-v021-default-voice
Closed

[Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline#3839
linyueqian wants to merge 1 commit into
vllm-project:mainfrom
linyueqian:perf/qwen3-tts-restore-v021-default-voice

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian commented May 24, 2026

Summary

On qwen3_tts_high_concurrency.yaml from #3662, default_voice short c=64 median TTFP regressed ~2.18× vs v0.21.0rc1 (736 → 1604 ms on 2× H20). The #3662 author (@Sy0307) measured only a ~6 % delta at the merge commit c99df1eb, so the catastrophic part of the regression is in one or more of the 60 commits between c99df1eb and current main HEAD; which specific commit is still under bisect.

This PR restores v021-or-better TTFP on the affected cell by defaulting the code_predictor_prefix_graphs knob off in the bundled yaml and tightening a few adjacent code paths. Result: 710 ms (3.5 % better than v021).

Voice_clone deployments that relied on the captured prefix graphs can re-enable them via a one-line yaml override; the keys remain in the bundled file with false so the override is documented in-place.

Measured (default_voice short c=64, p=512, w=8, 2× H20)

state median TTFP (ms) gate ≤ 773 ms
v021 (0.21.0rc1) 736
#3662 merge c99df1eb (per @Sy0307) ~757
vanilla current main + bundled yaml 1604 ✗ (2.18×)
this PR 710 ✓ (0.97×)

Changes

  • Site 1 (qwen3_tts_talker.py): scalar decode-preprocess fast-path when the decode batch is small (≤ scalar_decode_preprocess_threshold, default 8) or contains no task_type=Base request. Both knobs exposed as connector_extra fields for yaml tuning.
  • Site 2 (same file): raise the trailing-text compaction floor from 64 → 256 frames so short prompts no longer pay a mid-stream slice/copy. The original 64 is the legacy default if the connector_extra knob isn't set.
  • Site 3 (qwen3_tts_high_concurrency.yaml): default code_predictor_prefix_graphs: false. This single knob alone is the dominant fix. Voice_clone deployments can re-enable via a one-line override.
  • Site 4 (same yaml): widen decode_cudagraph_capture_sizes to [25, 49, 73, 97, 145, 169, 325] so default_voice's 49 / 145-frame chunks no longer pay re-compile cost.
  • Tests: 12-case parametrized parity test between scalar fast-path and batched path, plus 4 routing-predicate unit tests. Runs in <1 s without GPU.

Scope honestly stated

Only the worst-regression cell (default_voice short c=64) has been measured on this branch. Other concurrencies and voice_clone cells are not yet verified. Two known risks:

  1. voice_clone c=64[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662's headline win. This PR defaults prefix-graphs off, so a vanilla voice_clone user loses that improvement unless they re-enable the flag in a downstream yaml. A reviewer-side check at voice_clone short c=64 with the override yaml closes this gap.
  2. The unexplained 700+ ms cost of prefix graphs on current main — the dispatch is a single dict.in check per AR step (~100 ns), so the empirical 700+ ms cost must come from a CUDA-graph replay / memory-layout interaction inside the captured graphs. Needs a bisect across the 60 post-[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 commits to pin down the actual breaking change.

@Sy0307 — would appreciate your eyes on this. Two specific questions:

  • Does the voice_clone-override-yaml story match your original intent for [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662?
  • Have you observed any perf change between c99df1eb and current main HEAD on your end? Your ~757 ms baseline plus my 1604 ms on the same yaml is what flagged the post-merge regression.

Test plan

  • pytest tests/model_executor/models/qwen3_tts/test_decode_preprocess_parity.py — 12 cases pass
  • sanity bench: default_voice short c=64 p=512 on 2× H20 — 710 ms
  • full 60-cell sweep (default_voice × voice_clone × 6 concurrencies × 2 text lens)
  • voice_clone short c=64 with prefix-graph override yaml: confirm ≤ 1.05× current main
  • post-[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 bisect to localize the prefix-graph-cost amplifier
  • quality eval (WER / SIM / UTMOS) at voice_clone short c=8 within ±2 % of v021

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian linyueqian force-pushed the perf/qwen3-tts-restore-v021-default-voice branch 2 times, most recently from 98fd42f to 07b7d0b Compare May 24, 2026 15:39
Qwen3-TTS default_voice short c=64 median TTFP on the bundled
qwen3_tts_high_concurrency.yaml profile regressed ~2.18x against
v0.21.0rc1 sometime after vllm-project#3662 (c99df1e). Measured on 2x H20:

  v021 (0.21.0rc1) baseline:                       736 ms
  PR vllm-project#3662 merge c99df1e (per @Sy0307):          ~757 ms  (+ ~3%)
  current main HEAD + bundled yaml:               1604 ms  (2.18x)
  this commit:                                     710 ms  (0.97x)

vllm-project#3662 itself did not regress this cell; the author measured only a
~6% delta at the merge commit. One or more of the 60 commits that
landed between c99df1e and current main amplified the cost of the
new `code_predictor_prefix_graphs` code path; which specific commit
is still under bisect. Until that is identified and fixed at root,
this commit restores v021-equivalent TTFP at the affected cell by
defaulting the prefix-graph knob off in the bundled yaml and
tightening a few adjacent code paths.

Changes
-------
Site 1 -- vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py
    Add a scalar decode-preprocess fast-path that loops to the
    existing single-request preprocess() when the decode batch is
    small (<= scalar_decode_preprocess_threshold, default 8) or
    contains no task_type=Base request. The batched path's per-step
    Python coordination costs more than the single embed_input_ids
    call it amortizes for those batches. Both knobs are exposed as
    connector_extra fields so the routing is yaml-tunable.

Site 2 -- same file
    Raise the trailing-text compaction floor from 64 to 256 frames
    so short prompts no longer pay a mid-stream slice / copy. The
    original 64 is preserved as the legacy default for callers that
    do not set the connector_extra knob.

Site 3 -- vllm_omni/deploy/qwen3_tts_high_concurrency.yaml
    Default `code_predictor_prefix_graphs: false`. Disabling this
    knob alone is the dominant fix on the worst cell. Voice_clone
    deployments that previously relied on the captured prefix
    graphs can re-enable them by overriding
    `code_predictor_prefix_graphs: true` (and supplying the buckets
    / seq_lens) in a downstream yaml; the keys stay in the bundled
    file with `false` so the override is documented in-place.

Site 4 -- same yaml
    Widen `decode_cudagraph_capture_sizes` to
    [25, 49, 73, 97, 145, 169, 325] so default_voice's 49 / 145-
    frame chunks no longer fall outside the captured set and pay
    re-compile cost per cell.

Tests
-----
tests/model_executor/models/qwen3_tts/test_decode_preprocess_parity.py
adds a 12-case parametrized parity test covering batch_size in
{1, 2, 4, 8} crossed with task_type in {Base, CustomVoice}. Each
case runs both the scalar fast-path and the batched path against
the same synthetic inputs and asserts that
(input_ids, inputs_embeds, past_hidden, text_step, updates) are
byte-equivalent. Plus four unit tests on the routing predicate.
Runs in <1s without GPU.

Scope honestly stated
---------------------
Only the worst-regression cell (default_voice short c=64) has been
measured on this branch. Other concurrencies and voice_clone cells
are unverified. The voice_clone-c=64 prefix-graph win is preserved
only for deployments that explicitly re-enable the flag in a
downstream yaml. A reviewer-side measurement at voice_clone short
c=64 with the override yaml closes the remaining gap.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the perf/qwen3-tts-restore-v021-default-voice branch from 07b7d0b to ca6b175 Compare May 24, 2026 15:41
@linyueqian
Copy link
Copy Markdown
Collaborator Author

Benchmark: this PR vs current main

Hardware: 2× H20. Stage 0 talker on GPU 0 (S0=64), Stage 1 Code2Wav on GPU 1 (S1=10). Deploy config: vllm_omni/deploy/qwen3_tts_high_concurrency.yaml.

Workload: default_voice (task_type=CustomVoice), seed-tts-text EN short bucket, --num-prompts 512 --num-warmups 8 --max-concurrency 64 --request-rate inf. Bench cmd is vllm bench serve --omni ….

build median TTFP (ms) P99 TTFP (ms) successful / failed
v0.21.0rc1 (baseline) 736 1353 512 / 0
current main HEAD 748470c1 + bundled yaml 1604 1936 512 / 0
this PR (ca6b175e) 710 2092 512 / 0

Headline: this PR's median TTFP is 0.97× v021 and 0.44× current main at the cell where #3662 most visibly regressed.

The driver of the 1604 → 710 ms improvement is yaml-level: defaulting code_predictor_prefix_graphs: false. Sites 1, 2 and 4 are smaller contributors (about 200 ms in aggregate vs the 700+ ms prefix-graph drop). The code paths for prefix graphs themselves are untouched, so a downstream yaml that re-enables code_predictor_prefix_graphs: true falls back to the pre-PR behavior for that connector.

Other concurrencies (c=1, 8, 32, 128, 256) and long-text cells are still under measurement; I'll follow up with a wider sweep, and with the post-#3662 bisect to locate which commit amplified the prefix-graph cost.

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label May 24, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

do we have a test case to avoid perf regression?

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The fix is correct and performance claims are well-evidenced.

What I validated:

  • Root cause: code_predictor_prefix_graphs: true on the high-concurrency YAML is the dominant regression source. The knob is consumed by Qwen3CodePredictor._stage_connector_extra_config at qwen3_code_predictor.py:474-477 — turning it off is safe and the YAML comment documents the voice_clone override path.
  • Scalar decode fast-path: The routing logic (_should_use_scalar_decode_preprocess) is correct — scalar path for small batches (≤8) and batches with no task_type=Base requests, batched path otherwise. 16 parity tests confirm byte-identical output.
  • Config parsing: _stage_connector_extra_config mirrors the existing implementation in qwen3_code_predictor.py:566-573. _parse_non_negative_int safely handles None, invalid types, and negatives.
  • CUDA graph capture: Adding 49 and 145 to decode_cudagraph_capture_sizes stops default_voice chunks from paying re-compile cost.
  • Benchmarks: 710ms vs 1604ms (current main) vs 736ms (v021 baseline) measured on 2× H20. The table is clear and credible.
  • Gates: All pass.

Follow-ups (non-blocking):

  1. Only one cell (default_voice short c=64) has been measured. The full sweep is important before declaring victory — especially voice_clone c=64 with prefix graphs re-enabled to confirm ≤ 1.05× current main, as the PR itself notes.
  2. The trailing_text_compact_min_frames knob change (64→256) affects how aggressively short prompts compact their tail — no specific test covers it. A follow-up quality check (WER / SIM) at the boundaries would be reassuring but is not gating.
  3. The _stage_connector_extra_config fallback (connector_cfg when "extra" is missing) would be worth extracting to a shared utility in a follow-up cleanup.

@Sy0307
Copy link
Copy Markdown
Collaborator

Sy0307 commented May 25, 2026

Thanks for the fix. A few notes from my side.

  1. For the voice_clone override story: when [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 was merged, code_predictor_prefix_graphs was true by default in the bundled qwen3_tts_high_concurrency.yaml. So changing it to false in [Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline #3839 does change the default behavior introduced by [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662. I understand it as a mitigation: if the prefix-graph path currently hurts default/custom voice c64 TTFP, disabling it by default is reasonable, while voice_clone deployments can explicitly re-enable it in a downstream yaml.

  2. For perf between c99df1eb and current main: on my single-H20 CustomVoice C=64,N=500 run, I did not reproduce a 1604ms TTFP regression. The post-[TTS][Perf] Optimize Qwen3-TTS high-concurrency serving #3662 numbers were around 756/1103-1118 ms median/p90; my main snapshot retest was 733/1075 ms with ~29.3x audio throughput. This is still not the same workload as your 2x H20 default_voice C=64,N=512 bundled-yaml run.

  3. I agree with the scalar fast-path for decode preprocess. For small batches or batches without task_type == "Base", the batched preprocess path may not amortize its extra Python coordination cost. Routing CustomVoice/default_voice through the scalar path makes sense, and the parity test covers Base/CustomVoice consistency.

One concern: the new Code2Wav decode_cudagraph_capture_sizes look inactive under the bundled yaml, because Stage1 still has enforce_eager: true, and qwen3_tts_code2wav.py returns before enabling the inner decoder CUDA Graph. That part should either be documented as inactive/override-only, or guarded by a separate explicit knob that decouples Stage1 engine eager mode from inner Code2Wav decoder graph.

linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request May 25, 2026
…commits

Adds .claude/skills/perf-bisect/ — a project-local Claude skill that
encodes a repeatable workflow for attributing a vllm-omni perf change to
a specific commit. Covers TTS, diffusion-image, and omni-audio model
families. Generalised from the workflow used during the post-vllm-project#3662
regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel
blast-radius file lists, per-family bench-harness examples, and
ready-to-paste cells for each model class so the same discipline applies
across the stack.

The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga:
extract the full cell (model, task, deploy_yaml, dataset, num_prompts,
max_concurrency, num_warmups + family knobs) from the regression report
BEFORE writing any bench script. Measuring a sibling cell that does not
exercise the regressed code path is the most common path to a false
"no regression" verdict.

Layout (progressive disclosure):

- SKILL.md: trigger conditions, paired tools, the cell-definition
  discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step
  workflow with parallel TTS / diffusion / omni blast-radius file lists
  and per-family bench-harness snippets, the rationalization table of
  excuses-vs-reality, the red-flags list, and a one-paragraph
  cross-platform invariant.

- references/family-knobs.md: full TTS / diffusion / omni knob tables
  (extra_body, stage_overrides, headline metrics).

- references/pitfalls.md: six mechanical failure modes with copy-paste
  remediations (pytest -k zero-match, venv PATH for ninja subprocess,
  stale server PID, multi-tenant GPUs, /v1/models settle, cold download).

- scripts/run_bisect.sh: bench-loop template that pairs vllm serve with
  vllm bench serve, polls /v1/models with a settle window, parses
  median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up
  the server between commits.

- scripts/kanban_trend.py: per-build metric time series from the
  vllm-omni-kanban repo with rolling-delta percent and regression
  markers; works for any cell prefix the kanban tracks.

- scripts/cells/: four cells covering the three families —
  tts_default_voice_high_c (the vllm-project#3839 regression class),
  tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024
  (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni
  audio-in/audio-out) — plus a README documenting the
  <family>_<descriptor>.yaml convention.

Triggers on natural-language requests like "bisect TTFP between X and Y",
"verify PR #N actually improves perf", "find which commit slowed default_voice",
"高并发 TTFP 劣化".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request May 25, 2026
…commits

Adds .claude/skills/perf-bisect/ — a project-local Claude skill that
encodes a repeatable workflow for attributing a vllm-omni perf change to
a specific commit. Covers TTS, diffusion-image, and omni-audio model
families. Generalised from the workflow used during the post-vllm-project#3662
regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel
blast-radius file lists, per-family bench-harness examples, and
ready-to-paste cells for each model class so the same discipline applies
across the stack.

The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga:
extract the full cell (model, task, deploy_yaml, dataset, num_prompts,
max_concurrency, num_warmups + family knobs) from the regression report
BEFORE writing any bench script. Measuring a sibling cell that does not
exercise the regressed code path is the most common path to a false
"no regression" verdict.

Layout (progressive disclosure):

- SKILL.md: trigger conditions, paired tools, the cell-definition
  discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step
  workflow with parallel TTS / diffusion / omni blast-radius file lists
  and per-family bench-harness snippets, the rationalization table of
  excuses-vs-reality, the red-flags list, and a one-paragraph
  cross-platform invariant.

- references/family-knobs.md: full TTS / diffusion / omni knob tables
  (extra_body, stage_overrides, headline metrics).

- references/pitfalls.md: six mechanical failure modes with copy-paste
  remediations (pytest -k zero-match, venv PATH for ninja subprocess,
  stale server PID, multi-tenant GPUs, /v1/models settle, cold download).

- scripts/run_bisect.sh: bench-loop template that pairs vllm serve with
  vllm bench serve, polls /v1/models with a settle window, parses
  median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up
  the server between commits.

- scripts/kanban_trend.py: per-build metric time series from the
  vllm-omni-kanban repo with rolling-delta percent and regression
  markers; works for any cell prefix the kanban tracks.

- scripts/cells/: four cells covering the three families —
  tts_default_voice_high_c (the vllm-project#3839 regression class),
  tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024
  (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni
  audio-in/audio-out) — plus a README documenting the
  <family>_<descriptor>.yaml convention.

Triggers on natural-language requests like "bisect TTFP between X and Y",
"verify PR #N actually improves perf", "find which commit slowed default_voice",
"高并发 TTFP 劣化".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian
Copy link
Copy Markdown
Collaborator Author

Closing this for now — re-tested against latest main (8f45e68b) on a second hardware class (2× L20X). On L20X this PR is essentially a wash vs vanilla main (1637 → 1584 ms median TTFP at default_voice c=64 p=512); the dramatic improvement is H20-specific. That makes the PR a workaround for an H20-only regression whose root cause we haven't diagnosed, and not the right shape of fix.

Will open a separate issue with the H20 measurements and ping @Sy0307 for root-cause investigation. H20 users who need an immediate mitigation can set code_predictor_prefix_graphs: false in their downstream yaml.

@linyueqian linyueqian closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants