Skip to content

[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0)#3322

Open
ischencheng wants to merge 1 commit into
vllm-project:mainfrom
ischencheng:cheng/rfc-3163-code2wav-batching
Open

[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0)#3322
ischencheng wants to merge 1 commit into
vllm-project:mainfrom
ischencheng:cheng/rfc-3163-code2wav-batching

Conversation

@ischencheng
Copy link
Copy Markdown

@ischencheng ischencheng commented May 3, 2026

Purpose

Resolves #3163.

PR #1617 introduced CUDAGraphDecoderWrapper for the Qwen3-TTS Code2Wav stage, which captured CUDA graphs at bs=1 only. To keep the wrapper hit rate high it also enforced bs=1 in Qwen3TTSCode2Wav.forward() via a per-request for-loop, undoing the cross-request batching landed in PR #1426.

Combined with the max_num_seqs: 1 setting in qwen3_tts.yaml, this means the Code2Wav stage decodes one request's chunk per scheduler step even when the scheduler has multiple ready streams in flight. On RTX 4090 + 1.7B-Base + voice cloning (the setup reported in #3163), this surfaces as 11.7× TTFB scaling between QPS=1 and QPS=4.

This PR:

  1. Restores the batched forward path: replaces the per-request for-loop with a single padded [B, Q, F_max] chunked_decode call. Includes a B==1 fast path (no padding cost) so the existing single-request behaviour is preserved.
  2. Extends CUDAGraphDecoderWrapper to capture (batch_size, seq_len) pairs instead of seq_len only. Default capture set keeps all existing bs=1 buckets and adds (bs ∈ {2,4,8}, seq=streaming_hot) so the new bs>1 hot path stays on the graph; uncaptured (bs, seq) falls back to eager.
  3. Raises Stage 1 max_num_seqs: 1 → 10 (matches Stage 0) so the scheduler can actually deliver bs>1 to the codec.
  4. Adds a YAML override code2wav_capture_pairs for operators on memory-rich GPUs to opt into a wider capture set.

Test Plan

  • New tests/model_executor/models/qwen3_tts/test_code2wav_batching.py (7 cases): bs=1 baseline parity, bs>1 per-request parity, padding-no-bleed, per-request left_context_size honoring, malformed-mixed batches.
  • Updated tests/model_executor/models/qwen3_tts/test_cuda_graph_decoder.py to exercise the new (bs, seq) capture API: multi-bs replay, uncaptured-bs fallback, compute_capture_pairs parametrize.
  • New tests/e2e/online_serving/test_qwen3_tts_concurrent_ttfb.py mirrors the bench shape of PR [Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec #3221's benchmark_service.py and asserts sub-linear TTFA scaling per the RFC target (scaling_4 ≤ 4.0×).
  • Existing tests/e2e/online_serving/test_qwen3_tts_base.py -m core_model to verify audio output correctness (HNR + format checks).

Test Result

Unit tests: test_code2wav_batching.py 7/7 PASS, test_cuda_graph_decoder.py 38/38 PASS, test_qwen3_tts_code2wav.py 2/2 PASS.

E2E test_qwen3_tts_base.py core_model PASSES on H100.

Concurrent bench (the issue #3163 setup is RTX 4090; we measured H100 + Qwen3-TTS-12Hz-0.6B-Base + voice cloning as the closest reproducible scenario on our hardware):

Concurrency main TTFA mean / rps branch TTFA mean / rps speedup
1 255 / 0.94 259 / 0.93 ≈ 1×
4 1731 / 1.26 761 / 1.26 2.3×
8 4949 / 1.22 1220 / 1.75 4.1×
32 8047 / 3.51 7286 / 3.24 1.1×

c=4 / c=8 are large wins (TTFA -56% / -75%, rps at c=8 +43%) for the Base-mode + voice-cloning scenario, addressing the RFC #3163 reproducer.

Known scope limitation

On H100 + Qwen3-TTS-12Hz-1.7B-CustomVoice (CustomVoice voice mode, no voice cloning), main is already saturated and this PR is approximately break-even or slightly slower (within ~5-30% TTFA / rps depending on concurrency). The codec-bs=1 enforcement was masking voice-cloning's audio-encoder latency on slower GPUs / Base mode; it is not a bottleneck on H100 + CustomVoice.

We measured the same model on H100:

Concurrency main TTFA / rps this PR TTFA / rps
1 68 / 1.49 57 / 1.50
4 135 / 4.06 175 / 2.65
8 250 / 6.71 346 / 4.19
32 2834 / 7.45 4047 / 5.22

We have not yet pinned down the per-call cost (per-call codec decode time itself is identical at ~12ms via profiling; the regression is in inter-forward idle time which we suspect comes from the wrapper's dict key change or some IPC interaction we haven't traced). Reviewers familiar with the chunk_transfer_adapter / async pipeline interaction may have better intuition.

If this trade-off is unacceptable, options are:

  1. Land only the YAML knob (code2wav_capture_pairs) without the forward-path change.
  2. Gate the batched path behind a config flag (code2wav_force_bs1).
  3. Hold this PR until we trace the CustomVoice slowdown.

Happy to iterate based on review.

CC List

@linyueqian @Sy0307

@ischencheng ischencheng requested a review from hsliuustc0106 as a code owner May 3, 2026 15:48
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

BLOCKING:

  • CI Gates — DCO and pre-commit checks are failing. Please fix these before proceeding.

Note: The PR description and test evidence look comprehensive, but I can't proceed with review until the gates pass.

@linyueqian
Copy link
Copy Markdown
Collaborator

H20-3e bench: main vs +3321 vs +3322 vs +3321+3322 (Qwen3-TTS-12Hz-1.7B)

Ran the full 4-way comparison on a single H20-3e to disambiguate which piece of the stack delivers which gain. Same model in both modes, same prompt set (20 short English sentences), warmup 16 reqs, 8/16/24/32/48 measured reqs at c=1/4/8/16/32.

Base (voice-cloning, RFC #3163 reproducer regime), TTFA p95 (ms) / req-s / audio-s per wall-s:

Conc main +3321 +3322 +both
4 3293 / 0.75 / 2.39 3208 / 0.75 / 2.45 1063 / 0.73 / 2.34 1169 / 0.79 / 2.55
8 8313 / 0.84 / 2.55 7911 / 0.87 / 2.79 1927 / 0.87 / 2.74 2457 / 0.87 / 2.65
16 12638 / 1.15 / 3.49 12687 / 1.13 / 3.46 6719 / 1.08 / 3.36 4616 / 1.39 / 4.28
32 15568 / 1.93 / 6.02 17005 / 1.62 / 5.10 18620 / 1.54 / 4.78 17892 / 1.49 / 4.77

CustomVoice (preset voice, no voice-cloning), TTFA p95 (ms) / req-s / audio-s per wall-s:

Conc main +3321 +3322 +both
4 188 / 3.82 / 15.6 168 / 4.00 / 15.5 275 / 2.46 / 9.7 289 / 2.61 / 10.0
8 815 / 5.56 / 20.6 781 / 5.11 / 19.9 474 / 3.99 / 15.1 1103 / 3.28 / 12.1
16 2004 / 5.63 / 22.7 1728 / 6.46 / 23.6 2341 / 4.45 / 17.6 2202 / 3.83 / 15.7
32 3611 / 6.03 / 23.4 3814 / 5.65 / 23.4 5637 / 4.28 / 16.6 6042 / 4.60 / 18.7

Reading:

  • This PR's batched Code2Wav path is a 3 to 4x TTFA p95 win on Base + voice-clone at c=4 and c=8 (3293 to 1063, 8313 to 1927). Audio throughput is roughly flat at low/mid conc on Base; the win is fairness and p95, not raw capacity. Stacking [Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead #3321 on top lifts c=16 to 4616ms p95 / 1.39 rps / 4.28 audio-s per wall-s, the only configuration that meaningfully adds throughput on Base.
  • On CustomVoice, the same path costs roughly 30 to 40 percent rps and audio throughput at every conc, plus around 50 percent TTFA p95 at c=32 (3611 to 6042). Same shape you reported on H100; confirmed not H100-specific. The codec batching path adds dispatch overhead when the codec is not the bottleneck.
  • At c=32 on Base, all variants are within 20 percent of main on TTFA p95 and behind main on audio throughput. The system is past steady-state capacity on a single H20-3e regardless of optimization.

Suggested resolution: keep the batched path but workload-gate it. Two clean options:

  1. Auto-gate on task_type (Base / voice-clone uses batched, CustomVoice and VoiceDesign keep bs=1). The model already knows the task type per request.
  2. Ship a code2wav_force_bs1 config flag (your option 2 from the PR description) and flip it only in the Base / voice-clone deploy yaml.

Either avoids the CustomVoice regression while keeping the Base + voice-clone win.

Repro: scripts and full CSV at bench-3321-3322/, 32 rows. Hardware: H20-3e (141GB), single-GPU per server, default deploy yaml from each PR's branch.

cc @ischencheng @vklimkov-nvidia

@ischencheng
Copy link
Copy Markdown
Author

Profiling result: which part of the PR is doing the work

Tested main vs +3322 vs a fix variant (just max_num_seqs: 1→10 from the yaml, no batched forward) on H100, 1.7B, c=4/8/16.

TTFA (ms, mean / p95)

CV c=4 c=8 c=16 VC c=4 c=8 c=16
main 138 / 188 312 / 740 1043 / 1648 1672 / 2540 12862 / 61551 55256 / 118213
+3322 171 / 207 365 / 464 1588 / 2565 869 / 1129 1456 / 2311 6401 / 19117
fix 150 / 183 282 / 366 1538 / 2424 813 / 1110 1263 / 2148 6966 / 14989

Decomposition

All the voice-clone win comes from max_num_seqs: 1→10fix matches +3322 across c (and beats main by 8–29× on VC p95). The batched chunked_decode([B, Q, F_max]) and CG (bs, seq) extension contribute essentially nothing — at c=4/8 only 4–24% of forwards even hit bs>1; at c=32 the PR's own table and @linyueqian's table both show no improvement (1.1× / -20%).

Why VC and CV behave so differently — prof side-by-side at c=4 (1× H100):

metric main+CV main+VC +3322+CV +3322+VC
stage 0 chunk-emit interval 22.9 ms 263.2 ms 35.2 ms 294.3 ms
stage 1 GPU busy ratio 79% 9% 80% 53%
stage 1 fwd bs distribution {1: 68} {1: 213} {1:42, 2:12, 3:1} {1: 342, 3: 13}

Two things stand out: VC's stage 0 produces chunks ~11× slower (audio-encoder + ref_code prefill is heavy), and VC's stage 1 is mostly idle in main (9% busy). The TTFA staircase for main+VC at c=4 — 4 requests fired concurrently, but per-request TTFAs come back at 677 / 1252 / 2760 / 3333 ms (each request waits roughly an audio's worth of time longer than the previous) — confirms request-level serialization at stage 1 admission — each request's sequence holds the slot for its entire ~3s audio decode lifetime, blocking the next 3 from even starting. max_num_seqs=1 is the lock; 1→10 releases it. CV doesn't suffer because its stage 1 sequence lifetime is ~14 chunks × 22.9 ms ≈ 320 ms, so even strict serialization only adds ~1s to a c=4 workload.

CV regression in +3322 is two-thirds removed by reverting the batched forward (fix); the residual ~10–20% is from max_num_seqs=10 itself slowing stage 0's chunk-emit rate (22.9 → 35.3 ms), which we suspect is multi-stage scheduler overhead but didn't trace further.

Suggestion

Drop the batched forward + CG (bs, seq) extension. Land just max_num_seqs: 1→10 + @linyueqian's task-type auto-gate (CustomVoice keeps =1, Base/voice-clone gets =10). The VC vs CV profile asymmetry above (long vs short stage-1 sequence lifetime) is the direct justification for gating on task_type.

Methodology: perf_counter() JSON-line logger at Qwen3TTSCode2Wav.forward enter/exit, Qwen3TTSTalkerForConditionalGeneration.forward enter/exit, talker2code2wav_async_chunk emit. Bench: asyncio httpx, single warmup, measured run reported. Patches available on request.

@ischencheng
Copy link
Copy Markdown
Author

ischencheng commented May 4, 2026

Profiling result: which part of the PR is doing the work

Tested main vs +3322 vs a fix variant (just max_num_seqs: 1→10 from the yaml, no batched forward) on H100, 1.7B, c=4/8/16.

TTFA (ms, mean / p95)

CV c=4 c=8 c=16 VC c=4 c=8 c=16
main 138 / 188 312 / 740 1043 / 1648 1672 / 2540 12862 / 61551 55256 / 118213
+3322 171 / 207 365 / 464 1588 / 2565 869 / 1129 1456 / 2311 6401 / 19117
fix 150 / 183 282 / 366 1538 / 2424 813 / 1110 1263 / 2148 6966 / 14989

Decomposition

All the voice-clone win comes from max_num_seqs: 1→10fix matches +3322 across c (and beats main by 8–29× on VC p95). The batched chunked_decode([B, Q, F_max]) and CG (bs, seq) extension contribute essentially nothing — at c=4/8 only 4–24% of forwards even hit bs>1; at c=32 the PR's own table and @linyueqian's table both show no improvement (1.1× / -20%).

A separate decoder microbench (1× H100, no engine) corroborates this:

  • bs=1 already saturates SMs: 8 concurrent bs=1 graphs on 8 CUDA streams take exactly 8× a single bs=1 wall (no parallelism gained), because the upsample chain (×8×5×4×3) blows seq out by 480× and fills 132 SMs many times over per layer.
  • The (bs, seq) CG extension is actually slightly harmful: chunked_decode(F=800) end-to-end takes 110/203/395/775 ms for bs=1/2/4/8 with bs=1 graph + eager fallback, vs 118/228/448/884 ms with the new (bs∈{1,2,4,8}, seq=325) graphs. Reason: the internal loop pads seq=300 and seq=225 chunks up to 325, wasting ~7–30% per call — outpacing the ~14% kernel saving from batching.

Why VC and CV behave so differently — prof side-by-side at c=4 (1× H100):

metric main+CV main+VC +3322+CV +3322+VC
stage 0 chunk-emit interval 22.9 ms 263.2 ms 35.2 ms 294.3 ms
stage 1 GPU busy ratio 79% 9% 80% 53%
stage 1 fwd bs distribution {1: 68} {1: 213} {1:42, 2:12, 3:1} {1: 342, 3: 13}

Two things stand out: VC's stage 0 produces chunks ~11× slower (audio-encoder + ref_code prefill is heavy), and VC's stage 1 is mostly idle in main (9% busy). 4 concurrent VC requests at main come back with per-request TTFAs of 677 / 1252 / 2760 / 3333 ms — a staircase, each waiting ~one audio worth longer than the previous. That's request-level serialization at stage 1 admission: each request's sequence holds the slot for its entire ~3s audio decode lifetime, blocking the next 3 from even starting. max_num_seqs=1 is the lock, 1→10 releases it. CV doesn't suffer because its stage 1 sequence lifetime is ~14 chunks × 22.9 ms ≈ 320 ms, so even strict serialization only adds ~1s to a c=4 workload.

CV regression in +3322 has two contributing effects, both visible in stage 0's chunk-emit rate (main 22.9 ms → +3322 35.2 ms → fix 35.3 ms):

  • Small kernel-level cost from bs>1 forwards, surfaceable via GPU contention. The padded chunked_decode([B, Q, F_max]) raises kernel-only decode time from 0.31 ms (main, bs=1) to 1.70 ms (PR average, mixed bs). Across the bench that's ~73 ms extra GPU time for stage 1 (~3% of wall) — small in absolute terms but stage 0 (talker) on the same device 0 is sensitive to even small contention spikes during its tight AR loop. fix (bs=1 loop) recovers decode kernel time to 0.38 ms and pulls c=4 CV TTFA back from 156 → 136 ms — about half the regression gone.
  • Residual scheduler-layer overhead from max_num_seqs=10 itself, the bigger half. fix's stage 0 chunk-emit interval is 35.3 ms — identical to +3322's 35.2 ms — even though fix's stage 1 forwards are bs=1 like main. Per-forward Python+dispatch time is also similar (fix 32.3 ms vs +3322 34.9 ms vs main 18.6 ms). So something at the multi-stage scheduler / shm connector path scales with max_num_seqs and inflates per-step overhead independent of model-layer work. I didn't trace it deeper; the residual ~10–20% CV TTFA cost vs main lives here.

Suggestion

Drop the batched forward + CG (bs, seq) extension. Land just max_num_seqs: 1→10 + @linyueqian's task-type auto-gate (CustomVoice keeps =1, Base/voice-clone gets =10). The VC vs CV profile asymmetry above (long vs short stage-1 sequence lifetime) is the direct justification for gating on task_type.

Methodology: perf_counter() JSON-line logger at Qwen3TTSCode2Wav.forward enter/exit, Qwen3TTSTalkerForConditionalGeneration.forward enter/exit, talker2code2wav_async_chunk emit. Bench: asyncio httpx, single warmup, measured run reported. Patches available on request.

ischencheng added a commit to ischencheng/vllm-omni that referenced this pull request May 5, 2026
Profiling on 1× H100, 1.7B, c=4/8/16 (see PR vllm-project#3322 review comment for full
matrix) shows the voice-clone TTFA win in PR vllm-project#3322 comes entirely from the
yaml change `max_num_seqs: 1→10`, not from the batched chunked_decode forward
or the CG (bs, seq) wrapper extension:

- Reverting just the batched forward + CG extension while keeping
  `max_num_seqs=10` matches PR vllm-project#3322's voice-clone gain (8–29× p95
  improvement vs main at c=4/8/16) AND removes about half of PR vllm-project#3322's
  CustomVoice TTFA regression at c=4 (156 → 136 ms vs main's 115 ms).
- Stage 1 forward bs distribution under PR vllm-project#3322 is dominated by bs=1
  (76% at c=4 CV, 96% at c=4 VC); the new (bs ∈ {2,4,8}, seq=97) graphs
  are rarely hit. A separate decoder microbench shows bs=1 already
  saturates SMs (8 concurrent bs=1 streams take exactly 8× a single
  bs=1 wall) and the (bs, seq) graphs slightly hurt due to padding
  waste on shorter chunks.
- The PR's own H100-0.6B-Base table shows c=32 main 8047 → +3322 7286
  (1.1× speedup, with rps dropping); H20-3e 1.7B-Base at c=32 shows
  +20% TTFA regression. So the batched path doesn't pay off at any
  measured concurrency.

Reverts the batched forward, the CG (bs, seq) extension, and the e2e
concurrent TTFA test added by PR vllm-project#3322. Keeps only the yaml
`max_num_seqs: 1→10` change; the follow-up commit moves that knob from
the yaml default to a `--stage-overrides` opt-in for Base deployments
so CustomVoice / VoiceDesign keep main's TTFA behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
ischencheng added a commit to ischencheng/vllm-omni that referenced this pull request May 5, 2026
PR vllm-project#3322's `max_num_seqs: 1→10` yaml change gives Base voice-clone a
large TTFA win at concurrency (8–29× p95 on c=4/8/32, validated 1× H100,
1.7B) but regresses CustomVoice / VoiceDesign at low concurrency due to
multi-stage scheduler overhead. CV/VoiceDesign requests have ~50–200 ms
stage-1 lifetimes — there is no batching benefit, only step overhead.

Move the knob:
- Restore the yaml default to `max_num_seqs: 1` (CV / VoiceDesign safe;
  matches main behavior, no regression).
- Base deployments opt in via `--stage-overrides '{"1": {"max_num_seqs":
  10}}'`. The example launcher's Base branch sets this automatically.

Validation against main on 1× H100, 1.7B (voice-clone, ref clone_2.wav):

  c   |  main p95 (ms) | pr-3322 p95 (ms) | win
  ----|----------------|------------------|-----
   1  |          367   |          361     |  ~
   2  |          737   |          731     |  ~
   4  |        3,238   |        1,247     | 2.6×
   8  |      101,508   |        2,434     |  42×
  32  |      176,162   |       25,202     |   7×

CustomVoice (default yaml) is unchanged from main.

Also drops a pre-existing latent bug in the launcher: a top-level
`--gpu-memory-utilization 0.9` was overriding the per-stage 0.3 in the
deploy yaml, OOMing stage 1 on multi-stage deployments. Latent since
the April Pipeline + Deploy Config Schema refactor (vllm-project#2383); never hit
because the launcher wasn't actually used end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
@ischencheng ischencheng force-pushed the cheng/rfc-3163-code2wav-batching branch from e02f47f to aa5f738 Compare May 5, 2026 03:47
@ischencheng
Copy link
Copy Markdown
Author

ischencheng commented May 5, 2026

Validation after restructure (re-benched on current main, post-#3306)

Per @linyueqian's heads-up about #3306 (comment), re-benched on e49fbd8a (current main, includes the OmniARAsyncScheduler plumbing). Setup unchanged: vllm bench serve --omni, seed-tts EN dataset (1088 rows, diverse ref_audio per request), Qwen3-TTS-12Hz-1.7B-Base, 1× H100, warmup 16 reqs, --metric-percentiles 95.

Two configs against current main:

  • A. baseline (vllm_omni/deploy/qwen3_tts.yaml as-is on main, stage 1 max_num_seqs: 1)
  • B. stage 1 max_num_seqs: 10 (yaml-edit; see note below — --stage-overrides does not work on current main)
c num TTFA p95 (ms) A → B RPS A → B audio_throughput A → B
1 8 312 → 317 (~) 1.05 → 1.06 (~) 3.64 → 3.66 (~)
4 32 1707 → 764 (2.2×) 1.94 → 1.78 (-8%) 7.68 → 7.05 (-8%)
8 80 3080 → 1503 (2.0×) 2.77 → 2.44 (-12%) 11.53 → 10.08 (-13%)
16 64 5033 → 4201 (1.2×) 3.35 → 3.10 (-7%) 13.60 → 12.94 (-5%)
32 96 8841 → 8401 (~) 3.19 → 3.24 (~) 13.14 → 13.34 (~)

All 0 fails. Audio duration median ~3-4 s on both.

The TTFA p95 win at c=4 / c=8 (2×) survives #3306; c=16 marginal; c=32 neutral. Throughput cost is 8-12% in the win region.

For comparison, the original PR's full stack (batched forward + CG (bs, seq) extension + max_num_seqs=10) measured at 89865e27 on the same setup gave TTFA p95 734/1634/4526/10232 ms and RPS 1.54/2.14/2.65/2.71 at c=4/8/16/32 — strictly worse than just max_num_seqs: 10 on current main at every concurrency. So on H100 + 1.7B, the batched forward and the CG (bs, seq) extension are no longer carrying weight against the post-#3306 baseline; the max_num_seqs: 10 knob alone captures the available win.

This collapses #3322 to a config / docs change. Plan for the next push:

  1. Rebase off the current 89865e27 + 833bd474 + f5e467a2 stack onto current main as a single commit: yaml comment, launcher's Base branch sets --stage-overrides '{"1": {"max_num_seqs": 10}}', doc updates in docs/serving/speech_api.md / the Qwen3-TTS user guide. Drop the Restore Code2Wav cross-request batching and Drop batched Code2Wav forward commits — both became no-ops vs. current main.
  2. Side issue / blocker for the launcher: on current main, vllm-omni serve ... --stage-overrides '{"1": {"max_num_seqs": 10}}' hangs at Initializing stage 1 (stage-1 init never proceeds past config resolution; orchestrator times out at 30 min). Same yaml setting baked in directly works fine — the failure is specific to --stage-overrides interaction with [Core] Support Async & Sync AutoRegressive Scheduling #3306. Will dig and either fix here or split into a separate bugfix PR before this lands.

CustomVoice / VoiceDesign continue to use the default yaml (max_num_seqs: 1) and are unchanged from main.

ischencheng added a commit to ischencheng/vllm-omni that referenced this pull request May 5, 2026
PR vllm-project#3322's `max_num_seqs: 1→10` yaml change gives Base voice-clone a
large TTFA win at concurrency (8–29× p95 on c=4/8/32, validated 1× H100,
1.7B) but regresses CustomVoice / VoiceDesign at low concurrency due to
multi-stage scheduler overhead. CV/VoiceDesign requests have ~50–200 ms
stage-1 lifetimes — there is no batching benefit, only step overhead.

Move the knob:
- Restore the yaml default to `max_num_seqs: 1` (CV / VoiceDesign safe;
  matches main behavior, no regression).
- Base deployments opt in via `--stage-overrides '{"1": {"max_num_seqs":
  10}}'`. The example launcher's Base branch sets this automatically.
- Update the Base launch examples in the user guide, the example
  README, and the speech_api throughput section to show the override.

Validation against main on 1× H100, 1.7B (voice-clone, ref clone_2.wav):

  c   |  main p95 (ms) | pr-3322 p95 (ms) | win
  ----|----------------|------------------|-----
   1  |          367   |          361     |  ~
   2  |          737   |          731     |  ~
   4  |        3,238   |        1,247     | 2.6×
   8  |      101,508   |        2,434     |  42×
  32  |      176,162   |       25,202     |   7×

CustomVoice (default yaml) is unchanged from main.

Also drops a pre-existing latent bug in the launcher: a top-level
`--gpu-memory-utilization 0.9` was overriding the per-stage 0.3 in the
deploy yaml, OOMing stage 1 on multi-stage deployments. Latent since
the April Pipeline + Deploy Config Schema refactor (vllm-project#2383); never hit
because the launcher wasn't actually used end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
@ischencheng ischencheng force-pushed the cheng/rfc-3163-code2wav-batching branch 2 times, most recently from f5e467a to 398f442 Compare May 6, 2026 04:16
The bundled qwen3_tts.yaml now ships stage 1 (Code2Wav) at
max_num_seqs: 10 (set in vllm-project#2556), tuned for Base voice cloning's long
stage-1 lifetimes (~3 s/req): admitting up to 10 concurrent codec
sequences gives ~2x TTFA p95 at c=4 / c=8 on 1x H100 + 1.7B-Base +
seed-tts at an 8-12% audio-throughput cost.

CustomVoice / VoiceDesign have ~50-200 ms stage-1 lifetimes and remain
TTFA-optimal at max_num_seqs: 1. Document the trade-off and the
override invocation, and add a yaml comment so the choice is visible
at the config site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
@ischencheng ischencheng force-pushed the cheng/rfc-3163-code2wav-batching branch from 398f442 to d926f4f Compare May 8, 2026 22:30
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label May 8, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Hi @ischencheng, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐

Could you please provide an update?

  • If you're still working on it, that's great — just let us know.
  • If you're blocked on something, feel free to ask for help.
  • If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Cross-request batching for Qwen3-TTS Code2Wav stage to fix TTFB scaling under concurrency

3 participants