[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0)#3322
[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0)#3322ischencheng wants to merge 1 commit into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
BLOCKING:
Note: The PR description and test evidence look comprehensive, but I can't proceed with review until the gates pass. |
|
H20-3e bench: main vs +3321 vs +3322 vs +3321+3322 (Qwen3-TTS-12Hz-1.7B) Ran the full 4-way comparison on a single H20-3e to disambiguate which piece of the stack delivers which gain. Same model in both modes, same prompt set (20 short English sentences), warmup 16 reqs, 8/16/24/32/48 measured reqs at c=1/4/8/16/32. Base (voice-cloning, RFC #3163 reproducer regime), TTFA p95 (ms) / req-s / audio-s per wall-s:
CustomVoice (preset voice, no voice-cloning), TTFA p95 (ms) / req-s / audio-s per wall-s:
Reading:
Suggested resolution: keep the batched path but workload-gate it. Two clean options:
Either avoids the CustomVoice regression while keeping the Base + voice-clone win. Repro: scripts and full CSV at |
Profiling result: which part of the PR is doing the workTested TTFA (ms, mean / p95)
DecompositionAll the voice-clone win comes from Why VC and CV behave so differently — prof side-by-side at c=4 (1× H100):
Two things stand out: VC's stage 0 produces chunks ~11× slower (audio-encoder + ref_code prefill is heavy), and VC's stage 1 is mostly idle in CV regression in SuggestionDrop the batched forward + CG (bs, seq) extension. Land just Methodology: |
Profiling result: which part of the PR is doing the workTested TTFA (ms, mean / p95)
DecompositionAll the voice-clone win comes from A separate decoder microbench (1× H100, no engine) corroborates this:
Why VC and CV behave so differently — prof side-by-side at c=4 (1× H100):
Two things stand out: VC's stage 0 produces chunks ~11× slower (audio-encoder + ref_code prefill is heavy), and VC's stage 1 is mostly idle in CV regression in
SuggestionDrop the batched forward + CG (bs, seq) extension. Land just Methodology: |
Profiling on 1× H100, 1.7B, c=4/8/16 (see PR vllm-project#3322 review comment for full matrix) shows the voice-clone TTFA win in PR vllm-project#3322 comes entirely from the yaml change `max_num_seqs: 1→10`, not from the batched chunked_decode forward or the CG (bs, seq) wrapper extension: - Reverting just the batched forward + CG extension while keeping `max_num_seqs=10` matches PR vllm-project#3322's voice-clone gain (8–29× p95 improvement vs main at c=4/8/16) AND removes about half of PR vllm-project#3322's CustomVoice TTFA regression at c=4 (156 → 136 ms vs main's 115 ms). - Stage 1 forward bs distribution under PR vllm-project#3322 is dominated by bs=1 (76% at c=4 CV, 96% at c=4 VC); the new (bs ∈ {2,4,8}, seq=97) graphs are rarely hit. A separate decoder microbench shows bs=1 already saturates SMs (8 concurrent bs=1 streams take exactly 8× a single bs=1 wall) and the (bs, seq) graphs slightly hurt due to padding waste on shorter chunks. - The PR's own H100-0.6B-Base table shows c=32 main 8047 → +3322 7286 (1.1× speedup, with rps dropping); H20-3e 1.7B-Base at c=32 shows +20% TTFA regression. So the batched path doesn't pay off at any measured concurrency. Reverts the batched forward, the CG (bs, seq) extension, and the e2e concurrent TTFA test added by PR vllm-project#3322. Keeps only the yaml `max_num_seqs: 1→10` change; the follow-up commit moves that knob from the yaml default to a `--stage-overrides` opt-in for Base deployments so CustomVoice / VoiceDesign keep main's TTFA behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
PR vllm-project#3322's `max_num_seqs: 1→10` yaml change gives Base voice-clone a large TTFA win at concurrency (8–29× p95 on c=4/8/32, validated 1× H100, 1.7B) but regresses CustomVoice / VoiceDesign at low concurrency due to multi-stage scheduler overhead. CV/VoiceDesign requests have ~50–200 ms stage-1 lifetimes — there is no batching benefit, only step overhead. Move the knob: - Restore the yaml default to `max_num_seqs: 1` (CV / VoiceDesign safe; matches main behavior, no regression). - Base deployments opt in via `--stage-overrides '{"1": {"max_num_seqs": 10}}'`. The example launcher's Base branch sets this automatically. Validation against main on 1× H100, 1.7B (voice-clone, ref clone_2.wav): c | main p95 (ms) | pr-3322 p95 (ms) | win ----|----------------|------------------|----- 1 | 367 | 361 | ~ 2 | 737 | 731 | ~ 4 | 3,238 | 1,247 | 2.6× 8 | 101,508 | 2,434 | 42× 32 | 176,162 | 25,202 | 7× CustomVoice (default yaml) is unchanged from main. Also drops a pre-existing latent bug in the launcher: a top-level `--gpu-memory-utilization 0.9` was overriding the per-stage 0.3 in the deploy yaml, OOMing stage 1 on multi-stage deployments. Latent since the April Pipeline + Deploy Config Schema refactor (vllm-project#2383); never hit because the launcher wasn't actually used end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
e02f47f to
aa5f738
Compare
Validation after restructure (re-benched on current main, post-#3306)Per @linyueqian's heads-up about #3306 (comment), re-benched on Two configs against current main:
All 0 fails. Audio duration median ~3-4 s on both. The TTFA p95 win at c=4 / c=8 (2×) survives #3306; c=16 marginal; c=32 neutral. Throughput cost is 8-12% in the win region. For comparison, the original PR's full stack (batched forward + CG This collapses #3322 to a config / docs change. Plan for the next push:
CustomVoice / VoiceDesign continue to use the default yaml ( |
PR vllm-project#3322's `max_num_seqs: 1→10` yaml change gives Base voice-clone a large TTFA win at concurrency (8–29× p95 on c=4/8/32, validated 1× H100, 1.7B) but regresses CustomVoice / VoiceDesign at low concurrency due to multi-stage scheduler overhead. CV/VoiceDesign requests have ~50–200 ms stage-1 lifetimes — there is no batching benefit, only step overhead. Move the knob: - Restore the yaml default to `max_num_seqs: 1` (CV / VoiceDesign safe; matches main behavior, no regression). - Base deployments opt in via `--stage-overrides '{"1": {"max_num_seqs": 10}}'`. The example launcher's Base branch sets this automatically. - Update the Base launch examples in the user guide, the example README, and the speech_api throughput section to show the override. Validation against main on 1× H100, 1.7B (voice-clone, ref clone_2.wav): c | main p95 (ms) | pr-3322 p95 (ms) | win ----|----------------|------------------|----- 1 | 367 | 361 | ~ 2 | 737 | 731 | ~ 4 | 3,238 | 1,247 | 2.6× 8 | 101,508 | 2,434 | 42× 32 | 176,162 | 25,202 | 7× CustomVoice (default yaml) is unchanged from main. Also drops a pre-existing latent bug in the launcher: a top-level `--gpu-memory-utilization 0.9` was overriding the per-stage 0.3 in the deploy yaml, OOMing stage 1 on multi-stage deployments. Latent since the April Pipeline + Deploy Config Schema refactor (vllm-project#2383); never hit because the launcher wasn't actually used end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
f5e467a to
398f442
Compare
The bundled qwen3_tts.yaml now ships stage 1 (Code2Wav) at max_num_seqs: 10 (set in vllm-project#2556), tuned for Base voice cloning's long stage-1 lifetimes (~3 s/req): admitting up to 10 concurrent codec sequences gives ~2x TTFA p95 at c=4 / c=8 on 1x H100 + 1.7B-Base + seed-tts at an 8-12% audio-throughput cost. CustomVoice / VoiceDesign have ~50-200 ms stage-1 lifetimes and remain TTFA-optimal at max_num_seqs: 1. Document the trade-off and the override invocation, and add a yaml comment so the choice is visible at the config site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
398f442 to
d926f4f
Compare
|
Hi @ischencheng, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐 Could you please provide an update?
Thanks for your contribution! 🙏 |
Purpose
Resolves #3163.
PR #1617 introduced
CUDAGraphDecoderWrapperfor the Qwen3-TTS Code2Wav stage, which captured CUDA graphs atbs=1only. To keep the wrapper hit rate high it also enforcedbs=1inQwen3TTSCode2Wav.forward()via a per-request for-loop, undoing the cross-request batching landed in PR #1426.Combined with the
max_num_seqs: 1setting inqwen3_tts.yaml, this means the Code2Wav stage decodes one request's chunk per scheduler step even when the scheduler has multiple ready streams in flight. On RTX 4090 + 1.7B-Base + voice cloning (the setup reported in #3163), this surfaces as 11.7× TTFB scaling between QPS=1 and QPS=4.This PR:
[B, Q, F_max]chunked_decodecall. Includes aB==1fast path (no padding cost) so the existing single-request behaviour is preserved.CUDAGraphDecoderWrapperto capture(batch_size, seq_len)pairs instead ofseq_lenonly. Default capture set keeps all existingbs=1buckets and adds(bs ∈ {2,4,8}, seq=streaming_hot)so the new bs>1 hot path stays on the graph; uncaptured (bs, seq) falls back to eager.max_num_seqs: 1 → 10(matches Stage 0) so the scheduler can actually deliver bs>1 to the codec.code2wav_capture_pairsfor operators on memory-rich GPUs to opt into a wider capture set.Test Plan
tests/model_executor/models/qwen3_tts/test_code2wav_batching.py(7 cases): bs=1 baseline parity, bs>1 per-request parity, padding-no-bleed, per-requestleft_context_sizehonoring, malformed-mixed batches.tests/model_executor/models/qwen3_tts/test_cuda_graph_decoder.pyto exercise the new(bs, seq)capture API: multi-bs replay, uncaptured-bs fallback,compute_capture_pairsparametrize.tests/e2e/online_serving/test_qwen3_tts_concurrent_ttfb.pymirrors the bench shape of PR [Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec #3221'sbenchmark_service.pyand asserts sub-linear TTFA scaling per the RFC target (scaling_4 ≤ 4.0×).tests/e2e/online_serving/test_qwen3_tts_base.py -m core_modelto verify audio output correctness (HNR + format checks).Test Result
Unit tests:
test_code2wav_batching.py7/7 PASS,test_cuda_graph_decoder.py38/38 PASS,test_qwen3_tts_code2wav.py2/2 PASS.E2E
test_qwen3_tts_base.pycore_model PASSES on H100.Concurrent bench (the issue #3163 setup is RTX 4090; we measured H100 +
Qwen3-TTS-12Hz-0.6B-Base+ voice cloning as the closest reproducible scenario on our hardware):c=4 / c=8 are large wins (TTFA -56% / -75%, rps at c=8 +43%) for the Base-mode + voice-cloning scenario, addressing the RFC #3163 reproducer.
Known scope limitation
On H100 +
Qwen3-TTS-12Hz-1.7B-CustomVoice(CustomVoice voice mode, no voice cloning),mainis already saturated and this PR is approximately break-even or slightly slower (within ~5-30% TTFA / rps depending on concurrency). The codec-bs=1 enforcement was masking voice-cloning's audio-encoder latency on slower GPUs / Base mode; it is not a bottleneck on H100 + CustomVoice.We measured the same model on H100:
We have not yet pinned down the per-call cost (per-call codec decode time itself is identical at ~12ms via profiling; the regression is in inter-forward idle time which we suspect comes from the wrapper's
dictkey change or some IPC interaction we haven't traced). Reviewers familiar with the chunk_transfer_adapter / async pipeline interaction may have better intuition.If this trade-off is unacceptable, options are:
code2wav_capture_pairs) without the forward-path change.code2wav_force_bs1).Happy to iterate based on review.
CC List
@linyueqian @Sy0307