[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0) by ischencheng · Pull Request #3322 · vllm-project/vllm-omni

ischencheng · 2026-05-03T15:48:11Z

Purpose

Resolves #3163.

PR #1617 introduced CUDAGraphDecoderWrapper for the Qwen3-TTS Code2Wav stage, which captured CUDA graphs at bs=1 only. To keep the wrapper hit rate high it also enforced bs=1 in Qwen3TTSCode2Wav.forward() via a per-request for-loop, undoing the cross-request batching landed in PR #1426.

Combined with the max_num_seqs: 1 setting in qwen3_tts.yaml, this means the Code2Wav stage decodes one request's chunk per scheduler step even when the scheduler has multiple ready streams in flight. On RTX 4090 + 1.7B-Base + voice cloning (the setup reported in #3163), this surfaces as 11.7× TTFB scaling between QPS=1 and QPS=4.

This PR:

Restores the batched forward path: replaces the per-request for-loop with a single padded [B, Q, F_max] chunked_decode call. Includes a B==1 fast path (no padding cost) so the existing single-request behaviour is preserved.
Extends CUDAGraphDecoderWrapper to capture (batch_size, seq_len) pairs instead of seq_len only. Default capture set keeps all existing bs=1 buckets and adds (bs ∈ {2,4,8}, seq=streaming_hot) so the new bs>1 hot path stays on the graph; uncaptured (bs, seq) falls back to eager.
Raises Stage 1 max_num_seqs: 1 → 10 (matches Stage 0) so the scheduler can actually deliver bs>1 to the codec.
Adds a YAML override code2wav_capture_pairs for operators on memory-rich GPUs to opt into a wider capture set.

Test Plan

New tests/model_executor/models/qwen3_tts/test_code2wav_batching.py (7 cases): bs=1 baseline parity, bs>1 per-request parity, padding-no-bleed, per-request left_context_size honoring, malformed-mixed batches.
Updated tests/model_executor/models/qwen3_tts/test_cuda_graph_decoder.py to exercise the new (bs, seq) capture API: multi-bs replay, uncaptured-bs fallback, compute_capture_pairs parametrize.
New tests/e2e/online_serving/test_qwen3_tts_concurrent_ttfb.py mirrors the bench shape of PR [Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec #3221's benchmark_service.py and asserts sub-linear TTFA scaling per the RFC target (scaling_4 ≤ 4.0×).
Existing tests/e2e/online_serving/test_qwen3_tts_base.py -m core_model to verify audio output correctness (HNR + format checks).

Test Result

Unit tests: test_code2wav_batching.py 7/7 PASS, test_cuda_graph_decoder.py 38/38 PASS, test_qwen3_tts_code2wav.py 2/2 PASS.

E2E test_qwen3_tts_base.py core_model PASSES on H100.

Concurrent bench (the issue #3163 setup is RTX 4090; we measured H100 + Qwen3-TTS-12Hz-0.6B-Base + voice cloning as the closest reproducible scenario on our hardware):

Concurrency	main TTFA mean / rps	branch TTFA mean / rps	speedup
1	255 / 0.94	259 / 0.93	≈ 1×
4	1731 / 1.26	761 / 1.26	2.3×
8	4949 / 1.22	1220 / 1.75	4.1×
32	8047 / 3.51	7286 / 3.24	1.1×

c=4 / c=8 are large wins (TTFA -56% / -75%, rps at c=8 +43%) for the Base-mode + voice-cloning scenario, addressing the RFC #3163 reproducer.

Known scope limitation

On H100 + Qwen3-TTS-12Hz-1.7B-CustomVoice (CustomVoice voice mode, no voice cloning), main is already saturated and this PR is approximately break-even or slightly slower (within ~5-30% TTFA / rps depending on concurrency). The codec-bs=1 enforcement was masking voice-cloning's audio-encoder latency on slower GPUs / Base mode; it is not a bottleneck on H100 + CustomVoice.

We measured the same model on H100:

Concurrency	main TTFA / rps	this PR TTFA / rps
1	68 / 1.49	57 / 1.50
4	135 / 4.06	175 / 2.65
8	250 / 6.71	346 / 4.19
32	2834 / 7.45	4047 / 5.22

We have not yet pinned down the per-call cost (per-call codec decode time itself is identical at ~12ms via profiling; the regression is in inter-forward idle time which we suspect comes from the wrapper's dict key change or some IPC interaction we haven't traced). Reviewers familiar with the chunk_transfer_adapter / async pipeline interaction may have better intuition.

If this trade-off is unacceptable, options are:

Land only the YAML knob (code2wav_capture_pairs) without the forward-path change.
Gate the batched path behind a config flag (code2wav_force_bs1).
Hold this PR until we trace the CustomVoice slowdown.

Happy to iterate based on review.

CC List

@linyueqian @Sy0307

chatgpt-codex-connector · 2026-05-03T15:48:19Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-05-03T21:05:28Z

BLOCKING:

CI Gates — DCO and pre-commit checks are failing. Please fix these before proceeding.

Note: The PR description and test evidence look comprehensive, but I can't proceed with review until the gates pass.

linyueqian · 2026-05-03T21:27:30Z

H20-3e bench: main vs +3321 vs +3322 vs +3321+3322 (Qwen3-TTS-12Hz-1.7B)

Ran the full 4-way comparison on a single H20-3e to disambiguate which piece of the stack delivers which gain. Same model in both modes, same prompt set (20 short English sentences), warmup 16 reqs, 8/16/24/32/48 measured reqs at c=1/4/8/16/32.

Base (voice-cloning, RFC #3163 reproducer regime), TTFA p95 (ms) / req-s / audio-s per wall-s:

Conc	main	+3321	+3322	+both
4	3293 / 0.75 / 2.39	3208 / 0.75 / 2.45	1063 / 0.73 / 2.34	1169 / 0.79 / 2.55
8	8313 / 0.84 / 2.55	7911 / 0.87 / 2.79	1927 / 0.87 / 2.74	2457 / 0.87 / 2.65
16	12638 / 1.15 / 3.49	12687 / 1.13 / 3.46	6719 / 1.08 / 3.36	4616 / 1.39 / 4.28
32	15568 / 1.93 / 6.02	17005 / 1.62 / 5.10	18620 / 1.54 / 4.78	17892 / 1.49 / 4.77

CustomVoice (preset voice, no voice-cloning), TTFA p95 (ms) / req-s / audio-s per wall-s:

Conc	main	+3321	+3322	+both
4	188 / 3.82 / 15.6	168 / 4.00 / 15.5	275 / 2.46 / 9.7	289 / 2.61 / 10.0
8	815 / 5.56 / 20.6	781 / 5.11 / 19.9	474 / 3.99 / 15.1	1103 / 3.28 / 12.1
16	2004 / 5.63 / 22.7	1728 / 6.46 / 23.6	2341 / 4.45 / 17.6	2202 / 3.83 / 15.7
32	3611 / 6.03 / 23.4	3814 / 5.65 / 23.4	5637 / 4.28 / 16.6	6042 / 4.60 / 18.7

Reading:

This PR's batched Code2Wav path is a 3 to 4x TTFA p95 win on Base + voice-clone at c=4 and c=8 (3293 to 1063, 8313 to 1927). Audio throughput is roughly flat at low/mid conc on Base; the win is fairness and p95, not raw capacity. Stacking [Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead #3321 on top lifts c=16 to 4616ms p95 / 1.39 rps / 4.28 audio-s per wall-s, the only configuration that meaningfully adds throughput on Base.
On CustomVoice, the same path costs roughly 30 to 40 percent rps and audio throughput at every conc, plus around 50 percent TTFA p95 at c=32 (3611 to 6042). Same shape you reported on H100; confirmed not H100-specific. The codec batching path adds dispatch overhead when the codec is not the bottleneck.
At c=32 on Base, all variants are within 20 percent of main on TTFA p95 and behind main on audio throughput. The system is past steady-state capacity on a single H20-3e regardless of optimization.

Suggested resolution: keep the batched path but workload-gate it. Two clean options:

Auto-gate on task_type (Base / voice-clone uses batched, CustomVoice and VoiceDesign keep bs=1). The model already knows the task type per request.
Ship a code2wav_force_bs1 config flag (your option 2 from the PR description) and flip it only in the Base / voice-clone deploy yaml.

Either avoids the CustomVoice regression while keeping the Base + voice-clone win.

Repro: scripts and full CSV at bench-3321-3322/, 32 rows. Hardware: H20-3e (141GB), single-GPU per server, default deploy yaml from each PR's branch.

cc @ischencheng @vklimkov-nvidia

ischencheng · 2026-05-04T20:45:48Z

Profiling result: which part of the PR is doing the work

Tested main vs +3322 vs a fix variant (just max_num_seqs: 1→10 from the yaml, no batched forward) on H100, 1.7B, c=4/8/16.

TTFA (ms, mean / p95)

	CV c=4	c=8	c=16	VC c=4	c=8	c=16
main	138 / 188	312 / 740	1043 / 1648	1672 / 2540	12862 / 61551	55256 / 118213
+3322	171 / 207	365 / 464	1588 / 2565	869 / 1129	1456 / 2311	6401 / 19117
fix	150 / 183	282 / 366	1538 / 2424	813 / 1110	1263 / 2148	6966 / 14989

Decomposition

All the voice-clone win comes from max_num_seqs: 1→10 — fix matches +3322 across c (and beats main by 8–29× on VC p95). The batched chunked_decode([B, Q, F_max]) and CG (bs, seq) extension contribute essentially nothing — at c=4/8 only 4–24% of forwards even hit bs>1; at c=32 the PR's own table and @linyueqian's table both show no improvement (1.1× / -20%).

Why VC and CV behave so differently — prof side-by-side at c=4 (1× H100):

metric	main+CV	main+VC	+3322+CV	+3322+VC
stage 0 chunk-emit interval	22.9 ms	263.2 ms	35.2 ms	294.3 ms
stage 1 GPU busy ratio	79%	9%	80%	53%
stage 1 fwd bs distribution	{1: 68}	{1: 213}	{1:42, 2:12, 3:1}	{1: 342, 3: 13}

Two things stand out: VC's stage 0 produces chunks ~11× slower (audio-encoder + ref_code prefill is heavy), and VC's stage 1 is mostly idle in main (9% busy). The TTFA staircase for main+VC at c=4 — 4 requests fired concurrently, but per-request TTFAs come back at 677 / 1252 / 2760 / 3333 ms (each request waits roughly an audio's worth of time longer than the previous) — confirms request-level serialization at stage 1 admission — each request's sequence holds the slot for its entire ~3s audio decode lifetime, blocking the next 3 from even starting. max_num_seqs=1 is the lock; 1→10 releases it. CV doesn't suffer because its stage 1 sequence lifetime is ~14 chunks × 22.9 ms ≈ 320 ms, so even strict serialization only adds ~1s to a c=4 workload.

CV regression in +3322 is two-thirds removed by reverting the batched forward (fix); the residual ~10–20% is from max_num_seqs=10 itself slowing stage 0's chunk-emit rate (22.9 → 35.3 ms), which we suspect is multi-stage scheduler overhead but didn't trace further.

Suggestion

Drop the batched forward + CG (bs, seq) extension. Land just max_num_seqs: 1→10 + @linyueqian's task-type auto-gate (CustomVoice keeps =1, Base/voice-clone gets =10). The VC vs CV profile asymmetry above (long vs short stage-1 sequence lifetime) is the direct justification for gating on task_type.

Methodology: perf_counter() JSON-line logger at Qwen3TTSCode2Wav.forward enter/exit, Qwen3TTSTalkerForConditionalGeneration.forward enter/exit, talker2code2wav_async_chunk emit. Bench: asyncio httpx, single warmup, measured run reported. Patches available on request.

ischencheng · 2026-05-04T20:46:35Z

Profiling result: which part of the PR is doing the work

Tested main vs +3322 vs a fix variant (just max_num_seqs: 1→10 from the yaml, no batched forward) on H100, 1.7B, c=4/8/16.

TTFA (ms, mean / p95)

	CV c=4	c=8	c=16	VC c=4	c=8	c=16
main	138 / 188	312 / 740	1043 / 1648	1672 / 2540	12862 / 61551	55256 / 118213
+3322	171 / 207	365 / 464	1588 / 2565	869 / 1129	1456 / 2311	6401 / 19117
fix	150 / 183	282 / 366	1538 / 2424	813 / 1110	1263 / 2148	6966 / 14989

Decomposition

All the voice-clone win comes from max_num_seqs: 1→10 — fix matches +3322 across c (and beats main by 8–29× on VC p95). The batched chunked_decode([B, Q, F_max]) and CG (bs, seq) extension contribute essentially nothing — at c=4/8 only 4–24% of forwards even hit bs>1; at c=32 the PR's own table and @linyueqian's table both show no improvement (1.1× / -20%).

A separate decoder microbench (1× H100, no engine) corroborates this:

bs=1 already saturates SMs: 8 concurrent bs=1 graphs on 8 CUDA streams take exactly 8× a single bs=1 wall (no parallelism gained), because the upsample chain (×8×5×4×3) blows seq out by 480× and fills 132 SMs many times over per layer.
The (bs, seq) CG extension is actually slightly harmful: chunked_decode(F=800) end-to-end takes 110/203/395/775 ms for bs=1/2/4/8 with bs=1 graph + eager fallback, vs 118/228/448/884 ms with the new (bs∈{1,2,4,8}, seq=325) graphs. Reason: the internal loop pads seq=300 and seq=225 chunks up to 325, wasting ~7–30% per call — outpacing the ~14% kernel saving from batching.

Why VC and CV behave so differently — prof side-by-side at c=4 (1× H100):

metric	main+CV	main+VC	+3322+CV	+3322+VC
stage 0 chunk-emit interval	22.9 ms	263.2 ms	35.2 ms	294.3 ms
stage 1 GPU busy ratio	79%	9%	80%	53%
stage 1 fwd bs distribution	{1: 68}	{1: 213}	{1:42, 2:12, 3:1}	{1: 342, 3: 13}

Two things stand out: VC's stage 0 produces chunks ~11× slower (audio-encoder + ref_code prefill is heavy), and VC's stage 1 is mostly idle in main (9% busy). 4 concurrent VC requests at main come back with per-request TTFAs of 677 / 1252 / 2760 / 3333 ms — a staircase, each waiting ~one audio worth longer than the previous. That's request-level serialization at stage 1 admission: each request's sequence holds the slot for its entire ~3s audio decode lifetime, blocking the next 3 from even starting. max_num_seqs=1 is the lock, 1→10 releases it. CV doesn't suffer because its stage 1 sequence lifetime is ~14 chunks × 22.9 ms ≈ 320 ms, so even strict serialization only adds ~1s to a c=4 workload.

CV regression in +3322 has two contributing effects, both visible in stage 0's chunk-emit rate (main 22.9 ms → +3322 35.2 ms → fix 35.3 ms):

Small kernel-level cost from bs>1 forwards, surfaceable via GPU contention. The padded chunked_decode([B, Q, F_max]) raises kernel-only decode time from 0.31 ms (main, bs=1) to 1.70 ms (PR average, mixed bs). Across the bench that's ~73 ms extra GPU time for stage 1 (~3% of wall) — small in absolute terms but stage 0 (talker) on the same device 0 is sensitive to even small contention spikes during its tight AR loop. fix (bs=1 loop) recovers decode kernel time to 0.38 ms and pulls c=4 CV TTFA back from 156 → 136 ms — about half the regression gone.
Residual scheduler-layer overhead from max_num_seqs=10 itself, the bigger half. fix's stage 0 chunk-emit interval is 35.3 ms — identical to +3322's 35.2 ms — even though fix's stage 1 forwards are bs=1 like main. Per-forward Python+dispatch time is also similar (fix 32.3 ms vs +3322 34.9 ms vs main 18.6 ms). So something at the multi-stage scheduler / shm connector path scales with max_num_seqs and inflates per-step overhead independent of model-layer work. I didn't trace it deeper; the residual ~10–20% CV TTFA cost vs main lives here.

Suggestion

Drop the batched forward + CG (bs, seq) extension. Land just max_num_seqs: 1→10 + @linyueqian's task-type auto-gate (CustomVoice keeps =1, Base/voice-clone gets =10). The VC vs CV profile asymmetry above (long vs short stage-1 sequence lifetime) is the direct justification for gating on task_type.

Methodology: perf_counter() JSON-line logger at Qwen3TTSCode2Wav.forward enter/exit, Qwen3TTSTalkerForConditionalGeneration.forward enter/exit, talker2code2wav_async_chunk emit. Bench: asyncio httpx, single warmup, measured run reported. Patches available on request.

Profiling on 1× H100, 1.7B, c=4/8/16 (see PR vllm-project#3322 review comment for full matrix) shows the voice-clone TTFA win in PR vllm-project#3322 comes entirely from the yaml change `max_num_seqs: 1→10`, not from the batched chunked_decode forward or the CG (bs, seq) wrapper extension: - Reverting just the batched forward + CG extension while keeping `max_num_seqs=10` matches PR vllm-project#3322's voice-clone gain (8–29× p95 improvement vs main at c=4/8/16) AND removes about half of PR vllm-project#3322's CustomVoice TTFA regression at c=4 (156 → 136 ms vs main's 115 ms). - Stage 1 forward bs distribution under PR vllm-project#3322 is dominated by bs=1 (76% at c=4 CV, 96% at c=4 VC); the new (bs ∈ {2,4,8}, seq=97) graphs are rarely hit. A separate decoder microbench shows bs=1 already saturates SMs (8 concurrent bs=1 streams take exactly 8× a single bs=1 wall) and the (bs, seq) graphs slightly hurt due to padding waste on shorter chunks. - The PR's own H100-0.6B-Base table shows c=32 main 8047 → +3322 7286 (1.1× speedup, with rps dropping); H20-3e 1.7B-Base at c=32 shows +20% TTFA regression. So the batched path doesn't pay off at any measured concurrency. Reverts the batched forward, the CG (bs, seq) extension, and the e2e concurrent TTFA test added by PR vllm-project#3322. Keeps only the yaml `max_num_seqs: 1→10` change; the follow-up commit moves that knob from the yaml default to a `--stage-overrides` opt-in for Base deployments so CustomVoice / VoiceDesign keep main's TTFA behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>

PR vllm-project#3322's `max_num_seqs: 1→10` yaml change gives Base voice-clone a large TTFA win at concurrency (8–29× p95 on c=4/8/32, validated 1× H100, 1.7B) but regresses CustomVoice / VoiceDesign at low concurrency due to multi-stage scheduler overhead. CV/VoiceDesign requests have ~50–200 ms stage-1 lifetimes — there is no batching benefit, only step overhead. Move the knob: - Restore the yaml default to `max_num_seqs: 1` (CV / VoiceDesign safe; matches main behavior, no regression). - Base deployments opt in via `--stage-overrides '{"1": {"max_num_seqs": 10}}'`. The example launcher's Base branch sets this automatically. Validation against main on 1× H100, 1.7B (voice-clone, ref clone_2.wav): c | main p95 (ms) | pr-3322 p95 (ms) | win ----|----------------|------------------|----- 1 | 367 | 361 | ~ 2 | 737 | 731 | ~ 4 | 3,238 | 1,247 | 2.6× 8 | 101,508 | 2,434 | 42× 32 | 176,162 | 25,202 | 7× CustomVoice (default yaml) is unchanged from main. Also drops a pre-existing latent bug in the launcher: a top-level `--gpu-memory-utilization 0.9` was overriding the per-stage 0.3 in the deploy yaml, OOMing stage 1 on multi-stage deployments. Latent since the April Pipeline + Deploy Config Schema refactor (vllm-project#2383); never hit because the launcher wasn't actually used end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>

ischencheng · 2026-05-05T03:49:22Z

Validation after restructure (re-benched on current main, post-#3306)

Per @linyueqian's heads-up about #3306 (comment), re-benched on e49fbd8a (current main, includes the OmniARAsyncScheduler plumbing). Setup unchanged: vllm bench serve --omni, seed-tts EN dataset (1088 rows, diverse ref_audio per request), Qwen3-TTS-12Hz-1.7B-Base, 1× H100, warmup 16 reqs, --metric-percentiles 95.

Two configs against current main:

A. baseline (vllm_omni/deploy/qwen3_tts.yaml as-is on main, stage 1 max_num_seqs: 1)
B. stage 1 max_num_seqs: 10 (yaml-edit; see note below — --stage-overrides does not work on current main)

c	num	TTFA p95 (ms) A → B	RPS A → B	audio_throughput A → B
1	8	312 → 317 (~)	1.05 → 1.06 (~)	3.64 → 3.66 (~)
4	32	1707 → 764 (2.2×)	1.94 → 1.78 (-8%)	7.68 → 7.05 (-8%)
8	80	3080 → 1503 (2.0×)	2.77 → 2.44 (-12%)	11.53 → 10.08 (-13%)
16	64	5033 → 4201 (1.2×)	3.35 → 3.10 (-7%)	13.60 → 12.94 (-5%)
32	96	8841 → 8401 (~)	3.19 → 3.24 (~)	13.14 → 13.34 (~)

All 0 fails. Audio duration median ~3-4 s on both.

The TTFA p95 win at c=4 / c=8 (2×) survives #3306; c=16 marginal; c=32 neutral. Throughput cost is 8-12% in the win region.

For comparison, the original PR's full stack (batched forward + CG (bs, seq) extension + max_num_seqs=10) measured at 89865e27 on the same setup gave TTFA p95 734/1634/4526/10232 ms and RPS 1.54/2.14/2.65/2.71 at c=4/8/16/32 — strictly worse than just max_num_seqs: 10 on current main at every concurrency. So on H100 + 1.7B, the batched forward and the CG (bs, seq) extension are no longer carrying weight against the post-#3306 baseline; the max_num_seqs: 10 knob alone captures the available win.

This collapses #3322 to a config / docs change. Plan for the next push:

Rebase off the current 89865e27 + 833bd474 + f5e467a2 stack onto current main as a single commit: yaml comment, launcher's Base branch sets --stage-overrides '{"1": {"max_num_seqs": 10}}', doc updates in docs/serving/speech_api.md / the Qwen3-TTS user guide. Drop the Restore Code2Wav cross-request batching and Drop batched Code2Wav forward commits — both became no-ops vs. current main.
Side issue / blocker for the launcher: on current main, vllm-omni serve ... --stage-overrides '{"1": {"max_num_seqs": 10}}' hangs at Initializing stage 1 (stage-1 init never proceeds past config resolution; orchestrator times out at 30 min). Same yaml setting baked in directly works fine — the failure is specific to --stage-overrides interaction with [Core] Support Async & Sync AutoRegressive Scheduling #3306. Will dig and either fix here or split into a separate bugfix PR before this lands.

CustomVoice / VoiceDesign continue to use the default yaml (max_num_seqs: 1) and are unchanged from main.

PR vllm-project#3322's `max_num_seqs: 1→10` yaml change gives Base voice-clone a large TTFA win at concurrency (8–29× p95 on c=4/8/32, validated 1× H100, 1.7B) but regresses CustomVoice / VoiceDesign at low concurrency due to multi-stage scheduler overhead. CV/VoiceDesign requests have ~50–200 ms stage-1 lifetimes — there is no batching benefit, only step overhead. Move the knob: - Restore the yaml default to `max_num_seqs: 1` (CV / VoiceDesign safe; matches main behavior, no regression). - Base deployments opt in via `--stage-overrides '{"1": {"max_num_seqs": 10}}'`. The example launcher's Base branch sets this automatically. - Update the Base launch examples in the user guide, the example README, and the speech_api throughput section to show the override. Validation against main on 1× H100, 1.7B (voice-clone, ref clone_2.wav): c | main p95 (ms) | pr-3322 p95 (ms) | win ----|----------------|------------------|----- 1 | 367 | 361 | ~ 2 | 737 | 731 | ~ 4 | 3,238 | 1,247 | 2.6× 8 | 101,508 | 2,434 | 42× 32 | 176,162 | 25,202 | 7× CustomVoice (default yaml) is unchanged from main. Also drops a pre-existing latent bug in the launcher: a top-level `--gpu-memory-utilization 0.9` was overriding the per-stage 0.3 in the deploy yaml, OOMing stage 1 on multi-stage deployments. Latent since the April Pipeline + Deploy Config Schema refactor (vllm-project#2383); never hit because the launcher wasn't actually used end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>

The bundled qwen3_tts.yaml now ships stage 1 (Code2Wav) at max_num_seqs: 10 (set in vllm-project#2556), tuned for Base voice cloning's long stage-1 lifetimes (~3 s/req): admitting up to 10 concurrent codec sequences gives ~2x TTFA p95 at c=4 / c=8 on 1x H100 + 1.7B-Base + seed-tts at an 8-12% audio-throughput cost. CustomVoice / VoiceDesign have ~50-200 ms stage-1 lifetimes and remain TTFA-optimal at max_num_seqs: 1. Document the trade-off and the override invocation, and add a yaml comment so the choice is visible at the config site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>

hsliuustc0106

lgtm

hsliuustc0106 · 2026-05-16T13:09:57Z

Hi @ischencheng, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐

Could you please provide an update?

If you're still working on it, that's great — just let us know.
If you're blocked on something, feel free to ask for help.
If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

ischencheng requested a review from hsliuustc0106 as a code owner May 3, 2026 15:48

This was referenced May 3, 2026

[Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead #3321

Closed

[RFC]: Cross-request batching for Qwen3-TTS Code2Wav stage to fix TTFB scaling under concurrency #3163

Open

ischencheng force-pushed the cheng/rfc-3163-code2wav-batching branch from e02f47f to aa5f738 Compare May 5, 2026 03:47

ischencheng force-pushed the cheng/rfc-3163-code2wav-batching branch 2 times, most recently from f5e467a to 398f442 Compare May 6, 2026 04:16

ischencheng mentioned this pull request May 6, 2026

[Bug]: --stage-overrides hangs orchestrator init for Qwen3-TTS multi-stage post-#3306 #3375

Open

1 task

TaffyOfficial mentioned this pull request May 8, 2026

[Test][HunyuanImage3] Add e2e offline inference smoke tests for I2T, T2I #2986

Closed

ischencheng force-pushed the cheng/rfc-3163-code2wav-batching branch from 398f442 to d926f4f Compare May 8, 2026 22:30

ischencheng requested review from Gaohan123, david6666666, lishunyang12 and tzhouam as code owners May 8, 2026 22:30

hsliuustc0106 approved these changes May 8, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label May 8, 2026

linyueqian mentioned this pull request May 12, 2026

[RFC]: Qwen3-TTS high-throughput hardening #3535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0)#3322

[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0)#3322
ischencheng wants to merge 1 commit into
vllm-project:mainfrom
ischencheng:cheng/rfc-3163-code2wav-batching

ischencheng commented May 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 3, 2026

Uh oh!

hsliuustc0106 commented May 3, 2026

Uh oh!

linyueqian commented May 3, 2026

Uh oh!

ischencheng commented May 4, 2026

Uh oh!

ischencheng commented May 4, 2026 •

edited

Loading

Uh oh!

ischencheng commented May 5, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 left a comment

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ischencheng commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Known scope limitation

CC List

Uh oh!

chatgpt-codex-connector Bot commented May 3, 2026

Uh oh!

hsliuustc0106 commented May 3, 2026

Uh oh!

linyueqian commented May 3, 2026

Uh oh!

ischencheng commented May 4, 2026

Profiling result: which part of the PR is doing the work

TTFA (ms, mean / p95)

Decomposition

Suggestion

Uh oh!

ischencheng commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Profiling result: which part of the PR is doing the work

TTFA (ms, mean / p95)

Decomposition

Suggestion

Uh oh!

ischencheng commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation after restructure (re-benched on current main, post-#3306)

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ischencheng commented May 3, 2026 •

edited

Loading

ischencheng commented May 4, 2026 •

edited

Loading

ischencheng commented May 5, 2026 •

edited

Loading