[Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec#3221
[Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec#3221vklimkov-nvidia wants to merge 14 commits into
Conversation
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…3tts nv using triton Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…nfig.pbtxt Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…sting multiple codecs Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 491217bbbe
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| full = self._build_prompt_embeds( | ||
| text=text, speaker=speaker, language=language | ||
| ) |
There was a problem hiding this comment.
Guard prefill prompt build to first pipeline rank
preprocess always calls _build_prompt_embeds for prefill spans, but this model only creates a real text_embedding on the first PP rank and uses PPMissingLayer elsewhere. Since OmniGPUModelRunner invokes preprocess per request on every rank, any run with pipeline_parallel_size > 1 and prefill traffic can hit this path on non-first ranks and fail before forward. Add a rank guard (or avoid PPMissingLayer here) so only the first rank performs prompt-embedding construction.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
dropped PP support for now
| dialect = spk_is_dialect.get(speaker.lower()) | ||
| if isinstance(dialect, str) and dialect: |
There was a problem hiding this comment.
Normalize speaker key before dialect lookup
Dialect resolution in _build_prompt_embeds uses spk_is_dialect.get(speaker.lower()) without trimming whitespace, while other paths (including prompt-length estimation and speaker-id lookup) use stripped speaker keys. If a request sends a valid speaker name with leading/trailing spaces, prefill can miss dialect language conditioning and diverge from the estimated prompt layout, which changes control tokens and can misalign placeholder length assumptions.
Useful? React with 👍 / 👎.
|
I benchmarked the proposed model separately (just acoustic codes prediction) and the end-to-end (producing waveform from text) using triton inf server. Model-Only Benchmark (Talker only)
End-to-End Service Benchmark (Talker + Codec)
Fork is this one: https://github.com/vklimkov-nvidia/vllm/tree/vklimkov/qwen3_tts_voices @hsliuustc0106 Overall I would advocate to have this change, since it does not touch core of the vllm-omni but |
…d back on request Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
|
Nice work! I will take a look later and work on finding out why decoding steps are slow. Thanks for your contribution :) |
Reproduced on H20: throughput claim holds, with caveats on bring-upSpun up the full Triton + TRT-batched codec stack on a single H20 (1x H20-3e, driver 570.133.20, CUDA 12.8 host, Triton container A/B vs current main (
|
| Concurrency | main req/s | main RTF | main TTFA mean / p95 | PR req/s | PR RTF | PR TTFA mean / p95 | speedup (req/s) |
|---|---|---|---|---|---|---|---|
| 1 | 0.80 | 6.75x | 47 / 52 ms | 1.08 | 7.50x | 38 / 41 ms | +35% |
| 8 | 2.29 | 19.15x | 280 / 386 ms | 4.41 | 29.98x | 173 / 397 ms | +93% |
| 16 | 2.44 | 21.52x | 2191 / 4437 ms | 4.52 | 31.30x | 202 / 258 ms | +85% |
| 32 | 2.47 | 21.47x | 4359 / 8626 ms | 6.73 | 46.11x | 420 / 463 ms | +173% |
Two takeaways worth flagging beyond raw throughput:
- main's throughput plateaus at ~2.47 req/s starting at c=8, and additional concurrency just queues. TTFA on main grows from 0.4 s at c=8 to 4.4 s mean / 8.6 s p95 at c=32, which is functionally unusable for streaming TTS.
- The PR's stack keeps TTFA p95 under 470 ms even at c=32. This is the user-visible win; the
req/simprovement is real but the TTFA improvement is what makes high-concurrency streaming actually work.
So the claim in the PR description holds. Worth merging the serving recipe.
Bring-up friction (suggest folding these into examples/online_serving/qwen3_tts_nv_triton/README.md)
The recipe took ~3 hours of patch-and-rebuild on H20 (non-CUDA-13 driver). Most of that is captured below as concrete fixes; happy to send a follow-up PR with a Dockerfile.cu12 variant + README section if useful.
P1 (correctness, broken as-shipped):
- Dockerfile clones the wrong branch:
--branch qwen3tts_refactoris the closed PR [Model] Qwen3-TTS: integrate code predictor into model CUDA graph #3071 branch, not this one. Should bevklimkov/qwen3tts_nv(or whatever this PR's final branch name resolves to). - README step 1 says
cd examples/online_serving/qwen3_tts_triton, actual path isqwen3_tts_nv_triton. Copy-paste fails. python3-libnvinfer=10.15.1.29-1+cuda13.1pin is too tight; on a CUDA-12 base (e.g.tritonserver:25.12-py3) the base image already ships TRT 10.x with cuda12 builds. Recommend dropping the explicit version pin and letting apt resolve to whatever the base image carries, or shipping two Dockerfiles.
P2 (robustness, will hit anyone reproducing on a non-author machine):
- Final pip step
pip install ... transformers==4.57.3fails withERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILEbecause vllm 0.19.0 leaves hash-pinned dep constraints from earlier layers. Workaround:pip install --no-cache-dir --force-reinstall --no-deps transformers==4.57.3in its own RUN. numpy-1.26.4.dist-infoends up withinvalid metadata entry 'name'after the layered installs; transformers crashes at import withUnable to compare versions for numpy>=1.17: need=1.17 found=None. Workaround: appendRUN pip install --no-cache-dir --force-reinstall --no-deps "numpy==1.26.4"at the end of the Dockerfile.- For systems on driver < 580 (CUDA 13.1 not directly supported), the container's TRT 10.16 needs forward-compat libs. Adding
ENV LD_LIBRARY_PATH=/usr/local/cuda-13.1/compat:/usr/local/cuda/lib64makestrtexecand TRT codec runtime work via NVIDIA's forward-compat path. Worth either documenting or switching base image to a CUDA-12 tag (e.g.25.12-py3).
P3 (config defaults that don't match the perf claim):
model_repository/codec_decoder/config.pbtxtshipsmax_queue_delay_microseconds: 100. At 100 microseconds Triton's dynamic batcher will rarely form batches > 1 in practice; the codec batching that drives the c=32 win comes from concurrent requests aligning by chance. Recommend a default in the 1000 to 5000 us range, or annotating that this knob is the headline perf knob (the README does mention tweaking it, but the default value should be one that already shows the gain).
Repro setup
- Triton:
nvcr.io/nvidia/tritonserver:25.12-py3(CUDA 12.8 base, runs on driver 570.x without forward compat issues). - GPU: H20-3e, GPU 1, otherwise idle.
- Codec engine:
--minShapes 1x30x16 --optShapes 8x30x16 --maxShapes 32x30x16 --fp16, parity vs ONNX max_abs_diff 1.4e-5 PASSED. - Both stacks: same 20 English prompts, 30 requests per concurrency level, with warmup.
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
|
@Sy0307 I looked more into gap between forward passes. Turned out Attaching updated profile picture, gap between forward passes is reduced from 2.9ms to 0.8ms updated benchmark numbers in PR description and in README. The TTFA and TTFT growed by roughly 1 decoder step. I think decoder does one extra step now that goes into the measurement. Will look further into fixing it, but overall - that might be important finding for other models too. Gap between forward passes a bit overinflated now and can be reduced. |
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
|
thanks for the big effort of checking it and sorry the recipe was raw.
Yes! cu12 would be definitely appreciated! Please push a commit with a docker file you end up with.
my bad! fixed
fixed
dropped it. its not required for the run. container doesnt have
added a separate entry to dockerfile at the end installing those without dependencies.
happy to do that. please share your docker file, would be happy to check it in my env.
actually even with 0 queue delay i was getting average batch size of ~4 according to metrics port (8002). Btw, i analyzed the gap between forward passes, and was able to push throughput to 37xrt on my hardware, i.e. another +23%. See comment to @Sy0307 above |
|
I think this PR is great, but perhaps we could split the scheduler work into a separate PR. Could you create a dedicated PR for it and write some tests? Or I could also take on some of that work. In fact, we are also researching what the best scheduler design would be for TTS. A potentially relevant reference: #2568 |
… gap Borrows AsyncOmniARScheduler from PR vllm-project#3221 and wires the LLM_AR scheduler selection so any stage with async_scheduling=true automatically picks the async-bookkeeping variant. Background: When async_scheduling=true, vLLM's EngineCoreProc drives step_with_batch_queue, which speculatively schedules the next batch while the current one is still on the GPU. For the queue to stay full, the scheduler must increment request.num_output_placeholders after each scheduled step (so the next schedule() call knows to launch one more decode token before the previous step's output has merged) and decrement it again when the output arrives. Base OmniARScheduler skips this bookkeeping, so schedule() returns 0 tokens on every other step, the engine sleeps 1 ms, and the alternating empty-step pattern adds a ~2-3 ms gap between every talker forward - visible in nsys profiles and confirmed by PR vllm-project#3221's reviewer. AsyncOmniARScheduler injects vllm.v1.core.sched.AsyncScheduler into the OmniARScheduler MRO so the placeholder bookkeeping takes effect while preserving every Omni-specific behaviour (OmniNewRequestData wrapping, KV-transfer metadata, chunk-transfer adapter, streaming-session hooks). Wiring: * New _resolve_scheduler_cls(execution_type, async_scheduling) helper in stage_config.py picks AsyncOmniARScheduler for LLM_AR stages whenever async_scheduling=true; sync stages continue to use OmniARScheduler. * Re-exported from vllm_omni.core.sched for downstream callers. Measured impact (single H100 80 GB, Qwen3-TTS-12Hz-0.6B-Base, default qwen3_tts.yaml = both stages max_num_seqs=10, 30/60/80/128 reqs at c=1/4/8/32 with 96-req warmup): | Concurrency | TTFA mean (default) | TTFA mean (+Async) | rps default | rps +Async | | ----------: | ------------------: | -----------------: | ----------: | ---------: | | 1 | 259 ms | 260 ms | 0.93 | 0.94 | | 4 | 761 ms | 728 ms | 1.26 | 1.39 | | 8 | 1220 ms | 1129 ms | 1.75 | 1.55 | | 32 | 7286 ms | 5775 ms | 3.24 | 3.91 | c=32 sees TTFA mean -21% and rps +20% vs the base RFC vllm-project#3163 P0 fix; rps also exceeds main (3.51) on the same workload. c=1 is unchanged. Co-Authored-By: Viacheslav Klimkov (PR vllm-project#3221) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Tested this PR on a single H100 80GB (RunPod, Triton 25.12 base image, vLLM 0.19.0 per the Dockerfile) using the SeedTTS English testset, 100 prompts each, `voice=Vivian` on both sides, closed-loop concurrency sweep:
For reference, the H20-3e numbers from #3238's description on the same workload:
On H100 main does not saturate at c=32 the way it does on H20-3e (5.68 vs 2.47 req/s), so the PR win that's clearly visible on H20 doesn't appear here — req/s comes out behind main and TTFA tails are roughly comparable. |
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
|
Thanks @ischencheng for having a look! On the baseline comparison: On H100 Performance:
These numbers align with the expectation that the H100 should provide a performance lift. Since TRT models usually perform on par with or better than CUDA graphs, and Talker changes specifically target speed-ups, the PR should ideally be faster than Regarding Mainline: Let me know if I should be benchmarking against a specific newer version or a different |
…ient to docker for benchmark Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Implements RFC vllm-project#3163 P0: * Fix the per-request for-loop in Qwen3TTSCode2Wav.forward(): pad scheduler-delivered sequences to a single [B, Q, F_max] and run one chunked_decode call. The previous loop forced bs=1 even when the scheduler had already grouped concurrent requests, regressing per-step throughput introduced by PR vllm-project#1426. * Extend CUDAGraphDecoderWrapper to capture (batch_size, seq_len) pairs. Default set keeps the existing bs=1 seq buckets and adds (bs in {2,4,8}, seq=streaming_hot) so the new bs>1 hot path stays on the graph. Bs/seq misses fall back to eager. * Plumb capture_pairs / max_batch_size through enable_cudagraph(). YAML adds an optional code2wav_capture_pairs override. * Raise Stage 1 max_num_seqs from 1 to 10 (matches Stage 0). Update the inline comment about engine-level CUDA Graph compatibility. Tests: * New tests/model_executor/models/qwen3_tts/test_code2wav_batching.py covers bs=1 parity, bs>1 per-request parity, padding-no-bleed, per-request left_context, and malformed-mixed batches. * tests/model_executor/models/qwen3_tts/test_cuda_graph_decoder.py switches the fixture and tests to the new (bs, seq) API and adds multi-bs capture/replay, uncaptured-bs fallback, and compute_capture_pairs cases. * New tests/e2e/online_serving/test_qwen3_tts_concurrent_ttfb.py follows the bench shape of PR vllm-project#3221's benchmark_service.py (Throughput, RTF, TTFA mean / p95) and asserts sub-linear TTFA scaling per the RFC target (TTFA c=4 / TTFA c=1 <= 4.0x). Resolves vllm-project#3163. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
|
Hi @vklimkov-nvidia, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 11 days. 🕐 Could you please provide an update?
Thanks for your contribution! 🙏 |


Purpose
This PR introduces a new model definition for Qwen3-TTS that follows a unified architecture, alongside a comprehensive end-to-end Triton Inference Server serving example.
Instead of modifying the existing Qwen3-TTS model implementation, this PR provides a parallel model definition built on the principle of keeping the code predictor as an internal component of the model.
Key Architectural Choices:
Triton Serving Example:
Added a production-ready recipe in
examples/online_serving/qwen3_tts_nv_triton/to serve the two distinct stages of Qwen3-TTS efficiently.tensorrt_planbackend to efficiently batch independent frame chunks.Test Plan
tests/model_executor/models/qwen3_tts_nv/test_qwen3_tts_talker_nv.py, covering the talker forward path, code predictor integration, and CUDA graph capture/replay behavior.examples/online_serving/qwen3_tts_nv_triton/that wires the new talker into a Triton Inference Server deployment (vLLM-Omni Python backend for the talker + TensorRT backend for the codec, orchestrated via BLS over decoupled gRPC streaming).examples/online_serving/qwen3_tts_nv_triton/benchmark_service.pyto benchmark the full Triton service end-to-end (throughput, RTF, TTFA) and to optionally dump the synthesized waveforms for offline quality inspection.Test Result
tests/model_executor/models/qwen3_tts_nv/test_qwen3_tts_talker_nv.pypasses.benchmark_service.pyat concurrency 32 was transcribed via ASR — no intelligibility regressions and no synthesis-stability issues (no truncation, repetition, or collapse) were observed.max_num_seqs/ engine config; latencies reported as mean / p95 in ms):End-to-end service (
benchmark_service.py, talker + codec via Triton):Talker only (
benchmark_model.py, codec tokens only, no waveform):