[Model] Qwen3-TTS: integrate code predictor into model CUDA graph by vklimkov-nvidia · Pull Request #3071 · vllm-project/vllm-omni

vklimkov-nvidia · 2026-04-23T14:08:44Z

Purpose

Per the Slack discussion about Qwen3-TTS, this PR keeps the code predictor as part of the model instead of introducing a multi-token-predictor concept in the model runner:

During decode-only batches, the talker + code predictor are captured as a single full CUDA graph.
During prefill or mixed batches, they are captured as piecewise CUDA graphs.

Benefits:

The GPU model runner stays clean of model-architectural details (no MTP-specific branches).
Faster end-to-end: fewer graph launches and a single replay on decode.
Localizes Qwen3-TTS quirks inside the model module, matching the design of other models.

Scope of changes:

vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py — own the code-predictor invocation and CUDA-graph capture flow.
vllm_omni/model_executor/models/common/qwen3_code_predictor.py — simplified/refactored to be graph-capturable as part of the model.
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code_predictor_vllm.py — adapter adjustments.
vllm_omni/model_executor/models/qwen3_omni/* — small call-site updates for consistency.
vllm_omni/worker/gpu_model_runner.py — drop runner-side multi-token-predictor handling.
tests/model_executor/models/qwen3_tts/test_code_predictor_dtype.py — cleanup.

Test Plan

Unit: pytest tests/model_executor/models/qwen3_tts/test_code_predictor_dtype.py
E2E (offline): run the Qwen3-TTS talker demo notebook / example script and verify audio token generation is unchanged.
Serving: benchmarks/benchmark_qwen3_tts_serve.py and benchmarks/benchmark_qwen3_tts_talker.py before/after to confirm parity and speedup.
CUDA-graph modes exercised: decode-only (full graph) and prefill/mixed (piecewise).

Test Result

Unit tests: pass.
E2E outputs match the pre-refactor baseline (same audio tokens given the same input).
Decode throughput improves due to a single full CUDA graph replay; prefill/mixed performance is on par with the previous implementation.

(Please replace with concrete numbers from benchmark_qwen3_tts_* before merging.)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR.
The test plan.
The test results (fill in concrete numbers).
(Optional) Documentation update — N/A (no user-facing API change).
(Optional) Release notes update — N/A.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

linyueqian · 2026-04-23T14:10:57Z

@Sy0307 ptal

amy-why-3459 · 2026-04-24T02:09:20Z

@gcanlin PTAL

gcanlin · 2026-04-24T02:11:19Z

Do we plan to land it before 0.20.0? I prefer to after as recently we merge NPU graph for code predictor and need more time to test this PR(the full graph).

amy-why-3459 · 2026-04-24T02:13:40Z

Please resolve the conflict.

linyueqian · 2026-04-24T02:14:00Z

i think after 0.20.0. @vklimkov-nvidia said he will add sth in this pr as well

zhumingjue138 · 2026-04-24T04:01:13Z

Please add the corresponding UT tests.

hsliuustc0106 · 2026-04-28T02:28:16Z

can you add the profiling comparison here?

hsliuustc0106 · 2026-04-28T02:53:41Z

can you try to use the lated benchmark here https://github.com/vllm-project/vllm-omni/tree/main/benchmarks/tts

Sy0307 · 2026-04-28T08:30:16Z

I tested this PR(rebase on latest origin/main) and found that under low concurrency, there is approximately a 5% performance gain. However, under high concurrency, its performance is almost on par with the current code, and there may even be some regression (within 5%). I'm not sure whether this is just variance/fluctuation, but it seems like there isn't much of a gain under high concurrency.

Could you please confirm whether these results are correct? Also, could you test whether there is a performance regression under high concurrency? If needed, I can share more details about the testing, but in general, I followed the TTS benchmark for my tests.

cc @vklimkov-nvidia @hsliuustc0106 @linyueqian

hsliuustc0106 · 2026-04-28T09:52:20Z

I tested this PR(rebase on latest origin/main) and found that under low concurrency, there is approximately a 5% performance gain. However, under high concurrency, its performance is almost on par with the current code, and there may even be some regression (within 5%). I'm not sure whether this is just variance/fluctuation, but it seems like there isn't much of a gain under high concurrency.

Could you please confirm whether these results are correct? Also, could you test whether there is a performance regression under high concurrency? If needed, I can share more details about the testing, but in general, I followed the TTS benchmark for my tests.

cc @vklimkov-nvidia @hsliuustc0106 @linyueqian

can you compare the profiling w/o this PR?

Sy0307 · 2026-04-28T10:27:43Z

I ran an aligned A/B benchmark for this PR against latest origin/main (7a8b428) on Qwen3-TTS, following the TTS benchmark path under benchmarks/tts and using the directly-controllable vllm-omni bench serve --omni flow.

Test setup:

Model: Qwen3-TTS-12Hz-1.7B-Base
Dataset: Seed-TTS / en, 50 prompts, 2 warmups
GPU: single H20, CUDA_VISIBLE_DEVICES=2
Concurrency: 1, 4, 10
Same deploy config from latest main
The PR branch was aligned to latest main relay schema; without this alignment, the PR produced invalid 0s audio results.

Results:

concurrency	origin req/s	PR req/s	delta	origin audio/s	PR audio/s	delta	origin mean audio	PR mean audio
1	0.871	0.921	+5.7%	3.542	3.707	+4.7%	4.07s	4.03s
4	1.335	1.355	+1.5%	5.492	5.521	+0.5%	4.11s	4.07s
10	1.904	1.850	-2.8%	7.670	7.355	-4.1%	4.03s	3.98s

Conclusion:

After aligning generation semantics and output duration, this PR shows only a small improvement at low concurrency and no positive gain at higher concurrency. At c=10, both request throughput and audio throughput are slightly lower than latest main.

So the earlier large audio/s improvement does not look like a real performance gain from code predictor graphing. It was mainly caused by non-comparable generation behavior / output length differences. The apples-to-apples result is much closer to neutral.

Note that this is a short benchmark run, so the numbers may have some run-to-run variance. Tests results should be interpreted cautiously unless confirmed by repeated runs.

vklimkov-nvidia · 2026-04-28T14:10:56Z

thanks @Sy0307 for having a look.
I was meant to top the submission with the serving. thats actually what provides speed ups. the PR itself is important from perspective - that perhaps there is no need for explicit multi-token-predictor that runner executes. it can be part of model definition and fit regular gpu runner.

I realized that perhaps it would be easier to make a separate model definition that can be used as an example of how one can have code_predictor inside of the model definition's cuda graph. I created a separate PR and closing this one: #3221

the new one contains code on how you can serve the qwen3tts using triton inference server. in my benchmark that provides substantial gains in terms of throughput. let's move discussion on performance there, if thats okay

qwen3tts: make code predictor part of model cuda graph

8dcb8ac

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>

vklimkov-nvidia requested a review from hsliuustc0106 as a code owner April 23, 2026 14:08

hsliuustc0106 added the high priority high priority issue, needs to be done asap label Apr 24, 2026

hsliuustc0106 mentioned this pull request Apr 25, 2026

[HOW TO Optimize]: the delay of the first frame increases too quickly for Qwen3-TTS with Concurrently #3136

Open

1 task

ischencheng mentioned this pull request Apr 26, 2026

[RFC]: Cross-request batching for Qwen3-TTS Code2Wav stage to fix TTFB scaling under concurrency #3163

Open

1 task

vklimkov-nvidia closed this Apr 28, 2026

linyueqian mentioned this pull request Apr 29, 2026

[Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec #3221

Open

vklimkov-nvidia deleted the qwen3tts_refactor branch April 30, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Qwen3-TTS: integrate code predictor into model CUDA graph#3071

[Model] Qwen3-TTS: integrate code predictor into model CUDA graph#3071
vklimkov-nvidia wants to merge 1 commit into
vllm-project:mainfrom
vklimkov-nvidia:qwen3tts_refactor

vklimkov-nvidia commented Apr 23, 2026

Uh oh!

linyueqian commented Apr 23, 2026

Uh oh!

amy-why-3459 commented Apr 24, 2026

Uh oh!

gcanlin commented Apr 24, 2026

Uh oh!

amy-why-3459 commented Apr 24, 2026

Uh oh!

linyueqian commented Apr 24, 2026

Uh oh!

zhumingjue138 commented Apr 24, 2026

Uh oh!

hsliuustc0106 commented Apr 28, 2026

Uh oh!

hsliuustc0106 commented Apr 28, 2026

Uh oh!

Sy0307 commented Apr 28, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 28, 2026

Uh oh!

Sy0307 commented Apr 28, 2026

Uh oh!

vklimkov-nvidia commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

vklimkov-nvidia commented Apr 23, 2026

Purpose

Test Plan

Test Result

Uh oh!

linyueqian commented Apr 23, 2026

Uh oh!

amy-why-3459 commented Apr 24, 2026

Uh oh!

gcanlin commented Apr 24, 2026

Uh oh!

amy-why-3459 commented Apr 24, 2026

Uh oh!

linyueqian commented Apr 24, 2026

Uh oh!

zhumingjue138 commented Apr 24, 2026

Uh oh!

hsliuustc0106 commented Apr 28, 2026

Uh oh!

hsliuustc0106 commented Apr 28, 2026

Uh oh!

Sy0307 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 28, 2026

Uh oh!

Sy0307 commented Apr 28, 2026

Uh oh!

vklimkov-nvidia commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Sy0307 commented Apr 28, 2026 •

edited

Loading