Skip to content

[Model] Qwen3-TTS: integrate code predictor into model CUDA graph#3071

Closed
vklimkov-nvidia wants to merge 1 commit into
vllm-project:mainfrom
vklimkov-nvidia:qwen3tts_refactor
Closed

[Model] Qwen3-TTS: integrate code predictor into model CUDA graph#3071
vklimkov-nvidia wants to merge 1 commit into
vllm-project:mainfrom
vklimkov-nvidia:qwen3tts_refactor

Conversation

@vklimkov-nvidia
Copy link
Copy Markdown

Purpose

Per the Slack discussion about Qwen3-TTS, this PR keeps the code predictor as part of the model instead of introducing a multi-token-predictor concept in the model runner:

  • During decode-only batches, the talker + code predictor are captured as a single full CUDA graph.
  • During prefill or mixed batches, they are captured as piecewise CUDA graphs.

Benefits:

  • The GPU model runner stays clean of model-architectural details (no MTP-specific branches).
  • Faster end-to-end: fewer graph launches and a single replay on decode.
  • Localizes Qwen3-TTS quirks inside the model module, matching the design of other models.

Scope of changes:

  • vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py — own the code-predictor invocation and CUDA-graph capture flow.
  • vllm_omni/model_executor/models/common/qwen3_code_predictor.py — simplified/refactored to be graph-capturable as part of the model.
  • vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code_predictor_vllm.py — adapter adjustments.
  • vllm_omni/model_executor/models/qwen3_omni/* — small call-site updates for consistency.
  • vllm_omni/worker/gpu_model_runner.py — drop runner-side multi-token-predictor handling.
  • tests/model_executor/models/qwen3_tts/test_code_predictor_dtype.py — cleanup.

Test Plan

  • Unit: pytest tests/model_executor/models/qwen3_tts/test_code_predictor_dtype.py
  • E2E (offline): run the Qwen3-TTS talker demo notebook / example script and verify audio token generation is unchanged.
  • Serving: benchmarks/benchmark_qwen3_tts_serve.py and benchmarks/benchmark_qwen3_tts_talker.py before/after to confirm parity and speedup.
  • CUDA-graph modes exercised: decode-only (full graph) and prefill/mixed (piecewise).

Test Result

  • Unit tests: pass.
  • E2E outputs match the pre-refactor baseline (same audio tokens given the same input).
  • Decode throughput improves due to a single full CUDA graph replay; prefill/mixed performance is on par with the previous implementation.

(Please replace with concrete numbers from benchmark_qwen3_tts_* before merging.)


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR.
  • The test plan.
  • The test results (fill in concrete numbers).
  • (Optional) Documentation update — N/A (no user-facing API change).
  • (Optional) Release notes update — N/A.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
@linyueqian
Copy link
Copy Markdown
Collaborator

@Sy0307 ptal

@amy-why-3459
Copy link
Copy Markdown
Contributor

@gcanlin PTAL

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 24, 2026

Do we plan to land it before 0.20.0? I prefer to after as recently we merge NPU graph for code predictor and need more time to test this PR(the full graph).

@amy-why-3459
Copy link
Copy Markdown
Contributor

Please resolve the conflict.

@linyueqian
Copy link
Copy Markdown
Collaborator

i think after 0.20.0. @vklimkov-nvidia said he will add sth in this pr as well

@zhumingjue138
Copy link
Copy Markdown
Contributor

Please add the corresponding UT tests.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

can you add the profiling comparison here?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

can you try to use the lated benchmark here https://github.com/vllm-project/vllm-omni/tree/main/benchmarks/tts

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 28, 2026

I tested this PR(rebase on latest origin/main) and found that under low concurrency, there is approximately a 5% performance gain. However, under high concurrency, its performance is almost on par with the current code, and there may even be some regression (within 5%). I'm not sure whether this is just variance/fluctuation, but it seems like there isn't much of a gain under high concurrency.

Could you please confirm whether these results are correct? Also, could you test whether there is a performance regression under high concurrency? If needed, I can share more details about the testing, but in general, I followed the TTS benchmark for my tests.

cc @vklimkov-nvidia @hsliuustc0106 @linyueqian

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

I tested this PR(rebase on latest origin/main) and found that under low concurrency, there is approximately a 5% performance gain. However, under high concurrency, its performance is almost on par with the current code, and there may even be some regression (within 5%). I'm not sure whether this is just variance/fluctuation, but it seems like there isn't much of a gain under high concurrency.

Could you please confirm whether these results are correct? Also, could you test whether there is a performance regression under high concurrency? If needed, I can share more details about the testing, but in general, I followed the TTS benchmark for my tests.

cc @vklimkov-nvidia @hsliuustc0106 @linyueqian

can you compare the profiling w/o this PR?

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 28, 2026

I ran an aligned A/B benchmark for this PR against latest origin/main (7a8b428) on Qwen3-TTS, following the TTS benchmark path under benchmarks/tts and using the directly-controllable vllm-omni bench serve --omni flow.

Test setup:

  • Model: Qwen3-TTS-12Hz-1.7B-Base
  • Dataset: Seed-TTS / en, 50 prompts, 2 warmups
  • GPU: single H20, CUDA_VISIBLE_DEVICES=2
  • Concurrency: 1, 4, 10
  • Same deploy config from latest main
  • The PR branch was aligned to latest main relay schema; without this alignment, the PR produced invalid 0s audio results.

Results:

concurrency origin req/s PR req/s delta origin audio/s PR audio/s delta origin mean audio PR mean audio
1 0.871 0.921 +5.7% 3.542 3.707 +4.7% 4.07s 4.03s
4 1.335 1.355 +1.5% 5.492 5.521 +0.5% 4.11s 4.07s
10 1.904 1.850 -2.8% 7.670 7.355 -4.1% 4.03s 3.98s

Conclusion:

After aligning generation semantics and output duration, this PR shows only a small improvement at low concurrency and no positive gain at higher concurrency. At c=10, both request throughput and audio throughput are slightly lower than latest main.

So the earlier large audio/s improvement does not look like a real performance gain from code predictor graphing. It was mainly caused by non-comparable generation behavior / output length differences. The apples-to-apples result is much closer to neutral.

Note that this is a short benchmark run, so the numbers may have some run-to-run variance. Tests results should be interpreted cautiously unless confirmed by repeated runs.

@vklimkov-nvidia
Copy link
Copy Markdown
Author

thanks @Sy0307 for having a look.
I was meant to top the submission with the serving. thats actually what provides speed ups. the PR itself is important from perspective - that perhaps there is no need for explicit multi-token-predictor that runner executes. it can be part of model definition and fit regular gpu runner.

I realized that perhaps it would be easier to make a separate model definition that can be used as an example of how one can have code_predictor inside of the model definition's cuda graph. I created a separate PR and closing this one: #3221

the new one contains code on how you can serve the qwen3tts using triton inference server. in my benchmark that provides substantial gains in terms of throughput. let's move discussion on performance there, if thats okay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants