[Model] Qwen3-TTS: integrate code predictor into model CUDA graph#3071
[Model] Qwen3-TTS: integrate code predictor into model CUDA graph#3071vklimkov-nvidia wants to merge 1 commit into
Conversation
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
|
@Sy0307 ptal |
|
@gcanlin PTAL |
|
Do we plan to land it before 0.20.0? I prefer to after as recently we merge NPU graph for code predictor and need more time to test this PR(the full graph). |
|
Please resolve the conflict. |
|
i think after 0.20.0. @vklimkov-nvidia said he will add sth in this pr as well |
|
Please add the corresponding UT tests. |
|
can you add the profiling comparison here? |
|
can you try to use the lated benchmark here https://github.com/vllm-project/vllm-omni/tree/main/benchmarks/tts |
|
I tested this PR(rebase on latest origin/main) and found that under low concurrency, there is approximately a 5% performance gain. However, under high concurrency, its performance is almost on par with the current code, and there may even be some regression (within 5%). I'm not sure whether this is just variance/fluctuation, but it seems like there isn't much of a gain under high concurrency. Could you please confirm whether these results are correct? Also, could you test whether there is a performance regression under high concurrency? If needed, I can share more details about the testing, but in general, I followed the TTS benchmark for my tests. |
can you compare the profiling w/o this PR? |
|
I ran an aligned A/B benchmark for this PR against latest Test setup:
Results:
Conclusion: After aligning generation semantics and output duration, this PR shows only a small improvement at low concurrency and no positive gain at higher concurrency. At So the earlier large audio/s improvement does not look like a real performance gain from code predictor graphing. It was mainly caused by non-comparable generation behavior / output length differences. The apples-to-apples result is much closer to neutral. Note that this is a short benchmark run, so the numbers may have some run-to-run variance. Tests results should be interpreted cautiously unless confirmed by repeated runs. |
|
thanks @Sy0307 for having a look. I realized that perhaps it would be easier to make a separate model definition that can be used as an example of how one can have code_predictor inside of the model definition's cuda graph. I created a separate PR and closing this one: #3221 the new one contains code on how you can serve the qwen3tts using triton inference server. in my benchmark that provides substantial gains in terms of throughput. let's move discussion on performance there, if thats okay |
Purpose
Per the Slack discussion about Qwen3-TTS, this PR keeps the code predictor as part of the model instead of introducing a multi-token-predictor concept in the model runner:
Benefits:
Scope of changes:
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py— own the code-predictor invocation and CUDA-graph capture flow.vllm_omni/model_executor/models/common/qwen3_code_predictor.py— simplified/refactored to be graph-capturable as part of the model.vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code_predictor_vllm.py— adapter adjustments.vllm_omni/model_executor/models/qwen3_omni/*— small call-site updates for consistency.vllm_omni/worker/gpu_model_runner.py— drop runner-side multi-token-predictor handling.tests/model_executor/models/qwen3_tts/test_code_predictor_dtype.py— cleanup.Test Plan
pytest tests/model_executor/models/qwen3_tts/test_code_predictor_dtype.pybenchmarks/benchmark_qwen3_tts_serve.pyandbenchmarks/benchmark_qwen3_tts_talker.pybefore/after to confirm parity and speedup.Test Result
(Please replace with concrete numbers from
benchmark_qwen3_tts_*before merging.)Essential Elements of an Effective PR Description Checklist
BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md