Talker CUDA graph for Qwen-3 TTS#1925
Conversation
a892ab4 to
d2e5d79
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 229bee61af
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if input_ids.shape[0] != 1: | ||
| raise ValueError(f"TalkerMTPCudaGraphWrapper only supports bs=1, got input_ids.shape={input_ids.shape}") |
There was a problem hiding this comment.
Fall back when CUDA-graph talker receives batched decode
This hard-fails any decode batch larger than 1, but load_model now enables the TTS CUDA-graph path whenever full cudagraph mode is on, and _talker_mtp_forward invokes talker_mtp on the whole decode batch. With two concurrent decode requests, this raises ValueError and aborts the step instead of degrading to eager execution, which is a production-facing regression from the previous batched eager path.
Useful? React with 👍 / 👎.
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
ab5bf08 to
1766278
Compare
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
|
why audio results are the same? ------------------Audio Duration------------------ |
@hsliuustc0106 because I'm in the middle of updating the benchmarks after the rebase and batching support update :) I'll mark this draft for now, but the code is here, the numbers are arriving shortly. |
|
Closing this as incompatible with #1913 |
Purpose
This PR enables CUDA graph for Qwen3 TTS with batching support.
Test Plan
[Done] Performance benchmark
[Done] Numerical correctness tests
[Done] Batching support
[In progress] Accuracy issue with increased concurrencies
[TODO] Benchmark update for batching
Test Result
Performance (L40S)
vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml --omni --port 8000 --trust-remote-codevllm-omni bench serve --backend openai-audio-speech --endpoint /v1/audio/speech --model Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --dataset-name hf --dataset-path philschmid/mt-bench --skip-chat-template --num-prompts 10 --max_concurrency $MAX_CONCURRENCY --percentile-metrics e2el,audio_rtf,audio_duration --extra-body '{"voice": "Vivian", "instructions": "Speak with great enthusiasm", "max_new_tokens": 1024}'Main
This PR
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)