[Feat] Add benchmarks for Qwen3-TTS Base/VoiceDesign Model#2411
[Feat] Add benchmarks for Qwen3-TTS Base/VoiceDesign Model#2411linyueqian merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds task-type awareness to the Qwen3-TTS benchmarking scripts so payloads match the selected model variant (CustomVoice / Base / VoiceDesign), fixing incorrect requests for non-CustomVoice models.
Changes:
- Added a
--task-typeCLI flag to both serving and HF benchmark clients and propagated it throughrun_benchmark.sh. - Implemented task-type-specific request construction for
/v1/audio/speech(serving) and task-type-specific generation method selection (HF). - Updated benchmark README/run script examples for benchmarking Base (voice cloning) models.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| benchmarks/qwen3-tts/vllm_omni/bench_tts_serve.py | Adds --task-type and builds task-specific serving payloads (Base/VoiceDesign/CustomVoice). |
| benchmarks/qwen3-tts/transformers/bench_tts_hf.py | Adds --task-type and routes to the correct HF generation method per task type. |
| benchmarks/qwen3-tts/run_benchmark.sh | Wires TASK_TYPE through to both benchmark entrypoints and documents it. |
| benchmarks/qwen3-tts/README.md | Adds an example command for benchmarking the Base model task type. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| parser.add_argument( # noqa: E501 | ||
| "--max-concurrency", type=int, nargs="+", default=[1, 4, 10], help="Concurrency levels to test" | ||
| ) | ||
| parser.add_argument("--num-warmups", type=int, default=3) | ||
| parser.add_argument("--task-type", type=str, default="CustomVoice", choices=["CustomVoice", "VoiceDesign", "Base"]) | ||
| parser.add_argument("--voice", type=str, default="vivian") |
There was a problem hiding this comment.
The new --task-type option is not captured anywhere in the saved JSON results (BenchmarkResult / per-request entries) or the output filename, so runs for different task types can’t be distinguished when comparing results. Consider recording task_type in the result payload (and optionally include it in the result filename).
| parser.add_argument("--num-warmups", type=int, default=3) | ||
| parser.add_argument("--gpu-device", type=int, default=0) | ||
| parser.add_argument("--voice", type=str, default="Vivian") | ||
| parser.add_argument("--language", type=str, default="English") | ||
| parser.add_argument("--task-type", type=str, default="CustomVoice", choices=["CustomVoice", "VoiceDesign", "Base"]) | ||
| parser.add_argument( |
There was a problem hiding this comment.
The new --task-type option is not included in the saved benchmark JSON (BenchmarkResult) or filename, which makes it hard to tell which task type a result corresponds to when collecting multiple runs. Consider adding task_type to the serialized result (and optionally the filename).
| REF_AUDIO = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav" | ||
| REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you." | ||
| INSTRUCT = "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice." | ||
|
|
||
|
|
There was a problem hiding this comment.
REF_AUDIO/REF_TEXT/INSTRUCT are duplicated here and in the HF benchmark script; if either sample input needs to change later, the two benchmarks can drift. Consider centralizing these shared constants (or allowing them to be provided via CLI) to keep the benchmarking paths consistent.
| REF_AUDIO = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav" | |
| REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you." | |
| INSTRUCT = "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice." | |
| @dataclass(frozen=True) | |
| class ReferenceSample: | |
| """Shared reference sample used for TTS benchmarking.""" | |
| audio_url: str | |
| text: str | |
| instruct: str | |
| # Centralized reference sample; other benchmark scripts should import this | |
| # instead of duplicating the literals. | |
| REFERENCE_SAMPLE = ReferenceSample( | |
| audio_url="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav", | |
| text=( | |
| "Okay. Yeah. I resent you. I love you. I respect you. " | |
| "But you know what? You blew it! And thanks to you." | |
| ), | |
| instruct=( | |
| "Speak in an incredulous tone, but with a hint of panic " | |
| "beginning to creep into your voice." | |
| ), | |
| ) | |
| # Backwards-compatible aliases used throughout this module. | |
| REF_AUDIO = REFERENCE_SAMPLE.audio_url | |
| REF_TEXT = REFERENCE_SAMPLE.text | |
| INSTRUCT = REFERENCE_SAMPLE.instruct |
| REF_AUDIO = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav" | ||
| REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you." | ||
| INSTRUCT = "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice." | ||
|
|
There was a problem hiding this comment.
REF_AUDIO/REF_TEXT/INSTRUCT are duplicated here and in the serving benchmark script; this duplication can lead to drift between the HF and serving benchmark paths. Consider centralizing these shared constants (or allowing them to be provided via CLI) to keep both scripts aligned.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
fix dco pls |
fixed |
…ect#2411) Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
Purpose
This PR fixes #2348
Fix Qwen3-TTS benchmark scripts so
BaseandVoiceDesignmodels can be benchmarked correctly.Previously,
bench_tts_serve.pyalways sent requests as if the task type wereCustomVoice, which caused benchmarking errors forBaseandVoiceDesignmodels. This PR adds a--task-typeargument to the benchmark scripts and propagates it through the benchmarking pipeline so the request payload matches the actual model type.This PR includes:
--task-typetobench_tts_serve.py--task-typetobench_tts_hf.pyTASK_TYPEthroughrun_benchmark.shCustomVoiceBaseVoiceDesignCustomVoicemodelsTest Plan
Validated both vLLM-Omni serve benchmarking and HF benchmarking paths with
Basetask type.Commands used:
Test Result
The above benchmarks run as expected
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)