Skip to content

[Feat] Add benchmarks for Qwen3-TTS Base/VoiceDesign Model#2411

Merged
linyueqian merged 3 commits intovllm-project:mainfrom
JasonJ2021:dev
Apr 2, 2026
Merged

[Feat] Add benchmarks for Qwen3-TTS Base/VoiceDesign Model#2411
linyueqian merged 3 commits intovllm-project:mainfrom
JasonJ2021:dev

Conversation

@JasonJ2021
Copy link
Copy Markdown
Contributor

@JasonJ2021 JasonJ2021 commented Apr 1, 2026

Purpose

This PR fixes #2348
Fix Qwen3-TTS benchmark scripts so Base and VoiceDesign models can be benchmarked correctly.

Previously, bench_tts_serve.py always sent requests as if the task type were CustomVoice, which caused benchmarking errors for Base and VoiceDesign models. This PR adds a --task-type argument to the benchmark scripts and propagates it through the benchmarking pipeline so the request payload matches the actual model type.

This PR includes:

  • adding --task-type to bench_tts_serve.py
  • adding --task-type to bench_tts_hf.py
  • wiring TASK_TYPE through run_benchmark.sh
  • constructing task-specific payloads for:
    • CustomVoice
    • Base
    • VoiceDesign
  • updating the benchmark README with an example for non-CustomVoice models

Test Plan

Validated both vLLM-Omni serve benchmarking and HF benchmarking paths with Base task type.

Commands used:

MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base TASK_TYPE=Base bash benchmarks/qwen3-tts/run_benchmark.sh --async-only
MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign TASK_TYPE=VoiceDesign bash benchmarks/qwen3-tts/run_benchmark.sh --async-only
bash benchmarks/qwen3-tts/run_benchmark.sh --async-only

Test Result

The above benchmarks run as expected


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copilot AI review requested due to automatic review settings April 1, 2026 08:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds task-type awareness to the Qwen3-TTS benchmarking scripts so payloads match the selected model variant (CustomVoice / Base / VoiceDesign), fixing incorrect requests for non-CustomVoice models.

Changes:

  • Added a --task-type CLI flag to both serving and HF benchmark clients and propagated it through run_benchmark.sh.
  • Implemented task-type-specific request construction for /v1/audio/speech (serving) and task-type-specific generation method selection (HF).
  • Updated benchmark README/run script examples for benchmarking Base (voice cloning) models.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
benchmarks/qwen3-tts/vllm_omni/bench_tts_serve.py Adds --task-type and builds task-specific serving payloads (Base/VoiceDesign/CustomVoice).
benchmarks/qwen3-tts/transformers/bench_tts_hf.py Adds --task-type and routes to the correct HF generation method per task type.
benchmarks/qwen3-tts/run_benchmark.sh Wires TASK_TYPE through to both benchmark entrypoints and documents it.
benchmarks/qwen3-tts/README.md Adds an example command for benchmarking the Base model task type.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 355 to 360
parser.add_argument( # noqa: E501
"--max-concurrency", type=int, nargs="+", default=[1, 4, 10], help="Concurrency levels to test"
)
parser.add_argument("--num-warmups", type=int, default=3)
parser.add_argument("--task-type", type=str, default="CustomVoice", choices=["CustomVoice", "VoiceDesign", "Base"])
parser.add_argument("--voice", type=str, default="vivian")
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new --task-type option is not captured anywhere in the saved JSON results (BenchmarkResult / per-request entries) or the output filename, so runs for different task types can’t be distinguished when comparing results. Consider recording task_type in the result payload (and optionally include it in the result filename).

Copilot uses AI. Check for mistakes.
Comment on lines 286 to 291
parser.add_argument("--num-warmups", type=int, default=3)
parser.add_argument("--gpu-device", type=int, default=0)
parser.add_argument("--voice", type=str, default="Vivian")
parser.add_argument("--language", type=str, default="English")
parser.add_argument("--task-type", type=str, default="CustomVoice", choices=["CustomVoice", "VoiceDesign", "Base"])
parser.add_argument(
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new --task-type option is not included in the saved benchmark JSON (BenchmarkResult) or filename, which makes it hard to tell which task type a result corresponds to when collecting multiple runs. Consider adding task_type to the serialized result (and optionally the filename).

Copilot uses AI. Check for mistakes.
Comment on lines +40 to 44
REF_AUDIO = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav"
REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
INSTRUCT = "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice."


Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REF_AUDIO/REF_TEXT/INSTRUCT are duplicated here and in the HF benchmark script; if either sample input needs to change later, the two benchmarks can drift. Consider centralizing these shared constants (or allowing them to be provided via CLI) to keep the benchmarking paths consistent.

Suggested change
REF_AUDIO = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav"
REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
INSTRUCT = "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice."
@dataclass(frozen=True)
class ReferenceSample:
"""Shared reference sample used for TTS benchmarking."""
audio_url: str
text: str
instruct: str
# Centralized reference sample; other benchmark scripts should import this
# instead of duplicating the literals.
REFERENCE_SAMPLE = ReferenceSample(
audio_url="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav",
text=(
"Okay. Yeah. I resent you. I love you. I respect you. "
"But you know what? You blew it! And thanks to you."
),
instruct=(
"Speak in an incredulous tone, but with a hint of panic "
"beginning to creep into your voice."
),
)
# Backwards-compatible aliases used throughout this module.
REF_AUDIO = REFERENCE_SAMPLE.audio_url
REF_TEXT = REFERENCE_SAMPLE.text
INSTRUCT = REFERENCE_SAMPLE.instruct

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +44
REF_AUDIO = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav"
REF_TEXT = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
INSTRUCT = "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice."

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REF_AUDIO/REF_TEXT/INSTRUCT are duplicated here and in the serving benchmark script; this duplication can lead to drift between the HF and serving benchmark paths. Consider centralizing these shared constants (or allowing them to be provided via CLI) to keep both scripts aligned.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linyueqian
Copy link
Copy Markdown
Collaborator

fix dco pls

JasonJ2021 and others added 3 commits April 2, 2026 10:59
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
@JasonJ2021
Copy link
Copy Markdown
Contributor Author

fix dco pls

fixed

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 2, 2026
@linyueqian linyueqian enabled auto-merge (squash) April 2, 2026 03:23
@linyueqian linyueqian merged commit d3daafb into vllm-project:main Apr 2, 2026
7 of 8 checks passed
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: The benchmark of Qwen3-TTS-12Hz-0.6B-Base is expected.

3 participants