Skip to content

[Test] Add Qwen3-TTS nightly performance benchmark#1700

Merged
hsliuustc0106 merged 6 commits into
vllm-project:mainfrom
linyueqian:worktree-tts-perf-benchmark
Mar 11, 2026
Merged

[Test] Add Qwen3-TTS nightly performance benchmark#1700
hsliuustc0106 merged 6 commits into
vllm-project:mainfrom
linyueqian:worktree-tts-perf-benchmark

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

Summary

  • Add Qwen3-TTS performance benchmark to the nightly CI pipeline
  • Make backend/endpoint configurable in run_benchmark.py (was hardcoded to openai-chat-omni)
  • Add TTS stage config and test cases with concurrency [1, 4] matching Qwen3-Omni

Test Plan

  • Validated locally on H200: vllm bench serve --omni --backend openai-audio-speech with 3 prompts, concurrency 1
  • All 3 requests succeeded, RTF ~0.94, audio generation working end-to-end

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Rating: 8.5/10 | Verdict: ✅ Approved

Summary

Good addition of Qwen3-TTS performance benchmark to nightly CI pipeline. Follows established patterns from Qwen3-Omni benchmark.

Highlights

  • ✅ Makes backend/endpoint configurable (was hardcoded)
  • ✅ Adds TTS stage config with concurrency [1, 4]
  • ✅ Validated locally on H200
  • ✅ RTF ~0.94 demonstrates good performance

Minor Suggestions

  • Consider adding more concurrency levels (e.g., [1, 4, 8]) for scalability testing
  • Document expected RTF range for regression detection

Recommendation

Ready to merge. Solid CI enhancement.


Reviewed by OpenClaw with vllm-omni-skills 🦐

# Stage 1: Code2Wav (codec codes -> audio waveform)
#
# The following config has been verified on 1x H100-80G GPU.
async_chunk: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use async_chunk for default? do we need to change the yaml name?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@yenuo26 @congw729 PTAL

Comment thread tests/perf/tests/test.json Outdated
"server_params": {
"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
"stage_config_name": "qwen3_tts.yaml",
"backend": "openai-audio-speech",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why backend and endpoint add in server_params? i think these belong to benchmark_params

Comment thread tests/perf/scripts/run_benchmark.py Outdated
mapping[test_name] = {
"test_name": test_name,
"benchmark_params": [],
"backend": config["server_params"].get("backend", "openai-chat-omni"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move backend and endpoint to benchmark_params, I think these need to be deleted synchronously.

Comment thread tests/perf/scripts/run_benchmark.py Outdated
flow,
dataset_name: str,
num_prompt,
backend: str = "openai-chat-omni",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move backend and endpoint to benchmark_params, I think these can be included in the args.

@congw729
Copy link
Copy Markdown
Collaborator

congw729 commented Mar 6, 2026

Can you provide the example benchmark results for your testing? And the spending time.

@linyueqian
Copy link
Copy Markdown
Collaborator Author

Thanks for the reviews!

For the async_chunk question, async_chunk: true is the only supported mode for Qwen3-TTS. All TTS configs use it, unlike Qwen3-Omni which has both async and non-async variants. So qwen3_tts.yaml is the correct name, no rename needed.

For the backend/endpoint placement, good catch. Moved them into benchmark_params and removed the special-casing in run_benchmark.py. They now flow through the generic params-to-args loop like everything else.

Will post example benchmark results shortly.

@linyueqian linyueqian force-pushed the worktree-tts-perf-benchmark branch from 387e5fa to 2ab1833 Compare March 6, 2026 21:45
@linyueqian
Copy link
Copy Markdown
Collaborator Author

Ran benchmarks on H200 with latest main (includes #1583 initial_codec_chunk_frames and #1617 CUDA graph decoder). Config: initial_codec_chunk_frames=2, streaming enabled.

Concurrency 1 (10 prompts, random input_len=100)

Metric Value
Mean TTFP 131ms
Median TTFP 126ms
P99 TTFP 179ms
Mean RTF 0.34
Audio throughput 2.95x real-time

Concurrency 4 (10 prompts, random input_len=100)

Metric Value
Mean TTFP 3386ms
Median TTFP 200ms
Mean RTF 0.49
Audio throughput 7.38x real-time

Updated the benchmark client to use streaming (stream=true, response_format=pcm) so TTFP measures time to first audio chunk rather than full response latency. Also set initial_codec_chunk_frames=2 in the TTS yaml for faster first-chunk delivery.

@linyueqian linyueqian added the ready label to trigger buildkite CI label Mar 6, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator Author

@Sy0307 please help check if this config is correct. I think the RTF looks good for now.

"--backend",
"openai-chat-omni",
"--endpoint",
"/v1/chat/completions",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check possible risks for omni series test.

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 8, 2026

@Sy0307 please help check if this config is correct. I think the RTF looks good for now.

Config is totally correct for me. Notice that modifications of default backend may take some errors. Plz check it.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

hi @linyueqian, i'm working in a simple dynamic initial chunk size computation for qwen3TTS based on the load of code2wav, for low TTFC/TTFA. My question is:

  • initial_codec_chunk_frames is going to be computed dynamically, only when we make a "stream=True" request to the server, meaning that if we don't make a streaming request, there's no point in lowering the TTFA. Also, it would be removed from the yaml file.

The PR is: #1714

Should we merge that PR first and adapt the config here? for current benchmarking, the request might need to explicitly pass stream=true and/or initial_codec_chunk_frames to some lower value. (I see already stream=true async_request_openai_audio_speech)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Adapt to PR vllm-project#1714 which computes initial_codec_chunk_frames
dynamically based on code2wav load. The static config entry
is no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the worktree-tts-perf-benchmark branch from ddfa93d to bc13271 Compare March 11, 2026 03:01
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit f144f5e into vllm-project:main Mar 11, 2026
6 of 7 checks passed
meghaagr13 pushed a commit to meghaagr13/vllm-omni that referenced this pull request Mar 12, 2026
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Megha Agarwal <agarwalmegha1308@gmail.com>
meghaagr13 pushed a commit to meghaagr13/vllm-omni that referenced this pull request Mar 12, 2026
Signed-off-by: linyueqian <linyueqian@outlook.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: linyueqian <linyueqian@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants