[Benchmark] Universal TTS benchmark: Qwen3-TTS + VoxCPM2 with 3 task types (voice-clone/default/design) by linyueqian · Pull Request #2835 · vllm-project/vllm-omni

linyueqian · 2026-04-16T02:06:52Z

Summary

Adds universal TTS benchmark covering 3 task types: voice clone, default voice, voice design
Extends seed_tts_dataset.py with SeedTTSTextDataset (default voice) and SeedTTSDesignDataset (voice design)
Fixes seed_tts_row flag so WER/UTMOS PCM capture runs for all task types, not just voice clone
Adds bench_tts.py — a model-agnostic CLI backed by model_configs.yaml registry; adding a new TTS model requires only a YAML entry, no code changes
Adds plot_results.py — bar-chart visualization (TTFP / E2EL / RTF / throughput) for comparing runs or task types
Adds bench_voxcpm_offline.py — offline VoxCPM benchmark using Omni/AsyncOmni directly (sync + streaming, voice cloning, torch profiling)
Wires both Qwen3-TTS (3 tasks) and VoxCPM2 (voice clone) into DFX nightly perf dashboard via test_tts.json
Includes 20-prompt voice-design dataset (benchmarks/build_dataset/seed_tts_design/en/meta.lst)
Removes old per-model benchmark directories (benchmarks/qwen3-tts/, benchmarks/voxcpm/) — superseded by this framework

Metric coverage per task

Task	RTF	TTFP	Throughput	WER	SIM	UTMOS
voice_clone	✓	✓	✓	✓	✓	✓
default_voice	✓	✓	✓	✓	—	✓
voice_design	✓	✓	✓	✓	—	✓

SIM is skipped for tasks without a reference audio (empty seed_tts_ref_wav_path → sim_skipped_no_ref counter in seed_tts_eval.py).

Models

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice: voice_clone + default_voice + voice_design
Qwen/Qwen3-TTS-12Hz-1.7B-Base: voice_clone only
openbmb/VoxCPM2: voice_clone only

New files

File	Purpose
benchmarks/tts/bench_tts.py	Universal serving benchmark CLI
benchmarks/tts/model_configs.yaml	Model registry (add new TTS model here)
benchmarks/tts/plot_results.py	Visualization: bar charts + markdown table from JSON results
benchmarks/tts/bench_voxcpm_offline.py	Offline VoxCPM benchmark (sync/streaming, profiling)
benchmarks/tts/stage_configs/qwen3_tts.yaml	Qwen3-TTS stage config for CLI
benchmarks/tts/stage_configs/voxcpm2.yaml	VoxCPM2 stage config for CLI
benchmarks/build_dataset/seed_tts_design/en/meta.lst	20 voice-design prompts
tests/dfx/perf/stage_configs/voxcpm2.yaml	VoxCPM2 DFX nightly config
tests/dfx/perf/tests/test_tts.json	Universal TTS DFX benchmark matrix
tests/benchmarks/test_seed_tts_dataset_variants.py	Unit tests for new dataset classes
tests/benchmarks/test_bench_tts_cli.py	Unit tests for bench_tts.py
tests/dfx/perf/tests/test_runner_metadata.py	Tests for DFX metadata key exclusion

Removed files

Directory	Reason
benchmarks/qwen3-tts/	Superseded by benchmarks/tts/ (bench_tts.py + plot_results.py cover all use cases)
benchmarks/voxcpm/	Superseded by benchmarks/tts/ (offline path → bench_voxcpm_offline.py)

Alignment with sglang-omni

Uses the same seed-tts-eval dataset as sglang-omni. Adds TTFP and SIM/UTMOS which sglang-omni does not currently track.

Test plan

pytest tests/benchmarks/test_bench_tts_cli.py -v — 5/5 pass (locally, no torch needed)
pytest tests/benchmarks/test_seed_tts_dataset_variants.py -v — 8/9 pass locally; 9th (test_attach_sets_seed_tts_row_even_without_extra_body) requires real vllm, passes on H20
JSON/YAML schema valid (all configs validated with python -m json.tool and yaml.safe_load)
python benchmarks/tts/bench_tts.py --help — outputs help text, no import errors
Manual smoke test on H20
DFX runner integration test on H20 with both models

chatgpt-codex-connector · 2026-04-16T02:06:58Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-04-16T02:09:32Z

please help remove other unnecessary benchmarks

linyueqian · 2026-04-16T03:29:42Z

@Sy0307 During smoke testing of the benchmark framework on H20, VoxCPM2 crashes at concurrency=4 with an orchestrator thread crash. Concurrency 1 and 2 work fine.

Root cause trace:

(Worker pid=...) INFO VoxCPM2: CUDA Graph captured for scaffold (batch_size=3)
(APIServer pid=...) INFO [Orchestrator] Shutting down all stages
(APIServer pid=...) ERROR [AsyncOmniEngine] Orchestrator thread crashed
RuntimeError: {'request_id': '...', 'error': 'Orchestrator thread crashed'}

Looks like the VAE scaffold CUDA graph capture fires lazily during concurrent inference and races with the orchestrator. Worth a look if you have context on the VoxCPM2 runtime path.

lishunyang12

Review: Universal TTS Benchmark

This is a well-structured consolidation of per-model benchmarks into a single, model-agnostic framework. The YAML-registry approach (model_configs.yaml) for adding new models without code changes is a good design choice, and the removal of ~2900 lines of duplicated per-model benchmark code is a clear win.

Strengths

Clean architecture: bench_tts.py delegates to vllm bench serve --omni with model-aware defaults from YAML, keeping the CLI thin and testable. The _TASK_TO_DATASET mapping is simple and correct.
Correct bug fix in patch.py: Moving seed_tts_row = True before the if not ex: return guard is the right fix. Without this, SeedTTSTextSampleRequest (which has seed_tts_speech_extra=None) would skip PCM capture entirely, breaking WER/UTMOS for default_voice and voice_design tasks. The test in test_attach_sets_seed_tts_row_even_without_extra_body verifies this via source inspection, which is pragmatic.
DFX integration: test_tts.json separates perf and quality eval phases, which is thoughtful. The enabled field and eval_phase metadata keys, plus updating exclude_keys in run_benchmark.py, ensure these don't leak into CLI args.
Good test coverage: The conftest.py vllm stubs approach is creative and lets dataset tests run without a full vllm install.

Issues and suggestions

1. SeedTTSDesignDataset uses "instructions" key but SeedTTSDesignSampleRequest docstring says "voice_description"

In seed_tts_dataset.py, the sample() method of SeedTTSDesignDataset builds:

speech_extra: dict[str, Any] = {
    "instructions": row.voice_description,
    "task_type": "VoiceDesign",
    ...
}

But SeedTTSDesignSampleRequest's docstring says the dict carries voice_description. The test also checks for "voice_description" in extra -- but extra actually contains "instructions", not "voice_description". This test line would fail:

assert "voice_description" in extra  # extra has "instructions", not "voice_description"

Wait, looking more carefully the test builds the dataset and calls .sample(), so the extra dict will have "instructions" key. The assertion assert "voice_description" in extra should indeed fail. This looks like a real bug in the test.

Action needed: Either the key in speech_extra should be "voice_description" (matching the docstring and test), or the test assertion should check for "instructions" (matching the actual code). Please verify which key the Qwen3-TTS VoiceDesign endpoint expects and align all three (code, docstring, test).

2. Duplicate VoxCPM2 stage config

benchmarks/tts/stage_configs/voxcpm2.yaml and tests/dfx/perf/stage_configs/voxcpm2.yaml are identical. Consider having the DFX config reference the benchmark one (or vice versa) to avoid drift. If they must be separate (e.g., different GPU memory settings for CI vs benchmarking), add a comment explaining why.

3. bench_voxcpm_offline.py REPO_ROOT change looks correct but is fragile

The file was moved from benchmarks/voxcpm/vllm_omni/ (3 levels deep) to benchmarks/tts/ (2 levels deep), so parents[3] -> parents[2] is correct. But this kind of relative-path depth counting is brittle. Consider using a marker file lookup or deriving from git rev-parse --show-toplevel for robustness.

4. plot_results.py -- mean_e2el metric key may be wrong

The comparison table uses "mean_e2el" as the key for E2E latency:

("E2E (ms)", "mean_e2el", ".1f"),

But bench_tts.py's summary table references "mean_audio_rtf" and "mean_audio_ttfp_ms". Please verify "mean_e2el" is the correct key emitted by vllm bench serve --omni result JSON. If it should be "mean_e2el_ms" (with _ms suffix), the E2E column would silently show nan.

5. Minor: _SeedTTSDesignRow uses __import__("random") inline

In SeedTTSDesignDataset.load_data():

rng = __import__("random").Random(self.random_seed)

This works but is unusual. A normal import random at the top of the file would be cleaner and more readable.

6. voice_design entries in test_tts.json are missing extra_body for task_type

The voice_clone entries don't need extra_body because the dataset itself provides ref_audio/ref_text. The default_voice entries correctly include extra_body: {"voice": "Vivian", ...}. But the voice_design perf entry has no extra_body -- is the task_type: "VoiceDesign" already handled by the dataset class? If yes, this is fine. If the server needs task_type in the request body for routing, this could cause 400 errors at runtime. Worth a note or confirming in the test plan.

Overall

The framework design is solid and the consolidation is clearly needed. The main concern is the "instructions" vs "voice_description" key mismatch (point 1) which appears to be a real test bug. The rest are minor improvements. After addressing point 1, this should be good to go.

linyueqian · 2026-04-19T01:35:57Z

@lishunyang12 @hsliuustc0106 — addressed the review with some extra H20-sourced work. Push tip is e42f3ad9.

Review items

#	Item	Action
P1	`instructions` vs `voice_description` drift	The `fix(bench/tts): use 'instructions'` commit updated the code but left the docstring + test still asserting `voice_description`, so the unit test failed on H20. Aligned docstring + test on `instructions`.
P2	Duplicate `voxcpm2.yaml` / `qwen3_tts.yaml` under `benchmarks/tts/stage_configs/`	Removed the dir; `bench_tts.py` default now points at `tests/dfx/perf/stage_configs/` so DFX nightly and the CLI share a single source of truth.
P3	Fragile `parents[2]` in `bench_voxcpm_offline.py`	Replaced with a `_find_repo_root()` walker that looks for the `pyproject.toml` + `vllm_omni/` marker.
P4	`mean_e2el` silent-NaN in `plot_results.py`	Verified with H20 result JSONs: key is `mean_e2el_ms`. Updated both call sites.
P5	Inline `__import__("random")`	Replaced with the existing module-level `import random`.
P6	`extra_body.task_type` missing on `voice_design` DFX entry	Empirically verified on H20 — `SeedTTSDesignDataset.sample()` injects `task_type=VoiceDesign` into each request's `speech_extra`, so DFX works without `extra_body`. Added `extra_body` defensively anyway to match the `default_voice` style.
hsliuustc0106	"remove other unnecessary benchmarks"	The per-model TTS dirs (`qwen3-tts/`, `voxcpm/`) were already removed. Leaving `benchmarks/qwen3-omni/` and `benchmarks/fish-speech/` alone — they're separate model families, not TTS dupes. Can delete in a follow-up if you confirm which ones you meant.

Extras surfaced by H20 benchmarking

Ran voice_clone / voice_design concurrency sweeps on H20-3e 141GB and found:

Live-traffic bug: Qwen3-TTS-12Hz-*-CustomVoice checkpoints don't ship speaker_encoder weights, so voice_clone requests crash with ValueError: This checkpoint does not provide speaker_encoder weights. The original PR's model_configs.yaml listed voice_clone under CustomVoice and the DFX test_qwen3_tts entry used CustomVoice for voice_clone — both would fail in the nightly.

Fix:

Removed voice_clone from 1.7B-CustomVoice.supported_tasks.
Split the DFX matrix into test_qwen3_tts_base (voice_clone via 1.7B-Base) and test_qwen3_tts_customvoice (default_voice + voice_design via 1.7B-CustomVoice).

Concurrency-cliff regression coverage: the old perf phase only ran max_concurrency [1, 4], which is below the TTFP cliff that vllm-omni's codec-bs=1 exposes. Observed H20 numbers:

Task	Model	c=1	c=4	c=8	c=16	c=32
voice_clone	1.7B-Base	RTF 0.15 / TTFP 165ms	0.28 / 412ms	0.49 / 1701ms	0.72 / 3355ms	0.77 / 3772ms
voice_design	1.7B-CustomVoice	0.08 / 53ms	0.11 / 154ms	0.21 / 872ms	0.33 / 1801ms	0.38 / 1989ms

Both show a clean 4-6× TTFP jump from c=4 to c=8 and throughput saturating at c=4-8, which is the same pattern the NVIDIA-fork comparison doc flags.

Replaced the single perf phase with:

latency (c=1) — tight TTFP + RTF
throughput (c=8) — loose TTFP ceiling + throughput/RTF floor, so a codec-batching regression breaks CI
quality (c=4) — unchanged, WER/SIM/UTMOS

Thresholds padded ~2× from the H20 means. default_voice thresholds currently reuse the voice_design numbers (same checkpoint, same pipeline) — happy to run a dedicated default_voice sweep if you'd prefer.

Additional polish

bench_tts.py now persists _task / _concurrency into the saved result JSON so plot_results.py can build per-concurrency comparison tables (was in-memory only).
Rebased onto upstream/main and resolved the test_tts.json add/add and qwen3-tts/vllm_omni/configs/*.yaml modify/delete conflicts.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Adds SeedTTSTextDataset (CLI name: seed-tts-text) and SeedTTSTextSampleRequest. Loads the same meta.lst as SeedTTSDataset but omits ref_audio/ref_text from the request body; voice is supplied via --extra-body in the benchmark config. Sets seed_tts_ref_wav_path="" so SIM is automatically skipped in seed_tts_eval.py. WER and UTMOS still work normally. Also adds tests/benchmarks/conftest.py with lightweight vllm stubs and the corresponding unit test. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…clean up test stubs Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…s-text/design; update CLI choices Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…es task/enabled/eval_phase metadata Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…tadata tests Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…l registry Replaces per-model benchmark scripts with a single model-agnostic CLI that reads model_configs.yaml to dispatch vllm bench serve --omni with correct flags for any registered TTS model (Qwen3-TTS, VoxCPM2, and future models). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…y_is_skipped Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…(Qwen3-TTS + VoxCPM2, 3 tasks) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

… param Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…n voice_design perf entry Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Remove benchmarks/qwen3-tts/ and benchmarks/voxcpm/ which are superseded by the new universal benchmarks/tts/ framework. Apply ruff format/check fixes to all PR-touched files. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Restore two useful tools from the removed per-model benchmarks: - plot_results.py: updated to read vllm bench serve JSON keys (mean_audio_rtf, mean_audio_ttfp_ms, _task, _concurrency). Generates 4-panel bar charts (TTFP, E2EL, RTF, throughput) per task, with optional multi-run comparison and markdown table output. - bench_voxcpm_offline.py: offline VoxCPM benchmark using Omni/AsyncOmni.generate directly; supports sync and streaming modes, txt/jsonl batch input, voice cloning, and torch profiling. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…nch serve Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The server API (serving_speech.py) validates the 'instructions' field for VoiceDesign requests. The benchmark dataset was incorrectly sending the value under 'voice_description', causing all voice_design benchmark requests to fail with a 400 error. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

- P1: align voice-design code/docstring/test on `instructions` key. The prior fix changed the code but left the docstring + unit test asserting `voice_description`, which made the test fail. - P2: remove duplicate stage configs under benchmarks/tts/stage_configs/; point bench_tts.py default to tests/dfx/perf/stage_configs/ so DFX nightly + the CLI share a single source of truth. - P3: replace fragile parents[2] in bench_voxcpm_offline.py with a marker-based repo-root walker (pyproject.toml + vllm_omni/). - P4: actual result JSON key is mean_e2el_ms, not mean_e2el. Fix both references in plot_results.py so the E2E column no longer silently renders NaN. - P5: drop inline `__import__("random")` in favour of the module-level import that already exists. Also persist `_task`/`_concurrency` metadata into the saved result JSON from bench_tts.py so plot_results.py can build the per-concurrency comparison tables (previously the augmentation happened in-memory only). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…lity regimes Concurrency sweeps on H20 (Qwen3-TTS-1.7B-Base voice_clone and Qwen3-TTS-1.7B-CustomVoice voice_design) show a sharp TTFP cliff at max_concurrency>=8 — TTFP jumps 4-6x from c=4 to c=8 while audio throughput saturates. The prior `perf` entries only exercised c=1 and c=4, sitting below the cliff, so codec-batching regressions are invisible to DFX nightly. Replace the single `perf` phase with two targeted regimes plus the existing `quality` phase: - latency (c=1) — tight TTFP + RTF bounds for single-request SLO - throughput (c=8) — loose TTFP ceiling + throughput/RTF floor that collapses if the codec stays batch_size=1 - quality (c=4) — unchanged, WER/SIM/UTMOS eval Thresholds come from H20 sweeps on 1.7B-Base (voice_clone) and 1.7B-CustomVoice (voice_design), padded ~2x: voice_clone c=1: RTF 0.153 TTFP 165ms c=8: RTF 0.493 TTFP 1701ms voice_design c=1: RTF 0.083 TTFP 53ms c=8: RTF 0.21 TTFP 872ms Also fix a live-traffic bug surfaced during the sweep: the -CustomVoice checkpoints don't ship speaker_encoder weights, so voice_clone requests crashed with `ValueError: This checkpoint does not provide speaker_encoder weights`. Drop voice_clone from CustomVoice's supported_tasks in model_configs.yaml and split the DFX suite so voice_clone runs under -Base and default_voice/voice_design under -CustomVoice. Test plan: H20 voice_clone sweep (Qwen3-TTS-1.7B-Base) and voice_design sweep (Qwen3-TTS-1.7B-CustomVoice) validated the thresholds are reachable with >40% headroom on the latency entry and TTFP/throughput both fall within the throughput bounds at c=8. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…oject#2383 vllm-project#2383 replaced the per-model stage_configs/*.yaml layout with auto-loaded vllm_omni/deploy/<model>.yaml (Pipeline in Python, Deploy in YAML) and switched the DFX runner's config-loading dir from stage_configs/ to deploy/. This PR's test matrix and bench CLI still carried the old references: - test_tts.json: drop `stage_config_name` from the Qwen3-TTS entries; vllm-omni now auto-loads vllm_omni/deploy/qwen3_tts.yaml for both Base and CustomVoice checkpoints. - model_configs.yaml: drop the `stage_config` field — the bench CLI does not reference it and auto-discovery handles pipeline lookup. - bench_tts.py: remove the dead `--stage-configs-dir` flag and the `_DEFAULT_STAGE_CONFIGS_DIR` constant; both were unused and pointed at a directory vllm-project#2383 deleted. - Delete tests/dfx/perf/stage_configs/voxcpm2.yaml — the directory no longer exists post-vllm-project#2383. VoxCPM2 is not yet migrated to the Pipeline + Deploy schema in vllm-project#2383 (only qwen2_5_omni / qwen3_omni / qwen3_tts ship pipeline.py + deploy YAML) and still loads via the legacy `ModelPipeline` path. Drop the test_voxcpm2 entry from test_tts.json to unblock DFX nightly; will re-add as a follow-up once VoxCPM2 gets its deploy YAML. The latency / throughput / quality baselines remain unchanged — they come from H20 sweeps on stable checkpoints and should still hold under the new deploy YAML (stage 0 now sets max_num_seqs=10 and async_scheduling=true, which can only improve throughput numbers). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian · 2026-04-19T03:23:28Z

Rebased on top of #2383 (merged) + migrated to the Pipeline + Deploy schema (2ddf7658).

Post-#2383 adaptations

Change	Reason
Drop `stage_config_name: qwen3_tts.yaml` from `test_qwen3_tts_base` and `test_qwen3_tts_customvoice` in `test_tts.json`	#2383 deleted `tests/dfx/perf/stage_configs/qwen3_tts.yaml`. Server now auto-loads `vllm_omni/deploy/qwen3_tts.yaml` by model.
Drop `stage_config:` field from every entry in `model_configs.yaml`	Unused after #2383 — Pipeline registry handles model → stage mapping.
Remove `--stage-configs-dir` flag and `_DEFAULT_STAGE_CONFIGS_DIR` constant from `bench_tts.py`	Dead flag after #2383 (never referenced in `build_bench_args`).
Delete `tests/dfx/perf/stage_configs/voxcpm2.yaml`	Dir removed by #2383; not restored because VoxCPM2 is not yet migrated to the new schema.
Drop `test_voxcpm2` entry from `test_tts.json`	VoxCPM2 still uses legacy `ModelPipeline` / `_parse_pipeline_yaml`; no `vllm_omni/deploy/voxcpm2.yaml` exists. Will re-add as a follow-up once VoxCPM2 gets its deploy YAML.

H20 smoke after migration

Booted Qwen/Qwen3-TTS-12Hz-1.7B-Base on H20-3e with no --stage-configs-path / --deploy-config flags and fired bench_tts.py --task voice_clone --concurrency 1:

voice_clone  concurrency=1   RTF 0.173   TTFP 144ms   throughput 5.78 audio-s/wall-s

Auto-discovery picked up vllm_omni/deploy/qwen3_tts.yaml as expected.
Numbers are in the same ballpark as the pre-migration H20 sweep (RTF 0.153 / TTFP 165ms at c=1), well within the latency baselines (RTF <= 0.25, TTFP <= 350ms). The new deploy YAML's stage-0 max_num_seqs=10 + async_scheduling=true should only help throughput at higher concurrencies.

Baselines unchanged — full sweep recalibration can land in a follow-up if DFX nightly flags drift.

CI after push: pre-commit / DCO / build(3.11) / build(3.12) all green.

…ries Buildkite `🌕 TTS · Perf Test` failed with `completed == 0` on six benchmark entries because the seed-tts / seed-tts-text datasets they reference are not staged in the CI image and `snapshot_download` has no way to pull the Google-Drive-hosted seed-tts-eval archive. Two-part fix: 1. Bundle `benchmarks/build_dataset/seed_tts_smoke/en/meta.lst` — a 20-row seed-tts-compatible meta file with target_text only (no WAVs). `SeedTTSTextDataset` (used by default_voice) does not touch the wav column, so this is enough to exercise the full server path in CI. All entries are short, varied English sentences suitable for TTS smoke testing. 2. Point the `default_voice` benchmark entries at this bundled path and disable the three `voice_clone` entries with `enabled: false` — voice_clone needs real reference WAVs the bundled smoke set deliberately omits. The `voice_design` entries are unchanged; they were already using a bundled dataset and passing in the failing Buildkite run. Also disable the `default_voice` quality entry: WER evaluation requires real seed-tts-eval text (which we deliberately did not bundle — 20 rows × 4 CV folds would give an unreliable WER signal). Perf/throughput entries still exercise the codec-bs cliff on the bundled smoke set. H20 smoke: `bench serve --backend openai-audio-speech --dataset-name seed-tts-text --dataset-path benchmarks/build_dataset/seed_tts_smoke` returned `Successful requests: 5` with audio throughput 11.18 s/s — no more zero-completion failures. Re-enabling the seed-tts-eval entries will be a follow-up once the dataset is staged in the CI container (or made available via an HF mirror we can snapshot_download). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian · 2026-04-19T18:58:52Z

@yenuo26 @amy-why-3459 @congw729 ptal, thanks!

… regimes build #7249 showed default_voice c=1 latency mean_audio_ttfp_ms=2301 ms vs a 150 ms baseline — but p50 was 62 ms and only one cold-start outlier (first measured request after server warmup, p99 ~36 s) dragged the mean up. Latency and throughput regimes care about typical request behaviour, not cold-start tails, so switch their baselines from `mean_audio_*` to `median_audio_*`. The quality entries (WER-driven) still use mean since they aggregate over 200 prompts where single-request outliers don't matter. Applied to both qwen3_tts_base and qwen3_tts_customvoice: * latency → median_audio_ttfp_ms / median_audio_rtf * throughput → median_audio_ttfp_ms / median_audio_rtf * quality → unchanged (mean_audio_rtf) Baseline values unchanged; only the metric aggregation switched. Expected effect on build #7249 data: default_voice c=1 p50 TTFP 62 ms <= 150 ms ✅ default_voice c=8 p50 TTFP 230 ms <= 1500 ms ✅ Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

hsliuustc0106 · 2026-04-20T01:20:38Z

@@ -0,0 +1,308 @@
+#!/usr/bin/env python3


is there a readme.md under tts/

Added benchmarks/tts/README.md in 798cea7e. One-page reference covering quick-start recipes (smoke / concurrency sweep / --wer-eval), the three task types and which checkpoints support each (incl. the -CustomVoice no-speaker_encoder gotcha), how to register a new TTS model via model_configs.yaml, the bundled-vs-external dataset matrix, DFX nightly wiring (latency / throughput / quality regimes), and an H20 concurrency-cliff reference table. Links to #2558 and #2383 for context.

Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four scripts + a YAML registry with zero docs, leaving users to reverse-engineer the CLI from `--help` output. Add a single-page README covering: - quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS) - plot_results.py usage - the three task types and which checkpoints support each (notes that -CustomVoice lacks speaker_encoder so voice_clone is Base-only) - model_configs.yaml extension recipe for new TTS models - dataset matrix (bundled seed_tts_design / seed_tts_smoke, external seed-tts-eval with link to the download guide) - DFX nightly integration: latency / throughput / quality regimes, median-vs-mean baseline choice, quality-entry gating rationale - observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel) - file layout + cross-references to vllm-project#2558 and vllm-project#2383 Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Sy0307 · 2026-04-20T04:06:21Z

The new throughput regime here does not actually assert throughput. In tests/dfx/perf/tests/test_tts.json, the throughput entries only gate median_audio_ttfp_ms and median_audio_rtf, but there is no baseline for audio_throughput itself. That means this regime is currently acting more like a high-concurrency latency/RTF check than a real throughput guard. If the intent is to catch codec batching regressions or throughput collapse under load, it would be better to add an explicit audio_throughput baseline as well.

H20 benchmark summaries (per review feedback)

All runs on H20-3e 141GB, num_prompts=20, num_warmups=2, dataset seed-tts / seed-tts-design. PR tip 969aead0. The vllm bench serve --omni stdout blocks (TTFT / TTFP / TPOT / RTF / throughput / latency) are pasted below so reviewers do not have to re-run the sweep locally.

voice_clone — Qwen/Qwen3-TTS-12Hz-1.7B-Base, seed-tts

concurrency = 1

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             1
Benchmark duration (s):                  19.92
Request throughput (req/s):              1.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          995.69
Median E2EL (ms):                        902.93
P99 E2EL (ms):                           1577.83
---------------Time to First Token----------------
Mean TTFT (ms):                          164.80
Median TTFT (ms):                        162.55
P99 TTFT (ms):                           209.90
================== Audio Result ==================
Total audio duration generated(s):       135.04
Audio throughput(audio duration/s):      6.78
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.15
Median AUDIO_RTF:                        0.16
P99 AUDIO_RTF:                           0.18
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    164.80
Median AUDIO_TTFP (ms):                  162.55
P99 AUDIO_TTFP (ms):                     209.90
==================================================

concurrency = 4

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  9.23
Request throughput (req/s):              2.17
----------------End-to-end Latency----------------
Mean E2EL (ms):                          1785.28
Median E2EL (ms):                        1563.50
P99 E2EL (ms):                           2748.23
---------------Time to First Token----------------
Mean TTFT (ms):                          412.33
Median TTFT (ms):                        401.16
P99 TTFT (ms):                           581.77
================== Audio Result ==================
Audio throughput(audio duration/s):      14.37
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.28
Median AUDIO_RTF:                        0.29
P99 AUDIO_RTF:                           0.32
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    412.33
Median AUDIO_TTFP (ms):                  401.16
P99 AUDIO_TTFP (ms):                     581.77
==================================================

concurrency = 8 (cliff)

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  9.00
Request throughput (req/s):              2.22
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3071.54
Median E2EL (ms):                        3471.58
P99 E2EL (ms):                           4366.16
---------------Time to First Token----------------
Mean TTFT (ms):                          1701.14
Median TTFT (ms):                        1805.45
P99 TTFT (ms):                           2911.19
================== Audio Result ==================
Audio throughput(audio duration/s):      14.99
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.49
Median AUDIO_RTF:                        0.44
P99 AUDIO_RTF:                           0.87
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1701.14
Median AUDIO_TTFP (ms):                  1805.45
P99 AUDIO_TTFP (ms):                     2911.19
==================================================

concurrency = 32

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  8.92
Request throughput (req/s):              2.24
----------------End-to-end Latency----------------
Mean E2EL (ms):                          5929.22
Median E2EL (ms):                        5932.68
P99 E2EL (ms):                           8914.57
---------------Time to First Token----------------
Mean TTFT (ms):                          3772.46
Median TTFT (ms):                        3843.58
P99 TTFT (ms):                           7199.71
================== Audio Result ==================
Audio throughput(audio duration/s):      14.58
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.77
Median AUDIO_RTF:                        0.73
P99 AUDIO_RTF:                           1.34
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    3772.46
Median AUDIO_TTFP (ms):                  3843.58
P99 AUDIO_TTFP (ms):                     7199.71
==================================================

voice_design — Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, seed-tts-design

concurrency = 1

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             1
Benchmark duration (s):                  13.77
Request throughput (req/s):              1.45
----------------End-to-end Latency----------------
Mean E2EL (ms):                          688.20
Median E2EL (ms):                        688.65
P99 E2EL (ms):                           884.63
---------------Time to First Token----------------
Mean TTFT (ms):                          53.27
Median TTFT (ms):                        52.89
P99 TTFT (ms):                           59.12
================== Audio Result ==================
Audio throughput(audio duration/s):      12.28
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.083
Median AUDIO_RTF:                        0.081
P99 AUDIO_RTF:                           0.112
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    53.27
Median AUDIO_TTFP (ms):                  52.89
P99 AUDIO_TTFP (ms):                     59.12
==================================================

concurrency = 4

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  5.13
Request throughput (req/s):              3.90
----------------End-to-end Latency----------------
Mean E2EL (ms):                          943.80
Median E2EL (ms):                        919.07
P99 E2EL (ms):                           1456.01
---------------Time to First Token----------------
Mean TTFT (ms):                          154.31
Median TTFT (ms):                        150.22
P99 TTFT (ms):                           228.11
================== Audio Result ==================
Audio throughput(audio duration/s):      33.04
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.113
Median AUDIO_RTF:                        0.110
P99 AUDIO_RTF:                           0.155
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    154.31
Median AUDIO_TTFP (ms):                  150.22
P99 AUDIO_TTFP (ms):                     228.11
==================================================

concurrency = 8 (cliff)

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  4.80
Request throughput (req/s):              4.17
----------------End-to-end Latency----------------
Mean E2EL (ms):                          1640.43
Median E2EL (ms):                        1797.72
P99 E2EL (ms):                           2333.79
---------------Time to First Token----------------
Mean TTFT (ms):                          872.26
Median TTFT (ms):                        843.18
P99 TTFT (ms):                           1205.98
================== Audio Result ==================
Audio throughput(audio duration/s):      33.99
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.210
Median AUDIO_RTF:                        0.188
P99 AUDIO_RTF:                           0.283
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    872.26
Median AUDIO_TTFP (ms):                  843.18
P99 AUDIO_TTFP (ms):                     1205.98
==================================================

concurrency = 32

============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  4.66
Request throughput (req/s):              4.29
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2732.84
Median E2EL (ms):                        2754.59
P99 E2EL (ms):                           4639.83
---------------Time to First Token----------------
Mean TTFT (ms):                          1989.31
Median TTFT (ms):                        2009.91
P99 TTFT (ms):                           4053.75
================== Audio Result ==================
Audio throughput(audio duration/s):      33.55
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.378
Median AUDIO_RTF:                        0.330
P99 AUDIO_RTF:                           0.809
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1989.31
Median AUDIO_TTFP (ms):                  2009.91
P99 AUDIO_TTFP (ms):                     1989.31
==================================================

Notes

TPOT / ITL are reported as 0.0 for both tasks because Qwen3-TTS is non-streaming at the token level (codec streaming is chunk-based); the serving bench only computes TPOT/ITL when token-by-token streaming is enabled.
The codec-bs=1 cliff is clean from c=4 to c=8 in both tasks: voice_clone TTFP jumps 412ms → 1701ms (4.1×), voice_design TTFP jumps 154ms → 872ms (5.7×). Audio throughput saturates around c=4-8 in both cases.

linyueqian · 2026-04-21T21:21:32Z

@hsliuustc0106 is this OK to merge?

amy-why-3459 · 2026-04-22T01:13:07Z

Could you add a nightly-test for the benchmark?

yenuo26 · 2026-04-22T01:22:55Z

+import json
+
+
+def test_task_excluded_from_cli_args():


please add mark

Addressed in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] so the file is picked up by the nightly core_model and cpu selector.

yenuo26 · 2026-04-22T01:23:41Z

+import json
+import sys
+from pathlib import Path
+from unittest.mock import patch


please don't use unittest.mock, pytest mock instead

Done in 8f06be05. Dropped from unittest.mock import patch; test_unsupported_task_exits now takes mocker as a fixture and calls mocker.patch.object(sys, "argv", [...]) instead of the with patch.object(...) context manager.

yenuo26 · 2026-04-22T01:24:16Z

+    return p
+
+
+def test_load_model_configs(model_configs_path: Path) -> None:


please add mark, like pytest.mark.core_model, pytest.mark.cpu

Done in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] to tests/benchmarks/test_bench_tts_cli.py.

yenuo26 · 2026-04-22T01:24:27Z

+# ---------------------------------------------------------------------------
+
+
+def test_seed_tts_text_dataset_omits_ref_audio(seed_tts_root):


please add mark, like pytest.mark.core_model, pytest.mark.cpu

Done in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] to tests/benchmarks/test_seed_tts_dataset_variants.py.

yenuo26 · 2026-04-22T01:24:54Z

+import importlib.util
+import sys
+from pathlib import Path
+from unittest.mock import MagicMock


please don't use unittest.mock, pytest mock instead

Done in 8f06be05. Dropped from unittest.mock import MagicMock; the three dataset-class tests (test_seed_tts_text_dataset_omits_ref_audio, test_seed_tts_design_dataset_has_instructions, test_seed_tts_design_dataset_rejects_missing_description) now take mocker as a fixture and use mocker.MagicMock().

yenuo26 · 2026-04-22T01:26:38Z

Could you add a nightly-test for the benchmark?

I think tts-test and omni-test are enough.

amy-why-3459 · 2026-04-22T01:38:01Z

Could you add an L4 test case?

@yenuo26

Reviewer feedback from @yenuo26 on PR vllm-project#2835: 1. tests/dfx/perf/tests/test_runner_metadata.py — needs pytest marks 2. tests/benchmarks/test_bench_tts_cli.py — swap unittest.mock.patch for the pytest-mock `mocker` fixture; also missing marks 3. tests/benchmarks/test_seed_tts_dataset_variants.py — swap unittest.mock.MagicMock for `mocker.MagicMock`; also missing marks Applied module-level `pytestmark = [pytest.mark.core_model, pytest.mark.cpu]` to all three files so they run under the nightly `core_model and cpu` pytest selector (matches the existing repo convention in tests/dfx/perf/tests/test_qwen_omni.json's `run_benchmark.py` fixture path). Converted: - `from unittest.mock import patch` → `mocker.patch.object(...)` (test_bench_tts_cli.py::test_unsupported_task_exits) - `from unittest.mock import MagicMock` + `tokenizer = MagicMock()` → `tokenizer = mocker.MagicMock()` with `mocker` injected via fixture (test_seed_tts_dataset_variants.py, three tests) H20 smoke: `pytest tests/benchmarks/test_bench_tts_cli.py tests/benchmarks/test_seed_tts_dataset_variants.py tests/dfx/perf/tests/test_runner_metadata.py -m "core_model and cpu"` → 10/11 pass. The 1 remaining failure (`test_attach_sets_seed_tts_row_even_without_extra_body`) is a pre-existing `ModuleNotFoundError: No module named 'vllm.benchmarks.lib'` from a stale vllm import path unrelated to this refactor. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

lishunyang12 · 2026-04-23T06:34:10Z

@hsliuustc0106 PTAL

hsliuustc0106

lgtm, update the qwen3-tts recipe later, thank you

linyueqian requested a review from hsliuustc0106 as a code owner April 16, 2026 02:06

linyueqian force-pushed the feat/universal-tts-benchmark branch from ba1d784 to 5d8f536 Compare April 16, 2026 02:15

linyueqian added the nightly-test label to trigger buildkite nightly test CI label Apr 16, 2026

lishunyang12 reviewed Apr 16, 2026

View reviewed changes

linyueqian force-pushed the feat/universal-tts-benchmark branch 2 times, most recently from a0d0dda to e42f3ad Compare April 19, 2026 01:34

linyueqian added 18 commits April 18, 2026 23:15

chore: start feat/universal-tts-benchmark off upstream/main

7807e8d

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

feat(bench): add SeedTTSDesignDataset for voice-design benchmarking; …

878a69b

…clean up test stubs Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

fix(bench): always set seed_tts_row for PCM capture; register seed-tt…

30134f0

…s-text/design; update CLI choices Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

feat(bench): add voice-design dataset (20 prompts); DFX runner exclud…

74d6fe3

…es task/enabled/eval_phase metadata Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

test(bench/dfx): add test_enabled_false_entry_is_skipped to runner me…

2dab511

…tadata tests Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

fix(test): correct benchmark_params access in test_enabled_false_entr…

5520229

…y_is_skipped Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

feat(bench/dfx): add VoxCPM2 stage config + universal TTS DFX config …

c91a180

…(Qwen3-TTS + VoxCPM2, 3 tasks) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

fix(bench): use math.isnan, datetime import, remove dead request_rate…

3dec24d

… param Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

fix(bench/dfx): align num_prompts array length with max_concurrency i…

8092996

…n voice_design perf entry Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

fix(bench/tts): use vllm-omni binary instead of python -m vllm for be…

d415746

…nch serve Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian force-pushed the feat/universal-tts-benchmark branch from e42f3ad to 2ddf765 Compare April 19, 2026 03:18

hsliuustc0106 reviewed Apr 20, 2026

View reviewed changes

amy-why-3459 reviewed Apr 20, 2026

View reviewed changes

hsliuustc0106 reviewed Apr 20, 2026

View reviewed changes

linyueqian mentioned this pull request Apr 20, 2026

[Config Refactor] Migrate 5 TTS models (VoxCPM2 / CosyVoice3 / MiMo Audio / Voxtral TTS / Fish Speech S2 Pro) to Pipeline + Deploy schema #2958

Merged

5 tasks

yenuo26 reviewed Apr 22, 2026

View reviewed changes

yenuo26 added omni-test label to trigger buildkite omni model test in nightly CI tts-test label to trigger buildkite tts models test in nightly CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 22, 2026

linyueqian added this to the v0.20.0 milestone Apr 22, 2026

hsliuustc0106 reviewed Apr 24, 2026

View reviewed changes

hsliuustc0106 merged commit 39e3fd3 into vllm-project:main Apr 24, 2026
6 checks passed

gnomefin mentioned this pull request Apr 24, 2026

[Doc][Frontend][Model][VoxCPM2] Support instructions and per-request cfg_value #3118

Open

yenuo26 mentioned this pull request Apr 25, 2026

[CI Failure]: Simple Unit Test, tests/benchmarks/test_seed_tts_dataset_variants.py, ValueError: max() iterable argument is empty #3122

Open

1 task

+              ```bash
+              # Smallest smoke — 5 prompts, concurrency=1
+              python benchmarks/tts/bench_tts.py \

+              Outputs TTFP / RTF / throughput curves (and a markdown table) for every
+              `(task, concurrency)` combination in the result set.
+              ## Raw `vllm bench serve` commands

		return p


		def test_load_model_configs(model_configs_path: Path) -> None:

		# ---------------------------------------------------------------------------


		def test_seed_tts_text_dataset_omits_ref_audio(seed_tts_root):

Conversation

linyueqian commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metric coverage per task

Models

New files

Removed files

Alignment with sglang-omni

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 16, 2026

Uh oh!

hsliuustc0106 commented Apr 16, 2026

Uh oh!

linyueqian commented Apr 16, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: Universal TTS Benchmark

Strengths

Issues and suggestions

Overall

Uh oh!

linyueqian commented Apr 19, 2026

Review items

Extras surfaced by H20 benchmarking

Additional polish

Uh oh!

linyueqian commented Apr 19, 2026

Post-#2383 adaptations

H20 smoke after migration

Uh oh!

linyueqian commented Apr 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sy0307 commented Apr 20, 2026

Uh oh!

hsliuustc0106 commented Apr 20, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Apr 20, 2026

H20 benchmark summaries (per review feedback)

voice_clone — Qwen/Qwen3-TTS-12Hz-1.7B-Base, seed-tts

voice_design — Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, seed-tts-design

Notes

Uh oh!

linyueqian commented Apr 21, 2026

Uh oh!

amy-why-3459 commented Apr 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

linyueqian commented Apr 16, 2026 •

edited

Loading