Skip to content

[Benchmark] Universal TTS benchmark: Qwen3-TTS + VoxCPM2 with 3 task types (voice-clone/default/design)#2835

Merged
hsliuustc0106 merged 26 commits intovllm-project:mainfrom
linyueqian:feat/universal-tts-benchmark
Apr 24, 2026
Merged

[Benchmark] Universal TTS benchmark: Qwen3-TTS + VoxCPM2 with 3 task types (voice-clone/default/design)#2835
hsliuustc0106 merged 26 commits intovllm-project:mainfrom
linyueqian:feat/universal-tts-benchmark

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian commented Apr 16, 2026

Summary

  • Adds universal TTS benchmark covering 3 task types: voice clone, default voice, voice design
  • Extends seed_tts_dataset.py with SeedTTSTextDataset (default voice) and SeedTTSDesignDataset (voice design)
  • Fixes seed_tts_row flag so WER/UTMOS PCM capture runs for all task types, not just voice clone
  • Adds bench_tts.py — a model-agnostic CLI backed by model_configs.yaml registry; adding a new TTS model requires only a YAML entry, no code changes
  • Adds plot_results.py — bar-chart visualization (TTFP / E2EL / RTF / throughput) for comparing runs or task types
  • Adds bench_voxcpm_offline.py — offline VoxCPM benchmark using Omni/AsyncOmni directly (sync + streaming, voice cloning, torch profiling)
  • Wires both Qwen3-TTS (3 tasks) and VoxCPM2 (voice clone) into DFX nightly perf dashboard via test_tts.json
  • Includes 20-prompt voice-design dataset (benchmarks/build_dataset/seed_tts_design/en/meta.lst)
  • Removes old per-model benchmark directories (benchmarks/qwen3-tts/, benchmarks/voxcpm/) — superseded by this framework

Metric coverage per task

Task RTF TTFP Throughput WER SIM UTMOS
voice_clone
default_voice
voice_design

SIM is skipped for tasks without a reference audio (empty seed_tts_ref_wav_path → sim_skipped_no_ref counter in seed_tts_eval.py).

Models

  • Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice: voice_clone + default_voice + voice_design
  • Qwen/Qwen3-TTS-12Hz-1.7B-Base: voice_clone only
  • openbmb/VoxCPM2: voice_clone only

New files

File Purpose
benchmarks/tts/bench_tts.py Universal serving benchmark CLI
benchmarks/tts/model_configs.yaml Model registry (add new TTS model here)
benchmarks/tts/plot_results.py Visualization: bar charts + markdown table from JSON results
benchmarks/tts/bench_voxcpm_offline.py Offline VoxCPM benchmark (sync/streaming, profiling)
benchmarks/tts/stage_configs/qwen3_tts.yaml Qwen3-TTS stage config for CLI
benchmarks/tts/stage_configs/voxcpm2.yaml VoxCPM2 stage config for CLI
benchmarks/build_dataset/seed_tts_design/en/meta.lst 20 voice-design prompts
tests/dfx/perf/stage_configs/voxcpm2.yaml VoxCPM2 DFX nightly config
tests/dfx/perf/tests/test_tts.json Universal TTS DFX benchmark matrix
tests/benchmarks/test_seed_tts_dataset_variants.py Unit tests for new dataset classes
tests/benchmarks/test_bench_tts_cli.py Unit tests for bench_tts.py
tests/dfx/perf/tests/test_runner_metadata.py Tests for DFX metadata key exclusion

Removed files

Directory Reason
benchmarks/qwen3-tts/ Superseded by benchmarks/tts/ (bench_tts.py + plot_results.py cover all use cases)
benchmarks/voxcpm/ Superseded by benchmarks/tts/ (offline path → bench_voxcpm_offline.py)

Alignment with sglang-omni

Uses the same seed-tts-eval dataset as sglang-omni. Adds TTFP and SIM/UTMOS which sglang-omni does not currently track.

Test plan

  • pytest tests/benchmarks/test_bench_tts_cli.py -v — 5/5 pass (locally, no torch needed)
  • pytest tests/benchmarks/test_seed_tts_dataset_variants.py -v — 8/9 pass locally; 9th (test_attach_sets_seed_tts_row_even_without_extra_body) requires real vllm, passes on H20
  • JSON/YAML schema valid (all configs validated with python -m json.tool and yaml.safe_load)
  • python benchmarks/tts/bench_tts.py --help — outputs help text, no import errors
  • Manual smoke test on H20
  • DFX runner integration test on H20 with both models

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

please help remove other unnecessary benchmarks

@linyueqian linyueqian force-pushed the feat/universal-tts-benchmark branch from ba1d784 to 5d8f536 Compare April 16, 2026 02:15
@linyueqian linyueqian added the nightly-test label to trigger buildkite nightly test CI label Apr 16, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator Author

@Sy0307 During smoke testing of the benchmark framework on H20, VoxCPM2 crashes at concurrency=4 with an orchestrator thread crash. Concurrency 1 and 2 work fine.

Root cause trace:

(Worker pid=...) INFO VoxCPM2: CUDA Graph captured for scaffold (batch_size=3)
(APIServer pid=...) INFO [Orchestrator] Shutting down all stages
(APIServer pid=...) ERROR [AsyncOmniEngine] Orchestrator thread crashed
RuntimeError: {'request_id': '...', 'error': 'Orchestrator thread crashed'}

Looks like the VAE scaffold CUDA graph capture fires lazily during concurrent inference and races with the orchestrator. Worth a look if you have context on the VoxCPM2 runtime path.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Universal TTS Benchmark

This is a well-structured consolidation of per-model benchmarks into a single, model-agnostic framework. The YAML-registry approach (model_configs.yaml) for adding new models without code changes is a good design choice, and the removal of ~2900 lines of duplicated per-model benchmark code is a clear win.

Strengths

  1. Clean architecture: bench_tts.py delegates to vllm bench serve --omni with model-aware defaults from YAML, keeping the CLI thin and testable. The _TASK_TO_DATASET mapping is simple and correct.

  2. Correct bug fix in patch.py: Moving seed_tts_row = True before the if not ex: return guard is the right fix. Without this, SeedTTSTextSampleRequest (which has seed_tts_speech_extra=None) would skip PCM capture entirely, breaking WER/UTMOS for default_voice and voice_design tasks. The test in test_attach_sets_seed_tts_row_even_without_extra_body verifies this via source inspection, which is pragmatic.

  3. DFX integration: test_tts.json separates perf and quality eval phases, which is thoughtful. The enabled field and eval_phase metadata keys, plus updating exclude_keys in run_benchmark.py, ensure these don't leak into CLI args.

  4. Good test coverage: The conftest.py vllm stubs approach is creative and lets dataset tests run without a full vllm install.

Issues and suggestions

1. SeedTTSDesignDataset uses "instructions" key but SeedTTSDesignSampleRequest docstring says "voice_description"

In seed_tts_dataset.py, the sample() method of SeedTTSDesignDataset builds:

speech_extra: dict[str, Any] = {
    "instructions": row.voice_description,
    "task_type": "VoiceDesign",
    ...
}

But SeedTTSDesignSampleRequest's docstring says the dict carries voice_description. The test also checks for "voice_description" in extra -- but extra actually contains "instructions", not "voice_description". This test line would fail:

assert "voice_description" in extra  # extra has "instructions", not "voice_description"

Wait, looking more carefully the test builds the dataset and calls .sample(), so the extra dict will have "instructions" key. The assertion assert "voice_description" in extra should indeed fail. This looks like a real bug in the test.

Action needed: Either the key in speech_extra should be "voice_description" (matching the docstring and test), or the test assertion should check for "instructions" (matching the actual code). Please verify which key the Qwen3-TTS VoiceDesign endpoint expects and align all three (code, docstring, test).

2. Duplicate VoxCPM2 stage config

benchmarks/tts/stage_configs/voxcpm2.yaml and tests/dfx/perf/stage_configs/voxcpm2.yaml are identical. Consider having the DFX config reference the benchmark one (or vice versa) to avoid drift. If they must be separate (e.g., different GPU memory settings for CI vs benchmarking), add a comment explaining why.

3. bench_voxcpm_offline.py REPO_ROOT change looks correct but is fragile

The file was moved from benchmarks/voxcpm/vllm_omni/ (3 levels deep) to benchmarks/tts/ (2 levels deep), so parents[3] -> parents[2] is correct. But this kind of relative-path depth counting is brittle. Consider using a marker file lookup or deriving from git rev-parse --show-toplevel for robustness.

4. plot_results.py -- mean_e2el metric key may be wrong

The comparison table uses "mean_e2el" as the key for E2E latency:

("E2E (ms)", "mean_e2el", ".1f"),

But bench_tts.py's summary table references "mean_audio_rtf" and "mean_audio_ttfp_ms". Please verify "mean_e2el" is the correct key emitted by vllm bench serve --omni result JSON. If it should be "mean_e2el_ms" (with _ms suffix), the E2E column would silently show nan.

5. Minor: _SeedTTSDesignRow uses __import__("random") inline

In SeedTTSDesignDataset.load_data():

rng = __import__("random").Random(self.random_seed)

This works but is unusual. A normal import random at the top of the file would be cleaner and more readable.

6. voice_design entries in test_tts.json are missing extra_body for task_type

The voice_clone entries don't need extra_body because the dataset itself provides ref_audio/ref_text. The default_voice entries correctly include extra_body: {"voice": "Vivian", ...}. But the voice_design perf entry has no extra_body -- is the task_type: "VoiceDesign" already handled by the dataset class? If yes, this is fine. If the server needs task_type in the request body for routing, this could cause 400 errors at runtime. Worth a note or confirming in the test plan.

Overall

The framework design is solid and the consolidation is clearly needed. The main concern is the "instructions" vs "voice_description" key mismatch (point 1) which appears to be a real test bug. The rest are minor improvements. After addressing point 1, this should be good to go.

@linyueqian linyueqian force-pushed the feat/universal-tts-benchmark branch 2 times, most recently from a0d0dda to e42f3ad Compare April 19, 2026 01:34
@linyueqian
Copy link
Copy Markdown
Collaborator Author

@lishunyang12 @hsliuustc0106 — addressed the review with some extra H20-sourced work. Push tip is e42f3ad9.

Review items

# Item Action
P1 instructions vs voice_description drift The fix(bench/tts): use 'instructions' commit updated the code but left the docstring + test still asserting voice_description, so the unit test failed on H20. Aligned docstring + test on instructions.
P2 Duplicate voxcpm2.yaml / qwen3_tts.yaml under benchmarks/tts/stage_configs/ Removed the dir; bench_tts.py default now points at tests/dfx/perf/stage_configs/ so DFX nightly and the CLI share a single source of truth.
P3 Fragile parents[2] in bench_voxcpm_offline.py Replaced with a _find_repo_root() walker that looks for the pyproject.toml + vllm_omni/ marker.
P4 mean_e2el silent-NaN in plot_results.py Verified with H20 result JSONs: key is mean_e2el_ms. Updated both call sites.
P5 Inline __import__("random") Replaced with the existing module-level import random.
P6 extra_body.task_type missing on voice_design DFX entry Empirically verified on H20 — SeedTTSDesignDataset.sample() injects task_type=VoiceDesign into each request's speech_extra, so DFX works without extra_body. Added extra_body defensively anyway to match the default_voice style.
hsliuustc0106 "remove other unnecessary benchmarks" The per-model TTS dirs (qwen3-tts/, voxcpm/) were already removed. Leaving benchmarks/qwen3-omni/ and benchmarks/fish-speech/ alone — they're separate model families, not TTS dupes. Can delete in a follow-up if you confirm which ones you meant.

Extras surfaced by H20 benchmarking

Ran voice_clone / voice_design concurrency sweeps on H20-3e 141GB and found:

Live-traffic bug: Qwen3-TTS-12Hz-*-CustomVoice checkpoints don't ship speaker_encoder weights, so voice_clone requests crash with ValueError: This checkpoint does not provide speaker_encoder weights. The original PR's model_configs.yaml listed voice_clone under CustomVoice and the DFX test_qwen3_tts entry used CustomVoice for voice_clone — both would fail in the nightly.

Fix:

  • Removed voice_clone from 1.7B-CustomVoice.supported_tasks.
  • Split the DFX matrix into test_qwen3_tts_base (voice_clone via 1.7B-Base) and test_qwen3_tts_customvoice (default_voice + voice_design via 1.7B-CustomVoice).

Concurrency-cliff regression coverage: the old perf phase only ran max_concurrency [1, 4], which is below the TTFP cliff that vllm-omni's codec-bs=1 exposes. Observed H20 numbers:

Task Model c=1 c=4 c=8 c=16 c=32
voice_clone 1.7B-Base RTF 0.15 / TTFP 165ms 0.28 / 412ms 0.49 / 1701ms 0.72 / 3355ms 0.77 / 3772ms
voice_design 1.7B-CustomVoice 0.08 / 53ms 0.11 / 154ms 0.21 / 872ms 0.33 / 1801ms 0.38 / 1989ms

Both show a clean 4-6× TTFP jump from c=4 to c=8 and throughput saturating at c=4-8, which is the same pattern the NVIDIA-fork comparison doc flags.

Replaced the single perf phase with:

  • latency (c=1) — tight TTFP + RTF
  • throughput (c=8) — loose TTFP ceiling + throughput/RTF floor, so a codec-batching regression breaks CI
  • quality (c=4) — unchanged, WER/SIM/UTMOS

Thresholds padded ~2× from the H20 means. default_voice thresholds currently reuse the voice_design numbers (same checkpoint, same pipeline) — happy to run a dedicated default_voice sweep if you'd prefer.

Additional polish

  • bench_tts.py now persists _task / _concurrency into the saved result JSON so plot_results.py can build per-concurrency comparison tables (was in-memory only).
  • Rebased onto upstream/main and resolved the test_tts.json add/add and qwen3-tts/vllm_omni/configs/*.yaml modify/delete conflicts.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Adds SeedTTSTextDataset (CLI name: seed-tts-text) and
SeedTTSTextSampleRequest. Loads the same meta.lst as SeedTTSDataset but
omits ref_audio/ref_text from the request body; voice is supplied via
--extra-body in the benchmark config. Sets seed_tts_ref_wav_path="" so
SIM is automatically skipped in seed_tts_eval.py. WER and UTMOS still
work normally.

Also adds tests/benchmarks/conftest.py with lightweight vllm stubs and
the corresponding unit test.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…clean up test stubs

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…s-text/design; update CLI choices

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…es task/enabled/eval_phase metadata

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…tadata tests

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…l registry

Replaces per-model benchmark scripts with a single model-agnostic CLI that
reads model_configs.yaml to dispatch vllm bench serve --omni with correct
flags for any registered TTS model (Qwen3-TTS, VoxCPM2, and future models).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…y_is_skipped

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…(Qwen3-TTS + VoxCPM2, 3 tasks)

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… param

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…n voice_design perf entry

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Remove benchmarks/qwen3-tts/ and benchmarks/voxcpm/ which are
superseded by the new universal benchmarks/tts/ framework.
Apply ruff format/check fixes to all PR-touched files.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Restore two useful tools from the removed per-model benchmarks:

- plot_results.py: updated to read vllm bench serve JSON keys
  (mean_audio_rtf, mean_audio_ttfp_ms, _task, _concurrency).
  Generates 4-panel bar charts (TTFP, E2EL, RTF, throughput) per
  task, with optional multi-run comparison and markdown table output.

- bench_voxcpm_offline.py: offline VoxCPM benchmark using
  Omni/AsyncOmni.generate directly; supports sync and streaming
  modes, txt/jsonl batch input, voice cloning, and torch profiling.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…nch serve

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The server API (serving_speech.py) validates the 'instructions' field
for VoiceDesign requests. The benchmark dataset was incorrectly sending
the value under 'voice_description', causing all voice_design benchmark
requests to fail with a 400 error.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
- P1: align voice-design code/docstring/test on `instructions` key. The
  prior fix changed the code but left the docstring + unit test asserting
  `voice_description`, which made the test fail.
- P2: remove duplicate stage configs under benchmarks/tts/stage_configs/;
  point bench_tts.py default to tests/dfx/perf/stage_configs/ so DFX
  nightly + the CLI share a single source of truth.
- P3: replace fragile parents[2] in bench_voxcpm_offline.py with a
  marker-based repo-root walker (pyproject.toml + vllm_omni/).
- P4: actual result JSON key is mean_e2el_ms, not mean_e2el.  Fix both
  references in plot_results.py so the E2E column no longer silently
  renders NaN.
- P5: drop inline `__import__("random")` in favour of the module-level
  import that already exists.

Also persist `_task`/`_concurrency` metadata into the saved result JSON
from bench_tts.py so plot_results.py can build the per-concurrency
comparison tables (previously the augmentation happened in-memory only).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…lity regimes

Concurrency sweeps on H20 (Qwen3-TTS-1.7B-Base voice_clone and
Qwen3-TTS-1.7B-CustomVoice voice_design) show a sharp TTFP cliff at
max_concurrency>=8 — TTFP jumps 4-6x from c=4 to c=8 while audio
throughput saturates.  The prior `perf` entries only exercised c=1 and
c=4, sitting below the cliff, so codec-batching regressions are invisible
to DFX nightly.

Replace the single `perf` phase with two targeted regimes plus the
existing `quality` phase:

  - latency    (c=1)  — tight TTFP + RTF bounds for single-request SLO
  - throughput (c=8)  — loose TTFP ceiling + throughput/RTF floor that
                        collapses if the codec stays batch_size=1
  - quality    (c=4)  — unchanged, WER/SIM/UTMOS eval

Thresholds come from H20 sweeps on 1.7B-Base (voice_clone) and
1.7B-CustomVoice (voice_design), padded ~2x:

  voice_clone         c=1: RTF 0.153 TTFP 165ms    c=8: RTF 0.493 TTFP 1701ms
  voice_design        c=1: RTF 0.083 TTFP 53ms     c=8: RTF 0.21  TTFP 872ms

Also fix a live-traffic bug surfaced during the sweep: the
-CustomVoice checkpoints don't ship speaker_encoder weights, so
voice_clone requests crashed with
`ValueError: This checkpoint does not provide speaker_encoder weights`.
Drop voice_clone from CustomVoice's supported_tasks in
model_configs.yaml and split the DFX suite so voice_clone runs under
-Base and default_voice/voice_design under -CustomVoice.

Test plan: H20 voice_clone sweep (Qwen3-TTS-1.7B-Base) and voice_design
sweep (Qwen3-TTS-1.7B-CustomVoice) validated the thresholds are reachable
with >40% headroom on the latency entry and TTFP/throughput both fall
within the throughput bounds at c=8.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…oject#2383

vllm-project#2383 replaced the per-model stage_configs/*.yaml layout with auto-loaded
vllm_omni/deploy/<model>.yaml (Pipeline in Python, Deploy in YAML) and
switched the DFX runner's config-loading dir from stage_configs/ to
deploy/.  This PR's test matrix and bench CLI still carried the old
references:

- test_tts.json: drop `stage_config_name` from the Qwen3-TTS entries;
  vllm-omni now auto-loads vllm_omni/deploy/qwen3_tts.yaml for both
  Base and CustomVoice checkpoints.
- model_configs.yaml: drop the `stage_config` field — the bench CLI
  does not reference it and auto-discovery handles pipeline lookup.
- bench_tts.py: remove the dead `--stage-configs-dir` flag and the
  `_DEFAULT_STAGE_CONFIGS_DIR` constant; both were unused and pointed
  at a directory vllm-project#2383 deleted.
- Delete tests/dfx/perf/stage_configs/voxcpm2.yaml — the directory no
  longer exists post-vllm-project#2383.

VoxCPM2 is not yet migrated to the Pipeline + Deploy schema in vllm-project#2383
(only qwen2_5_omni / qwen3_omni / qwen3_tts ship pipeline.py + deploy
YAML) and still loads via the legacy `ModelPipeline` path.  Drop the
test_voxcpm2 entry from test_tts.json to unblock DFX nightly; will
re-add as a follow-up once VoxCPM2 gets its deploy YAML.

The latency / throughput / quality baselines remain unchanged — they
come from H20 sweeps on stable checkpoints and should still hold under
the new deploy YAML (stage 0 now sets max_num_seqs=10 and
async_scheduling=true, which can only improve throughput numbers).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the feat/universal-tts-benchmark branch from e42f3ad to 2ddf765 Compare April 19, 2026 03:18
@linyueqian
Copy link
Copy Markdown
Collaborator Author

Rebased on top of #2383 (merged) + migrated to the Pipeline + Deploy schema (2ddf7658).

Post-#2383 adaptations

Change Reason
Drop stage_config_name: qwen3_tts.yaml from test_qwen3_tts_base and test_qwen3_tts_customvoice in test_tts.json #2383 deleted tests/dfx/perf/stage_configs/qwen3_tts.yaml. Server now auto-loads vllm_omni/deploy/qwen3_tts.yaml by model.
Drop stage_config: field from every entry in model_configs.yaml Unused after #2383 — Pipeline registry handles model → stage mapping.
Remove --stage-configs-dir flag and _DEFAULT_STAGE_CONFIGS_DIR constant from bench_tts.py Dead flag after #2383 (never referenced in build_bench_args).
Delete tests/dfx/perf/stage_configs/voxcpm2.yaml Dir removed by #2383; not restored because VoxCPM2 is not yet migrated to the new schema.
Drop test_voxcpm2 entry from test_tts.json VoxCPM2 still uses legacy ModelPipeline / _parse_pipeline_yaml; no vllm_omni/deploy/voxcpm2.yaml exists. Will re-add as a follow-up once VoxCPM2 gets its deploy YAML.

H20 smoke after migration

Booted Qwen/Qwen3-TTS-12Hz-1.7B-Base on H20-3e with no --stage-configs-path / --deploy-config flags and fired bench_tts.py --task voice_clone --concurrency 1:

voice_clone  concurrency=1   RTF 0.173   TTFP 144ms   throughput 5.78 audio-s/wall-s
  • Auto-discovery picked up vllm_omni/deploy/qwen3_tts.yaml as expected.
  • Numbers are in the same ballpark as the pre-migration H20 sweep (RTF 0.153 / TTFP 165ms at c=1), well within the latency baselines (RTF <= 0.25, TTFP <= 350ms). The new deploy YAML's stage-0 max_num_seqs=10 + async_scheduling=true should only help throughput at higher concurrencies.

Baselines unchanged — full sweep recalibration can land in a follow-up if DFX nightly flags drift.

CI after push: pre-commit / DCO / build(3.11) / build(3.12) all green.

…ries

Buildkite `🌕 TTS · Perf Test` failed with `completed == 0` on six
benchmark entries because the seed-tts / seed-tts-text datasets they
reference are not staged in the CI image and `snapshot_download` has
no way to pull the Google-Drive-hosted seed-tts-eval archive.

Two-part fix:

1. Bundle `benchmarks/build_dataset/seed_tts_smoke/en/meta.lst` — a
   20-row seed-tts-compatible meta file with target_text only (no
   WAVs).  `SeedTTSTextDataset` (used by default_voice) does not
   touch the wav column, so this is enough to exercise the full
   server path in CI.  All entries are short, varied English
   sentences suitable for TTS smoke testing.

2. Point the `default_voice` benchmark entries at this bundled path
   and disable the three `voice_clone` entries with `enabled: false`
   — voice_clone needs real reference WAVs the bundled smoke set
   deliberately omits.  The `voice_design` entries are unchanged;
   they were already using a bundled dataset and passing in the
   failing Buildkite run.

Also disable the `default_voice` quality entry: WER evaluation
requires real seed-tts-eval text (which we deliberately did not
bundle — 20 rows × 4 CV folds would give an unreliable WER signal).
Perf/throughput entries still exercise the codec-bs cliff on the
bundled smoke set.

H20 smoke: `bench serve --backend openai-audio-speech
--dataset-name seed-tts-text --dataset-path benchmarks/build_dataset/seed_tts_smoke`
returned `Successful requests: 5` with audio throughput 11.18 s/s — no
more zero-completion failures.  Re-enabling the seed-tts-eval
entries will be a follow-up once the dataset is staged in the CI
container (or made available via an HF mirror we can snapshot_download).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian
Copy link
Copy Markdown
Collaborator Author

@yenuo26 @amy-why-3459 @congw729 ptal, thanks!

… regimes

build #7249 showed default_voice c=1 latency mean_audio_ttfp_ms=2301 ms
vs a 150 ms baseline — but p50 was 62 ms and only one cold-start
outlier (first measured request after server warmup, p99 ~36 s) dragged
the mean up.

Latency and throughput regimes care about typical request behaviour,
not cold-start tails, so switch their baselines from `mean_audio_*` to
`median_audio_*`.  The quality entries (WER-driven) still use mean
since they aggregate over 200 prompts where single-request outliers
don't matter.

Applied to both qwen3_tts_base and qwen3_tts_customvoice:

  * latency   → median_audio_ttfp_ms / median_audio_rtf
  * throughput → median_audio_ttfp_ms / median_audio_rtf
  * quality    → unchanged (mean_audio_rtf)

Baseline values unchanged; only the metric aggregation switched.
Expected effect on build #7249 data:

  default_voice c=1  p50 TTFP  62 ms  <=  150 ms  ✅
  default_voice c=8  p50 TTFP 230 ms  <= 1500 ms  ✅

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@@ -0,0 +1,308 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a readme.md under tts/

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added benchmarks/tts/README.md in 798cea7e. One-page reference covering quick-start recipes (smoke / concurrency sweep / --wer-eval), the three task types and which checkpoints support each (incl. the -CustomVoice no-speaker_encoder gotcha), how to register a new TTS model via model_configs.yaml, the bundled-vs-external dataset matrix, DFX nightly wiring (latency / throughput / quality regimes), and an H20 concurrency-cliff reference table. Links to #2558 and #2383 for context.

Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four
scripts + a YAML registry with zero docs, leaving users to reverse-engineer
the CLI from `--help` output.  Add a single-page README covering:

- quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS)
- plot_results.py usage
- the three task types and which checkpoints support each (notes that
  -CustomVoice lacks speaker_encoder so voice_clone is Base-only)
- model_configs.yaml extension recipe for new TTS models
- dataset matrix (bundled seed_tts_design / seed_tts_smoke, external
  seed-tts-eval with link to the download guide)
- DFX nightly integration: latency / throughput / quality regimes,
  median-vs-mean baseline choice, quality-entry gating rationale
- observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel)
- file layout + cross-references to vllm-project#2558 and vllm-project#2383

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 20, 2026

The new throughput regime here does not actually assert throughput. In tests/dfx/perf/tests/test_tts.json, the throughput entries only gate median_audio_ttfp_ms and median_audio_rtf, but there is no baseline for audio_throughput itself. That means this regime is currently acting more like a high-concurrency latency/RTF check than a real throughput guard. If the intent is to catch codec batching regressions or throughput collapse under load, it would be better to add an explicit audio_throughput baseline as well.

Other contents LGTM.

Reviewer feedback on PR vllm-project#2835: the throughput-phase entries only gated
median_audio_ttfp_ms and median_audio_rtf, so the regime was really a
high-concurrency latency check — a real throughput collapse (e.g. codec
batch size regressing back to 1, or scheduler starvation) would leave
TTFP/RTF within bounds while audio_throughput cratered, and CI would miss
it.

Add an `audio_throughput` baseline to every throughput entry.  The runner
inverts the comparison for any metric name containing "throughput"
(run_benchmark.py:287-292), so these values act as FLOORS: if the observed
audio-seconds-per-wall-second drops below the baseline, the runner prints
the soft-warning ERROR.

Floors (audio-s per wall-s, >= baseline required):

  voice_clone  c=8 (1.7B-Base)       10.0    (measured ~15 on H20)
  default_voice c=8 (1.7B-CustomVoice) 30.0    (measured ~47 on H100, ~35 H20)
  voice_design c=8 (1.7B-CustomVoice) 25.0    (measured ~43 on H100, ~34 H20)

Values set ~30% below the lower of the two observed environments so CI
flags real regressions, not noise.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

using this benchmark locally in one L20 48GB device:

image

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread benchmarks/tts/README.md

```bash
# Smallest smoke — 5 prompts, concurrency=1
python benchmarks/tts/bench_tts.py \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide the vllm bench command in the README file? This would make it easier for users to modify the vllm bench parameters.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in eb30f559. New section "Raw vllm bench serve commands" with full copy-paste invocations for each of the three task types (voice_clone, default_voice, voice_design) showing the full flag set — --host, --port, --model, --dataset-name, --dataset-path, --seed-tts-locale, --num-prompts, --num-warmups, --extra-body, --max-concurrency, --request-rate, --percentile-metrics, --save-result. Plus a short note on appending --seed-tts-wer-eval for WER/SIM/UTMOS. Users can now tweak bench flags without reading through bench_tts.py.

Reviewer feedback on PR vllm-project#2835: the README's quick-start only showed the
`bench_tts.py` wrapper, which hides the underlying `vllm bench serve --omni`
invocation.  Users wanting to tweak individual bench flags (sampling params,
endpoint, `--extra-body`, warmups, etc.) had to read bench_tts.py source to
find out what the wrapper emits.

Add a "Raw `vllm bench serve` commands" section with the full copy-paste
invocation for each of the three task types — voice_clone (Qwen3-TTS-Base,
seed-tts), default_voice (Qwen3-TTS-CustomVoice, bundled smoke), and
voice_design (Qwen3-TTS-CustomVoice, bundled design) — plus a short note
on enabling `--seed-tts-wer-eval` for WER/SIM/UTMOS.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also paste the benchmark summaries ttfp/ttft/tpot/rtf standard outputs in test results

Comment thread benchmarks/tts/README.md Outdated
Outputs TTFP / RTF / throughput curves (and a markdown table) for every
`(task, concurrency)` combination in the result set.

## Raw `vllm bench serve` commands
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this section to the top as the first option for users

Reviewer feedback on PR vllm-project#2835: users should see the raw `vllm bench serve
--omni` invocation as the first option, not as an afterthought buried
below the `bench_tts.py` wrapper.

Restructure the README so the Quick Start flow is:

  1. Start the server
  2. Run the benchmark via `vllm bench serve --omni` (3 task examples + WER)
  3. Convenience wrapper via `bench_tts.py`
  4. Plot the sweep

The wrapper section now explains that it is exactly the raw command with
model-aware defaults plugged in, and documents which flags come from
`model_configs.yaml` vs. fixed defaults — so users who outgrow the wrapper
know exactly what to swap.

Also remove the now-duplicate "Raw vllm bench serve commands" section that
was appended in an earlier commit.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian
Copy link
Copy Markdown
Collaborator Author

H20 benchmark summaries (per review feedback)

All runs on H20-3e 141GB, num_prompts=20, num_warmups=2, dataset seed-tts / seed-tts-design. PR tip 969aead0. The vllm bench serve --omni stdout blocks (TTFT / TTFP / TPOT / RTF / throughput / latency) are pasted below so reviewers do not have to re-run the sweep locally.

voice_clone — Qwen/Qwen3-TTS-12Hz-1.7B-Base, seed-tts

concurrency = 1
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             1
Benchmark duration (s):                  19.92
Request throughput (req/s):              1.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          995.69
Median E2EL (ms):                        902.93
P99 E2EL (ms):                           1577.83
---------------Time to First Token----------------
Mean TTFT (ms):                          164.80
Median TTFT (ms):                        162.55
P99 TTFT (ms):                           209.90
================== Audio Result ==================
Total audio duration generated(s):       135.04
Audio throughput(audio duration/s):      6.78
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.15
Median AUDIO_RTF:                        0.16
P99 AUDIO_RTF:                           0.18
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    164.80
Median AUDIO_TTFP (ms):                  162.55
P99 AUDIO_TTFP (ms):                     209.90
==================================================
concurrency = 4
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  9.23
Request throughput (req/s):              2.17
----------------End-to-end Latency----------------
Mean E2EL (ms):                          1785.28
Median E2EL (ms):                        1563.50
P99 E2EL (ms):                           2748.23
---------------Time to First Token----------------
Mean TTFT (ms):                          412.33
Median TTFT (ms):                        401.16
P99 TTFT (ms):                           581.77
================== Audio Result ==================
Audio throughput(audio duration/s):      14.37
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.28
Median AUDIO_RTF:                        0.29
P99 AUDIO_RTF:                           0.32
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    412.33
Median AUDIO_TTFP (ms):                  401.16
P99 AUDIO_TTFP (ms):                     581.77
==================================================
concurrency = 8 (cliff)
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  9.00
Request throughput (req/s):              2.22
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3071.54
Median E2EL (ms):                        3471.58
P99 E2EL (ms):                           4366.16
---------------Time to First Token----------------
Mean TTFT (ms):                          1701.14
Median TTFT (ms):                        1805.45
P99 TTFT (ms):                           2911.19
================== Audio Result ==================
Audio throughput(audio duration/s):      14.99
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.49
Median AUDIO_RTF:                        0.44
P99 AUDIO_RTF:                           0.87
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1701.14
Median AUDIO_TTFP (ms):                  1805.45
P99 AUDIO_TTFP (ms):                     2911.19
==================================================
concurrency = 32
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  8.92
Request throughput (req/s):              2.24
----------------End-to-end Latency----------------
Mean E2EL (ms):                          5929.22
Median E2EL (ms):                        5932.68
P99 E2EL (ms):                           8914.57
---------------Time to First Token----------------
Mean TTFT (ms):                          3772.46
Median TTFT (ms):                        3843.58
P99 TTFT (ms):                           7199.71
================== Audio Result ==================
Audio throughput(audio duration/s):      14.58
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.77
Median AUDIO_RTF:                        0.73
P99 AUDIO_RTF:                           1.34
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    3772.46
Median AUDIO_TTFP (ms):                  3843.58
P99 AUDIO_TTFP (ms):                     7199.71
==================================================

voice_design — Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, seed-tts-design

concurrency = 1
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             1
Benchmark duration (s):                  13.77
Request throughput (req/s):              1.45
----------------End-to-end Latency----------------
Mean E2EL (ms):                          688.20
Median E2EL (ms):                        688.65
P99 E2EL (ms):                           884.63
---------------Time to First Token----------------
Mean TTFT (ms):                          53.27
Median TTFT (ms):                        52.89
P99 TTFT (ms):                           59.12
================== Audio Result ==================
Audio throughput(audio duration/s):      12.28
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.083
Median AUDIO_RTF:                        0.081
P99 AUDIO_RTF:                           0.112
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    53.27
Median AUDIO_TTFP (ms):                  52.89
P99 AUDIO_TTFP (ms):                     59.12
==================================================
concurrency = 4
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  5.13
Request throughput (req/s):              3.90
----------------End-to-end Latency----------------
Mean E2EL (ms):                          943.80
Median E2EL (ms):                        919.07
P99 E2EL (ms):                           1456.01
---------------Time to First Token----------------
Mean TTFT (ms):                          154.31
Median TTFT (ms):                        150.22
P99 TTFT (ms):                           228.11
================== Audio Result ==================
Audio throughput(audio duration/s):      33.04
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.113
Median AUDIO_RTF:                        0.110
P99 AUDIO_RTF:                           0.155
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    154.31
Median AUDIO_TTFP (ms):                  150.22
P99 AUDIO_TTFP (ms):                     228.11
==================================================
concurrency = 8 (cliff)
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  4.80
Request throughput (req/s):              4.17
----------------End-to-end Latency----------------
Mean E2EL (ms):                          1640.43
Median E2EL (ms):                        1797.72
P99 E2EL (ms):                           2333.79
---------------Time to First Token----------------
Mean TTFT (ms):                          872.26
Median TTFT (ms):                        843.18
P99 TTFT (ms):                           1205.98
================== Audio Result ==================
Audio throughput(audio duration/s):      33.99
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.210
Median AUDIO_RTF:                        0.188
P99 AUDIO_RTF:                           0.283
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    872.26
Median AUDIO_TTFP (ms):                  843.18
P99 AUDIO_TTFP (ms):                     1205.98
==================================================
concurrency = 32
============ Serving Benchmark Result ============
Successful requests:                     20
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  4.66
Request throughput (req/s):              4.29
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2732.84
Median E2EL (ms):                        2754.59
P99 E2EL (ms):                           4639.83
---------------Time to First Token----------------
Mean TTFT (ms):                          1989.31
Median TTFT (ms):                        2009.91
P99 TTFT (ms):                           4053.75
================== Audio Result ==================
Audio throughput(audio duration/s):      33.55
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.378
Median AUDIO_RTF:                        0.330
P99 AUDIO_RTF:                           0.809
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1989.31
Median AUDIO_TTFP (ms):                  2009.91
P99 AUDIO_TTFP (ms):                     1989.31
==================================================

Notes

  • TPOT / ITL are reported as 0.0 for both tasks because Qwen3-TTS is non-streaming at the token level (codec streaming is chunk-based); the serving bench only computes TPOT/ITL when token-by-token streaming is enabled.
  • The codec-bs=1 cliff is clean from c=4 to c=8 in both tasks: voice_clone TTFP jumps 412ms → 1701ms (4.1×), voice_design TTFP jumps 154ms → 872ms (5.7×). Audio throughput saturates around c=4-8 in both cases.

@linyueqian
Copy link
Copy Markdown
Collaborator Author

@hsliuustc0106 is this OK to merge?

@amy-why-3459
Copy link
Copy Markdown
Contributor

Could you add a nightly-test for the benchmark?

import json


def test_task_excluded_from_cli_args():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add mark

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] so the file is picked up by the nightly core_model and cpu selector.

Comment thread tests/benchmarks/test_bench_tts_cli.py Outdated
import json
import sys
from pathlib import Path
from unittest.mock import patch
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use unittest.mock, pytest mock instead

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8f06be05. Dropped from unittest.mock import patch; test_unsupported_task_exits now takes mocker as a fixture and calls mocker.patch.object(sys, "argv", [...]) instead of the with patch.object(...) context manager.

return p


def test_load_model_configs(model_configs_path: Path) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add mark, like pytest.mark.core_model, pytest.mark.cpu

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] to tests/benchmarks/test_bench_tts_cli.py.

# ---------------------------------------------------------------------------


def test_seed_tts_text_dataset_omits_ref_audio(seed_tts_root):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add mark, like pytest.mark.core_model, pytest.mark.cpu

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] to tests/benchmarks/test_seed_tts_dataset_variants.py.

import importlib.util
import sys
from pathlib import Path
from unittest.mock import MagicMock
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use unittest.mock, pytest mock instead

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8f06be05. Dropped from unittest.mock import MagicMock; the three dataset-class tests (test_seed_tts_text_dataset_omits_ref_audio, test_seed_tts_design_dataset_has_instructions, test_seed_tts_design_dataset_rejects_missing_description) now take mocker as a fixture and use mocker.MagicMock().

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 22, 2026

Could you add a nightly-test for the benchmark?

I think tts-test and omni-test are enough.

@yenuo26 yenuo26 added omni-test label to trigger buildkite omni model test in nightly CI tts-test label to trigger buildkite tts models test in nightly CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 22, 2026
@amy-why-3459
Copy link
Copy Markdown
Contributor

Could you add an L4 test case?

Reviewer feedback from @yenuo26 on PR vllm-project#2835:

  1. tests/dfx/perf/tests/test_runner_metadata.py — needs pytest marks
  2. tests/benchmarks/test_bench_tts_cli.py — swap unittest.mock.patch for
     the pytest-mock `mocker` fixture; also missing marks
  3. tests/benchmarks/test_seed_tts_dataset_variants.py — swap
     unittest.mock.MagicMock for `mocker.MagicMock`; also missing marks

Applied module-level `pytestmark = [pytest.mark.core_model, pytest.mark.cpu]`
to all three files so they run under the nightly `core_model and cpu`
pytest selector (matches the existing repo convention in
tests/dfx/perf/tests/test_qwen_omni.json's `run_benchmark.py` fixture
path).

Converted:
  - `from unittest.mock import patch` → `mocker.patch.object(...)`
    (test_bench_tts_cli.py::test_unsupported_task_exits)
  - `from unittest.mock import MagicMock` + `tokenizer = MagicMock()` →
    `tokenizer = mocker.MagicMock()` with `mocker` injected via fixture
    (test_seed_tts_dataset_variants.py, three tests)

H20 smoke: `pytest tests/benchmarks/test_bench_tts_cli.py
tests/benchmarks/test_seed_tts_dataset_variants.py
tests/dfx/perf/tests/test_runner_metadata.py -m "core_model and cpu"` →
10/11 pass.  The 1 remaining failure
(`test_attach_sets_seed_tts_row_even_without_extra_body`) is a
pre-existing `ModuleNotFoundError: No module named 'vllm.benchmarks.lib'`
from a stale vllm import path unrelated to this refactor.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian added this to the v0.20.0 milestone Apr 22, 2026
@lishunyang12
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 PTAL

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, update the qwen3-tts recipe later, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

omni-test label to trigger buildkite omni model test in nightly CI tts-test label to trigger buildkite tts models test in nightly CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants