[Benchmark] Universal TTS benchmark: Qwen3-TTS + VoxCPM2 with 3 task types (voice-clone/default/design)#2835
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
please help remove other unnecessary benchmarks |
ba1d784 to
5d8f536
Compare
|
@Sy0307 During smoke testing of the benchmark framework on H20, VoxCPM2 crashes at concurrency=4 with an orchestrator thread crash. Concurrency 1 and 2 work fine. Root cause trace: Looks like the VAE scaffold CUDA graph capture fires lazily during concurrent inference and races with the orchestrator. Worth a look if you have context on the VoxCPM2 runtime path. |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: Universal TTS Benchmark
This is a well-structured consolidation of per-model benchmarks into a single, model-agnostic framework. The YAML-registry approach (model_configs.yaml) for adding new models without code changes is a good design choice, and the removal of ~2900 lines of duplicated per-model benchmark code is a clear win.
Strengths
-
Clean architecture:
bench_tts.pydelegates tovllm bench serve --omniwith model-aware defaults from YAML, keeping the CLI thin and testable. The_TASK_TO_DATASETmapping is simple and correct. -
Correct bug fix in
patch.py: Movingseed_tts_row = Truebefore theif not ex: returnguard is the right fix. Without this,SeedTTSTextSampleRequest(which hasseed_tts_speech_extra=None) would skip PCM capture entirely, breaking WER/UTMOS for default_voice and voice_design tasks. The test intest_attach_sets_seed_tts_row_even_without_extra_bodyverifies this via source inspection, which is pragmatic. -
DFX integration:
test_tts.jsonseparates perf and quality eval phases, which is thoughtful. Theenabledfield andeval_phasemetadata keys, plus updatingexclude_keysinrun_benchmark.py, ensure these don't leak into CLI args. -
Good test coverage: The conftest.py vllm stubs approach is creative and lets dataset tests run without a full vllm install.
Issues and suggestions
1. SeedTTSDesignDataset uses "instructions" key but SeedTTSDesignSampleRequest docstring says "voice_description"
In seed_tts_dataset.py, the sample() method of SeedTTSDesignDataset builds:
speech_extra: dict[str, Any] = {
"instructions": row.voice_description,
"task_type": "VoiceDesign",
...
}But SeedTTSDesignSampleRequest's docstring says the dict carries voice_description. The test also checks for "voice_description" in extra -- but extra actually contains "instructions", not "voice_description". This test line would fail:
assert "voice_description" in extra # extra has "instructions", not "voice_description"Wait, looking more carefully the test builds the dataset and calls .sample(), so the extra dict will have "instructions" key. The assertion assert "voice_description" in extra should indeed fail. This looks like a real bug in the test.
Action needed: Either the key in speech_extra should be "voice_description" (matching the docstring and test), or the test assertion should check for "instructions" (matching the actual code). Please verify which key the Qwen3-TTS VoiceDesign endpoint expects and align all three (code, docstring, test).
2. Duplicate VoxCPM2 stage config
benchmarks/tts/stage_configs/voxcpm2.yaml and tests/dfx/perf/stage_configs/voxcpm2.yaml are identical. Consider having the DFX config reference the benchmark one (or vice versa) to avoid drift. If they must be separate (e.g., different GPU memory settings for CI vs benchmarking), add a comment explaining why.
3. bench_voxcpm_offline.py REPO_ROOT change looks correct but is fragile
The file was moved from benchmarks/voxcpm/vllm_omni/ (3 levels deep) to benchmarks/tts/ (2 levels deep), so parents[3] -> parents[2] is correct. But this kind of relative-path depth counting is brittle. Consider using a marker file lookup or deriving from git rev-parse --show-toplevel for robustness.
4. plot_results.py -- mean_e2el metric key may be wrong
The comparison table uses "mean_e2el" as the key for E2E latency:
("E2E (ms)", "mean_e2el", ".1f"),But bench_tts.py's summary table references "mean_audio_rtf" and "mean_audio_ttfp_ms". Please verify "mean_e2el" is the correct key emitted by vllm bench serve --omni result JSON. If it should be "mean_e2el_ms" (with _ms suffix), the E2E column would silently show nan.
5. Minor: _SeedTTSDesignRow uses __import__("random") inline
In SeedTTSDesignDataset.load_data():
rng = __import__("random").Random(self.random_seed)This works but is unusual. A normal import random at the top of the file would be cleaner and more readable.
6. voice_design entries in test_tts.json are missing extra_body for task_type
The voice_clone entries don't need extra_body because the dataset itself provides ref_audio/ref_text. The default_voice entries correctly include extra_body: {"voice": "Vivian", ...}. But the voice_design perf entry has no extra_body -- is the task_type: "VoiceDesign" already handled by the dataset class? If yes, this is fine. If the server needs task_type in the request body for routing, this could cause 400 errors at runtime. Worth a note or confirming in the test plan.
Overall
The framework design is solid and the consolidation is clearly needed. The main concern is the "instructions" vs "voice_description" key mismatch (point 1) which appears to be a real test bug. The rest are minor improvements. After addressing point 1, this should be good to go.
a0d0dda to
e42f3ad
Compare
|
@lishunyang12 @hsliuustc0106 — addressed the review with some extra H20-sourced work. Push tip is Review items
Extras surfaced by H20 benchmarkingRan voice_clone / voice_design concurrency sweeps on H20-3e 141GB and found: Live-traffic bug: Fix:
Concurrency-cliff regression coverage: the old
Both show a clean 4-6× TTFP jump from c=4 to c=8 and throughput saturating at c=4-8, which is the same pattern the NVIDIA-fork comparison doc flags. Replaced the single
Thresholds padded ~2× from the H20 means. Additional polish
|
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Adds SeedTTSTextDataset (CLI name: seed-tts-text) and SeedTTSTextSampleRequest. Loads the same meta.lst as SeedTTSDataset but omits ref_audio/ref_text from the request body; voice is supplied via --extra-body in the benchmark config. Sets seed_tts_ref_wav_path="" so SIM is automatically skipped in seed_tts_eval.py. WER and UTMOS still work normally. Also adds tests/benchmarks/conftest.py with lightweight vllm stubs and the corresponding unit test. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…clean up test stubs Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…s-text/design; update CLI choices Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…es task/enabled/eval_phase metadata Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…tadata tests Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…l registry Replaces per-model benchmark scripts with a single model-agnostic CLI that reads model_configs.yaml to dispatch vllm bench serve --omni with correct flags for any registered TTS model (Qwen3-TTS, VoxCPM2, and future models). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…y_is_skipped Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…(Qwen3-TTS + VoxCPM2, 3 tasks) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… param Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…n voice_design perf entry Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Remove benchmarks/qwen3-tts/ and benchmarks/voxcpm/ which are superseded by the new universal benchmarks/tts/ framework. Apply ruff format/check fixes to all PR-touched files. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Restore two useful tools from the removed per-model benchmarks: - plot_results.py: updated to read vllm bench serve JSON keys (mean_audio_rtf, mean_audio_ttfp_ms, _task, _concurrency). Generates 4-panel bar charts (TTFP, E2EL, RTF, throughput) per task, with optional multi-run comparison and markdown table output. - bench_voxcpm_offline.py: offline VoxCPM benchmark using Omni/AsyncOmni.generate directly; supports sync and streaming modes, txt/jsonl batch input, voice cloning, and torch profiling. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…nch serve Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The server API (serving_speech.py) validates the 'instructions' field for VoiceDesign requests. The benchmark dataset was incorrectly sending the value under 'voice_description', causing all voice_design benchmark requests to fail with a 400 error. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
- P1: align voice-design code/docstring/test on `instructions` key. The
prior fix changed the code but left the docstring + unit test asserting
`voice_description`, which made the test fail.
- P2: remove duplicate stage configs under benchmarks/tts/stage_configs/;
point bench_tts.py default to tests/dfx/perf/stage_configs/ so DFX
nightly + the CLI share a single source of truth.
- P3: replace fragile parents[2] in bench_voxcpm_offline.py with a
marker-based repo-root walker (pyproject.toml + vllm_omni/).
- P4: actual result JSON key is mean_e2el_ms, not mean_e2el. Fix both
references in plot_results.py so the E2E column no longer silently
renders NaN.
- P5: drop inline `__import__("random")` in favour of the module-level
import that already exists.
Also persist `_task`/`_concurrency` metadata into the saved result JSON
from bench_tts.py so plot_results.py can build the per-concurrency
comparison tables (previously the augmentation happened in-memory only).
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…lity regimes
Concurrency sweeps on H20 (Qwen3-TTS-1.7B-Base voice_clone and
Qwen3-TTS-1.7B-CustomVoice voice_design) show a sharp TTFP cliff at
max_concurrency>=8 — TTFP jumps 4-6x from c=4 to c=8 while audio
throughput saturates. The prior `perf` entries only exercised c=1 and
c=4, sitting below the cliff, so codec-batching regressions are invisible
to DFX nightly.
Replace the single `perf` phase with two targeted regimes plus the
existing `quality` phase:
- latency (c=1) — tight TTFP + RTF bounds for single-request SLO
- throughput (c=8) — loose TTFP ceiling + throughput/RTF floor that
collapses if the codec stays batch_size=1
- quality (c=4) — unchanged, WER/SIM/UTMOS eval
Thresholds come from H20 sweeps on 1.7B-Base (voice_clone) and
1.7B-CustomVoice (voice_design), padded ~2x:
voice_clone c=1: RTF 0.153 TTFP 165ms c=8: RTF 0.493 TTFP 1701ms
voice_design c=1: RTF 0.083 TTFP 53ms c=8: RTF 0.21 TTFP 872ms
Also fix a live-traffic bug surfaced during the sweep: the
-CustomVoice checkpoints don't ship speaker_encoder weights, so
voice_clone requests crashed with
`ValueError: This checkpoint does not provide speaker_encoder weights`.
Drop voice_clone from CustomVoice's supported_tasks in
model_configs.yaml and split the DFX suite so voice_clone runs under
-Base and default_voice/voice_design under -CustomVoice.
Test plan: H20 voice_clone sweep (Qwen3-TTS-1.7B-Base) and voice_design
sweep (Qwen3-TTS-1.7B-CustomVoice) validated the thresholds are reachable
with >40% headroom on the latency entry and TTFP/throughput both fall
within the throughput bounds at c=8.
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…oject#2383 vllm-project#2383 replaced the per-model stage_configs/*.yaml layout with auto-loaded vllm_omni/deploy/<model>.yaml (Pipeline in Python, Deploy in YAML) and switched the DFX runner's config-loading dir from stage_configs/ to deploy/. This PR's test matrix and bench CLI still carried the old references: - test_tts.json: drop `stage_config_name` from the Qwen3-TTS entries; vllm-omni now auto-loads vllm_omni/deploy/qwen3_tts.yaml for both Base and CustomVoice checkpoints. - model_configs.yaml: drop the `stage_config` field — the bench CLI does not reference it and auto-discovery handles pipeline lookup. - bench_tts.py: remove the dead `--stage-configs-dir` flag and the `_DEFAULT_STAGE_CONFIGS_DIR` constant; both were unused and pointed at a directory vllm-project#2383 deleted. - Delete tests/dfx/perf/stage_configs/voxcpm2.yaml — the directory no longer exists post-vllm-project#2383. VoxCPM2 is not yet migrated to the Pipeline + Deploy schema in vllm-project#2383 (only qwen2_5_omni / qwen3_omni / qwen3_tts ship pipeline.py + deploy YAML) and still loads via the legacy `ModelPipeline` path. Drop the test_voxcpm2 entry from test_tts.json to unblock DFX nightly; will re-add as a follow-up once VoxCPM2 gets its deploy YAML. The latency / throughput / quality baselines remain unchanged — they come from H20 sweeps on stable checkpoints and should still hold under the new deploy YAML (stage 0 now sets max_num_seqs=10 and async_scheduling=true, which can only improve throughput numbers). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
e42f3ad to
2ddf765
Compare
|
Rebased on top of #2383 (merged) + migrated to the Pipeline + Deploy schema ( Post-#2383 adaptations
H20 smoke after migrationBooted
Baselines unchanged — full sweep recalibration can land in a follow-up if DFX nightly flags drift. CI after push: pre-commit / DCO / build(3.11) / build(3.12) all green. |
…ries Buildkite `🌕 TTS · Perf Test` failed with `completed == 0` on six benchmark entries because the seed-tts / seed-tts-text datasets they reference are not staged in the CI image and `snapshot_download` has no way to pull the Google-Drive-hosted seed-tts-eval archive. Two-part fix: 1. Bundle `benchmarks/build_dataset/seed_tts_smoke/en/meta.lst` — a 20-row seed-tts-compatible meta file with target_text only (no WAVs). `SeedTTSTextDataset` (used by default_voice) does not touch the wav column, so this is enough to exercise the full server path in CI. All entries are short, varied English sentences suitable for TTS smoke testing. 2. Point the `default_voice` benchmark entries at this bundled path and disable the three `voice_clone` entries with `enabled: false` — voice_clone needs real reference WAVs the bundled smoke set deliberately omits. The `voice_design` entries are unchanged; they were already using a bundled dataset and passing in the failing Buildkite run. Also disable the `default_voice` quality entry: WER evaluation requires real seed-tts-eval text (which we deliberately did not bundle — 20 rows × 4 CV folds would give an unreliable WER signal). Perf/throughput entries still exercise the codec-bs cliff on the bundled smoke set. H20 smoke: `bench serve --backend openai-audio-speech --dataset-name seed-tts-text --dataset-path benchmarks/build_dataset/seed_tts_smoke` returned `Successful requests: 5` with audio throughput 11.18 s/s — no more zero-completion failures. Re-enabling the seed-tts-eval entries will be a follow-up once the dataset is staged in the CI container (or made available via an HF mirror we can snapshot_download). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
@yenuo26 @amy-why-3459 @congw729 ptal, thanks! |
… regimes build #7249 showed default_voice c=1 latency mean_audio_ttfp_ms=2301 ms vs a 150 ms baseline — but p50 was 62 ms and only one cold-start outlier (first measured request after server warmup, p99 ~36 s) dragged the mean up. Latency and throughput regimes care about typical request behaviour, not cold-start tails, so switch their baselines from `mean_audio_*` to `median_audio_*`. The quality entries (WER-driven) still use mean since they aggregate over 200 prompts where single-request outliers don't matter. Applied to both qwen3_tts_base and qwen3_tts_customvoice: * latency → median_audio_ttfp_ms / median_audio_rtf * throughput → median_audio_ttfp_ms / median_audio_rtf * quality → unchanged (mean_audio_rtf) Baseline values unchanged; only the metric aggregation switched. Expected effect on build #7249 data: default_voice c=1 p50 TTFP 62 ms <= 150 ms ✅ default_voice c=8 p50 TTFP 230 ms <= 1500 ms ✅ Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
| @@ -0,0 +1,308 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
is there a readme.md under tts/
There was a problem hiding this comment.
Added benchmarks/tts/README.md in 798cea7e. One-page reference covering quick-start recipes (smoke / concurrency sweep / --wer-eval), the three task types and which checkpoints support each (incl. the -CustomVoice no-speaker_encoder gotcha), how to register a new TTS model via model_configs.yaml, the bundled-vs-external dataset matrix, DFX nightly wiring (latency / throughput / quality regimes), and an H20 concurrency-cliff reference table. Links to #2558 and #2383 for context.
Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four scripts + a YAML registry with zero docs, leaving users to reverse-engineer the CLI from `--help` output. Add a single-page README covering: - quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS) - plot_results.py usage - the three task types and which checkpoints support each (notes that -CustomVoice lacks speaker_encoder so voice_clone is Base-only) - model_configs.yaml extension recipe for new TTS models - dataset matrix (bundled seed_tts_design / seed_tts_smoke, external seed-tts-eval with link to the download guide) - DFX nightly integration: latency / throughput / quality regimes, median-vs-mean baseline choice, quality-entry gating rationale - observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel) - file layout + cross-references to vllm-project#2558 and vllm-project#2383 Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
The new Other contents LGTM. |
Reviewer feedback on PR vllm-project#2835: the throughput-phase entries only gated median_audio_ttfp_ms and median_audio_rtf, so the regime was really a high-concurrency latency check — a real throughput collapse (e.g. codec batch size regressing back to 1, or scheduler starvation) would leave TTFP/RTF within bounds while audio_throughput cratered, and CI would miss it. Add an `audio_throughput` baseline to every throughput entry. The runner inverts the comparison for any metric name containing "throughput" (run_benchmark.py:287-292), so these values act as FLOORS: if the observed audio-seconds-per-wall-second drops below the baseline, the runner prints the soft-warning ERROR. Floors (audio-s per wall-s, >= baseline required): voice_clone c=8 (1.7B-Base) 10.0 (measured ~15 on H20) default_voice c=8 (1.7B-CustomVoice) 30.0 (measured ~47 on H100, ~35 H20) voice_design c=8 (1.7B-CustomVoice) 25.0 (measured ~43 on H100, ~34 H20) Values set ~30% below the lower of the two observed environments so CI flags real regressions, not noise. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
|
||
| ```bash | ||
| # Smallest smoke — 5 prompts, concurrency=1 | ||
| python benchmarks/tts/bench_tts.py \ |
There was a problem hiding this comment.
Can we provide the vllm bench command in the README file? This would make it easier for users to modify the vllm bench parameters.
There was a problem hiding this comment.
Added in eb30f559. New section "Raw vllm bench serve commands" with full copy-paste invocations for each of the three task types (voice_clone, default_voice, voice_design) showing the full flag set — --host, --port, --model, --dataset-name, --dataset-path, --seed-tts-locale, --num-prompts, --num-warmups, --extra-body, --max-concurrency, --request-rate, --percentile-metrics, --save-result. Plus a short note on appending --seed-tts-wer-eval for WER/SIM/UTMOS. Users can now tweak bench flags without reading through bench_tts.py.
Reviewer feedback on PR vllm-project#2835: the README's quick-start only showed the `bench_tts.py` wrapper, which hides the underlying `vllm bench serve --omni` invocation. Users wanting to tweak individual bench flags (sampling params, endpoint, `--extra-body`, warmups, etc.) had to read bench_tts.py source to find out what the wrapper emits. Add a "Raw `vllm bench serve` commands" section with the full copy-paste invocation for each of the three task types — voice_clone (Qwen3-TTS-Base, seed-tts), default_voice (Qwen3-TTS-CustomVoice, bundled smoke), and voice_design (Qwen3-TTS-CustomVoice, bundled design) — plus a short note on enabling `--seed-tts-wer-eval` for WER/SIM/UTMOS. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
please also paste the benchmark summaries ttfp/ttft/tpot/rtf standard outputs in test results
| Outputs TTFP / RTF / throughput curves (and a markdown table) for every | ||
| `(task, concurrency)` combination in the result set. | ||
|
|
||
| ## Raw `vllm bench serve` commands |
There was a problem hiding this comment.
I think we can move this section to the top as the first option for users
Reviewer feedback on PR vllm-project#2835: users should see the raw `vllm bench serve --omni` invocation as the first option, not as an afterthought buried below the `bench_tts.py` wrapper. Restructure the README so the Quick Start flow is: 1. Start the server 2. Run the benchmark via `vllm bench serve --omni` (3 task examples + WER) 3. Convenience wrapper via `bench_tts.py` 4. Plot the sweep The wrapper section now explains that it is exactly the raw command with model-aware defaults plugged in, and documents which flags come from `model_configs.yaml` vs. fixed defaults — so users who outgrow the wrapper know exactly what to swap. Also remove the now-duplicate "Raw vllm bench serve commands" section that was appended in an earlier commit. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
H20 benchmark summaries (per review feedback)All runs on H20-3e 141GB, voice_clone — Qwen/Qwen3-TTS-12Hz-1.7B-Base, seed-ttsconcurrency = 1concurrency = 4concurrency = 8 (cliff)concurrency = 32voice_design — Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, seed-tts-designconcurrency = 1concurrency = 4concurrency = 8 (cliff)concurrency = 32Notes
|
|
@hsliuustc0106 is this OK to merge? |
|
Could you add a nightly-test for the benchmark? |
| import json | ||
|
|
||
|
|
||
| def test_task_excluded_from_cli_args(): |
There was a problem hiding this comment.
Addressed in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] so the file is picked up by the nightly core_model and cpu selector.
| import json | ||
| import sys | ||
| from pathlib import Path | ||
| from unittest.mock import patch |
There was a problem hiding this comment.
please don't use unittest.mock, pytest mock instead
There was a problem hiding this comment.
Done in 8f06be05. Dropped from unittest.mock import patch; test_unsupported_task_exits now takes mocker as a fixture and calls mocker.patch.object(sys, "argv", [...]) instead of the with patch.object(...) context manager.
| return p | ||
|
|
||
|
|
||
| def test_load_model_configs(model_configs_path: Path) -> None: |
There was a problem hiding this comment.
please add mark, like pytest.mark.core_model, pytest.mark.cpu
There was a problem hiding this comment.
Done in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] to tests/benchmarks/test_bench_tts_cli.py.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def test_seed_tts_text_dataset_omits_ref_audio(seed_tts_root): |
There was a problem hiding this comment.
please add mark, like pytest.mark.core_model, pytest.mark.cpu
There was a problem hiding this comment.
Done in 8f06be05. Added module-level pytestmark = [pytest.mark.core_model, pytest.mark.cpu] to tests/benchmarks/test_seed_tts_dataset_variants.py.
| import importlib.util | ||
| import sys | ||
| from pathlib import Path | ||
| from unittest.mock import MagicMock |
There was a problem hiding this comment.
please don't use unittest.mock, pytest mock instead
There was a problem hiding this comment.
Done in 8f06be05. Dropped from unittest.mock import MagicMock; the three dataset-class tests (test_seed_tts_text_dataset_omits_ref_audio, test_seed_tts_design_dataset_has_instructions, test_seed_tts_design_dataset_rejects_missing_description) now take mocker as a fixture and use mocker.MagicMock().
I think tts-test and omni-test are enough. |
|
Could you add an L4 test case? |
Reviewer feedback from @yenuo26 on PR vllm-project#2835: 1. tests/dfx/perf/tests/test_runner_metadata.py — needs pytest marks 2. tests/benchmarks/test_bench_tts_cli.py — swap unittest.mock.patch for the pytest-mock `mocker` fixture; also missing marks 3. tests/benchmarks/test_seed_tts_dataset_variants.py — swap unittest.mock.MagicMock for `mocker.MagicMock`; also missing marks Applied module-level `pytestmark = [pytest.mark.core_model, pytest.mark.cpu]` to all three files so they run under the nightly `core_model and cpu` pytest selector (matches the existing repo convention in tests/dfx/perf/tests/test_qwen_omni.json's `run_benchmark.py` fixture path). Converted: - `from unittest.mock import patch` → `mocker.patch.object(...)` (test_bench_tts_cli.py::test_unsupported_task_exits) - `from unittest.mock import MagicMock` + `tokenizer = MagicMock()` → `tokenizer = mocker.MagicMock()` with `mocker` injected via fixture (test_seed_tts_dataset_variants.py, three tests) H20 smoke: `pytest tests/benchmarks/test_bench_tts_cli.py tests/benchmarks/test_seed_tts_dataset_variants.py tests/dfx/perf/tests/test_runner_metadata.py -m "core_model and cpu"` → 10/11 pass. The 1 remaining failure (`test_attach_sets_seed_tts_row_even_without_extra_body`) is a pre-existing `ModuleNotFoundError: No module named 'vllm.benchmarks.lib'` from a stale vllm import path unrelated to this refactor. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
@hsliuustc0106 PTAL |
hsliuustc0106
left a comment
There was a problem hiding this comment.
lgtm, update the qwen3-tts recipe later, thank you

Summary
seed_tts_dataset.pywithSeedTTSTextDataset(default voice) andSeedTTSDesignDataset(voice design)seed_tts_rowflag so WER/UTMOS PCM capture runs for all task types, not just voice clonebench_tts.py— a model-agnostic CLI backed bymodel_configs.yamlregistry; adding a new TTS model requires only a YAML entry, no code changesplot_results.py— bar-chart visualization (TTFP / E2EL / RTF / throughput) for comparing runs or task typesbench_voxcpm_offline.py— offline VoxCPM benchmark using Omni/AsyncOmni directly (sync + streaming, voice cloning, torch profiling)Metric coverage per task
SIM is skipped for tasks without a reference audio (empty seed_tts_ref_wav_path → sim_skipped_no_ref counter in seed_tts_eval.py).
Models
New files
Removed files
Alignment with sglang-omni
Uses the same seed-tts-eval dataset as sglang-omni. Adds TTFP and SIM/UTMOS which sglang-omni does not currently track.
Test plan