[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval)#2558
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
80c1e74 to
a77f2fb
Compare
|
@ZeldaHuang @yenuo26 @Sy0307 PTAL |
linyueqian
left a comment
There was a problem hiding this comment.
Thanks for the work on this. A few comments:
Bug: missing spaces in prompt template
daily_omni_dataset.py -- _official_daily_omni_user_prompt uses implicit string concatenation without spaces/newlines between sentences. E.g. "...based on the {media_desc}.""Select the single..." becomes "...media_desc.Select the single...". This likely hurts MCQ accuracy. Each continuation line needs a leading space or \n.
Seed-TTS: SIM and UTMOS metrics missing
The official seed-tts-eval protocol reports both WER and SIM (speaker similarity via WavLM cosine). This PR only implements WER, which validates intelligibility but not voice clone quality. Since the PCM capture path is already in place, it would be straightforward to add:
- SIM: cosine similarity of WavLM embeddings between reference and synthesized audio (the other half of the official protocol)
- UTMOS: predicted MOS score for naturalness (lightweight, no human eval needed)
Both can reuse the same tts_output_pcm_bytes already captured for WER. Could these be included in this PR, or at least tracked as a follow-up issue?
Minor cleanup
extend_dataset_choices()inpatch.pyis a no-op (passbody), should be removed._choices_repr_for_official_promptcallsstr()on a list, producing Python repr like"['A. foo', 'B. bar']"with brackets. Is this intentional to match upstream?
0357e75 to
81e8227
Compare
|
|
||
| from vllm.benchmarks.serve import add_cli_args | ||
|
|
||
| import vllm_omni.benchmarks.patch.patch # noqa: F401 — patch get_samples before serve import |
There was a problem hiding this comment.
have same code in vllm_omni/entrypoints/cli/init.py, maybe this don't need?
There was a problem hiding this comment.
have same code in vllm_omni/entrypoints/cli/init.py, maybe this don't need?
fixed
| "matching the timbre, prosody, and speaking style of the reference audio while reading the new content clearly." | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
Why not add test cases at the same time?
There was a problem hiding this comment.
Why not add test cases at the same time?
Perhaps we can add test cases in the next PR.
|
maybe you can add "==================================================" at the end of the results, unify the format of the results. |
|
Should we print both performance and accuracy results? Perhaps printing only the accuracy results during accuracy testing would be more intuitive? |
|
okay, fixed. |
b264f62 to
eedd04b
Compare
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
|
@hsliuustc0106 @ZeldaHuang @linyueqian @Sy0307 @yenuo26 @gcanlin @Gaohan123 I've updated the test data for Qwen3-TTS, and I think this PR is ready. Please take a look. |
|
In the next PR, I will add this benchmark accuracy test to the nightly-test. |
|
Sorry for late reply. This PR totally LGTM. BTW we need to note that when testing on the server side, we should specify 'uni' instead of 'mp' to get more accurate RTF. It appears that the current RTF testing for qwen3-tts is based on 'mp'. I know this config does not belong to this PR, just mention it. |
|
Thank you for your review. The config file at #2383 is currently being refactored. To avoid modifying too many YAML files, we will wait for this PR to be merged before modifying the default configuration. However, uni might not be suitable for TP scenarios, so please continue to follow our updates @gcanlin . |
Yes, this uni config setting should only be enabled for tts series model. Thanks for nice work. |
…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four scripts + a YAML registry with zero docs, leaving users to reverse-engineer the CLI from `--help` output. Add a single-page README covering: - quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS) - plot_results.py usage - the three task types and which checkpoints support each (notes that -CustomVoice lacks speaker_encoder so voice_clone is Base-only) - model_configs.yaml extension recipe for new TTS models - dataset matrix (bundled seed_tts_design / seed_tts_smoke, external seed-tts-eval with link to the download guide) - DFX nightly integration: latency / throughput / quality regimes, median-vs-mean baseline choice, quality-entry gating rationale - observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel) - file layout + cross-references to vllm-project#2558 and vllm-project#2383 Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
#2284
Summary
This pull request extends vllm bench serve --omni so users can measure end-to-end serving performance together with accuracy-style quality metrics for two public omni benchmarks:
Daily-Omni — video (+ optional audio) understanding with multiple-choice QA; the benchmark reports overall accuracy and breakdowns by QA type, YouTube category, and clip duration, on top of the standard latency / throughput tables from vllm bench serve.
Seed-TTS / seed-tts-eval — zero-shot speech generation via openai-chat-omni with text+audio modalities; after capturing streamed PCM from the server, the client runs an evaluation stack aligned with Bytedance’s seed-tts-eval protocol, including WER (ASR-based content error), SIM (reference vs. synthesized embedding similarity), and UTMOS (predicted naturalness), plus optional JSON export of per-utterance rows.
Together, this gives a repeatable recipe to regression-test omni models (e.g. Qwen3-Omni) for vision-language accuracy and spoken-output quality under realistic concurrent load.
Tracks: #2284 (and related RFC / benchmark layout discussions linked from the PR thread).
Motivation
Throughput-only benchmarks do not catch wrong answers on video QA or unintelligible / mismatched TTS. Daily-Omni and seed-tts-eval are widely used external benchmarks; wiring them into the same CLI as latency benchmarking lowers the barrier to “perf + quality” runs in CI or manual release validation.
What changed (high level)
Dataset loaders & request shaping for --dataset-name daily-omni and --dataset-name seed-tts (HF or local layout, inline media options where applicable).
Monkey-patch / hook into the Omni benchmark path so get_samples and request construction stay consistent with openai-chat-omni (and related backends).
PCM capture on the benchmark client for Seed-TTS WER / SIM / UTMOS when SEED_TTS_WER_EVAL=1 (or --seed-tts-wer-eval) is enabled.
Optional extra pip install 'vllm-omni[seed-tts-eval]' for ASR / jiwer / FunASR / etc., as documented in the eval module.
Review-driven fixes (see conversation on the PR): e.g. Daily-Omni user prompt string concatenation / spacing, Seed-TTS SIM and UTMOS on top of WER, removal of dead helpers, unified trailing separators in printed summaries.
How to run
Daily-Omni (example — adjust paths, model, and concurrency):
...
Seed-TTS (enable eval + optional device; then same bench serve flow with text+audio):
export SEED_TTS_WER_EVAL=1
optional: export SEED_TTS_EVAL_DEVICE=cuda:0
...
(Full flags and numbers match the Test Plan section already pasted in the PR.)
Sample results (from the PR description)
Daily-Omni: e.g. ~69.8% overall MCQ accuracy on 1197 successful requests (with per-type / per-category / per-duration tables).
Seed-TTS: e.g. 1088 utterances evaluated; mean WER ~0.247, mean SIM ~0.832, mean UTMOS ~3.37, with zero request failures / PCM / ASR failures in the reported run.
These are single-environment snapshots for reviewers; absolute numbers will vary with model, checkpoint, concurrency, and dataset split.
Testing
Manual vllm bench serve --omni runs for both datasets (commands and logs are attached in the PR).
Automated tests for the new loaders are deferred to a follow-up PR (as discussed with reviewers) to keep this change set focused.
Test Plan
Daily-Omni:
seed-tts-eval:
export SEED_TTS_WER_EVAL=1
export SEED_TTS_EVAL_DEVICE=cuda:2
Test Result
Daily-Omni:
seed-tts-eval:
Qwen3-Omni:
Qwen3-TTS:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)