Skip to content

[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval)#2558

Merged
ZeldaHuang merged 1 commit into
vllm-project:mainfrom
amy-why-3459:benchmark
Apr 15, 2026
Merged

[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval)#2558
ZeldaHuang merged 1 commit into
vllm-project:mainfrom
amy-why-3459:benchmark

Conversation

@amy-why-3459
Copy link
Copy Markdown
Contributor

@amy-why-3459 amy-why-3459 commented Apr 7, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#2284
Summary
This pull request extends vllm bench serve --omni so users can measure end-to-end serving performance together with accuracy-style quality metrics for two public omni benchmarks:

Daily-Omni — video (+ optional audio) understanding with multiple-choice QA; the benchmark reports overall accuracy and breakdowns by QA type, YouTube category, and clip duration, on top of the standard latency / throughput tables from vllm bench serve.
Seed-TTS / seed-tts-eval — zero-shot speech generation via openai-chat-omni with text+audio modalities; after capturing streamed PCM from the server, the client runs an evaluation stack aligned with Bytedance’s seed-tts-eval protocol, including WER (ASR-based content error), SIM (reference vs. synthesized embedding similarity), and UTMOS (predicted naturalness), plus optional JSON export of per-utterance rows.
Together, this gives a repeatable recipe to regression-test omni models (e.g. Qwen3-Omni) for vision-language accuracy and spoken-output quality under realistic concurrent load.

Tracks: #2284 (and related RFC / benchmark layout discussions linked from the PR thread).

Motivation
Throughput-only benchmarks do not catch wrong answers on video QA or unintelligible / mismatched TTS. Daily-Omni and seed-tts-eval are widely used external benchmarks; wiring them into the same CLI as latency benchmarking lowers the barrier to “perf + quality” runs in CI or manual release validation.

What changed (high level)
Dataset loaders & request shaping for --dataset-name daily-omni and --dataset-name seed-tts (HF or local layout, inline media options where applicable).
Monkey-patch / hook into the Omni benchmark path so get_samples and request construction stay consistent with openai-chat-omni (and related backends).
PCM capture on the benchmark client for Seed-TTS WER / SIM / UTMOS when SEED_TTS_WER_EVAL=1 (or --seed-tts-wer-eval) is enabled.
Optional extra pip install 'vllm-omni[seed-tts-eval]' for ASR / jiwer / FunASR / etc., as documented in the eval module.
Review-driven fixes (see conversation on the PR): e.g. Daily-Omni user prompt string concatenation / spacing, Seed-TTS SIM and UTMOS on top of WER, removal of dead helpers, unified trailing separators in printed summaries.
How to run
Daily-Omni (example — adjust paths, model, and concurrency):

vllm bench serve --omni \
  --dataset-name daily-omni \
  --daily-omni-inline-local-video \
  --daily-omni-input-mode all \
  --daily-omni-video-dir ./Videos \
  --daily-omni-qa-json ./qa.json \
  --model <Qwen3-Omni-or-compatible> \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --extra_body '{"modalities": ["text"]}' \

...
Seed-TTS (enable eval + optional device; then same bench serve flow with text+audio):

export SEED_TTS_WER_EVAL=1

optional: export SEED_TTS_EVAL_DEVICE=cuda:0

vllm bench serve --omni \
  --dataset-name seed-tts \
  --dataset-path ./seed-tts-eval \
  --backend openai-chat-omni \
  --endpoint /v1/chat/completions \
  --extra_body '{"modalities": ["text", "audio"]}' \

...
(Full flags and numbers match the Test Plan section already pasted in the PR.)

Sample results (from the PR description)
Daily-Omni: e.g. ~69.8% overall MCQ accuracy on 1197 successful requests (with per-type / per-category / per-duration tables).
Seed-TTS: e.g. 1088 utterances evaluated; mean WER ~0.247, mean SIM ~0.832, mean UTMOS ~3.37, with zero request failures / PCM / ASR failures in the reported run.
These are single-environment snapshots for reviewers; absolute numbers will vary with model, checkpoint, concurrency, and dataset split.

Testing
Manual vllm bench serve --omni runs for both datasets (commands and logs are attached in the PR).
Automated tests for the new loaders are deferred to a follow-up PR (as discussed with reviewers) to keep this change set focused.

Test Plan

Daily-Omni:

vllm bench serve \
    --omni \
  --port 28889 \
  --max-concurrency 10 \
  --dataset-name daily-omni \
  --daily-omni-inline-local-video \
  --num-prompts 2000 \
  --no-oversample \
  --daily-omni-input-mode all \
  --daily-omni-video-dir ./Videos \
  --daily-omni-qa-json ./qa.json \
  --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
  --extra_body '{"modalities": ["text"]}'

seed-tts-eval:

export SEED_TTS_WER_EVAL=1
export SEED_TTS_EVAL_DEVICE=cuda:2

vllm bench serve \
    --omni \
  --port 28889 \
  --max-concurrency 10 \
  --dataset-name seed-tts \
  --dataset-path ./seed-tts-eval \
  --num-prompts 2000 \
  --no-oversample \
  --seed-tts-wer-save-items \
  --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
  --extra_body '{"modalities": ["text", "audio"]}'
vllm bench serve \
    --omni \
  --port 28889 \
  --max-concurrency 10 \
  --dataset-name seed-tts \
  --dataset-path ./seed-tts-data \
  --num-prompts 2000 \
  --no-oversample \
  --seed-tts-wer-save-items \
  --model /home/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --endpoint /v1/audio/speech \
  --backend openai-audio-speech \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf

Test Result

Daily-Omni:

============ Serving Benchmark Result ============
Successful requests:                     1197
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  1214.67
Request throughput (req/s):              0.99
Peak concurrent requests:                20.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          10119.84
Median E2EL (ms):                        10051.72
P99 E2EL (ms):                           13070.71
================== Text Result ===================
Total input tokens:                      4943640
Total generated tokens:                  1197
Output token throughput (tok/s):         0.99
Peak output token throughput (tok/s):    28.00
Peak concurrent requests:                20.00
Total Token throughput (tok/s):          4070.92
---------------Time to First Token----------------
Mean TTFT (ms):                          6816.26
Median TTFT (ms):                        6808.12
P99 TTFT (ms):                           10710.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           1651.78
Median ITL (ms):                         73.59
P99 ITL (ms):                            4495.12
================== Audio Result ==================
Total audio duration generated(s):       0.00
Total audio frames generated:            0
Audio throughput(audio duration/s):      0.00
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    0.00
Median AUDIO_TTFP (ms):                  0.00
P99 AUDIO_TTFP (ms):                     0.00
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.00
Median AUDIO_RTF:                        0.00
P99 AUDIO_RTF:                           0.00
==================================================
=========== Daily-Omni accuracy (MCQ) ============
Overall Accuracy: 836/1197 = 69.84%
Submitted (gold present):                1197
Successful HTTP (GitHub denom.):         1197
Correct:                                 836
Accuracy (ratio, same as above):         0.6984
Skipped (no gold):                       0
HTTP failed (excl. from GitHub acc.):    0
Parsed OK but no A–D found:              0

--- Accuracy by QA Type ---
AV Event Alignment: 144/238 = 60.50%
Comparative: 108/131 = 82.44%
Context understanding: 123/193 = 63.73%
Event Sequence: 191/306 = 62.42%
Inference: 130/154 = 84.42%
Reasoning: 140/175 = 80.00%

--- Accuracy by Video Category ---
Autos & Vehicles: 17/29 = 58.62%
Comedy: 15/25 = 60.00%
Education: 128/167 = 76.65%
Entertainment: 152/217 = 70.05%
Film & Animation: 29/54 = 53.70%
Gaming: 33/41 = 80.49%
Howto & Style: 73/107 = 68.22%
Music: 28/42 = 66.67%
News & Politics: 52/82 = 63.41%
Nonprofits & Activism: 9/14 = 64.29%
People & Blogs: 109/150 = 72.67%
Pets & Animals: 22/35 = 62.86%
Science & Technology: 69/89 = 77.53%
Sports: 82/122 = 67.21%
Travel & Events: 18/23 = 78.26%

--- Accuracy by Video Duration ---
30s Duration: 466/647 = 72.02%
60s Duration: 370/550 = 67.27%
==================================================

seed-tts-eval:
Qwen3-Omni:

tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     1088
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  549.65
Request throughput (req/s):              1.98
Peak concurrent requests:                16.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4660.31
Median E2EL (ms):                        4369.81
P99 E2EL (ms):                           7774.38
================== Text Result ===================
Total input tokens:                      29618
Total generated tokens:                  19414
Output token throughput (tok/s):         35.32
Peak output token throughput (tok/s):    216.00
Peak concurrent requests:                16.00
Total Token throughput (tok/s):          89.21
---------------Time to First Token----------------
Mean TTFT (ms):                          1333.74
Median TTFT (ms):                        1149.85
P99 TTFT (ms):                           3350.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.83
Median TPOT (ms):                        40.55
P99 TPOT (ms):                           86.95
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.74
Median ITL (ms):                         27.22
P99 ITL (ms):                            172.39
================== Audio Result ==================
Total audio duration generated(s):       5991.70
Total audio frames generated:            143793240
Audio throughput(audio duration/s):      10.90
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    2836.88
Median AUDIO_TTFP (ms):                  2717.88
P99 AUDIO_TTFP (ms):                     4467.19
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.99
Median AUDIO_RTF:                        0.94
P99 AUDIO_RTF:                           1.93
==================================================
WARNING 04-14 05:26:30 [seed_tts_eval.py:275] Loading UTMOS TorchScript from Hugging Face 'balacoon/utmos' file 'utmos.jit' (one-time download/cache)...
WARNING 04-14 05:26:43 [seed_tts_eval.py:358] Loading Seed-TTS eval Whisper HF model 'openai/whisper-large-v3' on cuda:0 (one-time, seed-tts-eval protocol)...
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtainreliable results.
WARNING 04-14 05:26:59 [seed_tts_eval.py:191] Loading WavLM 'microsoft/wavlm-base-plus' on cuda:0 for Seed-TTS SIM (embedding cosine; not identical to seed-tts-eval UniSpeech SV checkpoint).
===== Seed-TTS eval (seed-tts-eval protocol) =====
Evaluated (WER, lower is better):        1088
Mean WER:                                0.2471
Median WER:                              0.0000
Request failed:                          0
No PCM captured:                         0
ASR / WER failed:                        0
SIM evaluated (higher ~ closer):         1088
Mean SIM:                                0.8320
Median SIM:                              0.8377
SIM skipped (no ref path):               0
SIM embedding errors:                    0
UTMOS evaluated (JIT MOS, higher better): 1088
Mean UTMOS:                              3.3652
Median UTMOS:                            3.3755
UTMOS errors:                            0
==================================================

Qwen3-TTS:


============ Serving Benchmark Result ============
Successful requests:                     1088
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  617.74
Request throughput (req/s):              1.76
Peak concurrent requests:                15.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          5660.06
Median E2EL (ms):                        5490.80
P99 E2EL (ms):                           10185.65
================== Text Result ===================
Total input tokens:                      143858
Total generated tokens:                  0
Output token throughput (tok/s):         0.00
Peak output token throughput (tok/s):    7.00
Peak concurrent requests:                15.00
Total Token throughput (tok/s):          232.88
---------------Time to First Token----------------
Mean TTFT (ms):                          1635.47
Median TTFT (ms):                        1726.09
P99 TTFT (ms):                           2466.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
================== Audio Result ==================
Total audio duration generated(s):       4502.88
Total audio frames generated:            108069120
Audio throughput(audio duration/s):      7.29
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1635.47
Median AUDIO_TTFP (ms):                  1726.09
P99 AUDIO_TTFP (ms):                     2466.38
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.37
Median AUDIO_RTF:                        1.35
P99 AUDIO_RTF:                           1.71
==================================================
WARNING 04-15 01:35:59 [seed_tts_eval.py:275] Loading UTMOS TorchScript from Hugging Face 'balacoon/utmos' file 'utmos.jit' (one-time download/cache)...
WARNING 04-15 01:36:08 [seed_tts_eval.py:358] Loading Seed-TTS eval Whisper HF model 'openai/whisper-large-v3' on cuda:7 (one-time, seed-tts-eval protocol)...
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtainreliable results.
WARNING 04-15 01:36:18 [seed_tts_eval.py:191] Loading WavLM 'microsoft/wavlm-base-plus' on cuda:7 for Seed-TTS SIM (embedding cosine; not identical to seed-tts-eval UniSpeech SV checkpoint).
  key_padding_mask = _canonical_mask(
===== Seed-TTS eval (seed-tts-eval protocol) =====
Evaluated (WER, lower is better):        1088
Mean WER:                                0.0163
Median WER:                              0.0000
Request failed:                          0
No PCM captured:                         0
ASR / WER failed:                        0
SIM evaluated (higher ~ closer):         1088
Mean SIM:                                0.8753
Median SIM:                              0.8800
SIM skipped (no ref path):               0
SIM embedding errors:                    0
UTMOS evaluated (JIT MOS, higher better): 1088
Mean UTMOS:                              3.3592
Median UTMOS:                            3.3714
UTMOS errors:                            0
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@amy-why-3459 amy-why-3459 force-pushed the benchmark branch 10 times, most recently from 80c1e74 to a77f2fb Compare April 8, 2026 14:22
@amy-why-3459 amy-why-3459 changed the title [WIP][Benchmark][Qwen3-Omni] Benchmark supports the daily-omni dataset. [Benchmark][Qwen3-Omni] Benchmark supports the daily-omni dataset. Apr 8, 2026
@amy-why-3459 amy-why-3459 changed the title [Benchmark][Qwen3-Omni] Benchmark supports the daily-omni dataset. [Benchmark]Benchmark supports the daily-omni dataset. Apr 8, 2026
@amy-why-3459 amy-why-3459 changed the title [Benchmark]Benchmark supports the daily-omni dataset. [Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval) Apr 8, 2026
@amy-why-3459
Copy link
Copy Markdown
Contributor Author

@ZeldaHuang @yenuo26 @Sy0307 PTAL

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this. A few comments:

Bug: missing spaces in prompt template

daily_omni_dataset.py -- _official_daily_omni_user_prompt uses implicit string concatenation without spaces/newlines between sentences. E.g. "...based on the {media_desc}.""Select the single..." becomes "...media_desc.Select the single...". This likely hurts MCQ accuracy. Each continuation line needs a leading space or \n.

Seed-TTS: SIM and UTMOS metrics missing

The official seed-tts-eval protocol reports both WER and SIM (speaker similarity via WavLM cosine). This PR only implements WER, which validates intelligibility but not voice clone quality. Since the PCM capture path is already in place, it would be straightforward to add:

  1. SIM: cosine similarity of WavLM embeddings between reference and synthesized audio (the other half of the official protocol)
  2. UTMOS: predicted MOS score for naturalness (lightweight, no human eval needed)

Both can reuse the same tts_output_pcm_bytes already captured for WER. Could these be included in this PR, or at least tracked as a follow-up issue?

Minor cleanup

  1. extend_dataset_choices() in patch.py is a no-op (pass body), should be removed.
  2. _choices_repr_for_official_prompt calls str() on a list, producing Python repr like "['A. foo', 'B. bar']" with brackets. Is this intentional to match upstream?


from vllm.benchmarks.serve import add_cli_args

import vllm_omni.benchmarks.patch.patch # noqa: F401 — patch get_samples before serve import
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have same code in vllm_omni/entrypoints/cli/init.py, maybe this don't need?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have same code in vllm_omni/entrypoints/cli/init.py, maybe this don't need?

fixed

"matching the timbre, prosody, and speaking style of the reference audio while reading the new content clearly."
)


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add test cases at the same time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add test cases at the same time?

Perhaps we can add test cases in the next PR.

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 9, 2026

maybe you can add "==================================================" at the end of the results, unify the format of the results.

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 9, 2026

Should we print both performance and accuracy results? Perhaps printing only the accuracy results during accuracy testing would be more intuitive?

@amy-why-3459
Copy link
Copy Markdown
Contributor Author

Thanks for the work on this. A few comments:

Bug: missing spaces in prompt template

daily_omni_dataset.py -- _official_daily_omni_user_prompt uses implicit string concatenation without spaces/newlines between sentences. E.g. "...based on the {media_desc}.""Select the single..." becomes "...media_desc.Select the single...". This likely hurts MCQ accuracy. Each continuation line needs a leading space or \n.

Seed-TTS: SIM and UTMOS metrics missing

The official seed-tts-eval protocol reports both WER and SIM (speaker similarity via WavLM cosine). This PR only implements WER, which validates intelligibility but not voice clone quality. Since the PCM capture path is already in place, it would be straightforward to add:

  1. SIM: cosine similarity of WavLM embeddings between reference and synthesized audio (the other half of the official protocol)
  2. UTMOS: predicted MOS score for naturalness (lightweight, no human eval needed)

Both can reuse the same tts_output_pcm_bytes already captured for WER. Could these be included in this PR, or at least tracked as a follow-up issue?

Minor cleanup

  1. extend_dataset_choices() in patch.py is a no-op (pass body), should be removed.
  2. _choices_repr_for_official_prompt calls str() on a list, producing Python repr like "['A. foo', 'B. bar']" with brackets. Is this intentional to match upstream?
    okay, fixed.

@amy-why-3459
Copy link
Copy Markdown
Contributor Author

maybe you can add "==================================================" at the end of the results, unify the format of the results.

okay, fixed.

@amy-why-3459 amy-why-3459 force-pushed the benchmark branch 7 times, most recently from b264f62 to eedd04b Compare April 14, 2026 13:21
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
@amy-why-3459
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @ZeldaHuang @linyueqian @Sy0307 @yenuo26 @gcanlin @Gaohan123 I've updated the test data for Qwen3-TTS, and I think this PR is ready. Please take a look.

@amy-why-3459
Copy link
Copy Markdown
Contributor Author

In the next PR, I will add this benchmark accuracy test to the nightly-test.

@ZeldaHuang ZeldaHuang added the ready label to trigger buildkite CI label Apr 15, 2026
@ZeldaHuang ZeldaHuang enabled auto-merge (squash) April 15, 2026 03:07
@ZeldaHuang ZeldaHuang merged commit 227bab3 into vllm-project:main Apr 15, 2026
8 checks passed
@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 15, 2026

Sorry for late reply. This PR totally LGTM. BTW we need to note that when testing on the server side, we should specify 'uni' instead of 'mp' to get more accurate RTF. It appears that the current RTF testing for qwen3-tts is based on 'mp'. I know this config does not belong to this PR, just mention it.

@amy-why-3459
Copy link
Copy Markdown
Contributor Author

Thank you for your review. The config file at #2383 is currently being refactored. To avoid modifying too many YAML files, we will wait for this PR to be merged before modifying the default configuration. However, uni might not be suitable for TP scenarios, so please continue to follow our updates @gcanlin .

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 15, 2026

However, uni might not be suitable for TP scenarios, so please continue to follow our updates @gcanlin .

Yes, this uni config setting should only be enabled for tts series model. Thanks for nice work.

y123456y78 pushed a commit to y123456y78/vllm-omni that referenced this pull request Apr 15, 2026
…ts-eval) (vllm-project#2558)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 20, 2026
Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four
scripts + a YAML registry with zero docs, leaving users to reverse-engineer
the CLI from `--help` output.  Add a single-page README covering:

- quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS)
- plot_results.py usage
- the three task types and which checkpoints support each (notes that
  -CustomVoice lacks speaker_encoder so voice_clone is Base-only)
- model_configs.yaml extension recipe for new TTS models
- dataset matrix (bundled seed_tts_design / seed_tts_smoke, external
  seed-tts-eval with link to the download guide)
- DFX nightly integration: latency / throughput / quality regimes,
  median-vs-mean baseline choice, quality-entry gating rationale
- observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel)
- file layout + cross-references to vllm-project#2558 and vllm-project#2383

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
…ts-eval) (vllm-project#2558)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…ts-eval) (vllm-project#2558)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…ts-eval) (vllm-project#2558)

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants