[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval) by amy-why-3459 · Pull Request #2558 · vllm-project/vllm-omni

amy-why-3459 · 2026-04-07T13:31:38Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#2284
Summary
This pull request extends vllm bench serve --omni so users can measure end-to-end serving performance together with accuracy-style quality metrics for two public omni benchmarks:

Daily-Omni — video (+ optional audio) understanding with multiple-choice QA; the benchmark reports overall accuracy and breakdowns by QA type, YouTube category, and clip duration, on top of the standard latency / throughput tables from vllm bench serve.
Seed-TTS / seed-tts-eval — zero-shot speech generation via openai-chat-omni with text+audio modalities; after capturing streamed PCM from the server, the client runs an evaluation stack aligned with Bytedance’s seed-tts-eval protocol, including WER (ASR-based content error), SIM (reference vs. synthesized embedding similarity), and UTMOS (predicted naturalness), plus optional JSON export of per-utterance rows.
Together, this gives a repeatable recipe to regression-test omni models (e.g. Qwen3-Omni) for vision-language accuracy and spoken-output quality under realistic concurrent load.

Tracks: #2284 (and related RFC / benchmark layout discussions linked from the PR thread).

Motivation
Throughput-only benchmarks do not catch wrong answers on video QA or unintelligible / mismatched TTS. Daily-Omni and seed-tts-eval are widely used external benchmarks; wiring them into the same CLI as latency benchmarking lowers the barrier to “perf + quality” runs in CI or manual release validation.

What changed (high level)
Dataset loaders & request shaping for --dataset-name daily-omni and --dataset-name seed-tts (HF or local layout, inline media options where applicable).
Monkey-patch / hook into the Omni benchmark path so get_samples and request construction stay consistent with openai-chat-omni (and related backends).
PCM capture on the benchmark client for Seed-TTS WER / SIM / UTMOS when SEED_TTS_WER_EVAL=1 (or --seed-tts-wer-eval) is enabled.
Optional extra pip install 'vllm-omni[seed-tts-eval]' for ASR / jiwer / FunASR / etc., as documented in the eval module.
Review-driven fixes (see conversation on the PR): e.g. Daily-Omni user prompt string concatenation / spacing, Seed-TTS SIM and UTMOS on top of WER, removal of dead helpers, unified trailing separators in printed summaries.
How to run
Daily-Omni (example — adjust paths, model, and concurrency):

vllm bench serve --omni \
  --dataset-name daily-omni \
  --daily-omni-inline-local-video \
  --daily-omni-input-mode all \
  --daily-omni-video-dir ./Videos \
  --daily-omni-qa-json ./qa.json \
  --model <Qwen3-Omni-or-compatible> \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --extra_body '{"modalities": ["text"]}' \

...
Seed-TTS (enable eval + optional device; then same bench serve flow with text+audio):

export SEED_TTS_WER_EVAL=1

optional: export SEED_TTS_EVAL_DEVICE=cuda:0

vllm bench serve --omni \
  --dataset-name seed-tts \
  --dataset-path ./seed-tts-eval \
  --backend openai-chat-omni \
  --endpoint /v1/chat/completions \
  --extra_body '{"modalities": ["text", "audio"]}' \

...
(Full flags and numbers match the Test Plan section already pasted in the PR.)

Sample results (from the PR description)
Daily-Omni: e.g. ~69.8% overall MCQ accuracy on 1197 successful requests (with per-type / per-category / per-duration tables).
Seed-TTS: e.g. 1088 utterances evaluated; mean WER ~0.247, mean SIM ~0.832, mean UTMOS ~3.37, with zero request failures / PCM / ASR failures in the reported run.
These are single-environment snapshots for reviewers; absolute numbers will vary with model, checkpoint, concurrency, and dataset split.

Testing
Manual vllm bench serve --omni runs for both datasets (commands and logs are attached in the PR).
Automated tests for the new loaders are deferred to a follow-up PR (as discussed with reviewers) to keep this change set focused.

Test Plan

Daily-Omni:

vllm bench serve \
    --omni \
  --port 28889 \
  --max-concurrency 10 \
  --dataset-name daily-omni \
  --daily-omni-inline-local-video \
  --num-prompts 2000 \
  --no-oversample \
  --daily-omni-input-mode all \
  --daily-omni-video-dir ./Videos \
  --daily-omni-qa-json ./qa.json \
  --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
  --extra_body '{"modalities": ["text"]}'

seed-tts-eval:

export SEED_TTS_WER_EVAL=1
export SEED_TTS_EVAL_DEVICE=cuda:2

vllm bench serve \
    --omni \
  --port 28889 \
  --max-concurrency 10 \
  --dataset-name seed-tts \
  --dataset-path ./seed-tts-eval \
  --num-prompts 2000 \
  --no-oversample \
  --seed-tts-wer-save-items \
  --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --endpoint /v1/chat/completions \
  --backend openai-chat-omni \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
  --extra_body '{"modalities": ["text", "audio"]}'

vllm bench serve \
    --omni \
  --port 28889 \
  --max-concurrency 10 \
  --dataset-name seed-tts \
  --dataset-path ./seed-tts-data \
  --num-prompts 2000 \
  --no-oversample \
  --seed-tts-wer-save-items \
  --model /home/models/Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --endpoint /v1/audio/speech \
  --backend openai-audio-speech \
  --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf

Test Result

Daily-Omni:

============ Serving Benchmark Result ============
Successful requests:                     1197
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  1214.67
Request throughput (req/s):              0.99
Peak concurrent requests:                20.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          10119.84
Median E2EL (ms):                        10051.72
P99 E2EL (ms):                           13070.71
================== Text Result ===================
Total input tokens:                      4943640
Total generated tokens:                  1197
Output token throughput (tok/s):         0.99
Peak output token throughput (tok/s):    28.00
Peak concurrent requests:                20.00
Total Token throughput (tok/s):          4070.92
---------------Time to First Token----------------
Mean TTFT (ms):                          6816.26
Median TTFT (ms):                        6808.12
P99 TTFT (ms):                           10710.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           1651.78
Median ITL (ms):                         73.59
P99 ITL (ms):                            4495.12
================== Audio Result ==================
Total audio duration generated(s):       0.00
Total audio frames generated:            0
Audio throughput(audio duration/s):      0.00
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    0.00
Median AUDIO_TTFP (ms):                  0.00
P99 AUDIO_TTFP (ms):                     0.00
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.00
Median AUDIO_RTF:                        0.00
P99 AUDIO_RTF:                           0.00
==================================================
=========== Daily-Omni accuracy (MCQ) ============
Overall Accuracy: 836/1197 = 69.84%
Submitted (gold present):                1197
Successful HTTP (GitHub denom.):         1197
Correct:                                 836
Accuracy (ratio, same as above):         0.6984
Skipped (no gold):                       0
HTTP failed (excl. from GitHub acc.):    0
Parsed OK but no A–D found:              0

--- Accuracy by QA Type ---
AV Event Alignment: 144/238 = 60.50%
Comparative: 108/131 = 82.44%
Context understanding: 123/193 = 63.73%
Event Sequence: 191/306 = 62.42%
Inference: 130/154 = 84.42%
Reasoning: 140/175 = 80.00%

--- Accuracy by Video Category ---
Autos & Vehicles: 17/29 = 58.62%
Comedy: 15/25 = 60.00%
Education: 128/167 = 76.65%
Entertainment: 152/217 = 70.05%
Film & Animation: 29/54 = 53.70%
Gaming: 33/41 = 80.49%
Howto & Style: 73/107 = 68.22%
Music: 28/42 = 66.67%
News & Politics: 52/82 = 63.41%
Nonprofits & Activism: 9/14 = 64.29%
People & Blogs: 109/150 = 72.67%
Pets & Animals: 22/35 = 62.86%
Science & Technology: 69/89 = 77.53%
Sports: 82/122 = 67.21%
Travel & Events: 18/23 = 78.26%

--- Accuracy by Video Duration ---
30s Duration: 466/647 = 72.02%
60s Duration: 370/550 = 67.27%
==================================================

seed-tts-eval:
Qwen3-Omni:

tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     1088
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  549.65
Request throughput (req/s):              1.98
Peak concurrent requests:                16.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4660.31
Median E2EL (ms):                        4369.81
P99 E2EL (ms):                           7774.38
================== Text Result ===================
Total input tokens:                      29618
Total generated tokens:                  19414
Output token throughput (tok/s):         35.32
Peak output token throughput (tok/s):    216.00
Peak concurrent requests:                16.00
Total Token throughput (tok/s):          89.21
---------------Time to First Token----------------
Mean TTFT (ms):                          1333.74
Median TTFT (ms):                        1149.85
P99 TTFT (ms):                           3350.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.83
Median TPOT (ms):                        40.55
P99 TPOT (ms):                           86.95
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.74
Median ITL (ms):                         27.22
P99 ITL (ms):                            172.39
================== Audio Result ==================
Total audio duration generated(s):       5991.70
Total audio frames generated:            143793240
Audio throughput(audio duration/s):      10.90
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    2836.88
Median AUDIO_TTFP (ms):                  2717.88
P99 AUDIO_TTFP (ms):                     4467.19
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.99
Median AUDIO_RTF:                        0.94
P99 AUDIO_RTF:                           1.93
==================================================
WARNING 04-14 05:26:30 [seed_tts_eval.py:275] Loading UTMOS TorchScript from Hugging Face 'balacoon/utmos' file 'utmos.jit' (one-time download/cache)...
WARNING 04-14 05:26:43 [seed_tts_eval.py:358] Loading Seed-TTS eval Whisper HF model 'openai/whisper-large-v3' on cuda:0 (one-time, seed-tts-eval protocol)...
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtainreliable results.
WARNING 04-14 05:26:59 [seed_tts_eval.py:191] Loading WavLM 'microsoft/wavlm-base-plus' on cuda:0 for Seed-TTS SIM (embedding cosine; not identical to seed-tts-eval UniSpeech SV checkpoint).
===== Seed-TTS eval (seed-tts-eval protocol) =====
Evaluated (WER, lower is better):        1088
Mean WER:                                0.2471
Median WER:                              0.0000
Request failed:                          0
No PCM captured:                         0
ASR / WER failed:                        0
SIM evaluated (higher ~ closer):         1088
Mean SIM:                                0.8320
Median SIM:                              0.8377
SIM skipped (no ref path):               0
SIM embedding errors:                    0
UTMOS evaluated (JIT MOS, higher better): 1088
Mean UTMOS:                              3.3652
Median UTMOS:                            3.3755
UTMOS errors:                            0
==================================================

Qwen3-TTS:


============ Serving Benchmark Result ============
Successful requests:                     1088
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  617.74
Request throughput (req/s):              1.76
Peak concurrent requests:                15.00
----------------End-to-end Latency----------------
Mean E2EL (ms):                          5660.06
Median E2EL (ms):                        5490.80
P99 E2EL (ms):                           10185.65
================== Text Result ===================
Total input tokens:                      143858
Total generated tokens:                  0
Output token throughput (tok/s):         0.00
Peak output token throughput (tok/s):    7.00
Peak concurrent requests:                15.00
Total Token throughput (tok/s):          232.88
---------------Time to First Token----------------
Mean TTFT (ms):                          1635.47
Median TTFT (ms):                        1726.09
P99 TTFT (ms):                           2466.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
================== Audio Result ==================
Total audio duration generated(s):       4502.88
Total audio frames generated:            108069120
Audio throughput(audio duration/s):      7.29
---------------Time to First Packet---------------
Mean AUDIO_TTFP (ms):                    1635.47
Median AUDIO_TTFP (ms):                  1726.09
P99 AUDIO_TTFP (ms):                     2466.38
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          1.37
Median AUDIO_RTF:                        1.35
P99 AUDIO_RTF:                           1.71
==================================================
WARNING 04-15 01:35:59 [seed_tts_eval.py:275] Loading UTMOS TorchScript from Hugging Face 'balacoon/utmos' file 'utmos.jit' (one-time download/cache)...
WARNING 04-15 01:36:08 [seed_tts_eval.py:358] Loading Seed-TTS eval Whisper HF model 'openai/whisper-large-v3' on cuda:7 (one-time, seed-tts-eval protocol)...
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtainreliable results.
WARNING 04-15 01:36:18 [seed_tts_eval.py:191] Loading WavLM 'microsoft/wavlm-base-plus' on cuda:7 for Seed-TTS SIM (embedding cosine; not identical to seed-tts-eval UniSpeech SV checkpoint).
  key_padding_mask = _canonical_mask(
===== Seed-TTS eval (seed-tts-eval protocol) =====
Evaluated (WER, lower is better):        1088
Mean WER:                                0.0163
Median WER:                              0.0000
Request failed:                          0
No PCM captured:                         0
ASR / WER failed:                        0
SIM evaluated (higher ~ closer):         1088
Mean SIM:                                0.8753
Median SIM:                              0.8800
SIM skipped (no ref path):               0
SIM embedding errors:                    0
UTMOS evaluated (JIT MOS, higher better): 1088
Mean UTMOS:                              3.3592
Median UTMOS:                            3.3714
UTMOS errors:                            0
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-07T13:31:43Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

amy-why-3459 · 2026-04-08T14:26:05Z

@ZeldaHuang @yenuo26 @Sy0307 PTAL

linyueqian

Thanks for the work on this. A few comments:

Bug: missing spaces in prompt template

daily_omni_dataset.py -- _official_daily_omni_user_prompt uses implicit string concatenation without spaces/newlines between sentences. E.g. "...based on the {media_desc}.""Select the single..." becomes "...media_desc.Select the single...". This likely hurts MCQ accuracy. Each continuation line needs a leading space or \n.

Seed-TTS: SIM and UTMOS metrics missing

The official seed-tts-eval protocol reports both WER and SIM (speaker similarity via WavLM cosine). This PR only implements WER, which validates intelligibility but not voice clone quality. Since the PCM capture path is already in place, it would be straightforward to add:

SIM: cosine similarity of WavLM embeddings between reference and synthesized audio (the other half of the official protocol)
UTMOS: predicted MOS score for naturalness (lightweight, no human eval needed)

Both can reuse the same tts_output_pcm_bytes already captured for WER. Could these be included in this PR, or at least tracked as a follow-up issue?

Minor cleanup

extend_dataset_choices() in patch.py is a no-op (pass body), should be removed.
_choices_repr_for_official_prompt calls str() on a list, producing Python repr like "['A. foo', 'B. bar']" with brackets. Is this intentional to match upstream?

yenuo26 · 2026-04-09T06:18:19Z


 from vllm.benchmarks.serve import add_cli_args

+import vllm_omni.benchmarks.patch.patch  # noqa: F401 — patch get_samples before serve import


have same code in vllm_omni/entrypoints/cli/init.py, maybe this don't need?

have same code in vllm_omni/entrypoints/cli/init.py, maybe this don't need?

fixed

yenuo26 · 2026-04-09T06:19:24Z

+    "matching the timbre, prosody, and speaking style of the reference audio while reading the new content clearly."
+)
+
+


Why not add test cases at the same time?

Why not add test cases at the same time?

Perhaps we can add test cases in the next PR.

yenuo26 · 2026-04-09T06:21:21Z

maybe you can add "==================================================" at the end of the results, unify the format of the results.

yenuo26 · 2026-04-09T06:22:47Z

Should we print both performance and accuracy results? Perhaps printing only the accuracy results during accuracy testing would be more intuitive?

amy-why-3459 · 2026-04-09T09:10:11Z

Thanks for the work on this. A few comments:

Bug: missing spaces in prompt template

daily_omni_dataset.py -- _official_daily_omni_user_prompt uses implicit string concatenation without spaces/newlines between sentences. E.g. "...based on the {media_desc}.""Select the single..." becomes "...media_desc.Select the single...". This likely hurts MCQ accuracy. Each continuation line needs a leading space or \n.

Seed-TTS: SIM and UTMOS metrics missing

The official seed-tts-eval protocol reports both WER and SIM (speaker similarity via WavLM cosine). This PR only implements WER, which validates intelligibility but not voice clone quality. Since the PCM capture path is already in place, it would be straightforward to add:

SIM: cosine similarity of WavLM embeddings between reference and synthesized audio (the other half of the official protocol)

UTMOS: predicted MOS score for naturalness (lightweight, no human eval needed)

Both can reuse the same tts_output_pcm_bytes already captured for WER. Could these be included in this PR, or at least tracked as a follow-up issue?

Minor cleanup

extend_dataset_choices() in patch.py is a no-op (pass body), should be removed.

_choices_repr_for_official_prompt calls str() on a list, producing Python repr like "['A. foo', 'B. bar']" with brackets. Is this intentional to match upstream?
okay, fixed.

amy-why-3459 · 2026-04-09T09:10:27Z

maybe you can add "==================================================" at the end of the results, unify the format of the results.

okay, fixed.

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

amy-why-3459 · 2026-04-15T01:54:09Z

@hsliuustc0106 @ZeldaHuang @linyueqian @Sy0307 @yenuo26 @gcanlin @Gaohan123 I've updated the test data for Qwen3-TTS, and I think this PR is ready. Please take a look.

amy-why-3459 · 2026-04-15T01:57:06Z

In the next PR, I will add this benchmark accuracy test to the nightly-test.

Sy0307 · 2026-04-15T08:10:12Z

Sorry for late reply. This PR totally LGTM. BTW we need to note that when testing on the server side, we should specify 'uni' instead of 'mp' to get more accurate RTF. It appears that the current RTF testing for qwen3-tts is based on 'mp'. I know this config does not belong to this PR, just mention it.

amy-why-3459 · 2026-04-15T08:14:35Z

Thank you for your review. The config file at #2383 is currently being refactored. To avoid modifying too many YAML files, we will wait for this PR to be merged before modifying the default configuration. However, uni might not be suitable for TP scenarios, so please continue to follow our updates @gcanlin .

Sy0307 · 2026-04-15T08:18:11Z

However, uni might not be suitable for TP scenarios, so please continue to follow our updates @gcanlin .

Yes, this uni config setting should only be enabled for tts series model. Thanks for nice work.

…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

Address review comment on PR vllm-project#2835 — `benchmarks/tts/` shipped four scripts + a YAML registry with zero docs, leaving users to reverse-engineer the CLI from `--help` output. Add a single-page README covering: - quick-start recipes (smoke, concurrency sweep, WER/SIM/UTMOS) - plot_results.py usage - the three task types and which checkpoints support each (notes that -CustomVoice lacks speaker_encoder so voice_clone is Base-only) - model_configs.yaml extension recipe for new TTS models - dataset matrix (bundled seed_tts_design / seed_tts_smoke, external seed-tts-eval with link to the download guide) - DFX nightly integration: latency / throughput / quality regimes, median-vs-mean baseline choice, quality-entry gating rationale - observed H20 concurrency-cliff reference table (RFC vllm-project#272 sentinel) - file layout + cross-references to vllm-project#2558 and vllm-project#2383 Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

amy-why-3459 requested a review from hsliuustc0106 as a code owner April 7, 2026 13:31

amy-why-3459 force-pushed the benchmark branch 10 times, most recently from 80c1e74 to a77f2fb Compare April 8, 2026 14:22

amy-why-3459 changed the title ~~[WIP][Benchmark][Qwen3-Omni] Benchmark supports the daily-omni dataset.~~ [Benchmark][Qwen3-Omni] Benchmark supports the daily-omni dataset. Apr 8, 2026

amy-why-3459 changed the title ~~[Benchmark][Qwen3-Omni] Benchmark supports the daily-omni dataset.~~ [Benchmark]Benchmark supports the daily-omni dataset. Apr 8, 2026

amy-why-3459 changed the title ~~[Benchmark]Benchmark supports the daily-omni dataset.~~ [Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval) Apr 8, 2026

hsliuustc0106 requested review from linyueqian and tzhouam April 8, 2026 14:34

yenuo26 mentioned this pull request Apr 8, 2026

[RFC]: CI optimization and supplementary task tracking JiusiServe/vllm-omni#177

Open

12 tasks

linyueqian reviewed Apr 8, 2026

View reviewed changes

amy-why-3459 mentioned this pull request Apr 9, 2026

[RFC]: Omni-modality model accuracy benchmark JiusiServe/vllm-omni#190

Closed

1 task

amy-why-3459 force-pushed the benchmark branch 5 times, most recently from 0357e75 to 81e8227 Compare April 9, 2026 05:19

yenuo26 reviewed Apr 9, 2026

View reviewed changes

amy-why-3459 force-pushed the benchmark branch from 81e8227 to cf9513f Compare April 9, 2026 06:20

amy-why-3459 force-pushed the benchmark branch from cf9513f to 2df4efc Compare April 9, 2026 08:28

amy-why-3459 force-pushed the benchmark branch 7 times, most recently from b264f62 to eedd04b Compare April 14, 2026 13:21

amy-why-3459 mentioned this pull request Apr 14, 2026

[RFC]: Benchmark Layout Reorganization #2779

Open

1 task

Benchmark supports the daily-omni dataset.

41b9aa1

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

amy-why-3459 force-pushed the benchmark branch from eedd04b to 41b9aa1 Compare April 15, 2026 01:51

ZeldaHuang added the ready label to trigger buildkite CI label Apr 15, 2026

ZeldaHuang enabled auto-merge (squash) April 15, 2026 03:07

ZeldaHuang approved these changes Apr 15, 2026

View reviewed changes

ZeldaHuang merged commit 227bab3 into vllm-project:main Apr 15, 2026
8 checks passed

amy-why-3459 mentioned this pull request Apr 15, 2026

[RFC]: Omni-modality model accuracy benchmark #2284

Open

y123456y78 pushed a commit to y123456y78/vllm-omni that referenced this pull request Apr 15, 2026

[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-t…

7986291

…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026

[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-t…

676d94a

…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-t…

7661765

…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-t…

f3cdd19

…ts-eval) (vllm-project#2558) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>


		from vllm.benchmarks.serve import add_cli_args

		import vllm_omni.benchmarks.patch.patch # noqa: F401 — patch get_samples before serve import

		"matching the timbre, prosody, and speaking style of the reference audio while reading the new content clearly."
		)

Conversation

amy-why-3459 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

optional: export SEED_TTS_EVAL_DEVICE=cuda:0

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 7, 2026

Uh oh!

amy-why-3459 commented Apr 8, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

yenuo26 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

yenuo26 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

yenuo26 commented Apr 9, 2026

Uh oh!

yenuo26 commented Apr 9, 2026

Uh oh!

amy-why-3459 commented Apr 9, 2026

Uh oh!

amy-why-3459 commented Apr 9, 2026

Uh oh!

amy-why-3459 commented Apr 15, 2026

Uh oh!

amy-why-3459 commented Apr 15, 2026

Uh oh!

Uh oh!

Sy0307 commented Apr 15, 2026

Uh oh!

amy-why-3459 commented Apr 15, 2026

Uh oh!

Sy0307 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amy-why-3459 commented Apr 7, 2026 •

edited

Loading