[Tests][Qwen3-Omni] Add performance test cases by amy-why-3459 · Pull Request #2011 · vllm-project/vllm-omni

amy-why-3459 · 2026-03-19T11:31:06Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add long input long output performance test cases
Performance test cases for restoring the random-mm test set, [Bug]: Qwen3-omni, input(video+image+audio+text), output(text+audio), after sending requests with concurrency 5, an error "shape mismatch" is reported. #1447
Change the baseline to the actual value.

Test Plan

pytest -sv run_benchmark.py

Test Result

to show a full traceback on KeyboardInterrupt use --full-trace)
============================================================================== 6 passed, 6 skipped, 17 warnings in 2944.81s (0:49:04) ==============================================================================

Running benchmark 1/3 for test_qwen3_omni_chunk
Running benchmark for model: /home/models/Qwen3-Omni-30B-A3B-Instruct
Benchmark parameters: {'test_name': 'test_qwen3_omni_chunk', 'params': {'dataset_name': 'random', 'backend': 'openai-chat-omni', 'endpoint': '/v1/chat/completions', 'num_prompts': [10, 40], 'max_concurrency': [1,4], 'random_input_len': 100, 'random_output_len': 100, 'ignore_eos': True, 'percentile-metrics': 'ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration', 'baseline': {'mean_ttft_ms': [100, 300], 'mean_audio_ttfp_ms': [500, 1000], 'mean_audio_rtf': [0.2, 0.25]}}}

The test case failed when performance data exceeded the baseline.

FAILED run_benchmark.py::test_performance_benchmark[benchmark_params0-omni_server0] - AssertionError: mean_audio_ttfp_ms: 548.9104939973913 > 500
FAILED run_benchmark.py::test_performance_benchmark[benchmark_params1-omni_server0] - AssertionError: mean_ttft_ms: 392.6402267999947 > 100
=============================================================================== 2 failed, 1 passed, 17 warnings in 1215.92s (0:20:15)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d57d468cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T11:36:39Z

    if param_index >= len(all_params):
        raise ValueError(f"No benchmark parameters found for index {param_index} in test: {test_name}")

-    if all_params[param_index]["dataset_name"] == "random-mm":
-        # TODO: Due to known issues, skip the random-mm dataset.
-        pytest.skip("Skipping parameter for random-mm dataset.")
-
    current = param_index + 1
    total = len(all_params)


Keep random-mm perf cases skipped until batched mix input works

Removing the random-mm skip here makes the perf suite execute the mixed text+audio+image+video benchmark rows from tests/perf/tests/test.json (num_prompts 10/40, max_concurrency 1/4) for both Qwen3-Omni configs. That scenario is still explicitly disabled in tests/e2e/online_serving/test_qwen3_omni_expansion.py:366-395 with a “known issue with shape mismatch error”, while only the single-request mix case remains enabled in tests/e2e/online_serving/test_qwen3_omni.py:96-127. So this change reintroduces a known batched mixed-modality failure into nightly perf runs rather than just adding coverage.

Useful? React with 👍 / 👎.

amy-why-3459 · 2026-03-20T08:59:44Z

@yenuo26 @congw729 PTAL

yenuo26 · 2026-03-20T10:07:47Z

-                    "mean_ttft_ms": 100000,
-                    "mean_audio_ttfp_ms": 100000,
-                    "mean_audio_rtf": 100000
+                    "mean_ttft_ms": [100, 300],


please update doc: Test example in docs/contributing/ci/CI_5levels.md Section 3.4

lishunyang12

a few comments.

lishunyang12 · 2026-03-22T17:28:15Z

+                "baseline": {
+                    "mean_ttft_ms": [100000, 100000],
+                    "mean_audio_ttfp_ms": [100000, 100000],
+                    "mean_audio_rtf": [100000, 100000]


These baselines are all 100000 — effectively no-op assertions that will never fail. The whole point of this PR is replacing the old placeholder baselines with real values; these new test cases should get real thresholds too (or at least a TODO comment explaining why they're deferred).

Currently, there is no baseline value for the long input. We need to wait for this use case to be submitted before determining the correct baseline value.

lishunyang12 · 2026-03-22T17:28:15Z

+    For dict lookup, ``max_concurrency`` is preferred when both are set (concurrency sweep).
+    """
+    if baseline_raw is None:
+        return None


If baseline_raw is None, this returns None, and then assert_result will hit current_value >= None / current_value <= None → TypeError. Either skip the metric here or guard in the caller:

Suggested change

return None

if baseline_raw is None:

return None

→ in assert_result:

if baseline_value is None: continue

lishunyang12 · 2026-03-22T17:28:16Z

@@ -196,10 +196,6 @@ def benchmark_params(request, omni_server):
    if param_index >= len(all_params):


The skip was added with TODO: Due to known issues. Is the underlying issue actually resolved, or is this just removing the guard? Would be good to link the fix if there is one.

lishunyang12 · 2026-03-22T17:28:16Z

+    if isinstance(baseline_raw, (list, tuple)):
+        if sweep_index is None:
+            raise ValueError("baseline list requires sweep_index when asserting")
+        if sweep_index < 0 or sweep_index >= len(baseline_raw):


Nit: this bounds check is redundant — Python's list[index] already raises IndexError. If you want a custom message, fine, but then also validate the list length matches the sweep length upfront in assert_result so misconfigured JSON fails fast before any benchmarks run.

R2-Y · 2026-03-23T01:08:10Z

+                "num_prompts": [
+                    4,
+                    16
+                ],


why don't we set num_prompts to be the same as the short input/output case, using 10 and 40? Or change short input/output case to 4, 16 to maintain consistency.

R2-Y · 2026-03-23T01:08:31Z

+                "num_prompts": [
+                    4,
+                    16
+                ],


why don't we set num_prompts to be the same as the short input/output case, using 10 and 40? Or change short input/output case to 4, 16 to maintain consistency.

Long input requests take too long to test, and if the settings are the same as for short inputs, it may cause test cases to time out.

yenuo26 · 2026-03-23T12:30:04Z

+```JSON
+"baseline": {
+    "mean_ttft_ms": [100, 300],
+    "mean_audio_ttfp_ms": {"1": 500, "4": 1000},


Why is the example different from test.json?

amy-why-3459 · 2026-03-27T09:55:02Z

Please add a nightly-test label. @congw729 @Gaohan123 @gcanlin

amy-why-3459 · 2026-03-28T08:29:30Z

@yenuo26 @gcanlin @hsliuustc0106 PTAL, This PR is ready and can be merged.

yenuo26

LGTM

gcanlin

LGTM

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

amy-why-3459 changed the title ~~[Tests] Add performance test cases~~ [Tests][Qwen3-Omni] Add performance test cases Mar 19, 2026

chatgpt-codex-connector Bot reviewed Mar 19, 2026

View reviewed changes

amy-why-3459 force-pushed the perf branch 4 times, most recently from bd78742 to 3f59f03 Compare March 20, 2026 08:17

amy-why-3459 force-pushed the perf branch from 3f59f03 to 34e0f0c Compare March 20, 2026 09:00

yenuo26 reviewed Mar 20, 2026

View reviewed changes

amy-why-3459 force-pushed the perf branch from 34e0f0c to da1d9a1 Compare March 20, 2026 10:15

hsliuustc0106 added the ready label to trigger buildkite CI label Mar 20, 2026

lishunyang12 reviewed Mar 22, 2026

View reviewed changes

amy-why-3459 force-pushed the perf branch from da1d9a1 to c619059 Compare March 23, 2026 01:59

R2-Y reviewed Mar 23, 2026

View reviewed changes

congw729 mentioned this pull request Mar 23, 2026

[Perf] Qwen-Image Performance Nightly CI test #1805

Merged

5 tasks

yenuo26 reviewed Mar 23, 2026

View reviewed changes

amy-why-3459 force-pushed the perf branch 5 times, most recently from d91ff7e to 12e97d3 Compare March 27, 2026 06:38

gcanlin added the nightly-test label to trigger buildkite nightly test CI label Mar 27, 2026

amy-why-3459 force-pushed the perf branch 3 times, most recently from 3d1333d to eb308e3 Compare March 28, 2026 07:09

yenuo26 approved these changes Mar 28, 2026

View reviewed changes

gcanlin approved these changes Mar 28, 2026

View reviewed changes

gcanlin removed the nightly-test label to trigger buildkite nightly test CI label Mar 28, 2026

Add performance test cases

2a00ddf

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

amy-why-3459 force-pushed the perf branch from eb308e3 to 2a00ddf Compare March 28, 2026 10:33

gcanlin merged commit 93bb988 into vllm-project:main Mar 28, 2026
7 of 8 checks passed

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Tests][Qwen3-Omni] Add performance test cases (vllm-project#2011)

205b3d4

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Tests][Qwen3-Omni] Add performance test cases (vllm-project#2011)

affd30a

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Tests][Qwen3-Omni] Add performance test cases (vllm-project#2011)

07f9a37

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

		@@ -196,10 +196,6 @@ def benchmark_params(request, omni_server):
		if param_index >= len(all_params):

Conversation

amy-why-3459 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 commented Mar 27, 2026

Uh oh!

amy-why-3459 commented Mar 28, 2026

Uh oh!

yenuo26 left a comment

Choose a reason for hiding this comment

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amy-why-3459 commented Mar 19, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading