Skip to content

[Tests][Qwen3-Omni] Add performance test cases#2011

Merged
gcanlin merged 1 commit into
vllm-project:mainfrom
amy-why-3459:perf
Mar 28, 2026
Merged

[Tests][Qwen3-Omni] Add performance test cases#2011
gcanlin merged 1 commit into
vllm-project:mainfrom
amy-why-3459:perf

Conversation

@amy-why-3459
Copy link
Copy Markdown
Contributor

@amy-why-3459 amy-why-3459 commented Mar 19, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

  1. Add long input long output performance test cases
  2. Performance test cases for restoring the random-mm test set, [Bug]: Qwen3-omni, input(video+image+audio+text), output(text+audio), after sending requests with concurrency 5, an error "shape mismatch" is reported. #1447
  3. Change the baseline to the actual value.

Test Plan

pytest -sv run_benchmark.py

Test Result

to show a full traceback on KeyboardInterrupt use --full-trace)
============================================================================== 6 passed, 6 skipped, 17 warnings in 2944.81s (0:49:04) ==============================================================================
Running benchmark 1/3 for test_qwen3_omni_chunk
Running benchmark for model: /home/models/Qwen3-Omni-30B-A3B-Instruct
Benchmark parameters: {'test_name': 'test_qwen3_omni_chunk', 'params': {'dataset_name': 'random', 'backend': 'openai-chat-omni', 'endpoint': '/v1/chat/completions', 'num_prompts': [10, 40], 'max_concurrency': [1,4], 'random_input_len': 100, 'random_output_len': 100, 'ignore_eos': True, 'percentile-metrics': 'ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration', 'baseline': {'mean_ttft_ms': [100, 300], 'mean_audio_ttfp_ms': [500, 1000], 'mean_audio_rtf': [0.2, 0.25]}}}

The test case failed when performance data exceeded the baseline.

FAILED run_benchmark.py::test_performance_benchmark[benchmark_params0-omni_server0] - AssertionError: mean_audio_ttfp_ms: 548.9104939973913 > 500
FAILED run_benchmark.py::test_performance_benchmark[benchmark_params1-omni_server0] - AssertionError: mean_ttft_ms: 392.6402267999947 > 100
=============================================================================== 2 failed, 1 passed, 17 warnings in 1215.92s (0:20:15)

@amy-why-3459 amy-why-3459 changed the title [Tests] Add performance test cases [Tests][Qwen3-Omni] Add performance test cases Mar 19, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d57d468cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 196 to 200
if param_index >= len(all_params):
raise ValueError(f"No benchmark parameters found for index {param_index} in test: {test_name}")

if all_params[param_index]["dataset_name"] == "random-mm":
# TODO: Due to known issues, skip the random-mm dataset.
pytest.skip("Skipping parameter for random-mm dataset.")

current = param_index + 1
total = len(all_params)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep random-mm perf cases skipped until batched mix input works

Removing the random-mm skip here makes the perf suite execute the mixed text+audio+image+video benchmark rows from tests/perf/tests/test.json (num_prompts 10/40, max_concurrency 1/4) for both Qwen3-Omni configs. That scenario is still explicitly disabled in tests/e2e/online_serving/test_qwen3_omni_expansion.py:366-395 with a “known issue with shape mismatch error”, while only the single-request mix case remains enabled in tests/e2e/online_serving/test_qwen3_omni.py:96-127. So this change reintroduces a known batched mixed-modality failure into nightly perf runs rather than just adding coverage.

Useful? React with 👍 / 👎.

@amy-why-3459 amy-why-3459 force-pushed the perf branch 4 times, most recently from bd78742 to 3f59f03 Compare March 20, 2026 08:17
@amy-why-3459
Copy link
Copy Markdown
Contributor Author

@yenuo26 @congw729 PTAL

Comment thread tests/perf/tests/test.json Outdated
"mean_ttft_ms": 100000,
"mean_audio_ttfp_ms": 100000,
"mean_audio_rtf": 100000
"mean_ttft_ms": [100, 300],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update doc: Test example in docs/contributing/ci/CI_5levels.md Section 3.4

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 20, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments.

Comment thread tests/dfx/perf/tests/test.json Outdated
"baseline": {
"mean_ttft_ms": [100000, 100000],
"mean_audio_ttfp_ms": [100000, 100000],
"mean_audio_rtf": [100000, 100000]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These baselines are all 100000 — effectively no-op assertions that will never fail. The whole point of this PR is replacing the old placeholder baselines with real values; these new test cases should get real thresholds too (or at least a TODO comment explaining why they're deferred).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is no baseline value for the long input. We need to wait for this use case to be submitted before determining the correct baseline value.

Comment thread tests/perf/scripts/run_benchmark.py Outdated
For dict lookup, ``max_concurrency`` is preferred when both are set (concurrency sweep).
"""
if baseline_raw is None:
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If baseline_raw is None, this returns None, and then assert_result will hit current_value >= None / current_value <= NoneTypeError. Either skip the metric here or guard in the caller:

Suggested change
return None
if baseline_raw is None:
return None

→ in assert_result:

        if baseline_value is None:
            continue

@@ -196,10 +196,6 @@ def benchmark_params(request, omni_server):
if param_index >= len(all_params):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip was added with TODO: Due to known issues. Is the underlying issue actually resolved, or is this just removing the guard? Would be good to link the fix if there is one.

Comment thread tests/perf/scripts/run_benchmark.py Outdated
if isinstance(baseline_raw, (list, tuple)):
if sweep_index is None:
raise ValueError("baseline list requires sweep_index when asserting")
if sweep_index < 0 or sweep_index >= len(baseline_raw):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this bounds check is redundant — Python's list[index] already raises IndexError. If you want a custom message, fine, but then also validate the list length matches the sweep length upfront in assert_result so misconfigured JSON fails fast before any benchmarks run.

"num_prompts": [
4,
16
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we set num_prompts to be the same as the short input/output case, using 10 and 40? Or change short input/output case to 4, 16 to maintain consistency.

"num_prompts": [
4,
16
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we set num_prompts to be the same as the short input/output case, using 10 and 40? Or change short input/output case to 4, 16 to maintain consistency.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long input requests take too long to test, and if the settings are the same as for short inputs, it may cause test cases to time out.

Comment thread docs/contributing/ci/CI_5levels.md Outdated
```JSON
"baseline": {
"mean_ttft_ms": [100, 300],
"mean_audio_ttfp_ms": {"1": 500, "4": 1000},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the example different from test.json?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@amy-why-3459 amy-why-3459 force-pushed the perf branch 5 times, most recently from d91ff7e to 12e97d3 Compare March 27, 2026 06:38
@amy-why-3459
Copy link
Copy Markdown
Contributor Author

Please add a nightly-test label. @congw729 @Gaohan123 @gcanlin

@gcanlin gcanlin added the nightly-test label to trigger buildkite nightly test CI label Mar 27, 2026
@amy-why-3459 amy-why-3459 force-pushed the perf branch 3 times, most recently from 3d1333d to eb308e3 Compare March 28, 2026 07:09
@amy-why-3459
Copy link
Copy Markdown
Contributor Author

@yenuo26 @gcanlin @hsliuustc0106 PTAL, This PR is ready and can be merged.

Copy link
Copy Markdown
Collaborator

@yenuo26 yenuo26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gcanlin gcanlin removed the nightly-test label to trigger buildkite nightly test CI label Mar 28, 2026
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
@gcanlin gcanlin merged commit 93bb988 into vllm-project:main Mar 28, 2026
7 of 8 checks passed
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants