[Tests][Qwen3-Omni] Add performance test cases#2011
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2d57d468cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if param_index >= len(all_params): | ||
| raise ValueError(f"No benchmark parameters found for index {param_index} in test: {test_name}") | ||
|
|
||
| if all_params[param_index]["dataset_name"] == "random-mm": | ||
| # TODO: Due to known issues, skip the random-mm dataset. | ||
| pytest.skip("Skipping parameter for random-mm dataset.") | ||
|
|
||
| current = param_index + 1 | ||
| total = len(all_params) |
There was a problem hiding this comment.
Keep random-mm perf cases skipped until batched mix input works
Removing the random-mm skip here makes the perf suite execute the mixed text+audio+image+video benchmark rows from tests/perf/tests/test.json (num_prompts 10/40, max_concurrency 1/4) for both Qwen3-Omni configs. That scenario is still explicitly disabled in tests/e2e/online_serving/test_qwen3_omni_expansion.py:366-395 with a “known issue with shape mismatch error”, while only the single-request mix case remains enabled in tests/e2e/online_serving/test_qwen3_omni.py:96-127. So this change reintroduces a known batched mixed-modality failure into nightly perf runs rather than just adding coverage.
Useful? React with 👍 / 👎.
bd78742 to
3f59f03
Compare
| "mean_ttft_ms": 100000, | ||
| "mean_audio_ttfp_ms": 100000, | ||
| "mean_audio_rtf": 100000 | ||
| "mean_ttft_ms": [100, 300], |
There was a problem hiding this comment.
please update doc: Test example in docs/contributing/ci/CI_5levels.md Section 3.4
| "baseline": { | ||
| "mean_ttft_ms": [100000, 100000], | ||
| "mean_audio_ttfp_ms": [100000, 100000], | ||
| "mean_audio_rtf": [100000, 100000] |
There was a problem hiding this comment.
These baselines are all 100000 — effectively no-op assertions that will never fail. The whole point of this PR is replacing the old placeholder baselines with real values; these new test cases should get real thresholds too (or at least a TODO comment explaining why they're deferred).
There was a problem hiding this comment.
Currently, there is no baseline value for the long input. We need to wait for this use case to be submitted before determining the correct baseline value.
| For dict lookup, ``max_concurrency`` is preferred when both are set (concurrency sweep). | ||
| """ | ||
| if baseline_raw is None: | ||
| return None |
There was a problem hiding this comment.
If baseline_raw is None, this returns None, and then assert_result will hit current_value >= None / current_value <= None → TypeError. Either skip the metric here or guard in the caller:
| return None | |
| if baseline_raw is None: | |
| return None |
→ in assert_result:
if baseline_value is None:
continue| @@ -196,10 +196,6 @@ def benchmark_params(request, omni_server): | |||
| if param_index >= len(all_params): | |||
There was a problem hiding this comment.
The skip was added with TODO: Due to known issues. Is the underlying issue actually resolved, or is this just removing the guard? Would be good to link the fix if there is one.
| if isinstance(baseline_raw, (list, tuple)): | ||
| if sweep_index is None: | ||
| raise ValueError("baseline list requires sweep_index when asserting") | ||
| if sweep_index < 0 or sweep_index >= len(baseline_raw): |
There was a problem hiding this comment.
Nit: this bounds check is redundant — Python's list[index] already raises IndexError. If you want a custom message, fine, but then also validate the list length matches the sweep length upfront in assert_result so misconfigured JSON fails fast before any benchmarks run.
| "num_prompts": [ | ||
| 4, | ||
| 16 | ||
| ], |
There was a problem hiding this comment.
why don't we set num_prompts to be the same as the short input/output case, using 10 and 40? Or change short input/output case to 4, 16 to maintain consistency.
| "num_prompts": [ | ||
| 4, | ||
| 16 | ||
| ], |
There was a problem hiding this comment.
why don't we set num_prompts to be the same as the short input/output case, using 10 and 40? Or change short input/output case to 4, 16 to maintain consistency.
There was a problem hiding this comment.
Long input requests take too long to test, and if the settings are the same as for short inputs, it may cause test cases to time out.
| ```JSON | ||
| "baseline": { | ||
| "mean_ttft_ms": [100, 300], | ||
| "mean_audio_ttfp_ms": {"1": 500, "4": 1000}, |
There was a problem hiding this comment.
Why is the example different from test.json?
d91ff7e to
12e97d3
Compare
|
Please add a nightly-test label. @congw729 @Gaohan123 @gcanlin |
3d1333d to
eb308e3
Compare
|
@yenuo26 @gcanlin @hsliuustc0106 PTAL, This PR is ready and can be merged. |
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
pytest -sv run_benchmark.pyTest Result
The test case failed when performance data exceeded the baseline.