[Tests]Add accuracy benchmark L4 test cases by amy-why-3459 · Pull Request #2843 · vllm-project/vllm-omni

amy-why-3459 · 2026-04-16T08:13:19Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Added L4 test cases to the Qwen3-Omni precision dataset.

Test Plan

pytest -sv test_qwen_omni.py -m "advanced_model" --run-level "advanced_model"

Test Result

========================================================================== 1 passed, 1 skipped, 16 warnings in 2153.83s (0:35:53) ===========================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-16T08:13:26Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-04-16T09:30:25Z

Blocking Issues

None (WIP PR).

VERDICT: COMMENT

Ready for full review when draft status removed. Preliminary scan available on request.

lishunyang12

WIP Review: [WIP][Tests] Add accuracy benchmark L4 test cases

Acknowledging this is WIP. Overall the structure is solid — clean separation between core helpers, CLI driver, and pytest wrappers. A few items to address before this leaves WIP:

Issues

1. Dead sys.path manipulation in run_qwen_omni_acc_benchmark.py (lines 17-19)

from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import (
    build_serve_common_argv,
    ...
)

_REPO_ROOT = Path(__file__).resolve().parents[4]
if str(_REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(_REPO_ROOT))  # <-- too late, import above already resolved

The tests.e2e.accuracy... import on line 8 executes before the sys.path insert on line 19. Either move the path manipulation above the import or remove it entirely (pytest and the standalone __main__ entry likely already have the repo root on sys.path).

2. tarfile.extractall() without filter — path traversal risk + deprecation warning

In daily_omni_dataset.py, tf.extractall(path=work) uses no filter argument. Since Python 3.12 this emits a DeprecationWarning, and in 3.14+ it will default to 'data' filter. More importantly, a malicious .tar could write outside work via absolute paths or ../ members. Pass filter='data' (or at minimum filter='tar') explicitly:

tf.extractall(path=work, filter='data')

3. Duplicated audio chunk handling in patch.py

The refactored streaming parser now has two nearly identical blocks that decode audio_b64, create an AudioSegment, and concatenate it — once under if modality == "audio" and audio_b64: and again in the elif audio_b64: fallback. Extract a small helper (e.g., _append_audio_segment) to avoid the duplication.

4. os.environ mutation leaks state across tests

_sync_dataset_env_from_ns() writes directly to os.environ (e.g., VLLM_DAILY_OMNI_REPO, VLLM_DAILY_OMNI_QA_JSON). In the pytest path, if a test sets these and then a later test in the same process runs, the env vars persist. Consider using monkeypatch.setenv in the test layer, or saving/restoring the original values. The module-level os.environ assignments in test_qwen_omni.py (VLLM_WORKER_MULTIPROC_METHOD, VLLM_TEST_CLEAN_GPU_MEMORY) have the same concern.

5. Hardware requirement vs. PR title mismatch

The PR title says "L4 test cases" and the docstring in run_qwen_omni_acc_benchmark.py targets NVIDIA L4, but the test decorator requires H100/MI325 with 2 cards:

@hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2)

Is the intent to run on L4 or H100? If L4, the decorator needs updating. If H100, the title/docstrings are misleading.

6. Default num_prompts inconsistency

build_acc_benchmark_cli_argv in the core module defaults to 100 prompts (via ACC_BENCH_NUM_PROMPTS), but run_qwen_omni_acc_benchmark.py's argparse defaults to 2000. When tests call build_acc_benchmark_cli_argv the value is 100; when the standalone script is used it's 2000. This should be explicitly documented or unified to avoid confusion.

Minor / Nits

The _daily_omni_tar_fingerprint function uses st_mtime_ns which can differ across filesystems (e.g., NFS, Docker mounts with second-granularity timestamps). Consider including a content hash for robustness, or document that the fingerprint is best-effort.
_daily_omni_find_videos_root_in_extract probes only the first non-dotfile child. If the tar has mixed content (e.g., a README alongside video dirs), the probe may wrongly return tmp or fail. Not critical for a known upstream tar layout, but worth a comment.
The CJK-safe regex addition in daily_omni_eval.py looks correct and is a nice improvement. Consider adding a unit test for it (e.g., "选B" -> "B", "答案：C" -> "C").

Summary

Good foundational work for CI accuracy gating. The main blockers before merging would be the tarfile safety fix (#2), the dead sys.path code (#1), and clarifying the hardware target (#5). The rest are quality improvements that can land incrementally.

amy-why-3459 · 2026-04-20T12:13:44Z

@linyueqian @yenuo26 @gcanlin PTAL

yenuo26 · 2026-04-20T12:34:03Z

+        """If ``video_dir`` was not configured, download and extract ``Videos.tar`` once from HF."""
+        if self.video_dir is not None:
+            return
+        if os.environ.get("VLLM_DAILY_OMNI_NO_AUTO_HUB_MEDIA", "").strip().lower() in ("1", "true", "yes"):


Is it necessary to use environment variables?

yenuo26 · 2026-04-20T12:35:53Z

+from tests.e2e.accuracy.qwen3_omni.run_qwen_omni_acc_benchmark import _sync_dataset_env_from_ns
+from vllm_omni.benchmarks.data_modules.seed_tts_dataset import resolve_seed_tts_root

 pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu]


i think maybe pytest.mark.diffusion need to be removed

yenuo26 · 2026-04-20T12:39:05Z

+from pathlib import Path
+from typing import Any
+
+from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import (


If some utility functions need to be called by more test cases later, you can place them in helpers.py or conftest.py.

yenuo26 · 2026-04-20T12:40:47Z

+
+
+@pytest.mark.advanced_model
+@pytest.mark.core_model


i think maybe pytest.mark.core_model need to be removed

yenuo26 · 2026-04-20T12:40:59Z

+
+
+@pytest.mark.advanced_model
+@pytest.mark.core_model


i think maybe pytest.mark.core_model need to be removed

yenuo26 · 2026-04-20T12:41:51Z

+@pytest.mark.omni
+@hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2)
+@pytest.mark.parametrize("omni_server", test_params, indirect=True)
+@pytest.mark.skip(reason="Dataset download issues.")


To prevent the skip from being forgotten later, I suggest you file an issue to track it and include the corresponding issue number here.

yenuo26 · 2026-04-22T02:01:42Z

+        commands:
+          - |
+            set +e
+            pytest -s -v tests/e2e/accuracy/qwen3_omni/test_qwen_omni.py -m "advanced_model" --run-level advanced_model


To distinguish from the L3 test cases, we plan to change all L4 test cases in #2641 to "full_model". Maybe you can directly change the test cases here to "full_model" as well.

yenuo26 · 2026-04-22T02:02:42Z

    summarize_gebench_results,
 )
+from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import seed_tts_bench_argv
+from tests.e2e.accuracy.qwen3_omni.run_qwen_omni_acc_benchmark import _sync_dataset_env_from_ns


maybe _sync_dataset_env_from_ns should be sync_dataset_env_from_ns?

yenuo26 · 2026-04-22T02:04:33Z

+    # so only async_chunk needs flipping. Writing nested `engine_args:` into
+    # the new-schema overlay trips _parse_stage_deploy's legacy branch and
+    # drops flat fields (load_format, max_num_seqs, ...).
+    return modify_stage_config(config_path, updates={"async_chunk": True})


Why not directly use the command line form of --async chunk?

amy-why-3459 · 2026-04-22T13:50:26Z

I think this PR is ready and can be merged.

gcanlin

LGTM.

gcanlin · 2026-04-22T13:56:33Z

Retry can fix it?

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

amy-why-3459 requested a review from hsliuustc0106 as a code owner April 16, 2026 08:13

lishunyang12 reviewed Apr 16, 2026

View reviewed changes

amy-why-3459 force-pushed the benchmark branch 6 times, most recently from 87bc6cd to 635541a Compare April 20, 2026 11:30

amy-why-3459 changed the title ~~[WIP][Tests]Add accuracy benchmark L4 test cases~~ [Tests]Add accuracy benchmark L4 test cases Apr 20, 2026

amy-why-3459 force-pushed the benchmark branch 2 times, most recently from 46301ad to 8ee71a3 Compare April 20, 2026 12:13

amy-why-3459 force-pushed the benchmark branch from 8ee71a3 to 9764c20 Compare April 20, 2026 12:16

gcanlin added the nightly-test label to trigger buildkite nightly test CI label Apr 20, 2026

amy-why-3459 force-pushed the benchmark branch from 9764c20 to b12dc61 Compare April 20, 2026 12:28

yenuo26 reviewed Apr 20, 2026

View reviewed changes

amy-why-3459 force-pushed the benchmark branch from b12dc61 to db17b3c Compare April 20, 2026 13:51

gcanlin added omni-test label to trigger buildkite omni model test in nightly CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 20, 2026

yenuo26 reviewed Apr 22, 2026

View reviewed changes

amy-why-3459 force-pushed the benchmark branch 4 times, most recently from bcfb2a2 to 085e629 Compare April 22, 2026 10:36

gcanlin approved these changes Apr 22, 2026

View reviewed changes

gcanlin added the ready label to trigger buildkite CI label Apr 22, 2026

gcanlin removed the omni-test label to trigger buildkite omni model test in nightly CI label Apr 22, 2026

gcanlin enabled auto-merge (squash) April 22, 2026 13:58

lishunyang12 approved these changes Apr 22, 2026

View reviewed changes

Add accuracy benchmark L4 test cases

f44209b

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>

auto-merge was automatically disabled April 23, 2026 01:12
Head branch was pushed to by a user without write access

amy-why-3459 force-pushed the benchmark branch from 839b22b to f44209b Compare April 23, 2026 01:12

gcanlin merged commit 3aa84c4 into vllm-project:main Apr 23, 2026
8 checks passed



		@pytest.mark.advanced_model
		@pytest.mark.core_model



		@pytest.mark.advanced_model
		@pytest.mark.core_model

Conversation

amy-why-3459 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 16, 2026

Uh oh!

hsliuustc0106 commented Apr 16, 2026

Blocking Issues

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

WIP Review: [WIP][Tests] Add accuracy benchmark L4 test cases

Issues

Minor / Nits

Summary

Uh oh!

amy-why-3459 commented Apr 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 commented Apr 22, 2026

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amy-why-3459 commented Apr 16, 2026 •

edited

Loading