[Tests]Add accuracy benchmark L4 test cases#2843
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Blocking IssuesNone (WIP PR). VERDICT: COMMENT Ready for full review when draft status removed. Preliminary scan available on request. |
lishunyang12
left a comment
There was a problem hiding this comment.
WIP Review: [WIP][Tests] Add accuracy benchmark L4 test cases
Acknowledging this is WIP. Overall the structure is solid — clean separation between core helpers, CLI driver, and pytest wrappers. A few items to address before this leaves WIP:
Issues
1. Dead sys.path manipulation in run_qwen_omni_acc_benchmark.py (lines 17-19)
from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import (
build_serve_common_argv,
...
)
_REPO_ROOT = Path(__file__).resolve().parents[4]
if str(_REPO_ROOT) not in sys.path:
sys.path.insert(0, str(_REPO_ROOT)) # <-- too late, import above already resolvedThe tests.e2e.accuracy... import on line 8 executes before the sys.path insert on line 19. Either move the path manipulation above the import or remove it entirely (pytest and the standalone __main__ entry likely already have the repo root on sys.path).
2. tarfile.extractall() without filter — path traversal risk + deprecation warning
In daily_omni_dataset.py, tf.extractall(path=work) uses no filter argument. Since Python 3.12 this emits a DeprecationWarning, and in 3.14+ it will default to 'data' filter. More importantly, a malicious .tar could write outside work via absolute paths or ../ members. Pass filter='data' (or at minimum filter='tar') explicitly:
tf.extractall(path=work, filter='data')3. Duplicated audio chunk handling in patch.py
The refactored streaming parser now has two nearly identical blocks that decode audio_b64, create an AudioSegment, and concatenate it — once under if modality == "audio" and audio_b64: and again in the elif audio_b64: fallback. Extract a small helper (e.g., _append_audio_segment) to avoid the duplication.
4. os.environ mutation leaks state across tests
_sync_dataset_env_from_ns() writes directly to os.environ (e.g., VLLM_DAILY_OMNI_REPO, VLLM_DAILY_OMNI_QA_JSON). In the pytest path, if a test sets these and then a later test in the same process runs, the env vars persist. Consider using monkeypatch.setenv in the test layer, or saving/restoring the original values. The module-level os.environ assignments in test_qwen_omni.py (VLLM_WORKER_MULTIPROC_METHOD, VLLM_TEST_CLEAN_GPU_MEMORY) have the same concern.
5. Hardware requirement vs. PR title mismatch
The PR title says "L4 test cases" and the docstring in run_qwen_omni_acc_benchmark.py targets NVIDIA L4, but the test decorator requires H100/MI325 with 2 cards:
@hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2)Is the intent to run on L4 or H100? If L4, the decorator needs updating. If H100, the title/docstrings are misleading.
6. Default num_prompts inconsistency
build_acc_benchmark_cli_argv in the core module defaults to 100 prompts (via ACC_BENCH_NUM_PROMPTS), but run_qwen_omni_acc_benchmark.py's argparse defaults to 2000. When tests call build_acc_benchmark_cli_argv the value is 100; when the standalone script is used it's 2000. This should be explicitly documented or unified to avoid confusion.
Minor / Nits
- The
_daily_omni_tar_fingerprintfunction usesst_mtime_nswhich can differ across filesystems (e.g., NFS, Docker mounts with second-granularity timestamps). Consider including a content hash for robustness, or document that the fingerprint is best-effort. _daily_omni_find_videos_root_in_extractprobes only the first non-dotfile child. If the tar has mixed content (e.g., a README alongside video dirs), the probe may wrongly returntmpor fail. Not critical for a known upstream tar layout, but worth a comment.- The CJK-safe regex addition in
daily_omni_eval.pylooks correct and is a nice improvement. Consider adding a unit test for it (e.g.,"选B"->"B","答案:C"->"C").
Summary
Good foundational work for CI accuracy gating. The main blockers before merging would be the tarfile safety fix (#2), the dead sys.path code (#1), and clarifying the hardware target (#5). The rest are quality improvements that can land incrementally.
87bc6cd to
635541a
Compare
46301ad to
8ee71a3
Compare
|
@linyueqian @yenuo26 @gcanlin PTAL |
| """If ``video_dir`` was not configured, download and extract ``Videos.tar`` once from HF.""" | ||
| if self.video_dir is not None: | ||
| return | ||
| if os.environ.get("VLLM_DAILY_OMNI_NO_AUTO_HUB_MEDIA", "").strip().lower() in ("1", "true", "yes"): |
There was a problem hiding this comment.
Is it necessary to use environment variables?
| from tests.e2e.accuracy.qwen3_omni.run_qwen_omni_acc_benchmark import _sync_dataset_env_from_ns | ||
| from vllm_omni.benchmarks.data_modules.seed_tts_dataset import resolve_seed_tts_root | ||
|
|
||
| pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu] |
There was a problem hiding this comment.
i think maybe pytest.mark.diffusion need to be removed
| from pathlib import Path | ||
| from typing import Any | ||
|
|
||
| from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import ( |
There was a problem hiding this comment.
If some utility functions need to be called by more test cases later, you can place them in helpers.py or conftest.py.
|
|
||
|
|
||
| @pytest.mark.advanced_model | ||
| @pytest.mark.core_model |
There was a problem hiding this comment.
i think maybe pytest.mark.core_model need to be removed
|
|
||
|
|
||
| @pytest.mark.advanced_model | ||
| @pytest.mark.core_model |
There was a problem hiding this comment.
i think maybe pytest.mark.core_model need to be removed
| @pytest.mark.omni | ||
| @hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2) | ||
| @pytest.mark.parametrize("omni_server", test_params, indirect=True) | ||
| @pytest.mark.skip(reason="Dataset download issues.") |
There was a problem hiding this comment.
To prevent the skip from being forgotten later, I suggest you file an issue to track it and include the corresponding issue number here.
| commands: | ||
| - | | ||
| set +e | ||
| pytest -s -v tests/e2e/accuracy/qwen3_omni/test_qwen_omni.py -m "advanced_model" --run-level advanced_model |
There was a problem hiding this comment.
To distinguish from the L3 test cases, we plan to change all L4 test cases in #2641 to "full_model". Maybe you can directly change the test cases here to "full_model" as well.
| summarize_gebench_results, | ||
| ) | ||
| from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import seed_tts_bench_argv | ||
| from tests.e2e.accuracy.qwen3_omni.run_qwen_omni_acc_benchmark import _sync_dataset_env_from_ns |
There was a problem hiding this comment.
maybe _sync_dataset_env_from_ns should be sync_dataset_env_from_ns?
| # so only async_chunk needs flipping. Writing nested `engine_args:` into | ||
| # the new-schema overlay trips _parse_stage_deploy's legacy branch and | ||
| # drops flat fields (load_format, max_num_seqs, ...). | ||
| return modify_stage_config(config_path, updates={"async_chunk": True}) |
There was a problem hiding this comment.
Why not directly use the command line form of --async chunk?
bcfb2a2 to
085e629
Compare
|
I think this PR is ready and can be merged. |
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Head branch was pushed to by a user without write access

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Added L4 test cases to the Qwen3-Omni precision dataset.
Test Plan
pytest -sv test_qwen_omni.py -m "advanced_model" --run-level "advanced_model"Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)