Skip to content

[Tests]Add accuracy benchmark L4 test cases#2843

Merged
gcanlin merged 1 commit intovllm-project:mainfrom
amy-why-3459:benchmark
Apr 23, 2026
Merged

[Tests]Add accuracy benchmark L4 test cases#2843
gcanlin merged 1 commit intovllm-project:mainfrom
amy-why-3459:benchmark

Conversation

@amy-why-3459
Copy link
Copy Markdown
Contributor

@amy-why-3459 amy-why-3459 commented Apr 16, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Added L4 test cases to the Qwen3-Omni precision dataset.

Test Plan

pytest -sv test_qwen_omni.py -m "advanced_model" --run-level "advanced_model"

Test Result

========================================================================== 1 passed, 1 skipped, 16 warnings in 2153.83s (0:35:53) ===========================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Blocking Issues

None (WIP PR).


VERDICT: COMMENT

Ready for full review when draft status removed. Preliminary scan available on request.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP Review: [WIP][Tests] Add accuracy benchmark L4 test cases

Acknowledging this is WIP. Overall the structure is solid — clean separation between core helpers, CLI driver, and pytest wrappers. A few items to address before this leaves WIP:

Issues

1. Dead sys.path manipulation in run_qwen_omni_acc_benchmark.py (lines 17-19)

from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import (
    build_serve_common_argv,
    ...
)

_REPO_ROOT = Path(__file__).resolve().parents[4]
if str(_REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(_REPO_ROOT))  # <-- too late, import above already resolved

The tests.e2e.accuracy... import on line 8 executes before the sys.path insert on line 19. Either move the path manipulation above the import or remove it entirely (pytest and the standalone __main__ entry likely already have the repo root on sys.path).

2. tarfile.extractall() without filter — path traversal risk + deprecation warning

In daily_omni_dataset.py, tf.extractall(path=work) uses no filter argument. Since Python 3.12 this emits a DeprecationWarning, and in 3.14+ it will default to 'data' filter. More importantly, a malicious .tar could write outside work via absolute paths or ../ members. Pass filter='data' (or at minimum filter='tar') explicitly:

tf.extractall(path=work, filter='data')

3. Duplicated audio chunk handling in patch.py

The refactored streaming parser now has two nearly identical blocks that decode audio_b64, create an AudioSegment, and concatenate it — once under if modality == "audio" and audio_b64: and again in the elif audio_b64: fallback. Extract a small helper (e.g., _append_audio_segment) to avoid the duplication.

4. os.environ mutation leaks state across tests

_sync_dataset_env_from_ns() writes directly to os.environ (e.g., VLLM_DAILY_OMNI_REPO, VLLM_DAILY_OMNI_QA_JSON). In the pytest path, if a test sets these and then a later test in the same process runs, the env vars persist. Consider using monkeypatch.setenv in the test layer, or saving/restoring the original values. The module-level os.environ assignments in test_qwen_omni.py (VLLM_WORKER_MULTIPROC_METHOD, VLLM_TEST_CLEAN_GPU_MEMORY) have the same concern.

5. Hardware requirement vs. PR title mismatch

The PR title says "L4 test cases" and the docstring in run_qwen_omni_acc_benchmark.py targets NVIDIA L4, but the test decorator requires H100/MI325 with 2 cards:

@hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2)

Is the intent to run on L4 or H100? If L4, the decorator needs updating. If H100, the title/docstrings are misleading.

6. Default num_prompts inconsistency

build_acc_benchmark_cli_argv in the core module defaults to 100 prompts (via ACC_BENCH_NUM_PROMPTS), but run_qwen_omni_acc_benchmark.py's argparse defaults to 2000. When tests call build_acc_benchmark_cli_argv the value is 100; when the standalone script is used it's 2000. This should be explicitly documented or unified to avoid confusion.

Minor / Nits

  • The _daily_omni_tar_fingerprint function uses st_mtime_ns which can differ across filesystems (e.g., NFS, Docker mounts with second-granularity timestamps). Consider including a content hash for robustness, or document that the fingerprint is best-effort.
  • _daily_omni_find_videos_root_in_extract probes only the first non-dotfile child. If the tar has mixed content (e.g., a README alongside video dirs), the probe may wrongly return tmp or fail. Not critical for a known upstream tar layout, but worth a comment.
  • The CJK-safe regex addition in daily_omni_eval.py looks correct and is a nice improvement. Consider adding a unit test for it (e.g., "选B" -> "B", "答案:C" -> "C").

Summary

Good foundational work for CI accuracy gating. The main blockers before merging would be the tarfile safety fix (#2), the dead sys.path code (#1), and clarifying the hardware target (#5). The rest are quality improvements that can land incrementally.

@amy-why-3459 amy-why-3459 force-pushed the benchmark branch 6 times, most recently from 87bc6cd to 635541a Compare April 20, 2026 11:30
@amy-why-3459 amy-why-3459 changed the title [WIP][Tests]Add accuracy benchmark L4 test cases [Tests]Add accuracy benchmark L4 test cases Apr 20, 2026
@amy-why-3459 amy-why-3459 force-pushed the benchmark branch 2 times, most recently from 46301ad to 8ee71a3 Compare April 20, 2026 12:13
@amy-why-3459
Copy link
Copy Markdown
Contributor Author

@linyueqian @yenuo26 @gcanlin PTAL

@gcanlin gcanlin added the nightly-test label to trigger buildkite nightly test CI label Apr 20, 2026
"""If ``video_dir`` was not configured, download and extract ``Videos.tar`` once from HF."""
if self.video_dir is not None:
return
if os.environ.get("VLLM_DAILY_OMNI_NO_AUTO_HUB_MEDIA", "").strip().lower() in ("1", "true", "yes"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to use environment variables?

from tests.e2e.accuracy.qwen3_omni.run_qwen_omni_acc_benchmark import _sync_dataset_env_from_ns
from vllm_omni.benchmarks.data_modules.seed_tts_dataset import resolve_seed_tts_root

pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe pytest.mark.diffusion need to be removed

from pathlib import Path
from typing import Any

from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some utility functions need to be called by more test cases later, you can place them in helpers.py or conftest.py.



@pytest.mark.advanced_model
@pytest.mark.core_model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe pytest.mark.core_model need to be removed



@pytest.mark.advanced_model
@pytest.mark.core_model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe pytest.mark.core_model need to be removed

@pytest.mark.omni
@hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2)
@pytest.mark.parametrize("omni_server", test_params, indirect=True)
@pytest.mark.skip(reason="Dataset download issues.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent the skip from being forgotten later, I suggest you file an issue to track it and include the corresponding issue number here.

@gcanlin gcanlin added omni-test label to trigger buildkite omni model test in nightly CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 20, 2026
Comment thread .buildkite/test-nightly.yml Outdated
commands:
- |
set +e
pytest -s -v tests/e2e/accuracy/qwen3_omni/test_qwen_omni.py -m "advanced_model" --run-level advanced_model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To distinguish from the L3 test cases, we plan to change all L4 test cases in #2641 to "full_model". Maybe you can directly change the test cases here to "full_model" as well.

summarize_gebench_results,
)
from tests.e2e.accuracy.qwen3_omni.qwen3_omni_acc_bench_core import seed_tts_bench_argv
from tests.e2e.accuracy.qwen3_omni.run_qwen_omni_acc_benchmark import _sync_dataset_env_from_ns
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe _sync_dataset_env_from_ns should be sync_dataset_env_from_ns?

# so only async_chunk needs flipping. Writing nested `engine_args:` into
# the new-schema overlay trips _parse_stage_deploy's legacy branch and
# drops flat fields (load_format, max_num_seqs, ...).
return modify_stage_config(config_path, updates={"async_chunk": True})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly use the command line form of --async chunk?

@amy-why-3459 amy-why-3459 force-pushed the benchmark branch 4 times, most recently from bcfb2a2 to 085e629 Compare April 22, 2026 10:36
@amy-why-3459
Copy link
Copy Markdown
Contributor Author

I think this PR is ready and can be merged.

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 22, 2026

Retry can fix it?

image

@gcanlin gcanlin added the ready label to trigger buildkite CI label Apr 22, 2026
@gcanlin gcanlin removed the omni-test label to trigger buildkite omni model test in nightly CI label Apr 22, 2026
@gcanlin gcanlin enabled auto-merge (squash) April 22, 2026 13:58
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
auto-merge was automatically disabled April 23, 2026 01:12

Head branch was pushed to by a user without write access

@gcanlin gcanlin merged commit 3aa84c4 into vllm-project:main Apr 23, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants