vllm-project · hsliuustc0106 · Apr 16, 2026 · Apr 15, 2026 · Apr 15, 2026 · Apr 15, 2026
@@ -44,11 +44,19 @@ steps:
     agents:
       queue: "cpu_queue_premerge"
 
-  # L4 Test — main+NIGHTLY=1 (scheduled), or PR with label nightly-test (e.g. add label then Rebuild)
+  # L4 Test — main+NIGHTLY=1 (scheduled), or PR with specific label (e.g. add label then Rebuild)
   - label: "Upload Nightly Pipeline"
     depends_on: image-build
     key: upload-nightly-pipeline
-    if: '(build.branch == "main" && build.env("NIGHTLY") == "1") || (build.branch != "main" && build.pull_request.labels includes "nightly-test")'
+    if: >-
+      (build.branch == "main" && build.env("NIGHTLY") == "1") ||
+      (build.branch != "main" && (
+        build.pull_request.labels includes "nightly-test" ||
+        build.pull_request.labels includes "omni-test" ||
+        build.pull_request.labels includes "tts-test" ||
+        build.pull_request.labels includes "diffusion-x2iat-test" ||
+        build.pull_request.labels includes "diffusion-x2v-test"
+      ))
     commands:
       - buildkite-agent pipeline upload .buildkite/test-nightly.yml
     agents:

diff --git a/docs/contributing/ci/CI_5levels.md b/docs/contributing/ci/CI_5levels.md
@@ -86,7 +86,8 @@ Through five levels (L1-L5) and common (Common) specifications, the system clari
         /tests/e2e/online_serving/test_{model_name}_expansion.py<br>
         /tests/e2e/offline_inference/test_{model_name}_expansion.py<br>
         <strong>Performance:</strong><br>
-        /tests/dfx/perf/tests/test.json<br>
+        /tests/dfx/perf/tests/test_qwen_omni.json (Omni), test_tts.json (TTS),<br>
+        and /tests/dfx/perf/tests/test_{diffusion_model}_vllm_omni.json (Diffusion)<br>
         <strong>Doc Test:</strong><br>
         tests/example/online_serving/test_{model_name}.py<br>
         tests/example/offline_inference/test_{model_name}.py
@@ -530,13 +531,13 @@ L4 level testing is a comprehensive quality audit before a version release. It e
 ### 3.2 Testing Content and Scope
 
 -   ***Full Functionality Testing***: Executes all test cases defined in `test_{model_name}_expansion.py`, covering all implemented features, positive flows, boundary conditions, and exception handling.
--   ***Performance Testing***: Uses the `tests/dfx/perf/tests/test.json` configuration file to drive performance testing tools for stress, load, and endurance tests, collecting metrics like throughput, response time, and resource utilization.
+-   ***Performance Testing***: Uses `tests/dfx/perf/tests/test_qwen_omni.json`, `tests/dfx/perf/tests/test_tts.json`, and diffusion configs in the form `tests/dfx/perf/tests/test_*_vllm_omni.json` (passed to `run_benchmark.py` via `--test-config-file`) to drive performance testing tools for stress, load, and endurance tests, collecting metrics like throughput, response time, and resource utilization.
 -   ***Documentation Testing***: Verifies whether the example code provided to users is runnable and its results match the description.
 
 ### 3.3 Test Directory and Execution Files
 
 -   ***Functional Testing***: Same directories as L3.
--   ***Performance Test Configuration***: `tests/dfx/perf/tests/test.json`
+-   ***Performance Test Configuration***: `tests/dfx/perf/tests/test_qwen_omni.json`, `tests/dfx/perf/tests/test_tts.json`, and diffusion configs `tests/dfx/perf/tests/test_*_vllm_omni.json` (e.g. `test_qwen_image_vllm_omni.json`)
 -   ***Documentation Example Tests***:
 -   -   `tests/example/online_serving/test_{model_name}.py`
     -   `tests/example/offline_inference/test_{model_name}.py`

diff --git a/docs/contributing/ci/test_examples/l4_performance_tests.inc.md b/docs/contributing/ci/test_examples/l4_performance_tests.inc.md
@@ -1,4 +1,4 @@
-When you want to add L4-level ***performance test*** cases, you can refer to the following format for case addition in tests/dfx/perf/tests/test.json:
+When you want to add L4-level ***performance test*** cases, you can refer to the following format for case addition in `tests/dfx/perf/tests/test_qwen_omni.json`, `tests/dfx/perf/tests/test_tts.json`, or diffusion configs such as `tests/dfx/perf/tests/test_*_vllm_omni.json` (selected via `pytest ... run_benchmark.py --test-config-file <path>`):
 
 ```JSON
 {

diff --git a/docs/contributing/ci/test_guide.md b/docs/contributing/ci/test_guide.md
@@ -45,7 +45,6 @@ Our test scripts use the pytest framework. First, please use `git clone https://
 === "L3 level & L4 level"
 
     ```bash
-    cd tests
     pytest -s -v -m "advanced_model" --run-level=advanced_model
     ```
     If you only want to run L3 test case, you can use:
@@ -60,9 +59,9 @@ Our test scripts use the pytest framework. First, please use `git clone https://
     ```bash
     pytest -s -v -m "core_model and distributed_cuda and L4"  --run-level=core_model
     ```
-    Note: To run performance tests, use:
+    Note: To run performance tests (defaults to ``test_qwen_omni.json``; use ``--test-config-file tests/dfx/perf/tests/test_tts.json`` for TTS):
     ```bash
-    pytest -s -v perf/scripts/run_benchmark.py
+    pytest -s -v tests/dfx/perf/scripts/run_benchmark.py
     ```
 
     The latest L3 test commands for various test suites can be found in the [pipeline](https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/test-merge.yml).

diff --git a/tests/dfx/conftest.py b/tests/dfx/conftest.py
@@ -2,6 +2,8 @@
 from pathlib import Path
 from typing import Any
 
+import pytest
+
 from tests.conftest import modify_stage_config
 
 
@@ -95,3 +97,13 @@ def create_benchmark_indices(
                 indices.append((test_name, idx))
 
     return indices
+
+
+def pytest_addoption(parser: pytest.Parser) -> None:
+    """Register shared CLI options for DFX benchmark suites."""
+    parser.addoption(
+        "--test-config-file",
+        action="store",
+        default=None,
+        help=("Path to benchmark config JSON. Example: --test-config-file tests/dfx/perf/tests/test_tts.json"),
+    )
diff --git a/tests/dfx/perf/scripts/run_benchmark.py b/tests/dfx/perf/scripts/run_benchmark.py
@@ -21,10 +21,30 @@
 os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0"
 
 
-CONFIG_FILE_PATH = str(Path(__file__).parent.parent / "tests" / "test.json")
-BENCHMARK_CONFIGS = load_configs(CONFIG_FILE_PATH)
-STAGE_INIT_TIMEOUT = 600
+def _get_config_file_from_argv() -> str | None:
+    """Read ``--test-config-file`` from ``sys.argv`` at import time so parametrization can use it."""
+    import sys
+
+    for i, arg in enumerate(sys.argv):
+        if arg == "--test-config-file" and i + 1 < len(sys.argv):
+            return sys.argv[i + 1]
+        if arg.startswith("--test-config-file="):
+            return arg.split("=", 1)[1]
+    return None
+
+
+_PERF_TESTS_DIR = Path(__file__).resolve().parent.parent / "tests"
+_DEFAULT_CONFIG_FILE = str(_PERF_TESTS_DIR / "test_qwen_omni.json")
+
+CONFIG_FILE_PATH = _get_config_file_from_argv()
+if CONFIG_FILE_PATH is None:
+    print(
+        "No --test-config-file in argv, using default: tests/dfx/perf/tests/test_qwen_omni.json "
+        "(override with e.g. --test-config-file tests/dfx/perf/tests/test_tts.json)"
+    )
+    CONFIG_FILE_PATH = _DEFAULT_CONFIG_FILE
 
+BENCHMARK_CONFIGS = load_configs(CONFIG_FILE_PATH)
 
 STAGE_CONFIGS_DIR = Path(__file__).parent.parent / "stage_configs"
 test_params = create_unique_server_params(BENCHMARK_CONFIGS, STAGE_CONFIGS_DIR)
@@ -44,7 +64,7 @@ def omni_server(request):
 
         print(f"Starting OmniServer with test: {test_name}, model: {model}")
 
-        server_args = ["--stage-init-timeout", str(STAGE_INIT_TIMEOUT), "--init-timeout", "900"]
+        server_args = ["--stage-init-timeout", "300", "--init-timeout", "900"]
         if stage_config_path:
             server_args = ["--stage-configs-path", stage_config_path] + server_args
         with OmniServer(model, server_args) as server:
@@ -97,8 +117,6 @@ def run_benchmark(
         ["vllm", "bench", "serve", "--omni"]
         + args
         + [
-            "--num-warmups",
-            "2",
             "--save-result",
             "--result-dir",
             os.environ.get("BENCHMARK_DIR", "tests"),
@@ -141,7 +159,6 @@ def run_benchmark(
         result["random_output_len"] = random_output_len
     with open(result_path, "w", encoding="utf-8") as f:
         json.dump(result, f, ensure_ascii=False, indent=2)
-
     return result
 
 
@@ -207,10 +224,6 @@ def _resolve_baseline_value(
             f"or request_rate={request_rate!r}; keys={list(baseline_raw.keys())!r}"
         )
     if isinstance(baseline_raw, (list, tuple)):
-        if sweep_index is None:
-            raise ValueError("list baseline requires sweep_index")
-        if not (0 <= sweep_index < len(baseline_raw)):
-            raise IndexError(f"baseline list len={len(baseline_raw)} has no index {sweep_index}")
         return baseline_raw[sweep_index]
     return baseline_raw
 
@@ -245,14 +258,14 @@ def assert_result(
 ) -> None:
     assert result["completed"] == num_prompt, "Request failures exist"
     baseline_data = params.get("baseline", {})
-    thresholds = _baseline_thresholds_for_step(
-        baseline_data,
-        sweep_index=sweep_index,
-        max_concurrency=max_concurrency,
-        request_rate=request_rate,
-    )
-    for metric_name, baseline_value in thresholds.items():
+    for metric_name, baseline_raw in baseline_data.items():
         current_value = result[metric_name]
+        baseline_value = _resolve_baseline_value(
+            baseline_raw,
+            sweep_index=sweep_index,
+            max_concurrency=max_concurrency,
+            request_rate=request_rate,
+        )
         if "throughput" in metric_name:
             if current_value <= baseline_value:
                 print(

diff --git a/tests/dfx/perf/scripts/run_diffusion_benchmark.py b/tests/dfx/perf/scripts/run_diffusion_benchmark.py
@@ -5,8 +5,8 @@
   - vllm-omni (default): starts DiffusionServer via vllm_omni.entrypoints.cli.main,
     benchmarks with diffusion_benchmark_serving.py --backend vllm-omni
 
-A config JSON file is REQUIRED via --config-file:
-  pytest run_diffusion_benchmark.py --config-file tests/dfx/perf/tests/test_qwen_image_vllm_omni.json
+A config JSON file is REQUIRED via --test-config-file:
+  pytest run_diffusion_benchmark.py --test-config-file tests/dfx/perf/tests/test_qwen_image_vllm_omni.json
 
 JSON config entries use a "server_type" field, and this runner executes
 the vllm-omni path.
@@ -55,16 +55,16 @@
 
 
 def _get_config_file_from_argv() -> str | None:
-    """Read --config-file from sys.argv at import time so pytest parametrize can use it.
+    """Read --test-config-file from sys.argv at import time so pytest parametrize can use it.
 
     pytest_addoption (below) registers the same flag so pytest does not reject it.
-    Supports both ``--config-file path`` and ``--config-file=path`` forms.
+    Supports both ``--test-config-file path`` and ``--test-config-file=path`` forms.
     Returns None if the flag is not present; callers must handle the missing case.
     """
     for i, arg in enumerate(sys.argv):
-        if arg == "--config-file" and i + 1 < len(sys.argv):
+        if arg == "--test-config-file" and i + 1 < len(sys.argv):
             return sys.argv[i + 1]
-        if arg.startswith("--config-file="):
+        if arg.startswith("--test-config-file="):
             return arg.split("=", 1)[1]
     return None
 
@@ -133,19 +133,6 @@ def _append_to_aggregated_file(record: dict[str, Any]) -> None:
             json.dump(records, f, indent=2, ensure_ascii=False)
 
 
-# Register --config-file with pytest so it does not reject the argument.
-def pytest_addoption(parser: pytest.Parser) -> None:
-    parser.addoption(
-        "--config-file",
-        action="store",
-        default=None,
-        help=(
-            "Path to the benchmark config JSON file (required). "
-            "Example: --config-file tests/dfx/perf/tests/test_qwen_image_vllm_omni.json"
-        ),
-    )
-
-
 _server_lock = threading.Lock()
 
 # ---------------------------------------------------------------------------

diff --git a/tests/dfx/perf/tests/test.json → tests/dfx/perf/tests/test_qwen_omni.json b/tests/dfx/perf/tests/test.json → tests/dfx/perf/tests/test_qwen_omni.json
@@ -329,37 +329,5 @@
                 }
             }
         ]
-    },
-    {
-        "test_name": "test_qwen3_tts",
-        "server_params": {
-            "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
-        },
-        "benchmark_params": [
-            {
-                "dataset_name": "random",
-                "backend": "openai-audio-speech",
-                "endpoint": "/v1/audio/speech",
-                "num_prompts": [
-                    10,
-                    40
-                ],
-                "max_concurrency": [
-                    1,
-                    4
-                ],
-                "random_input_len": 100,
-                "random_output_len": 100,
-                "extra_body": {
-                    "voice": "Vivian",
-                    "language": "English"
-                },
-                "percentile-metrics": "ttft,e2el,audio_rtf,audio_ttfp,audio_duration",
-                "baseline": {
-                    "mean_audio_ttfp_ms": [6000, 6000],
-                    "mean_audio_rtf": [0.3, 0.3]
-                }
-            }
-        ]
     }
 ]
diff --git a/tests/dfx/perf/tests/test_tts.json b/tests/dfx/perf/tests/test_tts.json
@@ -0,0 +1,34 @@
+[
+    {
+        "test_name": "test_qwen3_tts",
+        "server_params": {
+            "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
+        },
+        "benchmark_params": [
+            {
+                "dataset_name": "random",
+                "backend": "openai-audio-speech",
+                "endpoint": "/v1/audio/speech",
+                "num_prompts": [
+                    10,
+                    40
+                ],
+                "max_concurrency": [
+                    1,
+                    4
+                ],
+                "random_input_len": 100,
+                "random_output_len": 100,
+                "extra_body": {
+                    "voice": "Vivian",
+                    "language": "English"
+                },
+                "percentile-metrics": "ttft,e2el,audio_rtf,audio_ttfp,audio_duration",
+                "baseline": {
+                    "mean_audio_ttfp_ms": [6000, 6000],
+                    "mean_audio_rtf": [0.3, 0.3]
+                }
+            }
+        ]
+    }
+]