vllm-project · tzhouam · Apr 20, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
@@ -552,7 +552,7 @@ steps:
       - label: ":full_moon: Diffusion X2V · Accuracy Test"
         timeout_in_minutes: 180
         commands:
-          - pytest -s -v tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py --run-level advanced_model
+          - pytest -s -v tests/e2e/accuracy/wan22_i2v/test_wan22_i2v_video_similarity.py -m advanced_model --run-level advanced_model
         agents:
           queue: "mithril-h100-pool"
         plugins:

@@ -107,7 +107,7 @@ nav:
       - design/feature/hsdp.md
       - design/feature/cache_dit.md
       - design/feature/teacache.md
-      - design/feature/async_chunk_design.md
+      - design/feature/async_chunk.md
       - design/feature/vae_parallel.md
       - design/feature/diffusion_step_execution.md
     - Module Design:

@@ -40,7 +40,7 @@ Currently all the features are available in online serving mode. Hence, only nee
 - Test marks: always add `advanced_model` and `diffusion`. Add GPU-related marks if needed. Ref: [Markers for Tests](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_markers/).
 - To maximize code reuse, you may refer to
     - `tests/conftest.py` for `omni_server` (running server in subprocess) and `openai_client` fixtures (sending requests and validating output), `generate_synthetic_image` and `assert_XXX_valid` helper.
-    - `tests/utils.py` for `@hardware_test(...)` and `hardware_marks`.
+    - `tests/helpers/mark.py` for `@hardware_test(...)` and `hardware_marks`.
     - [Parametrizing tests (pytest doc)](https://docs.pytest.org/en/stable/example/parametrize.html) to reuse test function implementation for different cases.
 - Doc: add a concise docstring for each test function.
 - Reference L4 test implementation: [tests/e2e/online_serving/test_qwen_image_edit_expansion.py](https://github.com/vllm-project/vllm-omni/blob/main/tests/e2e/online_serving/test_qwen_image_edit_expansion.py).
@@ -38,7 +38,7 @@ Defined in `pyproject.toml`:
 ### Example usage for markers
 
 ```python
-from tests.utils import hardware_test
+from tests.helpers.mark import hardware_test
 
 @pytest.mark.core_model
 @pytest.mark.omni
@@ -53,7 +53,7 @@ def test_video_to_audio()
 
 ### Decorator: `@hardware_test`
 
-This decorator is intended to make hardware-aware, cross-platform test authoring easier and more robust for CI/CD environments. The `hardware_test` decorator in `vllm-omni/tests/utils.py` performs the following actions:
+This decorator is intended to make hardware-aware, cross-platform test authoring easier and more robust for CI/CD environments. The `hardware_test` decorator in `vllm-omni/tests/helpers/mark.py` performs the following actions:
 
 1. **Applies platform and resource markers**  
    Adds the appropriate pytest markers for each specified hardware platform (e.g., `cuda`, `rocm`, `xpu`, `npu`) and resource type (e.g., `L4`, `H100`, `MI325`, `B60`, `A2`, `A3`).
@@ -105,7 +105,7 @@ This decorator is intended to make hardware-aware, cross-platform test authoring
 `hardware_marks` returns a list of pytest mark objects with the same signature as `@hardware_test`. Use it when you need more flexibility, such as attaching hardware marks to individual `pytest.param` entries rather than an entire test function.
 
 ```python
-from tests.utils import hardware_marks
+from tests.helpers.mark import hardware_marks
 
 MULTI_CARD_MARKS = hardware_marks(
     res={"cuda": "H100", "rocm": "MI325", "npu": "A2"}, num_cards=2
@@ -133,9 +133,9 @@ If you want to add support for a new platform (e.g., "tpu" for a new accelerator
        "distributed_tpu: Tests that require multiple TPU devices",
    ]
    ```
-2. **Implement a marker construction function for your platform** in `vllm-omni/tests/utils.py`:
+2. **Implement a marker construction function for your platform** in `vllm-omni/tests/helpers/mark.py`:
    ```python
-   # In vllm-omni/tests/utils.py
+   # In vllm-omni/tests/helpers/mark.py
 
    def tpu_marks(*, res: str, num_cards: int):
        test_platform = pytest.mark.tpu
@@ -175,4 +175,4 @@ If you want to add support for a new platform (e.g., "tpu" for a new accelerator
 - Plug into `hardware_marks`  
 - You're done: tests using `@hardware_test` or `hardware_marks` with your platform now automatically get the correct markers, distribution, and isolation!
 
-See code in `vllm-omni/tests/utils.py` for existing examples (`cuda_marks`, `rocm_marks`, `npu_marks`).
+See code in `vllm-omni/tests/helpers/mark.py` for existing examples (`cuda_marks`, `rocm_marks`, `npu_marks`).
@@ -221,15 +221,13 @@ from pathlib import Path
 import openai
 import pytest
 
-from tests.conftest import (
-    OmniServer,
-    convert_audio_to_text,
+from tests.helpers.media import (
+    convert_audio_bytes_to_text,
     cosine_similarity_text,
-    dummy_messages_from_mix_data,
     generate_synthetic_video,
-    merge_base64_and_convert_to_text,
 )
-from tests.utils import get_deploy_config_path
+from tests.helpers.runtime import OmniServer, dummy_messages_from_mix_data
+from tests.helpers.stage_config import get_deploy_config_path, modify_stage_config
 from vllm_omni.platforms import current_omni_platform
 
 # Edit: model name and stage config path
@@ -406,7 +404,7 @@ def test_mix_to_text_audio_001(client: openai.OpenAI, omni_server, request) -> N
     # PURPOSE: Verify text and audio outputs convey the same information
     # CUSTOMIZATION: Adjust similarity threshold (0.9) based on accuracy requirements
     assert audio_data is not None, "No audio output is generated"
-    audio_content = merge_base64_and_convert_to_text(audio_data)
+    audio_content = convert_audio_bytes_to_text(audio_data)
     print(f"text content is: {text_content}")
     print(f"audio content is: {audio_content}")
     similarity = cosine_similarity_text(audio_content.lower(), text_content.lower())
@@ -429,7 +427,7 @@ from pathlib import Path
 import pytest
 from vllm.assets.video import VideoAsset
 
-from tests.utils import hardware_test
+from tests.helpers.mark import hardware_test
 from ..multi_stages.conftest import OmniRunner
 
 # Optional: set process start method for workers

@@ -408,18 +408,17 @@ Understanding the data structures is crucial for implementing stage transitions:
 
 **Input to your function:**
 - `stage_list[source_stage_id].engine_outputs`: List of `EngineCoreOutput` objects
-  - Each contains `outputs`: List of `RequestOutput` objects
-  - Each `RequestOutput` has:
-    - `token_ids`: Generated token IDs
-    - `multimodal_output`: Dict with keys like `"code_predictor_codes"`, etc.
-      - These are the hidden states or intermediate outputs from the model's forward pass
-    - `prompt_token_ids`: Original prompt token IDs
+-   -  Each contains `outputs`: List of `RequestOutput` objects
+    - Each `RequestOutput` has:
+-   -  - `token_ids`: Generated token IDs
+       - `multimodal_output`: Dict with keys like `"code_predictor_codes"`, etc.These are the hidden states or intermediate outputs from the model's forward pass
+       - `prompt_token_ids`: Original prompt token IDs
 
 **Output from your function:**
 - Must return `list[OmniTokensPrompt]` where each `OmniTokensPrompt` contains:
-  - `prompt_token_ids`: List[int] - Token IDs for the next stage
-  - `additional_information`: Dict[str, Any] - Optional metadata (e.g., embeddings, hidden states)
-  - `multi_modal_data`: Optional multimodal data if needed
+-   - `prompt_token_ids`: List[int] - Token IDs for the next stage
+    - `additional_information`: Dict[str, Any] - Optional metadata (e.g., embeddings, hidden states)
+    - `multi_modal_data`: Optional multimodal data if needed
 
 ### How Model Outputs Are Stored
 

@@ -28,7 +28,7 @@ and can be placed on different devices. Qwen3-TTS has two stages:
 
 Each stage is a separate model class configured independently via YAML. The two stages
 are connected by the `async_chunk` framework, which enables inter-stage streaming for
-low first-packet latency (see [Async Chunk Design](../../design/feature/async_chunk_design.md)).
+low first-packet latency (see [Async Chunk Design](../../design/feature/async_chunk.md)).
 
 ### Without async_chunk (batch mode)
 
@@ -591,5 +591,5 @@ Adding a TTS model to vLLM-Omni involves:
 For more information, see:
 
 - [Architecture Overview](../../design/architecture_overview.md)
-- [Async Chunk Design](../../design/feature/async_chunk_design.md)
+- [Async Chunk Design](../../design/feature/async_chunk.md)
 - [Stage Configuration Guide](../../configuration/stage_configs.md)
@@ -1,4 +1,4 @@
-# Async Chunk Design
+# Async Chunk
 
 ## Table of Contents
 
@@ -88,8 +88,9 @@ The following diagram illustrates the **Async Chunk Architecture** for multi-sta
 </p>
 
 **Diagram Legend:**
+
 | Step | Stage Type | Description |
-|:------:|:-----------:|:------------|
+|------|-----------|------------|
 | `prefill` | Initialization | Context processing, KV cache initialization |
 | `decode` | Autoregressive | Token-by-token generation in AR stages |
 | `codes` | Audio Encoding | RVQ codec codes from Talker stage |