vllm-project · hsliuustc0106 · Mar 18, 2026 · Feb 7, 2026 · Feb 7, 2026 · Feb 10, 2026
diff --git a/.github/ISSUE_TEMPLATE/400-bug-report.yml b/.github/ISSUE_TEMPLATE/400-bug-report.yml
@@ -74,28 +74,21 @@ body:
       If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:
 
       ```python
-      from vllm_omni import OmniLLM, create_ar_stage_config, create_dit_stage_config
-
-      # Create stage configurations
-      ar_config = create_ar_stage_config(
-          stage_id=0,
-          model_path="Qwen/Qwen3-0.6B",
-          input_modalities=["text"],
-          output_modalities=["text"]
-      )
+      from vllm_omni.entrypoints.omni import Omni
+      from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+      from vllm import SamplingParams
 
-      dit_config = create_dit_stage_config(
-          stage_id=1,
-          model_path="stabilityai/stable-diffusion-2-1",
-          input_modalities=["text"],
-          output_modalities=["image"]
+      omni = Omni(
+          model="Qwen/Qwen-Image",
+          stage_configs_path="/path/to/stage_configs.yaml",
       )
 
-      # Initialize OmniLLM
-      omni = OmniLLM([ar_config, dit_config])
-
-      # Generate
-      outputs = omni.generate(prompt="A scenic watercolor painting of a lighthouse at sunset")
+      prompts = [{"prompt": "A scenic watercolor painting of a lighthouse at sunset"}]
+      sampling_params_list = [
+          SamplingParams(max_tokens=1),
+          OmniDiffusionSamplingParams(num_outputs_per_prompt=1),
+      ]
+      outputs = omni.generate(prompts=prompts, sampling_params_list=sampling_params_list)
       ```
 
       If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

@@ -56,9 +56,9 @@ What it does:
 - Runs `examples/offline_inference/qwen3_omni/end2end.py` with `--log-stats`.
 - Uses `benchmarks/build_dataset/top100.txt` and writes to:
   - Logs: `benchmarks/qwen3-omni/vllm_omni/logs/`
-    - `omni_llm_pipeline_text.orchestrator.stats.jsonl` — per-stage latency stats.
-    - `omni_llm_pipeline_text.overall.stats.jsonl` — end-to-end latency/TPS.
-    - `omni_llm_pipeline_text.stage{0,1,2}.log` — per-stage detailed logs/errors.
+    - `omni_pipeline_text.orchestrator.stats.jsonl` — per-stage latency stats.
+    - `omni_pipeline_text.overall.stats.jsonl` — end-to-end latency/TPS.
+    - `omni_pipeline_text.stage{0,1,2}.log` — per-stage detailed logs/errors.
   - Outputs: `benchmarks/qwen3-omni/vllm_omni/outputs/` — ~100 text and `.wav` files.
 
 Key checks:

@@ -26,12 +26,12 @@ else
                       --log-stats \
                       --log-dir $log_dir
     echo "Logs and outputs are saved in ${log_dir} and ${outputs_dir} respectively:"
-    echo "  - omni_llm_pipeline_text                       run dir/base name"
-    echo "  - omni_llm_pipeline_text.orchestrator.stats.jsonl  orchestrator-stage latency stats"
-    echo "  - omni_llm_pipeline_text.overall.stats.jsonl       overall latency/TPS stats"
-    echo "  - omni_llm_pipeline_text.stage0.log                per-stage detailed logs"
-    echo "  - omni_llm_pipeline_text.stage1.log"
-    echo "  - omni_llm_pipeline_text.stage2.log"
+    echo "  - omni_pipeline_text                       run dir/base name"
+    echo "  - omni_pipeline_text.orchestrator.stats.jsonl  orchestrator-stage latency stats"
+    echo "  - omni_pipeline_text.overall.stats.jsonl       overall latency/TPS stats"
+    echo "  - omni_pipeline_text.stage0.log                per-stage detailed logs"
+    echo "  - omni_pipeline_text.stage1.log"
+    echo "  - omni_pipeline_text.stage2.log"
     echo "Key checks: overall.stats.jsonl for end-to-end latency/TPS; orchestrator.stats.jsonl for stable per-stage latency; stage*.log for errors or long tails."
     echo "  - outputs/             Generated txt and wav files, there should be 100 text and wav files generated respectively"
 fi
diff --git a/docs/api/README.md b/docs/api/README.md
@@ -6,19 +6,13 @@ Main entry points for vLLM-Omni inference and serving.
 
 - [vllm_omni.entrypoints.async_omni.AsyncOmni][]
 - [vllm_omni.entrypoints.async_omni_diffusion.AsyncOmniDiffusion][]
-- [vllm_omni.entrypoints.async_omni_llm.AsyncOmniLLM][]
 - [vllm_omni.entrypoints.cli.benchmark.base.OmniBenchmarkSubcommandBase][]
 - [vllm_omni.entrypoints.cli.benchmark.main.OmniBenchmarkSubcommand][]
 - [vllm_omni.entrypoints.cli.benchmark.serve.OmniBenchmarkServingSubcommand][]
 - [vllm_omni.entrypoints.cli.serve.OmniServeCommand][]
 - [vllm_omni.entrypoints.client_request_state.ClientRequestState][]
 - [vllm_omni.entrypoints.omni.Omni][]
-- [vllm_omni.entrypoints.omni.OmniBase][]
-- [vllm_omni.entrypoints.omni_diffusion.OmniDiffusion][]
-- [vllm_omni.entrypoints.omni_llm.OmniLLM][]
-- [vllm_omni.entrypoints.omni_stage.OmniStage][]
-- [vllm_omni.entrypoints.stage_utils.OmniStageTaskType][]
-- [vllm_omni.entrypoints.zmq_utils.ZmqQueue][]
+- [vllm_omni.entrypoints.omni_base.OmniBase][]
 
 ## Inputs
 
@@ -48,9 +42,7 @@ Engine classes for offline and online inference.
 - [vllm_omni.engine.OmniEngineCoreOutputs][]
 - [vllm_omni.engine.OmniEngineCoreRequest][]
 - [vllm_omni.engine.PromptEmbedsPayload][]
-- [vllm_omni.engine.arg_utils.AsyncOmniEngineArgs][]
 - [vllm_omni.engine.arg_utils.OmniEngineArgs][]
-- [vllm_omni.engine.input_processor.OmniInputProcessor][]
 - [vllm_omni.engine.output_processor.MultimodalOutputProcessor][]
 - [vllm_omni.engine.output_processor.OmniRequestState][]
 

diff --git a/docs/configuration/stage_configs.md b/docs/configuration/stage_configs.md
@@ -18,7 +18,7 @@ If users want to modify some part of it. The custom stage_configs file can be in
 For offline (Assume necessary dependencies have ben imported):
 ```python
 model_name = "Qwen/Qwen2.5-Omni-7B"
-omni_llm = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")
+omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")
 ```
 
 For online serving:
@@ -30,7 +30,7 @@ vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to
 
 Below is a specific example of stage_configs.yaml in Qwen2.5-omni.
 ```python
-# stage config for running qwen2.5-omni with architecture of OmniLLM.
+# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime.
 stage_args:
   - stage_id: 0 # mark the unique id for each stage
     runtime: # The disaggregated configuration

diff --git a/docs/configuration/stage_configs/qwen2_5_omni.yaml b/docs/configuration/stage_configs/qwen2_5_omni.yaml
@@ -1,4 +1,4 @@
-# stage config for running qwen2.5-omni with architecture of OmniLLM.
+# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime.
 stage_args:
   - stage_id: 0
     runtime:

diff --git a/docs/contributing/README.md b/docs/contributing/README.md
@@ -107,7 +107,7 @@ Only specific types of PRs will be reviewed. The PR title is prefixed appropriat
 - `[CI/Build]` for build or continuous integration improvements.
 - `[Doc]` for documentation fixes and improvements.
 - `[Model]` for adding a new model or improving an existing model. Model name should appear in the title.
-- `[Frontend]` For changes on the vLLM-Omni frontend (e.g., OpenAI API server, `OmniLLM` class, etc.)
+- `[Frontend]` For changes on the vLLM-Omni frontend (e.g., OpenAI API server, `Omni`/`AsyncOmni`, etc.)
 - `[Kernel]` for changes affecting CUDA kernels or other compute kernels.
 - `[Core]` for changes in the core vLLM-Omni logic (e.g., `OmniProcessor`, `OmniARScheduler`, etc.)
 - `[Hardware][Vendor]` for hardware-specific changes. Vendor name should appear in the prefix, such as [Ascend] for Ascend NPUs.

diff --git a/docs/contributing/ci/CI_5levels.md b/docs/contributing/ci/CI_5levels.md
@@ -168,11 +168,6 @@ vllm_omni/                                    tests/
 │   └── arg_utils.py                            │   └── test_arg_utils.py               ⬜
 │
 ├── entrypoints/                        →     ├── entrypoints/
-│   ├── omni.py                                 │   ├── test_omni.py                    ⬜  (E2E covered by e2e/offline, e2e/online)
-│   ├── omni_llm.py                             │   ├── test_omni_llm.py                ✅
-│   ├── omni_stage.py                            │   ├── test_omni_stage.py              ⬜  (partial in test_omni_stage_diffusion_config.py)
-│   ├── omni_diffusion.py                       │   ├── test_omni_diffusion.py          ✅
-│   ├── async_omni.py                            │   ├── test_async_omni.py              ✅ actually in e2e/online_serving/test_async_omni.py
 │   ├── async_omni_diffusion.py                 │   ├── test_async_omni_diffusion_config.py  ✅
 │   ├── stage_utils.py                          │   ├── test_stage_utils.py            ✅
 │   ├── cli/                                     │   ├── cli/                           (benchmarks/test_serve_cli.py covers CLI serve)

diff --git a/docs/contributing/ci/tests_style.md b/docs/contributing/ci/tests_style.md
@@ -24,8 +24,6 @@ End-to-end tests verify the complete functionality of a system or component. For
 
 - **`tests/e2e/online_serving/`**: Tests for online serving scenarios (e.g., API server tests)
 
-**Example:** The test file for `vllm_omni/entrypoints/omni_llm.py` should be located at `tests/entrypoints/test_omni_llm.py`.
-
 ## Test Directory Structure
 
 The ideal directory structure mirrors the source code organization. Legend: `✅` = test exists, `⬜` = suggested to add.
@@ -75,11 +73,6 @@ vllm_omni/                                    tests/
 │   └── arg_utils.py                            │   └── test_arg_utils.py               ⬜
 │
 ├── entrypoints/                        →     ├── entrypoints/
-│   ├── omni.py                                 │   ├── test_omni.py                    ⬜  (E2E covered by e2e/offline, e2e/online)
-│   ├── omni_llm.py                             │   ├── test_omni_llm.py                ✅
-│   ├── omni_stage.py                            │   ├── test_omni_stage.py              ⬜  (partial in test_omni_stage_diffusion_config.py)
-│   ├── omni_diffusion.py                       │   ├── test_omni_diffusion.py          ✅
-│   ├── async_omni.py                            │   ├── test_async_omni.py              ✅ actually in e2e/online_serving/test_async_omni.py
 │   ├── async_omni_diffusion.py                 │   ├── test_async_omni_diffusion_config.py  ✅
 │   ├── stage_utils.py                          │   ├── test_stage_utils.py            ✅
 │   ├── cli/                                     │   ├── cli/                           (benchmarks/test_serve_cli.py covers CLI serve)
@@ -170,7 +163,7 @@ vllm_omni/                                    tests/
 
 ### Naming Conventions
 
-- **Unit Tests**: Use `test_<module_name>.py` format. Example: `omni_llm.py` → `test_omni_llm.py`
+- **Unit Tests**: Use `test_<module_name>.py` format. Example: `stage_utils.py` → `test_stage_utils.py`
 
 - **E2E Tests**: Place in `tests/e2e/offline_inference/` or `tests/e2e/online_serving/` with descriptive names. Example: `tests/e2e/offline_inference/test_qwen3_omni.py`, `tests/e2e/offline_inference/test_diffusion_model.py`
 

diff --git a/docs/contributing/model/adding_omni_model.md b/docs/contributing/model/adding_omni_model.md
@@ -330,54 +330,46 @@ Stage transitions are the mechanism by which outputs from one stage are converte
 
 ### Where Stage Transitions Are Called
 
-Stage transitions happen automatically in the orchestrator (`OmniLLM` class) during the generation loop. Here's the detailed flow:
+Stage transitions happen automatically in the runtime orchestrator. Here's the detailed flow:
 
-1. **Location**: `vllm_omni/entrypoints/omni_llm.py` in the `_run_generation()` method
+1. **Location**: `vllm_omni/engine/orchestrator.py` in `_forward_to_next_stage()`
 2. **Trigger**: When a stage completes processing and produces outputs
 3. **Execution Flow**:
    ```python
-   # In omni_llm.py, _run_generation() method (around line 345-460)
-
-   # Main orchestrator loop polls each stage for completed requests
-   for stage_id, stage in enumerate(self.stage_list):
-       result = stage.try_collect()  # Get completed request
-       if result is None:
-           continue
-
-       # Store outputs from this stage
-       engine_outputs = _load(result, obj_key="engine_outputs", shm_key="engine_outputs_shm")
-       stage.set_engine_outputs(engine_outputs)
-
-       # Check if there's a next stage to forward to
-       next_stage_id = stage_id + 1
-       if next_stage_id < len(self.stage_list):
-           next_stage: OmniStage = self.stage_list[next_stage_id]
-
-           # THIS IS WHERE STAGE TRANSITION HAPPENS
-           next_inputs = next_stage.process_engine_inputs(
-               self.stage_list,
-               [request_id_to_prompt[req_id]]
-           )
-
-           # Submit to next stage
-           task = {
-               "type": OmniStageTaskType.GENERATE,
-               "request_id": req_id,
-               "engine_inputs": next_inputs[0],
-               "sampling_params": sampling_params_list[next_stage_id],
-           }
-           next_stage.submit(task)
+   # In orchestrator.py
+   next_stage_id = stage_id + 1
+   next_client = self.stage_clients[next_stage_id]
+   params = req_state.sampling_params_list[next_stage_id]
+
+   # Save current stage outputs so stage_input_processors can consume them.
+   self.stage_clients[stage_id].set_engine_outputs([output])
+
+   # THIS IS WHERE STAGE TRANSITION HAPPENS
+   next_inputs = next_client.process_engine_inputs(
+       stage_list=self.stage_clients,
+       prompt=req_state.prompt,
+   )
+
+   # Build and submit request(s) to the next stage.
+   for next_input in next_inputs:
+       request = build_engine_core_request_from_tokens(
+           request_id=req_id,
+           prompt=next_input,
+           params=params,
+           model_config=self.stage_vllm_configs[next_stage_id].model_config,
+       )
+       await next_client.add_request_async(request)
    ```
 
 ### How Stage Transitions Work
 
 The stage transition process follows these steps:
 
-1. **Stage Completion**: When a stage finishes processing a request, it stores outputs via `stage.set_engine_outputs(engine_outputs)`
+1. **Stage Completion**: When a stage finishes processing a request, the orchestrator stores outputs via `stage_client.set_engine_outputs(...)`
 
 2. **Transition Detection**: The orchestrator checks if there's a next stage and calls `process_engine_inputs()` on it
 
-3. **Input Processing**: The `process_engine_inputs()` method in `OmniStage` (`omni_stage.py`) handles the transition:
+3. **Input Processing**: The stage input processor configured in stage YAML (under `vllm_omni/model_executor/stage_input_processors/`) handles the transition:
    ```python
    def process_engine_inputs(
        self, stage_list: list[Any], prompt: OmniTokensPrompt | TextPrompt = None

diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md
@@ -23,30 +23,32 @@ export VLLM_PROFILER_MAX_ITERS=1
 The profiler is default to function across all stages. But It is highly recommended to profile specific stages by passing the stages list, preventing from producing too large trace files:
 ```python
 # Profile all stages
-omni_llm.start_profile()
+omni.start_profile()
 
 # Only profile Stage 1
-omni_llm.start_profile(stages=[1])
+omni.start_profile(stages=[1])
 ```
 
 ```python
 # Stage 0 (Thinker) and Stage 2 (Audio Decoder) for qwen omni
-omni_llm.start_profile(stages=[0, 2])
+omni.start_profile(stages=[0, 2])
 ```
 
 **Python Usage**: Wrap your generation logic with `start_profile()` and `stop_profile()`.
 
 ```python
-from vllm_omni import omni_llm
+from vllm_omni.entrypoints.omni import Omni
+
+omni = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")
 
 profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
 
 # 1. Start profiling if enabled
 if profiler_enabled:
-    omni_llm.start_profile(stages=[0])
+    omni.start_profile(stages=[0])
 
 # Initialize generator
-omni_generator = omni_llm.generate(prompts, sampling_params_list, py_generator=args.py_generator)
+omni_generator = omni.generate(prompts, sampling_params_list, py_generator=args.py_generator)
 
 total_requests = len(prompts)
 processed_count = 0
@@ -57,21 +59,21 @@ for stage_outputs in omni_generator:
     # ... [Output processing logic for text/audio would go here] ...
 
     # Update count to track when to stop profiling
-    processed_count += len(stage_outputs.request_output)
+    processed_count += 1
 
     # 2. Check if all requests are done to stop the profiler safely
     if profiler_enabled and processed_count >= total_requests:
         print(f"[Info] Processed {processed_count}/{total_requests}. Stopping profiler inside active loop...")
 
         # Stop the profiler while workers are still active
-        omni_llm.stop_profile()
+        omni.stop_profile()
 
         # Wait for traces to flush to disk
         print("[Info] Waiting 30s for workers to write trace files to disk...")
         time.sleep(30)
         print("[Info] Trace export wait time finished.")
 
-omni_llm.close()
+omni.close()
 ```
 
 

diff --git a/docs/design/architecture_overview.md b/docs/design/architecture_overview.md
@@ -67,12 +67,12 @@ According to analysis for current popular open-source models, most of them have
 | Component         | Description                                                                                                                              |
 | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
 | **OmniRouter**    | provide an intelligent router for Omni-modality requests dispatch                                                                        |
-| **EntryPoints**   | define the APIs for offline/online serving (APIServer, Omni/AsyncOmni) and provide the OmniStage abstraction for different AR/DiT stages |
+| **EntryPoints**   | define the APIs for offline/online serving (APIServer, Omni/AsyncOmni), while `AsyncOmniEngine` and `Orchestrator` coordinate multi-stage AR/DiT execution |
 | **AR**            | adapted for omni-modality models while inheriting efficient features from vLLM, such as cache management                                 |
 | **Diffusion**     | natively implemented and optimized using acceleration components                                                                         |
 | **OmniConnector** | supports fully disaggregation based on E/P/D/G (Encoding/Processing/Decoding/Generation) disaggregation across stages                    |
 
-Disaggregated stages are managed through configuration, such as in the Qwen3-Omni example, where stages like Thinker, Talker, and Code2wav are defined as separate OmniStage instances with specific resources and input/output type.
+Disaggregated stages are managed through stage configuration. In Qwen3-Omni, Thinker/Talker/Code2wav are declared as separate configured stages, and runtime routing is handled by `Orchestrator` over `StageEngineCoreClient` / `StageDiffusionClient`.
 
 ## Main features
 
@@ -127,10 +127,10 @@ Taking **Qwen3-Omni** as an example:
 The **Omni** class provides a Python interface for offline batched inference. Users initialize the Omni class with a Hugging Face model name and use the generate method, passing inputs that include both text prompts and multi-modal data:
 
 ```
-# Create an omni_lm with HF model name.
+# Create an omni runtime with HF model name.
 from vllm_omni.entrypoints.omni import Omni
 
-omni_lm = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")
+omni = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")
 
 # Example prompts.
 om_inputs = {"prompt": prompt,
@@ -140,7 +140,7 @@ om_inputs = {"prompt": prompt,
              }}
 
 # Generate texts and audio from the multi-modality inputs.
-outputs = omni_lm.generate(om_inputs, sampling_params_list)
+outputs = omni.generate(om_inputs, sampling_params_list)
 ```
 
 ## Online Serving