-
-
----
-
-*Latest News* 🔥
-- [2026/03] We released [0.18.0](https://github.com/vllm-project/vllm-omni/releases/tag/v0.18.0) - strengthens the core runtime through a large entrypoint refactor and scheduler/runtime cleanups, expands unified quantization and diffusion execution, broadens multimodal model coverage, and improves production readiness across audio, omni, image, video, RL, and multi-platform deployments.
-- [2026/03] Check out our first public [project deepdive](https://youtu.be/sgwNfsNnR9I) at the vLLM Hong Kong Meetup!
-- [2026/03] **[vllm-omni-skills](https://github.com/hsliuustc0106/vllm-omni-skills)** is a community-driven collection of AI assistant skills that help developers work with vLLM-Omni more effectively. These skills can be used with popular agentic AI coding assistants like **Cursor IDE**, **Claude**, **Codex**, and more.
-- [2026/02] We released [0.16.0](https://github.com/vllm-project/vllm-omni/releases/tag/v0.16.0) - A major alignment + capability release that rebases onto **upstream vLLM v0.16.0** and significantly expands performance, distributed execution, and production readiness across **Qwen3-Omni / Qwen3-TTS**, **Bagel**, **MiMo-Audio**, **GLM-Image** and the **Diffusion (DiT) image/video stack**—while also improving platform coverage (CUDA / ROCm / NPU / XPU), CI quality, and documentation.
-- [2026/02] We released [0.14.0](https://github.com/vllm-project/vllm-omni/releases/tag/v0.14.0) - This is the first **stable release** of vLLM-Omni that expands Omni’s diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability. Please check our latest [paper](https://arxiv.org/abs/2602.02204) for architecture design and performance results.
-- [2026/01] We released [0.12.0rc1](https://github.com/vllm-project/vllm-omni/releases/tag/v0.12.0rc1) - a major RC milestone focused on maturing the diffusion stack, strengthening OpenAI-compatible serving, expanding omni-model coverage, and improving stability across platforms (GPU/NPU/ROCm).
-- [2025/11] vLLM community officially released [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) in order to support omni-modality models serving.
-
----
-
-## About
-
-[vLLM](https://github.com/vllm-project/vllm) was originally designed to support large language models for text-based autoregressive generation tasks. vLLM-Omni is a framework that extends its support for omni-modality model inference and serving:
-
-- **Omni-modality**: Text, image, video, and audio data processing
-- **Non-autoregressive Architectures**: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models
-- **Heterogeneous outputs**: from traditional text generation to multimodal outputs
-
-
-
-
-
-
-
-vLLM-Omni is fast with:
-
-- State-of-the-art AR support by leveraging efficient KV cache management from vLLM
-- Pipelined stage execution overlapping for high throughput performance
-- Fully disaggregation based on OmniConnector and dynamic resource allocation across stages
-
-vLLM-Omni is flexible and easy to use with:
-
-- Heterogeneous pipeline abstraction to manage complex model workflows
-- Seamless integration with popular Hugging Face models
-- Tensor, pipeline, data and expert parallelism support for distributed inference
-- Streaming outputs
-- OpenAI-compatible API server
-
-vLLM-Omni seamlessly supports most popular open-source models on HuggingFace, including:
-
-- Omni-modality models (e.g. Qwen-Omni)
-- Multi-modality generation models (e.g. Qwen-Image)
-
-## Getting Started
-
-Visit our [documentation](https://vllm-omni.readthedocs.io/en/latest/) to learn more.
-
-- [Installation](https://vllm-omni.readthedocs.io/en/latest/getting_started/installation/)
-- [Quickstart](https://vllm-omni.readthedocs.io/en/latest/getting_started/quickstart/)
-- [List of Supported Models](https://vllm-omni.readthedocs.io/en/latest/models/supported_models/)
-
-## Contributing
-
-We welcome and value any contributions and collaborations.
-Please check out [Contributing to vLLM-Omni](https://vllm-omni.readthedocs.io/en/latest/contributing/) for how to get involved.
-
-## Citation
-
-If you use vLLM-Omni for your research, please cite our [paper](https://arxiv.org/abs/2602.02204):
-
-```bibtex
-@article{yin2026vllmomni,
- title={vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models},
- author={Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, Didan Deng, Zifeng Mo, Cong Wang, James Cheng, Roger Wang, Hongsheng Liu},
- journal={arXiv preprint arXiv:2602.02204},
- year={2026}
-}
-```
-
-## Join the Community
-Feel free to ask questions, provide feedbacks and discuss with fellow users of vLLM-Omni in `#sig-omni` slack channel at [slack.vllm.ai](https://slack.vllm.ai) or vLLM user forum at [discuss.vllm.ai](https://discuss.vllm.ai).
-
-## Star History
-
-[](https://www.star-history.com/#vllm-project/vllm-omni&type=date&legend=top-left)
-
-## License
-
-Apache License 2.0, as found in the [LICENSE](./LICENSE) file.
diff --git a/apps/ComfyUI-vLLM-Omni/.gitignore b/apps/ComfyUI-vLLM-Omni/.gitignore
deleted file mode 100644
index 5704ad153cb..00000000000
--- a/apps/ComfyUI-vLLM-Omni/.gitignore
+++ /dev/null
@@ -1,115 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# OSX useful to ignore
-*.DS_Store
-.AppleDouble
-.LSOverride
-
-# Thumbnails
-._*
-
-# Files that might appear in the root of a volume
-.DocumentRevisions-V100
-.fseventsd
-.Spotlight-V100
-.TemporaryItems
-.Trashes
-.VolumeIcon.icns
-.com.apple.timemachine.donotpresent
-
-# Directories potentially created on remote AFP share
-.AppleDB
-.AppleDesktop
-Network Trash Folder
-Temporary Items
-.apdisk
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-env/
-venv/
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-*.egg-info/
-.installed.cfg
-*.egg
-
-# PyInstaller
-# Usually these files are written by a python script from a template
-# before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*,cover
-.hypothesis/
-.pytest_cache/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-
-# Sphinx documentation
-docs/_build/
-
-# IntelliJ Idea
-.idea
-*.iml
-*.ipr
-*.iws
-
-# PyBuilder
-target/
-
-# Cookiecutter
-output/
-python_boilerplate/
-cookiecutter-pypackage-env/
-
-# vscode settings
-.history/
-*.code-workspace
-
-# Frontend extension
-node_modules/
-.env
-.env.local
-.env.development.local
-.env.test.local
-.env.production.local
-npm-debug.log*
-yarn-debug.log*
-yarn-error.log*
-node.zip
-.vscode/
-.claude/
-.codemate/
diff --git a/apps/ComfyUI-vLLM-Omni/LICENSE b/apps/ComfyUI-vLLM-Omni/LICENSE
deleted file mode 100644
index b3c346397d7..00000000000
--- a/apps/ComfyUI-vLLM-Omni/LICENSE
+++ /dev/null
@@ -1,15 +0,0 @@
-Apache Software License 2.0
-
-Copyright (c) 2026, Zeyu Huang
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
diff --git a/apps/ComfyUI-vLLM-Omni/README.md b/apps/ComfyUI-vLLM-Omni/README.md
deleted file mode 100644
index 54f2fdf2e40..00000000000
--- a/apps/ComfyUI-vLLM-Omni/README.md
+++ /dev/null
@@ -1,184 +0,0 @@
-# vLLM-Omni
-
-vLLM-Omni offers a ComfyUI integration on top of its online serving API.
-It can send model inference requests to either a locally running vLLM-Omni service or a remote one.
-
-## Requirement
-
-- Python 3.12 or above
-- [ComfyUI installed](https://docs.comfy.org/installation/system_requirements)
-- [vLLM-Omni installed](https://docs.vllm.ai/projects/vllm-omni/en/latest/getting_started/installation/) on either the same device or another device discoverable via the internet.
-- No need to install additional packages apart from those already required by ComfyUI.
-
-> [!TIP]
-> If you run both ComfyUI and vLLM-Omni on the same device, you can create separate virtual environments and use different Python versions for them.
-
-
-## Installation
-
-Copy this folder to the `custom_nodes` subfolder of your ComfyUI installation. Your directory should look like `ComfyUI/custom_nodes/ComfyUI-vLLM-Omni`.
-
-If you are running ComfyUI during copying, you should restart ComfyUI to load this extension.
-
-> [!TIP]
-> You can use utility websites such as https://download-directory.github.io/ to download a subdirectory of a repo. Also checkout community discussions (e.g., https://stackoverflow.com/questions/7106012/download-a-single-folder-or-directory-from-a-github-repository) for more info.
-
-On the device and virtual environment you run ComfyUI, launch ComfyUI with
-```bash
-cd ComfyUI
-
-# The regular way
-python main.py
-
-# If you are mainly using this node, launch it faster with
-python main.py --cpu
-```
-
-On the device and virtual environment you run vLLM-Omni, start a model service with
-```bash
-vllm serve The_Model_ID_to_Serve --omni --port 8000
-```
-
-Check **ComfyUI's sidebar -> Node Library**. There should be a new folder named **vLLM-Omni**.
-If no, check your shell running the ComfyUI process. There may be some error messages before the line `Import times for custom nodes:` and the line `To see the GUI go to: http://127.0.0.1:8188`.
-
-## Quickstart
-
-This extension offers the following nodes based on the output modalities (at **ComfyUI sidebar -> Node Library**):
-
-- **Generate Image** for text-to-image and image-to-image tasks
-- **Generate Video** for text-to-video and image-to-video tasks
-- **Multimodality Understanding** for multimodality-to-text and multimodality-to-audio tasks
-- **TTS** and **TTS Voice Clone** for TTS tasks
-
-This extension also offers example workflows (at **ComfyUI sidebar -> Templates -> vLLM-Omni**)
-
-> [!NOTE]
-> The node UI and feature designs are intended to match vLLM-Omni online serving interfaces. It cannot offer more than what the interfaces support.
-
-To build a simple workflow yourself,
-
-- Drag a generation node onto the canvas.
-- Depending on your need, grab built-in multimedia file loader nodes, such as **image->Load Image**, **image->video->Load Video**, **audio->Load Audio**
-- Depending on your need, grab built-in multimedia file preview nodes, such as **image->Preview Image**, **image->video->Save Video**, **audio->Preview Audio**, **utils->Preview as Text**.
-- If you want to tune sampling parameters, grab corresponding nodes from **vLLM-Omni-> Sampling Params**.
- - For multi-stage models, you can connect multiple **AR Sampling Params** and **Diffusion Sampling Params** nodes to a **Multi-Stage Sampling Params List** node, and connect this node to the generation node.
- - For some multi-stage models like BAGEL, [only one stage's sampling parameters are exposed and tunable via vLLM-Omni's online serving API](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/bagel/). Thus, these models are treated as single-stage ones. Please check the vLLM-Omni documentation on how to correctly set each model's sampling parameters.
- - For multi-stage models where all stages are either autoregression or diffusion, you can also connect only a single Sampling Params node, indicating that this set of sampling parameters will be used for all stages.
-
-## Screenshots and Examples
-
-### Multimodal understanding (e.g., Qwen Omni series, BAGEL)
-
-(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Multimodal Understanding**)
-
-
-
-
-
-
-
-
-> [!TIP]
-> Although this node enables all-modality input, you should check whether the specific model you host and request for supports the modalities you connect to the node.
-
-You can configure per-stage sampling parameters for multi-stage models.
-
-
-
-
-
-
-
-
-### Text-to-image and image-to-image generation (e.g., Z-Image-Turbo, Qwen-Image-Edit, BAGEL)
-
-(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Image Generation**)
-
-
-
-
-
-
-
-
-> [!TIP]
-> The node automatically choose text-to-image or image-to-image API endpoints depending on whether you connect an image input or not.
-
-### Text-to-video and image-to-video generation (e.g., Wan)
-
-(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Video Generation**)
-
-
-
-
-
-
-
-
-> [!TIP]
-> The node automatically choose text-to-video or image-to-video API endpoints depending on whether you connect an image input or not.
-
-### TTS (e.g., Qwen TTS series)
-
-(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni TTS**)
-
-
-
-
-
-
-
-
-> [!TIP]
-> There is a dedicated node for VoiceClone tasks with reference audio input. Other simple text-to-speech tasks should use the regular TTS node.
-
-### Chaining multiple model services
-
-(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Chaining Services**)
-
-
-
-
-
-
-
-
-## Develop
-
-Follow the [development convention and rules of vLLM-Omni](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/).
-
-## Limitation and Non-Goals
-
-- Single server mode only. No automatic load balancing or failover.
-- Features set is bounded to vLLM-Omni's online service capability, including
- - The types of models supported in online mode,
- - The types of sampling parameters supported in the online mode,
- - The ways to send files (primarily through full-length base64 in JSON payload),
- - Figuring out errors in the payload (such as unsupported fields by a specific model) if the endpoint does not explicitly return an error,
- - (The lack of) Authentication
- - (The lack of) Progress indicator
-
-## Support
-
-If you are new to ComfyUI, please check out [its documentation](https://docs.comfy.org/) for usage instructions.
-
-If you are new to vLLM-Omni, please also check out [its documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/) for usage instructions.
-
-Whenever you find an issue or problem, please
-
-- First find out if this is an upstream limitation of vLLM-Omni's online serving mode, by [checking their documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/examples/).
-- [Open an issue](https://github.com/vllm-project/vllm-omni/issues) that clearly describes this ComfyUI or online service problem.
-
-## Acknowledgements
-
-Features
-
-- https://github.com/dougbtv/comfyui-vllm-omni/ The official reference implementation for ComfyUI integration with vLLM-Omni's DALL-E compatible image generation API.
-- https://github.com/Comfy-Org/ComfyUI/tree/master/comfy_extras ComfyUI's built-in node implementations.
-
-UI/UX design references
-
-- https://github.com/sgl-project/sglang/pull/15271 SGLang Diffusion's official ComfyUI integration for image and video generation.
-- https://github.com/SXQBW/ComfyUI-Qwen-Omni A third party ComfyUI integration for Qwen Omni series.
-- https://github.com/flybirdxx/ComfyUI-Qwen-TTS https://github.com/DarioFT/ComfyUI-Qwen3-TTS Third party ComfyUI integrations for Qwen TTS series.
diff --git a/apps/ComfyUI-vLLM-Omni/__init__.py b/apps/ComfyUI-vLLM-Omni/__init__.py
deleted file mode 100644
index e89b2ca1eb5..00000000000
--- a/apps/ComfyUI-vLLM-Omni/__init__.py
+++ /dev/null
@@ -1,61 +0,0 @@
-"""Top-level package for comfyui_vllm_omni.""" # noqa: N999 # This is not a python library intended to be imported
-
-__all__ = [
- "NODE_CLASS_MAPPINGS",
- "NODE_DISPLAY_NAME_MAPPINGS",
- "WEB_DIRECTORY",
-]
-
-__author__ = """vLLM-Omni Team"""
-__email__ = "vllm-omni@vllm.ai"
-__version__ = "0.0.1"
-
-from .comfyui_vllm_omni.nodes import (
- VLLMOmniARSampling,
- VLLMOmniDiffusionSampling,
- VLLMOmniGenerateImage,
- VLLMOmniGenerateVideo,
- VLLMOmniQwenTTSParams,
- VLLMOmniRemoteLoRA,
- VLLMOmniSamplingParamsList,
- VLLMOmniTTS,
- VLLMOmniUnderstanding,
- VLLMOmniVoiceClone,
- VLLMOmniWanParams,
-)
-
-# A dictionary that contains all nodes you want to export with their names
-NODE_CLASS_MAPPINGS = {
- # === Generation ===
- "VLLMOmniGenerateImage": VLLMOmniGenerateImage,
- "VLLMOmniGenerateVideo": VLLMOmniGenerateVideo,
- "VLLMOmniUnderstanding": VLLMOmniUnderstanding,
- "VLLMOmniTTS": VLLMOmniTTS,
- "VLLMOmniVoiceClone": VLLMOmniVoiceClone,
- # === Params ===
- "VLLMOmniARSampling": VLLMOmniARSampling,
- "VLLMOmniDiffusionSampling": VLLMOmniDiffusionSampling,
- "VLLMOmniSamplingParamsList": VLLMOmniSamplingParamsList,
- "VLLMOmniRemoteLoRA": VLLMOmniRemoteLoRA,
- "VLLMOmniQwenTTSParams": VLLMOmniQwenTTSParams,
- "VLLMOmniWanParams": VLLMOmniWanParams,
-}
-
-# A dictionary that contains the friendly/humanly readable titles for the nodes
-NODE_DISPLAY_NAME_MAPPINGS = {
- # === Generation ===
- "VLLMOmniGenerateImage": "Generate Image",
- "VLLMOmniGenerateVideo": "Generate Video",
- "VLLMOmniUnderstanding": "Multimodality Understanding",
- "VLLMOmniTTS": "TTS (Text to Speech)",
- "VLLMOmniVoiceClone": "TTS Voice Cloning",
- # === Params ===
- "VLLMOmniARSampling": "AR Sampling Params",
- "VLLMOmniDiffusionSampling": "Diffusion Sampling Params",
- "VLLMOmniSamplingParamsList": "Multi-Stage Sampling Params List",
- "VLLMOmniRemoteLoRA": "LoRA",
- "VLLMOmniQwenTTSParams": "Qwen TTS Params",
- "VLLMOmniWanParams": "Wan Video Params",
-}
-
-WEB_DIRECTORY = "./web"
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/__init__.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/__init__.py
deleted file mode 100644
index ebc9c5a59ea..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-# noqa: N999 # This is not a python library intended to be imported
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/nodes.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/nodes.py
deleted file mode 100644
index bfea939982c..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/nodes.py
+++ /dev/null
@@ -1,736 +0,0 @@
-from typing import Literal
-
-import torch
-from comfy_api.input import AudioInput, VideoInput
-
-from .utils.api_client import VLLMOmniClient
-from .utils.logger import get_logger
-from .utils.models import lookup_model_spec
-from .utils.types import (
- AudioFormat,
- AutoregressionSamplingParams,
- DiffusionSamplingParams,
- QwenTTSModelSpecificParams,
- WanModelSpecificParams,
-)
-from .utils.validators import (
- add_sampling_parameters_to_stage,
- validate_model_and_sampling_params_types,
-)
-
-logger = get_logger(__name__)
-
-
-class _VLLMOmniGenerateBase:
- """Base class for vLLM-Omni generation nodes with shared functionality."""
-
- CATEGORY = "vLLM-Omni"
-
- @classmethod
- def VALIDATE_INPUTS(cls, url, model) -> str | Literal[True]:
- """
- Can only validate this model's own input. Cannot check inputs from other nodes.
- See: https://docs.comfy.org/custom-nodes/backend/server_overview#validate_inputs
- """
- if not url:
- return "URL must not be empty"
- if not model:
- return "Model must not be empty"
- return True
-
-
-class VLLMOmniGenerateImage(_VLLMOmniGenerateBase):
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "url": ("STRING", {"default": "http://localhost:8000/v1"}),
- "model": ("STRING", {"default": "Tongyi-MAI/Z-Image-Turbo"}),
- "prompt": ("STRING", {"multiline": True}),
- "negative_prompt": ("STRING", {"multiline": True, "default": ""}),
- "width": ("INT", {"default": 512, "min": 64, "max": 2048}),
- "height": ("INT", {"default": 512, "min": 64, "max": 2048}),
- },
- "optional": {
- "image": ("IMAGE",),
- "mask": ("MASK",),
- # "video": ("VIDEO",),
- # "audio": ("AUDIO",),
- "sampling_params": ("SAMPLING_PARAMS",),
- "lora": ("REMOTE_LORA",),
- },
- }
-
- RETURN_TYPES = ("IMAGE",)
- RETURN_NAMES = ("image",)
- FUNCTION = "generate"
-
- async def generate(
- self,
- url: str,
- model: str,
- prompt: str,
- width: int,
- height: int,
- negative_prompt: str | None = None,
- image: torch.Tensor | None = None,
- mask: torch.Tensor | None = None,
- audio: AudioInput | None = None, # Hidden & unused
- video: VideoInput | None = None, # Hidden & unused
- sampling_params: dict | list[dict] | None = None,
- lora: dict | None = None,
- **kwargs,
- ):
- if kwargs:
- logger.info("Uncaught kwargs: %s", kwargs)
- logger.debug("Got sampling params: %s", sampling_params)
- validate_model_and_sampling_params_types(model, sampling_params)
- if image is None and mask is not None:
- raise ValueError("Mask input provided without an image input.")
-
- client = VLLMOmniClient(url)
-
- spec, pattern = lookup_model_spec(model)
- is_bagel = pattern is not None and "bagel" in pattern.lower()
-
- # Prefer DALL-E compatible API for simple (one-stage) diffusion models
- if (spec is None or spec["stages"] == ["diffusion"]) and not is_bagel:
- # The number of sampling parameter groups should have been validated.
- # Now, simply convert single-item list to dict.
- if isinstance(sampling_params, list):
- sampling_params = sampling_params[0]
- if audio is None and image is None and video is None:
- # No multimodal input --- use DALL-E image generation
- logger.info("Using DALL-E image generation endpoint")
- output = await client.generate_image(
- model=model,
- prompt=prompt,
- width=width,
- height=height,
- negative_prompt=negative_prompt,
- sampling_params=sampling_params,
- lora=lora,
- )
- return (output,)
- elif image is not None and audio is None and video is None:
- # Image and text input --- use DALL-E image edit
- logger.info("Using DALL-E image edit endpoint")
- output = await client.edit_image(
- model=model,
- prompt=prompt,
- image=image,
- width=width,
- height=height,
- negative_prompt=negative_prompt,
- mask=mask,
- sampling_params=sampling_params,
- lora=lora,
- )
- return (output,)
-
- logger.info("Using chat completion endpoint")
- sampling_params = add_sampling_parameters_to_stage(
- model, sampling_params, "diffusion", width=width, height=height
- )
- logger.debug("Edited sampling params: %s", sampling_params)
-
- output = await client.generate_image_chat_completion(
- model=model,
- prompt=prompt,
- negative_prompt=negative_prompt,
- image=image,
- audio=audio,
- video=video,
- sampling_params=sampling_params,
- lora=lora,
- )
-
- return (output,)
-
-
-class VLLMOmniGenerateVideo(_VLLMOmniGenerateBase):
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "url": ("STRING", {"default": "http://localhost:8000/v1"}),
- "model": ("STRING", {"default": "Wan-AI/Wan2.2-T2V-A14B-Diffusers"}),
- "prompt": ("STRING", {"multiline": True}),
- "negative_prompt": ("STRING", {"multiline": True, "default": ""}),
- "width": ("INT", {"default": 832, "min": 1}),
- "height": ("INT", {"default": 480, "min": 1}),
- "fps": ("INT", {"default": 16, "min": 1}),
- "num_frames": ("INT", {"default": 41, "min": 1}),
- },
- "optional": {
- "image": ("IMAGE",),
- "sampling_params": ("SAMPLING_PARAMS",),
- "lora": ("REMOTE_LORA",),
- "model_params": ("VIDEO_PARAMS",),
- },
- }
-
- RETURN_TYPES = ("VIDEO",)
- RETURN_NAMES = ("video",)
- FUNCTION = "generate"
-
- async def generate(
- self,
- url: str,
- model: str,
- prompt: str,
- width: int,
- height: int,
- fps: int,
- num_frames: int,
- negative_prompt: str | None = None,
- image: torch.Tensor | None = None,
- sampling_params: dict | list[dict] | None = None,
- model_params: dict | None = None,
- lora: dict | None = None,
- **kwargs,
- ):
- if kwargs:
- logger.info("Uncaught kwargs: %s", kwargs)
- logger.debug("Got sampling params: %s", sampling_params)
- logger.debug("Got model params: %s", model_params)
- validate_model_and_sampling_params_types(model, sampling_params)
-
- # Currently, all video generation models are single-stage diffusion models
- if isinstance(sampling_params, list):
- if len(sampling_params) != 1:
- raise ValueError(
- "Video generation expects a single sampling params group. "
- "Please provide one Diffusion sampling node."
- )
- sampling_params = sampling_params[0]
-
- if sampling_params is not None:
- sampling_params.pop("type", None) # internal fields
- if model_params is not None:
- model_params.pop("type", None) # internal fields
-
- client = VLLMOmniClient(url)
- output = await client.generate_video(
- model=model,
- prompt=prompt,
- image=image, # image present => i2v, absent => t2v
- width=width,
- height=height,
- num_frames=num_frames,
- fps=fps,
- negative_prompt=negative_prompt,
- sampling_params=sampling_params,
- lora=lora,
- model_params=model_params,
- )
- return (output,)
-
-
-class VLLMOmniUnderstanding(_VLLMOmniGenerateBase):
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "url": ("STRING", {"default": "http://localhost:8000/v1"}),
- "model": ("STRING", {"default": "Qwen/Qwen2.5-Omni-7B"}),
- "prompt": ("STRING", {"multiline": True}),
- "output_text": ("BOOLEAN", {"default": True}),
- "output_audio": ("BOOLEAN", {"default": True}),
- "use_audio_in_video": ("BOOLEAN", {"default": True}),
- },
- "optional": {
- "image": ("IMAGE",),
- "video": ("VIDEO",),
- "audio": ("AUDIO",),
- "sampling_params": ("SAMPLING_PARAMS",),
- },
- }
-
- RETURN_TYPES = ("STRING", "AUDIO")
- RETURN_NAMES = ("text_response", "audio_response")
- FUNCTION = "generate"
-
- @classmethod
- def VALIDATE_INPUTS(cls, url, model, output_text, output_audio) -> str | Literal[True]: # type: ignore[reportIncompatibleMethodOverride]
- super_validation = super().VALIDATE_INPUTS(url, model)
- if isinstance(super_validation, str):
- return super_validation
- if not output_text and not output_audio:
- return "At least one of output_text or output_audio must be True."
- return True
-
- async def generate(
- self,
- url: str,
- model: str,
- prompt: str,
- image: torch.Tensor | None = None,
- audio: AudioInput | None = None,
- video: VideoInput | None = None,
- sampling_params: dict | list[dict] | None = None,
- output_text: bool = True,
- output_audio: bool = True,
- use_audio_in_video: bool = True,
- **kwargs,
- ) -> tuple[str, AudioInput]:
- if kwargs:
- logger.info("Uncaught kwargs: %s", kwargs)
- logger.debug("Got sampling params: %s", sampling_params)
- validate_model_and_sampling_params_types(model, sampling_params)
-
- client = VLLMOmniClient(url)
- spec, pattern = lookup_model_spec(model)
- is_bagel = pattern is not None and "bagel" in pattern.lower()
-
- if is_bagel:
- # A lot of special handlings here...
- if output_audio:
- raise ValueError("BAGEL models do not support audio output.")
- if audio is not None or video is not None:
- raise ValueError("BAGEL models do not support audio or video input.")
- (
- text_response,
- _,
- ) = await client.generate_understanding_chat_completion(
- model=model,
- prompt=prompt,
- image=image,
- audio=None,
- video=None,
- sampling_params=sampling_params,
- modalities=["text"],
- )
- else:
- modalities = []
- if output_text:
- modalities.append("text")
- if output_audio:
- modalities.append("audio")
-
- if use_audio_in_video and video is not None:
- use_audio_in_video = True
- else:
- use_audio_in_video = False
-
- (
- text_response,
- audio,
- ) = await client.generate_understanding_chat_completion(
- model=model,
- prompt=prompt,
- image=image,
- audio=audio,
- video=video,
- sampling_params=sampling_params,
- modalities=modalities,
- # == extra kwargs ==
- mm_processor_kwargs={"use_audio_in_video": use_audio_in_video},
- )
-
- if text_response is None:
- text_response = ""
- if audio is None:
- channels = 1
- duration = 1
- sample_rate = 44100
- num_samples = int(round(duration * sample_rate))
- waveform = torch.zeros((1, channels, num_samples), dtype=torch.float32)
- audio = {"waveform": waveform, "sample_rate": sample_rate}
-
- return (text_response, audio)
-
-
-class VLLMOmniTTS(_VLLMOmniGenerateBase):
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "url": ("STRING", {"default": "http://localhost:8000/v1"}),
- "model": (
- "STRING",
- {"default": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"},
- ),
- "input": ("STRING", {"multiline": True}),
- "voice": ("STRING", {"default": "Vivian"}),
- "response_format": (["mp3", "opus", "aac", "flac", "wav", "pcm"],),
- "speed": (
- "FLOAT",
- {"default": 1.0, "min": 0.25, "max": 4.0, "step": 0.01},
- ),
- },
- "optional": {
- "model_specific_params": ("TTS_PARAMS",),
- },
- }
-
- RETURN_TYPES = ("AUDIO",)
- RETURN_NAMES = ("audio",)
- FUNCTION = "generate"
-
- async def generate(
- self,
- url: str,
- model: str,
- input: str,
- voice: str,
- response_format: AudioFormat,
- speed: float,
- model_specific_params: dict | None,
- **kwargs,
- ) -> tuple[AudioInput]:
- logger.info("Got extra kwargs in TTS: %s", kwargs)
-
- is_qwen_tts = "qwen3-tts" in model.lower()
- if not is_qwen_tts and isinstance(model_specific_params, QwenTTSModelSpecificParams):
- raise ValueError(
- "You have provided Qwen-specific TTS params."
- "However, the model appears to not be a Qwen TTS model (no 'Qwen3-TTS' in model name)."
- )
-
- combined_params = {**kwargs, **(model_specific_params or {})}
-
- client = VLLMOmniClient(url)
-
- audio = await client.generate_speech(
- model=model,
- input=input,
- voice=voice,
- response_format=response_format,
- speed=speed,
- **combined_params,
- )
- return (audio,)
-
-
-class VLLMOmniVoiceClone(_VLLMOmniGenerateBase):
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "url": ("STRING", {"default": "http://localhost:8000/v1"}),
- "model": ("STRING", {"default": "Qwen/Qwen3-TTS-12Hz-1.7B-Base"}),
- "input": ("STRING", {"multiline": True}),
- "voice": ("STRING", {"default": "Vivian"}),
- "response_format": (["mp3", "opus", "aac", "flac", "wav", "pcm"],),
- "speed": (
- "FLOAT",
- {"default": 1.0, "min": 0.25, "max": 4.0, "step": 0.01},
- ),
- "ref_audio": ("AUDIO",),
- "ref_text": ("STRING", {"multiline": True}),
- "x_vector_only_mode": ("BOOLEAN", {"default": False}),
- },
- "optional": {
- "model_specific_params": ("TTS_PARAMS",),
- },
- }
-
- RETURN_TYPES = ("AUDIO",)
- RETURN_NAMES = ("audio",)
- FUNCTION = "generate"
-
- async def generate(
- self,
- url: str,
- model: str,
- input: str,
- voice: str,
- response_format: AudioFormat,
- speed: float,
- ref_audio: AudioInput,
- ref_text: str,
- x_vector_only_mode: bool,
- model_specific_params: dict | None,
- **kwargs,
- ):
- is_qwen_tts = "qwen3-tts" in model.lower()
- if not is_qwen_tts and isinstance(model_specific_params, QwenTTSModelSpecificParams):
- raise ValueError(
- "You have provided Qwen-specific TTS params."
- "However, the model appears to not be a Qwen TTS model (no 'Qwen3-TTS' in model name)."
- )
-
- combined_params = {
- "ref_audio": ref_audio,
- "ref_text": ref_text,
- "x_vector_only_mode": x_vector_only_mode,
- **kwargs,
- **(model_specific_params or {}),
- }
-
- client = VLLMOmniClient(url)
-
- audio = await client.generate_speech(
- model=model,
- input=input,
- voice=voice,
- response_format=response_format,
- speed=speed,
- **combined_params,
- )
- return (audio,)
-
-
-class VLLMOmniARSampling:
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "max_tokens": ("INT", {"default": 100, "min": 1, "max": 10000}),
- "temperature": (
- "FLOAT",
- {"default": 1.0, "min": 0.0, "max": 2.0, "step": 0.01},
- ),
- "top_p": (
- "FLOAT",
- {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.01},
- ),
- "repetition_penalty": (
- "FLOAT",
- {"default": 1.0, "min": 0.0, "max": 5.0, "step": 0.01},
- ),
- # === Put seed at last. ===
- # Whenever a field named "seed" is present, ComfyUI adds another field called "control after generate"
- "seed": (
- "INT",
- {
- "default": -1,
- "min": -1,
- "step": 1,
- "tooltip": "-1 means to not provide a seed.",
- },
- ),
- }
- }
-
- RETURN_TYPES = ("SAMPLING_PARAMS",)
- RETURN_NAMES = ("AR sampling params",)
- FUNCTION = "get_params"
- CATEGORY = "vLLM-Omni/Sampling Params"
-
- def get_params(self, seed, **kwargs):
- params = AutoregressionSamplingParams(kwargs)
- if seed >= 0:
- params["seed"] = seed
- return (params,)
-
-
-class VLLMOmniDiffusionSampling:
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "n": (
- "INT",
- {
- "default": 1,
- "min": 0,
- "max": 10,
- "step": 1,
- "tooltip": "Number of images to generate",
- },
- ),
- "num_inference_steps": (
- "INT",
- {
- "default": 50,
- "min": 1,
- "max": 1000,
- "tooltip": "Number of denoising steps (higher = better quality, slower).",
- },
- ),
- "guidance_scale": (
- "FLOAT",
- {
- "default": 7.5,
- "min": 0.0,
- "max": 20.0,
- "step": 0.1,
- "tooltip": "Classifier-free guidance scale (higher = more prompt adherence).",
- },
- ),
- "true_cfg_scale": (
- "FLOAT",
- {
- "default": 1.0,
- "min": 0.0,
- "max": 20.0,
- "step": 0.5,
- "tooltip": "True CFG scale for advanced control (model-specific).",
- },
- ),
- "vae_use_slicing": (
- "BOOLEAN",
- {
- "default": False,
- "tooltip": "Enable VAE slicing for reduced memory usage (slight quality trade-off)",
- },
- ),
- "vae_use_tiling": (
- "BOOLEAN",
- {
- "default": False,
- "tooltip": "Enable VAE tiling for reduced memory usage (slight quality trade-off)",
- },
- ),
- # === Put seed at last. ===
- # Whenever a field named "seed" is present, ComfyUI adds another field called "control after generate"
- "seed": (
- "INT",
- {
- "default": -1,
- "min": -1,
- "step": 1,
- "tooltip": "-1 means to not provide a seed.",
- },
- ),
- }
- }
-
- RETURN_TYPES = ("SAMPLING_PARAMS",)
- RETURN_NAMES = ("diffusion sampling params",)
- FUNCTION = "get_params"
- CATEGORY = "vLLM-Omni/Sampling Params"
-
- def get_params(self, seed, **kwargs):
- params = DiffusionSamplingParams(kwargs)
- if seed >= 0:
- params["seed"] = seed
- return (params,)
-
-
-class VLLMOmniSamplingParamsList:
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "param1": ("SAMPLING_PARAMS",),
- },
- "optional": {
- "param2": ("SAMPLING_PARAMS",),
- "param3": ("SAMPLING_PARAMS",),
- },
- }
-
- RETURN_TYPES = ("SAMPLING_PARAMS",)
- RETURN_NAMES = ("param list",)
- FUNCTION = "aggregate"
- CATEGORY = "vLLM-Omni/Sampling Params"
-
- def aggregate(self, param1: dict, param2: dict | None = None, param3: dict | None = None):
- for i, p in enumerate((param1, param2, param3)):
- if isinstance(p, list):
- raise ValueError(
- f"Input {i} is a Multi-Stage Sampling Params List. "
- f"Expected a single sampling parameters node (either AR or Diffusion)."
- )
-
- params = [param1]
- if param2 is not None:
- params.append(param2)
- if param3 is not None:
- params.append(param3)
- return (params,)
-
-
-class VLLMOmniRemoteLoRA:
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "local_path": ("STRING", {"default": ""}),
- "name": ("STRING", {"default": ""}),
- "scale": (
- "FLOAT",
- {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.1},
- ),
- "int_id": (
- "INT",
- {
- "default": 0,
- "min": 0,
- "step": 1,
- "tooltip": "0 means it is not set and the server can derive it.",
- },
- ),
- }
- }
-
- RETURN_TYPES = ("REMOTE_LORA",)
- RETURN_NAMES = ("lora",)
- FUNCTION = "get_lora"
- CATEGORY = "vLLM-Omni"
-
- @classmethod
- def VALIDATE_INPUTS(cls, local_path, name) -> str | Literal[True]:
- if not local_path.strip() or not name.strip():
- return "Both local_path and name must be provided."
- return True
-
- def get_lora(self, local_path: str, name: str, scale: float, int_id: int):
- local_path = local_path.strip()
- name = name.strip()
- lora = {
- "local_path": local_path or None,
- "name": name or None,
- "scale": float(scale),
- "int_id": int(int_id) if int_id > 0 else None,
- }
- return (lora,)
-
-
-class VLLMOmniQwenTTSParams:
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "task_type": (
- ["CustomVoice", "VoiceDesign", "Base"],
- {"default": "CustomVoice"},
- ),
- "language": (
- ["Auto", "Chinese", "English", "Japanese", "Korean"],
- {"default": "Auto"},
- ),
- "instructions": ("STRING", {"multiline": True}),
- "max_new_tokens": ("INT", {"default": 2048, "min": 1}),
- }
- }
-
- RETURN_TYPES = ("TTS_PARAMS",)
- RETURN_NAMES = ("Qwen TTS params",)
- FUNCTION = "get_params"
- CATEGORY = "vLLM-Omni/TTS Params"
-
- def get_params(self, **kwargs):
- return (QwenTTSModelSpecificParams(kwargs),)
-
-
-class VLLMOmniWanParams:
- @classmethod
- def INPUT_TYPES(cls):
- return {
- "required": {
- "guidance_scale_2": (
- "FLOAT",
- {"default": 4.0, "min": 0.0, "max": 20.0, "step": 0.1},
- ),
- "boundary_ratio": (
- "FLOAT",
- {"default": 0.875, "min": 0.0, "max": 1.0, "step": 0.001},
- ),
- "flow_shift": (
- "FLOAT",
- {"default": 5.0, "min": 0.0, "max": 100.0, "step": 0.1},
- ),
- }
- }
-
- RETURN_TYPES = ("VIDEO_PARAMS",)
- RETURN_NAMES = ("Wan video params",)
- FUNCTION = "get_params"
- CATEGORY = "vLLM-Omni/Video Params"
-
- def get_params(self, **kwargs):
- return (WanModelSpecificParams(kwargs),)
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/api_client.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/api_client.py
deleted file mode 100644
index 8600fe39355..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/api_client.py
+++ /dev/null
@@ -1,585 +0,0 @@
-"""
-An high-level API client adapter that forwards ComfyUI inputs to vLLM-Omni's REST API,
-and transforms the API responses back to ComfyUI formats.
-
-The image generation part is derived from dougbtv/comfyui-vllm-omni by Doug (@dougbtv).
-Original source at https://github.com/dougbtv/comfyui-vllm-omni, distributed under the MIT License.
-"""
-
-import asyncio
-import json
-from typing import Any
-
-import aiohttp
-import av.error
-import torch
-from comfy_api.input import AudioInput, VideoInput
-
-from .format import (
- audio_to_base64,
- base64_to_audio,
- base64_to_image_tensor,
- bytes_to_audio,
- bytes_to_video,
- image_tensor_to_base64,
- image_tensor_to_png_bytes,
- video_to_base64,
-)
-from .logger import get_logger, pretty_printer
-from .models import lookup_model_spec
-from .types import AudioFormat
-
-logger = get_logger(__name__)
-
-
-async def url_json(session: aiohttp.ClientSession, url: str, verb: str = "get", **kwargs) -> dict[str, Any]:
- try:
- async with getattr(session, verb)(url, **kwargs) as response:
- if not response.ok:
- error_text = await response.text()
- raise (ValueError if response.status < 500 else RuntimeError)(
- f"vLLM-Omni API returned status {response.status}: {error_text}"
- )
- try:
- return await response.json()
- except aiohttp.ContentTypeError as e:
- raise RuntimeError(f"Invalid JSON response from vLLM-Omni: {e}")
- except aiohttp.ClientError as e:
- raise RuntimeError(f"Network error connecting to vLLM-Omni at {url}: {e}")
-
-
-async def url_bytes(session: aiohttp.ClientSession, url: str, verb: str = "get", **kwargs) -> bytes:
- try:
- async with getattr(session, verb)(url, **kwargs) as response:
- if not response.ok:
- error_text = await response.text()
- raise (ValueError if response.status < 500 else RuntimeError)(
- f"vLLM-Omni API returned status {response.status}: {error_text}"
- )
- return await response.read()
- except aiohttp.ClientError as e:
- raise RuntimeError(f"Network error connecting to vLLM-Omni at {url}: {e}")
-
-
-class VLLMOmniClient:
- def __init__(
- self, base_url: str, timeout: float | None = None, poll_interval: float = 5.0, max_poll_duration: float = 60 * 5
- ):
- self.base_url = base_url
- self.timeout = aiohttp.ClientTimeout(total=timeout)
- self.poll_interval = poll_interval
- self.max_poll_duration = max_poll_duration
-
- async def generate_image(
- self,
- *,
- model: str,
- prompt: str,
- width: int,
- height: int,
- negative_prompt: str | None = None,
- sampling_params: dict | None = None,
- lora: dict | None = None,
- ) -> torch.Tensor:
- """Run text-to-image generation via DALLE API"""
- await self._check_model_exist(model)
-
- size = f"{width}x{height}"
- payload: dict[str, Any] = {
- "model": model,
- "prompt": prompt,
- "size": size,
- "response_format": "b64_json",
- }
- if negative_prompt:
- payload["negative_prompt"] = negative_prompt
- if sampling_params is not None:
- payload.update(sampling_params)
- if lora is not None:
- payload["lora"] = lora
- logger.debug("img gen payload: %s", payload)
-
- url = self.base_url + "/images/generations"
- async with aiohttp.ClientSession(timeout=self.timeout) as session:
- try:
- async with session.post(
- url,
- json=payload,
- headers={"Content-Type": "application/json"},
- ) as response:
- if not response.ok:
- error_text = await response.text()
- raise (ValueError if response.status < 500 else RuntimeError)(
- f"vLLM-Omni API returned status {response.status}: {error_text}"
- )
-
- try:
- data = await response.json()
- except aiohttp.ContentTypeError as e:
- raise RuntimeError(f"Invalid JSON response from vLLM-Omni: {e}")
- if "data" not in data:
- raise RuntimeError("API response missing 'data' field - expected OpenAI DALL-E format")
- if not data["data"]:
- raise RuntimeError("API returned empty data array")
-
- image_tensors = []
- for idx, img in enumerate(data["data"]):
- if "b64_json" not in img:
- raise RuntimeError(f"API returned image #{idx} without 'b64_json' field")
- base64_str = img["b64_json"]
- tensor = base64_to_image_tensor(base64_str)
- image_tensors.append(tensor)
- logger.debug("Image #%d has shape %s", idx, tensor.shape)
-
- batch_tensor = torch.stack(image_tensors, dim=0)
- logger.debug("batch_tensor output has shape: %s", batch_tensor.shape)
- return batch_tensor
-
- except aiohttp.ClientError as e:
- raise RuntimeError(f"Network error connecting to vLLM-Omni at {url}: {e}")
-
- async def edit_image(
- self,
- *,
- model: str,
- prompt: str,
- image: torch.Tensor,
- width: int,
- height: int,
- negative_prompt: str | None = None,
- mask: torch.Tensor | None = None,
- sampling_params: dict | None = None,
- lora: dict | None = None,
- ) -> torch.Tensor:
- """Run image editing via DALLE API"""
- await self._check_model_exist(model)
-
- size = f"{width}x{height}"
- image_filename = "image.png" # Required for multipart form
- form = aiohttp.FormData()
- form.add_field("model", model)
- form.add_field(
- "image",
- image_tensor_to_png_bytes(image, image_filename),
- filename=image_filename,
- content_type="image/png",
- )
- form.add_field("prompt", prompt)
- form.add_field("size", size)
- if negative_prompt:
- form.add_field("negative_prompt", negative_prompt)
- if sampling_params is not None:
- for k, v in sampling_params.items():
- form.add_field(k, str(v))
- if lora is not None:
- form.add_field("lora", json.dumps(lora, ensure_ascii=False))
- if mask is not None:
- mask_filename = "mask.png"
- form.add_field(
- "mask",
- image_tensor_to_png_bytes(mask, mask_filename),
- filename=mask_filename,
- content_type="image/png",
- )
-
- url = self.base_url + "/images/edits"
- async with aiohttp.ClientSession(timeout=self.timeout) as session:
- try:
- async with session.post(url, data=form) as response:
- if not response.ok:
- error_text = await response.text()
- raise (ValueError if response.status < 500 else RuntimeError)(
- f"vLLM-Omni API returned status {response.status}: {error_text}"
- )
-
- try:
- data = await response.json()
- except aiohttp.ContentTypeError as e:
- raise RuntimeError(f"Invalid JSON response from vLLM-Omni: {e}")
-
- if "data" not in data:
- raise RuntimeError("API response missing 'data' field - expected OpenAI DALL-E format")
- if not data["data"]:
- raise RuntimeError("API returned empty data array")
-
- image_tensors = []
- for idx, img in enumerate(data["data"]):
- if "b64_json" not in img:
- raise RuntimeError(f"API returned image #{idx} without 'b64_json' field")
- base64_str = img["b64_json"]
- tensor = base64_to_image_tensor(base64_str)
- image_tensors.append(tensor)
-
- return torch.stack(image_tensors, dim=0)
-
- except aiohttp.ClientError as e:
- raise RuntimeError(f"Network error connecting to vLLM-Omni at {url}: {e}")
-
- async def generate_image_chat_completion(
- self,
- *,
- model: str,
- prompt: str,
- negative_prompt: str | None = None,
- image: torch.Tensor | None = None,
- audio: AudioInput | None = None,
- video: VideoInput | None = None,
- sampling_params: dict | list[dict] | None = None,
- lora: dict | None = None,
- ) -> torch.Tensor:
- payload = VLLMOmniClient._prepare_chat_completion_messages(
- model=model,
- prompt=prompt,
- negative_prompt=negative_prompt,
- image=image,
- audio=audio,
- video=video,
- sampling_params=sampling_params,
- modalities=["image"],
- # === below are additional `extra_body` fields, handled by **kwargs ===
- lora=lora,
- )
- choices = await self._generate_base_chat_completion(model, payload)
-
- image_tensors = []
- for idx, img_content in enumerate(choices[0]["message"]["content"]):
- base64_str = img_content.get("image_url", {}).get("url", "")
- if not base64_str:
- raise RuntimeError(f"API returned image #{idx} without image url")
- tensor = base64_to_image_tensor(base64_str)
- image_tensors.append(tensor)
-
- return torch.stack(image_tensors, dim=0)
-
- async def generate_video(
- self,
- *,
- model: str,
- prompt: str,
- width: int,
- height: int,
- num_frames: int,
- fps: int,
- negative_prompt: str | None = None,
- image: torch.Tensor | None = None,
- sampling_params: dict | None = None,
- model_params: dict | None = None,
- lora: dict | None = None,
- **extra_body,
- ) -> VideoInput:
- form = aiohttp.FormData()
- form.add_field("model", model)
- form.add_field("prompt", prompt)
- form.add_field("width", str(width))
- form.add_field("height", str(height))
- form.add_field("num_frames", str(num_frames))
- form.add_field("fps", str(fps))
- if negative_prompt:
- form.add_field("negative_prompt", negative_prompt)
- if sampling_params is not None:
- for k, v in sampling_params.items():
- form.add_field(k, str(v))
- if model_params is not None:
- for k, v in model_params.items():
- form.add_field(k, str(v))
- if lora is not None:
- form.add_field("lora", json.dumps(lora, ensure_ascii=False))
- if extra_body:
- form.add_field("extra_body", json.dumps(extra_body, ensure_ascii=False))
-
- if image is not None:
- image_filename = "image.png" # Required for multipart form
- form.add_field(
- "input_reference",
- image_tensor_to_png_bytes(image, image_filename),
- filename=image_filename,
- content_type="image/png",
- )
-
- async with aiohttp.ClientSession(timeout=self.timeout) as session:
- # Start the video generation job
- url = f"{self.base_url}/videos"
- data = await url_json(session, url, "post", data=form)
- if (job_id := data.get("id", None)) is None:
- raise RuntimeError("API response missing job 'id' field - expected OpenAI compliant format")
- if (job_status := data.get("status", None)) is None:
- raise RuntimeError("API response missing job 'status' field - expected OpenAI compliant format")
-
- # Poll for video generation job completion
- deadline = asyncio.get_running_loop().time() + self.max_poll_duration
- url = f"{self.base_url}/videos/{job_id}"
- while job_status not in {"completed", "failed"}:
- await asyncio.sleep(self.poll_interval)
-
- data = await url_json(session, url)
- if (job_status := data.get("status", None)) is None:
- raise RuntimeError("API response missing job 'status' field - expected OpenAI compliant format")
- if asyncio.get_running_loop().time() >= deadline:
- raise RuntimeError(f"Timed out waiting for video job {job_id} to complete")
-
- if job_status == "failed":
- raise RuntimeError(f"Video job failed: {data}")
-
- # Retrieve completed content
- video_bytes = await url_bytes(session, f"{url}/content")
-
- # Decode video and make a best effort at cleaning up server resources
- try:
- return bytes_to_video(video_bytes)
- finally:
- try:
- await url_json(session, url, "delete")
- except Exception as exc:
- logger.warning("Failed to clean up video job %s: %s", job_id, exc)
-
- async def generate_understanding_chat_completion(
- self,
- *,
- model: str,
- prompt: str,
- image: torch.Tensor | None = None,
- audio: AudioInput | None = None,
- video: VideoInput | None = None,
- sampling_params: dict | list[dict] | None = None,
- modalities: list[str] = ["text", "audio"],
- **extra_body,
- ) -> tuple[str | None, AudioInput | None]:
- # Response may contain two choices: one with text, one with audio
- payload = VLLMOmniClient._prepare_chat_completion_messages(
- model=model,
- prompt=prompt,
- negative_prompt=None,
- image=image,
- audio=audio,
- video=video,
- sampling_params=sampling_params,
- modalities=modalities,
- **extra_body,
- )
-
- choices = await self._generate_base_chat_completion(model, payload)
- text_response = None
- audio_base64 = None
- for choice in choices:
- try:
- text_response = choice["message"]["content"]
- except (KeyError, TypeError):
- # Either this case (text response) or the audio response case will be hit. Checking None's later.
- pass
- try:
- audio_base64 = choice["message"]["audio"]["data"]
- except (KeyError, TypeError):
- # Either this case (text response) or the audio response case will be hit. Checking None's later.
- pass
- if audio_base64 is None and text_response is None:
- raise RuntimeError(
- "API response missing both '.message.audio' and 'message.content' fields."
- f"The choices object is {choices}"
- )
- if audio_base64 is not None:
- audio = base64_to_audio(audio_base64)
- logger.debug(
- "audio sample rate %d, audio shape %s, duration in second %f",
- audio["sample_rate"],
- audio["waveform"].shape,
- audio["waveform"].shape[2] / audio["sample_rate"],
- )
- else:
- audio = None
- return text_response, audio
-
- async def generate_speech(
- self,
- *,
- model: str,
- input: str,
- voice: str,
- response_format: AudioFormat,
- speed: float,
- **extra_params,
- ) -> AudioInput:
- await self._check_model_exist(model)
-
- ref_audio: AudioInput | None = extra_params.pop("ref_audio", None)
-
- payload = {
- "model": model,
- "input": input,
- "voice": voice,
- "response_format": response_format,
- "speed": speed,
- **extra_params,
- }
-
- if ref_audio is not None:
- audio_base64 = audio_to_base64(ref_audio)
- payload["ref_audio"] = audio_base64
-
- logger.debug("Omni TTS payload: %s", pretty_printer.pformat(payload))
-
- url = self.base_url + "/audio/speech"
- async with aiohttp.ClientSession(timeout=self.timeout) as session:
- try:
- async with session.post(
- url,
- json=payload,
- headers={"Content-Type": "application/json"},
- ) as response:
- if not response.ok:
- error_text = await response.text()
- raise (ValueError if response.status < 500 else RuntimeError)(
- f"vLLM-Omni API returned status {response.status}: {error_text}"
- )
-
- try:
- audio_bytes = await response.read()
- except aiohttp.ContentTypeError as e:
- raise RuntimeError(f"Invalid JSON response from vLLM-Omni: {e}")
-
- try:
- audio = bytes_to_audio(audio_bytes)
- except av.error.InvalidDataError as e:
- raise ValueError(
- f"Invalid audio data received from vLLM-Omni: {e}"
- "Check if you have input unsupported arguments (such as 'voice')"
- )
- return audio
-
- except aiohttp.ClientError as e:
- raise RuntimeError(f"Network error connecting to vLLM-Omni at {url}: {e}")
-
- async def _generate_base_chat_completion(self, model: str, payload: dict[str, Any]) -> list[dict[str, Any]]:
- logger.debug("Omni payload: %s", pretty_printer.pformat(payload))
- await self._check_model_exist(model)
-
- url = self.base_url + "/chat/completions"
- async with aiohttp.ClientSession(timeout=self.timeout) as session:
- try:
- async with session.post(
- url,
- json=payload,
- headers={"Content-Type": "application/json"},
- ) as response:
- if not response.ok:
- error_text = await response.text()
- raise (ValueError if response.status < 500 else RuntimeError)(
- f"vLLM-Omni API returned status {response.status}: {error_text}"
- )
-
- try:
- data = await response.json()
- except aiohttp.ContentTypeError as e:
- raise RuntimeError(f"Invalid JSON response from vLLM-Omni: {e}")
-
- logger.debug(
- "chat completion response: %s",
- pretty_printer.pformat(data),
- )
-
- try:
- return data["choices"]
- except (KeyError, TypeError):
- raise RuntimeError("Invalid JSON response from vLLM-Omni: missing 'choices' field")
-
- except aiohttp.ClientError as e:
- raise RuntimeError(f"Network error connecting to vLLM-Omni at {self.base_url}: {e}")
-
- async def _check_model_exist(self, model: str):
- url = self.base_url + "/models"
- async with aiohttp.ClientSession(timeout=self.timeout) as session:
- try:
- async with session.get(
- url,
- headers={"Content-Type": "application/json"},
- ) as response:
- if not response.ok:
- error_text = await response.text()
- raise (ValueError if response.status < 500 else RuntimeError)(
- f"vLLM-Omni API returned status {response.status} "
- f"when getting hosted model list: {error_text}"
- )
-
- try:
- data = await response.json()
- except aiohttp.ContentTypeError as e:
- raise RuntimeError(f"Invalid JSON response when getting hosted model list from vLLM-Omni: {e}")
-
- except aiohttp.ClientError as e:
- raise RuntimeError(f"Network error connecting to vLLM-Omni at {self.base_url}: {e}")
- try:
- model_list = data["data"]
- model_found = next((True for m in model_list if m["id"] == model), False)
- except (KeyError, TypeError):
- raise RuntimeError(f"Invalid JSON response of the hosted model list: {data}")
-
- if not model_found:
- raise ValueError(f"Model {model} not served at {self.base_url}.")
-
- @staticmethod
- def _prepare_chat_completion_messages(
- *,
- model: str,
- prompt: str,
- negative_prompt: str | None,
- image: torch.Tensor | None = None,
- audio: AudioInput | None = None,
- video: VideoInput | None = None,
- sampling_params: dict | list[dict] | None = None,
- modalities: list[str] | None = None, # diffusion don't have this field
- **extra_body,
- ):
- message_content: list[dict] = [{"type": "text", "text": prompt}]
- if image is not None:
- message_content.append(
- {
- "type": "image_url",
- "image_url": {"url": image_tensor_to_base64(image)},
- }
- )
- if audio is not None:
- message_content.append({"type": "audio_url", "audio_url": {"url": audio_to_base64(audio)}})
- if video is not None:
- message_content.append({"type": "video_url", "video_url": {"url": video_to_base64(video)}})
- messages = [{"role": "user", "content": message_content}]
-
- payload: dict[str, Any] = {"messages": messages, "model": model}
- if modalities:
- payload["modalities"] = modalities
-
- combined_extra_body: dict[str, Any] = {}
- if sampling_params is not None:
- spec, _ = lookup_model_spec(model)
- is_single_sampling_param = isinstance(sampling_params, dict) or len(sampling_params) == 1
-
- if (spec is None and is_single_sampling_param) or (spec is not None and spec["stages"] == ["diffusion"]):
- # Diffusion format: extra_body directly contains sampling params.
- # Validation has already taken care of matching sampling params' types and length. Safe to take [0].
- # * Use this mode if the model is a simple one-stage diffusion model.
- # * Fallback to this mode if model is not registered and a single sampling param is provided.
- sampling_params = sampling_params if isinstance(sampling_params, dict) else sampling_params[0]
- combined_extra_body: dict[str, Any] = sampling_params.copy()
- if "n" in combined_extra_body:
- combined_extra_body["num_outputs_per_prompt"] = combined_extra_body.pop("n")
- else:
- # AR format: the payload has a sampling_params_list field, containing a list.
- sampling_params_list = sampling_params if isinstance(sampling_params, list) else [sampling_params]
- payload["sampling_params_list"] = sampling_params_list
-
- if negative_prompt:
- combined_extra_body["negative_prompt"] = negative_prompt
-
- if extra_body:
- combined_extra_body.update(extra_body)
-
- # Add extra_body only if it has any content.
- if combined_extra_body:
- payload["extra_body"] = combined_extra_body
-
- # Place to inject any model-specific payload adjustment
- spec, _ = lookup_model_spec(model)
- if spec:
- preprocessor = spec.get("payload_preprocessor", None)
- if preprocessor is not None:
- payload = preprocessor(payload)
-
- return payload
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/format.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/format.py
deleted file mode 100644
index 42d396f0694..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/format.py
+++ /dev/null
@@ -1,304 +0,0 @@
-"""Image/tensor format helpers.
-
-The image generation part is derived from dougbtv/comfyui-vllm-omni by Doug (@dougbtv).
-Original source at https://github.com/dougbtv/comfyui-vllm-omni, distributed under the MIT License.
-"""
-
-import base64
-import mimetypes
-from fractions import Fraction
-from io import BytesIO
-
-import av
-import numpy as np
-import torch
-from av.audio.frame import AudioFrame
-from av.audio.resampler import AudioResampler
-from av.video.frame import VideoFrame
-from comfy_api.input import AudioInput, VideoInput
-from comfy_api.latest import InputImpl, Types
-from comfy_extras import nodes_audio
-from PIL import Image
-
-from .logger import get_logger
-
-logger = get_logger(__name__)
-
-
-def base64_to_image_tensor(base64_str: str, mode: str = "RGB") -> torch.Tensor:
- """
- Convert base64-encoded image to ComfyUI image tensor.
-
- Args:
- base64_str: Base64-encoded image string
- mode: PIL image mode (default RGB for transparency support)
-
- Returns:
- torch.Tensor with shape (1, H, W, C) in float32 [0, 1] range
-
- Raises:
- ValueError: If base64 string is invalid or image cannot be decoded
- """
- if base64_str.startswith("data:image"):
- _, base64_str = base64_str.split(",", 1)
-
- try:
- # Decode base64 to bytes
- image_bytes = base64.b64decode(base64_str)
- except Exception as e:
- raise ValueError(f"Invalid base64 string: {e}")
-
- # Create BytesIO object for PIL
- image_bytesio = BytesIO(image_bytes)
-
- # Open with PIL and convert to desired mode
- try:
- pil_image = Image.open(image_bytesio)
- pil_image = pil_image.convert(mode)
- except Exception as e:
- raise RuntimeError(f"Failed to open image: {e}")
-
- image_array = np.asarray(pil_image).astype(np.float32) / 255.0
- image_tensor = torch.from_numpy(image_array)
- return image_tensor
-
-
-def image_tensor_to_png_bytes(tensor: torch.Tensor, filename: str = "image.png") -> BytesIO:
- """
- Convert ComfyUI image tensor to PNG BytesIO for multipart upload.
-
- This function converts a ComfyUI IMAGE tensor to a PNG-encoded BytesIO object
- suitable for multipart/form-data upload. The BytesIO object has its .name
- attribute set, which is required by aiohttp for file uploads.
-
- Args:
- tensor: ComfyUI IMAGE tensor with shape (B, H, W, C), dtype float32, range [0, 1]
- filename: Name attribute to set on BytesIO (default: "image.png")
-
- Returns:
- BytesIO object containing PNG-encoded image with .name attribute set
-
- Raises:
- ValueError: If tensor format is invalid (not 4D, wrong dtype, etc.)
- """
- if tensor.ndim != 4:
- raise ValueError(f"Expected 4D tensor with shape (B, H, W, C), got {tensor.ndim}D tensor")
-
- image_tensor = tensor[0] # Shape: (H, W, C)
- image_np = (image_tensor.cpu().numpy() * 255.0).astype(np.uint8)
- pil_image = Image.fromarray(image_np)
-
- # Save to BytesIO as image file
- img_bytes = BytesIO()
- # Set name attribute (required for multipart upload and mimetype detection)
- img_bytes.name = filename
- try:
- pil_image.save(img_bytes)
- except Exception as e:
- raise RuntimeError(f"Failed to save image as file: {e}")
-
- # Reset position to beginning
- img_bytes.seek(0)
-
- return img_bytes
-
-
-def image_tensor_to_base64(tensor: torch.Tensor, filename: str = "image.png") -> str:
- """
- Convert ComfyUI image tensor to base64-encoded image string.
-
- Args:
- tensor: ComfyUI IMAGE tensor with shape (B, H, W, C), dtype float32, range [0, 1]
- filename: Name attribute to set on BytesIO (default: "image.png")
- format: File format of the output image file buffer (default: "PNG")
-
- Returns:
- Base64-encoded image string
-
- Raises:
- ValueError: If tensor format is invalid (not 4D, wrong dtype, etc.)
- """
- img_bytes = image_tensor_to_png_bytes(tensor, filename)
- img_bytes.seek(0)
- byte_data = img_bytes.read()
- base64_str = base64.b64encode(byte_data).decode("utf-8")
- mime_type = mimetypes.guess_type(filename)[0] or "application/octet-stream"
- return f"data:{mime_type};base64,{base64_str}"
-
-
-def video_to_bytes(video: VideoInput, filename: str = "video.mp4") -> BytesIO:
- output_buffer = BytesIO()
- output_buffer.name = filename
- video.save_to(output_buffer)
- output_buffer.seek(0)
- return output_buffer
-
-
-def video_to_base64(video: VideoInput, filename: str = "video.mp4") -> str:
- video_buffer = video_to_bytes(video, filename)
- video_buffer.seek(0)
- byte_data = video_buffer.read()
- base64_str = base64.b64encode(byte_data).decode("utf-8")
- mime_type = mimetypes.guess_type(filename)[0] or "application/octet-stream"
- return f"data:{mime_type};base64,{base64_str}"
-
-
-def bytes_to_video(video_bytes: bytes) -> VideoInput:
- video_buffer = BytesIO(video_bytes)
-
- try:
- with av.open(video_buffer, mode="r") as container:
- video_stream = next((s for s in container.streams if s.type == "video"), None)
- if video_stream is None:
- raise ValueError("No video stream found in decoded payload.")
-
- frames = []
- for frame in container.decode(video_stream):
- if not isinstance(frame, VideoFrame):
- continue
- image = frame.to_ndarray(format="rgb24")
- frames.append(torch.from_numpy(image).float() / 255.0)
-
- if len(frames) == 0:
- raise ValueError("No video frames found in decoded payload.")
-
- images = torch.stack(frames, dim=0)
- frame_rate = Fraction(video_stream.average_rate) if video_stream.average_rate else Fraction(1)
-
- audio: AudioInput | None = None
- if len(container.streams.audio):
- audio_stream = container.streams.audio[-1]
- audio_frames = []
- resampler = AudioResampler(format="fltp")
- for frame in container.decode(audio_stream):
- if not isinstance(frame, AudioFrame):
- continue
- resampled = resampler.resample(frame)
- if not isinstance(resampled, list):
- resampled = [resampled]
- for audio_frame in resampled:
- if audio_frame is not None:
- audio_frames.append(audio_frame.to_ndarray())
-
- if len(audio_frames) > 0:
- audio_data = np.concatenate(audio_frames, axis=1)
- sample_rate = int(audio_stream.sample_rate) if audio_stream.sample_rate else 1
- audio = {
- "waveform": torch.from_numpy(audio_data).unsqueeze(0),
- "sample_rate": sample_rate,
- }
-
- components = Types.VideoComponents(
- images=images,
- frame_rate=frame_rate,
- audio=audio,
- metadata=container.metadata if container.metadata else None,
- )
- except Exception as e:
- raise RuntimeError(f"Failed to decode video: {e}")
-
- return InputImpl.VideoFromComponents(components)
-
-
-def base64_to_video(base64_str: str) -> VideoInput:
- if base64_str.startswith("data:video"):
- _, base64_str = base64_str.split(",", 1)
-
- try:
- video_bytes = base64.b64decode(base64_str)
- except Exception as e:
- raise ValueError(f"Invalid base64 string: {e}")
-
- return bytes_to_video(video_bytes)
-
-
-def audio_to_bytes(audio: AudioInput, filename: str = "audio.mp3", quality: str = "128k") -> BytesIO:
- waveform = audio["waveform"][0] # Shape: (C, T)
- sample_rate = audio["sample_rate"]
- format = filename.rsplit(".", maxsplit=1)[1]
- layout = "mono" if waveform.shape[0] == 1 else "stereo"
-
- output_buffer = BytesIO()
- output_buffer.name = filename
- output_container = av.open(output_buffer, mode="w", format=format)
- if format == "opus":
- out_stream = output_container.add_stream("libopus", rate=sample_rate, layout=layout)
- if quality == "64k":
- out_stream.bit_rate = 64000 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- elif quality == "96k":
- out_stream.bit_rate = 96000 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- elif quality == "128k":
- out_stream.bit_rate = 128000 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- elif quality == "192k":
- out_stream.bit_rate = 192000 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- elif quality == "320k":
- out_stream.bit_rate = 320000 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- elif format == "mp3":
- out_stream = output_container.add_stream("libmp3lame", rate=sample_rate, layout=layout)
- if quality == "V0":
- out_stream.codec_context.qscale = 1 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- elif quality == "128k":
- out_stream.bit_rate = 128000 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- elif quality == "320k":
- out_stream.bit_rate = 320000 # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- else: # format == "flac":
- out_stream = output_container.add_stream("flac", rate=sample_rate, layout=layout)
-
- frame = av.AudioFrame.from_ndarray(
- waveform.movedim(0, 1).reshape(1, -1).float().numpy(),
- format="flt",
- layout=layout,
- )
- frame.sample_rate = sample_rate
- frame.pts = 0
- output_container.mux(out_stream.encode(frame)) # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- # Flush encoder
- output_container.mux(out_stream.encode(None)) # type: ignore # copy from ComfyUI comfy_api/latest/_ui.py
- output_container.close()
- output_buffer.seek(0)
-
- return output_buffer
-
-
-def audio_to_base64(audio: AudioInput, filename: str = "audio.mp3", quality: str = "128k") -> str:
- audio_buffer = audio_to_bytes(audio, filename, quality)
- audio_buffer.seek(0)
- byte_data = audio_buffer.read()
- base64_str = base64.b64encode(byte_data).decode("utf-8")
- mime_type = mimetypes.guess_type(filename)[0] or "application/octet-stream"
- return f"data:{mime_type};base64,{base64_str}"
-
-
-def bytes_to_audio(audio_bytes: bytes) -> AudioInput:
- """
- Convert audio bytes to ComfyUI audio tensor.
-
- Args:
- audio_bytes: Audio file bytes
- Returns:
- torch.Tensor with shape (B, C, T) in float32 range [-1, 1]
- """
- audio_buffer = BytesIO(audio_bytes)
- waveform, sample_rate = nodes_audio.load(audio_buffer) # type: ignore # Although expect string argument, it calls av.open underneath, which supports BytesIO (file-like)
- return {"waveform": waveform.unsqueeze(0), "sample_rate": sample_rate}
-
-
-def base64_to_audio(base64_str: str) -> AudioInput:
- """
- Convert base64-encoded audio to ComfyUI audio tensor.
-
- Args:
- base64_str: Base64-encoded audio string
- Returns:
- torch.Tensor with shape (B, C, T) in float32 range [-1, 1]
- """
- if base64_str.startswith("data:audio"):
- _, base64_str = base64_str.split(",", 1)
-
- try:
- # Decode base64 to bytes
- audio_bytes = base64.b64decode(base64_str)
- except Exception as e:
- raise ValueError(f"Invalid base64 string: {e}")
-
- return bytes_to_audio(audio_bytes)
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/logger.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/logger.py
deleted file mode 100644
index cad640f4d7e..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/logger.py
+++ /dev/null
@@ -1,125 +0,0 @@
-"""Centralized logger configuration for vLLM-Omni ComfyUI."""
-
-import logging
-import pprint
-import sys
-from typing import Any
-
-
-def get_logger(name: str) -> logging.Logger:
- """
- Get or create a logger with proper formatting.
-
- Args:
- name: Logger name (typically __name__ of the calling module)
-
- Returns:
- Configured logger instance
- """
- logger = logging.getLogger(name)
-
- # Only configure if not already configured
- if not logger.handlers:
- logger.setLevel(logging.DEBUG)
-
- # Create console handler
- handler = logging.StreamHandler(sys.stdout)
- handler.setLevel(logging.INFO)
-
- # Create formatter
- formatter = logging.Formatter(
- fmt="(ComfyUI-vLLM-Omni) [%(levelname)s] %(asctime)s [%(filename)s:%(lineno)s] %(message)s",
- datefmt="%Y-%m-%d %H:%M:%S",
- )
- handler.setFormatter(formatter)
-
- # Add handler to logger
- logger.addHandler(handler)
-
- # Prevent propagation to root logger
- logger.propagate = False
-
- return logger
-
-
-class OmitBase64PrettyPrinter(pprint.PrettyPrinter):
- """
- A PrettyPrinter that redacts specific field names with '...'
- wherever they appear in nested structures.
- """
-
- def __init__(self, *args, **kwargs):
- super().__init__(*args, **kwargs)
-
- def _format(self, obj: Any, stream, indent: int, allowance: int, context, level: int) -> None:
- # Check if this is a dict with redacted keys
- if isinstance(obj, dict):
- # Create a copy with redacted values
- display_obj = {}
- for key, value in obj.items():
- if key == "data" or key == "url":
- if value.startswith("data:"):
- base64_header = value.split(",", 1)[0]
- display_obj[key] = f"{base64_header},***"
- elif value.startswith("http://") or value.startswith("https://"):
- display_obj[key] = value
- elif len(value) > 10:
- display_obj[key] = f"{value[:10]}***"
- else:
- display_obj[key] = value
- else:
- display_obj[key] = value
- obj = display_obj
-
- # Handle list/tuple/set containing dicts that might have redacted keys
- # (pprint will recursively call _format on nested items, so this
- # handles arbitrary nesting automatically)
-
- super()._format(obj, stream, indent, allowance, context, level)
-
-
-pretty_printer = OmitBase64PrettyPrinter()
-
-
-# ========== PPrint EXAMPLES ==========
-
-if __name__ == "__main__":
- data = {
- "messages": [
- {
- "role": "system",
- "content": (
- "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group,"
- "capable of perceiving auditory and visual inputs, as well as generating text and speech."
- ),
- },
- {
- "role": "user",
- "content": [
- {
- "type": "text",
- "text": "What sound is it, and what is the drawing about?",
- },
- {
- "type": "text",
- "text": "What sound is it, and what is the drawing about?",
- },
- {
- "type": "image_url",
- "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAwAAAA"},
- },
- {
- "type": "audio_url",
- "audio_url": {"url": "data:audio/mpeg;base64,SUQzBAAAAAAAIlRTU0UAAAAOAAADT="},
- },
- ],
- },
- ],
- "extra_body": {"mm_processor_kwargs": {"use_audio_in_video": False}},
- "modalities": ["text"],
- }
-
- # Create printer that redacts 'password' and 'token' fields
-
- print("\nRedactingPrettyPrinter (hides secrets):")
- pretty_printer.pprint(data)
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/models.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/models.py
deleted file mode 100644
index bfeddd82b87..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/models.py
+++ /dev/null
@@ -1,118 +0,0 @@
-import re
-
-from .types import Modality, ModelMode, Spec
-
-
-def _bagel_payload_preprocessor(payload: dict) -> dict:
- try:
- for message in payload["messages"]:
- for content in message["content"]:
- if content["type"] == "text":
- content["text"] = "<|im_start|>" + content["text"] + "<|im_end|>"
- except (KeyError, TypeError):
- raise RuntimeError("Internal Error: malformatted BAGEL payload")
- return payload
-
-
-def _qwen25_payload_preprocessor(payload: dict) -> dict:
- if payload["messages"][0]["role"] != "system":
- payload["messages"] = [
- {
- "role": "system",
- "content": (
- "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group,"
- "capable of perceiving auditory and visual inputs, as well as generating text and speech."
- ),
- },
- *payload["messages"],
- ]
- return payload
-
-
-_MODEL_PIPELINE_SPECS: dict[str, Spec] = {
- r"BAGEL-7B-MoT": {
- "stages": [
- "diffusion" # The vLLM-Omni interface treats it as a single-stage diffusion model
- ],
- "modes": [
- {
- "mode": ModelMode.UNDERSTANDING,
- "input_modalities": [Modality.TEXT, Modality.IMAGE],
- }
- ],
- "payload_preprocessor": _bagel_payload_preprocessor,
- },
- r"Qwen2.5-Omni*": {
- "stages": ["autoregression", "autoregression", "autoregression"],
- "payload_preprocessor": _qwen25_payload_preprocessor,
- "modes": [
- {
- "mode": ModelMode.UNDERSTANDING,
- "input_modalities": [
- Modality.TEXT,
- Modality.IMAGE,
- Modality.VIDEO,
- Modality.AUDIO,
- ],
- }
- ],
- },
- r"Qwen3-Omni*": {
- "stages": ["autoregression", "autoregression", "autoregression"],
- "modes": [
- {
- "mode": ModelMode.UNDERSTANDING,
- "input_modalities": [
- Modality.TEXT,
- Modality.IMAGE,
- Modality.VIDEO,
- Modality.AUDIO,
- ],
- }
- ],
- },
-}
-# Convert dict keys to regex patterns
-MODEL_PIPELINE_SPECS: dict[re.Pattern, Spec] = {}
-for k, v in _MODEL_PIPELINE_SPECS.items():
- MODEL_PIPELINE_SPECS[re.compile(k)] = v
-del _MODEL_PIPELINE_SPECS
-
-
-def lookup_model_spec(model: str) -> tuple[Spec | None, str | None]:
- try:
- last_component = model.rstrip("/").rsplit("/", 1)[-1]
- except IndexError:
- last_component = model
- for pattern, spec in MODEL_PIPELINE_SPECS.items():
- if pattern.search(last_component):
- return spec, pattern.pattern
- return None, None
-
-
-# ============== DEMONSTRATION ==============
-
-if __name__ == "__main__":
- test_paths = [
- "Qwen/Qwen2.5-Omni-7B",
- "MyModels/Qwen2.5-Omni-3B",
- "/root/home/Qwen2.5-Omni-7B",
- "Qwen/Qwen3-Omni",
- "Qwen/Qwen3-Omni-30B-A3B-Instruct",
- "Custom/Path/UnknownModel-Instruct",
- "Not/Matching/Anything",
- ]
-
- test_payload = {"messages": [{"role": "user", "content": "prompt"}]}
-
- print("Testing registry lookups:\n")
- for path in test_paths:
- spec, _ = lookup_model_spec(path)
- if spec:
- if preprocessor := spec.get("payload_preprocessor"):
- result = preprocessor(test_payload)
- print(f"✓ {path:<40} → {result}")
- else:
- print(f"✓ {path:<40} → No preprocessor")
- else:
- print(f"✗ {path:<40} → No match")
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/types.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/types.py
deleted file mode 100644
index c7d254eb9ea..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/types.py
+++ /dev/null
@@ -1,55 +0,0 @@
-from collections.abc import Callable
-from enum import Enum, auto
-from typing import (
- Any,
- Literal,
- NotRequired,
- TypeAlias,
- TypedDict,
-)
-
-AudioFormat: TypeAlias = Literal["mp3", "opus", "aac", "flac", "wav", "pcm"]
-
-
-class AutoregressionSamplingParams(dict):
- pass
-
-
-class DiffusionSamplingParams(dict):
- pass
-
-
-class QwenTTSModelSpecificParams(dict):
- pass
-
-
-class WanModelSpecificParams(dict):
- pass
-
-
-class ModelMode(Enum):
- IMAGE_GENERATION = auto()
- VIDEO_GENERATION = auto()
- AUDIO_GENERATION = auto()
- UNDERSTANDING = auto()
-
-
-class Modality(Enum):
- TEXT = auto() # maybe not useful. Prompt is always required
- IMAGE = auto()
- VIDEO = auto()
- AUDIO = auto()
-
-
-class ModelModeSpec(TypedDict):
- mode: ModelMode
- input_modalities: list[Modality]
-
-
-PayloadPreprocessor: TypeAlias = Callable[[dict[str, Any]], dict[str, Any]]
-
-
-class Spec(TypedDict):
- stages: list[Literal["diffusion", "autoregression"]]
- modes: list[ModelModeSpec]
- payload_preprocessor: NotRequired[PayloadPreprocessor]
diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/validators.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/validators.py
deleted file mode 100644
index f607f2cb81a..00000000000
--- a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/utils/validators.py
+++ /dev/null
@@ -1,93 +0,0 @@
-from .logger import get_logger
-from .models import lookup_model_spec
-from .types import AutoregressionSamplingParams, DiffusionSamplingParams
-
-logger = get_logger(__name__)
-
-
-def validate_model_and_sampling_params_types(
- model_name: str,
- sampling_param_list: dict | list[dict] | None = None,
-):
- # Check if model name exists
- if not model_name:
- raise ValueError("Model name must not be empty.")
-
- # Skip if no spec or no sampling params
- pipeline_spec, _ = lookup_model_spec(model_name)
- if pipeline_spec is None:
- logger.info(f"skipping sampling params check because spec for {model_name} is not found")
- return
- if sampling_param_list is None:
- return
-
- # Check the number of stages and their data types
- stages = pipeline_spec["stages"]
- if isinstance(sampling_param_list, list):
- # Check that the lengths match
- if len(stages) != len(sampling_param_list):
- raise ValueError(
- f"Sampling parameter list length {len(sampling_param_list)} does not match "
- f"number of stages {len(stages)} for model {model_name}."
- )
- # Check that each stage's type match
- for i, sp in enumerate(sampling_param_list):
- if not _check_sampling_param_matches_stage(sp, stages[i]):
- raise ValueError(
- f"Sampling parameter type ({sp.__class__.__name__}) does not match "
- f"stage type ({stages[i]}) at index {i} for model {model_name}."
- )
- elif isinstance(sampling_param_list, dict):
- # Check that the provided single sampling param matches all stages
- for i, stage in enumerate(stages):
- if not _check_sampling_param_matches_stage(sampling_param_list, stage):
- raise ValueError(
- f"Provided single sampling parameter type ({sampling_param_list.__class__.__name__}) must match "
- f"the types of all stages of the model. "
- f"However, stage {i} of model {model_name} is of type {stage}."
- )
-
-
-def add_sampling_parameters_to_stage(
- model_name: str,
- sampling_param_list: dict | list[dict] | None,
- stage_type: str,
- /,
- **params_to_add,
-) -> dict | list[dict]:
- """
- Given a model's name and the sampling parameter list to query this model,
- add arbitrary additional parameters to the sampling parameters of all stages of the given type.
- """
- pipeline_spec, _ = lookup_model_spec(model_name)
- if not pipeline_spec:
- logger.warning(
- f"Since the model {model_name} is not in our list, we cannot ensure if "
- f"the fields ({tuple(params_to_add.keys())}) are added to the correct stage's sampling params. "
- f"We will do it heuristically."
- )
- pipeline_spec = {"stages": ["diffusion"]}
-
- stages = pipeline_spec["stages"]
- if isinstance(sampling_param_list, dict):
- sampling_param_list = sampling_param_list.__class__(sampling_param_list)
- sampling_param_list.update(params_to_add)
- elif sampling_param_list is None:
- sampling_param_list = params_to_add.copy()
- else:
- for i, stage in enumerate(stages):
- if stage == stage_type:
- stage_param = sampling_param_list[i]
- stage_param = stage_param.__class__(stage_param)
- stage_param.update(params_to_add)
- sampling_param_list[i] = stage_param
-
- return sampling_param_list
-
-
-def _check_sampling_param_matches_stage(sampling_param: dict, stage_type: str) -> bool:
- if stage_type == "autoregression":
- return isinstance(sampling_param, AutoregressionSamplingParams)
- if stage_type == "diffusion":
- return isinstance(sampling_param, DiffusionSamplingParams)
- raise RuntimeError(f"Internal error: unknown stage type {stage_type}.")
diff --git a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-chaining-services.jpg b/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-chaining-services.jpg
deleted file mode 100644
index 20d9d077938..00000000000
Binary files a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-chaining-services.jpg and /dev/null differ
diff --git a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg b/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg
deleted file mode 100644
index 5aec26e8cb6..00000000000
Binary files a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg and /dev/null differ
diff --git a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg b/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg
deleted file mode 100644
index 54eb91b9e03..00000000000
Binary files a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg and /dev/null differ
diff --git a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg b/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg
deleted file mode 100644
index f446de2a004..00000000000
Binary files a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg and /dev/null differ
diff --git a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-understanding.jpg b/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-understanding.jpg
deleted file mode 100644
index f2f6c8de175..00000000000
Binary files a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-understanding.jpg and /dev/null differ
diff --git a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-video-generation.jpg b/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-video-generation.jpg
deleted file mode 100644
index a449a4eee70..00000000000
Binary files a/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-video-generation.jpg and /dev/null differ
diff --git a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Chaining Services.json b/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Chaining Services.json
deleted file mode 100644
index 3031f83444e..00000000000
--- a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Chaining Services.json
+++ /dev/null
@@ -1,552 +0,0 @@
-{
- "id": "6643e5fd-fa2a-4f25-935a-483173a8097c",
- "revision": 0,
- "last_node_id": 8,
- "last_link_id": 6,
- "nodes": [
- {
- "id": 1,
- "type": "PreviewImage",
- "pos": [
- 1446.2005205859512,
- -316.7359686049902
- ],
- "size": [
- 305.6628685610174,
- 307.0336884172169
- ],
- "flags": {},
- "order": 5,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "images",
- "name": "images",
- "type": "IMAGE",
- "link": 1
- }
- ],
- "outputs": [],
- "properties": {
- "Node name for S&R": "PreviewImage"
- },
- "widgets_values": []
- },
- {
- "id": 8,
- "type": "PreviewImage",
- "pos": [
- 1010.1848219273943,
- -311.54216706114363
- ],
- "size": [
- 305.6628685610174,
- 307.0336884172169
- ],
- "flags": {},
- "order": 4,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "images",
- "name": "images",
- "type": "IMAGE",
- "link": 6
- }
- ],
- "outputs": [],
- "properties": {
- "Node name for S&R": "PreviewImage"
- },
- "widgets_values": []
- },
- {
- "id": 7,
- "type": "VLLMOmniDiffusionSampling",
- "pos": [
- 234.61958293819296,
- 9.405681270043619
- ],
- "size": [
- 284.205078125,
- 226
- ],
- "flags": {},
- "order": 0,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "n",
- "name": "n",
- "type": "INT",
- "widget": {
- "name": "n"
- },
- "link": null
- },
- {
- "localized_name": "num_inference_steps",
- "name": "num_inference_steps",
- "type": "INT",
- "widget": {
- "name": "num_inference_steps"
- },
- "link": null
- },
- {
- "localized_name": "guidance_scale",
- "name": "guidance_scale",
- "type": "FLOAT",
- "widget": {
- "name": "guidance_scale"
- },
- "link": null
- },
- {
- "localized_name": "true_cfg_scale",
- "name": "true_cfg_scale",
- "type": "FLOAT",
- "widget": {
- "name": "true_cfg_scale"
- },
- "link": null
- },
- {
- "localized_name": "vae_use_slicing",
- "name": "vae_use_slicing",
- "type": "BOOLEAN",
- "widget": {
- "name": "vae_use_slicing"
- },
- "link": null
- },
- {
- "localized_name": "vae_use_tiling",
- "name": "vae_use_tiling",
- "type": "BOOLEAN",
- "widget": {
- "name": "vae_use_tiling"
- },
- "link": null
- },
- {
- "localized_name": "seed",
- "name": "seed",
- "type": "INT",
- "widget": {
- "name": "seed"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "diffusion sampling params",
- "name": "diffusion sampling params",
- "type": "SAMPLING_PARAMS",
- "links": [
- 5
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniDiffusionSampling"
- },
- "widgets_values": [
- 1,
- 50,
- 1,
- 1,
- false,
- false,
- 1525,
- "randomize"
- ]
- },
- {
- "id": 4,
- "type": "VLLMOmniDiffusionSampling",
- "pos": [
- 666.8380026154548,
- 268.86271068330126
- ],
- "size": [
- 284.205078125,
- 226
- ],
- "flags": {},
- "order": 1,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "n",
- "name": "n",
- "type": "INT",
- "widget": {
- "name": "n"
- },
- "link": null
- },
- {
- "localized_name": "num_inference_steps",
- "name": "num_inference_steps",
- "type": "INT",
- "widget": {
- "name": "num_inference_steps"
- },
- "link": null
- },
- {
- "localized_name": "guidance_scale",
- "name": "guidance_scale",
- "type": "FLOAT",
- "widget": {
- "name": "guidance_scale"
- },
- "link": null
- },
- {
- "localized_name": "true_cfg_scale",
- "name": "true_cfg_scale",
- "type": "FLOAT",
- "widget": {
- "name": "true_cfg_scale"
- },
- "link": null
- },
- {
- "localized_name": "vae_use_slicing",
- "name": "vae_use_slicing",
- "type": "BOOLEAN",
- "widget": {
- "name": "vae_use_slicing"
- },
- "link": null
- },
- {
- "localized_name": "vae_use_tiling",
- "name": "vae_use_tiling",
- "type": "BOOLEAN",
- "widget": {
- "name": "vae_use_tiling"
- },
- "link": null
- },
- {
- "localized_name": "seed",
- "name": "seed",
- "type": "INT",
- "widget": {
- "name": "seed"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "diffusion sampling params",
- "name": "diffusion sampling params",
- "type": "SAMPLING_PARAMS",
- "links": [
- 2
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniDiffusionSampling"
- },
- "widgets_values": [
- 4,
- 50,
- 7,
- 1,
- false,
- false,
- 42,
- "fixed"
- ]
- },
- {
- "id": 5,
- "type": "VLLMOmniGenerateImage",
- "pos": [
- 984.723613585788,
- 63.376900027553276
- ],
- "size": [
- 416.56628685610167,
- 372.1662621294205
- ],
- "flags": {},
- "order": 3,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "image",
- "name": "image",
- "shape": 7,
- "type": "IMAGE",
- "link": 4
- },
- {
- "localized_name": "mask",
- "name": "mask",
- "shape": 7,
- "type": "MASK",
- "link": null
- },
- {
- "localized_name": "sampling_params",
- "name": "sampling_params",
- "shape": 7,
- "type": "SAMPLING_PARAMS",
- "link": 2
- },
- {
- "localized_name": "url",
- "name": "url",
- "type": "STRING",
- "widget": {
- "name": "url"
- },
- "link": null
- },
- {
- "localized_name": "model",
- "name": "model",
- "type": "STRING",
- "widget": {
- "name": "model"
- },
- "link": null
- },
- {
- "localized_name": "prompt",
- "name": "prompt",
- "type": "STRING",
- "widget": {
- "name": "prompt"
- },
- "link": null
- },
- {
- "localized_name": "negative_prompt",
- "name": "negative_prompt",
- "type": "STRING",
- "widget": {
- "name": "negative_prompt"
- },
- "link": null
- },
- {
- "localized_name": "width",
- "name": "width",
- "type": "INT",
- "widget": {
- "name": "width"
- },
- "link": null
- },
- {
- "localized_name": "height",
- "name": "height",
- "type": "INT",
- "widget": {
- "name": "height"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "image",
- "name": "image",
- "type": "IMAGE",
- "links": [
- 1
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniGenerateImage"
- },
- "widgets_values": [
- "http://localhost:8001/v1",
- "/home/models/Qwen/Qwen-Image-Edit",
- "A high-quality, high contrast, stylized portrait of the object in the uploaded reference image. Pop art and doodle style, with abundant scribbling patterns, such as a teal crown, orange lightning bolts, colorful handwritten scripts.",
- "Realistic",
- 800,
- 800
- ]
- },
- {
- "id": 6,
- "type": "VLLMOmniGenerateImage",
- "pos": [
- 541.2702717429818,
- -160.70392744708548
- ],
- "size": [
- 416.56628685610167,
- 372.1662621294205
- ],
- "flags": {},
- "order": 2,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "image",
- "name": "image",
- "shape": 7,
- "type": "IMAGE",
- "link": null
- },
- {
- "localized_name": "mask",
- "name": "mask",
- "shape": 7,
- "type": "MASK",
- "link": null
- },
- {
- "localized_name": "sampling_params",
- "name": "sampling_params",
- "shape": 7,
- "type": "SAMPLING_PARAMS",
- "link": 5
- },
- {
- "localized_name": "url",
- "name": "url",
- "type": "STRING",
- "widget": {
- "name": "url"
- },
- "link": null
- },
- {
- "localized_name": "model",
- "name": "model",
- "type": "STRING",
- "widget": {
- "name": "model"
- },
- "link": null
- },
- {
- "localized_name": "prompt",
- "name": "prompt",
- "type": "STRING",
- "widget": {
- "name": "prompt"
- },
- "link": null
- },
- {
- "localized_name": "negative_prompt",
- "name": "negative_prompt",
- "type": "STRING",
- "widget": {
- "name": "negative_prompt"
- },
- "link": null
- },
- {
- "localized_name": "width",
- "name": "width",
- "type": "INT",
- "widget": {
- "name": "width"
- },
- "link": null
- },
- {
- "localized_name": "height",
- "name": "height",
- "type": "INT",
- "widget": {
- "name": "height"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "image",
- "name": "image",
- "type": "IMAGE",
- "links": [
- 4,
- 6
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniGenerateImage"
- },
- "widgets_values": [
- "http://localhost:8000/v1",
- "/home/models/Tongyi-MAI/Z-Image-Turbo",
- "A headshot of a cute Siamese kitty. Blurred background due to wide aperture. Close-up look. Realistic.",
- "Cartoonish.",
- 800,
- 800
- ]
- }
- ],
- "links": [
- [
- 1,
- 5,
- 0,
- 1,
- 0,
- "IMAGE"
- ],
- [
- 2,
- 4,
- 0,
- 5,
- 2,
- "SAMPLING_PARAMS"
- ],
- [
- 4,
- 6,
- 0,
- 5,
- 0,
- "IMAGE"
- ],
- [
- 5,
- 7,
- 0,
- 6,
- 2,
- "SAMPLING_PARAMS"
- ],
- [
- 6,
- 6,
- 0,
- 8,
- 0,
- "IMAGE"
- ]
- ],
- "groups": [],
- "config": {},
- "extra": {
- "workflowRendererVersion": "LG",
- "ds": {
- "scale": 1.0426432563169903,
- "offset": [
- -90.4071374799346,
- 575.2948920069742
- ]
- }
- },
- "version": 0.4
-}
diff --git a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Image Generation.json b/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Image Generation.json
deleted file mode 100644
index 86194ee70d5..00000000000
--- a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Image Generation.json
+++ /dev/null
@@ -1 +0,0 @@
-{"id":"91f75acc-8040-40f6-865a-2e8a7cfd6672","revision":0,"last_node_id":11,"last_link_id":23,"nodes":[{"id":3,"type":"PreviewImage","pos":[1281.8455167767304,-69.02638461454333],"size":[305.6628685610174,307.0336884172169],"flags":{},"order":4,"mode":0,"inputs":[{"localized_name":"图像","name":"images","type":"IMAGE","link":22}],"outputs":[],"properties":{"Node name for S&R":"PreviewImage"},"widgets_values":[]},{"id":11,"type":"VLLMOmniGenerateImage","pos":[816.0962515873642,-137.51112584854445],"size":[400,278],"flags":{},"order":3,"mode":0,"inputs":[{"localized_name":"image","name":"image","shape":7,"type":"IMAGE","link":21},{"localized_name":"mask","name":"mask","shape":7,"type":"MASK","link":null},{"localized_name":"sampling_params","name":"sampling_params","shape":7,"type":"SAMPLING_PARAMS","link":23},{"localized_name":"url","name":"url","type":"STRING","widget":{"name":"url"},"link":null},{"localized_name":"model","name":"model","type":"STRING","widget":{"name":"model"},"link":null},{"localized_name":"prompt","name":"prompt","type":"STRING","widget":{"name":"prompt"},"link":null},{"localized_name":"negative_prompt","name":"negative_prompt","type":"STRING","widget":{"name":"negative_prompt"},"link":null},{"localized_name":"width","name":"width","type":"INT","widget":{"name":"width"},"link":null},{"localized_name":"height","name":"height","type":"INT","widget":{"name":"height"},"link":null}],"outputs":[{"localized_name":"image","name":"image","type":"IMAGE","links":[22]}],"properties":{"Node name for S&R":"VLLMOmniGenerateImage"},"widgets_values":["http://localhost:8000/v1","Qwen/Qwen-Image-Edit","Put this figure in a realistic mountain view","",512,512]},{"id":10,"type":"MarkdownNote","pos":[227.99306819575278,-231.24143306069843],"size":[240.3326478922474,136.3505820791881],"flags":{},"order":1,"mode":0,"inputs":[],"outputs":[],"title":"Note: Task and Input","properties":{},"widgets_values":["vLLM-Omni nodes are categorized based on the output modality. The \"Generate Image\" node supports both text-to-image generation or image-to-image generation (a.k.a. image editing). The node will route to the correct endpoint depending on whether an input image is present or not."],"color":"#432","bgcolor":"#000"},{"id":4,"type":"LoadImage","pos":[496.31859627609606,-229.71277089860084],"size":[270,314],"flags":{},"order":2,"mode":0,"inputs":[{"localized_name":"图像","name":"image","type":"COMBO","widget":{"name":"image"},"link":null},{"localized_name":"选择文件上传","name":"upload","type":"IMAGEUPLOAD","widget":{"name":"upload"},"link":null}],"outputs":[{"localized_name":"图像","name":"IMAGE","type":"IMAGE","links":[21]},{"localized_name":"遮罩","name":"MASK","type":"MASK","links":null}],"properties":{"Node name for S&R":"LoadImage"},"widgets_values":["example.png","image"]},{"id":8,"type":"VLLMOmniDiffusionSampling","pos":[478.59266934006774,183.67711984955648],"size":[284.205078125,226],"flags":{},"order":0,"mode":0,"inputs":[{"localized_name":"n","name":"n","type":"INT","widget":{"name":"n"},"link":null},{"localized_name":"num_inference_steps","name":"num_inference_steps","type":"INT","widget":{"name":"num_inference_steps"},"link":null},{"localized_name":"guidance_scale","name":"guidance_scale","type":"FLOAT","widget":{"name":"guidance_scale"},"link":null},{"localized_name":"true_cfg_scale","name":"true_cfg_scale","type":"FLOAT","widget":{"name":"true_cfg_scale"},"link":null},{"localized_name":"vae_use_slicing","name":"vae_use_slicing","type":"BOOLEAN","widget":{"name":"vae_use_slicing"},"link":null},{"localized_name":"vae_use_tiling","name":"vae_use_tiling","type":"BOOLEAN","widget":{"name":"vae_use_tiling"},"link":null},{"localized_name":"seed","name":"seed","type":"INT","widget":{"name":"seed"},"link":null}],"outputs":[{"localized_name":"diffusion sampling params","name":"diffusion sampling params","type":"SAMPLING_PARAMS","links":[23]}],"properties":{"Node name for S&R":"VLLMOmniDiffusionSampling"},"widgets_values":[4,50,1,1,false,false,42,"randomize"]}],"links":[[21,4,0,11,0,"IMAGE"],[22,11,0,3,0,"IMAGE"],[23,8,0,11,2,"SAMPLING_PARAMS"]],"groups":[{"id":1,"title":"Input","bounding":[213.1706010087147,-313.20095750667554,560.8246471437027,407.4246532472182],"color":"#3f789e","font_size":24,"flags":{}}],"config":{},"extra":{"workflowRendererVersion":"LG","ds":{"scale":1.1469075819486894,"offset":[-55.18047513099182,220.2553505195962]}},"version":0.4}
diff --git a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Multimodal Understanding.json b/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Multimodal Understanding.json
deleted file mode 100644
index 4d32d5368c2..00000000000
--- a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Multimodal Understanding.json
+++ /dev/null
@@ -1,761 +0,0 @@
-{
- "id": "1c99f525-0a37-45ba-a28a-7df7c3af66b4",
- "revision": 0,
- "last_node_id": 12,
- "last_link_id": 14,
- "nodes": [
- {
- "id": 1,
- "type": "VLLMOmniUnderstanding",
- "pos": [
- 1191.2177053682556,
- 144.66829928181377
- ],
- "size": [
- 400,
- 268
- ],
- "flags": {},
- "order": 8,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "image",
- "name": "image",
- "shape": 7,
- "type": "IMAGE",
- "link": 1
- },
- {
- "localized_name": "video",
- "name": "video",
- "shape": 7,
- "type": "VIDEO",
- "link": 2
- },
- {
- "localized_name": "audio",
- "name": "audio",
- "shape": 7,
- "type": "AUDIO",
- "link": 3
- },
- {
- "localized_name": "sampling_params",
- "name": "sampling_params",
- "shape": 7,
- "type": "SAMPLING_PARAMS",
- "link": 14
- },
- {
- "localized_name": "url",
- "name": "url",
- "type": "STRING",
- "widget": {
- "name": "url"
- },
- "link": null
- },
- {
- "localized_name": "model",
- "name": "model",
- "type": "STRING",
- "widget": {
- "name": "model"
- },
- "link": null
- },
- {
- "localized_name": "prompt",
- "name": "prompt",
- "type": "STRING",
- "widget": {
- "name": "prompt"
- },
- "link": null
- },
- {
- "localized_name": "output_text",
- "name": "output_text",
- "type": "BOOLEAN",
- "widget": {
- "name": "output_text"
- },
- "link": null
- },
- {
- "localized_name": "output_audio",
- "name": "output_audio",
- "type": "BOOLEAN",
- "widget": {
- "name": "output_audio"
- },
- "link": null
- },
- {
- "localized_name": "use_audio_in_video",
- "name": "use_audio_in_video",
- "type": "BOOLEAN",
- "widget": {
- "name": "use_audio_in_video"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "text_response",
- "name": "text_response",
- "type": "STRING",
- "links": [
- 8
- ]
- },
- {
- "localized_name": "audio_response",
- "name": "audio_response",
- "type": "AUDIO",
- "links": [
- 9
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniUnderstanding"
- },
- "widgets_values": [
- "http://localhost:8000/v1",
- "Qwen/Qwen2.5-Omni-7B",
- "",
- true,
- true,
- true
- ]
- },
- {
- "id": 3,
- "type": "LoadVideo",
- "pos": [
- 729.5984141255855,
- -198.631920454299
- ],
- "size": [
- 282.798828125,
- 233.0743408203125
- ],
- "flags": {},
- "order": 0,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "file",
- "name": "file",
- "type": "COMBO",
- "widget": {
- "name": "file"
- },
- "link": null
- },
- {
- "localized_name": "choose file to upload",
- "name": "upload",
- "type": "IMAGEUPLOAD",
- "widget": {
- "name": "upload"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "VIDEO",
- "name": "VIDEO",
- "type": "VIDEO",
- "links": [
- 2
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "LoadVideo"
- },
- "widgets_values": [
- "draw.mp4",
- "image"
- ]
- },
- {
- "id": 4,
- "type": "LoadAudio",
- "pos": [
- 729.8037086965753,
- 99.86963519703949
- ],
- "size": [
- 282.798828125,
- 136
- ],
- "flags": {},
- "order": 1,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "audio",
- "name": "audio",
- "type": "COMBO",
- "widget": {
- "name": "audio"
- },
- "link": null
- },
- {
- "localized_name": "audioUI",
- "name": "audioUI",
- "type": "AUDIO_UI",
- "widget": {
- "name": "audioUI"
- },
- "link": null
- },
- {
- "localized_name": "choose file to upload",
- "name": "upload",
- "type": "AUDIOUPLOAD",
- "widget": {
- "name": "upload"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "AUDIO",
- "name": "AUDIO",
- "type": "AUDIO",
- "links": [
- 3
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "LoadAudio"
- },
- "widgets_values": [
- "Megan-Fox.mp3",
- null,
- null
- ]
- },
- {
- "id": 5,
- "type": "VLLMOmniARSampling",
- "pos": [
- 510.3517536642828,
- 658.073751009259
- ],
- "size": [
- 270,
- 178
- ],
- "flags": {},
- "order": 2,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "max_tokens",
- "name": "max_tokens",
- "type": "INT",
- "widget": {
- "name": "max_tokens"
- },
- "link": null
- },
- {
- "localized_name": "temperature",
- "name": "temperature",
- "type": "FLOAT",
- "widget": {
- "name": "temperature"
- },
- "link": null
- },
- {
- "localized_name": "top_p",
- "name": "top_p",
- "type": "FLOAT",
- "widget": {
- "name": "top_p"
- },
- "link": null
- },
- {
- "localized_name": "repetition_penalty",
- "name": "repetition_penalty",
- "type": "FLOAT",
- "widget": {
- "name": "repetition_penalty"
- },
- "link": null
- },
- {
- "localized_name": "seed",
- "name": "seed",
- "type": "INT",
- "widget": {
- "name": "seed"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "AR sampling params",
- "name": "AR sampling params",
- "type": "SAMPLING_PARAMS",
- "links": [
- 12
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniARSampling"
- },
- "widgets_values": [
- 100,
- 1,
- 1,
- 1,
- -1,
- "randomize"
- ]
- },
- {
- "id": 7,
- "type": "VLLMOmniARSampling",
- "pos": [
- 503.33235181647115,
- 419.34158016181806
- ],
- "size": [
- 270,
- 178
- ],
- "flags": {},
- "order": 3,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "max_tokens",
- "name": "max_tokens",
- "type": "INT",
- "widget": {
- "name": "max_tokens"
- },
- "link": null
- },
- {
- "localized_name": "temperature",
- "name": "temperature",
- "type": "FLOAT",
- "widget": {
- "name": "temperature"
- },
- "link": null
- },
- {
- "localized_name": "top_p",
- "name": "top_p",
- "type": "FLOAT",
- "widget": {
- "name": "top_p"
- },
- "link": null
- },
- {
- "localized_name": "repetition_penalty",
- "name": "repetition_penalty",
- "type": "FLOAT",
- "widget": {
- "name": "repetition_penalty"
- },
- "link": null
- },
- {
- "localized_name": "seed",
- "name": "seed",
- "type": "INT",
- "widget": {
- "name": "seed"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "AR sampling params",
- "name": "AR sampling params",
- "type": "SAMPLING_PARAMS",
- "links": [
- 5,
- 13
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniARSampling"
- },
- "widgets_values": [
- 100,
- 1,
- 1,
- 1,
- -1,
- "randomize"
- ]
- },
- {
- "id": 8,
- "type": "VLLMOmniSamplingParamsList",
- "pos": [
- 820.6056617389042,
- 426.38372037182273
- ],
- "size": [
- 263.066015625,
- 66
- ],
- "flags": {},
- "order": 7,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "param1",
- "name": "param1",
- "type": "SAMPLING_PARAMS",
- "link": 5
- },
- {
- "localized_name": "param2",
- "name": "param2",
- "shape": 7,
- "type": "SAMPLING_PARAMS",
- "link": 12
- },
- {
- "localized_name": "param3",
- "name": "param3",
- "shape": 7,
- "type": "SAMPLING_PARAMS",
- "link": 13
- }
- ],
- "outputs": [
- {
- "localized_name": "param list",
- "name": "param list",
- "type": "SAMPLING_PARAMS",
- "links": [
- 14
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniSamplingParamsList"
- },
- "widgets_values": []
- },
- {
- "id": 11,
- "type": "MarkdownNote",
- "pos": [
- 826.2328280438272,
- 569.1890318701705
- ],
- "size": [
- 333.8220435590464,
- 261.63596728060656
- ],
- "flags": {},
- "order": 4,
- "mode": 0,
- "inputs": [],
- "outputs": [],
- "title": "Note: Sampling Parameters",
- "properties": {},
- "widgets_values": [
- "## Sampling Parameter Types\n\nThere are two types of sampling parameters: one for autoregression and one for diffusion.\nYou should ensure that you have chosen the correct type of sampling parameters for the model you request.\n\n## Stages & Shorthand\n\nFor multi-stage models such as Qwen Omni, you can either\n- connect one sampling parameter node, which is applied to all stages.\n- connect exactly the same number of sampling parameter nodes to a \"Multi-Stage Sampling Parameter List\" node, then connect this node to the primary request node.\n\nNote that this shorthand is intended to stay consistent with the [online serving API](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/qwen2_5_omni/)"
- ],
- "color": "#432",
- "bgcolor": "#000"
- },
- {
- "id": 12,
- "type": "MarkdownNote",
- "pos": [
- 378.9866207003777,
- 152.59550252215752
- ],
- "size": [
- 319.7287574247016,
- 107.15904081906785
- ],
- "flags": {},
- "order": 6,
- "mode": 0,
- "inputs": [],
- "outputs": [],
- "title": "Note: Input",
- "properties": {},
- "widgets_values": [
- "Note that not all models support every modality as input. For example, `ByteDance-Seed/BAGEL-7B-MoT` in Multimodality Understanding mode only support text and image input.\n\nYou should ensure that the input are supported by the model. You can check the corresponding [online serving documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/bagel/) for confirmation."
- ],
- "color": "#432",
- "bgcolor": "#000"
- },
- {
- "id": 2,
- "type": "LoadImage",
- "pos": [
- 394.4674804308822,
- -207.6987397548834
- ],
- "size": [
- 282.798828125,
- 314
- ],
- "flags": {},
- "order": 5,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "image",
- "name": "image",
- "type": "COMBO",
- "widget": {
- "name": "image"
- },
- "link": null
- },
- {
- "localized_name": "choose file to upload",
- "name": "upload",
- "type": "IMAGEUPLOAD",
- "widget": {
- "name": "upload"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "IMAGE",
- "name": "IMAGE",
- "type": "IMAGE",
- "links": [
- 1
- ]
- },
- {
- "localized_name": "MASK",
- "name": "MASK",
- "type": "MASK",
- "links": null
- }
- ],
- "properties": {
- "Node name for S&R": "LoadImage"
- },
- "widgets_values": [
- "example.png",
- "image"
- ]
- },
- {
- "id": 10,
- "type": "PreviewAudio",
- "pos": [
- 1664.548345556043,
- 297.5921292054054
- ],
- "size": [
- 270,
- 88
- ],
- "flags": {},
- "order": 10,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "audio",
- "name": "audio",
- "type": "AUDIO",
- "link": 9
- },
- {
- "localized_name": "audioUI",
- "name": "audioUI",
- "type": "AUDIO_UI",
- "widget": {
- "name": "audioUI"
- },
- "link": null
- }
- ],
- "outputs": [],
- "properties": {
- "Node name for S&R": "PreviewAudio"
- },
- "widgets_values": []
- },
- {
- "id": 9,
- "type": "ShowText|pysssss",
- "pos": [
- 1649.2506875091847,
- 66.22823888292349
- ],
- "size": [
- 318.7188464232943,
- 173.38502269972975
- ],
- "flags": {},
- "order": 9,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "text",
- "name": "text",
- "type": "STRING",
- "link": 8
- }
- ],
- "outputs": [
- {
- "localized_name": "STRING",
- "name": "STRING",
- "shape": 6,
- "type": "STRING",
- "links": null
- }
- ],
- "properties": {
- "Node name for S&R": "ShowText|pysssss"
- },
- "widgets_values": []
- }
- ],
- "links": [
- [
- 1,
- 2,
- 0,
- 1,
- 0,
- "IMAGE"
- ],
- [
- 2,
- 3,
- 0,
- 1,
- 1,
- "VIDEO"
- ],
- [
- 3,
- 4,
- 0,
- 1,
- 2,
- "AUDIO"
- ],
- [
- 5,
- 7,
- 0,
- 8,
- 0,
- "SAMPLING_PARAMS"
- ],
- [
- 8,
- 1,
- 0,
- 9,
- 0,
- "STRING"
- ],
- [
- 9,
- 1,
- 1,
- 10,
- 0,
- "AUDIO"
- ],
- [
- 12,
- 5,
- 0,
- 8,
- 1,
- "SAMPLING_PARAMS"
- ],
- [
- 13,
- 7,
- 0,
- 8,
- 2,
- "SAMPLING_PARAMS"
- ],
- [
- 14,
- 8,
- 0,
- 1,
- 3,
- "SAMPLING_PARAMS"
- ]
- ],
- "groups": [
- {
- "id": 1,
- "title": "Sampling Parameters",
- "bounding": [
- 480.1649301181556,
- 341.08402513937995,
- 692.4113568972277,
- 510.48648853403665
- ],
- "color": "#3f789e",
- "font_size": 24,
- "flags": {}
- },
- {
- "id": 2,
- "title": "Input",
- "bounding": [
- 344.2364817813528,
- -287.5850484183313,
- 704.8832238768634,
- 559.5124009832894
- ],
- "color": "#3f789e",
- "font_size": 24,
- "flags": {}
- }
- ],
- "config": {},
- "extra": {
- "workflowRendererVersion": "LG",
- "ds": {
- "scale": 0.9478575057427204,
- "offset": [
- 33.81037136029557,
- 307.1974296197726
- ]
- }
- },
- "version": 0.4
-}
diff --git a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni TTS.json b/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni TTS.json
deleted file mode 100644
index 761e8d147de..00000000000
--- a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni TTS.json
+++ /dev/null
@@ -1 +0,0 @@
-{"id":"c0a11241-f0e6-4f5c-8488-b3d344a9df3d","revision":0,"last_node_id":7,"last_link_id":3,"nodes":[{"id":4,"type":"PreviewAudio","pos":[690.1759063763212,226.76444957981545],"size":[270,88],"flags":{},"order":5,"mode":0,"inputs":[{"localized_name":"audio","name":"audio","type":"AUDIO","link":2},{"localized_name":"audioUI","name":"audioUI","type":"AUDIO_UI","widget":{"name":"audioUI"},"link":null}],"outputs":[],"properties":{"Node name for S&R":"PreviewAudio"},"widgets_values":[]},{"id":5,"type":"LoadAudio","pos":[-83.61093919682321,103.80510898465587],"size":[282.798828125,136],"flags":{},"order":0,"mode":0,"inputs":[{"localized_name":"audio","name":"audio","type":"COMBO","widget":{"name":"audio"},"link":null},{"localized_name":"audioUI","name":"audioUI","type":"AUDIO_UI","widget":{"name":"audioUI"},"link":null},{"localized_name":"choose file to upload","name":"upload","type":"AUDIOUPLOAD","widget":{"name":"upload"},"link":null}],"outputs":[{"localized_name":"AUDIO","name":"AUDIO","type":"AUDIO","links":[3]}],"properties":{"Node name for S&R":"LoadAudio"},"widgets_values":["Megan-Fox.mp3",null,null]},{"id":3,"type":"VLLMOmniVoiceClone","pos":[248.63466280064262,168.13112893706375],"size":[400,306],"flags":{},"order":4,"mode":0,"inputs":[{"localized_name":"ref_audio","name":"ref_audio","type":"AUDIO","link":3},{"localized_name":"model_specific_params","name":"model_specific_params","shape":7,"type":"TTS_PARAMS","link":1},{"localized_name":"url","name":"url","type":"STRING","widget":{"name":"url"},"link":null},{"localized_name":"model","name":"model","type":"STRING","widget":{"name":"model"},"link":null},{"localized_name":"input","name":"input","type":"STRING","widget":{"name":"input"},"link":null},{"localized_name":"voice","name":"voice","type":"STRING","widget":{"name":"voice"},"link":null},{"localized_name":"response_format","name":"response_format","type":"COMBO","widget":{"name":"response_format"},"link":null},{"localized_name":"speed","name":"speed","type":"FLOAT","widget":{"name":"speed"},"link":null},{"localized_name":"ref_text","name":"ref_text","type":"STRING","widget":{"name":"ref_text"},"link":null},{"localized_name":"x_vector_only_mode","name":"x_vector_only_mode","type":"BOOLEAN","widget":{"name":"x_vector_only_mode"},"link":null}],"outputs":[{"localized_name":"audio","name":"audio","type":"AUDIO","links":[2]}],"properties":{"Node name for S&R":"VLLMOmniVoiceClone"},"widgets_values":["http://localhost:8000/v1","Qwen/Qwen3-TTS-12Hz-1.7B-Base","Someone just spilled a cup of coffee on my jacket this morning in the subway. Now I have to wear this stained jacket for a whole day!","","mp3",1,"",true]},{"id":6,"type":"MarkdownNote","pos":[255.52667408769662,27.26842775454861],"size":[381.58204108637847,88],"flags":{},"order":2,"mode":0,"inputs":[],"outputs":[],"title":"Note: TTS Nodes","properties":{},"widgets_values":["Apart from the Voice Cloning node, there is also another TTS node for simple text-to-speech tasks (without extra inputs for reference audio)."],"color":"#432","bgcolor":"#000"},{"id":7,"type":"MarkdownNote","pos":[-196.99453003423977,598.3693693727904],"size":[382.4539509209102,88],"flags":{},"order":3,"mode":0,"inputs":[],"outputs":[],"title":"Note: Model-Specific Parameters","properties":{},"widgets_values":["TTS models often require some tailor-made parameters. If you need to customize these parameters, grab one that matches the model you are requesting from \"TTS Params\" subfolder."],"color":"#432","bgcolor":"#000"},{"id":2,"type":"VLLMOmniQwenTTSParams","pos":[-207.8925310560497,342.7942867650348],"size":[400,200],"flags":{},"order":1,"mode":0,"inputs":[{"localized_name":"task_type","name":"task_type","type":"COMBO","widget":{"name":"task_type"},"link":null},{"localized_name":"language","name":"language","type":"COMBO","widget":{"name":"language"},"link":null},{"localized_name":"instructions","name":"instructions","type":"STRING","widget":{"name":"instructions"},"link":null},{"localized_name":"max_new_tokens","name":"max_new_tokens","type":"INT","widget":{"name":"max_new_tokens"},"link":null}],"outputs":[{"localized_name":"Qwen TTS params","name":"Qwen TTS params","type":"TTS_PARAMS","links":[1]}],"properties":{"Node name for S&R":"VLLMOmniQwenTTSParams"},"widgets_values":["Base","Auto","Super angry",2048]}],"links":[[1,2,0,3,1,"TTS_PARAMS"],[2,3,0,4,0,"AUDIO"],[3,5,0,3,0,"AUDIO"]],"groups":[{"id":1,"title":"Model-Specific Parameters","bounding":[-228.38328407738007,264.42790274716276,443.9887293164586,439.6853171248897],"color":"#3f789e","font_size":24,"flags":{}}],"config":{},"extra":{"workflowRendererVersion":"LG","ds":{"scale":1.1469075819486911,"offset":[898.8819468322333,258.7179979718389]}},"version":0.4}
diff --git a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Video Generation.json b/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Video Generation.json
deleted file mode 100644
index 636360ab8ca..00000000000
--- a/apps/ComfyUI-vLLM-Omni/example_workflows/vLLM-Omni Video Generation.json
+++ /dev/null
@@ -1,513 +0,0 @@
-{
- "id": "3aea1d08-7de5-4997-ba71-b2f66ad72a21",
- "revision": 0,
- "last_node_id": 14,
- "last_link_id": 28,
- "nodes": [
- {
- "id": 10,
- "type": "MarkdownNote",
- "pos": [
- 227.99306819575278,
- -231.24143306069843
- ],
- "size": [
- 240.3326478922474,
- 136.3505820791881
- ],
- "flags": {},
- "order": 0,
- "mode": 0,
- "inputs": [],
- "outputs": [],
- "title": "Note: Task and Input",
- "properties": {},
- "widgets_values": [
- "vLLM-Omni nodes are categorized based on the output modality. The \"Generate Video\" node supports both text-to-video generation and image-to-video generation. The node will create corresponding payload depending on whether an input image is present or not."
- ],
- "color": "#432",
- "bgcolor": "#000"
- },
- {
- "id": 4,
- "type": "LoadImage",
- "pos": [
- 496.31859627609606,
- -229.71277089860084
- ],
- "size": [
- 270,
- 314
- ],
- "flags": {},
- "order": 1,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "image",
- "name": "image",
- "type": "COMBO",
- "widget": {
- "name": "image"
- },
- "link": null
- },
- {
- "localized_name": "choose file to upload",
- "name": "upload",
- "type": "IMAGEUPLOAD",
- "widget": {
- "name": "upload"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "IMAGE",
- "name": "IMAGE",
- "type": "IMAGE",
- "links": [
- 25
- ]
- },
- {
- "localized_name": "MASK",
- "name": "MASK",
- "type": "MASK",
- "links": null
- }
- ],
- "properties": {
- "Node name for S&R": "LoadImage"
- },
- "widgets_values": [
- "cute-cat.jpg",
- "image"
- ]
- },
- {
- "id": 8,
- "type": "VLLMOmniDiffusionSampling",
- "pos": [
- 465.5140218220927,
- 148.80072646828987
- ],
- "size": [
- 284.205078125,
- 226
- ],
- "flags": {},
- "order": 2,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "n",
- "name": "n",
- "type": "INT",
- "widget": {
- "name": "n"
- },
- "link": null
- },
- {
- "localized_name": "num_inference_steps",
- "name": "num_inference_steps",
- "type": "INT",
- "widget": {
- "name": "num_inference_steps"
- },
- "link": null
- },
- {
- "localized_name": "guidance_scale",
- "name": "guidance_scale",
- "type": "FLOAT",
- "widget": {
- "name": "guidance_scale"
- },
- "link": null
- },
- {
- "localized_name": "true_cfg_scale",
- "name": "true_cfg_scale",
- "type": "FLOAT",
- "widget": {
- "name": "true_cfg_scale"
- },
- "link": null
- },
- {
- "localized_name": "vae_use_slicing",
- "name": "vae_use_slicing",
- "type": "BOOLEAN",
- "widget": {
- "name": "vae_use_slicing"
- },
- "link": null
- },
- {
- "localized_name": "vae_use_tiling",
- "name": "vae_use_tiling",
- "type": "BOOLEAN",
- "widget": {
- "name": "vae_use_tiling"
- },
- "link": null
- },
- {
- "localized_name": "seed",
- "name": "seed",
- "type": "INT",
- "widget": {
- "name": "seed"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "diffusion sampling params",
- "name": "diffusion sampling params",
- "type": "SAMPLING_PARAMS",
- "links": [
- 27
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniDiffusionSampling"
- },
- "widgets_values": [
- 1,
- 50,
- 1,
- 1,
- false,
- false,
- 1992,
- "randomize"
- ]
- },
- {
- "id": 14,
- "type": "VLLMOmniWanParams",
- "pos": [
- 518.6005521845747,
- 428.4455663719664
- ],
- "size": [
- 270,
- 106
- ],
- "flags": {},
- "order": 3,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "guidance_scale_2",
- "name": "guidance_scale_2",
- "type": "FLOAT",
- "widget": {
- "name": "guidance_scale_2"
- },
- "link": null
- },
- {
- "localized_name": "boundary_ratio",
- "name": "boundary_ratio",
- "type": "FLOAT",
- "widget": {
- "name": "boundary_ratio"
- },
- "link": null
- },
- {
- "localized_name": "flow_shift",
- "name": "flow_shift",
- "type": "FLOAT",
- "widget": {
- "name": "flow_shift"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "Wan video params",
- "name": "Wan video params",
- "type": "VIDEO_PARAMS",
- "links": [
- 28
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniWanParams"
- },
- "widgets_values": [
- 4,
- 0.875,
- 5
- ]
- },
- {
- "id": 12,
- "type": "VLLMOmniGenerateVideo",
- "pos": [
- 827.2566336087867,
- -73.77449831827545
- ],
- "size": [
- 400,
- 346
- ],
- "flags": {},
- "order": 4,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "image",
- "name": "image",
- "shape": 7,
- "type": "IMAGE",
- "link": 25
- },
- {
- "localized_name": "sampling_params",
- "name": "sampling_params",
- "shape": 7,
- "type": "SAMPLING_PARAMS",
- "link": 27
- },
- {
- "localized_name": "lora",
- "name": "lora",
- "shape": 7,
- "type": "REMOTE_LORA",
- "link": null
- },
- {
- "localized_name": "model_params",
- "name": "model_params",
- "shape": 7,
- "type": "VIDEO_PARAMS",
- "link": 28
- },
- {
- "localized_name": "url",
- "name": "url",
- "type": "STRING",
- "widget": {
- "name": "url"
- },
- "link": null
- },
- {
- "localized_name": "model",
- "name": "model",
- "type": "STRING",
- "widget": {
- "name": "model"
- },
- "link": null
- },
- {
- "localized_name": "prompt",
- "name": "prompt",
- "type": "STRING",
- "widget": {
- "name": "prompt"
- },
- "link": null
- },
- {
- "localized_name": "negative_prompt",
- "name": "negative_prompt",
- "type": "STRING",
- "widget": {
- "name": "negative_prompt"
- },
- "link": null
- },
- {
- "localized_name": "width",
- "name": "width",
- "type": "INT",
- "widget": {
- "name": "width"
- },
- "link": null
- },
- {
- "localized_name": "height",
- "name": "height",
- "type": "INT",
- "widget": {
- "name": "height"
- },
- "link": null
- },
- {
- "localized_name": "fps",
- "name": "fps",
- "type": "INT",
- "widget": {
- "name": "fps"
- },
- "link": null
- },
- {
- "localized_name": "num_frames",
- "name": "num_frames",
- "type": "INT",
- "widget": {
- "name": "num_frames"
- },
- "link": null
- }
- ],
- "outputs": [
- {
- "localized_name": "video",
- "name": "video",
- "type": "VIDEO",
- "links": [
- 26
- ]
- }
- ],
- "properties": {
- "Node name for S&R": "VLLMOmniGenerateVideo"
- },
- "widgets_values": [
- "http://localhost:8000/v1",
- "Wan-AI/Wan2.2-TI2V-5B-Diffusers",
- "Make the cat in the attached video yawning. Eye closed and mouth wide open. Keep the original realistic style.",
- "Cartoonish",
- 900,
- 1200,
- 12,
- 41
- ]
- },
- {
- "id": 13,
- "type": "SaveVideo",
- "pos": [
- 1257.5441369501657,
- -44.12956394419863
- ],
- "size": [
- 374,
- 372.34896826553796
- ],
- "flags": {},
- "order": 5,
- "mode": 0,
- "inputs": [
- {
- "localized_name": "video",
- "name": "video",
- "type": "VIDEO",
- "link": 26
- },
- {
- "localized_name": "filename_prefix",
- "name": "filename_prefix",
- "type": "STRING",
- "widget": {
- "name": "filename_prefix"
- },
- "link": null
- },
- {
- "localized_name": "format",
- "name": "format",
- "type": "COMBO",
- "widget": {
- "name": "format"
- },
- "link": null
- },
- {
- "localized_name": "codec",
- "name": "codec",
- "type": "COMBO",
- "widget": {
- "name": "codec"
- },
- "link": null
- }
- ],
- "outputs": [],
- "properties": {},
- "widgets_values": [
- "video/ComfyUI",
- "auto",
- "auto"
- ]
- }
- ],
- "links": [
- [
- 25,
- 4,
- 0,
- 12,
- 0,
- "IMAGE"
- ],
- [
- 26,
- 12,
- 0,
- 13,
- 0,
- "VIDEO"
- ],
- [
- 27,
- 8,
- 0,
- 12,
- 1,
- "SAMPLING_PARAMS"
- ],
- [
- 28,
- 14,
- 0,
- 12,
- 3,
- "VIDEO_PARAMS"
- ]
- ],
- "groups": [
- {
- "id": 1,
- "title": "Input",
- "bounding": [
- 213.1706010087147,
- -313.20095750667554,
- 560.8246471437027,
- 407.4246532472182
- ],
- "color": "#3f789e",
- "font_size": 24,
- "flags": {}
- }
- ],
- "config": {},
- "extra": {
- "workflowRendererVersion": "LG",
- "ds": {
- "scale": 1.1469075819486894,
- "offset": [
- 101.7632950847089,
- 492.29121889347806
- ]
- }
- },
- "version": 0.4
-}
diff --git a/apps/ComfyUI-vLLM-Omni/web/main.js b/apps/ComfyUI-vLLM-Omni/web/main.js
deleted file mode 100644
index da99b2539a3..00000000000
--- a/apps/ComfyUI-vLLM-Omni/web/main.js
+++ /dev/null
@@ -1,21 +0,0 @@
-/**
- * @file This file is intended to add dynamic fields to vLLM-Omni nodes
- * based on widget (in-node form fields) and input (connection link) values and changes.
- * However, this functionality is currently disabled/commented out.
- * Because it introduces too much complexity,
- * and it may even conflict with the current backend (Python) validation for unknown reasons (pending ComfyUI upstream fixes).
- */
-
-import { app } from "../../scripts/app.js";
-app.registerExtension({
- name: "vllm.vllm_omni",
- async beforeRegisterNodeDef(nodeType, nodeData, app) {
- if (!nodeData.name.startsWith("VLLMOmni")) {
- return
- }
- // Stub frontend plugin for now
- },
- async setup() {
- console.info("vLLM-Omni Setup complete!")
- },
-})
diff --git a/benchmarks/README.md b/benchmarks/README.md
deleted file mode 100644
index 68ffd40ef5c..00000000000
--- a/benchmarks/README.md
+++ /dev/null
@@ -1,44 +0,0 @@
-# Benchmarks Overview and Architecture
-
-This document explains the benchmark architecture across all benchmark assets in this repo. It describes what we measure, and where to find or plug in new scenarios. Per-task details remain in subfolder READMEs (e.g., `benchmarks//README.md`).
-
-## Scope and goals
-- Establish repeatable latency/throughput measurements for multimodal LLM pipelines.
-- Provide both HF Transformers (offline) and vLLM-Omni (multi-stage/pipeline) baselines.
-- Make it easy to plug in new datasets and models with minimal changes to the runner scripts.
-
-## Dataset and inputs
-- Default example: SeedTTS top-100 prompts (`benchmarks/build_dataset/top100.txt`) via `benchmarks/build_dataset/`.
-- Extensible: drop in new prompt files or modality-aligned payloads; keep the expected format for the consuming scripts (e.g., one prompt per line).
-- If you add a new dataset, document it under `benchmarks//README.md` and point scripts to your data path.
-
-## Directory layout
-- `benchmarks/build_dataset/` — dataset prep utilities (e.g., SeedTTS top100).
-- `benchmarks//vllm_omni/` — vLLM-Omni pipeline benchmarks, logs, outputs.
-- `benchmarks/accuracy/` — accuracy benchmark integrations that adapt external
- benchmark suites to vLLM-Omni serving and evaluation flows.
-- Add new tasks under `benchmarks//...` with the same pattern: `transformers/`, `vllm_omni/`, task-specific README, and (optionally) dataset prep notes.
-
-## Reference workflows
-- **HF Transformers (offline, single process)**
- Script (example): `benchmarks//transformers/eval_qwen3_moe_omni_transformers.sh`
- Outputs: `benchmark_results/perf_stats.json`, `benchmark_results/results.json`, `benchmark_results/audio/` (if audio is produced).
-
-- **vLLM-Omni end-to-end pipeline**
- Script (example): `benchmarks//vllm_omni/eval_qwen3_moe_omni.sh`
- Outputs: `vllm_omni/logs/*.stats.jsonl` (per-stage/overall latency & TPS), `vllm_omni/logs/stage*.log`, `vllm_omni/outputs/` (text/audio artifacts).
-
-- **Adding a new task/model**
- 1) Create `benchmarks//transformers/` and/or `benchmarks//vllm_omni/` with scripts referencing your model and dataset.
- 2) Add a task README describing dataset, configs, and expected outputs.
- 3) Keep the output/log structure similar for easy comparison (perf_stats/results/audio or text outputs; stats.jsonl/logs for pipeline).
-
-## Metrics to watch
-- **Throughput**: `overall_tps`, `*_tps_avg` per stage.
-- **Latency distribution**: look for long tails in `*.stats.jsonl`.
-- **Quality/completeness**: missing outputs or errors in stage logs indicate pipeline failures or misconfigurations.
-
-## Troubleshooting
-- Verify GPU/driver/FlashAttention2 requirements for your chosen model/config.
-- Ensure network access for dataset/model downloads (Google Drive, Hugging Face, etc.).
-- If outputs are missing or slow, inspect per-stage logs and `*.stats.jsonl` for errors, stragglers, or contention.
diff --git a/benchmarks/__init__.py b/benchmarks/__init__.py
deleted file mode 100644
index 79917e818cb..00000000000
--- a/benchmarks/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-"""Benchmark helpers and runnable benchmark entrypoints."""
diff --git a/benchmarks/accuracy/README.md b/benchmarks/accuracy/README.md
deleted file mode 100644
index dbe20916a77..00000000000
--- a/benchmarks/accuracy/README.md
+++ /dev/null
@@ -1,27 +0,0 @@
-# Accuracy Benchmarks
-
-This directory hosts accuracy benchmark integrations that run entirely through a
-local `vllm-omni serve` deployment.
-
-Current integrations:
-
-- `text_to_image/`: GEBench generation + local judge scoring flow.
-- `image_to_image/`: GEdit-Bench generation + local VIEScore-style scoring flow.
-
-Design notes:
-
-- Generation is executed through the OpenAI-compatible endpoints exposed by
- `vllm-omni serve`.
-- Evaluation is also executed through a local OpenAI-compatible judge model
- served by `vllm-omni`.
-- Both generation and judge requests accept either `http://host:port` or
- `http://host:port/v1`.
-- Output directory layout intentionally stays close to the upstream repos.
-
-Test guidance:
-
-- Local static/self-checks live in `tests/benchmarks/test_accuracy_bench_utils.py`.
-- End-to-end generation/evaluation should be validated in a remote GPU
- environment. In the current repo marker system there is `L4` but no `L5`
- marker, so benchmark smoke tests should be wired as `full_model +
- benchmark + L4` for nightly when GPU capacity is available.
diff --git a/benchmarks/accuracy/__init__.py b/benchmarks/accuracy/__init__.py
deleted file mode 100644
index 029ec71da22..00000000000
--- a/benchmarks/accuracy/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-"""Accuracy benchmark integrations for vLLM-Omni."""
diff --git a/benchmarks/accuracy/common.py b/benchmarks/accuracy/common.py
deleted file mode 100644
index bce0b235914..00000000000
--- a/benchmarks/accuracy/common.py
+++ /dev/null
@@ -1,199 +0,0 @@
-from __future__ import annotations
-
-import base64
-import io
-import json
-from pathlib import Path
-from typing import Any
-
-import requests
-from PIL import Image
-
-
-def ensure_dir(path: Path) -> Path:
- path.mkdir(parents=True, exist_ok=True)
- return path
-
-
-def load_json(path: Path) -> dict[str, Any]:
- with path.open("r", encoding="utf-8") as handle:
- return json.load(handle)
-
-
-def write_json(path: Path, payload: dict[str, Any]) -> None:
- ensure_dir(path.parent)
- with path.open("w", encoding="utf-8") as handle:
- json.dump(payload, handle, indent=2, ensure_ascii=False)
-
-
-def save_image(path: Path, image: Image.Image) -> None:
- ensure_dir(path.parent)
- image.save(path)
-
-
-def find_first_image(folder: Path, stem: str | None = None) -> Path | None:
- patterns = [f"{stem}.*"] if stem else ["*.png", "*.jpg", "*.jpeg", "*.webp"]
- for pattern in patterns:
- for candidate in sorted(folder.glob(pattern)):
- if candidate.suffix.lower() in {".png", ".jpg", ".jpeg", ".webp"}:
- return candidate
- return None
-
-
-def extract_json_object(raw_text: str) -> dict[str, Any]:
- raw_text = raw_text.strip()
- delimiter = "||V^=^V||"
- if raw_text.count(delimiter) >= 2:
- start = raw_text.find(delimiter) + len(delimiter)
- end = raw_text.rfind(delimiter)
- raw_text = raw_text[start:end].strip()
-
- start = raw_text.find("{")
- end = raw_text.rfind("}")
- if start == -1 or end == -1 or end < start:
- raise ValueError(f"Could not find JSON object in: {raw_text[:200]}")
- return json.loads(raw_text[start : end + 1])
-
-
-def build_openai_url(base_url: str, api_path: str) -> str:
- base = base_url.rstrip("/")
- normalized_path = api_path if api_path.startswith("/") else f"/{api_path}"
- if base.endswith(normalized_path):
- return base
- if base.endswith("/v1"):
- return f"{base}{normalized_path}"
- return f"{base}/v1{normalized_path}"
-
-
-def pil_to_base64(image: Image.Image, image_format: str = "PNG") -> str:
- buffer = io.BytesIO()
- image.save(buffer, format=image_format)
- return base64.b64encode(buffer.getvalue()).decode("utf-8")
-
-
-def pil_to_data_url(image: Image.Image, image_format: str = "PNG") -> str:
- return f"data:image/{image_format.lower()};base64,{pil_to_base64(image, image_format=image_format)}"
-
-
-def decode_base64_image(encoded: str) -> Image.Image:
- image = Image.open(io.BytesIO(base64.b64decode(encoded)))
- image.load()
- return image.convert("RGB")
-
-
-def pil_to_png_bytes(image: Image.Image) -> bytes:
- buffer = io.BytesIO()
- image.save(buffer, format="PNG")
- return buffer.getvalue()
-
-
-class VllmOmniImageClient:
- """Thin OpenAI-compatible image client for vLLM-Omni serving."""
-
- def __init__(self, base_url: str, api_key: str = "EMPTY", timeout: int = 600):
- self.base_url = base_url.rstrip("/")
- self.api_key = api_key
- self.timeout = timeout
-
- @property
- def _headers(self) -> dict[str, str]:
- return {
- "Authorization": f"Bearer {self.api_key}",
- "Content-Type": "application/json",
- }
-
- def generate_text_to_image(
- self,
- *,
- model: str,
- prompt: str,
- width: int,
- height: int,
- num_inference_steps: int = 20,
- guidance_scale: float | None = None,
- seed: int | None = None,
- output_compression: int | None = None,
- ) -> Image.Image:
- payload: dict[str, Any] = {
- "model": model,
- "prompt": prompt,
- "n": 1,
- "size": f"{width}x{height}",
- "response_format": "b64_json",
- "num_inference_steps": num_inference_steps,
- }
- if guidance_scale is not None:
- payload["guidance_scale"] = guidance_scale
- if seed is not None:
- payload["seed"] = seed
- if output_compression is not None:
- payload["output_compression"] = output_compression
-
- response = requests.post(
- build_openai_url(self.base_url, "/images/generations"),
- json=payload,
- headers=self._headers,
- timeout=self.timeout,
- )
- response.raise_for_status()
- return decode_base64_image(response.json()["data"][0]["b64_json"])
-
- def generate_image_edit(
- self,
- *,
- model: str,
- prompt: str,
- images: Image.Image | list[Image.Image],
- width: int,
- height: int,
- num_inference_steps: int = 20,
- guidance_scale: float | None = None,
- seed: int | None = None,
- negative_prompt: str | None = None,
- output_compression: int | None = None,
- ) -> Image.Image:
- if not isinstance(images, list):
- images = [images]
- data: dict[str, Any] = {
- "model": model,
- "prompt": prompt,
- "n": 1,
- "size": f"{width}x{height}",
- "response_format": "b64_json",
- "num_inference_steps": str(num_inference_steps),
- }
- if guidance_scale is not None:
- data["guidance_scale"] = str(guidance_scale)
- if seed is not None:
- data["seed"] = str(seed)
- if negative_prompt:
- data["negative_prompt"] = negative_prompt
- if output_compression is not None:
- data["output_compression"] = str(output_compression)
-
- files = [
- (
- "image[]" if len(images) > 1 else "image",
- (f"image_{index}.png", pil_to_png_bytes(image), "image/png"),
- )
- for index, image in enumerate(images)
- ]
-
- edit_paths = ["/images/edits", "/images/edit"]
- last_response: requests.Response | None = None
- for api_path in edit_paths:
- response = requests.post(
- build_openai_url(self.base_url, api_path),
- data=data,
- files=files,
- headers={"Authorization": f"Bearer {self.api_key}"},
- timeout=self.timeout,
- )
- last_response = response
- if response.status_code != 404:
- response.raise_for_status()
- return decode_base64_image(response.json()["data"][0]["b64_json"])
-
- assert last_response is not None
- last_response.raise_for_status()
- raise ValueError("No image payload returned from image edit endpoint")
diff --git a/benchmarks/accuracy/image_to_image/README.md b/benchmarks/accuracy/image_to_image/README.md
deleted file mode 100644
index 86e7b0cf328..00000000000
--- a/benchmarks/accuracy/image_to_image/README.md
+++ /dev/null
@@ -1,103 +0,0 @@
-# GEdit-Bench on vLLM-Omni
-
-This integration adapts the upstream `stepfun-ai/Step1X-Edit/GEdit-Bench`
-evaluation flow into `vllm-omni/benchmarks/accuracy/image_to_image`.
-
-Upstream mapping:
-
-- `run_gedit_score.py` -> `run_gedit_bench.py evaluate`
-- `calculate_statistics.py` -> `run_gedit_bench.py summarize`
-- upstream output layout under `results//fullset/...` is preserved
-
-What changed:
-
-- The upstream repo mainly ships evaluation scripts. This integration adds a
- generation runner that uses the local `vllm-omni` OpenAI-compatible
- `/v1/images/edits` endpoint to produce benchmark outputs in the expected
- directory structure.
-- The evaluator keeps the same VIEScore-style decomposition:
- - `semantics_score`
- - `quality_score`
- - `overall_score = sqrt(semantics_score * quality_score)`
-- Judge calls are routed to a local OpenAI-compatible model served by
- `vllm-omni`, not a remote provider.
-
-Dataset:
-
-- Default `--dataset-ref` is `stepfun-ai/GEdit-Bench`
-- You can also pass a local dataset directory previously saved with
- Hugging Face `datasets`
-
-Example usage:
-
-```bash
-python benchmarks/accuracy/image_to_image/run_gedit_bench.py generate \
- --output-root benchmarks/accuracy/image_to_image/results \
- --base-url http://127.0.0.1:8000 \
- --model Qwen/Qwen-Image-Edit \
- --model-name qwen_image_edit \
- --dataset-ref stepfun-ai/GEdit-Bench \
- --task-type all \
- --instruction-language en
-```
-
-```bash
-python benchmarks/accuracy/image_to_image/run_gedit_bench.py evaluate \
- --output-root benchmarks/accuracy/image_to_image/results \
- --model-name qwen_image_edit \
- --save-dir benchmarks/accuracy/image_to_image/scores \
- --dataset-ref stepfun-ai/GEdit-Bench \
- --judge-base-url http://127.0.0.1:8000 \
- --judge-model Qwen/Qwen2.5-VL-7B-Instruct \
- --judge-api-key EMPTY
-```
-
-```bash
-python benchmarks/accuracy/image_to_image/run_gedit_bench.py summarize \
- --csv-path benchmarks/accuracy/image_to_image/scores/qwen_image_edit_all_all_vie_score.csv \
- --language en
-```
-
-Example summary output:
-
-```json
-{
- "language": "all",
- "languages": {
- "en": {
- "overall": {"count": 110, "Q_SC": 6.58, "Q_PQ": 5.89, "Q_O": 5.86},
- "intersection": {"count": 78, "Q_SC": 6.50, "Q_PQ": 5.66, "Q_O": 5.65}
- },
- "cn": {
- "overall": {"count": 110, "Q_SC": 6.90, "Q_PQ": 5.78, "Q_O": 6.11},
- "intersection": {"count": 63, "Q_SC": 7.22, "Q_PQ": 5.59, "Q_O": 6.28}
- }
- }
-}
-```
-
-Example generated images to inspect:
-
-- `benchmarks/accuracy/image_to_image/results/qwen_image_edit/fullset/background_change/en/.png`
-- `benchmarks/accuracy/image_to_image/results/qwen_image_edit/fullset/text_change/en/.png`
-- `benchmarks/accuracy/image_to_image/results/qwen_image_edit/fullset/subject-replace/cn/.png`
-
-Example score artifacts to inspect together with the images:
-
-- `benchmarks/accuracy/image_to_image/scores/qwen_image_edit_all_all_vie_score.csv`
-- `benchmarks/accuracy/image_to_image/scores/qwen_image_edit_all_all_summary.json`
-
-What to expect:
-
-- `Q_SC` measures instruction following and content preservation.
-- `Q_PQ` measures image naturalness and artifact quality.
-- `Q_O` is the combined overall score; higher is better.
-- `overall.count` is the number of evaluated samples for that language, while `intersection.count` is the subset with `intersection_exist == True`.
-
-Notes:
-
-- This flow requires the optional Hugging Face `datasets` package.
-- `generate` writes `generation_manifest.json` with local output coverage.
-- The current repo marker set exposes `L4` but not `L5`, so if you promote an
- end-to-end smoke test into CI, use the `full_model`, `benchmark`,
- and `L4` markers for nightly (or `advanced_model` for merge) or introduce a new repo-wide marker explicitly first.
diff --git a/benchmarks/accuracy/image_to_image/__init__.py b/benchmarks/accuracy/image_to_image/__init__.py
deleted file mode 100644
index 1b9b2d44df6..00000000000
--- a/benchmarks/accuracy/image_to_image/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-"""GEdit-Bench integration for vLLM-Omni."""
diff --git a/benchmarks/accuracy/image_to_image/gedit_bench.py b/benchmarks/accuracy/image_to_image/gedit_bench.py
deleted file mode 100644
index adb072c0a2e..00000000000
--- a/benchmarks/accuracy/image_to_image/gedit_bench.py
+++ /dev/null
@@ -1,718 +0,0 @@
-from __future__ import annotations
-
-import argparse
-import csv
-import json
-import logging
-import math
-import statistics
-from collections import defaultdict
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from datetime import datetime, timezone
-from io import BytesIO
-from pathlib import Path
-from typing import Any
-
-import requests
-from PIL import Image
-from tqdm.auto import tqdm
-
-from benchmarks.accuracy.common import (
- VllmOmniImageClient,
- build_openai_url,
- extract_json_object,
- write_json,
-)
-
-GROUPS = [
- "background_change",
- "color_alter",
- "material_alter",
- "motion_change",
- "ps_human",
- "style_change",
- "subject-add",
- "subject-remove",
- "subject-replace",
- "text_change",
- "tone_transfer",
-]
-DEFAULT_SAMPLES_PER_GROUP = 10
-logger = logging.getLogger(__name__)
-
-
-def infer_model_name(model: str) -> str:
- normalized = model.rstrip("/\\")
- name = Path(normalized).name
- return name or normalized
-
-
-def resolve_model_name(*, model_name: str | None, model: str | None = None, output_root: Path | None = None) -> str:
- if model_name:
- return model_name
- if model:
- return infer_model_name(model)
- if output_root is not None:
- candidates = sorted(path.name for path in output_root.iterdir() if path.is_dir())
- if len(candidates) == 1:
- return candidates[0]
- if not candidates:
- raise ValueError(f"Could not infer model-name from empty output root: {output_root}")
- raise ValueError(
- f"Could not infer model-name from output root {output_root}; multiple candidates found: {candidates}"
- )
- raise ValueError("model-name is required when it cannot be inferred from model or output-root")
-
-
-def parse_score_payload(raw_text: str) -> dict[str, Any]:
- try:
- parsed = extract_json_object(raw_text)
- except Exception:
- stripped = raw_text.strip()
- if stripped.startswith("[") and stripped.endswith("]"):
- parsed = {"score": json.loads(stripped), "reasoning": ""}
- else:
- raise
- score = parsed.get("score", [])
- if not isinstance(score, list):
- score = [score]
- parsed["score"] = [int(value) for value in score]
- return parsed
-
-
-def _utc_timestamp() -> str:
- return datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
-
-
-def summarize_generated_records(records: list[dict[str, Any]]) -> dict[str, Any]:
- by_task: dict[str, list[dict[str, Any]]] = defaultdict(list)
- by_language: dict[str, list[dict[str, Any]]] = defaultdict(list)
- for record in records:
- by_task[record["task_type"]].append(record)
- by_language[record["instruction_language"]].append(record)
-
- return {
- "count": len(records),
- "by_task": {
- task_type: {
- "count": len(rows),
- "samples": sorted(row["key"] for row in rows),
- }
- for task_type, rows in sorted(by_task.items())
- },
- "by_language": {
- language: {
- "count": len(rows),
- "samples": sorted(row["key"] for row in rows),
- }
- for language, rows in sorted(by_language.items())
- },
- }
-
-
-def select_balanced_gedit_rows(
- rows: list[dict[str, Any]],
- *,
- task_type: str = "all",
- instruction_language: str = "all",
- samples_per_group: int | None,
-) -> list[dict[str, Any]]:
- filtered_rows = []
- for row in rows:
- if task_type != "all" and row["task_type"] != task_type:
- continue
- if instruction_language != "all" and row["instruction_language"] != instruction_language:
- continue
- filtered_rows.append(row)
-
- if samples_per_group is None:
- return filtered_rows
-
- def _select_rows_for_group(group_rows: list[dict[str, Any]]) -> list[dict[str, Any]]:
- if instruction_language != "all":
- return group_rows[:samples_per_group]
-
- rows_by_language: dict[str, list[dict[str, Any]]] = defaultdict(list)
- for row in group_rows:
- rows_by_language[str(row.get("instruction_language", "")).strip()].append(row)
-
- ordered_languages = ["en", "cn"]
- per_language_quota = samples_per_group // max(len(ordered_languages), 1)
- remainder = samples_per_group % max(len(ordered_languages), 1)
-
- selected_rows: list[dict[str, Any]] = []
- leftovers: list[dict[str, Any]] = []
- for index, language in enumerate(ordered_languages):
- language_rows = rows_by_language.get(language, [])
- quota = per_language_quota + (1 if index < remainder else 0)
- selected_rows.extend(language_rows[:quota])
- leftovers.extend(language_rows[quota:])
-
- if len(selected_rows) < samples_per_group:
- selected_rows.extend(leftovers[: samples_per_group - len(selected_rows)])
-
- return selected_rows
-
- if task_type != "all":
- return _select_rows_for_group(filtered_rows)
-
- grouped_rows: dict[str, list[dict[str, Any]]] = defaultdict(list)
- for row in filtered_rows:
- grouped_rows[row["task_type"]].append(row)
-
- selected: list[dict[str, Any]] = []
- for group in GROUPS:
- selected.extend(_select_rows_for_group(grouped_rows.get(group, [])))
- return selected
-
-
-def summarize_gedit_rows(rows: list[dict[str, Any]], language: str = "all") -> dict[str, Any]:
- return summarize_gedit_rows_with_backbone(rows, language=language)
-
-
-def _mean_or_none(values: list[float]) -> float | None:
- return statistics.fmean(values) if values else None
-
-
-def _to_q_metrics(section: dict[str, float | None]) -> dict[str, float | None]:
- return {
- "count": section.get("count"),
- "Q_SC": section.get("avg_semantics"),
- "Q_PQ": section.get("avg_quality"),
- "Q_O": section.get("avg_overall"),
- }
-
-
-def _summarize_gedit_rows_single_language(
- rows: list[dict[str, Any]],
- *,
- language: str,
-) -> dict[str, Any]:
- filtered_rows = [row for row in rows if str(row.get("instruction_language", "")).strip() == language]
-
- def _to_bool(value: Any) -> bool:
- if isinstance(value, bool):
- return value
- return str(value).strip().lower() in {"1", "true", "yes"}
-
- per_group: dict[str, dict[str, float | None]] = {}
- intersection_group: dict[str, dict[str, float | None]] = {}
-
- for group in GROUPS:
- group_rows = [row for row in filtered_rows if row.get("task_type") == group]
- semantics = [float(row["semantics_score"]) for row in group_rows]
- quality = [float(row["quality_score"]) for row in group_rows]
- overall = [math.sqrt(float(row["semantics_score"]) * float(row["quality_score"])) for row in group_rows]
-
- per_group[group] = {
- "count": len(group_rows),
- "avg_semantics": _mean_or_none(semantics),
- "avg_quality": _mean_or_none(quality),
- "avg_overall": _mean_or_none(overall),
- }
-
- intersection_rows = [row for row in group_rows if _to_bool(row.get("intersection_exist", False))]
- intersection_semantics = [float(row["semantics_score"]) for row in intersection_rows]
- intersection_quality = [float(row["quality_score"]) for row in intersection_rows]
- intersection_overall = [
- math.sqrt(float(row["semantics_score"]) * float(row["quality_score"])) for row in intersection_rows
- ]
- intersection_group[group] = {
- "count": len(intersection_rows),
- "avg_semantics": _mean_or_none(intersection_semantics),
- "avg_quality": _mean_or_none(intersection_quality),
- "avg_overall": _mean_or_none(intersection_overall),
- }
-
- overall_section = {
- "count": len(filtered_rows),
- "avg_semantics": _mean_or_none(
- [score["avg_semantics"] for score in per_group.values() if score["avg_semantics"] is not None]
- ),
- "avg_quality": _mean_or_none(
- [score["avg_quality"] for score in per_group.values() if score["avg_quality"] is not None]
- ),
- "avg_overall": _mean_or_none(
- [score["avg_overall"] for score in per_group.values() if score["avg_overall"] is not None]
- ),
- }
- intersection_section = {
- "count": sum(score["count"] for score in intersection_group.values()),
- "avg_semantics": _mean_or_none(
- [score["avg_semantics"] for score in intersection_group.values() if score["avg_semantics"] is not None]
- ),
- "avg_quality": _mean_or_none(
- [score["avg_quality"] for score in intersection_group.values() if score["avg_quality"] is not None]
- ),
- "avg_overall": _mean_or_none(
- [score["avg_overall"] for score in intersection_group.values() if score["avg_overall"] is not None]
- ),
- }
-
- return {
- "language": language,
- "by_group": {group: _to_q_metrics(section) for group, section in per_group.items()},
- "overall": _to_q_metrics(overall_section),
- "intersection": _to_q_metrics(intersection_section),
- }
-
-
-def summarize_gedit_rows_with_backbone(
- rows: list[dict[str, Any]],
- *,
- language: str = "all",
-) -> dict[str, Any]:
- if language == "all":
- return {
- "language": "all",
- "languages": {
- single_language: _summarize_gedit_rows_single_language(
- rows,
- language=single_language,
- )
- for single_language in ["en", "cn"]
- },
- }
-
- return _summarize_gedit_rows_single_language(rows, language=language)
-
-
-def _require_datasets():
- try:
- from datasets import load_dataset, load_from_disk
- except ImportError as exc:
- raise ImportError("GEdit-Bench requires the optional `datasets` package.") from exc
- return load_dataset, load_from_disk
-
-
-def _load_gedit_dataset(dataset_ref: str):
- load_dataset, load_from_disk = _require_datasets()
- dataset_path = Path(dataset_ref)
- if dataset_path.exists():
- if (dataset_path / "state.json").exists() and (dataset_path / "dataset_info.json").exists():
- return load_from_disk(str(dataset_path))
- return load_dataset(str(dataset_path))
- return load_dataset(dataset_ref)
-
-
-def _resolve_gedit_split(dataset_obj: Any) -> Any:
- if isinstance(dataset_obj, dict):
- if "train" in dataset_obj:
- return dataset_obj["train"]
- return dataset_obj
- try:
- return dataset_obj["train"]
- except Exception:
- return dataset_obj
-
-
-def _to_pil_image(value: Any) -> Image.Image:
- if isinstance(value, Image.Image):
- return value.convert("RGB")
- if isinstance(value, dict) and "bytes" in value:
- image = Image.open(BytesIO(value["bytes"]))
- image.load()
- return image.convert("RGB")
- raise TypeError(f"Unsupported image payload type: {type(value)!r}")
-
-
-class LocalVIEScorer:
- def __init__(self, *, base_url: str, api_key: str, model: str, timeout: int = 600):
- self.base_url = base_url.rstrip("/")
- self.api_key = api_key
- self.model = model
- self.timeout = timeout
- self.sc_prompt = (
- "You are evaluating image editing quality.\n"
- "Two images are provided: the source image and the edited image.\n"
- 'Return JSON only in the format {"score": [edit_success, content_preservation], '
- '"reasoning": "..."}.\n'
- "Each score must be an integer from 0 to 10.\n"
- "Editing instruction: "
- )
- self.pq_prompt = (
- "You are evaluating image quality.\n"
- 'Return JSON only in the format {"score": [naturalness, artifact_free], "reasoning": "..."}.\n'
- "Each score must be an integer from 0 to 10."
- )
-
- def _request(self, prompt: str, images: list[Image.Image]) -> dict[str, Any]:
- from benchmarks.accuracy.common import pil_to_data_url
-
- content: list[dict[str, Any]] = [{"type": "text", "text": prompt}]
- for image in images:
- content.append({"type": "image_url", "image_url": {"url": pil_to_data_url(image)}})
-
- response = requests.post(
- build_openai_url(self.base_url, "/chat/completions"),
- json={
- "model": self.model,
- "messages": [{"role": "user", "content": content}],
- "temperature": 0,
- },
- headers={
- "Authorization": f"Bearer {self.api_key}",
- "Content-Type": "application/json",
- },
- timeout=self.timeout,
- )
- response.raise_for_status()
- message_content = response.json()["choices"][0]["message"]["content"]
- if isinstance(message_content, list):
- text = "\n".join(part.get("text", "") for part in message_content if part.get("type") == "text")
- else:
- text = str(message_content)
- return parse_score_payload(text)
-
- def evaluate(self, source_image: Image.Image, edited_image: Image.Image, instruction: str) -> dict[str, float]:
- sc_payload = self._request(self.sc_prompt.replace("", instruction), [source_image, edited_image])
- pq_payload = self._request(self.pq_prompt, [edited_image])
- semantics = float(min(sc_payload["score"])) if sc_payload["score"] else 0.0
- quality = float(min(pq_payload["score"])) if pq_payload["score"] else 0.0
- overall = math.sqrt(semantics * quality)
- return {
- "semantics_score": semantics,
- "quality_score": quality,
- "overall_score": overall,
- }
-
-
-class GEditBenchRunner:
- def __init__(
- self,
- *,
- dataset_ref: str,
- output_root: Path,
- base_url: str,
- model: str,
- api_key: str = "EMPTY",
- width: int = 512,
- height: int = 512,
- num_inference_steps: int = 20,
- guidance_scale: float | None = None,
- seed: int | None = 42,
- ):
- self.dataset_ref = dataset_ref
- self.output_root = output_root
- self.model = model
- self.width = width
- self.height = height
- self.num_inference_steps = num_inference_steps
- self.guidance_scale = guidance_scale
- self.seed = seed
- self.client = VllmOmniImageClient(base_url=base_url, api_key=api_key)
-
- def generate(
- self,
- *,
- model_name: str,
- task_type: str = "all",
- instruction_language: str = "all",
- workers: int = 1,
- max_samples: int | None = None,
- samples_per_group: int | None = None,
- ) -> list[dict[str, Any]]:
- dataset = _resolve_gedit_split(_load_gedit_dataset(self.dataset_ref))
- rows = select_balanced_gedit_rows(
- list(dataset),
- task_type=task_type,
- instruction_language=instruction_language,
- samples_per_group=samples_per_group,
- )
- if max_samples is not None:
- rows = rows[:max_samples]
-
- outputs: list[dict[str, Any]] = []
- total = len(rows)
- if workers <= 1:
- with tqdm(total=total, desc="GEdit generate", unit="sample") as progress:
- for item in rows:
- result = self._safe_generate_one(model_name, item)
- if result:
- outputs.append(result)
- progress.update(1)
- return outputs
-
- with tqdm(total=total, desc="GEdit generate", unit="sample") as progress:
- with ThreadPoolExecutor(max_workers=workers) as executor:
- futures = [executor.submit(self._safe_generate_one, model_name, item) for item in rows]
- for future in as_completed(futures):
- result = future.result()
- if result:
- outputs.append(result)
- progress.update(1)
- return outputs
-
- def _safe_generate_one(self, model_name: str, item: dict[str, Any]) -> dict[str, Any] | None:
- try:
- return self._generate_one(model_name, item)
- except Exception:
- logger.exception("Failed to generate GEdit-Bench sample %s", item.get("key", ""))
- return None
-
- def _generate_one(self, model_name: str, item: dict[str, Any]) -> dict[str, Any] | None:
- output_path = (
- self.output_root
- / model_name
- / "fullset"
- / item["task_type"]
- / item["instruction_language"]
- / f"{item['key']}.png"
- )
- output_path.parent.mkdir(parents=True, exist_ok=True)
- if output_path.exists():
- return {
- "task_type": item["task_type"],
- "instruction_language": item["instruction_language"],
- "key": item["key"],
- "output_path": str(output_path),
- }
-
- source_image = _to_pil_image(item["input_image_raw"])
- edited_image = self.client.generate_image_edit(
- model=self.model,
- prompt=item["instruction"],
- images=source_image,
- width=self.width,
- height=self.height,
- num_inference_steps=self.num_inference_steps,
- guidance_scale=self.guidance_scale,
- seed=self.seed,
- )
- edited_image.save(output_path)
- return {
- "task_type": item["task_type"],
- "instruction_language": item["instruction_language"],
- "key": item["key"],
- "output_path": str(output_path),
- }
-
-
-class GEditBenchEvaluator:
- def __init__(self, *, dataset_ref: str, output_root: Path, scorer: LocalVIEScorer):
- self.dataset_ref = dataset_ref
- self.output_root = output_root
- self.scorer = scorer
-
- def evaluate(
- self,
- *,
- model_name: str,
- save_dir: Path,
- task_type: str = "all",
- instruction_language: str = "all",
- workers: int = 1,
- max_samples: int | None = None,
- samples_per_group: int | None = None,
- ) -> dict[str, Any]:
- dataset = _resolve_gedit_split(_load_gedit_dataset(self.dataset_ref))
- rows = select_balanced_gedit_rows(
- list(dataset),
- task_type=task_type,
- instruction_language=instruction_language,
- samples_per_group=samples_per_group,
- )
- if max_samples is not None:
- rows = rows[:max_samples]
-
- results: list[dict[str, Any]] = []
- total = len(rows)
- if workers <= 1:
- with tqdm(total=total, desc="GEdit evaluate", unit="sample") as progress:
- for item in rows:
- result = self._safe_evaluate_one(model_name, item)
- if result:
- results.append(result)
- progress.update(1)
- else:
- with tqdm(total=total, desc="GEdit evaluate", unit="sample") as progress:
- with ThreadPoolExecutor(max_workers=workers) as executor:
- futures = [executor.submit(self._safe_evaluate_one, model_name, item) for item in rows]
- for future in as_completed(futures):
- result = future.result()
- if result:
- results.append(result)
- progress.update(1)
-
- save_dir.mkdir(parents=True, exist_ok=True)
- base_name = f"{model_name}_{task_type}_{instruction_language}"
- timestamp = _utc_timestamp()
- csv_path = save_dir / f"{base_name}_vie_score.csv"
- timestamped_csv_path = save_dir / f"{base_name}_vie_score_{timestamp}.csv"
-
- def _write_csv(path: Path) -> None:
- with path.open("w", encoding="utf-8", newline="") as handle:
- writer = csv.DictWriter(
- handle,
- fieldnames=[
- "key",
- "task_type",
- "edited_image",
- "instruction",
- "semantics_score",
- "quality_score",
- "overall_score",
- "intersection_exist",
- "instruction_language",
- ],
- )
- writer.writeheader()
- for row in results:
- writer.writerow(row)
-
- _write_csv(csv_path)
- _write_csv(timestamped_csv_path)
-
- summary = summarize_gedit_rows_with_backbone(
- results,
- language=instruction_language,
- )
- summary_path = save_dir / f"{base_name}_summary.json"
- timestamped_summary_path = save_dir / f"{base_name}_summary_{timestamp}.json"
- write_json(summary_path, summary)
- write_json(timestamped_summary_path, summary)
- return {
- "results": results,
- "summary": summary,
- "csv_path": str(csv_path),
- "summary_path": str(summary_path),
- "timestamped_csv_path": str(timestamped_csv_path),
- "timestamped_summary_path": str(timestamped_summary_path),
- }
-
- def _safe_evaluate_one(self, model_name: str, item: dict[str, Any]) -> dict[str, Any] | None:
- try:
- return self._evaluate_one(model_name, item)
- except Exception:
- logger.exception("Failed to evaluate GEdit-Bench sample %s", item.get("key", ""))
- return None
-
- def _evaluate_one(self, model_name: str, item: dict[str, Any]) -> dict[str, Any] | None:
- edited_image_path = (
- self.output_root
- / model_name
- / "fullset"
- / item["task_type"]
- / item["instruction_language"]
- / f"{item['key']}.png"
- )
- if not edited_image_path.exists():
- return None
-
- source_image = _to_pil_image(item["input_image_raw"])
- edited_image = Image.open(edited_image_path).convert("RGB")
- scores = self.scorer.evaluate(source_image, edited_image, item["instruction"])
- return {
- "key": item["key"],
- "task_type": item["task_type"],
- "edited_image": str(edited_image_path),
- "instruction": item["instruction"],
- "semantics_score": scores["semantics_score"],
- "quality_score": scores["quality_score"],
- "overall_score": scores["overall_score"],
- "intersection_exist": item.get("Intersection_exist", False),
- "instruction_language": item["instruction_language"],
- }
-
-
-def build_parser() -> argparse.ArgumentParser:
- parser = argparse.ArgumentParser(description="Run the GEdit-Bench integration against a local vLLM-Omni server.")
- subparsers = parser.add_subparsers(dest="command", required=True)
-
- generate = subparsers.add_parser("generate")
- generate.add_argument("--dataset-ref", type=str, default="stepfun-ai/GEdit-Bench")
- generate.add_argument("--output-root", type=Path, required=True)
- generate.add_argument("--base-url", type=str, required=True)
- generate.add_argument("--model", type=str, required=True)
- generate.add_argument("--model-name", type=str, default=None)
- generate.add_argument("--api-key", type=str, default="EMPTY")
- generate.add_argument("--task-type", choices=["all", *GROUPS], default="all")
- generate.add_argument("--instruction-language", choices=["all", "en", "cn"], default="all")
- generate.add_argument("--width", type=int, default=512)
- generate.add_argument("--height", type=int, default=512)
- generate.add_argument("--num-inference-steps", type=int, default=20)
- generate.add_argument("--guidance-scale", type=float, default=None)
- generate.add_argument("--seed", type=int, default=42)
- generate.add_argument("--workers", type=int, default=1)
- generate.add_argument("--max-samples", type=int, default=None)
- generate.add_argument("--samples-per-group", type=int, default=None)
-
- evaluate = subparsers.add_parser("evaluate")
- evaluate.add_argument("--dataset-ref", type=str, default="stepfun-ai/GEdit-Bench")
- evaluate.add_argument("--output-root", type=Path, required=True)
- evaluate.add_argument("--model-name", type=str, default=None)
- evaluate.add_argument("--save-dir", type=Path, required=True)
- evaluate.add_argument("--task-type", choices=["all", *GROUPS], default="all")
- evaluate.add_argument("--instruction-language", choices=["all", "en", "cn"], default="all")
- evaluate.add_argument("--judge-base-url", type=str, required=True)
- evaluate.add_argument("--judge-model", type=str, required=True)
- evaluate.add_argument("--judge-api-key", type=str, default="EMPTY")
- evaluate.add_argument("--workers", type=int, default=1)
- evaluate.add_argument("--max-samples", type=int, default=None)
- evaluate.add_argument("--samples-per-group", type=int, default=None)
-
- summarize = subparsers.add_parser("summarize")
- summarize.add_argument("--csv-path", type=Path, required=True)
- summarize.add_argument("--language", choices=["all", "en", "cn"], default="all")
-
- return parser
-
-
-def main(argv: list[str] | None = None) -> int:
- parser = build_parser()
- args = parser.parse_args(argv)
-
- if args.command == "generate":
- model_name = resolve_model_name(model_name=args.model_name, model=args.model)
- runner = GEditBenchRunner(
- dataset_ref=args.dataset_ref,
- output_root=args.output_root,
- base_url=args.base_url,
- model=args.model,
- api_key=args.api_key,
- width=args.width,
- height=args.height,
- num_inference_steps=args.num_inference_steps,
- guidance_scale=args.guidance_scale,
- seed=args.seed,
- )
- records = runner.generate(
- model_name=model_name,
- task_type=args.task_type,
- instruction_language=args.instruction_language,
- workers=args.workers,
- max_samples=args.max_samples,
- samples_per_group=args.samples_per_group,
- )
- payload = {"records": records, "summary": summarize_generated_records(records)}
- write_json(args.output_root / model_name / "generation_manifest.json", payload)
- return 0
-
- if args.command == "evaluate":
- model_name = resolve_model_name(model_name=args.model_name, output_root=args.output_root)
- scorer = LocalVIEScorer(
- base_url=args.judge_base_url,
- api_key=args.judge_api_key,
- model=args.judge_model,
- )
- evaluator = GEditBenchEvaluator(dataset_ref=args.dataset_ref, output_root=args.output_root, scorer=scorer)
- evaluator.evaluate(
- model_name=model_name,
- save_dir=args.save_dir,
- task_type=args.task_type,
- instruction_language=args.instruction_language,
- workers=args.workers,
- max_samples=args.max_samples,
- samples_per_group=args.samples_per_group,
- )
- return 0
-
- if args.command == "summarize":
- with args.csv_path.open("r", encoding="utf-8", newline="") as handle:
- rows = list(csv.DictReader(handle))
- summary = summarize_gedit_rows_with_backbone(rows, language=args.language)
- print(json.dumps(summary, indent=2, ensure_ascii=False))
- return 0
-
- parser.error(f"Unknown command: {args.command}")
- return 1
diff --git a/benchmarks/accuracy/image_to_image/run_gedit_bench.py b/benchmarks/accuracy/image_to_image/run_gedit_bench.py
deleted file mode 100644
index 4ce10c18465..00000000000
--- a/benchmarks/accuracy/image_to_image/run_gedit_bench.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# ruff: noqa: E402, I001
-import sys
-from pathlib import Path
-
-REPO_ROOT = Path(__file__).resolve().parents[3]
-if str(REPO_ROOT) not in sys.path:
- sys.path.insert(0, str(REPO_ROOT))
-
-from benchmarks.accuracy.image_to_image.gedit_bench import main
-
-
-if __name__ == "__main__":
- raise SystemExit(main())
diff --git a/benchmarks/accuracy/text_to_image/README.md b/benchmarks/accuracy/text_to_image/README.md
deleted file mode 100644
index 79641e8fc43..00000000000
--- a/benchmarks/accuracy/text_to_image/README.md
+++ /dev/null
@@ -1,103 +0,0 @@
-# GEBench on vLLM-Omni
-
-This integration adapts the upstream `stepfun-ai/GEBench` scripts into
-`vllm-omni/benchmarks/accuracy/text_to_image`.
-
-Upstream mapping:
-
-- `scripts/generate.py` -> `run_gebench.py generate`
-- `scripts/evaluate.py` -> `run_gebench.py evaluate`
-- upstream prompt / judge logic -> `gbench.py`
-
-What changed:
-
-- Generation calls the local OpenAI-compatible `vllm-omni` endpoints:
- - `/v1/images/generations` for text-only frame generation
- - `/v1/images/edits` for image-conditioned GUI transition generation
-- Evaluation still keeps the GEBench scoring dimensions:
- - `goal`
- - `logic`
- - `cons`
- - `ui`
- - `qual`
-- Judge calls are also routed to a local OpenAI-compatible model served by
- `vllm-omni` instead of a remote service.
-
-Expected dataset layout:
-
-- Clone the benchmark dataset from Hugging Face into a local directory:
-
-```bash
-git clone https://huggingface.co/datasets/stepfun-ai/GEBench /path/to/GEBench
-```
-
-Example usage:
-
-```bash
-python benchmarks/accuracy/text_to_image/run_gebench.py generate \
- --dataset-root /path/to/GEBench \
- --output-root benchmarks/accuracy/text_to_image/outputs \
- --base-url http://127.0.0.1:8000 \
- --model Tongyi-MAI/Z-Image-Turbo \
- --data-type type3
-```
-
-```bash
-python benchmarks/accuracy/text_to_image/run_gebench.py evaluate \
- --dataset-root /path/to/GEBench \
- --output-root benchmarks/accuracy/text_to_image/outputs \
- --data-type type3 \
- --judge-base-url http://127.0.0.1:8000 \
- --judge-model Qwen/Qwen2.5-VL-7B-Instruct \
- --judge-api-key EMPTY
-```
-
-```bash
-python benchmarks/accuracy/text_to_image/run_gebench.py summarize \
- --output-root benchmarks/accuracy/text_to_image/outputs
-```
-
-Example summary output:
-
-```json
-{
- "generation": {
- "count": 20,
- "by_type": {
- "type3": {"count": 10},
- "type4": {"count": 10}
- }
- },
- "evaluation": {
- "count": 20,
- "overall_mean": 0.52,
- "by_type": {
- "type3": {"count": 10, "overall_mean": 0.50, "overall_mean_100": 50.0},
- "type4": {"count": 10, "overall_mean": 0.54, "overall_mean_100": 54.0}
- }
- }
-}
-```
-
-Example generated images to inspect:
-
-- `benchmarks/accuracy/text_to_image/outputs/03_trajectory_text_fictionalapp///frame0.png`
-- `benchmarks/accuracy/text_to_image/outputs/03_trajectory_text_fictionalapp///frame5.png`
-- `benchmarks/accuracy/text_to_image/outputs/04_trajectory_text_realapp///frame0.png`
-- `benchmarks/accuracy/text_to_image/outputs/04_trajectory_text_realapp///frame5.png`
-
-What to expect:
-
-- `overall_mean` is normalized to `0.0 ~ 1.0`; higher is better.
-- `frame0.png` is the initial GUI frame; `frame5.png` is the final trajectory frame most often used for quick inspection.
-- For full debugging, inspect the whole `frame0.png` ... `frame5.png` sequence for one sample directory.
-
-Notes:
-
-- GEBench upstream leaves type3/type4 generation unfinished. This integration
- fills that gap with a trajectory runner that generates `frame0.png` followed
- by `frame1.png` ... `frame5.png`.
-- Type1/2/5 require an image-edit capable model exposed through
- `vllm-omni serve`.
-- `summarize` will report both generated coverage and any existing evaluation
- summary files.
diff --git a/benchmarks/accuracy/text_to_image/__init__.py b/benchmarks/accuracy/text_to_image/__init__.py
deleted file mode 100644
index e1f7b2bb673..00000000000
--- a/benchmarks/accuracy/text_to_image/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-"""GEBench integration for vLLM-Omni."""
diff --git a/benchmarks/accuracy/text_to_image/gbench.py b/benchmarks/accuracy/text_to_image/gbench.py
deleted file mode 100644
index 2ea02130d6b..00000000000
--- a/benchmarks/accuracy/text_to_image/gbench.py
+++ /dev/null
@@ -1,927 +0,0 @@
-from __future__ import annotations
-
-import argparse
-import json
-import statistics
-from collections import defaultdict
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from dataclasses import dataclass
-from datetime import datetime, timezone
-from pathlib import Path
-from typing import Any
-
-import requests
-from PIL import Image
-
-from benchmarks.accuracy.common import (
- VllmOmniImageClient,
- build_openai_url,
- ensure_dir,
- extract_json_object,
- find_first_image,
- load_json,
- pil_to_data_url,
- save_image,
- write_json,
-)
-
-TYPE_TO_FOLDER = {
- "type1": "01_single_step",
- "type2": "02_multi_step",
- "type3": "03_trajectory_text_fictionalapp",
- "type4": "04_trajectory_text_realapp",
- "type5": "05_grounding_data",
-}
-SCORE_KEYS = ("goal", "logic", "cons", "ui", "qual")
-DEFAULT_SAMPLES_PER_TYPE = 10
-
-
-def _utc_timestamp() -> str:
- return datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
-
-
-def _write_json_with_timestamp(path: Path, payload: dict[str, Any]) -> Path:
- write_json(path, payload)
- timestamped_path = path.with_name(f"{path.stem}_{_utc_timestamp()}{path.suffix}")
- write_json(timestamped_path, payload)
- return timestamped_path
-
-
-@dataclass(frozen=True)
-class GEBenchSampleSpec:
- sample_path: Path
- metadata: dict[str, Any]
- sample_name: str
- lang_device: str
-
-
-def summarize_generated_records(records: list[dict[str, Any]]) -> dict[str, Any]:
- by_type: dict[str, list[dict[str, Any]]] = defaultdict(list)
- for record in records:
- by_type[record["data_type"]].append(record)
-
- return {
- "count": len(records),
- "by_type": {
- data_type: {
- "count": len(rows),
- }
- for data_type, rows in sorted(by_type.items())
- },
- }
-
-
-def summarize_gebench_results(results: list[dict[str, Any]]) -> dict[str, Any]:
- by_type: dict[str, list[dict[str, Any]]] = defaultdict(list)
- for result in results:
- by_type[result["data_type"]].append(result)
-
- summary: dict[str, Any] = {
- "count": len(results),
- "overall_mean": statistics.fmean(r["overall"] for r in results) if results else 0.0,
- "by_type": {},
- }
- for data_type, rows in by_type.items():
- score_means: dict[str, float] = {}
- all_score_keys = {key for row in rows for key in row.get("scores", {}).keys()}
- for score_key in all_score_keys:
- values = [row["scores"][score_key] for row in rows if score_key in row.get("scores", {})]
- score_means[score_key] = statistics.fmean(values) if values else 0.0
- overall_mean = statistics.fmean(row["overall"] for row in rows)
- summary["by_type"][data_type] = {
- "count": len(rows),
- "overall_mean": overall_mean,
- "overall_mean_100": overall_mean * 100.0,
- "score_means": score_means,
- }
- return summary
-
-
-def select_balanced_gebench_samples(
- sample_paths_by_type: dict[str, list[Any]],
- *,
- samples_per_type: int | None,
-) -> dict[str, list[Any]]:
- if samples_per_type is None:
- return {data_type: list(paths) for data_type, paths in sample_paths_by_type.items()}
- return {data_type: list(paths)[:samples_per_type] for data_type, paths in sample_paths_by_type.items()}
-
-
-def collect_gebench_generation_summary(output_root: Path) -> dict[str, Any]:
- records: list[dict[str, Any]] = []
- for data_type, folder_name in TYPE_TO_FOLDER.items():
- type_root = output_root / folder_name
- if not type_root.exists():
- continue
- for lang_dir in sorted(path for path in type_root.iterdir() if path.is_dir()):
- for sample_dir in sorted(path for path in lang_dir.iterdir() if path.is_dir()):
- expected = sample_dir / "frame5.png" if data_type in {"type2", "type3", "type4"} else None
- if expected is None:
- expected = find_first_image(sample_dir)
- elif not expected.exists():
- expected = None
- if expected is None:
- continue
- records.append(
- {
- "data_type": data_type,
- "sample_name": f"{lang_dir.name}/{sample_dir.name}",
- "output_path": str(expected),
- }
- )
- return summarize_generated_records(records)
-
-
-def _normalize_score_key(key: str) -> str:
- mapping = {
- "goal": "goal",
- "logic": "logic",
- "cons": "cons",
- "consistency": "cons",
- "ui": "ui",
- "qual": "qual",
- "quality": "qual",
- }
- return mapping.get(key.lower(), key.lower())
-
-
-def _normalize_scores(raw_scores: dict[str, Any]) -> dict[str, int]:
- scores: dict[str, int] = {}
- for key, value in raw_scores.items():
- normalized = _normalize_score_key(key)
- if normalized not in SCORE_KEYS:
- continue
- scalar = value.get("s", 0) if isinstance(value, dict) else value
- try:
- scores[normalized] = int(scalar)
- except (TypeError, ValueError):
- scores[normalized] = 0
- for key in SCORE_KEYS:
- scores.setdefault(key, 0)
- return scores
-
-
-def _compute_overall(scores: dict[str, int]) -> float:
- return sum(scores.values()) / (len(SCORE_KEYS) * 5.0)
-
-
-def _iter_sample_paths(dataset_root: Path, data_type: str) -> list[Path]:
- data_dir = dataset_root / TYPE_TO_FOLDER[data_type]
- if not data_dir.exists():
- raise FileNotFoundError(f"Dataset path not found: {data_dir}")
-
- samples: list[Path] = []
- for lang_dir in sorted(path for path in data_dir.iterdir() if path.is_dir()):
- for child in sorted(lang_dir.iterdir()):
- if child.is_dir():
- samples.append(child)
- elif child.suffix.lower() == ".json":
- samples.append(child)
- return samples
-
-
-def _sample_name_from_metadata(metadata: dict[str, Any], sample_path: Path, item_index: int | None = None) -> str:
- sample_id = metadata.get("id") or metadata.get("sample_id") or metadata.get("name")
- if sample_id:
- return str(sample_id)
- if item_index is not None:
- return f"{sample_path.stem}_{item_index:04d}"
- return sample_path.stem if sample_path.is_file() else sample_path.name
-
-
-def _expand_sample_path(sample_path: Path) -> list[GEBenchSampleSpec]:
- if sample_path.is_dir():
- metadata = _load_metadata(sample_path)
- return [
- GEBenchSampleSpec(
- sample_path=sample_path,
- metadata=metadata,
- sample_name=_sample_name(sample_path),
- lang_device=_lang_device(sample_path, metadata),
- )
- ]
-
- payload = load_json(sample_path)
- if isinstance(payload, dict):
- return [
- GEBenchSampleSpec(
- sample_path=sample_path,
- metadata=payload,
- sample_name=_sample_name_from_metadata(payload, sample_path),
- lang_device=_lang_device(sample_path, payload),
- )
- ]
-
- if isinstance(payload, list):
- specs: list[GEBenchSampleSpec] = []
- for item_index, item in enumerate(payload):
- if not isinstance(item, dict):
- continue
- specs.append(
- GEBenchSampleSpec(
- sample_path=sample_path,
- metadata=item,
- sample_name=_sample_name_from_metadata(item, sample_path, item_index),
- lang_device=_lang_device(sample_path, item),
- )
- )
- return specs
-
- raise TypeError(f"Unsupported metadata payload type for sample {sample_path}: {type(payload)!r}")
-
-
-def _iter_sample_specs(dataset_root: Path, data_type: str) -> list[GEBenchSampleSpec]:
- sample_specs: list[GEBenchSampleSpec] = []
- for sample_path in _iter_sample_paths(dataset_root, data_type):
- sample_specs.extend(_expand_sample_path(sample_path))
- return sample_specs
-
-
-def _load_metadata(sample_path: Path) -> dict[str, Any]:
- if sample_path.is_file():
- return load_json(sample_path)
- for candidate in ("meta_data.json", "metadata.json"):
- meta_path = sample_path / candidate
- if meta_path.exists():
- return load_json(meta_path)
- raise FileNotFoundError(f"Metadata not found for sample: {sample_path}")
-
-
-def _sample_name(sample_path: Path) -> str:
- return sample_path.stem if sample_path.is_file() else sample_path.name
-
-
-def _lang_device(sample_path: Path, metadata: dict[str, Any]) -> str:
- return str(metadata.get("lang_device") or sample_path.parent.name)
-
-
-def _resolve_referenced_image(
- *,
- metadata: dict[str, Any],
- sample_path: Path,
- dataset_root: Path,
- data_type: str,
-) -> Image.Image | None:
- for key in ("image", "input_image", "initial_image", "reference_image"):
- image_ref = metadata.get(key)
- if not image_ref:
- continue
- candidate = dataset_root / TYPE_TO_FOLDER[data_type] / str(image_ref)
- if candidate.exists():
- image = Image.open(candidate)
- image.load()
- return image.convert("RGB")
- if sample_path.is_dir():
- local_image = find_first_image(sample_path)
- if local_image:
- image = Image.open(local_image)
- image.load()
- return image.convert("RGB")
- return None
-
-
-def _trajectory_steps(metadata: dict[str, Any]) -> list[dict[str, Any]]:
- for key in ("trajectory", "steps", "frames"):
- value = metadata.get(key)
- if isinstance(value, list):
- return [step for step in value if isinstance(step, dict)]
- extracted: list[dict[str, Any]] = []
- for index in range(1, 6):
- value = metadata.get(f"step{index}") or metadata.get(str(index))
- if isinstance(value, dict):
- extracted.append(value)
- return extracted
-
-
-def _text_or_default(value: Any, default: str = "") -> str:
- return str(value).strip() if value is not None else default
-
-
-def _type1_prompt(metadata: dict[str, Any]) -> str:
- caption = _text_or_default(metadata.get("caption") or metadata.get("instruction"), "Transform the reference GUI.")
- return (
- "Using the reference GUI screenshot, generate the next GUI state after the requested interaction.\n\n"
- f"Requested change:\n{caption}\n\n"
- "Requirements:\n"
- "- Preserve layout, visual identity, and unrelated regions.\n"
- "- Only apply the requested state change.\n"
- "- Keep all text and controls readable.\n"
- )
-
-
-def _type2_prompt(goal: str, step_num: int) -> str:
- return (
- "Generate the next GUI state for a multi-step task.\n\n"
- f"Overall goal: {goal}\n"
- f"Current progress step: {step_num}/5\n\n"
- "Requirements:\n"
- "- The change should be incremental and plausible.\n"
- "- Preserve layout and visual identity.\n"
- "- Make text/buttons readable.\n"
- )
-
-
-def _type34_initial_prompt(metadata: dict[str, Any], first_step: dict[str, Any]) -> str:
- app_name = _text_or_default(metadata.get("app_name"), "App")
- final_goal = _text_or_default(metadata.get("final_goal") or metadata.get("instruction"), "Complete the task.")
- visual_description = _text_or_default(
- metadata.get("visual_description") or first_step.get("visual_description") or first_step.get("description"),
- "A clean product-quality app home screen.",
- )
- return (
- "Generate the first GUI frame for a task trajectory.\n\n"
- f"App name: {app_name}\n"
- f"Final goal: {final_goal}\n"
- f"Visual description:\n{visual_description}\n\n"
- "Requirements:\n"
- "- Generate a production-looking UI screenshot only.\n"
- "- Keep the layout coherent and readable.\n"
- )
-
-
-def _type34_next_prompt(step_num: int, step_info: dict[str, Any]) -> str:
- action = _text_or_default(step_info.get("action") or step_info.get("instruction"), "Continue the task.")
- visual_description = _text_or_default(
- step_info.get("visual_description") or step_info.get("description"),
- "Reflect the expected next GUI state.",
- )
- return (
- "Using the previous frame as reference, generate the next GUI frame.\n\n"
- f"Step {step_num} action: {action}\n"
- f"Expected visual state:\n{visual_description}\n\n"
- "Requirements:\n"
- "- Only change UI regions affected by this action.\n"
- "- Preserve persistent bars, layout, and style.\n"
- "- Keep text and icons readable.\n"
- )
-
-
-def _type5_prompt(metadata: dict[str, Any]) -> str:
- grounding = metadata.get("grounding") or {}
- explanation = _text_or_default(
- metadata.get("grounding_explanation") or grounding.get("effect") or grounding.get("description"),
- "Predict the immediate GUI reaction to the indicated target.",
- )
- return (
- "Using the reference GUI screenshot, predict the immediate GUI state after the grounded interaction.\n\n"
- f"Expected effect: {explanation}\n"
- f"Grounding metadata: {json.dumps(grounding, ensure_ascii=False)}\n\n"
- "Requirements:\n"
- "- Apply only the interaction-triggered change.\n"
- "- Preserve unrelated regions.\n"
- "- Keep the UI realistic and readable.\n"
- )
-
-
-def _make_storyboard_image(
- frames: list[Image.Image],
- *,
- columns: int = 3,
- background_color: tuple[int, int, int] = (255, 255, 255),
-) -> Image.Image:
- if not frames:
- raise ValueError("Expected at least one frame to build a storyboard image.")
-
- normalized = [frame.convert("RGB") for frame in frames]
- frame_width = max(frame.width for frame in normalized)
- frame_height = max(frame.height for frame in normalized)
- rows = (len(normalized) + columns - 1) // columns
- storyboard = Image.new("RGB", (frame_width * columns, frame_height * rows), color=background_color)
-
- for index, frame in enumerate(normalized):
- x_offset = (index % columns) * frame_width
- y_offset = (index // columns) * frame_height
- if frame.size != (frame_width, frame_height):
- frame = frame.resize((frame_width, frame_height))
- storyboard.paste(frame, (x_offset, y_offset))
- return storyboard
-
-
-def _trajectory_judge_payload(frames: list[Image.Image]) -> tuple[str, list[Image.Image]]:
- storyboard = _make_storyboard_image(frames, columns=3)
- prompt_suffix = (
- "The attached image is a storyboard containing six frames arranged left-to-right, "
- "top-to-bottom as frame0, frame1, frame2, frame3, frame4, frame5."
- )
- return prompt_suffix, [storyboard]
-
-
-class LocalJudgeClient:
- def __init__(self, base_url: str, api_key: str, model: str, timeout: int = 600):
- self.base_url = base_url.rstrip("/")
- self.api_key = api_key
- self.model = model
- self.timeout = timeout
-
- def _build_scoring_prompt(self, task_prompt: str) -> str:
- return (
- "You are an expert evaluator for GUI image editing and GUI trajectory generation.\n"
- "Evaluate whether the generated image(s) satisfy the task.\n\n"
- "Score these five dimensions from 0 to 5:\n"
- "- goal: whether the user goal is completed correctly\n"
- "- logic: whether the transition/state change is logically correct\n"
- "- cons: whether unrelated regions remain consistent\n"
- "- ui: whether the UI layout/components remain realistic and coherent\n"
- "- qual: whether the images are visually clear and artifact-free\n\n"
- "Return JSON only. Do not add any prose outside JSON.\n"
- "Use exactly this schema:\n"
- "{\n"
- ' "goal": 0,\n'
- ' "logic": 0,\n'
- ' "cons": 0,\n'
- ' "ui": 0,\n'
- ' "qual": 0,\n'
- ' "reasoning": "short explanation"\n'
- "}\n\n"
- "Scoring task:\n"
- f"{task_prompt}"
- )
-
- def _request_text(self, prompt: str, images: list[Image.Image]) -> str:
- content: list[dict[str, Any]] = [{"type": "text", "text": prompt}]
- for image in images:
- content.append({"type": "image_url", "image_url": {"url": pil_to_data_url(image)}})
-
- response = requests.post(
- build_openai_url(self.base_url, "/chat/completions"),
- json={
- "model": self.model,
- "messages": [{"role": "user", "content": content}],
- "temperature": 0,
- },
- headers={
- "Authorization": f"Bearer {self.api_key}",
- "Content-Type": "application/json",
- },
- timeout=self.timeout,
- )
- response.raise_for_status()
- message_content = response.json()["choices"][0]["message"]["content"]
- if isinstance(message_content, list):
- return "\n".join(part.get("text", "") for part in message_content if part.get("type") == "text")
- return str(message_content)
-
- def evaluate(self, *, prompt: str, images: list[Image.Image]) -> dict[str, Any]:
- primary_prompt = self._build_scoring_prompt(prompt)
- text = self._request_text(primary_prompt, images)
- try:
- return extract_json_object(text)
- except ValueError:
- retry_prompt = (
- self._build_scoring_prompt(prompt) + "\n\nYour previous response was not valid JSON. "
- "Return only the JSON object with integer scores."
- )
- retry_text = self._request_text(retry_prompt, images)
- try:
- return extract_json_object(retry_text)
- except ValueError:
- return {
- "goal": 0,
- "logic": 0,
- "cons": 0,
- "ui": 0,
- "qual": 0,
- "reasoning": retry_text.strip() or text.strip() or "Judge response was not valid JSON.",
- }
-
-
-class GEBenchRunner:
- def __init__(
- self,
- *,
- dataset_root: Path,
- output_root: Path,
- base_url: str,
- model: str,
- api_key: str = "EMPTY",
- width: int = 768,
- height: int = 576,
- num_inference_steps: int = 8,
- output_compression: int | None = 98,
- guidance_scale: float | None = None,
- seed: int | None = 42,
- ):
- self.dataset_root = dataset_root
- self.output_root = output_root
- self.model = model
- self.width = width
- self.height = height
- self.num_inference_steps = num_inference_steps
- self.output_compression = output_compression
- self.guidance_scale = guidance_scale
- self.seed = seed
- self.client = VllmOmniImageClient(base_url=base_url, api_key=api_key)
-
- def generate(
- self,
- *,
- data_type: str,
- workers: int = 1,
- max_samples: int | None = None,
- samples_per_type: int | None = None,
- ) -> list[dict[str, Any]]:
- sample_specs = select_balanced_gebench_samples(
- {data_type: _iter_sample_specs(self.dataset_root, data_type)},
- samples_per_type=samples_per_type,
- )[data_type]
- if max_samples is not None:
- sample_specs = sample_specs[:max_samples]
-
- results: list[dict[str, Any]] = []
- if workers <= 1:
- for sample_spec in sample_specs:
- result = self._generate_one(data_type, sample_spec)
- if result:
- results.append(result)
- return results
-
- with ThreadPoolExecutor(max_workers=workers) as executor:
- futures = [executor.submit(self._generate_one, data_type, sample_spec) for sample_spec in sample_specs]
- for future in as_completed(futures):
- result = future.result()
- if result:
- results.append(result)
- return results
-
- def _generate_one(self, data_type: str, sample_spec: GEBenchSampleSpec) -> dict[str, Any] | None:
- sample_path = sample_spec.sample_path
- metadata = sample_spec.metadata
- lang_device = sample_spec.lang_device
- sample_name = sample_spec.sample_name
- output_dir = ensure_dir(self.output_root / TYPE_TO_FOLDER[data_type] / lang_device / sample_name)
-
- if data_type == "type1":
- output_path = output_dir / "generated.png"
- if output_path.exists():
- return {
- "data_type": data_type,
- "sample_name": f"{lang_device}/{sample_name}",
- "output_path": str(output_path),
- }
- source = _resolve_referenced_image(
- metadata=metadata, sample_path=sample_path, dataset_root=self.dataset_root, data_type=data_type
- )
- if source is None:
- return None
- generated = self.client.generate_image_edit(
- model=self.model,
- prompt=_type1_prompt(metadata),
- images=source,
- width=self.width,
- height=self.height,
- num_inference_steps=self.num_inference_steps,
- output_compression=self.output_compression,
- guidance_scale=self.guidance_scale,
- seed=self.seed,
- )
- save_image(output_path, generated)
- return {
- "data_type": data_type,
- "sample_name": f"{lang_device}/{sample_name}",
- "output_path": str(output_path),
- }
-
- if data_type == "type2":
- goal = _text_or_default(metadata.get("question") or metadata.get("caption"), "Complete the task.")
- source = _resolve_referenced_image(
- metadata=metadata, sample_path=sample_path, dataset_root=self.dataset_root, data_type=data_type
- )
- if source is None:
- return None
- frame0_path = output_dir / "frame0.png"
- if not frame0_path.exists():
- save_image(frame0_path, source)
- previous = source
- for step_num in range(1, 6):
- frame_path = output_dir / f"frame{step_num}.png"
- if frame_path.exists():
- previous = Image.open(frame_path).convert("RGB")
- continue
- generated = self.client.generate_image_edit(
- model=self.model,
- prompt=_type2_prompt(goal, step_num),
- images=previous,
- width=self.width,
- height=self.height,
- num_inference_steps=self.num_inference_steps,
- output_compression=self.output_compression,
- guidance_scale=self.guidance_scale,
- seed=self.seed,
- )
- save_image(frame_path, generated)
- previous = generated
- output_path = output_dir / "frame5.png"
- return {
- "data_type": data_type,
- "sample_name": f"{lang_device}/{sample_name}",
- "output_path": str(output_path),
- }
-
- if data_type in {"type3", "type4"}:
- steps = _trajectory_steps(metadata)
- frame0_path = output_dir / "frame0.png"
- if frame0_path.exists():
- previous = Image.open(frame0_path).convert("RGB")
- else:
- previous = self.client.generate_text_to_image(
- model=self.model,
- prompt=_type34_initial_prompt(metadata, steps[0] if steps else {}),
- width=self.width,
- height=self.height,
- num_inference_steps=self.num_inference_steps,
- output_compression=self.output_compression,
- guidance_scale=self.guidance_scale,
- seed=self.seed,
- )
- save_image(frame0_path, previous)
-
- for step_num in range(1, 6):
- frame_path = output_dir / f"frame{step_num}.png"
- if frame_path.exists():
- previous = Image.open(frame_path).convert("RGB")
- continue
- step_info = steps[step_num - 1] if step_num - 1 < len(steps) else {}
- generated = self.client.generate_image_edit(
- model=self.model,
- prompt=_type34_next_prompt(step_num, step_info),
- images=previous,
- width=self.width,
- height=self.height,
- num_inference_steps=self.num_inference_steps,
- output_compression=self.output_compression,
- guidance_scale=self.guidance_scale,
- seed=self.seed,
- )
- save_image(frame_path, generated)
- previous = generated
- output_path = output_dir / "frame5.png"
- return {
- "data_type": data_type,
- "sample_name": f"{lang_device}/{sample_name}",
- "output_path": str(output_path),
- }
-
- if data_type == "type5":
- output_path = output_dir / "generated.png"
- if output_path.exists():
- return {
- "data_type": data_type,
- "sample_name": f"{lang_device}/{sample_name}",
- "output_path": str(output_path),
- }
- source = _resolve_referenced_image(
- metadata=metadata, sample_path=sample_path, dataset_root=self.dataset_root, data_type=data_type
- )
- if source is None:
- return None
- generated = self.client.generate_image_edit(
- model=self.model,
- prompt=_type5_prompt(metadata),
- images=source,
- width=self.width,
- height=self.height,
- num_inference_steps=self.num_inference_steps,
- output_compression=self.output_compression,
- guidance_scale=self.guidance_scale,
- seed=self.seed,
- )
- save_image(output_path, generated)
- return {
- "data_type": data_type,
- "sample_name": f"{lang_device}/{sample_name}",
- "output_path": str(output_path),
- }
-
- raise ValueError(f"Unsupported data type: {data_type}")
-
-
-class GEBenchEvaluator:
- def __init__(self, *, dataset_root: Path, output_root: Path, judge: LocalJudgeClient):
- self.dataset_root = dataset_root
- self.output_root = output_root
- self.judge = judge
-
- def evaluate(
- self,
- *,
- data_type: str,
- workers: int = 1,
- max_samples: int | None = None,
- samples_per_type: int | None = None,
- ) -> dict[str, Any]:
- output_type_dir = self.output_root / TYPE_TO_FOLDER[data_type]
- sample_specs_by_name = {
- (spec.lang_device, spec.sample_name): spec for spec in _iter_sample_specs(self.dataset_root, data_type)
- }
- if not output_type_dir.exists():
- payload = {"data_type": data_type, "results": [], "summary": summarize_gebench_results([])}
- write_json(self.output_root / "evaluations" / f"{data_type}.json", payload)
- return payload
- sample_dirs = [
- sample_dir
- for lang_dir in sorted(path for path in output_type_dir.iterdir() if path.is_dir())
- for sample_dir in sorted(path for path in lang_dir.iterdir() if path.is_dir())
- if (lang_dir.name, sample_dir.name) in sample_specs_by_name
- ]
- sample_dirs = select_balanced_gebench_samples(
- {data_type: sample_dirs},
- samples_per_type=samples_per_type,
- )[data_type]
- if max_samples is not None:
- sample_dirs = sample_dirs[:max_samples]
- results: list[dict[str, Any]] = []
- if workers <= 1:
- for sample_dir in sample_dirs:
- result = self._evaluate_one(
- data_type,
- sample_dir,
- sample_specs_by_name[(sample_dir.parent.name, sample_dir.name)],
- )
- if result:
- results.append(result)
- else:
- with ThreadPoolExecutor(max_workers=workers) as executor:
- futures = [
- executor.submit(
- self._evaluate_one,
- data_type,
- sample_dir,
- sample_specs_by_name[(sample_dir.parent.name, sample_dir.name)],
- )
- for sample_dir in sample_dirs
- ]
- for future in as_completed(futures):
- result = future.result()
- if result:
- results.append(result)
-
- payload = {"data_type": data_type, "results": results, "summary": summarize_gebench_results(results)}
- write_json(self.output_root / "evaluations" / f"{data_type}.json", payload)
- return payload
-
- def _evaluate_one(self, data_type: str, sample_dir: Path, sample_spec: GEBenchSampleSpec) -> dict[str, Any] | None:
- lang_device = sample_dir.parent.name
- sample_name = sample_dir.name
- dataset_sample = sample_spec.sample_path
- metadata = sample_spec.metadata
-
- if data_type == "type1":
- source = _resolve_referenced_image(
- metadata=metadata, sample_path=dataset_sample, dataset_root=self.dataset_root, data_type=data_type
- )
- generated_path = find_first_image(sample_dir)
- if source is None or generated_path is None:
- return None
- generated = Image.open(generated_path).convert("RGB")
- raw_scores = self.judge.evaluate(prompt=_type1_prompt(metadata), images=[source, generated])
- elif data_type == "type2":
- frames = [Image.open(sample_dir / f"frame{i}.png").convert("RGB") for i in range(6)]
- goal = _text_or_default(metadata.get("question") or metadata.get("caption"), "Complete the task.")
- prompt_suffix, judge_images = _trajectory_judge_payload(frames)
- raw_scores = self.judge.evaluate(
- prompt=f"Evaluate a six-frame GUI trajectory.\nTask: {goal}\n{prompt_suffix}",
- images=judge_images,
- )
- elif data_type in {"type3", "type4"}:
- frames = [Image.open(sample_dir / f"frame{i}.png").convert("RGB") for i in range(6)]
- instruction = _text_or_default(metadata.get("instruction") or metadata.get("caption"), "Complete the task.")
- prompt_suffix, judge_images = _trajectory_judge_payload(frames)
- raw_scores = self.judge.evaluate(
- prompt=f"Evaluate a six-frame GUI trajectory.\nInstruction: {instruction}\n{prompt_suffix}",
- images=judge_images,
- )
- elif data_type == "type5":
- source = _resolve_referenced_image(
- metadata=metadata, sample_path=dataset_sample, dataset_root=self.dataset_root, data_type=data_type
- )
- generated_path = find_first_image(sample_dir)
- if source is None or generated_path is None:
- return None
- generated = Image.open(generated_path).convert("RGB")
- raw_scores = self.judge.evaluate(prompt=_type5_prompt(metadata), images=[source, generated])
- else:
- raise ValueError(f"Unsupported data type: {data_type}")
-
- scores = _normalize_scores(raw_scores)
- return {
- "sample_name": f"{lang_device}/{sample_name}",
- "data_type": data_type,
- "scores": scores,
- "overall": _compute_overall(scores),
- "raw_scores": raw_scores,
- }
-
-
-def _data_types_arg(value: str) -> list[str]:
- return list(TYPE_TO_FOLDER.keys()) if value == "all" else [value]
-
-
-def build_parser() -> argparse.ArgumentParser:
- parser = argparse.ArgumentParser(description="Run local GEBench generation and scoring against vLLM-Omni.")
- subparsers = parser.add_subparsers(dest="command", required=True)
-
- generate = subparsers.add_parser("generate")
- generate.add_argument("--dataset-root", type=Path, required=True)
- generate.add_argument("--output-root", type=Path, required=True)
- generate.add_argument("--base-url", type=str, required=True)
- generate.add_argument("--model", type=str, required=True)
- generate.add_argument("--data-type", choices=["all", *TYPE_TO_FOLDER.keys()], default="all")
- generate.add_argument("--api-key", type=str, default="EMPTY")
- generate.add_argument("--width", type=int, default=768)
- generate.add_argument("--height", type=int, default=576)
- generate.add_argument("--num-inference-steps", type=int, default=8)
- generate.add_argument("--output-compression", type=int, default=98)
- generate.add_argument("--guidance-scale", type=float, default=None)
- generate.add_argument("--seed", type=int, default=42)
- generate.add_argument("--workers", type=int, default=1)
- generate.add_argument("--max-samples", type=int, default=None)
- generate.add_argument("--samples-per-type", type=int, default=None)
-
- evaluate = subparsers.add_parser("evaluate")
- evaluate.add_argument("--dataset-root", type=Path, required=True)
- evaluate.add_argument("--output-root", type=Path, required=True)
- evaluate.add_argument("--data-type", choices=["all", *TYPE_TO_FOLDER.keys()], default="all")
- evaluate.add_argument("--judge-base-url", type=str, required=True)
- evaluate.add_argument("--judge-model", type=str, required=True)
- evaluate.add_argument("--judge-api-key", type=str, default="EMPTY")
- evaluate.add_argument("--workers", type=int, default=1)
- evaluate.add_argument("--max-samples", type=int, default=None)
- evaluate.add_argument("--samples-per-type", type=int, default=None)
-
- summarize = subparsers.add_parser("summarize")
- summarize.add_argument("--output-root", type=Path, required=True)
-
- return parser
-
-
-def main(argv: list[str] | None = None) -> int:
- parser = build_parser()
- args = parser.parse_args(argv)
-
- if args.command == "generate":
- runner = GEBenchRunner(
- dataset_root=args.dataset_root,
- output_root=args.output_root,
- base_url=args.base_url,
- model=args.model,
- api_key=args.api_key,
- width=args.width,
- height=args.height,
- num_inference_steps=args.num_inference_steps,
- output_compression=args.output_compression,
- guidance_scale=args.guidance_scale,
- seed=args.seed,
- )
- records: list[dict[str, Any]] = []
- for data_type in _data_types_arg(args.data_type):
- records.extend(
- runner.generate(
- data_type=data_type,
- workers=args.workers,
- max_samples=args.max_samples,
- samples_per_type=args.samples_per_type,
- )
- )
- payload = {"records": records, "summary": summarize_generated_records(records)}
- write_json(args.output_root / "generation_manifest.json", payload)
- return 0
-
- if args.command == "evaluate":
- judge = LocalJudgeClient(
- base_url=args.judge_base_url,
- api_key=args.judge_api_key,
- model=args.judge_model,
- )
- evaluator = GEBenchEvaluator(dataset_root=args.dataset_root, output_root=args.output_root, judge=judge)
- combined_results: list[dict[str, Any]] = []
- for data_type in _data_types_arg(args.data_type):
- payload = evaluator.evaluate(
- data_type=data_type,
- workers=args.workers,
- max_samples=args.max_samples,
- samples_per_type=args.samples_per_type,
- )
- combined_results.extend(payload["results"])
- _write_json_with_timestamp(
- args.output_root / "evaluations" / "summary.json",
- {"summary": summarize_gebench_results(combined_results)},
- )
- return 0
-
- if args.command == "summarize":
- generation_summary = collect_gebench_generation_summary(args.output_root)
- evaluation_dir = args.output_root / "evaluations"
- result_records: list[dict[str, Any]] = []
- if evaluation_dir.exists():
- for file_path in sorted(evaluation_dir.glob("type*.json")):
- payload = load_json(file_path)
- result_records.extend(payload.get("results", []))
- payload: dict[str, Any] = {"generation": generation_summary}
- if result_records:
- payload["evaluation"] = summarize_gebench_results(result_records)
- _write_json_with_timestamp(args.output_root / "summary.json", payload)
- print(json.dumps(payload, indent=2, ensure_ascii=False))
- return 0
-
- parser.error(f"Unknown command: {args.command}")
- return 1
diff --git a/benchmarks/accuracy/text_to_image/run_gebench.py b/benchmarks/accuracy/text_to_image/run_gebench.py
deleted file mode 100644
index 554c968aaa4..00000000000
--- a/benchmarks/accuracy/text_to_image/run_gebench.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# ruff: noqa: E402, I001
-import sys
-from pathlib import Path
-
-REPO_ROOT = Path(__file__).resolve().parents[3]
-if str(REPO_ROOT) not in sys.path:
- sys.path.insert(0, str(REPO_ROOT))
-
-from benchmarks.accuracy.text_to_image.gbench import main
-
-
-if __name__ == "__main__":
- raise SystemExit(main())
diff --git a/benchmarks/build_dataset/download_process_data_seedtts.md b/benchmarks/build_dataset/download_process_data_seedtts.md
deleted file mode 100644
index faf072303b8..00000000000
--- a/benchmarks/build_dataset/download_process_data_seedtts.md
+++ /dev/null
@@ -1,82 +0,0 @@
-# Benchmark Dataset Preparation Guide
-
-This guide describes how to download and prepare the SeedTTS test dataset for benchmarking Qwen-Omni models.
-
-## Prerequisites
-
-- Python 3.8+
-- `gdown` for downloading from Google Drive
-- Access to the benchmark scripts
-
-## Steps
-
-### 1. Navigate to the Dataset Directory
-
-```bash
-cd benchmarks/build_dataset
-```
-
-### 2. Install Dependencies
-
-```bash
-pip install gdown
-```
-
-### 3. Download the SeedTTS Test Dataset
-
-Download the dataset from Google Drive:
-
-```bash
-gdown 1GlSjVfSHkW3-leKKBlfrjuuTGqQ_xaLP
-```
-
-### 4. Extract the Dataset
-
-```bash
-tar -xf seedtts_testset.tar
-```
-
-### 5. Prepare the Metadata File
-
-Copy the English metadata file to the working directory:
-
-```bash
-cp seedtts_testset/en/meta.lst meta.lst
-```
-
-### 6. Extract Prompts
-
-Extract the first N prompts from the metadata file:
-
-```bash
-# Extract top 100 prompts (adjust -n for different amounts)
-python extract_tts_prompts.py -i meta.lst -o top100.txt -n 100
-```
-
-**Options:**
-- `-i, --input`: Input metadata file (default: `meta.lst`)
-- `-o, --output`: Output prompts file (default: `prompts.txt`)
-- `-n, --num-lines`: Number of prompts to extract (required)
-
-### 7. Clean Up (Optional)
-
-Remove temporary files to save disk space:
-
-```bash
-rm -rf seedtts_testset
-rm seedtts_testset.tar
-rm meta.lst
-```
-
-## Quick Start (All-in-One)
-
-```bash
-# Full setup and benchmark
-cd benchmarks/build_dataset
-pip install gdown
-gdown 1GlSjVfSHkW3-leKKBlfrjuuTGqQ_xaLP
-tar -xf seedtts_testset.tar
-cp seedtts_testset/en/meta.lst meta.lst
-python extract_tts_prompts.py -i meta.lst -o top100.txt -n 100
-rm -rf seedtts_testset seedtts_testset.tar meta.lst
-```
diff --git a/benchmarks/build_dataset/extract_tts_prompts.py b/benchmarks/build_dataset/extract_tts_prompts.py
deleted file mode 100644
index bd6ae9bdb1e..00000000000
--- a/benchmarks/build_dataset/extract_tts_prompts.py
+++ /dev/null
@@ -1,73 +0,0 @@
-#!/usr/bin/env python3
-"""
-Extract prompts from meta.lst and save them to a txt file.
-
-Each line in meta.lst has the format:
-ID|prompt_text|audio_path|target_text
-
-This script extracts the prompt_text (second field) from the first N lines.
-"""
-
-import argparse
-from pathlib import Path
-
-
-def extract_prompts(input_file: str, output_file: str, num_lines: int) -> None:
- """
- Extract prompts from meta.lst and save to output file.
-
- Args:
- input_file: Path to the meta.lst file
- output_file: Path to the output txt file
- num_lines: Number of lines to process
- """
- prompts = []
-
- with open(input_file, encoding="utf-8") as f:
- for i, line in enumerate(f):
- if i >= num_lines:
- break
-
- line = line.strip()
- if not line: # Skip empty lines
- continue
-
- parts = line.split("|")
- if len(parts) >= 2:
- prompt = parts[1] # The prompt is the second field
- prompts.append(prompt)
-
- # Write prompts to output file
- with open(output_file, "w", encoding="utf-8") as f:
- for prompt in prompts:
- f.write(prompt + "\n")
-
- # Print result stats
- print(f"Extracted {len(prompts)} prompts from first {num_lines} lines")
- print(f"Saved to: {output_file}")
-
-
-def main():
- parser = argparse.ArgumentParser(description="Extract prompts from meta.lst file")
- parser.add_argument(
- "-i", "--input", type=str, default="meta.lst", help="Input meta.lst file path (default: meta.lst)"
- )
- parser.add_argument(
- "-o", "--output", type=str, default="prompts.txt", help="Output txt file path (default: prompts.txt)"
- )
- parser.add_argument(
- "-n", "--num-lines", type=int, required=True, help="Number of lines to extract from the beginning"
- )
-
- args = parser.parse_args()
-
- # Check if input file exists
- if not Path(args.input).exists():
- print(f"Error: Input file '{args.input}' not found")
- return
-
- extract_prompts(args.input, args.output, args.num_lines)
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmarks/build_dataset/seed_tts_design/en/meta.lst b/benchmarks/build_dataset/seed_tts_design/en/meta.lst
deleted file mode 100644
index 7e364c2e517..00000000000
--- a/benchmarks/build_dataset/seed_tts_design/en/meta.lst
+++ /dev/null
@@ -1,20 +0,0 @@
-vd001|||The quick brown fox jumps over the lazy dog.|A warm, friendly female voice with a slight American Midwest accent, speaking at a moderate pace with natural inflection.
-vd002|||Welcome to the future of text-to-speech synthesis.|A deep, authoritative male news anchor voice, clear and professional with a measured cadence.
-vd003|||The sunset painted the sky in brilliant shades of orange and pink.|A gentle elderly female voice, soft and wise, with a slight Southern American accent.
-vd004|||Scientists have discovered a new species of deep-sea creature.|A young male voice with an Australian accent, curious and enthusiastic.
-vd005|||Breaking news: a major climate summit opens today in Geneva.|A crisp female newsreader voice, neutral accent, confident and precise.
-vd006|||In the beginning, there was darkness and silence across the universe.|A rich, dramatic bass male narrator voice, slow and deeply resonant.
-vd007|||Come closer, I have something important to tell you.|A soft, intimate female voice, slightly whispery, warm and gentle.
-vd008|||And they're off! The horses race toward the first turn at incredible speed.|An energetic male sports commentator, fast-paced and excited.
-vd009|||Once upon a time, in a land far away, lived a very clever fox.|A light, playful voice with childlike enthusiasm, bright and clear.
-vd010|||The ancient manuscript reveals secrets hidden for a thousand years.|A wise, measured elderly male voice, slow and deliberate, British English accent.
-vd011|||Good evening, ladies and gentlemen, and welcome to our show.|A sophisticated female voice with a slight French accent speaking English, elegant and refined.
-vd012|||System initialized. Running diagnostics. All systems nominal.|A clear, precise robotic-sounding voice, neutral and monotone with slight synthetic quality.
-vd013|||I hear what you are saying, and it is completely understandable to feel that way.|A warm, empathetic female therapist voice, calm and reassuring, unhurried pace.
-vd014|||Attention all units: proceed to grid reference seven-seven-alpha.|A firm, authoritative military male voice, clipped and commanding.
-vd015|||Oh my goodness, you have to try this amazing new recipe I just found!|An enthusiastic, bubbly female voice, high energy and friendly.
-vd016|||Dude, the waves were totally amazing out there today. Super happy about it!|A relaxed male voice with a California accent, casual and laid-back.
-vd017|||The quarterly results exceed expectations across all major metrics.|A sharp, businesslike female voice, confident and efficient, fast-paced delivery.
-vd018|||Chapter one. The morning sun filtered gently through the forest canopy.|A smooth, rich male audiobook narrator voice, expressive and engaging.
-vd019|||To be or not to be, that is the question.|A theatrical female voice, dramatic and expressive, stage projection quality.
-vd020|||And that is all for tonight. Stay well out there, everyone.|A warm, velvety male late-night radio DJ voice, smooth and intimate.
diff --git a/benchmarks/build_dataset/seed_tts_smoke/en/meta.lst b/benchmarks/build_dataset/seed_tts_smoke/en/meta.lst
deleted file mode 100644
index afe4bc8abcd..00000000000
--- a/benchmarks/build_dataset/seed_tts_smoke/en/meta.lst
+++ /dev/null
@@ -1,20 +0,0 @@
-smoke001|||The quick brown fox jumps over the lazy dog near the riverbank at sunset.
-smoke002|||Welcome to the future of text-to-speech synthesis in production systems.
-smoke003|||Yesterday the team finished rolling out the new authentication flow.
-smoke004|||She walked carefully across the wet cobblestones, careful not to slip.
-smoke005|||The conference call is scheduled for nine in the morning, Pacific time.
-smoke006|||Please remember to save your work before closing the editor.
-smoke007|||Two plus two equals four, but five hundred and forty three digits is long.
-smoke008|||I would like a coffee with oat milk and a chocolate croissant please.
-smoke009|||The library closes at eight on weekdays and six on Saturdays.
-smoke010|||During the Renaissance, art and science flourished in European cities.
-smoke011|||He whispered the secret word so quietly that no one else could hear.
-smoke012|||Our flight departs from gate twenty three at eleven fifteen.
-smoke013|||The storm knocked out power for six hours, but the backup generator kicked in.
-smoke014|||Reading a good book on a rainy afternoon is one of life's great pleasures.
-smoke015|||When the kettle whistled, she poured the hot water over the fresh tea leaves.
-smoke016|||The algorithm runs in linear time, which is a big improvement over the previous approach.
-smoke017|||In the distance, the mountains were shrouded in thick morning fog.
-smoke018|||Our company reported record revenue for the fourth quarter of the fiscal year.
-smoke019|||She explained the new policy in detail during the staff meeting this morning.
-smoke020|||The children laughed and played in the garden until the sun began to set.
diff --git a/benchmarks/diffusion/README.md b/benchmarks/diffusion/README.md
deleted file mode 100644
index 06bf01726e6..00000000000
--- a/benchmarks/diffusion/README.md
+++ /dev/null
@@ -1,138 +0,0 @@
-
-# Diffusion Serving Benchmark (Image/Video)
-
-This folder contains an online-serving benchmark script for diffusion models.
-It sends requests to a vLLM OpenAI-compatible endpoint and reports throughput,
-latency percentiles, and optional SLO attainment.
-
-The main entrypoint is:
-
-- `benchmarks/diffusion/diffusion_benchmark_serving.py`
-
-## 1. Quick Start
-
-1. Start the server:
-
-```bash
-vllm serve Qwen/Qwen-Image --omni --port 8099
-```
-
-2. Run a minimal benchmark:
-
-```bash
-python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --base-url http://localhost:8099 \
- --model Qwen/Qwen-Image \
- --task t2i \
- --dataset vbench \
- --num-prompts 5
-```
-
-**Notes**
-
-- The benchmark talks to `http://:/v1/chat/completions`.
-- If you run the server on another host or port, pass `--base-url` accordingly.
-
-## 2. Supported Datasets
-
-The benchmark supports three dataset modes via `--dataset`:
-
-- `vbench`: Built-in prompt/data loader.
-- `trace`: Heterogeneous request traces (each request can have different resolution/frames/steps).
-- `random`: Synthetic prompts for quick smoke tests.
-
-### VBench dataset
-
-`vbench` only provides prompt data (and image paths for i2v/i2i); it does not carry
-per-request generation fields. In this mode, all requests share CLI values:
-`--width --height --num-frames --fps --num-inference-steps`
-(pass `--width` and `--height` together).
-
-Example (`t2v`):
-
-```bash
-python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --base-url http://localhost:8099 \
- --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \
- --task t2v \
- --dataset vbench \
- --num-prompts 50 \
- --width 640 --height 480 \
- --num-frames 81 --fps 16 \
- --num-inference-steps 40
-```
-
-Note: `vbench` can also be used for other tasks such as `t2i` / `i2v` (and `i2i`). For `t2i`, the loader reuses VBench t2v text prompts; for `i2v` / `i2i`, it loads the VBench i2v dataset (with image paths).
-
-If you use i2v/i2i bench datasets and need auto-download support, you may need:
-
-```bash
-uv pip install gdown
-```
-
-### Trace dataset
-
-Use `--dataset trace` to replay a trace file. The trace can specify per-request fields such as:
-
-- `width`, `height`
-- `num_frames` (video)
-- `num_inference_steps`
-- `seed`, `fps`
-- optional `slo_ms` (per-request SLO target)
-
-By default (when `--dataset-path` is not provided), the script downloads a default trace from
-the HuggingFace dataset repo `asukaqaqzz/Dit_Trace`. The default filename can depend on `--task`
-(e.g., `t2v` uses a video trace).
-
-Current defaults:
-
-- `--task t2i` -> `sd3_trace.txt`
-- `--task t2v` -> `cogvideox_trace.txt`
-
-You can point to your own trace using `--dataset-path`.
-
-## 3. Benchmark Parameters
-
-### Basic flags
-
-- `--base-url`: Server address (the script calls `.../v1/chat/completions`).
-- `--model`: The OpenAI-compatible `model` field.
-- `--task`: Task type (e.g., `t2i`, `t2v`, `i2i`, `i2v`).
-- `--dataset`: Dataset mode (`vbench` / `trace` / `random`).
-- `--num-prompts`: Number of requests to send.
-
-Common optional flags:
-
-- `--output-file`: Write metrics to a JSON file.
-- `--disable-tqdm`: Disable the progress bar.
-
-### Resolution / frames / steps: CLI defaults vs dataset fields
-
-Related flags: `--width`, `--height`, `--num-frames`, `--fps`, `--num-inference-steps`.
-
-- For `vbench` / `random`: these CLI flags act as global defaults for all generated requests.
-- For `trace`: requests can carry their own fields (e.g., `width/height/num_frames/num_inference_steps`), with overrides/fallbacks as below.
-
-Precedence rules for `trace` (i.e., what actually gets sent):
-
-- `width/height`: if either `--width` or `--height` is explicitly set, it overrides per-request values from the trace; otherwise per-request values are used when present.
-- `num_frames`: per-request `num_frames` takes precedence; otherwise fall back to `--num-frames`.
-- `num_inference_steps`: per-request `num_inference_steps` takes precedence; otherwise fall back to `--num-inference-steps`.
-
-### SLO, warmup, and max concurrency
-
-Enable SLO evaluation with `--slo`.
-
-- If a request in the trace already has `slo_ms`, that value is used.
-- Otherwise, the script runs warmup requests to infer a base unit time, estimates `expected_ms` by linearly scaling with area/frames/steps, and then sets `slo_ms = expected_ms * --slo-scale`.
-
-Warmup flags:
-
-- `--warmup-requests`: Number of warmup requests.
-- `--warmup-num-inference-steps`: Steps used during warmup.
-- For `--task t2v`: warmup requests are forced to use `num_frames=1` to make warmup faster and less noisy.
-
-Traffic / concurrency flags:
-
-- `--request-rate`: Target request rate (requests/second). If set to `inf`, the script sends all requests immediately.
-- `--max-concurrency`: Max number of in-flight requests (default: `1`). This can hard-cap the achieved QPS: if it is too small, requests will queue behind the semaphore, and both achieved throughput and observed SLO attainment can be skewed.
diff --git a/benchmarks/diffusion/backends.py b/benchmarks/diffusion/backends.py
deleted file mode 100644
index d33160f1377..00000000000
--- a/benchmarks/diffusion/backends.py
+++ /dev/null
@@ -1,359 +0,0 @@
-import asyncio
-import base64
-import mimetypes
-import os
-import time
-import uuid
-from dataclasses import dataclass, field
-from typing import Any
-
-import aiohttp
-from tqdm import tqdm
-
-
-@dataclass
-class RequestFuncInput:
- prompt: str
- api_url: str
- model: str
- width: int | None = None
- height: int | None = None
- num_frames: int | None = None
- num_inference_steps: int | None = None
- seed: int | None = None
- fps: int | None = None
- timestamp: float | None = None
- slo_ms: float | None = None
- extra_body: dict[str, Any] = field(default_factory=dict)
- image_paths: list[str] | None = None
- request_id: str = field(default_factory=lambda: str(uuid.uuid4()))
-
-
-@dataclass
-class RequestFuncOutput:
- success: bool = False
- latency: float = 0.0
- error: str = ""
- start_time: float = 0.0
- response_body: dict[str, Any] = field(default_factory=dict)
- stage_durations: dict[str, float] = field(default_factory=dict)
- peak_memory_mb: float = 0.0
- slo_achieved: bool | None = None
-
-
-def _guess_mime_type(path: str) -> str:
- mime, _ = mimetypes.guess_type(path)
- return mime or "application/octet-stream"
-
-
-def _encode_image_as_data_url(path: str) -> str:
- with open(path, "rb") as f:
- encoded = base64.b64encode(f.read()).decode("utf-8")
- mime = _guess_mime_type(path)
- return f"data:{mime};base64,{encoded}"
-
-
-async def async_request_chat_completions(
- input: RequestFuncInput,
- session: aiohttp.ClientSession,
- pbar: tqdm | None = None,
- enable_diffusion_pipeline_profiler: bool = False,
-) -> RequestFuncOutput:
- output = RequestFuncOutput()
- output.start_time = time.perf_counter()
-
- extra_body = dict(input.extra_body)
- if input.width and input.height:
- extra_body.setdefault("height", input.height)
- extra_body.setdefault("width", input.width)
- if input.num_frames:
- extra_body.setdefault("num_frames", input.num_frames)
- if input.num_inference_steps:
- extra_body.setdefault("num_inference_steps", input.num_inference_steps)
- if input.seed is not None:
- extra_body.setdefault("seed", input.seed)
- if input.fps:
- extra_body.setdefault("fps", input.fps)
-
- if input.image_paths and len(input.image_paths) > 0:
- content = []
- if input.prompt:
- content.append({"type": "text", "text": input.prompt})
- for img_path in input.image_paths:
- if not os.path.exists(img_path):
- output.error = f"Image file not found: {img_path}"
- output.success = False
- if pbar:
- pbar.update(1)
- return output
- content.append(
- {
- "type": "image_url",
- "image_url": {"url": _encode_image_as_data_url(img_path)},
- }
- )
- messages = [{"role": "user", "content": content}]
- else:
- messages = [{"role": "user", "content": input.prompt}]
-
- payload = {
- "model": input.model,
- "messages": messages,
- }
- if extra_body:
- payload["extra_body"] = extra_body
-
- try:
- async with session.post(input.api_url, json=payload) as response:
- if response.status == 200:
- resp_json = await response.json()
- output.response_body = resp_json
- output.success = True
- try:
- choices = resp_json.get("choices", [])
- if choices and isinstance(choices, list):
- msg = choices[0].get("message", {})
- if isinstance(msg, dict):
- content = msg.get("content", [])
- if content and isinstance(content, list) and len(content) > 0:
- first_item = content[0]
- if isinstance(first_item, dict):
- output.stage_durations = first_item.get("stage_durations") or {}
- output.peak_memory_mb = first_item.get("peak_memory_mb", 0.0)
- except (IndexError, TypeError, AttributeError):
- pass
-
- if (not output.stage_durations or output.peak_memory_mb == 0.0) and isinstance(
- resp_json.get("metrics"), dict
- ):
- m = resp_json["metrics"]
- if not output.stage_durations and isinstance(m.get("stage_durations"), dict):
- output.stage_durations = m.get("stage_durations") or {}
- if output.peak_memory_mb == 0.0 and m.get("peak_memory_mb") is not None:
- try:
- output.peak_memory_mb = float(m.get("peak_memory_mb") or 0.0)
- except (TypeError, ValueError):
- pass
- else:
- output.error = f"HTTP {response.status}: {await response.text()}"
- output.success = False
- except Exception as e:
- output.error = str(e)
- output.success = False
-
- output.latency = time.perf_counter() - output.start_time
-
- if output.success and input.slo_ms is not None:
- output.slo_achieved = (output.latency * 1000.0) <= float(input.slo_ms)
-
- if pbar:
- pbar.update(1)
- return output
-
-
-async def async_request_openai_images(
- input: RequestFuncInput,
- session: aiohttp.ClientSession,
- pbar: tqdm | None = None,
-) -> RequestFuncOutput:
- """
- Send request to OpenAI's /v1/images/generations endpoint.
- """
- output = RequestFuncOutput()
- output.start_time = time.perf_counter()
-
- # Build size string from width/height
- width = input.width or 1024
- height = input.height or 1024
- size = f"{width}x{height}"
-
- payload: dict[str, Any] = {
- "model": input.model,
- "prompt": input.prompt,
- "n": 1,
- "size": size,
- "response_format": "b64_json",
- }
-
- # Add optional parameters
- if input.seed is not None:
- payload["seed"] = input.seed
- if input.num_inference_steps is not None:
- payload["num_inference_steps"] = input.num_inference_steps
-
- # Add any extra body parameters
- if input.extra_body:
- for key, value in input.extra_body.items():
- if key not in payload:
- payload[key] = value
-
- headers = {
- "Content-Type": "application/json",
- "Authorization": "Bearer EMPTY",
- }
-
- try:
- async with session.post(input.api_url, json=payload, headers=headers) as response:
- if response.status == 200:
- resp_json = await response.json()
- output.response_body = resp_json
- output.success = True
- # Check for usage/memory info if available
- if "usage" in resp_json and "peak_memory_mb" in resp_json.get("usage", {}):
- output.peak_memory_mb = resp_json["usage"]["peak_memory_mb"]
- else:
- output.error = f"HTTP {response.status}: {await response.text()}"
- output.success = False
- except Exception as e:
- output.error = str(e)
- output.success = False
-
- output.latency = time.perf_counter() - output.start_time
-
- if output.success and input.slo_ms is not None:
- output.slo_achieved = (output.latency * 1000.0) <= float(input.slo_ms)
-
- if pbar:
- pbar.update(1)
- return output
-
-
-async def async_request_v1_videos(
- input: RequestFuncInput,
- session: aiohttp.ClientSession,
- pbar: tqdm | None = None,
-) -> RequestFuncOutput:
- output = RequestFuncOutput()
- output.start_time = time.perf_counter()
-
- files = dict(input.extra_body)
- if input.prompt:
- files.setdefault("prompt", input.prompt)
- if input.width and input.height:
- files.setdefault("height", input.height)
- files.setdefault("width", input.width)
- if input.num_frames:
- files.setdefault("num_frames", input.num_frames)
- if input.num_inference_steps:
- files.setdefault("num_inference_steps", input.num_inference_steps)
- if input.seed is not None:
- files.setdefault("seed", input.seed)
- if input.fps:
- files.setdefault("fps", input.fps)
-
- form = aiohttp.FormData()
- for k, v in files.items():
- form.add_field(k, str(v))
-
- image_file = None
- if input.image_paths and len(input.image_paths) > 0:
- image_path = input.image_paths[0]
- image_file = open(image_path, "rb")
- form.add_field(
- "input_reference",
- image_file,
- filename=os.path.basename(image_path),
- content_type="application/octet-stream",
- )
-
- job_id = None
- job_status = None
- poll_json = {}
- resp_json = {}
-
- try:
- # invoke a post request (POST /v1/videos)
- async with session.post(input.api_url, data=form) as response:
- if response.status == 200:
- resp_json = await response.json()
- job_id = resp_json.get("id")
- job_status = resp_json.get("status")
- if not job_id or not job_status:
- output.error = "API response missing job 'id' or 'status' field."
- output.success = False
- return output
- else:
- output.error = f"HTTP {response.status}: {await response.text()}"
- output.success = False
- return output
-
- # invoke a poll request (GET /v1/videos/{video_id})
- poll_interval = 2.0 # Unit(s)
- timeout_seconds = 600.0
- deadline = time.perf_counter() + timeout_seconds
- job_url = f"{input.api_url}/{job_id}"
-
- while job_status not in {"completed", "failed"}:
- await asyncio.sleep(poll_interval)
-
- async with session.get(job_url) as poll_response:
- if poll_response.status != 200:
- output.error = f"Polling failed HTTP {poll_response.status}: {await poll_response.text()}"
- output.success = False
- return output
-
- poll_json = await poll_response.json()
- job_status = poll_json.get("status")
-
- if time.perf_counter() >= deadline:
- output.error = f"Timed out waiting for video job {job_id} to complete."
- output.success = False
- return output
-
- if job_status == "failed":
- output.error = f"Video job failed: {poll_json}"
- output.success = False
- return output
-
- # invoke a get request (GET /v1/videos/{video_id}/content)
- content_url = f"{job_url}/content"
- async with session.get(content_url) as content_response:
- if content_response.status != 200:
- output.error = (
- f"Content retrieval failed HTTP {content_response.status}: {await content_response.text()}"
- )
- output.success = False
- return output
-
- video_bytes = await content_response.read()
- output.response_body = video_bytes
- output.success = True
- if "stage_durations" in poll_json:
- output.stage_durations = poll_json["stage_durations"] or {}
- if "peak_memory_mb" in poll_json:
- output.peak_memory_mb = poll_json["peak_memory_mb"]
- elif "peak_memory_mb" in resp_json:
- output.peak_memory_mb = resp_json["peak_memory_mb"]
- except Exception as e:
- output.error = str(e)
- output.success = False
- finally:
- if image_file is not None:
- image_file.close()
-
- if job_id is not None:
- try:
- async with session.delete(f"{input.api_url}/{job_id}") as _:
- pass
- except Exception as e:
- print(f"Failed to clean up video job {job_id}: {e}")
-
- output.latency = time.perf_counter() - output.start_time
-
- if output.success and input.slo_ms is not None:
- output.slo_achieved = (output.latency * 1000.0) <= float(input.slo_ms)
-
- if pbar:
- pbar.update(1)
- return output
-
-
-backends_function_mapping = {
- "2i": {
- "vllm-omni": (async_request_chat_completions, "/v1/chat/completions"),
- "openai": (async_request_openai_images, "/v1/images/generations"),
- },
- "2v": {
- "v1/videos": (async_request_v1_videos, "/v1/videos"),
- },
-}
diff --git a/benchmarks/diffusion/diffusion_benchmark_serving.py b/benchmarks/diffusion/diffusion_benchmark_serving.py
deleted file mode 100644
index 77b36b3d9c0..00000000000
--- a/benchmarks/diffusion/diffusion_benchmark_serving.py
+++ /dev/null
@@ -1,1139 +0,0 @@
-# adapted from fastvideo
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-"""
-Benchmark online serving for diffusion models (Image/Video Generation).
-If you want to use i2v, i2i dataset, you should `uv pip install gdown` first
-
-Supports multiple backends:
- - vllm-omni: Uses /v1/chat/completions endpoint (default)
- - openai: Uses /v1/images/generations endpoint
- - v1/videos: Use /v1/videos endpoint
-
-Usage:
- # Video (v1/videos backend)
- t2v:
- python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend v1/videos --dataset vbench --task t2v --num-prompts 10 \
- --height 480 --width 640 --fps 16 --num-frames 80
-
- i2v:
- python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend v1/videos --dataset vbench --task i2v --num-prompts 10
-
-
- # Image (vllm-omni backend)
- t2i:
- python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend vllm-omni --dataset vbench --task t2i --num-prompts 10 \
- --height 1024 --width 1024
-
- python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend vllm-omni --dataset random --task t2i --num-prompts 1 \
- --max-concurrency 1 --enable-negative-prompt \
- --random-request-config '[
- {"width":512,"height":512,"num_inference_steps":20,"weight":0.15},
- {"width":768,"height":768,"num_inference_steps":20,"weight":0.25},
- {"width":1024,"height":1024,"num_inference_steps":25,"weight":0.45},
- {"width":1536,"height":1536,"num_inference_steps":35,"weight":0.15}
- ]'
-
- i2i:
- python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend vllm-omni --dataset vbench --task i2i --num-prompts 10
-
- # Image (openai backend)
- t2i:
- python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend openai --dataset vbench --task t2i --num-prompts 10 \
- --height 1024 --width 1024 --port 3000
-
- # Video (v1/videos)
- t2v:
- python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend v1/videos --dataset random --task t2v --num-prompts 1 \
- --max-concurrency 1 --enable-negative-prompt \
- --random-request-config '[
- {"width":854,"height":480,"num_inference_steps":18,"num_frames":120,"fps":24,"weight":1}
- ]'
-
-
-"""
-
-import argparse
-import ast
-import asyncio
-import glob
-import json
-import logging
-import os
-import random
-import tempfile
-import time
-import uuid
-from abc import ABC, abstractmethod
-from collections.abc import AsyncGenerator
-from dataclasses import replace
-from typing import Any
-
-import aiohttp
-import numpy as np
-import requests
-from backends import RequestFuncInput, RequestFuncOutput, backends_function_mapping
-from PIL import Image
-from tqdm.asyncio import tqdm
-
-logger = logging.getLogger(__name__)
-
-
-class BaseDataset(ABC):
- def __init__(self, args, api_url: str, model: str):
- self.args = args
- self.api_url = api_url
- self.model = model
-
- @abstractmethod
- def __len__(self) -> int:
- pass
-
- @abstractmethod
- def __getitem__(self, idx: int) -> RequestFuncInput:
- pass
-
- @abstractmethod
- def get_requests(self) -> list[RequestFuncInput]:
- pass
-
-
-class VBenchDataset(BaseDataset):
- """
- Dataset loader for VBench prompts.
- Supports t2v, i2v.
- """
-
- T2V_PROMPT_URL = (
- "https://raw.githubusercontent.com/Vchitect/VBench/master/prompts/prompts_per_dimension/subject_consistency.txt"
- )
- I2V_DOWNLOAD_SCRIPT_URL = (
- "https://raw.githubusercontent.com/Vchitect/VBench/master/vbench2_beta_i2v/download_data.sh"
- )
-
- def __init__(self, args, api_url: str, model: str):
- super().__init__(args, api_url, model)
- self.cache_dir = os.path.join(os.path.expanduser("~"), ".cache", "vllm-omni")
- self.items = self._load_data()
-
- def _load_data(self) -> list[dict[str, Any]]:
- if self.args.task == "t2v":
- return self._load_t2v_prompts()
- elif self.args.task in ["i2v", "ti2v", "ti2i", "i2i"]:
- return self._load_i2v_data()
- else:
- return self._load_t2v_prompts()
-
- def _download_file(self, url: str, dest_path: str) -> None:
- """Download a file from URL to destination path."""
- os.makedirs(os.path.dirname(dest_path), exist_ok=True)
- resp = requests.get(url)
- resp.raise_for_status()
- with open(dest_path, "w") as f:
- f.write(resp.text)
-
- def _load_t2v_prompts(self) -> list[dict[str, Any]]:
- path = self.args.dataset_path
-
- if not path:
- path = os.path.join(self.cache_dir, "vbench_subject_consistency.txt")
- if not os.path.exists(path):
- print(f"Downloading VBench T2V prompts to {path}...")
- try:
- self._download_file(self.T2V_PROMPT_URL, path)
- except Exception as e:
- print(f"Failed to download VBench prompts: {e}")
- return [{"prompt": "A cat sitting on a bench"}] * 50
-
- prompts = []
- with open(path) as f:
- for line in f:
- line = line.strip()
- if line:
- prompts.append({"prompt": line})
-
- return self._resize_data(prompts)
-
- def _auto_download_i2v_dataset(self) -> str:
- """Auto-download VBench I2V dataset and return the dataset directory."""
- vbench_i2v_dir = os.path.join(self.cache_dir, "vbench_i2v", "vbench2_beta_i2v")
- info_json_path = os.path.join(vbench_i2v_dir, "data", "i2v-bench-info.json")
-
- if os.path.exists(info_json_path):
- return vbench_i2v_dir
-
- print(f"Downloading VBench I2V dataset to {vbench_i2v_dir}...")
- try:
- cache_root = os.path.join(self.cache_dir, "vbench_i2v")
- script_path = os.path.join(cache_root, "download_data.sh")
-
- self._download_file(self.I2V_DOWNLOAD_SCRIPT_URL, script_path)
- os.chmod(script_path, 0o755)
-
- print("Executing download_data.sh (this may take a while)...")
- import subprocess
-
- result = subprocess.run(
- ["bash", script_path],
- cwd=cache_root,
- capture_output=True,
- text=True,
- )
-
- if result.returncode != 0:
- raise RuntimeError(f"Download script failed: {result.stderr}")
-
- print(f"Successfully downloaded VBench I2V dataset to {vbench_i2v_dir}")
- except Exception as e:
- print(f"Failed to download VBench I2V dataset: {e}")
- print("Please manually download following instructions at:")
- print("https://github.com/Vchitect/VBench/tree/master/vbench2_beta_i2v#22-download")
- return None
-
- return vbench_i2v_dir if os.path.exists(info_json_path) else None
-
- def _load_from_i2v_json(self, json_path: str) -> list[dict[str, Any]]:
- """Load I2V data from i2v-bench-info.json format."""
- with open(json_path) as f:
- items = json.load(f)
-
- base_dir = os.path.dirname(os.path.dirname(json_path)) # Go up to vbench2_beta_i2v
- origin_dir = os.path.join(base_dir, "data", "origin")
-
- data = []
- for item in items:
- img_path = os.path.join(origin_dir, item.get("file_name", ""))
- if os.path.exists(img_path):
- data.append({"prompt": item.get("caption", ""), "image_path": img_path})
- else:
- print(f"Warning: Image not found: {img_path}")
-
- print(f"Loaded {len(data)} I2V samples from VBench I2V dataset")
- return data
-
- def _scan_directory_for_images(self, path: str) -> list[dict[str, Any]]:
- """Scan directory for image files."""
- exts = ["*.jpg", "*.jpeg", "*.png", "*.webp"]
- files = []
-
- for ext in exts:
- files.extend(glob.glob(os.path.join(path, ext)))
- files.extend(glob.glob(os.path.join(path, ext.upper())))
-
- # Also check in data/origin subdirectory
- origin_dir = os.path.join(path, "data", "origin")
- if os.path.exists(origin_dir):
- files.extend(glob.glob(os.path.join(origin_dir, ext)))
- files.extend(glob.glob(os.path.join(origin_dir, ext.upper())))
-
- return [{"prompt": os.path.splitext(os.path.basename(f))[0], "image_path": f} for f in files]
-
- def _create_dummy_data(self) -> list[dict[str, Any]]:
- """Create dummy data with a placeholder image in cache directory."""
- print("No I2V data found. Using dummy placeholders.")
-
- dummy_image = os.path.join(self.cache_dir, "dummy_image.jpg")
- if not os.path.exists(dummy_image):
- try:
- from PIL import Image
-
- os.makedirs(self.cache_dir, exist_ok=True)
- img = Image.new("RGB", (100, 100), color="red")
- img.save(dummy_image)
- print(f"Created dummy image at {dummy_image}")
- except ImportError:
- print("PIL not installed, cannot create dummy image.")
- return []
-
- return [{"prompt": "A moving cat", "image_path": dummy_image}] * 10
-
- def _load_i2v_data(self) -> list[dict[str, Any]]:
- """Load I2V data from VBench I2V dataset or user-provided path."""
- path = self.args.dataset_path
-
- # Auto-download if no path provided
- if not path:
- path = self._auto_download_i2v_dataset()
- if not path:
- return self._resize_data(self._create_dummy_data())
-
- # Try to load from i2v-bench-info.json
- info_json_candidates = [
- os.path.join(path, "data", "i2v-bench-info.json"),
- path if path.endswith(".json") else None,
- ]
-
- for json_path in info_json_candidates:
- if json_path and os.path.exists(json_path):
- try:
- return self._resize_data(self._load_from_i2v_json(json_path))
- except Exception as e:
- print(f"Failed to load {json_path}: {e}")
-
- # Fallback: scan directory for images
- if os.path.isdir(path):
- data = self._scan_directory_for_images(path)
- if data:
- return self._resize_data(data)
-
- # Last resort: dummy data
- return self._resize_data(self._create_dummy_data())
-
- def _resize_data(self, data: list[dict[str, Any]]) -> list[dict[str, Any]]:
- """Resize data to match num_prompts."""
- if not data:
- raise ValueError("No benchmark data available. Install Pillow or provide --dataset-path.")
-
- if not self.args.num_prompts:
- return data
-
- if len(data) < self.args.num_prompts:
- factor = (self.args.num_prompts // len(data)) + 1
- data = data * factor
-
- return data[: self.args.num_prompts]
-
- def __len__(self) -> int:
- return len(self.items)
-
- def __getitem__(self, idx: int) -> RequestFuncInput:
- item = self.items[idx]
- image_paths = [item["image_path"]] if "image_path" in item else None
-
- return RequestFuncInput(
- prompt=item.get("prompt", ""),
- api_url=self.api_url,
- model=self.model,
- width=self.args.width,
- height=self.args.height,
- num_frames=self.args.num_frames,
- num_inference_steps=self.args.num_inference_steps,
- seed=self.args.seed,
- fps=self.args.fps,
- image_paths=image_paths,
- )
-
- def get_requests(self) -> list[RequestFuncInput]:
- return [self[i] for i in range(len(self))]
-
-
-class TraceDataset(BaseDataset):
- """Trace-based dataset loader for heterogeneous diffusion requests."""
-
- DEFAULT_REPO_ID = "asukaqaqzz/Dit_Trace"
- DEFAULT_FILENAME = "sd3_trace.txt"
- DEFAULT_FILENAME_BY_TASK: dict[str, str] = {
- # Text-to-image traces (e.g., SD3)
- "t2i": "sd3_trace.txt",
- # Text-to-video traces (e.g., CogVideoX)
- "t2v": "cogvideox_trace.txt",
- }
-
- def __init__(self, args, api_url: str, model: str):
- super().__init__(args, api_url, model)
- self.cache_dir = os.path.join(os.path.expanduser("~"), ".cache", "vllm-omni", "trace")
- self.default_filename = self.DEFAULT_FILENAME_BY_TASK.get(getattr(args, "task", ""), self.DEFAULT_FILENAME)
- dataset_root = args.dataset_path
- if not dataset_root:
- dataset_root = self._download_default_trace()
- self.items = self._load_items(dataset_root)
-
- @staticmethod
- def _coerce_int(x: Any) -> int | None:
- if x is None:
- return None
- if isinstance(x, bool):
- return None
- if isinstance(x, int):
- return x
- try:
- s = str(x).strip()
- if not s:
- return None
- return int(float(s))
- except Exception:
- return None
-
- @staticmethod
- def _coerce_float(x: Any) -> float | None:
- if x is None:
- return None
- if isinstance(x, float):
- return x
- if isinstance(x, int):
- return float(x)
- try:
- s = str(x).strip()
- if not s:
- return None
- return float(s)
- except Exception:
- return None
-
- def _download_default_trace(self) -> str:
- """Download default trace file from HuggingFace Hub if not provided."""
-
- try:
- from huggingface_hub import hf_hub_download
- except ImportError as exc:
- raise ImportError(
- "huggingface_hub is required to download the default trace dataset. "
- "Install via `pip install huggingface_hub`."
- ) from exc
-
- os.makedirs(self.cache_dir, exist_ok=True)
- return hf_hub_download(
- repo_id=self.DEFAULT_REPO_ID,
- filename=self.default_filename,
- repo_type="dataset",
- local_dir=self.cache_dir,
- local_dir_use_symlinks=False,
- )
-
- def _expand_paths(self, dataset_path: str | None) -> list[str]:
- if not dataset_path:
- return []
-
- parts = [p.strip() for p in str(dataset_path).split(",") if p.strip()]
- paths: list[str] = []
- for p in parts:
- if any(ch in p for ch in ["*", "?", "["]):
- paths.extend(sorted(glob.glob(p)))
- elif os.path.isdir(p):
- paths.extend(sorted(glob.glob(os.path.join(p, "**", "*.txt"), recursive=True)))
- else:
- paths.append(p)
-
- seen = set()
- unique_paths = []
- for p in paths:
- if p not in seen:
- seen.add(p)
- unique_paths.append(p)
- return unique_paths
-
- def _parse_trace_file(self, path: str) -> list[dict[str, Any]]:
- rows: list[dict[str, Any]] = []
-
- def parse_request_repr_line(line: str) -> dict[str, Any] | None:
- text = line.strip()
- if not text:
- return None
- if not (text.startswith("Request(") and text.endswith(")")):
- return None
- inner = text[len("Request(") : -1]
- try:
- expr = ast.parse(f"f({inner})", mode="eval")
- if not isinstance(expr.body, ast.Call):
- return None
- call = expr.body
- out: dict[str, Any] = {}
- for kw in call.keywords:
- if kw.arg is None:
- continue
- out[kw.arg] = ast.literal_eval(kw.value)
- return out
- except Exception:
- return None
-
- # detect first non-empty line to pick parser
- first_non_empty = None
- with open(path, encoding="utf-8") as f:
- for _ in range(50):
- pos = f.tell()
- line = f.readline()
- if not line:
- break
- if line.strip():
- first_non_empty = line.strip()
- f.seek(pos)
- break
-
- if first_non_empty is None:
- return rows
-
- if first_non_empty.startswith("Request("):
- with open(path, encoding="utf-8") as f:
- for line in f:
- parsed = parse_request_repr_line(line)
- if isinstance(parsed, dict):
- rows.append(parsed)
- return rows
-
- # txt fallback: parse Request(...) lines only
- with open(path, encoding="utf-8") as f:
- for line in f:
- parsed = parse_request_repr_line(line)
- if isinstance(parsed, dict):
- rows.append(parsed)
- return rows
-
- def _load_items(self, dataset_root: str) -> list[dict[str, Any]]:
- paths = self._expand_paths(dataset_root)
- if not paths:
- raise ValueError("No trace files found. Provide --dataset-path or rely on default HuggingFace download.")
-
- items: list[dict[str, Any]] = []
- for p in paths:
- if not os.path.exists(p):
- continue
- for row in self._parse_trace_file(p):
- if isinstance(row, dict):
- row = dict(row)
- row.setdefault("_source", p)
- items.append(row)
-
- if not items:
- raise ValueError("Trace dataset is empty after parsing provided paths.")
-
- if self.args.num_prompts is not None:
- items = items[: self.args.num_prompts]
-
- return items
-
- def __len__(self) -> int:
- return len(self.items)
-
- def __getitem__(self, idx: int) -> RequestFuncInput:
- row = self.items[idx]
- prompt = row.get("prompt") or row.get("text") or ""
-
- row_height = self._coerce_int(row.get("height"))
- row_width = self._coerce_int(row.get("width"))
- num_frames = self._coerce_int(row.get("num_frames"))
- num_steps = self._coerce_int(row.get("num_inference_steps"))
- seed = self._coerce_int(row.get("seed"))
- fps = self._coerce_int(row.get("fps"))
- timestamp = self._coerce_float(row.get("timestamp"))
- slo_ms = self._coerce_float(row.get("slo_ms"))
- image_paths = row.get("image_paths")
- if not image_paths:
- single = row.get("image_path")
- image_paths = [single] if single else None
-
- if not image_paths and self.args.task in ["i2v", "i2i", "ti2v", "ti2i"]:
- raise ValueError(
- f"Task {self.args.task} requires image input, but no image_path or image_paths found in trace row."
- )
-
- override_w = self.args.width
- override_h = self.args.height
- if override_w is not None or override_h is not None:
- width = override_w
- height = override_h
- else:
- width = row_width
- height = row_height
-
- return RequestFuncInput(
- prompt=str(prompt),
- api_url=self.api_url,
- model=self.model,
- width=width,
- height=height,
- num_frames=num_frames if num_frames is not None else self.args.num_frames,
- num_inference_steps=num_steps if num_steps is not None else self.args.num_inference_steps,
- seed=seed if seed is not None else self.args.seed,
- fps=fps if fps is not None else self.args.fps,
- timestamp=timestamp,
- slo_ms=slo_ms,
- image_paths=image_paths,
- request_id=str(row.get("request_id")) if row.get("request_id") is not None else str(uuid.uuid4()),
- )
-
- def get_requests(self) -> list[RequestFuncInput]:
- return [self[i] for i in range(len(self))]
-
-
-class RandomDataset(BaseDataset):
- def __init__(self, args, api_url: str, model: str, enable_negative_prompt: bool = False):
- super().__init__(args, api_url, model)
- self.num_prompts = args.num_prompts
- self.enable_negative_prompt = enable_negative_prompt
- self.num_input_images = max(1, args.num_input_images)
- self.random_request_config = getattr(args, "random_request_config", None)
- if self.random_request_config:
- self.random_request_config = json.loads(self.random_request_config)
- self._weights = [p["weight"] for p in self.random_request_config]
-
- self.random_request_config = [
- {k: v for k, v in p.items() if k != "weight"} for p in self.random_request_config
- ]
-
- seed = getattr(args, "random_request_seed", 42)
- self._rng = random.Random(seed)
-
- self._sampled_requests = self._rng.choices(
- self.random_request_config,
- weights=self._weights,
- k=self.num_prompts,
- )
- else:
- self._sampled_requests = None
-
- # Random image generate
- if self.args.task in ["i2v", "ti2v", "ti2i", "i2i"]:
- self._random_image_path = self._generate_random_image_paths()
- else:
- self._random_image_path = None
-
- def __len__(self) -> int:
- return self.num_prompts
-
- def __getitem__(self, idx: int) -> RequestFuncInput:
- extra_body = {}
- if self.enable_negative_prompt:
- extra_body["negative_prompt"] = f"Negative prompt {idx} for benchmarking diffusion models"
-
- params = {
- "width": self.args.width,
- "height": self.args.height,
- "num_frames": self.args.num_frames,
- "num_inference_steps": self.args.num_inference_steps,
- "fps": self.args.fps,
- }
- if self._sampled_requests:
- profile = self._sampled_requests[idx]
- params.update(profile)
- return RequestFuncInput(
- prompt=f"Random prompt {idx} for benchmarking diffusion models",
- api_url=self.api_url,
- model=self.model,
- seed=self.args.seed,
- extra_body=extra_body,
- image_paths=self._random_image_path,
- **params,
- )
-
- def get_requests(self) -> list[RequestFuncInput]:
- return [self[i] for i in range(len(self))]
-
- def _generate_random_image_paths(self) -> list[str]:
- image_paths: list[str] = []
- for image_idx in range(self.num_input_images):
- img = Image.new("RGB", (512, 512), (255, 255, 255))
- image_path = os.path.join(
- tempfile.gettempdir(),
- f"diffusion_benchmark_random_image_{image_idx}.png",
- )
- img.save(image_path)
- image_paths.append(image_path)
- return image_paths
-
-
-def _compute_expected_latency_ms_from_base(req: RequestFuncInput, args, base_time_ms: float | None) -> float | None:
- """Compute expected execution time (ms) based on a base per-step-per-frame unit time.
-
- Assumes linear scaling with pixel area, frame count, and num_inference_steps.
- The base unit represents latency for a 16x16 resolution, single frame, single step.
- """
-
- if base_time_ms is None:
- return None
-
- width = req.width if req.width is not None else args.width
- height = req.height if req.height is not None else args.height
- if width is None or height is None:
- return None
-
- frames = req.num_frames if req.num_frames is not None else args.num_frames
- steps = req.num_inference_steps if req.num_inference_steps is not None else args.num_inference_steps
-
- frame_scale = frames if isinstance(frames, int) and frames > 0 else 1
- step_scale = steps if isinstance(steps, int) and steps > 0 else 1
-
- area_units = max((float(width) * float(height)) / float(16 * 16), 1.0)
- return float(base_time_ms) * area_units * frame_scale * step_scale
-
-
-def _infer_slo_base_time_ms_from_warmups(
- warmup_pairs: list[tuple[RequestFuncInput, RequestFuncOutput]],
- args,
-) -> float | None:
- """Infer base SLO unit time from warmup requests.
-
- Returns the median base latency (ms) for a 16x16 resolution, single-frame,
- single-step request. Only uses warmups that succeeded and have resolvable
- width/height.
- """
-
- candidates_ms: list[float] = []
- for req, out in warmup_pairs:
- if not out.success or out.latency <= 0:
- continue
-
- width = req.width if req.width is not None else args.width
- height = req.height if req.height is not None else args.height
- if width is None or height is None:
- continue
-
- frames = req.num_frames if req.num_frames is not None else args.num_frames
- steps = req.num_inference_steps if req.num_inference_steps is not None else args.num_inference_steps
-
- frame_scale = int(frames) if isinstance(frames, int) and frames > 0 else 1
- step_scale = int(steps) if isinstance(steps, int) and steps > 0 else 1
-
- area_units = max((float(width) * float(height)) / float(16 * 16), 1.0)
- denom = area_units * float(frame_scale) * float(step_scale)
- if denom <= 0:
- continue
-
- candidates_ms.append((out.latency * 1000.0) / denom)
-
- if not candidates_ms:
- return None
- return float(np.median(candidates_ms))
-
-
-def _populate_slo_ms_from_warmups(
- requests_list: list[RequestFuncInput],
- warmup_pairs: list[tuple[RequestFuncInput, RequestFuncOutput]],
- args,
-) -> list[RequestFuncInput]:
- """Populate missing RequestFuncInput.slo_ms using warmup outputs.
-
- - If a request already has slo_ms (e.g., trace-provided), it is kept as-is.
- - If any request has slo_ms is None and we can infer base time from warmups,
- we estimate each missing request's expected execution time and set:
- req.slo_ms = expected_latency_ms * args.slo_scale
-
- Returns updated requests_list.
- """
-
- if not any(req.slo_ms is None for req in requests_list):
- return requests_list
-
- base_time_ms = _infer_slo_base_time_ms_from_warmups(warmup_pairs, args)
- if base_time_ms is None:
- return requests_list
-
- slo_scale = float(getattr(args, "slo_scale", 3.0))
- if slo_scale <= 0:
- raise ValueError(f"slo_scale must be positive, got {slo_scale}.")
-
- updated: list[RequestFuncInput] = []
- for req in requests_list:
- if req.slo_ms is not None:
- updated.append(req)
- continue
- expected_ms = _compute_expected_latency_ms_from_base(req, args, base_time_ms)
- updated.append(replace(req, slo_ms=(expected_ms * slo_scale) if expected_ms is not None else None))
-
- return updated
-
-
-async def iter_requests(
- requests_list: list[RequestFuncInput],
- request_rate: float,
-) -> AsyncGenerator[RequestFuncInput, None]:
- """Yield requests using a Poisson process if request_rate is set.
-
- - If request_rate is inf, all requests are yielded immediately (no sleep).
- - Otherwise, inter-arrival times follow an exponential distribution.
- """
-
- if request_rate != float("inf"):
- if request_rate <= 0:
- raise ValueError(f"request_rate must be positive or inf, got {request_rate}.")
-
- for i, req in enumerate(requests_list):
- if request_rate != float("inf") and i > 0:
- interval_s = random.expovariate(request_rate)
- await asyncio.sleep(interval_s)
- yield req
-
-
-def calculate_metrics(
- outputs: list[RequestFuncOutput],
- total_duration: float,
- requests_list: list[RequestFuncInput],
- args,
- slo_enabled: bool,
-):
- success_outputs = [o for o in outputs if o.success]
- error_outputs = [o for o in outputs if not o.success]
-
- num_success = len(success_outputs)
- latencies = [o.latency for o in success_outputs]
- peak_memories = [o.peak_memory_mb for o in success_outputs if o.peak_memory_mb > 0]
-
- # Aggregate per-stage durations across all successful requests that reported them.
- stage_duration_lists: dict[str, list[float]] = {}
- for o in success_outputs:
- for stage, duration in (o.stage_durations or {}).items():
- stage_duration_lists.setdefault(stage, []).append(duration)
- stage_durations_mean = {s: float(np.mean(v)) for s, v in stage_duration_lists.items()}
- stage_durations_p50 = {s: float(np.percentile(v, 50)) for s, v in stage_duration_lists.items()}
- stage_durations_p99 = {s: float(np.percentile(v, 99)) for s, v in stage_duration_lists.items()}
-
- metrics = {
- "duration": total_duration,
- "completed_requests": num_success,
- "failed_requests": len(error_outputs),
- "throughput_qps": num_success / total_duration if total_duration > 0 else 0,
- "latency_mean": np.mean(latencies) if latencies else 0,
- "latency_median": np.median(latencies) if latencies else 0,
- "latency_p99": np.percentile(latencies, 99) if latencies else 0,
- "latency_p95": np.percentile(latencies, 95) if latencies else 0,
- "latency_p50": np.percentile(latencies, 50) if latencies else 0,
- "peak_memory_mb_max": max(peak_memories) if peak_memories else 0,
- "peak_memory_mb_mean": np.mean(peak_memories) if peak_memories else 0,
- "peak_memory_mb_median": np.median(peak_memories) if peak_memories else 0,
- "stage_durations_mean": stage_durations_mean,
- "stage_durations_p50": stage_durations_p50,
- "stage_durations_p99": stage_durations_p99,
- }
-
- if slo_enabled:
- slo_defined_total = 0
- slo_met_success = 0
-
- for req, out in zip(requests_list, outputs):
- if req.slo_ms is None:
- continue
- slo_defined_total += 1
- if out.slo_achieved is None:
- continue
- if out.slo_achieved:
- slo_met_success += 1
-
- slo_attain_all = (slo_met_success / slo_defined_total) if slo_defined_total > 0 else 0.0
-
- metrics.update(
- {
- "slo_attainment_rate": slo_attain_all,
- "slo_met_success": slo_met_success,
- "slo_scale": getattr(args, "slo_scale", 3.0),
- }
- )
-
- return metrics
-
-
-def wait_for_service(base_url: str, timeout: int = 120) -> None:
- print(f"Waiting for service at {base_url}...")
- start_time = time.time()
- while True:
- try:
- # Try /health endpoint first
- resp = requests.get(f"{base_url}/health", timeout=1)
- if resp.status_code == 200:
- print("Service is ready.")
- break
- except requests.exceptions.RequestException:
- pass
-
- if time.time() - start_time > timeout:
- raise TimeoutError(f"Service at {base_url} did not start within {timeout} seconds.")
-
- time.sleep(1)
-
-
-async def benchmark(args):
- # Construct base_url if not provided
- if args.base_url is None:
- args.base_url = f"http://{args.host}:{args.port}"
-
- VIDEO_TASKS = {"t2v", "i2v", "ti2v"}
- IMAGE_TASKS = {"t2i", "i2i", "ti2i"}
-
- if args.task in VIDEO_TASKS:
- task_type = "2v"
- elif args.task in IMAGE_TASKS:
- task_type = "2i"
- else:
- raise ValueError(
- f"Unsupported task: '{args.task}'. "
- f"Valid video tasks: {sorted(VIDEO_TASKS)}, "
- f"Valid image tasks: {sorted(IMAGE_TASKS)}"
- )
-
- valid_backends = sorted(backends_function_mapping[task_type].keys())
-
- if args.backend not in valid_backends:
- logger.error(
- f"Invalid backend '{args.backend}' for task '{args.task}' (task type: '{task_type}').\n"
- f"Valid backends for this task type: {valid_backends}\n"
- f"Example usage: --task {args.task} --backend {valid_backends[0]}"
- )
- raise ValueError("Backend validation failed. See log above for valid options.")
-
- # Setup API URL and request function based on backend
- request_func, api_url = backends_function_mapping[task_type][args.backend]
- api_url = f"{args.base_url}{api_url}"
-
- if args.dataset == "vbench":
- dataset = VBenchDataset(args, api_url, args.model)
- elif args.dataset == "trace":
- dataset = TraceDataset(args, api_url, args.model)
- elif args.dataset == "random":
- dataset = RandomDataset(args, api_url, args.model, args.enable_negative_prompt)
- else:
- raise ValueError(f"Unknown dataset: {args.dataset}")
-
- print("Loading requests...")
- requests_list = dataset.get_requests()
- print(f"Prepared {len(requests_list)} requests from {args.dataset} dataset.")
-
- # Limit concurrency
- if args.max_concurrency is not None:
- semaphore = asyncio.Semaphore(args.max_concurrency)
- else:
- semaphore = None
-
- async def limited_request_func(req, session, pbar):
- if semaphore:
- async with semaphore:
- return await request_func(req, session, pbar)
- else:
- return await request_func(req, session, pbar)
-
- # Run benchmark
- pbar = tqdm(total=len(requests_list), disable=args.disable_tqdm)
-
- async with aiohttp.ClientSession() as session:
- warmup_pairs: list[tuple[RequestFuncInput, RequestFuncOutput]] = []
- if args.warmup_requests and requests_list:
- print(
- f"Running {args.warmup_requests} warmup request(s) \
- with num_inference_steps={args.warmup_num_inference_steps}..."
- )
- for i in range(args.warmup_requests):
- warm_req = requests_list[i % len(requests_list)]
- if args.warmup_num_inference_steps is not None:
- warm_req = replace(
- warm_req,
- num_inference_steps=args.warmup_num_inference_steps,
- )
- if args.task == "t2v":
- warm_req = replace(warm_req, num_frames=1)
- warm_out = await limited_request_func(warm_req, session, None)
- warmup_pairs.append((warm_req, warm_out))
-
- if args.slo:
- # Prefer trace-provided per-request slo_ms. Only populate when missing.
- requests_list = _populate_slo_ms_from_warmups(
- requests_list=requests_list,
- warmup_pairs=warmup_pairs,
- args=args,
- )
-
- start_time = time.perf_counter()
- tasks = []
- async for req in iter_requests(requests_list=requests_list, request_rate=args.request_rate):
- task = asyncio.create_task(limited_request_func(req, session, pbar))
- tasks.append(task)
-
- outputs = await asyncio.gather(*tasks)
- total_duration = time.perf_counter() - start_time
-
- pbar.close()
-
- # Calculate metrics
- metrics = calculate_metrics(outputs, total_duration, requests_list, args, args.slo)
-
- # Add configuration info to metrics for JSON output
- metrics["backend"] = args.backend
- metrics["model"] = args.model
- metrics["dataset"] = args.dataset
- metrics["task"] = args.task
-
- print("\n{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=60, c="="))
-
- # Section 1: Configuration
- print("{:<40} {:<15}".format("Backend:", args.backend))
- print("{:<40} {:<15}".format("Model:", args.model))
- print("{:<40} {:<15}".format("Dataset:", args.dataset))
- print("{:<40} {:<15}".format("Task:", args.task))
-
- # Section 2: Execution & Traffic
- print(f"{'-' * 50}")
- print("{:<40} {:<15.2f}".format("Benchmark duration (s):", metrics["duration"]))
- print("{:<40} {:<15}".format("Request rate:", str(args.request_rate)))
- print(
- "{:<40} {:<15}".format(
- "Max request concurrency:",
- str(args.max_concurrency) if args.max_concurrency else "not set",
- )
- )
- print("{:<40} {}/{:<15}".format("Successful requests:", metrics["completed_requests"], len(requests_list)))
-
- # Section 3: Performance Metrics
- print(f"{'-' * 50}")
-
- print("{:<40} {:<15.2f}".format("Request throughput (req/s):", metrics["throughput_qps"]))
- print("{:<40} {:<15.4f}".format("Latency Mean (s):", metrics["latency_mean"]))
- print("{:<40} {:<15.4f}".format("Latency Median (s):", metrics["latency_median"]))
- print("{:<40} {:<15.4f}".format("Latency P99 (s):", metrics["latency_p99"]))
- print("{:<40} {:<15.4f}".format("Latency P95 (s):", metrics["latency_p95"]))
-
- if args.slo:
- print(f"{'-' * 50}")
- print("{:<40} {:<15.2%}".format("SLO Attainment Rate (all):", metrics.get("slo_attainment_rate", 0.0)))
- print("{:<40} {:<15}".format("SLO Met (success count):", str(metrics.get("slo_met_success", 0))))
- print("{:<40} {:<15}".format("SLO Scale:", str(metrics.get("slo_scale", 3.0))))
-
- if metrics["peak_memory_mb_max"] > 0:
- print(f"{'-' * 50}")
- print("{:<40} {:<15.2f}".format("Peak Memory Max (MB):", metrics["peak_memory_mb_max"]))
- print("{:<40} {:<15.2f}".format("Peak Memory Mean (MB):", metrics["peak_memory_mb_mean"]))
- print("{:<40} {:<15.2f}".format("Peak Memory Median (MB):", metrics["peak_memory_mb_median"]))
-
- if metrics["stage_durations_mean"]:
- print(f"{'-' * 50}")
- print("Stage Durations Mean (s):")
- for stage, val in metrics["stage_durations_mean"].items():
- print("{:<40} {:<15.4f}".format(f" {stage}:", val))
-
- print("\n" + "=" * 60)
-
- if args.output_file:
- with open(args.output_file, "w") as f:
- json.dump(metrics, f, indent=2)
- print(f"Metrics saved to {args.output_file}")
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser(description="Benchmark serving for diffusion models.")
- parser.add_argument(
- "--base-url",
- type=str,
- default=None,
- help="Base URL of the server (e.g., http://localhost:8091). Overrides host/port.",
- )
- parser.add_argument("--host", type=str, default="localhost", help="Server host.")
- parser.add_argument("--port", type=int, default=8091, help="Server port.")
- parser.add_argument("--model", type=str, default="default", help="Model name.")
- parser.add_argument(
- "--backend",
- type=str,
- default="vllm-omni",
- choices=["vllm-omni", "openai", "v1/videos"],
- help="Backend to target the benchmark to.",
- )
- parser.add_argument(
- "--dataset",
- type=str,
- default="vbench",
- choices=["vbench", "trace", "random"],
- help="Dataset to use.",
- )
- parser.add_argument(
- "--task",
- type=str,
- default="t2v",
- choices=["t2v", "i2v", "ti2v", "ti2i", "i2i", "t2i"],
- help="Task type.",
- )
- parser.add_argument(
- "--dataset-path",
- type=str,
- default=None,
- help="Path to local dataset file (optional).",
- )
- parser.add_argument("--num-prompts", type=int, default=10, help="Number of prompts to benchmark.")
- parser.add_argument(
- "--max-concurrency",
- type=int,
- default=1,
- help="Maximum number of concurrent requests, default to `1`. This can be used "
- "to help simulate an environment where a higher level component "
- "is enforcing a maximum number of concurrent requests. While the "
- "--request-rate argument controls the rate at which requests are "
- "initiated, this argument will control how many are actually allowed "
- "to execute at a time. This means that when used in combination, the "
- "actual request rate may be lower than specified with --request-rate, "
- "if the server is not processing requests fast enough to keep up.",
- )
- parser.add_argument(
- "--request-rate",
- type=float,
- default=float("inf"),
- help="Number of requests per second. If this is inf, then all the requests are sent at time 0. "
- "Otherwise, we use Poisson process to synthesize the request arrival times. Default is inf.",
- )
- parser.add_argument(
- "--warmup-requests",
- type=int,
- default=1,
- help="Number of warmup requests to run before measurement.",
- )
- parser.add_argument(
- "--warmup-num-inference-steps",
- type=int,
- default=1,
- help="num_inference_steps used for warmup requests.",
- )
- parser.add_argument("--width", type=int, default=None, help="Image/Video width.")
- parser.add_argument("--height", type=int, default=None, help="Image/Video height.")
- parser.add_argument("--num-frames", type=int, default=None, help="Number of frames (for video).")
- parser.add_argument(
- "--num-inference-steps",
- type=int,
- default=50,
- help="Number of inference steps (for diffusion models).",
- )
- parser.add_argument(
- "--seed",
- type=int,
- default=None,
- help="Random seed (for diffusion models).",
- )
- parser.add_argument("--fps", type=int, default=None, help="FPS (for video).")
- parser.add_argument("--output-file", type=str, default=None, help="Output JSON file for metrics.")
- parser.add_argument(
- "--slo",
- action="store_true",
- help=(
- "Enable SLO calculation and reporting. If trace provides per-request slo_ms, it is used. "
- "Otherwise, warmup request(s) are used to infer expected execution time assuming linear "
- "scaling by resolution, frames, and steps, then slo_ms = expected_time * --slo-scale."
- ),
- )
- parser.add_argument(
- "--slo-scale",
- type=float,
- default=3.0,
- help="SLO target multiplier: slo_ms = estimated_exec_time_ms * slo_scale (default: 3).",
- )
- parser.add_argument("--disable-tqdm", action="store_true", help="Disable progress bar.")
- parser.add_argument(
- "--enable-negative-prompt",
- action="store_true",
- default=False,
- help="Generate negative prompts when using the random dataset.",
- )
- parser.add_argument(
- "--random-request-config",
- type=str,
- default=None,
- help=(
- "JSON string defining random request profiles. "
- "Each profile may contain: width, height, num_inference_steps, etc. "
- "The 'weight' field controls sampling probability (relative weight). "
- "Example: "
- '[{"width":512,"height":512,"num_inference_steps":20,"weight":0.15},'
- '{"width":768,"height":768,"num_inference_steps":20,"weight":0.85}]'
- ),
- )
- parser.add_argument(
- "--num-input-images",
- type=int,
- default=1,
- help=(
- "Number of synthetic input images to attach for image-conditioned tasks "
- "(i2v, ti2v, ti2i, i2i) when using random dataset."
- ),
- )
-
- args = parser.parse_args()
-
- asyncio.run(benchmark(args))
diff --git a/benchmarks/diffusion/performance_dashboard/qwen_image_serving_performance.md b/benchmarks/diffusion/performance_dashboard/qwen_image_serving_performance.md
deleted file mode 100644
index ce022f1a8d9..00000000000
--- a/benchmarks/diffusion/performance_dashboard/qwen_image_serving_performance.md
+++ /dev/null
@@ -1,169 +0,0 @@
-# Qwen-Image Serving Performance Dashboard
-
-This document describes how to deploy and benchmark **Qwen-Image** using vLLM-Omni. It includes service startup configuration, acceleration-related options, benchmark methodology, dataset settings, and performance results.
-
----
-
-# 1. Overview
-
-Qwen-Image is a multimodal text-to-image generation model served through the vLLM-Omni infrastructure.
-
-This document covers:
-
-* Service launch configuration (including acceleration options)
-* Benchmark scripts and usage
-* Dataset and workload settings
-* Performance measurement results
-* Reproducibility guidelines
-
----
-
-# 2. Test Environment
-| Component | Specification |
-|------------|----------------|
-| GPU | NVIDIA A100-SXM4-80GB |
-| Diffusion Attention Backend | FlashAttention |
-
-# 3. Service Launch Configuration
-
-## 3.1 Basic Serving Command
-
-```bash
-vllm serve Qwen/Qwen-Image --omni \
- --port 8091
-```
-
-## 3.2 Key Parameters
-
-| Parameter | Description |
-| --------------------- | ------------------------ |
-| `--cfg-parallel-size` | CFG parallelism degree |
-| `--ulysses-degree` | Ulysses parallel degree |
-| `--vae-patch-parallel-size` | VAE parallel degree |
-| `--tensor-parallel-size` | Tensor parallelism degree |
-
-Record these parameters when reporting performance results.
-
----
-
-# 4. Benchmark Script
-
-## 4.1 Benchmark Entry
-
-```bash
-python benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend vllm-omni \
- --dataset \
- --task t2i \
- --num-prompts \
- --max-concurrency \
- --enable-negative-prompt \
- --random-request-config
-```
-
-## 4.2 Key Benchmark Arguments
-
-| Parameter | Description |
-| ---------------------- | --------------------------------- |
-| `--backend` | Serving backend (use `vllm-omni`) |
-| `--dataset` | Dataset name (`random` or custom) |
-| `--task` | Task type (e.g., `t2i`) |
-| `--num-prompts` | Total number of requests |
-| `--max-concurrency` | Client-side concurrency |
-| `--random-request-config`| JSON string defining random request |
-
----
-
-# 5. Dataset & Workload Settings
-
-## 5.1 Recommended Evaluation Configurations
-
-### Dataset A ( 512 Resolution)
-
-* Dataset: `random`
-* Task: t2i
-* Concurrency: 1
-* Mix Resolution
-```
-[
- {"width":512,"height":512,"num_inference_steps":20,"weight":1}
-]
-```
-
-### Dataset B (1536 Resolution)
-
-* Dataset: `random`
-* Task: t2i
-* Concurrency: 1
-* Mix Resolution
-```
-[
- {"width":1536,"height":1536,"num_inference_steps":35,"weight":1}
-]
-```
-
-### Dataset C (Mix Resolution)
-
-* Dataset: `random`
-* Task: t2i
-* Concurrency: 1
-* Mix Resolution
-```
-[
- {"width":512,"height":512,"num_inference_steps":20,"weight":0.15},
- {"width":768,"height":768,"num_inference_steps":20,"weight":0.25},
- {"width":1024,"height":1024,"num_inference_steps":25,"weight":0.45},
- {"width":1536,"height":1536,"num_inference_steps":35,"weight":0.15}
-]
-```
----
-
-## 5.2 Example Benchmark Command
-
-```bash
-python benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend vllm-omni \
- --dataset random \
- --task t2i \
- --num-prompts 1 \
- --max-concurrency 1 \
- --enable-negative-prompt \
- --random-request-config '[
- {"width":512,"height":512,"num_inference_steps":20,"weight":1}
- ]'
-```
-
----
-
-# 6. Performance Metrics
-
-The following metrics are collected during benchmarking:
-
-| Metric | Description | Unit |
-| ------------------ | ----------------------------- | ------- |
-| Mean Latency | Mean of latency | seconds |
-| P99 Latency | P99 of latency | seconds |
-
----
-
-# 7. Performance Results
-
-| Dataset Configuration | Max Concur. | CFG | Usp | Tp | VAE Parallel | Mean Latency (s) | P99 Latency (s) |
-|-----------------------|-----|-----|-----|----|--------------|------------------|------------------|
-| Dataset A | 1 | 2 | 2 | Off | Off | 2.2087 | 2.2087 |
-| Dataset B | 1 | 2 | 2 | Off | Off | 19.6739 | 19.6739 |
-| Dataset C | 1 | 2 | 2 | Off | Off | 5.67259 | 18.6234 |
----
-
-# 8. Reproducibility Checklist
-
-To ensure consistent and comparable benchmark results:
-
-* Record GPU type
-* Record parallel configuration
-* Record benchmark parameters (resolution, concurrency, number of prompts)
-* Ensure no background workload on GPUs during testing
-
----
-
-This document serves as the official Qwen-Image serving performance reference under vLLM-Omni.
diff --git a/benchmarks/diffusion/performance_dashboard/wan_2_2_serving_performance.md b/benchmarks/diffusion/performance_dashboard/wan_2_2_serving_performance.md
deleted file mode 100644
index 9d6c40ece36..00000000000
--- a/benchmarks/diffusion/performance_dashboard/wan_2_2_serving_performance.md
+++ /dev/null
@@ -1,170 +0,0 @@
-# Wan2.2 Serving Performance Dashboard
-
-This document describes how to deploy and benchmark **Wan-AI/Wan2.2-T2V-A14B-Diffusers** using vLLM-Omni. It includes service startup configuration, acceleration-related options, benchmark methodology, dataset settings, and performance results.
-
----
-
-# 1. Overview
-
-Wan-AI/Wan2.2-T2V-A14B-Diffusers is a multimodal text-to-video generation model served through the vLLM-Omni infrastructure.
-
-This document covers:
-
-* Service launch configuration (including acceleration options)
-* Benchmark scripts and usage
-* Dataset and workload settings
-* Performance measurement results
-* Reproducibility guidelines
-
----
-
-# 2. Test Environment
-| Component | Specification |
-|------------|----------------|
-| GPU | NVIDIA A100-SXM4-80GB |
-| Diffusion Attention Backend | FlashAttention |
-
-# 3. Service Launch Configuration
-
-## 3.1 Basic Serving Command
-
-```bash
-vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni \
- --port 8091
-```
-
-## 3.2 Key Parameters
-
-| Parameter | Description |
-| --------------------- | ------------------------ |
-| `--cfg-parallel-size` | CFG parallelism degree |
-| `--ulysses-degree` | Ulysses parallel degree |
-| `--vae-patch-parallel-size` | VAE parallel degree |
-| `--tensor-parallel-size` | Tensor parallelism degree |
-| `--use-hsdp` | Enable Hybrid Sharded Data Parallel to shard model weights across GPUs |
-
-Record these parameters when reporting performance results.
-
----
-
-# 4. Benchmark Script
-
-## 4.1 Benchmark Entry
-
-```bash
-python benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend v1/videos \
- --dataset \
- --task t2v \
- --num-prompts \
- --max-concurrency \
- --enable-negative-prompt \
- --random-request-config
-```
-
-## 4.2 Key Benchmark Arguments
-
-| Parameter | Description |
-| ---------------------- | --------------------------------- |
-| `--backend` | Serving backend (use `v1/videos`) |
-| `--dataset` | Dataset name (`random` or custom) |
-| `--task` | Task type (e.g., `t2v`) |
-| `--num-prompts` | Total number of requests |
-| `--max-concurrency` | Client-side concurrency |
-| `--random-request-config`| JSON string defining random request |
-
----
-
-# 5. Dataset & Workload Settings
-
-## 5.1 Recommended Evaluation Configurations
-
-### Dataset A (480p)
-
-* Dataset: `random`
-* Task: t2v
-* Concurrency: 1
-* Mix Resolution
-```
-[
- {"width":854,"height":480,"num_inference_steps":3,"num_frames":80,"fps":16,"weight":1}
-]
-```
-### Dataset B (720p)
-
-* Dataset: `random`
-* Task: t2v
-* Concurrency: 1
-* Mix Resolution
-```
-[
- {"width":1280,"height":720,"num_inference_steps":6,"num_frames":80,"fps":16,"weight":1}
-]
-```
-### Dataset C (Mix Resolution)
-
-* Dataset: `random`
-* Task: t2v
-* Concurrency: 1
-* Mix Resolution
-```
-[
- {"width":854,"height":480,"num_inference_steps":3,"num_frames":80,"fps":16,"weight":0.15},
- {"width":854,"height":480,"num_inference_steps":4,"num_frames":120,"fps":24,"weight":0.25},
- {"width":1280,"height":720,"num_inference_steps":6,"num_frames":80,"fps":16,"weight":0.6}
-]
-```
----
-
-## 5.2 Example Benchmark Command
-
-```bash
-python benchmarks/diffusion/diffusion_benchmark_serving.py \
- --backend v1/videos \
- --dataset random \
- --task t2v \
- --num-prompts 1 \
- --max-concurrency 1 \
- --enable-negative-prompt \
- --random-request-config '[
- {"width":854,"height":480,"num_inference_steps":18,"num_frames": 33,"fps":16",weight":1}
- ]'
-```
-
----
-
-# 6. Performance Metrics
-
-The following metrics are collected during benchmarking:
-
-| Metric | Description | Unit |
-| ------------------ | ----------------------------- | ------- |
-| Mean Latency | Mean of latency | seconds |
-| P99 Latency | P99 of latency | seconds |
-
----
-
-# 7. Performance Results
-
-| Dataset Configuration | Max Concur. | CFG | Usp | Tp | Hsdp | VAE Parallel | Mean Latency (s) | P99 Latency (s) |
-|-----------------------|-----|-----|-----|-----|----|--------------|------------------|------------------|
-| Dataset A | 1 | 2 | 2 | 1 | On | 1 | 24.6766 | 24.6766 |
-| Dataset A | 1 | 2 | 2 | 1 | On | 4 | 21.6810 | 21.6810 |
-| Dataset B | 1 | 2 | 2 | 1 | On | 1 | 124.6639 | 124.6639 |
-| Dataset B | 1 | 2 | 2 | 1 | On | 4 | 117.44 | 117.44 |
-| Dataset C | 1 | 2 | 2 | 1 | On | 1 | 79.2175 | 124.2565 |
-| Dataset C | 1 | 2 | 2 | 1 | On | 4 | 74.4977 | 117.710 |
----
-
-# 8. Reproducibility Checklist
-
-To ensure consistent and comparable benchmark results:
-
-* Record GPU type
-* Record parallel configuration
-* Record benchmark parameters (resolution, concurrency, number of prompts)
-* Ensure no background workload on GPUs during testing
-
----
-
-This document serves as the official Wan2.2 serving performance reference under vLLM-Omni.
diff --git a/benchmarks/diffusion/quantization_quality.py b/benchmarks/diffusion/quantization_quality.py
deleted file mode 100644
index 4a916e7ea62..00000000000
--- a/benchmarks/diffusion/quantization_quality.py
+++ /dev/null
@@ -1,460 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-"""
-Benchmark quantization quality loss for diffusion models (image & video).
-
-Generates outputs with BF16 (baseline) and a quantized config using the same
-seed, then computes LPIPS perceptual distance between them. Results are printed
-as a Markdown table ready to paste into a PR description.
-
-Requirements:
- pip install lpips Pillow numpy
-
-Image example (text-to-image):
- python benchmarks/diffusion/quantization_quality.py \
- --model Tongyi-MAI/Z-Image-Turbo \
- --task t2i \
- --quantization fp8 \
- --prompts \
- "an aerial view of a coral reef with crystal clear turquoise water" \
- "a campfire in a dark forest with sparks rising into a starry sky" \
- "a gourmet dessert plate with chocolate mousse and gold leaf" \
- --height 1024 --width 1024 \
- --num-inference-steps 50 --seed 42
-
-Video example (text-to-video):
- python benchmarks/diffusion/quantization_quality.py \
- --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \
- --task t2v \
- --quantization fp8 \
- --prompts \
- "A serene lakeside sunrise with mist over the water" \
- "A cat walking across a wooden bridge in autumn" \
- --height 720 --width 1280 \
- --num-frames 81 --num-inference-steps 40 --seed 42
-
-Multiple quantization methods:
- python benchmarks/diffusion/quantization_quality.py \
- --model Tongyi-MAI/Z-Image-Turbo \
- --task t2i \
- --quantization fp8 int8 bitsandbytes \
- --prompts "a cup of coffee on the table" \
- --height 1024 --width 1024 \
- --num-inference-steps 50 --seed 42
-
-Output directory structure (--output-dir, default: ./quant_bench_output):
- quant_bench_output/
- baseline/ # BF16 outputs
- / # Quantized outputs per method
- results.md # Markdown table
-"""
-
-import argparse
-import gc
-import time
-from pathlib import Path
-
-import numpy as np
-import torch
-
-
-def compute_lpips_images(
- baseline_images: list,
- quantized_images: list,
- net: str = "alex",
-) -> list[float]:
- """Compute LPIPS between paired lists of PIL images."""
- import lpips
- from torchvision import transforms
-
- loss_fn = lpips.LPIPS(net=net).eval()
- if torch.cuda.is_available():
- loss_fn = loss_fn.cuda()
-
- transform = transforms.Compose(
- [
- transforms.Resize((256, 256)),
- transforms.ToTensor(),
- transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
- ]
- )
-
- scores = []
- for img_bl, img_qt in zip(baseline_images, quantized_images):
- t_bl = transform(img_bl.convert("RGB")).unsqueeze(0)
- t_qt = transform(img_qt.convert("RGB")).unsqueeze(0)
- if torch.cuda.is_available():
- t_bl, t_qt = t_bl.cuda(), t_qt.cuda()
- with torch.no_grad():
- score = loss_fn(t_bl, t_qt).item()
- scores.append(score)
- return scores
-
-
-def compute_lpips_video(
- baseline_frames: np.ndarray,
- quantized_frames: np.ndarray,
- net: str = "alex",
-) -> float:
- """Compute mean per-frame LPIPS for a video pair.
-
- Args:
- baseline_frames: (F, H, W, C) float array in [0, 1].
- quantized_frames: same shape.
-
- Returns:
- Mean LPIPS across all frames.
- """
- import lpips
-
- loss_fn = lpips.LPIPS(net=net).eval()
- if torch.cuda.is_available():
- loss_fn = loss_fn.cuda()
-
- num_frames = min(len(baseline_frames), len(quantized_frames))
- scores = []
- for i in range(num_frames):
- # Convert (H, W, C) float [0,1] -> (1, C, H, W) float [-1, 1]
- f_bl = torch.from_numpy(baseline_frames[i]).permute(2, 0, 1).unsqueeze(0).float() * 2 - 1
- f_qt = torch.from_numpy(quantized_frames[i]).permute(2, 0, 1).unsqueeze(0).float() * 2 - 1
- if torch.cuda.is_available():
- f_bl, f_qt = f_bl.cuda(), f_qt.cuda()
- with torch.no_grad():
- score = loss_fn(f_bl, f_qt).item()
- scores.append(score)
- return float(np.mean(scores))
-
-
-def _build_omni_kwargs(args, quantization=None):
- """Build kwargs dict for Omni() constructor."""
- from vllm_omni.diffusion.data import DiffusionParallelConfig
-
- parallel_config = DiffusionParallelConfig(
- ulysses_degree=args.ulysses_degree,
- ring_degree=args.ring_degree,
- tensor_parallel_size=args.tensor_parallel_size,
- )
- kwargs = {
- "model": args.model,
- "parallel_config": parallel_config,
- "enforce_eager": args.enforce_eager,
- }
- if quantization:
- kwargs["quantization_config"] = quantization
- return kwargs
-
-
-def _generate_image(omni, args, prompt, seed):
- """Generate a single image and return (PIL.Image, time_seconds, memory_gib)."""
- from vllm_omni.inputs.data import OmniDiffusionSamplingParams
- from vllm_omni.platforms import current_omni_platform
-
- generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(seed)
- torch.cuda.reset_peak_memory_stats()
- start = time.perf_counter()
- outputs = omni.generate(
- {"prompt": prompt},
- OmniDiffusionSamplingParams(
- height=args.height,
- width=args.width,
- generator=generator,
- num_inference_steps=args.num_inference_steps,
- ),
- )
- elapsed = time.perf_counter() - start
- peak_mem = torch.cuda.max_memory_allocated() / (1024**3)
-
- first = outputs[0]
- req_out = first.request_output[0] if hasattr(first, "request_output") else first
- img = req_out.images[0]
- return img, elapsed, peak_mem
-
-
-def _generate_video(omni, args, prompt, seed):
- """Generate a video and return (np.ndarray [F,H,W,C], time_seconds, memory_gib)."""
- from vllm_omni.inputs.data import OmniDiffusionSamplingParams
- from vllm_omni.outputs import OmniRequestOutput
- from vllm_omni.platforms import current_omni_platform
-
- generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(seed)
- torch.cuda.reset_peak_memory_stats()
- start = time.perf_counter()
- outputs = omni.generate(
- {"prompt": prompt, "negative_prompt": ""},
- OmniDiffusionSamplingParams(
- height=args.height,
- width=args.width,
- generator=generator,
- guidance_scale=args.guidance_scale,
- num_inference_steps=args.num_inference_steps,
- num_frames=args.num_frames,
- ),
- )
- elapsed = time.perf_counter() - start
- peak_mem = torch.cuda.max_memory_allocated() / (1024**3)
-
- first = outputs[0]
- if hasattr(first, "request_output") and isinstance(first.request_output, list):
- inner = first.request_output[0]
- if isinstance(inner, OmniRequestOutput) and hasattr(inner, "images"):
- frames = inner.images[0] if inner.images else None
- else:
- frames = inner
- elif hasattr(first, "images") and first.images:
- frames = first.images
- else:
- raise ValueError("Could not extract video frames from output.")
-
- if isinstance(frames, torch.Tensor):
- video = frames.detach().cpu()
- if video.dim() == 5:
- video = video[0].permute(1, 2, 3, 0) if video.shape[1] in (3, 4) else video[0]
- elif video.dim() == 4 and video.shape[0] in (3, 4):
- video = video.permute(1, 2, 3, 0)
- if video.is_floating_point():
- video = video.clamp(-1, 1) * 0.5 + 0.5
- frames_array = video.float().numpy()
- else:
- frames_array = np.asarray(frames)
- if frames_array.ndim == 5:
- frames_array = frames_array[0]
-
- return frames_array, elapsed, peak_mem
-
-
-def _unload_omni(omni):
- """Delete Omni instance and free GPU memory."""
- del omni
- gc.collect()
- if torch.cuda.is_available():
- torch.cuda.empty_cache()
- torch.cuda.synchronize()
-
-
-def run_benchmark(args):
- from vllm_omni.entrypoints.omni import Omni
-
- output_dir = Path(args.output_dir)
- output_dir.mkdir(parents=True, exist_ok=True)
-
- is_video = args.task == "t2v"
- prompts = args.prompts
- seed = args.seed
-
- # Determine configs to benchmark
- configs = [] # list of (label, quantization_method)
- for method in args.quantization:
- configs.append((method, method))
-
- # --- Baseline run ---
- print("\n" + "=" * 60)
- print("Running BF16 baseline...")
- print("=" * 60)
- bl_kwargs = _build_omni_kwargs(args, quantization=None)
- omni_bl = Omni(**bl_kwargs)
-
- baseline_outputs = {} # prompt -> (output, time, mem)
- for prompt in prompts:
- print(f" Generating: {prompt[:60]}...")
- if is_video:
- out, t, mem = _generate_video(omni_bl, args, prompt, seed)
- else:
- out, t, mem = _generate_image(omni_bl, args, prompt, seed)
- baseline_outputs[prompt] = (out, t, mem)
-
- bl_avg_time = np.mean([v[1] for v in baseline_outputs.values()])
- bl_mem = baseline_outputs[prompts[0]][2] # use first prompt's memory
- _unload_omni(omni_bl)
-
- # Save baseline outputs
- bl_dir = output_dir / "baseline"
- bl_dir.mkdir(parents=True, exist_ok=True)
- for i, prompt in enumerate(prompts):
- out = baseline_outputs[prompt][0]
- if is_video:
- try:
- from diffusers.utils import export_to_video
-
- frames_list = list(out) if isinstance(out, np.ndarray) and out.ndim == 4 else out
- export_to_video(frames_list, str(bl_dir / f"prompt_{i}.mp4"), fps=args.fps)
- except ImportError:
- np.save(bl_dir / f"prompt_{i}.npy", out)
- else:
- out.save(bl_dir / f"prompt_{i}.png")
-
- # --- Quantized runs ---
- all_results = [] # list of dicts
-
- for config_label, quant_method in configs:
- print(f"\n{'=' * 60}")
- print(f"Running: {config_label}...")
- print("=" * 60)
-
- qt_kwargs = _build_omni_kwargs(args, quantization=quant_method)
- omni_qt = Omni(**qt_kwargs)
-
- qt_outputs = {}
- for prompt in prompts:
- print(f" Generating: {prompt[:60]}...")
- if is_video:
- out, t, mem = _generate_video(omni_qt, args, prompt, seed)
- else:
- out, t, mem = _generate_image(omni_qt, args, prompt, seed)
- qt_outputs[prompt] = (out, t, mem)
-
- qt_avg_time = np.mean([v[1] for v in qt_outputs.values()])
- qt_mem = qt_outputs[prompts[0]][2]
- _unload_omni(omni_qt)
-
- # Save quantized outputs
- qt_dir = output_dir / config_label.replace(" ", "_")
- qt_dir.mkdir(parents=True, exist_ok=True)
-
- # Compute LPIPS per prompt
- per_prompt = []
- for i, prompt in enumerate(prompts):
- bl_out = baseline_outputs[prompt][0]
- qt_out = qt_outputs[prompt][0]
- if is_video:
- lpips_score = compute_lpips_video(bl_out, qt_out, net=args.lpips_net)
- try:
- from diffusers.utils import export_to_video
-
- frames_list = list(qt_out) if isinstance(qt_out, np.ndarray) and qt_out.ndim == 4 else qt_out
- export_to_video(frames_list, str(qt_dir / f"prompt_{i}.mp4"), fps=args.fps)
- except ImportError:
- np.save(qt_dir / f"prompt_{i}.npy", qt_out)
- else:
- lpips_score = compute_lpips_images([bl_out], [qt_out], net=args.lpips_net)[0]
- qt_out.save(qt_dir / f"prompt_{i}.png")
- per_prompt.append({"prompt": prompt, "lpips": lpips_score})
-
- mean_lpips = np.mean([p["lpips"] for p in per_prompt])
- speedup = bl_avg_time / qt_avg_time if qt_avg_time > 0 else float("inf")
- mem_reduction = (bl_mem - qt_mem) / bl_mem * 100
-
- all_results.append(
- {
- "config": config_label,
- "avg_time": qt_avg_time,
- "speedup": speedup,
- "memory_gib": qt_mem,
- "mem_reduction_pct": mem_reduction,
- "mean_lpips": mean_lpips,
- "per_prompt": per_prompt,
- }
- )
-
- # --- Print results ---
- print("\n\n")
- print("=" * 80)
- print("RESULTS")
- print("=" * 80)
-
- # Summary table
- lines = []
- lines.append(f"## Quantization Quality Benchmark — {args.model.split('/')[-1]}")
- lines.append(
- f"Setup: {args.height}x{args.width}, {args.num_inference_steps} steps, "
- f"seed={args.seed}, LPIPS ({args.lpips_net})"
- )
- if is_video:
- lines.append(f"Video: {args.num_frames} frames")
- lines.append("")
- lines.append("### Summary")
- lines.append("")
- lines.append("| Config | Avg Time | Speedup | Memory (GiB) | Mem Reduction | Mean LPIPS |")
- lines.append("|--------|----------|---------|--------------|---------------|------------|")
- lines.append(f"| BF16 baseline | {bl_avg_time:.2f}s | 1.00x | {bl_mem:.2f} | — | (ref) |")
- for r in all_results:
- lines.append(
- f"| {r['config']} | {r['avg_time']:.2f}s | {r['speedup']:.2f}x "
- f"| {r['memory_gib']:.2f} | {r['mem_reduction_pct']:.0f}% "
- f"| {r['mean_lpips']:.4f} |"
- )
- lines.append("")
- lines.append("> LPIPS < 0.01 = imperceptible, > 0.1 = clearly noticeable.")
- lines.append("")
-
- # Per-prompt table
- if len(prompts) > 1:
- lines.append("### Per-Prompt LPIPS")
- lines.append("")
- header = "| Prompt |"
- sep = "|--------|"
- for r in all_results:
- header += f" {r['config']} |"
- sep += "--------|"
- lines.append(header)
- lines.append(sep)
- for i, prompt in enumerate(prompts):
- short = prompt[:50] + "..." if len(prompt) > 50 else prompt
- row = f"| {short} |"
- for r in all_results:
- row += f" {r['per_prompt'][i]['lpips']:.4f} |"
- lines.append(row)
- lines.append("")
-
- md = "\n".join(lines)
- print(md)
-
- # Save markdown
- results_path = output_dir / "results.md"
- results_path.write_text(md, encoding="utf-8")
- print(f"\nResults saved to {results_path}")
- print(f"Baseline outputs in {bl_dir}")
- for r in all_results:
- qt_dir = output_dir / r["config"].replace(" ", "_")
- print(f"Quantized outputs in {qt_dir}")
-
-
-def parse_args():
- parser = argparse.ArgumentParser(
- description="Benchmark quantization quality loss for diffusion models.",
- formatter_class=argparse.RawDescriptionHelpFormatter,
- )
- parser.add_argument("--model", required=True, help="Model name or local path.")
- parser.add_argument(
- "--task",
- default="t2i",
- choices=["t2i", "t2v"],
- help="Task type: t2i (text-to-image) or t2v (text-to-video).",
- )
- parser.add_argument(
- "--quantization",
- nargs="+",
- required=True,
- help="One or more quantization methods to benchmark (e.g. fp8 int8 bitsandbytes).",
- )
- parser.add_argument(
- "--prompts",
- nargs="+",
- default=["a cup of coffee on the table"],
- help="One or more prompts to generate.",
- )
- parser.add_argument("--seed", type=int, default=42)
- parser.add_argument("--height", type=int, default=1024)
- parser.add_argument("--width", type=int, default=1024)
- parser.add_argument("--num-inference-steps", type=int, default=50)
- parser.add_argument("--num-frames", type=int, default=81, help="Number of video frames (t2v only).")
- parser.add_argument("--fps", type=int, default=24, help="Video FPS for saving (t2v only).")
- parser.add_argument("--guidance-scale", type=float, default=4.0, help="CFG scale (used for video).")
- parser.add_argument("--output-dir", type=str, default="./quant_bench_output", help="Directory to save outputs.")
- parser.add_argument(
- "--lpips-net",
- type=str,
- default="alex",
- choices=["alex", "vgg", "squeeze"],
- help="LPIPS backbone network.",
- )
- parser.add_argument("--ulysses-degree", type=int, default=1)
- parser.add_argument("--ring-degree", type=int, default=1)
- parser.add_argument("--tensor-parallel-size", type=int, default=1)
- parser.add_argument("--enforce-eager", action="store_true")
- return parser.parse_args()
-
-
-if __name__ == "__main__":
- args = parse_args()
- run_benchmark(args)
diff --git a/benchmarks/distributed/omni_connectors/README.md b/benchmarks/distributed/omni_connectors/README.md
deleted file mode 100644
index ab7346441bb..00000000000
--- a/benchmarks/distributed/omni_connectors/README.md
+++ /dev/null
@@ -1,397 +0,0 @@
-# RDMA Test Configuration Guide
-
-This document explains how to configure the RDMA environment and run tests for `MooncakeTransferEngineConnector`.
-
-## Table of Contents
-
-- [Docker Container Permissions](#docker-container-permissions)
-- [Single-Node Testing](#single-node-testing)
-- [Multi-Node Testing](#multi-node-testing)
-- [Running Tests](#running-tests)
-- [Cross-Node Testing](#cross-node-testing)
-- [Troubleshooting](#troubleshooting)
-
----
-
-## Docker Container Permissions
-
-RDMA tests require access to InfiniBand/RoCE devices and system topology. Add the following permissions when running `docker run`.
-
-### Option 1: Minimal Permissions (Recommended)
-
-```bash
-docker run -it \
- --cap-add=SYS_PTRACE \
- --cap-add=IPC_LOCK \
- --security-opt seccomp=unconfined \
- --network=host \
- --device=/dev/infiniband \
- -v /sys/class/infiniband:/sys/class/infiniband:ro \
- your-image:tag
-```
-
-Parameter explanation:
-- `--cap-add=SYS_PTRACE`: Allow reading system topology information
-- `--cap-add=IPC_LOCK`: Allow memory locking (required for RDMA memory registration)
-- `--security-opt seccomp=unconfined`: Disable seccomp restrictions
-- `--network=host`: Use host network (required for RDMA)
-- `--device=/dev/infiniband`: Mount InfiniBand devices
-- `-v /sys/class/infiniband`: Mount IB device info (read-only)
-
-### Option 2: Full Permissions (Quick but not recommended for production)
-
-```bash
-docker run -it \
- --privileged \
- --network=host \
- your-image:tag
-```
-
-`--privileged` grants full host permissions. Suitable for quick testing but not recommended for production.
-
----
-
-## Single-Node Testing
-
-When running single-node tests (producer and consumer on the same machine), ensure they use the **same RDMA device**.
-
-### Problem Background
-
-InfiniBand devices use LID (Local Identifier) for routing. Different devices have different LIDs and cannot communicate directly. If no device is specified, Mooncake may assign different devices to connectors, causing handshake failures.
-
-Common error:
-```
-[Handshake] Failed to modify QP to RTR, check mtu, gid, peer lid, peer qp num: Invalid argument [22]
-```
-
-### Solution
-
-**Method 1: Set Environment Variable (Recommended)**
-
-```bash
-# List available RDMA devices
-ibstat
-
-# Select a device (e.g., mlx5_0)
-export RDMA_DEVICE_NAME='mlx5_0'
-
-# Run tests
-pytest test_mooncake_transfer_engine_rdma.py -v -s
-```
-
-**Method 2: Use RoCE Devices**
-
-If the system has RoCE devices (using IPv4 routing), the test code will automatically detect and prefer them. RoCE device GIDs start with `00:00:00:00:00:00:00:00:00:00:ff:ff` (IPv4-mapped).
-
-**Method 3: Ensure MTU Consistency**
-
-Make sure both endpoints use the same MTU:
-
-```bash
-# Check device MTU
-ibstatus mlx5_0
-```
-
----
-
-## Multi-Node Testing
-
-For multi-node tests, producer and consumer run on different machines connected via InfiniBand switch.
-
-### Prerequisites
-
-1. Both machines have Mooncake and RDMA drivers installed
-2. Both machines are in the same InfiniBand subnet
-3. Switch is properly configured
-
-### Configuration
-
-**Machine A (Producer):**
-
-```bash
-# Set RDMA host IP (InfiniBand interface IP)
-export RDMA_TEST_HOST='10.0.0.1'
-
-# Optional: Specify device
-export RDMA_DEVICE_NAME='mlx5_0'
-```
-
-**Machine B (Consumer):**
-
-```bash
-# Set RDMA host IP
-export RDMA_TEST_HOST='10.0.0.2'
-
-# Optional: Specify device
-export RDMA_DEVICE_NAME='mlx5_0'
-```
-
-### Verify Connectivity
-
-```bash
-# Ping IB interface
-ping 10.0.0.2
-
-# Test RDMA connectivity with ibping
-# On Machine B (server)
-ibping -S
-
-# On Machine A (client)
-ibping -G
-```
-
----
-
-## Running Tests
-
-### Run All RDMA Tests (Single-Node, fast suite)
-
-Slow tests (large payloads, stress, concurrency integrity) are marked `@pytest.mark.slow`. Use `-m "not slow"` to skip them in quick CI or local fast iteration.
-
-```bash
-cd tests/distributed/omni_connectors
-
-# Fast suite only (excludes slow/stress tests)
-pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s -m "not slow"
-```
-
-### Run Including Slow Tests
-
-```bash
-# Run ALL tests including slow/stress tests
-pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s
-
-# Run ONLY the slow/stress tests
-pytest test_mooncake_transfer_engine_rdma.py test_mooncake_transfer_engine_buffer.py -v -s -m slow
-```
-
-### Run Buffer Management Tests
-
-```bash
-# Fast only
-pytest test_mooncake_transfer_engine_buffer.py -v -s -m "not slow"
-
-# Including allocator invariant tests (double-free, overlap, merge)
-pytest test_mooncake_transfer_engine_buffer.py -v -s
-```
-
-### Run Specific Test Classes
-
-```bash
-# Basic connector tests
-pytest test_mooncake_transfer_engine_rdma.py::TestBasicConnector -v -s
-
-# End-to-end RDMA transfer tests
-pytest test_mooncake_transfer_engine_rdma.py::TestEndToEnd -v -s
-
-# Lifecycle & resource management tests
-pytest test_mooncake_transfer_engine_rdma.py::TestLifecycle -v -s
-
-# GPU memory pool tests (requires CUDA)
-pytest test_mooncake_transfer_engine_rdma.py::TestGPUPool -v -s
-
-# Stress / correctness tests (slow)
-pytest test_mooncake_transfer_engine_rdma.py::TestStressCorrectness -v -s
-```
-
-### RDMA Environment Diagnostics
-
-For quick diagnostics (device status, Mooncake availability, env vars, etc.),
-see the [Troubleshooting section](../../../docs/design/feature/omni_connectors/mooncake_transfer_engine_connector.md#troubleshooting)
-in the connector documentation.
-
----
-
-## Cross-Node Testing
-
-The `cross_node_mooncake_transfer_engine.py` script enables testing RDMA transfers between two separate physical machines. This script is **not** auto-discovered by `pytest` (it does not start with `test_`) — it must be run manually on each node.
-
-### Prerequisites
-
-1. Both machines have Mooncake installed
-2. Both machines are connected via InfiniBand/RoCE switch
-3. Firewall allows ZMQ ports (default: 15500, 15501)
-4. Same RDMA device name on both nodes (if multiple devices exist)
-
-### Running Cross-Node Tests
-
-**On Machine A (Producer) — start first:**
-
-```bash
-cd benchmarks/distributed/omni_connectors/
-
-# Optional: specify device if multiple exist
-export RDMA_DEVICE_NAME='mlx5_0'
-
-python cross_node_mooncake_transfer_engine.py \
- --role producer \
- --local-host \
- --remote-host \
- --tensor-size-mb 100 \
- --num-transfers 3
-```
-
-**On Machine B (Consumer) — start after producer:**
-
-```bash
-cd benchmarks/distributed/omni_connectors/
-
-export RDMA_DEVICE_NAME='mlx5_0'
-
-python cross_node_mooncake_transfer_engine.py \
- --role consumer \
- --local-host \
- --remote-host \
- --tensor-size-mb 100 \
- --num-transfers 3
-```
-
-### Transfer Modes
-
-| Mode | Description | Example |
-|------|-------------|---------|
-| `copy` | Normal path — tensor copied to RDMA pool (default) | `--mode copy` |
-| `zerocopy` | Zero-copy path — data created directly in RDMA pool | `--mode zerocopy` |
-| `gpu` | GPU transfer — RDMA pool on GPU, uses GPUDirect | `--mode gpu --gpu-id 0` |
-
-### Benchmark Mode
-
-Skip MD5 verification and measure pure RDMA throughput:
-
-```bash
-# Producer
-python cross_node_mooncake_transfer_engine.py \
- --role producer \
- --local-host \
- --remote-host \
- --tensor-size-mb 1024 \
- --num-transfers 20 \
- --benchmark
-
-# Consumer
-python cross_node_mooncake_transfer_engine.py \
- --role consumer \
- --local-host \
- --remote-host \
- --tensor-size-mb 1024 \
- --num-transfers 20 \
- --benchmark
-```
-
-### Cross-Node Test Options
-
-| Option | Description | Default |
-|--------|-------------|---------|
-| `--role` | `producer` or `consumer` | Required |
-| `--local-host` | Local RDMA IP address | Required |
-| `--remote-host` | Remote RDMA IP address | Required |
-| `--local-port` | Local ZMQ port for RDMA data | 15500 |
-| `--remote-port` | Remote ZMQ port for RDMA data | 15500 |
-| `--ctrl-port` | Control channel port | 15501 |
-| `--tensor-size-mb` | Tensor size in MB | 100 |
-| `--num-transfers` | Number of transfers | 3 |
-| `--mode` | `copy`, `zerocopy`, or `gpu` | `copy` |
-| `--gpu-id` | GPU ID for GPU mode | 0 |
-| `--benchmark` | Skip MD5, pure performance test | off |
-
----
-
-## Troubleshooting
-
-### 1. "Failed to modify QP to RTR" Error
-
-**Cause**: QP handshake failed, usually due to device configuration mismatch.
-
-**Solution**:
-```bash
-# Force using the same device
-export RDMA_DEVICE_NAME='mlx5_0'
-```
-
-### 2. "Mooncake TransferEngine is not available"
-
-**Cause**: Mooncake not installed or import failed.
-
-**Solution**:
-```bash
-# Check Mooncake installation
-python -c "from mooncake.engine import TransferEngine; print('OK')"
-
-# Reinstall if needed
-pip install mooncake-transfer-engine
-# Or using uv
-uv pip install mooncake-transfer-engine
-
-```
-
-### 3. "Permission denied" accessing /dev/infiniband
-
-**Cause**: Container lacks IB device access permissions.
-
-**Solution**:
-```bash
-docker run --device=/dev/infiniband --cap-add=IPC_LOCK ...
-```
-
-### 4. Test Timeout
-
-**Cause**: RDMA connection establishment failed or network latency.
-
-**Solution**:
-```bash
-# Check network status
-ibstat
-ibstatus
-```
-
-### 5. GPU Test Failed "CUDA is not available"
-
-**Cause**: CUDA environment not configured or GPU unavailable.
-
-**Solution**:
-```bash
-# Check CUDA
-python -c "import torch; print(torch.cuda.is_available())"
-
-# Docker needs NVIDIA runtime
-docker run --gpus all ...
-```
-
----
-
-## Environment Variables Reference
-
-| Variable | Description | Example |
-|----------|-------------|---------|
-| `RDMA_DEVICE_NAME` | Specify RDMA device name | `mlx5_0` |
-| `RDMA_TEST_HOST` | Specify test host IP | `10.0.0.1` |
-| `MC_TE_METRIC` | Enable Mooncake metrics | `1` |
-| `MC_IB_PCI_RELAXED_ORDERING` | Enable PCIe relaxed ordering | `1` |
-
----
-
-## Test Files Overview
-
-| File | Description | Auto-discovered by pytest |
-|------|-------------|--------------------------|
-| `test_mooncake_transfer_engine_rdma.py` | Integration tests for MooncakeTransferEngineConnector (basic, E2E, lifecycle, GPU) | Yes |
-| `test_mooncake_transfer_engine_buffer.py` | Memory pool and buffer management unit tests | Yes |
-| `cross_node_mooncake_transfer_engine.py` | Cross-node (multi-machine) testing script — run manually | No (filename does not start with `test_`) |
-
-### test_mooncake_transfer_engine_rdma.py — Test Classes
-
-| Test Class | Memory Pool | Marker | Description |
-|------------|-------------|--------|-------------|
-| `TestBasicConnector` | CPU | — | Initialization, put tensor/bytes/object, cleanup, pool exhaustion |
-| `TestEndToEnd` | CPU | — | E2E RDMA transfer: tensor, bytes, object, zero-copy, large payload (100MB), mixed types, concurrency |
-| `TestLifecycle` | CPU | — | Close, context manager, double-close safety |
-| `TestGPUPool` | GPU | — | GPU pool init, put CPU/GPU tensor, GPU E2E transfer |
-| `TestStressCorrectness` | CPU | `slow` | Concurrent put+get with MD5 integrity, bidirectional concurrency, edge cases (1-element tensor, empty bytes), 500MB payload, rapid alloc/free cycles |
-
-### test_mooncake_transfer_engine_buffer.py — Test Classes
-
-| Test Class | Marker | Description |
-|------------|--------|-------------|
-| `TestBufferAllocator` | — | Basic alloc/free, alignment, exhaustion/recovery, thread safety |
-| `TestAllocatorInvariants` | `slow` | Double-free safety, overlap corruption detection, adjacent-block merging, fragmentation/defrag |
-| `TestManagedBuffer` | — | Tensor views, context manager |
diff --git a/benchmarks/distributed/omni_connectors/cross_node_mooncake_transfer_engine.py b/benchmarks/distributed/omni_connectors/cross_node_mooncake_transfer_engine.py
deleted file mode 100644
index fd01739ccc6..00000000000
--- a/benchmarks/distributed/omni_connectors/cross_node_mooncake_transfer_engine.py
+++ /dev/null
@@ -1,644 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-"""
-Cross-Node RDMA Test Script (Automated Version)
-
-This script enables testing RDMA transfers between two separate machines.
-Supports three transfer modes:
- - copy: Normal path - tensor copied to RDMA pool (default)
- - zerocopy: Zero-copy path - data created directly in RDMA pool
- - gpu: GPU transfer - RDMA pool on GPU, uses GPUDirect
-
-Usage:
- # On Machine A (Producer) - start first:
- python cross_node_mooncake_transfer_engine.py --role producer --local-host hostname_A --remote-host hostname_B
-
- # On Machine B (Consumer) - start after producer:
- python cross_node_mooncake_transfer_engine.py --role consumer --local-host hostname_B --remote-host hostname_A
-
- # Zero-copy mode:
- python cross_node_mooncake_transfer_engine.py --role producer ... --mode zerocopy
-
- # GPU mode (requires GPUDirect RDMA support):
- python cross_node_mooncake_transfer_engine.py --role producer ... --mode gpu --gpu-id 0
-
- # Benchmark mode (skip random data generation and MD5 verification,
- # measures pure RDMA throughput):
- python cross_node_mooncake_transfer_engine.py --role producer ... --benchmark
-
-Environment Variables:
- RDMA_DEVICE_NAME: Specify RDMA device (e.g., mlx5_0)
- MC_IB_PCI_RELAXED_ORDERING: Set to 1 to enable PCIe relaxed ordering
- for higher RDMA throughput
-"""
-
-import argparse
-import hashlib
-import os
-import sys
-import time
-from abc import ABC, abstractmethod
-from dataclasses import dataclass
-from typing import Any
-
-import msgspec
-import torch
-import zmq
-
-# Add parent path for imports
-sys.path.insert(
- 0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
-)
-
-from vllm_omni.distributed.omni_connectors.connectors.mooncake_transfer_engine_connector import (
- ManagedBuffer,
- MooncakeTransferEngineConnector,
- TransferEngine,
-)
-
-
-def compute_md5(tensor: torch.Tensor) -> str:
- """Compute MD5 checksum of a tensor."""
- if tensor.is_cuda:
- tensor = tensor.cpu()
- data = tensor.contiguous().view(torch.uint8).numpy().tobytes()
- return hashlib.md5(data).hexdigest()
-
-
-# Control channel message types
-class CtrlMsg(msgspec.Struct):
- """Control channel message."""
-
- msg_type: str # "READY", "TRANSFER", "ACK", "DONE", "ERROR"
- request_id: str = ""
- md5: str = ""
- data_size: int = 0
- error: str = ""
-
-
-@dataclass
-class TransferConfig:
- """Configuration for cross-node transfer test."""
-
- local_host: str
- remote_host: str
- local_port: int
- remote_port: int
- ctrl_port: int
- num_transfers: int
- tensor_size_mb: int
- mode: str # "copy", "zerocopy", "gpu"
- gpu_id: int = 0
- benchmark: bool = False # Skip MD5 verification for pure performance test
-
-
-@dataclass
-class TransferStats:
- """Statistics for transfer operations."""
-
- success_count: int = 0
- fail_count: int = 0
- total_bytes: int = 0
- elapsed_time: float = 0.0
-
- @property
- def throughput_mbps(self) -> float:
- if self.elapsed_time > 0:
- return (self.total_bytes / (1024 * 1024)) / self.elapsed_time
- return 0.0
-
- def print_summary(self, role: str):
- print(f"\n{'=' * 60}")
- print(f" {role.upper()} SUMMARY")
- print(f" Successful: {self.success_count}/{self.success_count + self.fail_count}")
- print(f" Failed: {self.fail_count}/{self.success_count + self.fail_count}")
- print(f" Total: {self.total_bytes / (1024 * 1024):.2f} MB")
- print(f" Time: {self.elapsed_time:.2f} s")
- print(f" Throughput: {self.throughput_mbps:.2f} MB/s")
- print(f"{'=' * 60}")
-
-
-class CrossNodeTester(ABC):
- """Abstract base class for cross-node RDMA testing."""
-
- def __init__(self, config: TransferConfig):
- self.config = config
- self.connector: MooncakeTransferEngineConnector | None = None
- self.zmq_ctx: zmq.Context | None = None
- self.ctrl_socket: zmq.Socket | None = None
- self.stats = TransferStats()
-
- def get_connector_config(self) -> dict:
- """Get connector configuration based on mode."""
- pool_size = int(self.config.tensor_size_mb * 1.5) * 1024 * 1024
- pool_size = max(pool_size, 128 * 1024 * 1024)
-
- conn_config = {
- "host": self.config.local_host,
- "zmq_port": self.config.local_port,
- "protocol": "rdma",
- "memory_pool_size": pool_size,
- }
-
- # Set device based on mode
- if self.config.mode == "gpu":
- conn_config["memory_pool_device"] = f"cuda:{self.config.gpu_id}"
- else:
- conn_config["memory_pool_device"] = "cpu"
-
- # RDMA device name from environment
- device_name = os.environ.get("RDMA_DEVICE_NAME")
- if device_name:
- conn_config["device_name"] = device_name
- print(f"[CONFIG] Using RDMA device: {device_name}")
-
- return conn_config
-
- def initialize(self):
- """Initialize connector and ZMQ context."""
- print(f"[{self.role}] Initializing connector...")
- conn_config = self.get_connector_config()
- self.connector = MooncakeTransferEngineConnector(conn_config)
- self.zmq_ctx = zmq.Context()
- print(f"[{self.role}] Ready at {self.config.local_host}:{self.config.local_port}")
-
- def cleanup(self):
- """Cleanup resources."""
- if self.ctrl_socket:
- self.ctrl_socket.close()
- if self.zmq_ctx:
- self.zmq_ctx.term()
- if self.connector:
- self.connector.close()
- print(f"[{self.role}] Closed.")
-
- @property
- @abstractmethod
- def role(self) -> str:
- pass
-
- @abstractmethod
- def run(self):
- pass
-
-
-class Producer(CrossNodeTester):
- """Producer node - sends data to consumer."""
-
- @property
- def role(self) -> str:
- return "PRODUCER"
-
- def print_header(self):
- print(f"\n{'=' * 60}")
- print(f" PRODUCER MODE ({self.config.mode.upper()})")
- print(f" Local: {self.config.local_host}:{self.config.local_port}")
- print(f" Remote: {self.config.remote_host}:{self.config.remote_port}")
- print(f" Control Port: {self.config.ctrl_port}")
- print(f" Transfer Mode: {self.config.mode}")
- if self.config.mode == "gpu":
- print(f" GPU ID: {self.config.gpu_id}")
- print(f"{'=' * 60}\n")
-
- def setup_control_channel(self):
- """Setup ZMQ control channel as server (REP socket)."""
- self.ctrl_socket = self.zmq_ctx.socket(zmq.REP)
- self.ctrl_socket.bind(f"tcp://*:{self.config.ctrl_port}")
- print(f"[PRODUCER] Control channel listening on port {self.config.ctrl_port}")
- print("[PRODUCER] Waiting for consumer to connect...")
-
- def wait_for_consumer(self) -> bool:
- """Wait for consumer READY signal."""
- msg_data = self.ctrl_socket.recv()
- msg = msgspec.msgpack.decode(msg_data, type=CtrlMsg)
- if msg.msg_type != "READY":
- print(f"[PRODUCER] Unexpected message: {msg.msg_type}")
- return False
- print("[PRODUCER] Consumer connected!")
- self.ctrl_socket.send(msgspec.msgpack.encode(CtrlMsg(msg_type="ACK")))
- return True
-
- def create_test_data(self, transfer_idx: int) -> tuple[Any, str, int]:
- """
- Create test data based on transfer mode.
-
- Returns:
- (data, md5, size) tuple
- """
- num_elements = (self.config.tensor_size_mb * 1024 * 1024) // 4
- data_size = num_elements * 4
-
- if self.config.mode == "zerocopy":
- # Zero-Copy Path: Allocate directly from connector's pool
- offset = self.connector.allocator.alloc(data_size)
- managed_buf = ManagedBuffer(self.connector.allocator, offset, data_size, self.connector.pool)
-
- if self.config.benchmark:
- # In benchmark mode, skip random data generation (use uninitialized memory)
- return managed_buf, "", data_size
- else:
- # Fill buffer with random data using tensor view
- tensor_view = managed_buf.as_tensor(dtype=torch.float32, shape=(num_elements,))
- random_data = torch.randn(num_elements, dtype=torch.float32)
- if tensor_view.is_cuda:
- tensor_view.copy_(random_data.to(tensor_view.device))
- else:
- tensor_view.copy_(random_data)
- md5 = compute_md5(tensor_view)
- return managed_buf, md5, data_size
-
- elif self.config.mode == "gpu":
- # GPU Path: Create tensor on GPU
- device = f"cuda:{self.config.gpu_id}"
- if self.config.benchmark:
- # In benchmark mode, use empty tensor (no random generation)
- gpu_tensor = torch.empty(num_elements, dtype=torch.float32, device=device)
- return gpu_tensor, "", data_size
- else:
- cpu_tensor = torch.randn(num_elements, dtype=torch.float32)
- md5 = compute_md5(cpu_tensor)
- gpu_tensor = cpu_tensor.to(device)
- return gpu_tensor, md5, data_size
-
- else:
- # Copy Path (default): Create regular CPU tensor
- if self.config.benchmark:
- # In benchmark mode, use empty tensor (no random generation)
- tensor = torch.empty(num_elements, dtype=torch.float32)
- return tensor, "", data_size
- else:
- tensor = torch.randn(num_elements, dtype=torch.float32)
- md5 = compute_md5(tensor)
- return tensor, md5, data_size
-
- def do_transfer(self, transfer_idx: int) -> bool:
- """Perform a single transfer."""
- req_id = f"cross_node_transfer_{transfer_idx}"
-
- if not self.config.benchmark:
- print(f"\n[PRODUCER] Transfer {transfer_idx + 1}/{self.config.num_transfers}")
-
- # Create test data
- t0 = time.time()
- data, md5, data_size = self.create_test_data(transfer_idx)
- t_create = time.time() - t0
-
- if not self.config.benchmark:
- print(f" Mode: {self.config.mode}")
- print(f" Size: {self.config.tensor_size_mb} MB")
- if md5:
- print(f" MD5: {md5[:16]}...")
- print(f" Create time: {t_create * 1000:.1f} ms")
-
- # Put data
- t1 = time.time()
- success, size, metadata = self.connector.put("producer", "consumer", req_id, data)
- t_put = time.time() - t1
-
- if not success:
- print(" [FAIL] Put failed")
- return False
-
- if not self.config.benchmark:
- print(f" [OK] Put successful, {size} bytes ({t_put * 1000:.1f} ms)")
-
- # Wait for consumer to request transfer info
- msg_data = self.ctrl_socket.recv()
- msg = msgspec.msgpack.decode(msg_data, type=CtrlMsg)
-
- if msg.msg_type != "READY":
- print(f" [ERROR] Unexpected message: {msg.msg_type}")
- return False
-
- # Send transfer metadata to consumer
- transfer_msg = CtrlMsg(
- msg_type="TRANSFER",
- request_id=req_id,
- md5=md5,
- data_size=data_size,
- )
- self.ctrl_socket.send(msgspec.msgpack.encode(transfer_msg))
-
- # Wait for consumer ACK (this includes RDMA transfer time)
- t2 = time.time()
- msg_data = self.ctrl_socket.recv()
- t_rdma = time.time() - t2
- msg = msgspec.msgpack.decode(msg_data, type=CtrlMsg)
-
- success = msg.msg_type == "ACK"
- if success:
- if not self.config.benchmark:
- print(f" [OK] RDMA transfer complete ({t_rdma * 1000:.1f} ms)")
- self.stats.success_count += 1
- self.stats.total_bytes += size
- else:
- print(f" [WARN] Consumer reported error: {msg.error}")
- self.stats.fail_count += 1
-
- # Send ACK to allow consumer to continue
- self.ctrl_socket.send(msgspec.msgpack.encode(CtrlMsg(msg_type="ACK")))
-
- # Cleanup buffer
- self.connector.cleanup(req_id)
-
- return success
-
- def run(self):
- """Run the producer."""
- self.print_header()
- self.initialize()
- self.setup_control_channel()
-
- try:
- if not self.wait_for_consumer():
- return
-
- if self.config.benchmark:
- print(
- f"[BENCHMARK] Running {self.config.num_transfers} "
- f"transfers of {self.config.tensor_size_mb} MB each..."
- )
-
- start_time = time.time()
-
- for i in range(self.config.num_transfers):
- self.do_transfer(i)
- if self.config.benchmark and (i + 1) % 10 == 0:
- elapsed = time.time() - start_time
- current_throughput = (self.stats.total_bytes / (1024 * 1024)) / elapsed
- print(f" Progress: {i + 1}/{self.config.num_transfers}, Throughput: {current_throughput:.2f} MB/s")
-
- self.stats.elapsed_time = time.time() - start_time
- self.stats.print_summary("PRODUCER")
-
- # Wait for final consumer message and send DONE
- self.ctrl_socket.recv()
- self.ctrl_socket.send(msgspec.msgpack.encode(CtrlMsg(msg_type="DONE")))
-
- finally:
- self.cleanup()
-
-
-class Consumer(CrossNodeTester):
- """Consumer node - receives data from producer."""
-
- @property
- def role(self) -> str:
- return "CONSUMER"
-
- def print_header(self):
- print(f"\n{'=' * 60}")
- print(f" CONSUMER MODE ({self.config.mode.upper()})")
- print(f" Local: {self.config.local_host}:{self.config.local_port}")
- print(f" Remote: {self.config.remote_host}:{self.config.remote_port}")
- print(f" Control Port: {self.config.ctrl_port}")
- print(f" Transfer Mode: {self.config.mode}")
- if self.config.mode == "gpu":
- print(f" GPU ID: {self.config.gpu_id}")
- print(f"{'=' * 60}\n")
-
- def setup_control_channel(self):
- """Setup ZMQ control channel as client (REQ socket)."""
- self.ctrl_socket = self.zmq_ctx.socket(zmq.REQ)
- ctrl_addr = f"tcp://{self.config.remote_host}:{self.config.ctrl_port}"
- print(f"[CONSUMER] Connecting to producer control channel at {ctrl_addr}...")
- self.ctrl_socket.connect(ctrl_addr)
-
- def connect_to_producer(self) -> bool:
- """Connect to producer and send READY signal."""
- self.ctrl_socket.send(msgspec.msgpack.encode(CtrlMsg(msg_type="READY")))
- msg_data = self.ctrl_socket.recv()
- msg = msgspec.msgpack.decode(msg_data, type=CtrlMsg)
- if msg.msg_type != "ACK":
- print(f"[CONSUMER] Unexpected response: {msg.msg_type}")
- return False
- print("[CONSUMER] Connected to producer! Starting transfers...")
- return True
-
- def do_transfer(self, transfer_idx: int) -> bool:
- """Perform a single transfer."""
- if not self.config.benchmark:
- print(f"\n[CONSUMER] Transfer {transfer_idx + 1}/{self.config.num_transfers}")
-
- # Request next transfer info
- self.ctrl_socket.send(msgspec.msgpack.encode(CtrlMsg(msg_type="READY")))
- msg_data = self.ctrl_socket.recv()
- msg = msgspec.msgpack.decode(msg_data, type=CtrlMsg)
-
- if msg.msg_type == "DONE":
- print("[CONSUMER] Producer signaled completion")
- return False
-
- if msg.msg_type != "TRANSFER":
- print(f"[CONSUMER] Unexpected message: {msg.msg_type}")
- return False
-
- req_id = msg.request_id
- expected_md5 = msg.md5
- data_size = msg.data_size
- num_elements = data_size // 4
-
- if not self.config.benchmark:
- print(f" Request ID: {req_id}")
- if expected_md5:
- print(f" Expected MD5: {expected_md5[:16]}...")
- print(f" Data Size: {data_size / (1024 * 1024):.2f} MB")
-
- # Build metadata for get
- metadata = {
- "request_id": req_id,
- "source_host": self.config.remote_host,
- "source_port": self.config.remote_port,
- "data_size": data_size,
- "dtype": "float32",
- "shape": [num_elements],
- "is_fast_path": True,
- }
-
- if not self.config.benchmark:
- print(f" [INFO] Requesting from {self.config.remote_host}:{self.config.remote_port}")
-
- # Get data with timing
- t0 = time.time()
- result = self.connector.get("producer", "consumer", req_id, metadata)
- t_get = time.time() - t0
-
- response_msg = CtrlMsg(msg_type="ERROR", error="Get failed")
-
- if result is not None:
- recv_buffer, recv_size = result
- if not self.config.benchmark:
- print(f" [OK] Get successful, {recv_size} bytes ({t_get * 1000:.1f} ms)")
-
- if isinstance(recv_buffer, ManagedBuffer):
- # In benchmark mode, skip MD5 verification
- if self.config.benchmark or not expected_md5:
- response_msg = CtrlMsg(msg_type="ACK")
- self.stats.success_count += 1
- self.stats.total_bytes += recv_size
- else:
- # Verify data
- t1 = time.time()
- reconstructed = recv_buffer.as_tensor(dtype=torch.float32, shape=(num_elements,))
- recv_md5 = compute_md5(reconstructed)
- t_md5 = time.time() - t1
- print(f" MD5: {recv_md5[:16]}... ({t_md5 * 1000:.1f} ms)")
-
- if recv_md5 == expected_md5:
- print(" [PASS] MD5 checksum verified!")
- response_msg = CtrlMsg(msg_type="ACK")
- self.stats.success_count += 1
- self.stats.total_bytes += recv_size
- else:
- print(" [FAIL] MD5 mismatch!")
- response_msg = CtrlMsg(msg_type="ERROR", error="MD5 mismatch")
- self.stats.fail_count += 1
-
- recv_buffer.release()
- else:
- response_msg = CtrlMsg(msg_type="ACK")
- self.stats.success_count += 1
- self.stats.total_bytes += recv_size
- else:
- print(" [FAIL] Get failed")
- self.stats.fail_count += 1
-
- # Send response to producer
- self.ctrl_socket.send(msgspec.msgpack.encode(response_msg))
- # Wait for ACK
- self.ctrl_socket.recv()
-
- return response_msg.msg_type == "ACK"
-
- def run(self):
- """Run the consumer."""
- self.print_header()
- self.initialize()
- self.setup_control_channel()
-
- try:
- if not self.connect_to_producer():
- return
-
- if self.config.benchmark:
- print(
- f"[BENCHMARK] Running {self.config.num_transfers} "
- f"transfers of {self.config.tensor_size_mb} MB each..."
- )
-
- start_time = time.time()
-
- for i in range(self.config.num_transfers):
- if not self.do_transfer(i):
- break
- if self.config.benchmark and (i + 1) % 10 == 0:
- elapsed = time.time() - start_time
- current_throughput = (self.stats.total_bytes / (1024 * 1024)) / elapsed
- print(f" Progress: {i + 1}/{self.config.num_transfers}, Throughput: {current_throughput:.2f} MB/s")
-
- self.stats.elapsed_time = time.time() - start_time
- self.stats.print_summary("CONSUMER")
-
- # Send final READY and wait for DONE
- self.ctrl_socket.send(msgspec.msgpack.encode(CtrlMsg(msg_type="READY")))
- self.ctrl_socket.recv()
-
- finally:
- self.cleanup()
-
-
-def main():
- parser = argparse.ArgumentParser(
- description="Cross-Node RDMA Test (Automated)",
- formatter_class=argparse.RawDescriptionHelpFormatter,
- epilog="""
-Transfer Modes:
- copy - Normal path: tensor copied to RDMA pool (default)
- zerocopy - Zero-copy path: data created directly in RDMA pool
- gpu - GPU transfer: RDMA pool on GPU, uses GPUDirect
-
-Examples:
- # Copy mode (default):
- python cross_node_mooncake_transfer_engine.py --role producer \
- --local-host hostA --remote-host hostB
-
- # Zero-copy mode:
- python cross_node_mooncake_transfer_engine.py --role producer \
- --local-host hostA --remote-host hostB --mode zerocopy
-
- # GPU mode:
- python cross_node_mooncake_transfer_engine.py --role producer \
- --local-host hostA --remote-host hostB --mode gpu --gpu-id 0
-
- # Benchmark mode (skip MD5, measure pure RDMA performance):
- python cross_node_mooncake_transfer_engine.py --role producer \
- --local-host hostA --remote-host hostB --benchmark
-
- # With specific RDMA device:
- RDMA_DEVICE_NAME=mlx5_0 python cross_node_mooncake_transfer_engine.py --role producer ...
- """,
- )
-
- parser.add_argument(
- "--role", required=True, choices=["producer", "consumer"], help="Role: producer (sends) or consumer (receives)"
- )
- parser.add_argument("--local-host", required=True, help="Local hostname or IP address")
- parser.add_argument("--remote-host", required=True, help="Remote hostname or IP address")
- parser.add_argument("--local-port", type=int, default=15500, help="Local ZMQ port for RDMA data (default: 15500)")
- parser.add_argument("--remote-port", type=int, default=15500, help="Remote ZMQ port for RDMA data (default: 15500)")
- parser.add_argument("--ctrl-port", type=int, default=15501, help="Control channel port (default: 15501)")
- parser.add_argument("--num-transfers", type=int, default=20, help="Number of transfers to perform (default: 3)")
- parser.add_argument("--tensor-size-mb", type=int, default=100, help="Tensor size in MB (default: 100)")
- parser.add_argument(
- "--mode",
- choices=["copy", "zerocopy", "gpu"],
- default="copy",
- help="Transfer mode: copy, zerocopy, or gpu (default: copy)",
- )
- parser.add_argument("--gpu-id", type=int, default=0, help="GPU ID for GPU mode (default: 0)")
- parser.add_argument(
- "--benchmark", action="store_true", help="Benchmark mode: skip MD5 verification for pure performance test"
- )
-
- args = parser.parse_args()
-
- # Check Mooncake
- if TransferEngine is None:
- print("[ERROR] Mooncake TransferEngine is not available.")
- print("Install with: pip install mooncake")
- sys.exit(1)
-
- # Check CUDA for GPU mode
- if args.mode == "gpu":
- if not torch.cuda.is_available():
- print("[ERROR] CUDA is not available but GPU mode was requested.")
- sys.exit(1)
- if args.gpu_id >= torch.cuda.device_count():
- print(f"[ERROR] GPU {args.gpu_id} not available. Found {torch.cuda.device_count()} GPUs.")
- sys.exit(1)
- print(f"[INFO] Using GPU {args.gpu_id}: {torch.cuda.get_device_name(args.gpu_id)}")
-
- config = TransferConfig(
- local_host=args.local_host,
- remote_host=args.remote_host,
- local_port=args.local_port,
- remote_port=args.remote_port,
- ctrl_port=args.ctrl_port,
- num_transfers=args.num_transfers,
- tensor_size_mb=args.tensor_size_mb,
- mode=args.mode,
- gpu_id=args.gpu_id,
- benchmark=args.benchmark,
- )
-
- if args.role == "producer":
- tester = Producer(config)
- else:
- tester = Consumer(config)
-
- tester.run()
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmarks/fish-speech/bench_voice_cache.py b/benchmarks/fish-speech/bench_voice_cache.py
deleted file mode 100644
index 8d465d6489f..00000000000
--- a/benchmarks/fish-speech/bench_voice_cache.py
+++ /dev/null
@@ -1,290 +0,0 @@
-"""Benchmark Fish Speech voice cache: inline ref_audio vs uploaded voice.
-
-Measures TTFP improvement from DAC-code caching when using uploaded voices.
-
-Setup:
- 1. Start vllm-omni with Fish Speech S2 Pro (use our feat branch)
- 2. Provide a reference audio file for voice cloning
-
-Usage:
- python bench_voice_cache.py \
- --ref-audio /path/to/reference.wav \
- --ref-text "Transcript of the reference audio." \
- --num-prompts 20 \
- --port 8091
-
-The script runs two rounds:
- A) Inline ref_audio: every request sends base64 audio (no cache)
- B) Uploaded voice: upload once, then use voice name (cache hits after 1st)
-"""
-
-import argparse
-import asyncio
-import base64
-import json
-import os
-import sys
-import time
-from pathlib import Path
-
-import aiohttp
-
-# Allow imports from benchmarks/fish-speech/
-sys.path.insert(0, str(Path(__file__).resolve().parent))
-
-from fish_bench_utils import ( # noqa: E402
- BenchmarkResult,
- RequestResult,
- compute_stats,
- print_benchmark_results,
- send_streaming_request,
-)
-
-SAMPLE_RATE = 44100
-SAMPLE_WIDTH = 2
-
-PROMPTS = [
- "Hello, welcome to the voice synthesis benchmark test.",
- "She said she would be here by noon, but nobody showed up.",
- "The quick brown fox jumps over the lazy dog near the riverbank.",
- "I can't believe how beautiful the sunset looks from up here.",
- "Please remember to bring your identification documents tomorrow morning.",
- "Have you ever wondered what it would be like to travel through time?",
- "The restaurant on the corner serves the best pasta I have ever tasted.",
- "After the meeting, we should discuss the quarterly results.",
- "Learning a new language takes patience and genuine curiosity.",
- "The train leaves at half past seven, so we need to arrive early.",
- "Could you please turn down the music, I'm trying to concentrate.",
- "It was a dark and stormy night when the keeper heard a knock.",
-]
-
-
-def encode_audio_to_base64(audio_path: str) -> str:
- """Encode a local audio file to base64 data URL."""
- ext = audio_path.lower().rsplit(".", 1)[-1]
- mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "flac": "audio/flac"}
- mime_type = mime_map.get(ext, "audio/wav")
- with open(audio_path, "rb") as f:
- audio_b64 = base64.b64encode(f.read()).decode("utf-8")
- return f"data:{mime_type};base64,{audio_b64}"
-
-
-async def upload_voice(
- host: str,
- port: int,
- audio_path: str,
- ref_text: str,
- voice_name: str = "bench_voice",
-) -> dict:
- """Upload a voice via POST /v1/audio/voices."""
- url = f"http://{host}:{port}/v1/audio/voices"
- data = aiohttp.FormData()
- data.add_field("name", voice_name)
- data.add_field("consent", "true")
- if ref_text:
- data.add_field("ref_text", ref_text)
- data.add_field(
- "audio_sample",
- open(audio_path, "rb"),
- filename=os.path.basename(audio_path),
- content_type="audio/wav",
- )
-
- async with aiohttp.ClientSession() as session:
- async with session.post(url, data=data) as resp:
- result = await resp.json()
- print(f" Upload response ({resp.status}): {json.dumps(result, indent=2)}")
- return result
-
-
-async def delete_voice(host: str, port: int, voice_name: str) -> None:
- """Delete an uploaded voice."""
- url = f"http://{host}:{port}/v1/audio/voices/{voice_name}"
- async with aiohttp.ClientSession() as session:
- async with session.delete(url) as resp:
- if resp.status == 200:
- print(f" Deleted voice '{voice_name}'")
-
-
-async def run_round(
- host: str,
- port: int,
- num_prompts: int,
- create_payload_fn,
- label: str,
- num_warmups: int = 2,
- timeout_s: float = 120.0,
-) -> BenchmarkResult:
- """Run one benchmark round and return results."""
- api_url = f"http://{host}:{port}/v1/audio/speech"
- connector = aiohttp.TCPConnector(limit=1, limit_per_host=1)
- session = aiohttp.ClientSession(
- connector=connector,
- timeout=aiohttp.ClientTimeout(total=timeout_s),
- )
-
- try:
- # Warmup.
- if num_warmups > 0:
- print(f" [{label}] Warming up ({num_warmups} requests)...")
- for i in range(num_warmups):
- payload = create_payload_fn(PROMPTS[i % len(PROMPTS)])
- r = await send_streaming_request(
- session,
- api_url,
- payload,
- SAMPLE_RATE,
- SAMPLE_WIDTH,
- )
- status = "OK" if r.success else f"FAIL: {r.error[:80]}"
- print(f" warmup {i + 1}: ttfp={r.ttfp * 1000:.0f}ms {status}")
-
- # Benchmark.
- print(f" [{label}] Running {num_prompts} requests (concurrency=1)...")
- results: list[RequestResult] = []
- start = time.perf_counter()
- for i in range(num_prompts):
- prompt = PROMPTS[i % len(PROMPTS)]
- payload = create_payload_fn(prompt)
- r = await send_streaming_request(
- session,
- api_url,
- payload,
- SAMPLE_RATE,
- SAMPLE_WIDTH,
- )
- results.append(r)
- tag = "HIT" if i > 0 and label == "uploaded_voice" else ""
- print(
- f" req {i + 1:3d}: ttfp={r.ttfp * 1000:7.1f}ms "
- f"e2e={r.e2e * 1000:7.1f}ms "
- f"{'OK' if r.success else 'FAIL'} {tag}"
- )
- wall_time = time.perf_counter() - start
- finally:
- await session.close()
-
- bench = compute_stats(results, wall_time)
- bench.concurrency = 1
- bench.num_prompts = num_prompts
- bench.config_name = label
- return bench
-
-
-async def main():
- parser = argparse.ArgumentParser(
- description="Benchmark Fish Speech voice cache (inline vs uploaded)",
- )
- parser.add_argument("--host", default="127.0.0.1")
- parser.add_argument("--port", type=int, default=8091)
- parser.add_argument("--ref-audio", required=True, help="Path to reference audio file")
- parser.add_argument("--ref-text", required=True, help="Transcript of reference audio")
- parser.add_argument("--num-prompts", type=int, default=20)
- parser.add_argument("--num-warmups", type=int, default=2)
- parser.add_argument("--voice-name", default="bench_voice")
- args = parser.parse_args()
-
- if not os.path.exists(args.ref_audio):
- print(f"Error: ref_audio not found: {args.ref_audio}")
- sys.exit(1)
-
- ref_audio_b64 = encode_audio_to_base64(args.ref_audio)
- print(f"Reference audio: {args.ref_audio} ({len(ref_audio_b64) // 1024}KB base64)")
-
- # ---- Round A: Inline ref_audio (no cache) ----
- print(f"\n{'=' * 60}")
- print("Round A: INLINE ref_audio (every request sends full audio)")
- print(f"{'=' * 60}")
-
- def make_inline_payload(prompt: str) -> dict:
- return {
- "input": prompt,
- "voice": "default",
- "stream": True,
- "response_format": "pcm",
- "ref_audio": ref_audio_b64,
- "ref_text": args.ref_text,
- "max_new_tokens": 2048,
- }
-
- bench_inline = await run_round(
- args.host,
- args.port,
- args.num_prompts,
- make_inline_payload,
- "inline_ref_audio",
- num_warmups=args.num_warmups,
- )
- print_benchmark_results(bench_inline)
-
- # ---- Upload voice ----
- print(f"\n{'=' * 60}")
- print("Uploading voice for cache test...")
- print(f"{'=' * 60}")
- await delete_voice(args.host, args.port, args.voice_name)
- await upload_voice(
- args.host,
- args.port,
- args.ref_audio,
- args.ref_text,
- args.voice_name,
- )
-
- # ---- Round B: Uploaded voice (cache hits after 1st request) ----
- print(f"\n{'=' * 60}")
- print("Round B: UPLOADED VOICE (cache hits after 1st request)")
- print(f"{'=' * 60}")
-
- def make_uploaded_payload(prompt: str) -> dict:
- return {
- "input": prompt,
- "voice": args.voice_name,
- "stream": True,
- "response_format": "pcm",
- "ref_text": args.ref_text,
- "max_new_tokens": 2048,
- }
-
- bench_cached = await run_round(
- args.host,
- args.port,
- args.num_prompts,
- make_uploaded_payload,
- "uploaded_voice",
- num_warmups=args.num_warmups,
- )
- print_benchmark_results(bench_cached)
-
- # ---- Comparison ----
- print(f"\n{'=' * 60}")
- print("COMPARISON: Inline ref_audio vs Uploaded voice (cached)")
- print(f"{'=' * 60}")
- print(f"{'Metric':<30} {'Inline':>12} {'Cached':>12} {'Speedup':>10}")
- print(f"{'-' * 64}")
-
- def fmt_speedup(inline_val: float, cached_val: float) -> str:
- if cached_val > 0 and inline_val > 0:
- ratio = inline_val / cached_val
- return f"{ratio:.2f}x"
- return "N/A"
-
- rows = [
- ("Mean TTFP (ms)", bench_inline.mean_ttfp_ms, bench_cached.mean_ttfp_ms),
- ("Median TTFP (ms)", bench_inline.median_ttfp_ms, bench_cached.median_ttfp_ms),
- ("P99 TTFP (ms)", bench_inline.p99_ttfp_ms, bench_cached.p99_ttfp_ms),
- ("Mean E2E (ms)", bench_inline.mean_e2e_ms, bench_cached.mean_e2e_ms),
- ("Median E2E (ms)", bench_inline.median_e2e_ms, bench_cached.median_e2e_ms),
- ("Mean RTF", bench_inline.mean_rtf, bench_cached.mean_rtf),
- ]
- for label, a, b in rows:
- print(f"{label:<30} {a:>12.1f} {b:>12.1f} {fmt_speedup(a, b):>10}")
-
- print("\nNote: Round B request #1 is a cache MISS (cold start).")
- print(" Requests #2+ are cache HITs (skip DAC encoding).")
-
- # Cleanup.
- await delete_voice(args.host, args.port, args.voice_name)
-
-
-if __name__ == "__main__":
- asyncio.run(main())
diff --git a/benchmarks/fish-speech/fish_bench_utils.py b/benchmarks/fish-speech/fish_bench_utils.py
deleted file mode 100644
index cc84c4037fe..00000000000
--- a/benchmarks/fish-speech/fish_bench_utils.py
+++ /dev/null
@@ -1,501 +0,0 @@
-"""Shared benchmark infrastructure for Fish Speech serving benchmarks.
-
-Provides common dataclasses, metrics computation, streaming HTTP client,
-and result formatting used by model-specific benchmark scripts.
-
-Model-specific scripts supply a ``create_payload_fn(prompt) -> dict``
-callback and audio parameters; everything else is handled here.
-"""
-
-import asyncio
-import base64
-import json
-import time
-from collections.abc import Callable
-from dataclasses import asdict, dataclass, field
-from datetime import datetime
-from pathlib import Path
-
-import aiohttp
-import numpy as np
-from tqdm.asyncio import tqdm
-
-# ---------------------------------------------------------------------------
-# Shared test prompts (varying length for realistic workload)
-# ---------------------------------------------------------------------------
-PROMPTS = [
- "Hello, welcome to the voice synthesis benchmark test.",
- "She said she would be here by noon, but nobody showed up.",
- "The quick brown fox jumps over the lazy dog near the riverbank.",
- "I can't believe how beautiful the sunset looks from up here on the mountain.",
- "Please remember to bring your identification documents to the appointment tomorrow morning.",
- "Have you ever wondered what it would be like to travel through time and visit ancient civilizations?",
- "The restaurant on the corner serves the best pasta I have ever tasted in my entire life.",
- "After the meeting, we should discuss the quarterly results and plan for the next phase.",
- "Learning a new language takes patience, practice, and a genuine curiosity about other cultures.",
- "The train leaves at half past seven, so we need to arrive at the station before then.",
- "Could you please turn down the music a little bit, I'm trying to concentrate on my work.",
- "It was a dark and stormy night when the old lighthouse keeper heard a knock at the door.",
-]
-
-
-# ---------------------------------------------------------------------------
-# Dataclasses
-# ---------------------------------------------------------------------------
-@dataclass
-class RequestResult:
- success: bool = False
- ttfp: float = 0.0 # Time to first audio packet (seconds)
- e2e: float = 0.0 # End-to-end latency (seconds)
- audio_bytes: int = 0 # Total audio bytes received
- audio_duration: float = 0.0 # Audio duration in seconds
- rtf: float = 0.0 # Real-time factor = e2e / audio_duration
- prompt: str = ""
- error: str = ""
-
-
-@dataclass
-class BenchmarkResult:
- config_name: str = ""
- concurrency: int = 0
- num_prompts: int = 0
- completed: int = 0
- failed: int = 0
- duration_s: float = 0.0
- # TTFP stats (ms)
- mean_ttfp_ms: float = 0.0
- median_ttfp_ms: float = 0.0
- std_ttfp_ms: float = 0.0
- p90_ttfp_ms: float = 0.0
- p95_ttfp_ms: float = 0.0
- p99_ttfp_ms: float = 0.0
- # E2E stats (ms)
- mean_e2e_ms: float = 0.0
- median_e2e_ms: float = 0.0
- std_e2e_ms: float = 0.0
- p90_e2e_ms: float = 0.0
- p95_e2e_ms: float = 0.0
- p99_e2e_ms: float = 0.0
- # RTF stats
- mean_rtf: float = 0.0
- median_rtf: float = 0.0
- std_rtf: float = 0.0
- p99_rtf: float = 0.0
- # Audio stats
- mean_audio_duration_s: float = 0.0
- total_audio_duration_s: float = 0.0
- audio_throughput: float = 0.0 # audio_duration / wall_time
- request_throughput: float = 0.0 # requests / second
- # Per-request details
- per_request: list = field(default_factory=list)
-
-
-# ---------------------------------------------------------------------------
-# Audio helpers
-# ---------------------------------------------------------------------------
-def pcm_bytes_to_duration(
- num_bytes: int,
- sample_rate: int = 24000,
- sample_width: int = 2,
-) -> float:
- """Convert raw PCM byte count to duration in seconds."""
- return num_bytes / sample_width / sample_rate
-
-
-def _is_sse_response(response: aiohttp.ClientResponse) -> bool:
- content_type = (response.headers.get("Content-Type") or "").lower()
- return "text/event-stream" in content_type
-
-
-async def _read_raw_audio_stream(
- response: aiohttp.ClientResponse,
- *,
- start_time: float,
-) -> tuple[int, float]:
- first_audio_at = 0.0
- total_bytes = 0
-
- async for chunk in response.content.iter_any():
- if chunk and first_audio_at <= 0:
- first_audio_at = time.perf_counter() - start_time
- total_bytes += len(chunk)
-
- return total_bytes, first_audio_at
-
-
-def _extract_sse_payload(raw_event: bytes) -> bytes | None:
- data_lines: list[bytes] = []
- for raw_line in raw_event.splitlines():
- line = raw_line.rstrip(b"\r")
- if line.startswith(b"data: "):
- data_lines.append(line[6:])
- elif line.startswith(b"data:"):
- data_lines.append(line[5:].lstrip())
-
- if not data_lines:
- return None
- return b"\n".join(data_lines).strip()
-
-
-async def _read_sse_audio_stream(
- response: aiohttp.ClientResponse,
- *,
- start_time: float,
-) -> tuple[int, float]:
- """Decode SSE events and count raw audio bytes from base64 payloads."""
- first_audio_at = 0.0
- total_bytes = 0
- pending = b""
-
- async for chunk in response.content.iter_any():
- if not chunk:
- continue
- pending += chunk
- pending = pending.replace(b"\r\n", b"\n")
-
- while b"\n\n" in pending:
- raw_event, pending = pending.split(b"\n\n", 1)
- payload_bytes = _extract_sse_payload(raw_event)
- if payload_bytes is None:
- continue
- if payload_bytes == b"[DONE]":
- return total_bytes, first_audio_at
-
- try:
- payload = json.loads(payload_bytes)
- except json.JSONDecodeError as exc:
- raise ValueError(f"Invalid SSE JSON payload: {exc}") from exc
-
- audio = payload.get("audio")
- if not isinstance(audio, dict):
- continue
-
- audio_b64 = audio.get("data")
- if not audio_b64:
- continue
-
- try:
- audio_bytes = base64.b64decode(audio_b64)
- except Exception as exc:
- raise ValueError(f"Invalid base64 audio chunk: {exc}") from exc
-
- if audio_bytes and first_audio_at <= 0:
- first_audio_at = time.perf_counter() - start_time
- total_bytes += len(audio_bytes)
-
- return total_bytes, first_audio_at
-
-
-# ---------------------------------------------------------------------------
-# Metrics
-# ---------------------------------------------------------------------------
-def compute_stats(
- results: list[RequestResult],
- wall_time: float,
-) -> BenchmarkResult:
- """Compute aggregate statistics from per-request results."""
- successful = [r for r in results if r.success]
- failed = [r for r in results if not r.success]
-
- bench = BenchmarkResult(
- completed=len(successful),
- failed=len(failed),
- duration_s=wall_time,
- )
-
- if not successful:
- return bench
-
- ttfps = [r.ttfp * 1000 for r in successful]
- e2es = [r.e2e * 1000 for r in successful]
- rtfs = [r.rtf for r in successful]
- audio_durs = [r.audio_duration for r in successful]
-
- bench.mean_ttfp_ms = float(np.mean(ttfps))
- bench.median_ttfp_ms = float(np.median(ttfps))
- bench.std_ttfp_ms = float(np.std(ttfps))
- bench.p90_ttfp_ms = float(np.percentile(ttfps, 90))
- bench.p95_ttfp_ms = float(np.percentile(ttfps, 95))
- bench.p99_ttfp_ms = float(np.percentile(ttfps, 99))
-
- bench.mean_e2e_ms = float(np.mean(e2es))
- bench.median_e2e_ms = float(np.median(e2es))
- bench.std_e2e_ms = float(np.std(e2es))
- bench.p90_e2e_ms = float(np.percentile(e2es, 90))
- bench.p95_e2e_ms = float(np.percentile(e2es, 95))
- bench.p99_e2e_ms = float(np.percentile(e2es, 99))
-
- bench.mean_rtf = float(np.mean(rtfs))
- bench.median_rtf = float(np.median(rtfs))
- bench.std_rtf = float(np.std(rtfs))
- bench.p99_rtf = float(np.percentile(rtfs, 99))
-
- bench.mean_audio_duration_s = float(np.mean(audio_durs))
- bench.total_audio_duration_s = float(np.sum(audio_durs))
- bench.audio_throughput = bench.total_audio_duration_s / wall_time
- bench.request_throughput = len(successful) / wall_time
-
- bench.per_request = [
- {
- "ttfp_ms": r.ttfp * 1000,
- "e2e_ms": r.e2e * 1000,
- "rtf": r.rtf,
- "audio_duration_s": r.audio_duration,
- "prompt": r.prompt,
- }
- for r in successful
- ]
-
- return bench
-
-
-# ---------------------------------------------------------------------------
-# Output formatting
-# ---------------------------------------------------------------------------
-def print_benchmark_results(bench: BenchmarkResult) -> None:
- """Print benchmark results in standardized format."""
- W = 50
- print("")
- print(f"{'=' * W}")
- print(f"{'Serving Benchmark Result':^{W}}")
- print(f"{'=' * W}")
- print(f"{'Successful requests:':<40}{bench.completed:<10}")
- print(f"{'Failed requests:':<40}{bench.failed:<10}")
- print(f"{'Maximum request concurrency:':<40}{bench.concurrency:<10}")
- print(f"{'Benchmark duration (s):':<40}{bench.duration_s:<10.2f}")
- print(f"{'Request throughput (req/s):':<40}{bench.request_throughput:<10.2f}")
- print(f"{'-' * W}")
- print(f"{'End-to-end Latency':^{W}}")
- print(f"{'-' * W}")
- print(f"{'Mean E2EL (ms):':<40}{bench.mean_e2e_ms:<10.2f}")
- print(f"{'Median E2EL (ms):':<40}{bench.median_e2e_ms:<10.2f}")
- print(f"{'P99 E2EL (ms):':<40}{bench.p99_e2e_ms:<10.2f}")
- print(f"{'=' * W}")
- print(f"{'Audio Result':^{W}}")
- print(f"{'=' * W}")
- print(f"{'Total audio duration generated (s):':<40}{bench.total_audio_duration_s:<10.2f}")
- print(f"{'Audio throughput (audio duration/s):':<40}{bench.audio_throughput:<10.2f}")
- print(f"{'-' * W}")
- print(f"{'Time to First Packet':^{W}}")
- print(f"{'-' * W}")
- print(f"{'Mean AUDIO_TTFP (ms):':<40}{bench.mean_ttfp_ms:<10.2f}")
- print(f"{'Median AUDIO_TTFP (ms):':<40}{bench.median_ttfp_ms:<10.2f}")
- print(f"{'P99 AUDIO_TTFP (ms):':<40}{bench.p99_ttfp_ms:<10.2f}")
- print(f"{'-' * W}")
- print(f"{'Real Time Factor':^{W}}")
- print(f"{'-' * W}")
- print(f"{'Mean AUDIO_RTF:':<40}{bench.mean_rtf:<10.3f}")
- print(f"{'Median AUDIO_RTF:':<40}{bench.median_rtf:<10.3f}")
- print(f"{'P99 AUDIO_RTF:':<40}{bench.p99_rtf:<10.3f}")
- print(f"{'=' * W}")
- print("")
-
-
-def save_results(
- all_results: list[dict],
- result_dir: str,
- config_name: str,
-) -> Path:
- """Save benchmark results as JSON and return the file path."""
- out = Path(result_dir)
- out.mkdir(parents=True, exist_ok=True)
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
- result_file = out / f"bench_{config_name}_{timestamp}.json"
-
- with open(result_file, "w") as f:
- json.dump(all_results, f, indent=2)
- print(f"Results saved to {result_file}")
- return result_file
-
-
-# ---------------------------------------------------------------------------
-# Streaming HTTP client
-# ---------------------------------------------------------------------------
-async def send_streaming_request(
- session: aiohttp.ClientSession,
- api_url: str,
- payload: dict,
- sample_rate: int,
- sample_width: int,
- pbar: tqdm | None = None,
-) -> RequestResult:
- """Send a streaming TTS request and measure latency metrics."""
- result = RequestResult(prompt=payload.get("input", ""))
- st = time.perf_counter()
-
- try:
- async with session.post(api_url, json=payload) as response:
- if response.status != 200:
- result.error = f"HTTP {response.status}: {await response.text()}"
- else:
- if _is_sse_response(response):
- total_bytes, result.ttfp = await _read_sse_audio_stream(
- response,
- start_time=st,
- )
- else:
- total_bytes, result.ttfp = await _read_raw_audio_stream(
- response,
- start_time=st,
- )
-
- result.e2e = time.perf_counter() - st
- result.audio_bytes = total_bytes
- result.audio_duration = pcm_bytes_to_duration(total_bytes, sample_rate, sample_width)
-
- if total_bytes <= 0 or result.ttfp <= 0:
- result.error = "HTTP 200 but no audio bytes were received"
- else:
- if result.audio_duration > 0:
- result.rtf = result.e2e / result.audio_duration
- result.success = True
-
- except Exception as e:
- result.error = str(e)
- result.e2e = time.perf_counter() - st
-
- finally:
- if pbar:
- pbar.update(1)
- return result
-
-
-# ---------------------------------------------------------------------------
-# Benchmark runner
-# ---------------------------------------------------------------------------
-async def run_benchmark(
- host: str,
- port: int,
- num_prompts: int,
- max_concurrency: int,
- create_payload_fn: Callable[[str], dict],
- sample_rate: int,
- sample_width: int = 2,
- num_warmups: int = 3,
- request_timeout_s: float = 120.0,
-) -> BenchmarkResult:
- """Run a TTS streaming benchmark at a given concurrency level.
-
- Args:
- create_payload_fn: Model-specific function that takes a prompt string
- and returns the request JSON payload dict.
- sample_rate: PCM sample rate for audio duration calculation.
- sample_width: PCM sample width in bytes (default 2 for 16-bit).
- """
- api_url = f"http://{host}:{port}/v1/audio/speech"
-
- connector = aiohttp.TCPConnector(
- limit=max_concurrency,
- limit_per_host=max_concurrency,
- keepalive_timeout=60,
- )
- session = aiohttp.ClientSession(
- connector=connector,
- timeout=aiohttp.ClientTimeout(
- total=request_timeout_s,
- connect=min(10.0, request_timeout_s),
- sock_connect=min(10.0, request_timeout_s),
- sock_read=request_timeout_s,
- ),
- )
-
- try:
- # Warmup
- if num_warmups > 0:
- print(f" Warming up with {num_warmups} requests...")
- warmup_tasks = [
- send_streaming_request(
- session,
- api_url,
- create_payload_fn(PROMPTS[i % len(PROMPTS)]),
- sample_rate,
- sample_width,
- )
- for i in range(num_warmups)
- ]
- warmup_results = await asyncio.gather(*warmup_tasks)
- warmup_ok = sum(1 for r in warmup_results if r.success)
- if warmup_ok == 0:
- print(" WARNING: All warmup requests failed!")
- for r in warmup_results:
- if r.error:
- print(f" {r.error[:200]}")
- print(f" Warmup done ({warmup_ok}/{num_warmups} succeeded).")
-
- # Build request list
- request_prompts = [PROMPTS[i % len(PROMPTS)] for i in range(num_prompts)]
-
- # Run
- print(f" Running {num_prompts} requests with concurrency={max_concurrency}...")
- semaphore = asyncio.Semaphore(max_concurrency)
- pbar = tqdm(total=num_prompts, desc=f" concurrency={max_concurrency}")
-
- async def limited_request(prompt: str) -> RequestResult:
- async with semaphore:
- return await send_streaming_request(
- session,
- api_url,
- create_payload_fn(prompt),
- sample_rate,
- sample_width,
- pbar,
- )
-
- start_time = time.perf_counter()
- tasks = [asyncio.create_task(limited_request(p)) for p in request_prompts]
- results: list[RequestResult] = await asyncio.gather(*tasks)
- wall_time = time.perf_counter() - start_time
- pbar.close()
-
- finally:
- await session.close()
-
- # Compute stats
- bench = compute_stats(results, wall_time)
- bench.concurrency = max_concurrency
- bench.num_prompts = num_prompts
-
- print_benchmark_results(bench)
-
- # Print sample errors
- failed = [r for r in results if not r.success]
- if failed:
- for r in failed[:3]:
- print(f" [ERROR] {r.error[:200]}")
-
- return bench
-
-
-async def run_benchmark_sweep(
- host: str,
- port: int,
- num_prompts: int,
- concurrency_levels: list[int],
- create_payload_fn: Callable[[str], dict],
- sample_rate: int,
- sample_width: int = 2,
- num_warmups: int = 3,
- request_timeout_s: float = 120.0,
- config_name: str = "benchmark",
- result_dir: str = "results",
-) -> list[dict]:
- """Run benchmarks across multiple concurrency levels and save results."""
- all_results = []
-
- for concurrency in concurrency_levels:
- result = await run_benchmark(
- host=host,
- port=port,
- num_prompts=num_prompts,
- max_concurrency=concurrency,
- create_payload_fn=create_payload_fn,
- sample_rate=sample_rate,
- sample_width=sample_width,
- num_warmups=num_warmups,
- request_timeout_s=request_timeout_s,
- )
- result.config_name = config_name
- all_results.append(asdict(result))
-
- save_results(all_results, result_dir, config_name)
- return all_results
diff --git a/benchmarks/glm_image/README.md b/benchmarks/glm_image/README.md
deleted file mode 100644
index 485e081426f..00000000000
--- a/benchmarks/glm_image/README.md
+++ /dev/null
@@ -1,157 +0,0 @@
-# GLM-Image Benchmarks
-
-Benchmark GLM-Image T2I (text-to-image) and I2I (image-to-image) performance across three backends: HuggingFace baseline, vLLM-Omni offline, and vLLM-Omni online serving.
-
-## Benchmarks
-
-| Benchmark | Script | Description |
-|-----------|--------|-------------|
-| HuggingFace Baseline | `huggingface/inference.py` | Single-GPU transformers + diffusers pipeline |
-| vLLM-Omni Offline | `vllm-omni/inference.py` | Offline inference with continuous batching |
-| vLLM-Omni Online | `benchmark_glm_image.py` | Online serving via `/v1/chat/completions` |
-
-## HuggingFace Baseline
-
-Single-request sequential inference using the reference HuggingFace pipeline.
-
-```bash
-# T2I
-CUDA_VISIBLE_DEVICES=0 python benchmarks/glm_image/huggingface/inference.py \
- --model-path /path/to/GLM-Image --mode t2i --num-prompts 10
-
-# I2I
-CUDA_VISIBLE_DEVICES=0 python benchmarks/glm_image/huggingface/inference.py \
- --model-path /path/to/GLM-Image --mode i2i --num-prompts 10
-```
-
-### Options
-
-| Flag | Default | Description |
-|------|---------|-------------|
-| `--model-path` | `zai-org/GLM-Image` | Model path |
-| `--mode` | `t2i` | `t2i` or `i2i` |
-| `--dataset-path` | `prompt/prompt.json` | Path to prompt.json |
-| `--num-prompts` | `10` | Number of images to generate |
-| `--width` / `--height` | `1024` | Output image size |
-| `--num-inference-steps` | `50` | Diffusion denoising steps |
-| `--output-dir` | `benchmarks/glm_image/huggingface/outputs` | Output directory |
-| `--output-file` | - | JSON file for metrics |
-
-## vLLM-Omni Offline
-
-Multi-GPU offline inference with pipeline parallelism and continuous batching.
-
-```bash
-# T2I
-CUDA_VISIBLE_DEVICES=0,1 python benchmarks/glm_image/vllm-omni/inference.py \
- --model-path /path/to/GLM-Image --mode t2i --num-prompts 10
-
-# I2I
-CUDA_VISIBLE_DEVICES=0,1 python benchmarks/glm_image/vllm-omni/inference.py \
- --model-path /path/to/GLM-Image --mode i2i --num-prompts 10
-```
-
-### Options
-
-| Flag | Default | Description |
-|------|---------|-------------|
-| `--model-path` | `zai-org/GLM-Image` | Model path |
-| `--deploy-config` | - | Deploy config YAML |
-| `--mode` | `t2i` | `t2i` or `i2i` |
-| `--dataset-path` | `prompt/prompt.json` | Path to prompt.json |
-| `--num-prompts` | `10` | Number of images to generate |
-| `--width` / `--height` | `1024` | Output image size |
-| `--num-inference-steps` | `50` | Diffusion denoising steps |
-| `--output-dir` | `benchmarks/glm_image/vllm-omni/outputs` | Output directory |
-| `--output-file` | - | JSON file for metrics |
-| `--stage-init-timeout` | `600` | Stage initialization timeout (s) |
-
-### Latency Computation
-
-In offline mode all requests are submitted simultaneously and processed with continuous batching. The per-request latency is computed by summing the actual per-stage times (with `stage_0_gen_ms` diffed against the previous request to remove accumulated queue/scheduling wait).
-
-## vLLM-Omni Online Serving
-
-### Start the server
-
-```bash
-CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/GLM-Image \
- --omni --port 8091 --host 0.0.0.0 \
- --served-model-name glm-image
-```
-
-### Run the benchmark
-
-```bash
-# T2I
-python benchmarks/glm_image/benchmark_glm_image.py \
- --mode t2i --num-prompts 10 --model glm-image
-
-# I2I
-python benchmarks/glm_image/benchmark_glm_image.py \
- --mode i2i --num-prompts 10 --model glm-image
-
-# Custom dataset
-python benchmarks/glm_image/benchmark_glm_image.py \
- --mode i2i --dataset custom \
- --dataset-path prompts.json --num-prompts 5
-```
-
-### Options
-
-| Flag | Default | Description |
-|------|---------|-------------|
-| `--mode` | `t2i` | `t2i` or `i2i` |
-| `--dataset` | `prompt` | `prompt`, `random`, or `custom` |
-| `--dataset-path` | - | JSON file path (required for `custom`) |
-| `--num-prompts` | `10` | Number of benchmark requests |
-| `--max-concurrency` | `1` | Max concurrent requests |
-| `--request-rate` | `inf` | Requests per second (Poisson arrival) |
-| `--warmup-requests` | `1` | Warmup requests before measurement |
-| `--width` / `--height` | `1024` | Output image size |
-| `--num-inference-steps` | `50` | Diffusion denoising steps |
-| `--seed` | - | Random seed |
-| `--model` | `default` | Model name (must match `--served-model-name`) |
-| `--host` | `localhost` | Server host |
-| `--port` | `8091` | Server port |
-| `--output-file` | - | JSON output file for metrics |
-| `--num-input-images` | `1` | Number of input images for random I2I |
-
-## Dataset
-
-The default dataset is hosted on [HuggingFace](https://huggingface.co/datasets/JaredforReal/glm-image-bench) (`prompt.json`). It is automatically downloaded and cached to `prompt/prompt.json` on first run. No manual setup needed.
-
-Each entry contains:
-
-- `t2i_prompt`: Text prompt for text-to-image generation
-- `i2i_prompt`: Text prompt for image-to-image editing
-- `image_url`: Source image URL for I2I (downloaded and cached on first use)
-
-Custom datasets use the same JSON format and can be provided via `--dataset-path`.
-
-## Pipeline Timings
-
-All three benchmarks report per-stage pipeline timings (in milliseconds):
-
-| Key | Description |
-|-----|-------------|
-| `preprocess_ms` | Input preprocessing (tokenization, multimodal encoding) |
-| `stage_0_gen_ms` | AR (autoregressive) model generation time |
-| `ar2diffusion_ms` | AR output to diffusion input conversion |
-| `stage_1_gen_ms` | Diffusion model denoising time |
-| `queue_wait_ms` | Queue wait time before processing |
-
-The stages are ordered by execution: `preprocess → stage_0 (AR) → ar2diffusion → stage_1 (Diffusion)`.
-
-## Sample Results
-
-Tested on 2x GPU with 10 prompts, 1024x1024, 50 denoising steps:
-
-| Backend | Mode | Latency Mean (s) | Throughput (img/s) |
-|---------|------|-------------------|--------------------|
-| HuggingFace | T2I | 72.6 | 0.014 |
-| HuggingFace | I2I | 70.9 | 0.014 |
-| vLLM-Omni Offline | T2I | 35.0 | 0.044 |
-| vLLM-Omni Offline | I2I | 31.0 | 0.053 |
-| vLLM-Omni Online | T2I | 38.8 | 0.026 |
-| vLLM-Omni Online | I2I | 34.7 | 0.029 |
diff --git a/benchmarks/glm_image/__init__.py b/benchmarks/glm_image/__init__.py
deleted file mode 100644
index e69de29bb2d..00000000000
diff --git a/benchmarks/glm_image/benchmark_glm_image.py b/benchmarks/glm_image/benchmark_glm_image.py
deleted file mode 100644
index 9f8df3f1986..00000000000
--- a/benchmarks/glm_image/benchmark_glm_image.py
+++ /dev/null
@@ -1,464 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-Online serving benchmark for GLM-Image (T2I and I2I modes).
-
-Sends requests to the /v1/chat/completions endpoint and reports end-to-end
-latency, throughput, and per-stage durations (when the server is started with
---enable-diffusion-pipeline-profiler and/or --enable-ar-profiler).
-
-Supports three dataset types:
- - prompt: Use prompt.json (default). T2I uses t2i_prompt, I2I uses i2i_prompt
- and sends source images from image_url.
- - random: Generate synthetic prompts (and random images for I2I).
- - custom: Load from a user-specified JSON file.
-
-Usage:
- # T2I with prompt.json (default)
- python benchmarks/glm_image/benchmark_glm_image.py \
- --mode t2i --num-prompts 10
-
- # I2I with prompt.json (downloads source images automatically)
- python benchmarks/glm_image/benchmark_glm_image.py \
- --mode i2i --num-prompts 10
-
- # Random dataset
- python benchmarks/glm_image/benchmark_glm_image.py \
- --mode t2i --dataset random --num-prompts 20
-
- # Custom dataset
- python benchmarks/glm_image/benchmark_glm_image.py \
- --mode i2i --dataset custom \
- --dataset-path my_prompts.json --num-prompts 5
-"""
-
-import argparse
-import asyncio
-import base64
-import json
-import os
-import sys
-import tempfile
-import time
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-import aiohttp
-import numpy as np
-import requests as sync_requests
-from PIL import Image
-from tqdm.asyncio import tqdm
-
-# Import backends from the diffusion benchmark (add parent dirs to path)
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "diffusion"))
-from backends import RequestFuncOutput
-
-BENCHMARK_DIR = Path(__file__).resolve().parent
-DEFAULT_PROMPT_JSON = BENCHMARK_DIR / "prompt" / "prompt.json"
-IMAGE_CACHE_DIR = BENCHMARK_DIR / "prompt" / "images"
-
-DATASET_REPO = "JaredforReal/glm-image-bench"
-DATASET_FILE = "prompt.json"
-
-
-def _ensure_prompt_json(dataset_path: str | None) -> str:
- """Return path to prompt.json, downloading from HuggingFace if needed."""
- if dataset_path:
- return dataset_path
- local = DEFAULT_PROMPT_JSON
- if local.exists():
- return str(local)
- print(f"Downloading {DATASET_FILE} from {DATASET_REPO} ...")
- try:
- from huggingface_hub import hf_hub_download
-
- downloaded = hf_hub_download(
- repo_id=DATASET_REPO,
- filename=DATASET_FILE,
- repo_type="dataset",
- )
- local.parent.mkdir(parents=True, exist_ok=True)
- import shutil
-
- shutil.copy2(downloaded, local)
- print(f"Saved to {local}")
- except ImportError:
- url = f"https://huggingface.co/datasets/{DATASET_REPO}/resolve/main/{DATASET_FILE}"
- import urllib.request
-
- local.parent.mkdir(parents=True, exist_ok=True)
- urllib.request.urlretrieve(url, local)
- print(f"Saved to {local}")
- return str(local)
-
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-
-@dataclass
-class GLMImageRequest:
- prompt: str
- image_path: str | None = None # Only for I2I mode
-
-
-def download_image(url: str) -> str:
- """Download an image to cache and return the local path."""
- IMAGE_CACHE_DIR.mkdir(parents=True, exist_ok=True)
- fname = url.rsplit("/", 1)[-1]
- local_path = IMAGE_CACHE_DIR / fname
- if local_path.exists():
- return str(local_path)
- resp = sync_requests.get(url, timeout=30)
- resp.raise_for_status()
- local_path.write_bytes(resp.content)
- return str(local_path)
-
-
-def encode_image_as_data_url(path: str) -> str:
- """Encode a local image file as a base64 data URL."""
- with open(path, "rb") as f:
- encoded = base64.b64encode(f.read()).decode("utf-8")
- ext = Path(path).suffix.lower()
- mime = {"png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg"}.get(ext, "image/png")
- return f"data:{mime};base64,{encoded}"
-
-
-# ---------------------------------------------------------------------------
-# Datasets
-# ---------------------------------------------------------------------------
-
-
-class PromptDataset:
- """Load from prompt.json. T2I uses t2i_prompt, I2I uses i2i_prompt + image_url."""
-
- def __init__(self, args: argparse.Namespace):
- path = _ensure_prompt_json(args.dataset_path)
- with open(path, encoding="utf-8") as f:
- raw = json.load(f)
-
- prompt_key = "t2i_prompt" if args.mode == "t2i" else "i2i_prompt"
- self.items: list[GLMImageRequest] = []
-
- for entry in raw:
- prompt = entry.get(prompt_key, "").strip()
- if not prompt:
- continue
- image_path = None
- if args.mode == "i2i":
- url = entry.get("image_url", "")
- if url:
- image_path = download_image(url)
- self.items.append(GLMImageRequest(prompt=prompt, image_path=image_path))
-
- if args.num_prompts and len(self.items) > args.num_prompts:
- self.items = self.items[: args.num_prompts]
-
- def __len__(self) -> int:
- return len(self.items)
-
- def __getitem__(self, idx: int) -> GLMImageRequest:
- return self.items[idx]
-
- def get_requests(self) -> list[GLMImageRequest]:
- return list(self.items)
-
-
-class RandomDataset:
- """Generate synthetic prompts (and optional random images for I2I)."""
-
- def __init__(self, args: argparse.Namespace):
- self.args = args
- self.num_prompts = args.num_prompts
- self._random_image_paths: list[str] | None = None
- if args.mode == "i2i":
- self._random_image_paths = self._generate_random_images()
-
- def _generate_random_images(self) -> list[str]:
- paths: list[str] = []
- for i in range(self.args.num_input_images):
- img = Image.new("RGB", (512, 512), (128 + i * 30 % 128, 64, 192))
- path = os.path.join(tempfile.gettempdir(), f"glm_image_bench_input_{i}.png")
- img.save(path)
- paths.append(path)
- return paths
-
- def __len__(self) -> int:
- return self.num_prompts
-
- def __getitem__(self, idx: int) -> GLMImageRequest:
- image_path = None
- if self._random_image_paths is not None:
- image_path = self._random_image_paths[idx % len(self._random_image_paths)]
- return GLMImageRequest(
- prompt=f"A beautiful scene with vivid colors and intricate details, prompt {idx}",
- image_path=image_path,
- )
-
- def get_requests(self) -> list[GLMImageRequest]:
- return [self[i] for i in range(len(self))]
-
-
-class CustomDataset:
- """Load from a user-specified JSON file.
-
- Expected format:
- [
- {"prompt": "A cat sitting on a windowsill"},
- {"prompt": "Make it look like winter", "image_path": "/path/to/img.png"}
- ]
- """
-
- def __init__(self, args: argparse.Namespace):
- if not args.dataset_path:
- raise ValueError("--dataset-path is required for custom dataset")
- with open(args.dataset_path, encoding="utf-8") as f:
- raw = json.load(f)
- self.items: list[GLMImageRequest] = []
- for item in raw:
- self.items.append(
- GLMImageRequest(
- prompt=item.get("prompt", ""),
- image_path=item.get("image_path"),
- )
- )
- if args.num_prompts and len(self.items) > args.num_prompts:
- self.items = self.items[: args.num_prompts]
-
- def __len__(self) -> int:
- return len(self.items)
-
- def __getitem__(self, idx: int) -> GLMImageRequest:
- return self.items[idx]
-
- def get_requests(self) -> list[GLMImageRequest]:
- return list(self.items)
-
-
-# ---------------------------------------------------------------------------
-# Async request for GLM-Image (chat completions with image support)
-# ---------------------------------------------------------------------------
-
-
-async def async_glm_image_request(
- req: GLMImageRequest,
- api_url: str,
- model: str,
- session: aiohttp.ClientSession,
- pbar: Any,
- args: argparse.Namespace,
-) -> RequestFuncOutput:
- """Send a single T2I or I2I request via chat completions endpoint."""
- output = RequestFuncOutput()
- output.start_time = time.perf_counter()
-
- # Build messages
- if req.image_path and args.mode == "i2i":
- data_url = encode_image_as_data_url(req.image_path)
- content = [
- {"type": "text", "text": req.prompt},
- {"type": "image_url", "image_url": {"url": data_url}},
- ]
- else:
- content = req.prompt
-
- messages = [{"role": "user", "content": content}]
-
- extra_body: dict[str, Any] = {}
- if args.height:
- extra_body["height"] = args.height
- if args.width:
- extra_body["width"] = args.width
- if args.num_inference_steps:
- extra_body["num_inference_steps"] = args.num_inference_steps
- if args.seed is not None:
- extra_body["seed"] = args.seed
-
- payload: dict[str, Any] = {
- "model": model,
- "messages": messages,
- }
- if extra_body:
- payload["extra_body"] = extra_body
-
- try:
- async with session.post(api_url, json=payload) as response:
- if response.status == 200:
- resp_json = await response.json()
- output.response_body = resp_json
- output.success = True
- try:
- choices = resp_json.get("choices", [])
- if choices and isinstance(choices, list):
- msg = choices[0].get("message", {})
- if isinstance(msg, dict):
- resp_content = msg.get("content", [])
- if resp_content and isinstance(resp_content, list) and len(resp_content) > 0:
- first_item = resp_content[0]
- if isinstance(first_item, dict):
- output.stage_durations = first_item.get("stage_durations") or {}
- output.peak_memory_mb = first_item.get("peak_memory_mb", 0.0)
- except (IndexError, TypeError, AttributeError):
- pass
- else:
- output.error = f"HTTP {response.status}: {await response.text()}"
- output.success = False
- except Exception as e:
- output.error = str(e)
- output.success = False
-
- output.latency = time.perf_counter() - output.start_time
- if pbar:
- pbar.update(1)
- return output
-
-
-# ---------------------------------------------------------------------------
-# Benchmark
-# ---------------------------------------------------------------------------
-
-
-async def iter_requests(n: int, request_rate: float) -> Any:
- import random as _random
-
- for i in range(n):
- if request_rate != float("inf") and i > 0:
- await asyncio.sleep(_random.expovariate(request_rate))
- yield i
-
-
-def calculate_metrics(outputs: list[RequestFuncOutput], total_duration: float) -> dict[str, Any]:
- success = [o for o in outputs if o.success]
- errors = [o for o in outputs if not o.success]
- latencies = [o.latency for o in success]
- peak_mems = [o.peak_memory_mb for o in success if o.peak_memory_mb > 0]
-
- stage_duration_lists: dict[str, list[float]] = {}
- for o in success:
- for stage, dur in (o.stage_durations or {}).items():
- stage_duration_lists.setdefault(stage, []).append(dur)
-
- return {
- "duration": total_duration,
- "completed_requests": len(success),
- "failed_requests": len(errors),
- "throughput_qps": len(success) / total_duration if total_duration > 0 else 0,
- "latency_mean": float(np.mean(latencies)) if latencies else 0,
- "latency_median": float(np.median(latencies)) if latencies else 0,
- "latency_p99": float(np.percentile(latencies, 99)) if latencies else 0,
- "latency_p95": float(np.percentile(latencies, 95)) if latencies else 0,
- "peak_memory_mb_max": max(peak_mems) if peak_mems else 0,
- "stage_durations_mean": {s: float(np.mean(v)) for s, v in stage_duration_lists.items()},
- "stage_durations_p50": {s: float(np.percentile(v, 50)) for s, v in stage_duration_lists.items()},
- }
-
-
-async def benchmark(args: argparse.Namespace) -> None:
- api_url = f"http://{args.host}:{args.port}/v1/chat/completions"
-
- # Load dataset
- if args.dataset == "prompt":
- dataset = PromptDataset(args)
- elif args.dataset == "random":
- dataset = RandomDataset(args)
- elif args.dataset == "custom":
- dataset = CustomDataset(args)
- else:
- raise ValueError(f"Unknown dataset: {args.dataset}")
-
- glm_requests = dataset.get_requests()
- print(f"Prepared {len(glm_requests)} requests (mode={args.mode}, dataset={args.dataset})")
-
- semaphore = asyncio.Semaphore(args.max_concurrency) if args.max_concurrency else None
-
- async def limited_request(idx: int, req: GLMImageRequest, session: aiohttp.ClientSession, pbar: Any):
- if semaphore:
- async with semaphore:
- return await async_glm_image_request(req, api_url, args.model, session, pbar, args)
- return await async_glm_image_request(req, api_url, args.model, session, pbar, args)
-
- async with aiohttp.ClientSession() as session:
- # Warmup
- if args.warmup_requests and glm_requests:
- print(f"Running {args.warmup_requests} warmup request(s)...")
- for i in range(args.warmup_requests):
- await limited_request(i, glm_requests[i % len(glm_requests)], session, None)
-
- # Main benchmark
- pbar = tqdm(total=len(glm_requests), disable=args.disable_tqdm)
- start_time = time.perf_counter()
- tasks = []
- async for idx in iter_requests(len(glm_requests), args.request_rate):
- tasks.append(asyncio.create_task(limited_request(idx, glm_requests[idx], session, pbar)))
- outputs = await asyncio.gather(*tasks)
- total_duration = time.perf_counter() - start_time
- pbar.close()
-
- # Metrics
- metrics = calculate_metrics(outputs, total_duration)
- metrics["mode"] = args.mode
- metrics["model"] = args.model
- metrics["dataset"] = args.dataset
-
- print(f"\n{' GLM-Image Online Benchmark Result ':=^60}")
- print(f"{'Mode:':<40} {args.mode}")
- print(f"{'Model:':<40} {args.model}")
- print(f"{'Dataset:':<40} {args.dataset}")
- print("-" * 50)
- print(f"{'Benchmark duration (s):':<40} {metrics['duration']:.2f}")
- print(f"{'Request rate:':<40} {args.request_rate}")
- print(f"{'Max concurrency:':<40} {args.max_concurrency}")
- print(f"{'Successful requests:':<40} {metrics['completed_requests']}/{len(glm_requests)}")
- print("-" * 50)
- print(f"{'Throughput (req/s):':<40} {metrics['throughput_qps']:.2f}")
- print(f"{'Latency Mean (s):':<40} {metrics['latency_mean']:.4f}")
- print(f"{'Latency Median (s):':<40} {metrics['latency_median']:.4f}")
- print(f"{'Latency P95 (s):':<40} {metrics['latency_p95']:.4f}")
- print(f"{'Latency P99 (s):':<40} {metrics['latency_p99']:.4f}")
-
- if metrics["peak_memory_mb_max"] > 0:
- print("-" * 50)
- print(f"{'Peak Memory Max (MB):':<40} {metrics['peak_memory_mb_max']:.2f}")
-
- if metrics["stage_durations_mean"]:
- print("-" * 50)
- print("Stage Durations Mean:")
- for stage, val in sorted(metrics["stage_durations_mean"].items()):
- unit = "ms" if stage.endswith("_ms") else "s"
- print(f" {stage + ':':<38} {val:.4f} ({unit})")
-
- print("=" * 60)
-
- if args.output_file:
- with open(args.output_file, "w") as f:
- json.dump(metrics, f, indent=2)
- print(f"Metrics saved to {args.output_file}")
-
-
-def main() -> None:
- parser = argparse.ArgumentParser(description="Benchmark GLM-Image T2I/I2I online serving.")
- parser.add_argument("--mode", type=str, default="t2i", choices=["t2i", "i2i"])
- parser.add_argument("--dataset", type=str, default="prompt", choices=["prompt", "random", "custom"])
- parser.add_argument("--dataset-path", type=str, default=None)
- parser.add_argument("--num-prompts", type=int, default=10)
- parser.add_argument("--max-concurrency", type=int, default=1)
- parser.add_argument("--request-rate", type=float, default=float("inf"))
- parser.add_argument("--warmup-requests", type=int, default=1)
- parser.add_argument("--width", type=int, default=1024)
- parser.add_argument("--height", type=int, default=1024)
- parser.add_argument("--num-inference-steps", type=int, default=50)
- parser.add_argument("--seed", type=int, default=None)
- parser.add_argument("--model", type=str, default="default")
- parser.add_argument("--host", type=str, default="localhost")
- parser.add_argument("--port", type=int, default=8091)
- parser.add_argument("--output-file", type=str, default=None)
- parser.add_argument("--disable-tqdm", action="store_true")
- parser.add_argument("--num-input-images", type=int, default=1, help="For random I2I dataset.")
- args = parser.parse_args()
- asyncio.run(benchmark(args))
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmarks/glm_image/huggingface/inference.py b/benchmarks/glm_image/huggingface/inference.py
deleted file mode 100644
index ff826080e8c..00000000000
--- a/benchmarks/glm_image/huggingface/inference.py
+++ /dev/null
@@ -1,291 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-HuggingFace (transformers + diffusers) baseline benchmark for GLM-Image.
-
-Supports T2I and I2I modes with the prompt.json dataset.
-Downloads source images for I2I from image_url on first run and caches locally.
-
-Usage:
- # T2I mode (text-to-image, no source images needed)
- python benchmarks/glm_image/huggingface/inference.py \
- --model-path zai-org/GLM-Image \
- --mode t2i --num-prompts 10
-
- # I2I mode (image-to-image, downloads source images)
- python benchmarks/glm_image/huggingface/inference.py \
- --model-path zai-org/GLM-Image \
- --mode i2i --num-prompts 10
-
- # With custom prompt.json
- python benchmarks/glm_image/huggingface/inference.py \
- --model-path zai-org/GLM-Image \
- --mode i2i --dataset-path prompts.json --num-prompts 5
-"""
-
-import argparse
-import json
-import os
-import time
-from pathlib import Path
-
-import numpy as np
-import requests
-import torch
-from PIL import Image
-
-BENCHMARK_DIR = Path(__file__).resolve().parent.parent
-DEFAULT_PROMPT_JSON = BENCHMARK_DIR / "prompt" / "prompt.json"
-IMAGE_CACHE_DIR = BENCHMARK_DIR / "prompt" / "images"
-
-DATASET_REPO = "JaredforReal/glm-image-bench"
-DATASET_FILE = "prompt.json"
-
-
-def _ensure_prompt_json(dataset_path: str | None) -> str:
- """Return path to prompt.json, downloading from HuggingFace if needed."""
- if dataset_path:
- return dataset_path
- local = DEFAULT_PROMPT_JSON
- if local.exists():
- return str(local)
- print(f"Downloading {DATASET_FILE} from {DATASET_REPO} ...")
- try:
- from huggingface_hub import hf_hub_download
-
- downloaded = hf_hub_download(
- repo_id=DATASET_REPO,
- filename=DATASET_FILE,
- repo_type="dataset",
- )
- local.parent.mkdir(parents=True, exist_ok=True)
- import shutil
-
- shutil.copy2(downloaded, local)
- print(f"Saved to {local}")
- except ImportError:
- url = f"https://huggingface.co/datasets/{DATASET_REPO}/resolve/main/{DATASET_FILE}"
- import urllib.request
-
- local.parent.mkdir(parents=True, exist_ok=True)
- urllib.request.urlretrieve(url, local)
- print(f"Saved to {local}")
- return str(local)
-
-
-HEIGHT = 1024
-WIDTH = 1024
-SEED = 42
-NUM_INFERENCE_STEPS = 50
-GUIDANCE_SCALE = 1.5
-
-
-# ---------------------------------------------------------------------------
-# Dataset
-# ---------------------------------------------------------------------------
-
-
-def load_dataset(
- dataset_path: str | None,
- mode: str,
- num_prompts: int,
-) -> list[dict]:
- """Load prompts from prompt.json and prepare per-request data."""
- path = _ensure_prompt_json(dataset_path)
- with open(path, encoding="utf-8") as f:
- raw = json.load(f)
-
- items = []
- for entry in raw:
- if mode == "t2i":
- prompt_key = "t2i_prompt"
- else:
- prompt_key = "i2i_prompt"
-
- prompt_text = entry.get(prompt_key, "").strip()
- if not prompt_text:
- continue
-
- item = {"prompt": prompt_text}
- if mode == "i2i":
- item["image_url"] = entry.get("image_url", "")
- items.append(item)
-
- if num_prompts and len(items) > num_prompts:
- items = items[:num_prompts]
- return items
-
-
-def download_image(url: str, cache_dir: Path) -> str:
- """Download an image to cache_dir and return the local path."""
- cache_dir.mkdir(parents=True, exist_ok=True)
- fname = url.rsplit("/", 1)[-1]
- local_path = cache_dir / fname
- if local_path.exists():
- return str(local_path)
- print(f" Downloading {url} ...")
- resp = requests.get(url, timeout=30)
- resp.raise_for_status()
- local_path.write_bytes(resp.content)
- return str(local_path)
-
-
-# ---------------------------------------------------------------------------
-# Benchmark
-# ---------------------------------------------------------------------------
-
-
-def benchmark(args: argparse.Namespace) -> None:
- from diffusers.pipelines.glm_image import GlmImagePipeline
-
- print("=" * 60)
- print("GLM-Image HuggingFace Baseline Benchmark")
- print(f"Mode: {args.mode} | Model: {args.model_path}")
- print(f"Size: {args.height}x{args.width} | Steps: {args.num_inference_steps}")
- print("=" * 60)
-
- # Load dataset
- items = load_dataset(args.dataset_path, args.mode, args.num_prompts)
- if not items:
- print("No prompts loaded. Exiting.")
- return
- print(f"Loaded {len(items)} prompts for {args.mode} mode")
-
- # Download I2I source images
- if args.mode == "i2i":
- print("Preparing source images...")
- for item in items:
- url = item.get("image_url", "")
- if url:
- item["image_path"] = download_image(url, IMAGE_CACHE_DIR)
- else:
- item["image_path"] = None
-
- # Load pipeline
- print(f"\nLoading pipeline from {args.model_path} ...")
- t0 = time.perf_counter()
- pipe = GlmImagePipeline.from_pretrained(
- args.model_path,
- torch_dtype=torch.bfloat16,
- device_map="cuda",
- )
- init_time = time.perf_counter() - t0
- print(f"Pipeline loaded in {init_time:.2f}s")
-
- # Create output dir
- os.makedirs(args.output_dir, exist_ok=True)
-
- # Run benchmark
- generator = torch.Generator(device="cuda").manual_seed(args.seed)
- latencies = []
- success = 0
- failed = 0
-
- print(f"\nRunning {len(items)} requests sequentially...")
- print("-" * 60)
-
- for i, item in enumerate(items):
- prompt = item["prompt"]
- gen_kwargs: dict = {
- "prompt": prompt,
- "height": args.height,
- "width": args.width,
- "num_inference_steps": args.num_inference_steps,
- "guidance_scale": args.guidance_scale,
- "generator": generator,
- }
-
- if args.mode == "i2i":
- img_path = item.get("image_path")
- if img_path and os.path.exists(img_path):
- gen_kwargs["image"] = [Image.open(img_path).convert("RGB")]
- else:
- print(f" [{i + 1}] SKIP: no source image")
- failed += 1
- continue
-
- t_start = time.perf_counter()
- try:
- result = pipe(**gen_kwargs)
- image = result.images[0]
- elapsed = time.perf_counter() - t_start
- latencies.append(elapsed)
- success += 1
-
- out_path = os.path.join(args.output_dir, f"{i:04d}.png")
- image.save(out_path)
- print(f" [{i + 1}/{len(items)}] {elapsed:.3f}s -> {out_path}")
- except Exception as e:
- elapsed = time.perf_counter() - t_start
- failed += 1
- print(f" [{i + 1}/{len(items)}] FAILED ({elapsed:.3f}s): {e}")
-
- # Report
- total_gen_time = sum(latencies) if latencies else 0
- print("\n" + "=" * 60)
- print("HuggingFace Baseline Results")
- print("=" * 60)
- print(f"{'Mode:':<40} {args.mode}")
- print(f"{'Model:':<40} {args.model_path}")
- print(f"{'Image size:':<40} {args.height}x{args.width}")
- print(f"{'Num inference steps:':<40} {args.num_inference_steps}")
- print("-" * 50)
- print(f"{'Pipeline init time (s):':<40} {init_time:.2f}")
- print(f"{'Successful:':<40} {success}/{len(items)}")
- print(f"{'Failed:':<40} {failed}")
- print("-" * 50)
- if latencies:
- arr = np.array(latencies)
- print(f"{'Total generation time (s):':<40} {total_gen_time:.2f}")
- print(f"{'Throughput (img/s):':<40} {success / total_gen_time:.4f}")
- print(f"{'Latency Mean (s):':<40} {arr.mean():.4f}")
- print(f"{'Latency Median (s):':<40} {np.median(arr):.4f}")
- print(f"{'Latency P95 (s):':<40} {np.percentile(arr, 95):.4f}")
- print(f"{'Latency P99 (s):':<40} {np.percentile(arr, 99):.4f}")
-
- print(f"\n{'Output dir:':<40} {args.output_dir}")
- print("=" * 60)
-
- # Save metrics JSON
- metrics = {
- "backend": "huggingface",
- "mode": args.mode,
- "model": args.model_path,
- "height": args.height,
- "width": args.width,
- "num_inference_steps": args.num_inference_steps,
- "init_time_s": init_time,
- "completed_requests": success,
- "failed_requests": failed,
- "total_gen_time_s": total_gen_time,
- "throughput_qps": success / total_gen_time if total_gen_time > 0 else 0,
- "latency_mean": float(np.mean(latencies)) if latencies else 0,
- "latency_median": float(np.median(latencies)) if latencies else 0,
- "latency_p95": float(np.percentile(latencies, 95)) if latencies else 0,
- "latency_p99": float(np.percentile(latencies, 99)) if latencies else 0,
- }
- if args.output_file:
- with open(args.output_file, "w") as f:
- json.dump(metrics, f, indent=2)
- print(f"Metrics saved to {args.output_file}")
-
-
-def main() -> None:
- parser = argparse.ArgumentParser(description="GLM-Image HuggingFace baseline benchmark")
- parser.add_argument("--model-path", type=str, default="zai-org/GLM-Image")
- parser.add_argument("--mode", type=str, default="t2i", choices=["t2i", "i2i"])
- parser.add_argument("--dataset-path", type=str, default=None, help="Path to prompt.json")
- parser.add_argument("--num-prompts", type=int, default=10)
- parser.add_argument("--height", type=int, default=HEIGHT)
- parser.add_argument("--width", type=int, default=WIDTH)
- parser.add_argument("--num-inference-steps", type=int, default=NUM_INFERENCE_STEPS)
- parser.add_argument("--guidance-scale", type=float, default=GUIDANCE_SCALE)
- parser.add_argument("--seed", type=int, default=SEED)
- parser.add_argument("--output-dir", type=str, default="benchmarks/glm_image/huggingface/outputs")
- parser.add_argument("--output-file", type=str, default=None, help="JSON file for metrics")
- args = parser.parse_args()
- benchmark(args)
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmarks/glm_image/vllm-omni/inference.py b/benchmarks/glm_image/vllm-omni/inference.py
deleted file mode 100644
index 5729da07174..00000000000
--- a/benchmarks/glm_image/vllm-omni/inference.py
+++ /dev/null
@@ -1,505 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-vLLM-Omni offline benchmark for GLM-Image.
-
-Supports T2I and I2I modes with the prompt.json dataset.
-Downloads source images for I2I from image_url on first run and caches locally.
-
-Usage:
- # T2I mode
- python benchmarks/glm_image/vllm-omni/inference.py \
- --model-path zai-org/GLM-Image \
- --mode t2i --num-prompts 10
-
- # I2I mode (downloads source images)
- python benchmarks/glm_image/vllm-omni/inference.py \
- --model-path zai-org/GLM-Image \
- --mode i2i --num-prompts 10
-"""
-
-import argparse
-import json
-import math
-import os
-import time
-from pathlib import Path
-
-import numpy as np
-import requests
-from PIL import Image
-from vllm import SamplingParams
-
-from vllm_omni.entrypoints.omni import Omni
-from vllm_omni.inputs.data import OmniDiffusionSamplingParams
-
-BENCHMARK_DIR = Path(__file__).resolve().parent.parent
-DEFAULT_PROMPT_JSON = BENCHMARK_DIR / "prompt" / "prompt.json"
-IMAGE_CACHE_DIR = BENCHMARK_DIR / "prompt" / "images"
-DEFAULT_DEPLOY_CONFIG = "vllm_omni/deploy/glm_image.yaml"
-
-DATASET_REPO = "JaredforReal/glm-image-bench"
-DATASET_FILE = "prompt.json"
-
-
-def _ensure_prompt_json(dataset_path: str | None) -> str:
- """Return path to prompt.json, downloading from HuggingFace if needed."""
- if dataset_path:
- return dataset_path
- local = DEFAULT_PROMPT_JSON
- if local.exists():
- return str(local)
- print(f"Downloading {DATASET_FILE} from {DATASET_REPO} ...")
- try:
- from huggingface_hub import hf_hub_download
-
- downloaded = hf_hub_download(
- repo_id=DATASET_REPO,
- filename=DATASET_FILE,
- repo_type="dataset",
- )
- local.parent.mkdir(parents=True, exist_ok=True)
- import shutil
-
- shutil.copy2(downloaded, local)
- print(f"Saved to {local}")
- except ImportError:
- url = f"https://huggingface.co/datasets/{DATASET_REPO}/resolve/main/{DATASET_FILE}"
- import urllib.request
-
- local.parent.mkdir(parents=True, exist_ok=True)
- urllib.request.urlretrieve(url, local)
- print(f"Saved to {local}")
- return str(local)
-
-
-SEED = 42
-HEIGHT = 1024
-WIDTH = 1024
-NUM_INFERENCE_STEPS = 50
-GUIDANCE_SCALE = 1.5
-
-GLM_IMAGE_EOS_TOKEN_ID = 16385
-GLM_IMAGE_VISION_VOCAB_SIZE = 16512
-
-
-# ---------------------------------------------------------------------------
-# Dataset
-# ---------------------------------------------------------------------------
-
-
-def load_dataset(
- dataset_path: str | None,
- mode: str,
- num_prompts: int,
-) -> list[dict]:
- path = _ensure_prompt_json(dataset_path)
- with open(path, encoding="utf-8") as f:
- raw = json.load(f)
-
- items = []
- for entry in raw:
- prompt_key = "t2i_prompt" if mode == "t2i" else "i2i_prompt"
- prompt_text = entry.get(prompt_key, "").strip()
- if not prompt_text:
- continue
-
- item = {"prompt": prompt_text}
- if mode == "i2i":
- item["image_url"] = entry.get("image_url", "")
- items.append(item)
-
- if num_prompts and len(items) > num_prompts:
- items = items[:num_prompts]
- return items
-
-
-def download_image(url: str, cache_dir: Path) -> str:
- cache_dir.mkdir(parents=True, exist_ok=True)
- fname = url.rsplit("/", 1)[-1]
- local_path = cache_dir / fname
- if local_path.exists():
- return str(local_path)
- print(f" Downloading {url} ...")
- resp = requests.get(url, timeout=30)
- resp.raise_for_status()
- local_path.write_bytes(resp.content)
- return str(local_path)
-
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-
-def compute_max_tokens(height: int, width: int, is_i2i: bool = False) -> int:
- factor = 32
- token_h = height // factor
- token_w = width // factor
- large_tokens = token_h * token_w
-
- # Small preview tokens (half resolution in each dimension)
-
- ratio = token_h / token_w if token_w > 0 else 1.0
- small_token_h = max(1, int(math.sqrt(ratio) * (factor // 2)))
- small_token_w = max(1, int(math.sqrt(1 / ratio) * (factor // 2)))
- small_tokens = small_token_h * small_token_w
-
- # Mode-dependent totals:
- # - t2i: small + large + EOS
- # - i2i: large + EOS
- if is_i2i:
- return large_tokens + 1
- return small_tokens + large_tokens + 1
-
-
-def build_prompt_t2i(prompt: str, height: int, width: int, **gen_kw) -> dict:
- return {
- "prompt": prompt,
- "height": height,
- "width": width,
- "mm_processor_kwargs": {"target_h": height, "target_w": width},
- **gen_kw,
- }
-
-
-def build_prompt_i2i(prompt: str, image_path: str, height: int, width: int, **gen_kw) -> dict:
- return {
- "prompt": prompt,
- "height": height,
- "width": width,
- "mm_processor_kwargs": {"target_h": height, "target_w": width},
- "multi_modal_data": {"image": Image.open(image_path).convert("RGB")},
- **gen_kw,
- }
-
-
-def resolve_deploy_config(args: argparse.Namespace) -> str:
- if args.deploy_config:
- return args.deploy_config
- if os.path.exists(DEFAULT_DEPLOY_CONFIG):
- return DEFAULT_DEPLOY_CONFIG
- fallback = Path(__file__).resolve().parents[3] / DEFAULT_DEPLOY_CONFIG
- if fallback.exists():
- return str(fallback)
- raise FileNotFoundError("Deploy config not found. Specify --deploy-config.")
-
-
-# ---------------------------------------------------------------------------
-# Benchmark
-# ---------------------------------------------------------------------------
-
-
-def benchmark(args: argparse.Namespace) -> None:
- is_i2i = args.mode == "i2i"
-
- print("=" * 60)
- print("GLM-Image vLLM-Omni Benchmark")
- print(f"Mode: {args.mode} | Model: {args.model_path}")
- print(f"Size: {args.height}x{args.width} | Steps: {args.num_inference_steps}")
- print("=" * 60)
-
- # Load dataset
- items = load_dataset(args.dataset_path, args.mode, args.num_prompts)
- if not items:
- print("No prompts loaded. Exiting.")
- return
- print(f"Loaded {len(items)} prompts for {args.mode} mode")
-
- # Download I2I source images
- if is_i2i:
- print("Preparing source images...")
- for item in items:
- url = item.get("image_url", "")
- if url:
- item["image_path"] = download_image(url, IMAGE_CACHE_DIR)
- else:
- item["image_path"] = None
-
- # Init Omni
- deploy_config = resolve_deploy_config(args)
- print(f"\nInitializing vLLM-Omni (deploy config: {deploy_config}) ...")
- t0 = time.perf_counter()
-
- omni = Omni(
- model=args.model_path,
- deploy_config=deploy_config,
- log_stats=args.log_stats,
- stage_init_timeout=args.stage_init_timeout,
- enable_diffusion_pipeline_profiler=args.enable_diffusion_pipeline_profiler,
- enable_ar_profiler=args.enable_ar_profiler,
- )
-
- init_time = time.perf_counter() - t0
- print(f"Initialized in {init_time:.2f}s")
-
- # Sampling params
- max_tokens = compute_max_tokens(args.height, args.width, is_i2i=is_i2i)
- ar_params = SamplingParams(
- temperature=0.9,
- top_p=0.75,
- top_k=GLM_IMAGE_VISION_VOCAB_SIZE,
- max_tokens=max_tokens,
- stop_token_ids=[GLM_IMAGE_EOS_TOKEN_ID],
- seed=args.seed,
- detokenize=False,
- extra_args={"target_h": args.height, "target_w": args.width},
- )
- diff_params = OmniDiffusionSamplingParams(
- num_inference_steps=args.num_inference_steps,
- guidance_scale=args.guidance_scale,
- height=args.height,
- width=args.width,
- seed=args.seed,
- )
- sampling_params_list = [ar_params, diff_params]
-
- # Build all prompts
- gen_kw = {
- "seed": args.seed,
- "num_inference_steps": args.num_inference_steps,
- "guidance_scale": args.guidance_scale,
- }
- all_prompts = []
- for item in items:
- if is_i2i:
- img_path = item.get("image_path")
- if not img_path or not os.path.exists(img_path):
- continue
- all_prompts.append(build_prompt_i2i(item["prompt"], img_path, args.height, args.width, **gen_kw))
- else:
- all_prompts.append(build_prompt_t2i(item["prompt"], args.height, args.width, **gen_kw))
-
- valid = len(all_prompts)
- print(f"Valid prompts: {valid}")
-
- # Create output dir
- os.makedirs(args.output_dir, exist_ok=True)
-
- # Warmup: run 1 request to prime caches, CUDA graphs, etc.
- if all_prompts:
- print("Running warmup request...")
- try:
- warmup_prompt = [all_prompts[0]]
- omni.generate(warmup_prompt, sampling_params_list, py_generator=False)
- print("Warmup done.\n")
- except Exception as e:
- print(f"Warmup failed (continuing): {e}")
-
- # Run
- print(f"\nRunning {valid} requests...")
- print("-" * 60)
-
- latencies = []
- all_stage_durations: list[dict[str, float]] = []
- success = 0
- failed = 0
- wall_start = time.perf_counter()
-
- try:
- output_idx = 0
- for stage_outputs in omni.generate(all_prompts, sampling_params_list, py_generator=True):
- if stage_outputs.final_output_type == "image":
- request_output = stage_outputs.request_output
- request_id = getattr(request_output, "request_id", "")
-
- images = getattr(request_output, "images", [])
- if not images and hasattr(request_output, "multimodal_output"):
- mm = request_output.multimodal_output
- if isinstance(mm, dict):
- images = mm.get("images", [])
-
- elapsed = time.perf_counter() - wall_start
- if images:
- for img in images:
- if isinstance(img, Image.Image):
- out_path = os.path.join(args.output_dir, f"{output_idx:04d}.png")
- img.save(out_path)
- success += 1
- latencies.append(elapsed)
- stage_durations = getattr(stage_outputs, "stage_durations", {})
- if stage_durations:
- all_stage_durations.append(stage_durations)
- # Show wall-clock elapsed and pipeline breakdown if available
- preprocess_str = ""
- if "preprocess_ms" in stage_durations:
- preprocess_str = f" preprocess={stage_durations['preprocess_ms'] / 1000.0:.2f}s"
- print(f" [{success}/{valid}] id={request_id[:8]} {elapsed:.2f}s{preprocess_str}")
- output_idx += 1
- else:
- failed += 1
- except Exception as e:
- print(f"Error: {e}")
- failed = valid - success
-
- total_gen_time = time.perf_counter() - wall_start
-
- # Diff stage_0_gen_ms with previous request to remove accumulated wait time.
- # stage_0_gen_ms is measured from submit_ts (same for all requests submitted
- # at once), so it accumulates queue/scheduling overhead across requests.
- # Other stages and pipeline timings are per-request already.
- _TIMING_ORDER = [
- "preprocess_ms",
- "stage_0_gen_ms",
- "ar2diffusion_ms",
- "stage_1_gen_ms",
- "queue_wait_ms",
- ]
-
- per_request_actual: list[dict[str, float]] = []
- prev_stage_0_ms = 0.0
- for sd in all_stage_durations:
- actual = dict(sd)
- s0 = sd.get("stage_0_gen_ms", 0.0)
- actual["stage_0_gen_ms"] = s0 - prev_stage_0_ms
- prev_stage_0_ms = s0
- per_request_actual.append(actual)
-
- per_request_e2e_ms: list[float] = []
- for actual in per_request_actual:
- e2e_ms = sum(v for k, v in actual.items() if k in _TIMING_ORDER)
- if e2e_ms > 0:
- per_request_e2e_ms.append(e2e_ms)
-
- # Report
- print("\n" + "=" * 60)
- print("vLLM-Omni Benchmark Results")
- print("=" * 60)
- print(f"{'Mode:':<40} {args.mode}")
- print(f"{'Model:':<40} {args.model_path}")
- print(f"{'Image size:':<40} {args.height}x{args.width}")
- print(f"{'Num inference steps:':<40} {args.num_inference_steps}")
- print("-" * 50)
- print(f"{'Init time (s):':<40} {init_time:.2f}")
- print(f"{'Successful:':<40} {success}/{valid}")
- print(f"{'Failed:':<40} {failed}")
- print("-" * 50)
-
- if per_request_e2e_ms:
- per_request_s = np.array(per_request_e2e_ms) / 1000.0
- print(f"{'Total generation time (s):':<40} {total_gen_time:.2f}")
- print(f"{'Throughput (img/s):':<40} {success / total_gen_time:.4f}")
- print(f"{'Latency Mean (s):':<40} {per_request_s.mean():.4f}")
- print(f"{'Latency Median (s):':<40} {np.median(per_request_s):.4f}")
- print(f"{'Latency P95 (s):':<40} {np.percentile(per_request_s, 95):.4f}")
- print(f"{'Latency P99 (s):':<40} {np.percentile(per_request_s, 99):.4f}")
- print(f"{'Latency Min (s):':<40} {per_request_s.min():.4f}")
- print(f"{'Latency Max (s):':<40} {per_request_s.max():.4f}")
- elif latencies:
- per_request = np.diff([0.0] + list(latencies))
- print(f"{'Total generation time (s):':<40} {total_gen_time:.2f}")
- print(f"{'Throughput (img/s):':<40} {success / total_gen_time:.4f}")
- print(f"{'Latency Mean (s) [wall-clock]:':<40} {per_request.mean():.4f}")
- print(f"{'Latency Median (s) [wall-clock]:':<40} {np.median(per_request):.4f}")
- print(f"{'Latency P95 (s) [wall-clock]:':<40} {np.percentile(per_request, 95):.4f}")
- print(f"{'Latency P99 (s) [wall-clock]:':<40} {np.percentile(per_request, 99):.4f}")
- print(f"{'Latency Min (s) [wall-clock]:':<40} {per_request.min():.4f}")
- print(f"{'Latency Max (s) [wall-clock]:':<40} {per_request.max():.4f}")
-
- if per_request_actual:
- print("-" * 50)
- print("Pipeline Timings Mean:")
- for key in _TIMING_ORDER:
- vals = [d.get(key, 0.0) for d in per_request_actual]
- if any(v != 0 for v in vals):
- unit = "ms" if key.endswith("_ms") else "s"
- print(f" {key + ':':<38} {np.mean(vals):.4f} ({unit})")
- # Show any extra keys not in the ordered list
- ordered_set = set(_TIMING_ORDER)
- extra_keys = sorted(k for k in per_request_actual[0].keys() if k not in ordered_set)
- for key in extra_keys:
- vals = [d.get(key, 0.0) for d in per_request_actual]
- if any(v != 0 for v in vals):
- unit = "ms" if key.endswith("_ms") else "s"
- print(f" {key + ':':<38} {np.mean(vals):.4f} ({unit})")
-
- print(f"\n{'Output dir:':<40} {args.output_dir}")
- print("=" * 60)
-
- # Metrics JSON
- metrics = {
- "backend": "vllm-omni",
- "mode": args.mode,
- "model": args.model_path,
- "height": args.height,
- "width": args.width,
- "num_inference_steps": args.num_inference_steps,
- "init_time_s": init_time,
- "completed_requests": success,
- "failed_requests": failed,
- "total_gen_time_s": total_gen_time,
- "throughput_qps": success / total_gen_time if total_gen_time > 0 else 0,
- }
- if per_request_e2e_ms:
- per_request_s = np.array(per_request_e2e_ms) / 1000.0
- metrics["latency_mean"] = float(per_request_s.mean())
- metrics["latency_median"] = float(np.median(per_request_s))
- metrics["latency_p95"] = float(np.percentile(per_request_s, 95))
- metrics["latency_p99"] = float(np.percentile(per_request_s, 99))
- elif latencies:
- per_request = np.diff([0.0] + list(latencies))
- metrics["latency_mean"] = float(per_request.mean())
- metrics["latency_median"] = float(np.median(per_request))
- metrics["latency_p95"] = float(np.percentile(per_request, 95))
- metrics["latency_p99"] = float(np.percentile(per_request, 99))
- else:
- metrics["latency_mean"] = 0
- metrics["latency_median"] = 0
- metrics["latency_p95"] = 0
- metrics["latency_p99"] = 0
- if per_request_actual:
- all_keys = list(_TIMING_ORDER) + sorted(k for k in per_request_actual[0].keys() if k not in set(_TIMING_ORDER))
- stage_metrics = {}
- for key in all_keys:
- vals = [d.get(key, 0.0) for d in per_request_actual]
- stage_metrics[key] = {
- "mean": float(np.mean(vals)),
- "median": float(np.median(vals)),
- "p95": float(np.percentile(vals, 95)),
- }
- metrics["stage_durations"] = stage_metrics
- if args.output_file:
- with open(args.output_file, "w") as f:
- json.dump(metrics, f, indent=2)
- print(f"Metrics saved to {args.output_file}")
-
- omni.close()
- print("Done!")
-
-
-def main() -> None:
- parser = argparse.ArgumentParser(description="GLM-Image vLLM-Omni offline benchmark")
- parser.add_argument("--model-path", type=str, default="zai-org/GLM-Image")
- parser.add_argument("--deploy-config", type=str, default=None, help="Deploy config YAML")
- parser.add_argument("--mode", type=str, default="t2i", choices=["t2i", "i2i"])
- parser.add_argument("--dataset-path", type=str, default=None, help="Path to prompt.json")
- parser.add_argument("--num-prompts", type=int, default=10)
- parser.add_argument("--height", type=int, default=HEIGHT)
- parser.add_argument("--width", type=int, default=WIDTH)
- parser.add_argument("--num-inference-steps", type=int, default=NUM_INFERENCE_STEPS)
- parser.add_argument("--guidance-scale", type=float, default=GUIDANCE_SCALE)
- parser.add_argument("--seed", type=int, default=SEED)
- parser.add_argument("--output-dir", type=str, default="benchmarks/glm_image/vllm-omni/outputs")
- parser.add_argument("--output-file", type=str, default=None, help="JSON file for metrics")
- parser.add_argument("--stage-init-timeout", type=int, default=600)
- parser.add_argument(
- "--enable-diffusion-pipeline-profiler",
- action="store_true",
- help="Enable diffusion pipeline profiler for stage-level timing",
- )
- parser.add_argument(
- "--enable-ar-profiler",
- action="store_true",
- help="Enable AR stage profiler to include AR timing in stage_durations",
- )
- parser.add_argument(
- "--log-stats",
- action="store_true",
- help="Enable detailed per-request pipeline stats logging",
- )
- args = parser.parse_args()
- benchmark(args)
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmarks/tts/README.md b/benchmarks/tts/README.md
deleted file mode 100644
index 9e2fd35b1a5..00000000000
--- a/benchmarks/tts/README.md
+++ /dev/null
@@ -1,227 +0,0 @@
-# TTS Universal Benchmark
-
-A model-agnostic serving benchmark for TTS models in vllm-omni. One CLI
-(`bench_tts.py`) + one YAML registry (`model_configs.yaml`) drive perf and
-quality runs for every registered checkpoint: **Qwen3-TTS** (Base / CustomVoice)
-and **VoxCPM2** today, more to come.
-
-The same three task types — `voice_clone`, `default_voice`, `voice_design` —
-are wired into both the manual CLI and the DFX nightly CI matrix
-(`tests/dfx/perf/tests/test_tts.json`).
-
-## Quick start
-
-### 1. Start the server
-
-```bash
-vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni --port 8000
-```
-
-The server auto-loads its Deploy YAML from `vllm_omni/deploy/qwen3_tts.yaml`
-(Pipeline + Deploy schema introduced in #2383). No `--stage-configs-path` or
-`--deploy-config` flag is needed for any registered model.
-
-### 2. Run the benchmark (`vllm bench serve --omni`)
-
-The primary, directly-controllable path. Copy-paste one of these and tweak
-any bench flag (sampling params, endpoint, extra body, warmups, etc.):
-
-#### voice_clone (Qwen3-TTS-Base, seed-tts dataset)
-
-```bash
-vllm bench serve --omni \
- --host 127.0.0.1 --port 8000 \
- --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
- --backend openai-audio-speech \
- --endpoint /v1/audio/speech \
- --dataset-name seed-tts \
- --dataset-path /path/to/seed-tts-eval \
- --seed-tts-locale en \
- --num-prompts 20 --num-warmups 2 \
- --extra-body '{"task_type":"Base"}' \
- --max-concurrency 1 --request-rate inf \
- --percentile-metrics ttft,e2el,audio_rtf,audio_ttfp,audio_duration \
- --save-result --result-dir ./results
-```
-
-#### default_voice (Qwen3-TTS-CustomVoice, bundled seed_tts_smoke)
-
-```bash
-vllm bench serve --omni \
- --host 127.0.0.1 --port 8000 \
- --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
- --backend openai-audio-speech \
- --endpoint /v1/audio/speech \
- --dataset-name seed-tts-text \
- --dataset-path benchmarks/build_dataset/seed_tts_smoke \
- --seed-tts-locale en \
- --num-prompts 20 --num-warmups 2 \
- --extra-body '{"voice":"Vivian","language":"English","task_type":"CustomVoice"}' \
- --max-concurrency 1 --request-rate inf \
- --percentile-metrics ttft,e2el,audio_rtf,audio_ttfp,audio_duration \
- --save-result --result-dir ./results
-```
-
-#### voice_design (Qwen3-TTS-CustomVoice, bundled seed_tts_design)
-
-```bash
-vllm bench serve --omni \
- --host 127.0.0.1 --port 8000 \
- --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
- --backend openai-audio-speech \
- --endpoint /v1/audio/speech \
- --dataset-name seed-tts-design \
- --dataset-path benchmarks/build_dataset/seed_tts_design \
- --seed-tts-locale en \
- --num-prompts 20 --num-warmups 2 \
- --extra-body '{"task_type":"VoiceDesign","language":"English"}' \
- --max-concurrency 1 --request-rate inf \
- --percentile-metrics ttft,e2el,audio_rtf,audio_ttfp,audio_duration \
- --save-result --result-dir ./results
-```
-
-#### Add WER / SIM / UTMOS to any of the above
-
-Append `--seed-tts-wer-eval` (and optionally `SEED_TTS_EVAL_DEVICE=cuda:0`
-in the env, per PR #2558). This triggers the seed-tts-eval protocol:
-Whisper-large-v3 ASR → WER, WavLM embeddings → SIM, balacoon/utmos → UTMOS.
-
-### 3. Convenience wrapper (`bench_tts.py`)
-
-If you're running the **canonical** configuration for a registered model,
-`bench_tts.py` loads the right defaults from `model_configs.yaml` and
-emits the exact `vllm bench serve --omni` command above — useful for
-concurrency sweeps and multi-task runs:
-
-```bash
-# Smallest smoke — 5 prompts, concurrency=1
-python benchmarks/tts/bench_tts.py \
- --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
- --task voice_clone \
- --dataset-path /path/to/seed-tts-eval \
- --concurrency 1 --num-prompts 5 \
- --output-dir ./results
-
-# Full concurrency sweep
-python benchmarks/tts/bench_tts.py \
- --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
- --task voice_clone \
- --dataset-path /path/to/seed-tts-eval \
- --concurrency 1 2 4 8 16 32 \
- --num-prompts 20 \
- --output-dir ./results
-
-# With WER / SIM / UTMOS quality eval (adds ASR + embedding compute)
-python benchmarks/tts/bench_tts.py \
- --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
- --task voice_clone \
- --dataset-path /path/to/seed-tts-eval \
- --wer-eval \
- --concurrency 4 --num-prompts 200 \
- --output-dir ./results
-```
-
-### 4. Plot a sweep
-
-```bash
-python benchmarks/tts/plot_results.py \
- --results ./results/*.json \
- --output ./results/curve.png
-```
-
-Outputs TTFP / RTF / throughput curves (and a markdown table) for every
-`(task, concurrency)` combination in the result set.
-
-## Task types
-
-| Task | Dataset | Request body | Checkpoints that support it |
-|-----------------|-------------------|-----------------------------------------------------|------------------------------------------|
-| `voice_clone` | `seed-tts` | `ref_audio` + `ref_text` + `task_type=Base` | `Qwen3-TTS-*-Base`, `VoxCPM2` |
-| `default_voice` | `seed-tts-text` | `voice=Vivian` + `task_type=CustomVoice` | `Qwen3-TTS-*-CustomVoice` |
-| `voice_design` | `seed-tts-design` | `instructions=` + `task_type=VoiceDesign` | `Qwen3-TTS-*-CustomVoice` |
-
-**`-CustomVoice` checkpoints do NOT ship `speaker_encoder` weights**, so
-voice_clone requests raise `ValueError` at model runtime. Use `-Base` for
-voice_clone.
-
-## Adding a new TTS model
-
-Drop an entry into `model_configs.yaml` — no Python changes required:
-
-```yaml
-models:
- /:
- supported_tasks: [voice_clone] # or default_voice / voice_design
- backend: openai-audio-speech # vllm bench serve backend
- endpoint: /v1/audio/speech # OpenAI-compatible endpoint
- task_extra_body: # merged into every request's body
- voice_clone:
- task_type: Base
-```
-
-Then add the model's Deploy YAML under `vllm_omni/deploy/.yaml`
-(Pipeline + Deploy schema) and it's immediately benchable.
-
-## Datasets
-
-| Dataset | Bundled? | Format | Source |
-|--------------------|----------|-------------------|----------------------------------------------------------------|
-| `seed-tts-design` | ✅ | 5-field meta.lst | `benchmarks/build_dataset/seed_tts_design/en/meta.lst` (20 prompts) |
-| `seed_tts_smoke` | ✅ | 4-field meta.lst | `benchmarks/build_dataset/seed_tts_smoke/en/meta.lst` (20 text-only) |
-| `seed-tts` | ❌ | 4-field meta.lst + WAVs | Google-Drive: [BytedanceSpeech/seed-tts-eval][seedtts] (~1.2 GB) |
-| `seed-tts-text` | ❌ | 4-field meta.lst | Same archive as `seed-tts` (wav column unused) |
-
-[seedtts]: https://github.com/BytedanceSpeech/seed-tts-eval
-
-For manual voice_clone / default_voice runs against the full corpus, follow
-`benchmarks/build_dataset/download_process_data_seedtts.md` and point
-`--dataset-path` at the extracted `seedtts_testset` directory.
-
-## DFX nightly CI
-
-`tests/dfx/perf/tests/test_tts.json` wires three perf regimes plus quality:
-
-| eval_phase | concurrency | purpose | Baseline metrics |
-|---------------|-------------|---------------------------------------------------------|-----------------------------------------|
-| `latency` | 1 | Single-request TTFP / RTF SLO | `median_audio_ttfp_ms`, `median_audio_rtf` |
-| `throughput` | 8 | Codec-batching cliff sentinel (PDF #272 concurrency≥8) | `median_audio_ttfp_ms`, `median_audio_rtf` |
-| `quality` | 4 | WER / SIM / UTMOS regression (disabled in CI by default)| `mean_audio_rtf` |
-
-Why `median_*` for latency/throughput and `mean_*` for quality: latency
-distributions have cold-start tails that drag the mean; quality aggregates
-over 200 prompts so single-request outliers don't matter.
-
-Quality entries are `enabled: false` in CI because seed-tts-eval is not
-staged in the Buildkite container (matches the precedent in
-PR #2558 — quality runs are manual / release-validation, not nightly).
-
-## Concurrency cliff regression sentinel
-
-Observed on H20-3e, Qwen3-TTS-1.7B (measured pre-merge on this branch):
-
-| Task | Model | c=1 | c=4 | **c=8** | c=16 | c=32 |
-|---------------|---------------|--------|--------|------------|--------|--------|
-| voice_clone | 1.7B-Base | RTF 0.15 / TTFP 165ms | 0.28 / 412ms | **0.49 / 1701ms** | 0.72 / 3355ms | 0.77 / 3772ms |
-| voice_design | 1.7B-CustomVoice | RTF 0.08 / TTFP 53ms | 0.11 / 154ms | **0.21 / 872ms** | 0.33 / 1801ms | 0.38 / 1989ms |
-
-Both models show a **4–6× TTFP jump from c=4 to c=8** while audio throughput
-saturates around c=4–8 — the codec-bs=1 bottleneck documented in
-vllm-project/vllm-omni#272. The `throughput` CI regime at c=8 is the
-sentinel for regressions in this area.
-
-## File layout
-
-```
-benchmarks/tts/
-├── README.md (this file)
-├── bench_tts.py CLI — serve-mode benchmark driver
-├── bench_voxcpm_offline.py CLI — offline VoxCPM benchmark (sync + streaming)
-├── plot_results.py Generate per-task / per-concurrency curves
-└── model_configs.yaml Model registry (supported tasks + extra body)
-```
-
-## Related
-
-- Upstream seed-tts-eval integration: vllm-project/vllm-omni#2558
-- Pipeline + Deploy schema: vllm-project/vllm-omni#2383
-- Concurrency cliff RFC: vllm-project/vllm-omni#272
diff --git a/benchmarks/tts/bench_tts.py b/benchmarks/tts/bench_tts.py
deleted file mode 100644
index ba82b1c9b7b..00000000000
--- a/benchmarks/tts/bench_tts.py
+++ /dev/null
@@ -1,308 +0,0 @@
-#!/usr/bin/env python3
-"""Universal TTS benchmark CLI for vllm-omni.
-
-Runs ``vllm bench serve --omni`` with model-aware defaults loaded from
-``model_configs.yaml``. Supports Qwen3-TTS, VoxCPM2, and any future TTS
-model registered in the config file -- no code changes needed to add models.
-
-Usage::
-
- python benchmarks/tts/bench_tts.py \\
- --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \\
- --task voice_clone \\
- --locale en \\
- --concurrency 1 4 \\
- --num-prompts 20 \\
- --dataset-path /path/to/seed-tts-eval \\
- --host localhost --port 8000
-
-See ``--help`` for full option list.
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import math
-import os
-import subprocess
-import sys
-from datetime import datetime
-from pathlib import Path
-from typing import Any
-
-import yaml
-
-
-def _vllm_omni_bin() -> str:
- """Return the vllm-omni (or vllm) binary co-located with the current Python."""
- bin_dir = Path(sys.executable).parent
- for candidate in ("vllm-omni", "vllm"):
- p = bin_dir / candidate
- if p.is_file():
- return str(p)
- return "vllm-omni" # fall back and let the shell resolve it
-
-
-_REPO_ROOT = Path(__file__).resolve().parent.parent.parent
-_SCRIPT_DIR = Path(__file__).resolve().parent
-_DEFAULT_MODEL_CONFIGS = _SCRIPT_DIR / "model_configs.yaml"
-
-# Maps task name to the dataset_name used with vllm bench serve
-_TASK_TO_DATASET: dict[str, str] = {
- "voice_clone": "seed-tts",
- "default_voice": "seed-tts-text",
- "voice_design": "seed-tts-design",
-}
-
-# Default design dataset path (bundled with the repo)
-_DEFAULT_DESIGN_DATASET_PATH = str(_REPO_ROOT / "benchmarks" / "build_dataset" / "seed_tts_design")
-
-
-def load_model_configs(path: Path) -> dict[str, Any]:
- """Load model registry from YAML file."""
- with open(path, encoding="utf-8") as f:
- data = yaml.safe_load(f)
- return data.get("models", {})
-
-
-def build_bench_args(
- *,
- host: str,
- port: int,
- model: str,
- task: str,
- model_cfg: dict[str, Any],
- locale: str,
- num_prompts: int,
- concurrency: int | None,
- dataset_path: str | None,
- wer_eval: bool,
- output_dir: str | None,
- result_filename: str | None,
- extra_cli_args: list[str],
-) -> list[str]:
- """Build the ``vllm bench serve --omni`` command for one (task, concurrency) run."""
- dataset_name = _TASK_TO_DATASET[task]
- backend: str = model_cfg["backend"]
- endpoint: str = model_cfg["endpoint"]
- task_extra_body: dict[str, Any] = (model_cfg.get("task_extra_body") or {}).get(task) or {}
-
- # Resolve dataset path
- if dataset_path:
- resolved_dataset_path = dataset_path
- elif task == "voice_design":
- resolved_dataset_path = _DEFAULT_DESIGN_DATASET_PATH
- else:
- resolved_dataset_path = None
-
- cmd = [
- _vllm_omni_bin(),
- "bench",
- "serve",
- "--omni",
- "--host",
- host,
- "--port",
- str(port),
- "--model",
- model,
- "--backend",
- backend,
- "--endpoint",
- endpoint,
- "--dataset-name",
- dataset_name,
- "--num-prompts",
- str(num_prompts),
- "--num-warmups",
- "2",
- "--percentile-metrics",
- "ttft,e2el,audio_rtf,audio_ttfp,audio_duration",
- ]
-
- if resolved_dataset_path:
- cmd += ["--dataset-path", resolved_dataset_path]
-
- if locale:
- cmd += ["--seed-tts-locale", locale]
-
- if task_extra_body:
- cmd += ["--extra-body", json.dumps(task_extra_body, separators=(",", ":"))]
-
- if concurrency is not None:
- cmd += ["--max-concurrency", str(concurrency), "--request-rate", "inf"]
-
- if wer_eval:
- cmd.append("--seed-tts-wer-eval")
-
- if output_dir or result_filename:
- out_dir = output_dir or "."
- os.makedirs(out_dir, exist_ok=True)
- cmd += ["--save-result", "--result-dir", out_dir]
- if result_filename:
- cmd += ["--result-filename", result_filename]
-
- cmd += extra_cli_args
- return cmd
-
-
-def run_one_benchmark(cmd: list[str]) -> dict[str, Any] | None:
- """Run a single benchmark subprocess and return parsed JSON result if available."""
- print(f"\n{'=' * 60}")
- print("Running:", " ".join(cmd))
- print("=" * 60)
- result = subprocess.run(cmd, check=False)
- if result.returncode != 0:
- print(f"[bench_tts] WARNING: benchmark exited with code {result.returncode}")
- return None
- # If --save-result was used, find the result file
- try:
- result_dir_idx = cmd.index("--result-dir")
- result_dir = Path(cmd[result_dir_idx + 1])
- if "--result-filename" in cmd:
- fname_idx = cmd.index("--result-filename")
- result_file = result_dir / cmd[fname_idx + 1]
- else:
- # find most recently modified json
- jsons = sorted(result_dir.glob("result_*.json"), key=lambda p: p.stat().st_mtime)
- result_file = jsons[-1] if jsons else None
- if result_file and result_file.is_file():
- return json.loads(result_file.read_text(encoding="utf-8"))
- except (ValueError, IndexError, OSError):
- pass
- return None
-
-
-def print_summary_table(results: list[dict[str, Any]]) -> None:
- """Print a unified metrics table across all (task, concurrency) runs."""
- if not results:
- return
- header = (
- f"{'Task':<16} {'Concurrency':>11} {'RTF mean':>10} "
- f"{'TTFP (ms)':>10} {'Throughput':>12} {'WER':>7} {'SIM':>7} {'UTMOS':>7}"
- )
- print(f"\n{'=' * len(header)}")
- print("BENCHMARK SUMMARY")
- print("=" * len(header))
- print(header)
- print("-" * len(header))
- for r in results:
- task = r.get("_task", "?")
- conc = r.get("_concurrency", "?")
- rtf = r.get("mean_audio_rtf", float("nan"))
- ttfp = r.get("mean_audio_ttfp_ms", float("nan"))
- throughput = r.get("audio_throughput", float("nan"))
- wer = r.get("seed_tts_mean_wer", float("nan"))
- sim = r.get("seed_tts_mean_sim", float("nan"))
- utmos = r.get("seed_tts_mean_utmos", float("nan"))
-
- def fmt(v: float, digits: int = 3) -> str:
- return f"{v:.{digits}f}" if not math.isnan(v) else " n/a"
-
- print(
- f"{task:<16} {str(conc):>11} {fmt(rtf):>10} {fmt(ttfp, 0):>10} "
- f"{fmt(throughput):>12} {fmt(wer):>7} {fmt(sim):>7} {fmt(utmos):>7}"
- )
- print("=" * len(header))
-
-
-def main() -> None:
- """Entry point for the universal TTS benchmark CLI."""
- parser = argparse.ArgumentParser(
- description=__doc__,
- formatter_class=argparse.RawDescriptionHelpFormatter,
- )
- parser.add_argument(
- "--model", required=True, help="HuggingFace model ID (e.g. Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice)"
- )
- parser.add_argument("--task", default="all", help="Task type: voice_clone | default_voice | voice_design | all")
- parser.add_argument("--locale", default="en", choices=["en", "zh"])
- parser.add_argument("--concurrency", type=int, nargs="+", default=[1, 4], metavar="N")
- parser.add_argument(
- "--num-prompts",
- type=int,
- nargs="+",
- default=[20],
- metavar="N",
- help="Number of prompts per run. If one value, applied to all concurrency levels.",
- )
- parser.add_argument(
- "--dataset-path", default=None, help="Root of seed-tts-eval dataset (required for voice_clone/default_voice)"
- )
- parser.add_argument("--wer-eval", action="store_true", help="Enable WER/SIM/UTMOS quality eval")
- parser.add_argument("--output-dir", default=None, help="Directory to save result JSON files")
- parser.add_argument("--host", default="localhost")
- parser.add_argument("--port", type=int, default=8000)
- parser.add_argument("--model-configs", default=str(_DEFAULT_MODEL_CONFIGS), help="Path to model_configs.yaml")
- parser.add_argument("extra", nargs=argparse.REMAINDER, help="Extra args passed directly to vllm bench serve")
- args = parser.parse_args()
-
- model_configs = load_model_configs(Path(args.model_configs))
- if args.model not in model_configs:
- known = "\n ".join(model_configs.keys())
- print(f"[bench_tts] ERROR: model '{args.model}' not in model_configs.yaml.\nKnown models:\n {known}")
- sys.exit(1)
-
- model_cfg = model_configs[args.model]
- supported_tasks: list[str] = model_cfg.get("supported_tasks", [])
-
- tasks_to_run: list[str]
- if args.task == "all":
- tasks_to_run = supported_tasks
- elif args.task in supported_tasks:
- tasks_to_run = [args.task]
- else:
- print(
- f"[bench_tts] ERROR: task '{args.task}' not supported by {args.model}.\nSupported tasks: {supported_tasks}"
- )
- sys.exit(1)
-
- # Align num_prompts list with concurrency list
- num_prompts_list: list[int] = args.num_prompts
- if len(num_prompts_list) == 1:
- num_prompts_list = num_prompts_list * len(args.concurrency)
- elif len(num_prompts_list) != len(args.concurrency):
- print(
- f"[bench_tts] ERROR: --num-prompts ({len(num_prompts_list)} values) must be "
- f"length 1 or match --concurrency ({len(args.concurrency)} values)."
- )
- sys.exit(1)
-
- all_results: list[dict[str, Any]] = []
-
- for task in tasks_to_run:
- for concurrency, num_prompts in zip(args.concurrency, num_prompts_list):
- ts = datetime.now().strftime("%Y%m%d-%H%M%S")
- result_filename = f"bench_tts_{args.model.replace('/', '_')}_{task}_c{concurrency}_{ts}.json"
- cmd = build_bench_args(
- host=args.host,
- port=args.port,
- model=args.model,
- task=task,
- model_cfg=model_cfg,
- locale=args.locale,
- num_prompts=num_prompts,
- concurrency=concurrency,
- dataset_path=args.dataset_path,
- wer_eval=args.wer_eval,
- output_dir=args.output_dir,
- result_filename=result_filename,
- extra_cli_args=args.extra or [],
- )
- result = run_one_benchmark(cmd)
- if result is not None:
- result["_task"] = task
- result["_concurrency"] = concurrency
- all_results.append(result)
- # Persist the metadata so plot_results.py can pick it up.
- if args.output_dir and result_filename:
- result_path = Path(args.output_dir) / result_filename
- if result_path.is_file():
- result_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
-
- print_summary_table(all_results)
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmarks/tts/bench_voxcpm_offline.py b/benchmarks/tts/bench_voxcpm_offline.py
deleted file mode 100644
index 672b77f1495..00000000000
--- a/benchmarks/tts/bench_voxcpm_offline.py
+++ /dev/null
@@ -1,922 +0,0 @@
-"""Offline VoxCPM benchmark for vLLM Omni.
-
-Supports both:
-- sync one-shot (Omni.generate)
-- streaming (AsyncOmni.generate with async_chunk config)
-- text-only synthesis
-- voice cloning
-- text/clone batch inputs from txt or jsonl
-
-Usage::
-
- # Sync (default voice)
- python benchmarks/tts/bench_voxcpm_offline.py \\
- --model /path/to/VoxCPM \\
- --text "Hello world" \\
- --output-dir results/audio/
-
- # Streaming (async_chunk)
- python benchmarks/tts/bench_voxcpm_offline.py \\
- --model /path/to/VoxCPM \\
- --stage-configs-path vllm_omni/model_executor/stage_configs/voxcpm_async_chunk.yaml \\
- --txt-prompts prompts.txt \\
- --output-dir results/audio/
-
- # Voice cloning batch via JSONL
- python benchmarks/tts/bench_voxcpm_offline.py \\
- --model /path/to/VoxCPM \\
- --jsonl-prompts prompts.jsonl \\
- --output-dir results/audio/
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-import logging
-import os
-import tempfile
-import time
-import uuid
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-import torch
-from vllm.utils.argparse_utils import FlexibleArgumentParser
-
-from vllm_omni import AsyncOmni, Omni
-
-
-def _find_repo_root(start: Path) -> Path:
- """Walk up from ``start`` until a repo marker is found.
-
- Falls back to ``parents[2]`` for backwards compatibility if no marker hits
- (which can only happen in unusual checkouts — the tree should always have
- pyproject.toml + vllm_omni/ at the top level).
- """
- for candidate in [start, *start.parents]:
- if (candidate / "pyproject.toml").is_file() and (candidate / "vllm_omni").is_dir():
- return candidate
- return start.parents[2]
-
-
-REPO_ROOT = _find_repo_root(Path(__file__).resolve())
-DEFAULT_STAGE_ASYNC = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "voxcpm_async_chunk.yaml"
-DEFAULT_STAGE_SYNC = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "voxcpm.yaml"
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass(frozen=True, slots=True)
-class PromptSpec:
- text: str
- label: str
- ref_audio: str | None = None
- ref_text: str | None = None
-
-
-def _require_soundfile():
- try:
- import soundfile as sf # type: ignore
- except ModuleNotFoundError as exc:
- raise RuntimeError(
- "soundfile is required to write VoxCPM benchmark WAV outputs. Install it with: pip install soundfile"
- ) from exc
- return sf
-
-
-def _build_prompt(
- args,
- *,
- text: str,
- ref_audio: str | None = None,
- ref_text: str | None = None,
- global_request_id: str | None = None,
-) -> dict[str, Any]:
- additional_information: dict[str, list[Any]] = {
- "text": [text],
- "cfg_value": [args.cfg_value],
- "inference_timesteps": [args.inference_timesteps],
- "min_len": [args.min_len],
- "max_new_tokens": [args.max_new_tokens],
- }
- if args.streaming_prefix_len is not None:
- additional_information["streaming_prefix_len"] = [args.streaming_prefix_len]
-
- if ref_audio:
- additional_information["ref_audio"] = [ref_audio]
- if ref_text:
- additional_information["ref_text"] = [ref_text]
- if global_request_id is not None:
- additional_information["global_request_id"] = [global_request_id]
-
- return {
- "prompt_token_ids": [1],
- "additional_information": additional_information,
- }
-
-
-def _extract_audio_tensor(mm: dict[str, Any]) -> torch.Tensor:
- audio = mm.get("audio", mm.get("model_outputs"))
- if audio is None:
- raise ValueError("No audio output found in multimodal output.")
- if isinstance(audio, list):
- parts = [torch.as_tensor(a).float().cpu().reshape(-1) for a in audio]
- audio = torch.cat(parts, dim=-1) if parts else torch.zeros(0)
- if not isinstance(audio, torch.Tensor):
- audio = torch.as_tensor(audio)
- return audio.float().cpu().reshape(-1)
-
-
-def _extract_sample_rate(mm: dict[str, Any]) -> int:
- sr_raw = mm.get("sr", 24000)
- if isinstance(sr_raw, list) and sr_raw:
- sr_raw = sr_raw[-1]
- if hasattr(sr_raw, "item"):
- return int(sr_raw.item())
- return int(sr_raw)
-
-
-def _emit_offline_metrics(
- *,
- request_id: str,
- elapsed_s: float,
- first_audio_elapsed: float | None,
- audio_duration_s: float,
-) -> None:
- metrics = {
- "request_id": request_id,
- "ttfp_ms": round(first_audio_elapsed * 1000.0, 3) if first_audio_elapsed is not None else None,
- "audio_duration_s": round(audio_duration_s, 6),
- "rtf": round(elapsed_s / audio_duration_s, 6) if audio_duration_s > 0 else None,
- }
- print(f"[OfflineMetrics] {metrics}")
-
-
-def _write_audio_tensor(output_path: Path, audio_tensor: Any, sample_rate: int) -> None:
- sf = _require_soundfile()
- if isinstance(audio_tensor, torch.Tensor):
- audio_np = audio_tensor.float().cpu().clamp(-1.0, 1.0).numpy()
- else:
- audio_np = torch.as_tensor(audio_tensor).float().cpu().clamp(-1.0, 1.0).numpy()
- sf.write(
- output_path,
- audio_np,
- sample_rate,
- format="WAV",
- subtype="PCM_16",
- )
-
-
-def _save_wav(mm: dict[str, Any], output_dir: Path, request_id: str) -> Path:
- output_dir.mkdir(parents=True, exist_ok=True)
- output_path = output_dir / f"output_{request_id}.wav"
- _write_audio_tensor(output_path, _extract_audio_tensor(mm), _extract_sample_rate(mm))
- return output_path
-
-
-def _iter_request_multimodal_outputs(request_output: Any):
- outputs = getattr(request_output, "outputs", None)
- if outputs:
- for output in outputs:
- mm = getattr(output, "multimodal_output", None)
- if isinstance(mm, dict):
- yield mm
-
- mm = getattr(request_output, "multimodal_output", None)
- if isinstance(mm, dict):
- yield mm
-
-
-def _read_non_empty_lines(path: str) -> list[str]:
- with open(path, encoding="utf-8") as f:
- return [line.strip() for line in f if line.strip()]
-
-
-def _load_prompt_specs(args) -> list[PromptSpec]:
- specs: list[PromptSpec] = []
-
- if args.txt_prompts is not None:
- texts = _read_non_empty_lines(args.txt_prompts)
- if not texts:
- raise ValueError(f"No prompts found in {args.txt_prompts}")
- for idx, text in enumerate(texts, start=1):
- specs.append(
- PromptSpec(
- text=text,
- label=f"item{idx:03d}",
- ref_audio=args.ref_audio,
- ref_text=args.ref_text,
- )
- )
- return specs
-
- if args.jsonl_prompts is not None:
- with open(args.jsonl_prompts, encoding="utf-8") as f:
- for line_no, raw_line in enumerate(f, start=1):
- line = raw_line.strip()
- if not line:
- continue
- try:
- item = json.loads(line)
- except json.JSONDecodeError as exc:
- raise ValueError(f"{args.jsonl_prompts}:{line_no} is not valid JSON: {exc}") from exc
- if not isinstance(item, dict):
- raise ValueError(f"{args.jsonl_prompts}:{line_no} must be a JSON object")
-
- text = item.get("text")
- if not isinstance(text, str) or not text.strip():
- raise ValueError(f"{args.jsonl_prompts}:{line_no} requires non-empty string field 'text'")
-
- ref_audio = item.get("ref_audio", args.ref_audio)
- ref_text = item.get("ref_text", args.ref_text)
- if (ref_audio is None) != (ref_text is None):
- raise ValueError(
- f"{args.jsonl_prompts}:{line_no} must provide both 'ref_audio' and 'ref_text' together"
- )
-
- specs.append(
- PromptSpec(
- text=text.strip(),
- label=f"item{len(specs) + 1:03d}",
- ref_audio=ref_audio,
- ref_text=ref_text,
- )
- )
-
- if not specs:
- raise ValueError(f"No prompts found in {args.jsonl_prompts}")
- return specs
-
- specs.append(
- PromptSpec(
- text=args.text,
- label="item001",
- ref_audio=args.ref_audio,
- ref_text=args.ref_text,
- )
- )
- return specs
-
-
-def _build_prompt_for_spec(args, spec: PromptSpec, *, global_request_id: str | None = None) -> dict[str, Any]:
- return _build_prompt(
- args,
- text=spec.text,
- ref_audio=spec.ref_audio,
- ref_text=spec.ref_text,
- global_request_id=global_request_id,
- )
-
-
-def _count_voice_clone_prompts(prompt_specs: list[PromptSpec]) -> int:
- return sum(1 for spec in prompt_specs if spec.ref_audio is not None)
-
-
-def _get_warmup_specs(prompt_specs: list[PromptSpec]) -> list[PromptSpec]:
- return prompt_specs[:1]
-
-
-def _extract_stream_finished(stage_output: Any) -> bool:
- request_output = getattr(stage_output, "request_output", None)
- request_finished = getattr(request_output, "finished", None)
- if request_finished is not None:
- return bool(request_finished)
- return bool(getattr(stage_output, "finished", False))
-
-
-def _build_profiled_stage_config(
- stage_configs_path: str,
- profiler_dir: str,
-) -> str:
- stage_config_path = Path(stage_configs_path)
- yaml_text = stage_config_path.read_text(encoding="utf-8")
- injected_lines: list[str] = []
- injected_count = 0
-
- for line in yaml_text.splitlines():
- injected_lines.append(line)
- if line.strip() != "engine_args:":
- continue
- indent = line[: len(line) - len(line.lstrip())]
- child_indent = indent + " "
- grandchild_indent = child_indent + " "
- injected_lines.extend(
- [
- f"{child_indent}profiler_config:",
- f'{grandchild_indent}profiler: "torch"',
- f'{grandchild_indent}torch_profiler_dir: "{profiler_dir}"',
- f"{grandchild_indent}torch_profiler_with_stack: true",
- ]
- )
- injected_count += 1
-
- if injected_count == 0:
- raise ValueError(f"No engine_args block found in stage config: {stage_configs_path}")
-
- tmp = tempfile.NamedTemporaryFile(
- mode="w",
- encoding="utf-8",
- delete=False,
- suffix=".yaml",
- prefix=f"{stage_config_path.stem}_profile_",
- )
- tmp.write("\n".join(injected_lines) + "\n")
- tmp.close()
- return tmp.name
-
-
-def parse_args():
- parser = FlexibleArgumentParser(
- description="Offline split-stage VoxCPM inference with vLLM Omni (auto sync/streaming by stage config)"
- )
- parser.add_argument(
- "--model",
- type=str,
- default=os.environ.get("VOXCPM_MODEL"),
- help="Local VoxCPM model directory. Defaults to $VOXCPM_MODEL.",
- )
- parser.add_argument(
- "--text",
- type=str,
- default="This is a split-stage VoxCPM synthesis example running on vLLM Omni.",
- help="Text to synthesize. Ignored when --txt-prompts or --jsonl-prompts is used.",
- )
- parser.add_argument(
- "--txt-prompts",
- type=str,
- default=None,
- help="Path to a .txt file with one synthesis text per line.",
- )
- parser.add_argument(
- "--jsonl-prompts",
- type=str,
- default=None,
- help=(
- "Path to a .jsonl file. Each line must contain at least {'text': ...}; "
- "clone rows can also set ref_audio/ref_text, and ref_text must be the "
- "real transcript of ref_audio."
- ),
- )
- parser.add_argument(
- "--ref-audio",
- type=str,
- default=None,
- help=(
- "Optional reference audio path for voice cloning. With --txt-prompts, "
- "the same reference is applied to every line."
- ),
- )
- parser.add_argument(
- "--ref-text",
- type=str,
- default=None,
- help=(
- "Real transcript of the reference audio. Placeholder text or mismatched "
- "text will usually produce noisy/electronic clone audio."
- ),
- )
- parser.add_argument(
- "--stage-configs-path",
- type=str,
- default=str(DEFAULT_STAGE_SYNC),
- help="Stage config YAML path. Routing is selected only from this path.",
- )
- parser.add_argument(
- "--cfg-value",
- type=float,
- default=2.0,
- help="Classifier-free guidance value for VoxCPM.",
- )
- parser.add_argument(
- "--inference-timesteps",
- type=int,
- default=10,
- help="Number of inference timesteps.",
- )
- parser.add_argument(
- "--min-len",
- type=int,
- default=2,
- help="Minimum generated token length.",
- )
- parser.add_argument(
- "--max-new-tokens",
- type=int,
- default=4096,
- help="Maximum generated token length.",
- )
- parser.add_argument(
- "--streaming-prefix-len",
- type=int,
- default=None,
- help="VoxCPM streaming window (optional, streaming mode only).",
- )
- parser.add_argument(
- "--output-dir",
- type=str,
- default=None,
- help="Directory for output WAV files.",
- )
- parser.add_argument(
- "--stage-init-timeout",
- type=int,
- default=600,
- help="Stage initialization timeout in seconds.",
- )
- parser.add_argument(
- "--log-stats",
- dest="log_stats",
- action="store_true",
- help="Enable vLLM Omni stats logging.",
- )
- parser.add_argument(
- "--no-log-stats",
- dest="log_stats",
- action="store_false",
- help="Disable vLLM Omni stats logging.",
- )
- parser.set_defaults(log_stats=True)
- parser.add_argument(
- "--num-runs",
- type=int,
- default=1,
- help="Number of full inference runs (same prompt each time). Default 1.",
- )
- parser.add_argument(
- "--warmup-runs",
- type=int,
- default=0,
- help=(
- "Optional number of warmup passes before measured runs. Warmup uses only "
- "the first prompt and does not save outputs."
- ),
- )
- parser.add_argument(
- "--enable-profiler",
- action="store_true",
- help=(
- "Enable torch profiler for the configured stages. A temporary profiled "
- "stage config is generated automatically."
- ),
- )
- parser.add_argument(
- "--profiler-dir",
- type=str,
- default=None,
- help="Directory for profiler traces. Defaults to /profiler when profiling is enabled.",
- )
- parser.add_argument(
- "--profiler-stages",
- type=int,
- nargs="*",
- default=None,
- help="Optional stage ids to profile. Defaults to all stages that have profiler_config.",
- )
- parser.add_argument(
- "--profiler-wait-seconds",
- type=float,
- default=30.0,
- help="Seconds to wait after stop_profile for trace files to flush.",
- )
- args = parser.parse_args()
-
- if not args.model:
- parser.error("--model is required unless $VOXCPM_MODEL is set")
- if args.txt_prompts is not None and args.jsonl_prompts is not None:
- parser.error("--txt-prompts and --jsonl-prompts are mutually exclusive")
- if (args.ref_audio is None) != (args.ref_text is None):
- parser.error("--ref-audio and --ref-text must be provided together")
- if args.num_runs < 1:
- parser.error("--num-runs must be >= 1")
- if args.warmup_runs < 0:
- parser.error("--warmup-runs must be >= 0")
- if args.output_dir is None:
- args.output_dir = (
- "output_audio_streaming" if _is_streaming_stage_config(args.stage_configs_path) else "output_audio"
- )
- if args.enable_profiler and args.profiler_dir is None:
- args.profiler_dir = str(Path(args.output_dir) / "profiler")
- try:
- args.prompt_specs = _load_prompt_specs(args)
- except ValueError as exc:
- parser.error(str(exc))
-
- return args
-
-
-def _is_streaming_stage_config(stage_configs_path: str) -> bool:
- cfg_name = Path(stage_configs_path).name.lower()
- return "async_chunk" in cfg_name
-
-
-async def _collect_streaming_audio(
- omni: AsyncOmni,
- args: Any,
- spec: PromptSpec,
- request_id: str,
- *,
- phase_label: str,
- prompt_index: int,
- prompt_count: int,
- print_prompt: bool = False,
-) -> tuple[torch.Tensor, int, float, float | None]:
- prompt = _build_prompt_for_spec(args, spec, global_request_id=request_id)
- delta_chunks: list[torch.Tensor] = []
- sample_rate = 24000
- chunk_i = 0
- prev_total_samples = 0
- t_start = time.perf_counter()
- first_audio_elapsed: float | None = None
-
- if print_prompt:
- print(f"---prompt---:{prompt}")
-
- async for stage_output in omni.generate(prompt, request_id=request_id):
- mm = getattr(stage_output, "multimodal_output", None)
- if not isinstance(mm, dict):
- ro = getattr(stage_output, "request_output", None)
- if ro is None:
- continue
- mm = getattr(ro, "multimodal_output", None)
- if not isinstance(mm, dict) and getattr(ro, "outputs", None):
- seq = ro.outputs[0]
- mm = getattr(seq, "multimodal_output", None)
- if not isinstance(mm, dict):
- continue
- sample_rate = _extract_sample_rate(mm)
- try:
- w = _extract_audio_tensor(mm)
- n = int(w.numel())
- if n == 0:
- continue
- finished = _extract_stream_finished(stage_output)
- if n > prev_total_samples:
- delta = w.reshape(-1)[prev_total_samples:]
- prev_total_samples = n
- elif finished and n == prev_total_samples:
- delta = w.reshape(-1)[:0]
- else:
- delta = w.reshape(-1)
- prev_total_samples += int(delta.numel())
- if int(delta.numel()) > 0:
- delta_chunks.append(delta)
- if first_audio_elapsed is None and int(delta.numel()) > 0:
- first_audio_elapsed = time.perf_counter() - t_start
- logger.info(
- "%s prompt=%d/%d chunk=%d delta_samples=%d buf_len=%d finished=%s",
- phase_label,
- prompt_index + 1,
- prompt_count,
- chunk_i,
- int(delta.numel()),
- n,
- finished,
- )
- chunk_i += 1
- except ValueError:
- if not _extract_stream_finished(stage_output):
- logger.debug("skip non-audio partial output chunk=%d", chunk_i)
-
- if not delta_chunks:
- raise RuntimeError("No audio chunks received; check stage config and logs.")
-
- audio_cat = torch.cat([c.reshape(-1) for c in delta_chunks], dim=0)
- elapsed = time.perf_counter() - t_start
- return audio_cat, sample_rate, elapsed, first_audio_elapsed
-
-
-async def _abort_streaming_residual_work(
- omni: AsyncOmni,
- request_id: str,
- *,
- settle_seconds: float = 0.1,
-) -> None:
- """Stop any late stage-0 work once the final audio has been collected."""
- await omni.engine.abort_async([request_id])
- if settle_seconds > 0:
- await asyncio.sleep(settle_seconds)
-
-
-async def _run_streaming_single(
- omni: AsyncOmni,
- args: Any,
- spec: PromptSpec,
- output_dir: Path,
- request_id: str,
- *,
- run_index: int,
- num_runs: int,
- prompt_index: int,
- prompt_count: int,
-) -> Path:
- audio_cat, sample_rate, elapsed, first_audio_elapsed = await _collect_streaming_audio(
- omni,
- args,
- spec,
- request_id,
- phase_label=f"run={run_index + 1}/{num_runs}",
- prompt_index=prompt_index,
- prompt_count=prompt_count,
- print_prompt=(run_index == 0 and prompt_index == 0),
- )
- await _abort_streaming_residual_work(omni, request_id)
- output_path = output_dir / f"output_run{run_index + 1}_{spec.label}.wav"
- _write_audio_tensor(output_path, audio_cat, sample_rate)
- audio_duration_s = float(audio_cat.numel()) / float(sample_rate) if sample_rate > 0 else 0.0
- ttfp_text = f", ttfp={first_audio_elapsed:.2f}s" if first_audio_elapsed is not None else ""
- rtf_text = f", rtf={elapsed / audio_duration_s:.3f}" if audio_duration_s > 0 else ""
- print(
- f"Saved (streaming) run {run_index + 1}/{num_runs}, "
- f"prompt {prompt_index + 1}/{prompt_count}: {output_path} ({elapsed:.2f}s{ttfp_text}{rtf_text})"
- )
- _emit_offline_metrics(
- request_id=request_id,
- elapsed_s=elapsed,
- first_audio_elapsed=first_audio_elapsed,
- audio_duration_s=audio_duration_s,
- )
- return output_path
-
-
-async def _run_streaming_warmup(args, omni: AsyncOmni) -> None:
- if args.warmup_runs == 0:
- return
-
- warmup_specs = _get_warmup_specs(args.prompt_specs)
- print(
- f"Warmup: {args.warmup_runs} run(s) using the first prompt "
- f"({len(warmup_specs)} prompt(s)); outputs will be discarded."
- )
- for warmup_index in range(args.warmup_runs):
- t_warmup = time.perf_counter()
- tasks = []
- request_ids: list[str] = []
- for prompt_index, spec in enumerate(warmup_specs):
- request_id = f"warmup_stream_{warmup_index + 1}_{spec.label}_{uuid.uuid4().hex[:8]}"
- request_ids.append(request_id)
- tasks.append(
- _collect_streaming_audio(
- omni,
- args,
- spec,
- request_id,
- phase_label=f"warmup={warmup_index + 1}/{args.warmup_runs}",
- prompt_index=prompt_index,
- prompt_count=len(warmup_specs),
- )
- )
- results = await asyncio.gather(*tasks)
- for request_id in request_ids:
- await _abort_streaming_residual_work(omni, request_id)
- total_samples = sum(int(audio.numel()) for audio, _, _, _ in results)
- warmup_ttfps = [ttfp for _, _, _, ttfp in results if ttfp is not None]
- ttfp_text = f", ttfp={min(warmup_ttfps):.2f}s" if warmup_ttfps else ""
- print(
- f"Warmup (streaming) {warmup_index + 1}/{args.warmup_runs} finished: "
- f"{len(results)} prompt(s), {total_samples} sample(s) "
- f"({time.perf_counter() - t_warmup:.2f}s{ttfp_text})"
- )
-
-
-async def _run_streaming(args) -> list[Path]:
- output_dir = Path(args.output_dir)
- output_dir.mkdir(parents=True, exist_ok=True)
-
- omni = AsyncOmni(
- model=args.model,
- stage_configs_path=args.stage_configs_path,
- log_stats=args.log_stats,
- stage_init_timeout=args.stage_init_timeout,
- )
-
- await _run_streaming_warmup(args, omni)
- profiler_started = False
- if args.enable_profiler:
- profile_prefix = f"voxcpm_streaming_{int(time.time())}"
- stages_text = args.profiler_stages if args.profiler_stages is not None else "all-configured"
- print(f"Starting profiler (streaming): stages={stages_text}, dir={args.profiler_dir}")
- await omni.start_profile(profile_prefix=profile_prefix, stages=args.profiler_stages)
- profiler_started = True
- t_total = time.perf_counter()
- total_elapsed = 0.0
- paths: list[Path] = []
- prompt_specs: list[PromptSpec] = args.prompt_specs
- try:
- for run in range(args.num_runs):
- for prompt_index, spec in enumerate(prompt_specs):
- request_id = f"stream_{run + 1}_{spec.label}_{uuid.uuid4().hex[:8]}"
- paths.append(
- await _run_streaming_single(
- omni,
- args,
- spec,
- output_dir,
- request_id,
- run_index=run,
- num_runs=args.num_runs,
- prompt_index=prompt_index,
- prompt_count=len(prompt_specs),
- )
- )
- total_elapsed = time.perf_counter() - t_total
- finally:
- if profiler_started:
- print("Stopping profiler (streaming)...")
- await omni.stop_profile(stages=args.profiler_stages)
- if args.profiler_wait_seconds > 0:
- print(f"Waiting {args.profiler_wait_seconds:.1f}s for profiler traces to flush...")
- await asyncio.sleep(args.profiler_wait_seconds)
-
- print(
- f"All streaming runs finished: {args.num_runs} run(s), "
- f"{len(prompt_specs)} prompt(s), {len(paths)} file(s) in {total_elapsed:.2f}s total"
- )
- return paths
-
-
-def _run_sync(args) -> list[Path]:
- output_dir = Path(args.output_dir)
-
- omni = Omni(
- model=args.model,
- stage_configs_path=args.stage_configs_path,
- log_stats=args.log_stats,
- stage_init_timeout=args.stage_init_timeout,
- )
-
- def _run_sync_single(
- spec: PromptSpec,
- *,
- request_prefix: str,
- save_outputs: bool,
- run_index: int | None = None,
- ) -> tuple[list[Path], int, float | None, float, float, str]:
- global_request_id = f"{request_prefix}_{spec.label}"
- prompt = _build_prompt_for_spec(args, spec, global_request_id=global_request_id)
- if save_outputs and run_index == 0 and spec.label == "item001":
- print(f"---prompt---:{prompt}")
-
- saved_paths: list[Path] = []
- output_count = 0
- first_audio_elapsed: float | None = None
- total_audio_duration_s = 0.0
- metrics_request_id = global_request_id
- t_start = time.perf_counter()
- for stage_outputs in omni.generate(prompt):
- request_output = stage_outputs.request_output
- if request_output is None:
- continue
- request_output_id = getattr(request_output, "request_id", None)
- if isinstance(request_output_id, str) and request_output_id:
- metrics_request_id = request_output_id
- for j, mm in enumerate(_iter_request_multimodal_outputs(request_output)):
- output_count += 1
- if first_audio_elapsed is None:
- try:
- audio_tensor = _extract_audio_tensor(mm)
- if int(audio_tensor.numel()) > 0:
- first_audio_elapsed = time.perf_counter() - t_start
- total_audio_duration_s += float(audio_tensor.numel()) / float(_extract_sample_rate(mm))
- except ValueError:
- pass
- else:
- try:
- audio_tensor = _extract_audio_tensor(mm)
- total_audio_duration_s += float(audio_tensor.numel()) / float(_extract_sample_rate(mm))
- except ValueError:
- pass
- if not save_outputs:
- continue
- save_stem = f"run{run_index + 1}_{spec.label}" if j == 0 else f"run{run_index + 1}_{spec.label}_{j}"
- saved_paths.append(_save_wav(mm, output_dir, save_stem))
-
- if output_count == 0:
- raise RuntimeError("No output from Omni.generate")
- elapsed_s = time.perf_counter() - t_start
- return saved_paths, output_count, first_audio_elapsed, elapsed_s, total_audio_duration_s, metrics_request_id
-
- if args.warmup_runs:
- warmup_specs = _get_warmup_specs(args.prompt_specs)
- print(
- f"Warmup: {args.warmup_runs} run(s) using the first prompt "
- f"({len(warmup_specs)} prompt(s)); outputs will be discarded."
- )
- for warmup_index in range(args.warmup_runs):
- t_warmup = time.perf_counter()
- _, output_count, first_audio_elapsed, elapsed_s, audio_duration_s, _ = _run_sync_single(
- warmup_specs[0],
- request_prefix=f"warmup_sync{warmup_index + 1}",
- save_outputs=False,
- )
- ttfp_text = f", ttfp={first_audio_elapsed:.2f}s" if first_audio_elapsed is not None else ""
- rtf_text = f", rtf={elapsed_s / audio_duration_s:.3f}" if audio_duration_s > 0 else ""
- print(
- f"Warmup (sync) {warmup_index + 1}/{args.warmup_runs} finished: "
- f"{output_count} output(s) ({time.perf_counter() - t_warmup:.2f}s{ttfp_text}{rtf_text})"
- )
-
- profiler_started = False
- if args.enable_profiler:
- profile_prefix = f"voxcpm_sync_{int(time.time())}"
- stages_text = args.profiler_stages if args.profiler_stages is not None else "all-configured"
- print(f"Starting profiler (sync): stages={stages_text}, dir={args.profiler_dir}")
- omni.start_profile(profile_prefix=profile_prefix, stages=args.profiler_stages)
- profiler_started = True
-
- t_total = time.perf_counter()
- total_elapsed = 0.0
- saved_paths: list[Path] = []
- prompt_specs: list[PromptSpec] = args.prompt_specs
- try:
- for run in range(args.num_runs):
- t_run = time.perf_counter()
- run_paths: list[Path] = []
- for prompt_index, spec in enumerate(prompt_specs):
- prompt_paths, _, first_audio_elapsed, elapsed_s, audio_duration_s, metrics_request_id = (
- _run_sync_single(
- spec,
- request_prefix=f"sync_run{run + 1}_{prompt_index + 1:03d}",
- save_outputs=True,
- run_index=run,
- )
- )
- run_paths.extend(prompt_paths)
- ttfp_text = f", ttfp={first_audio_elapsed:.2f}s" if first_audio_elapsed is not None else ""
- rtf_text = f", rtf={elapsed_s / audio_duration_s:.3f}" if audio_duration_s > 0 else ""
- print(
- f"Saved (sync) run {run + 1}/{args.num_runs}, "
- f"prompt {prompt_index + 1}/{len(prompt_specs)}: {len(prompt_paths)} file(s){ttfp_text}{rtf_text}"
- )
- _emit_offline_metrics(
- request_id=metrics_request_id,
- elapsed_s=elapsed_s,
- first_audio_elapsed=first_audio_elapsed,
- audio_duration_s=audio_duration_s,
- )
-
- saved_paths.extend(run_paths)
- print(
- f"Run {run + 1}/{args.num_runs} finished: {len(run_paths)} file(s) ({time.perf_counter() - t_run:.2f}s)"
- )
- for path in run_paths:
- print(f" {path}")
-
- total_elapsed = time.perf_counter() - t_total
- finally:
- if profiler_started:
- print("Stopping profiler (sync)...")
- omni.stop_profile(stages=args.profiler_stages)
- if args.profiler_wait_seconds > 0:
- print(f"Waiting {args.profiler_wait_seconds:.1f}s for profiler traces to flush...")
- time.sleep(args.profiler_wait_seconds)
-
- print(
- f"All sync runs finished: {args.num_runs} run(s), "
- f"{len(prompt_specs)} prompt(s), {len(saved_paths)} file(s) in {total_elapsed:.2f}s total"
- )
- return saved_paths
-
-
-def main(args) -> int:
- logging.basicConfig(level=logging.INFO)
- profiled_stage_config_path: str | None = None
- original_stage_config_path = args.stage_configs_path
- if args.enable_profiler:
- Path(args.profiler_dir).mkdir(parents=True, exist_ok=True)
- profiled_stage_config_path = _build_profiled_stage_config(
- args.stage_configs_path,
- str(Path(args.profiler_dir).resolve()),
- )
- args.stage_configs_path = profiled_stage_config_path
-
- is_streaming = _is_streaming_stage_config(args.stage_configs_path)
- voice_clone_count = _count_voice_clone_prompts(args.prompt_specs)
- print(f"Model: {args.model}")
- print(f"Stage config: {original_stage_config_path}")
- print(f"Route: {'streaming' if is_streaming else 'sync'} (from stage-configs-path)")
- print(f"Prompt count: {len(args.prompt_specs)}")
- print("Batch mode: sequential (aligned with native VoxCPM)")
- print(f"Warmup runs: {args.warmup_runs}")
- print(f"Voice cloning prompts: {voice_clone_count}/{len(args.prompt_specs)}")
- if args.enable_profiler:
- print(f"Profiler: enabled (dir={args.profiler_dir}, stages={args.profiler_stages or 'all-configured'})")
- print(f"Profiled stage config: {args.stage_configs_path}")
- if voice_clone_count:
- print("Voice cloning note: --ref-text/ref_text must match the spoken content of the reference audio.")
- print(f"Num runs: {args.num_runs}")
- try:
- if is_streaming:
- asyncio.run(_run_streaming(args))
- else:
- _run_sync(args)
- finally:
- if profiled_stage_config_path is not None and os.path.exists(profiled_stage_config_path):
- os.unlink(profiled_stage_config_path)
- return 0
-
-
-if __name__ == "__main__":
- os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
- raise SystemExit(main(parse_args()))
diff --git a/benchmarks/tts/model_configs.yaml b/benchmarks/tts/model_configs.yaml
deleted file mode 100644
index 83b25370538..00000000000
--- a/benchmarks/tts/model_configs.yaml
+++ /dev/null
@@ -1,39 +0,0 @@
-# Universal TTS benchmark model registry.
-# Maps HuggingFace model ID → supported tasks + per-task extra body fields.
-# To add a new TTS model: add an entry here. No code changes required.
-#
-# The server auto-loads its Deploy YAML from vllm_omni/deploy/.yaml via
-# the Pipeline + Deploy schema introduced in #2383, so no stage_config path
-# is tracked here.
-
-models:
- # -CustomVoice checkpoints lack speaker_encoder weights, so voice_clone is
- # NOT supported (an attempt raises ValueError from _extract_speaker_embedding
- # at model runtime). Use -Base for voice_clone.
- Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice:
- supported_tasks: [default_voice, voice_design]
- backend: openai-audio-speech
- endpoint: /v1/audio/speech
- task_extra_body:
- default_voice:
- voice: Vivian
- language: English
- task_type: CustomVoice
- voice_design:
- task_type: VoiceDesign
- language: English
-
- Qwen/Qwen3-TTS-12Hz-1.7B-Base:
- supported_tasks: [voice_clone]
- backend: openai-audio-speech
- endpoint: /v1/audio/speech
- task_extra_body:
- voice_clone:
- task_type: Base
-
- openbmb/VoxCPM2:
- supported_tasks: [voice_clone]
- backend: openai-audio-speech
- endpoint: /v1/audio/speech
- task_extra_body:
- voice_clone: {}
diff --git a/benchmarks/tts/plot_results.py b/benchmarks/tts/plot_results.py
deleted file mode 100644
index f19c613209a..00000000000
--- a/benchmarks/tts/plot_results.py
+++ /dev/null
@@ -1,324 +0,0 @@
-"""Plot universal TTS benchmark results.
-
-Reads JSON files saved by ``bench_tts.py`` (via ``vllm bench serve --omni``)
-and generates comparison bar charts grouped by task type.
-
-Metrics plotted:
-- AUDIO_TTFP (mean audio time-to-first-packet, ms)
-- E2EL (mean end-to-end latency, ms)
-- Audio RTF (mean real-time factor)
-- Audio throughput (audio-seconds / wall-second)
-
-Quality metrics (WER / SIM / UTMOS) are printed in a table when present.
-
-Usage::
-
- # Single run — one JSON per task, all in results/
- python benchmarks/tts/plot_results.py \\
- --results results/bench_tts_*.json \\
- --output results/tts_benchmark.png
-
- # Compare two runs (e.g. async_chunk on vs off)
- python benchmarks/tts/plot_results.py \\
- --results run_a/bench_tts_*.json \\
- --results run_b/bench_tts_*.json \\
- --labels "async_chunk_on" "async_chunk_off" \\
- --output results/comparison.png
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import math
-from pathlib import Path
-
-import matplotlib.pyplot as plt
-import numpy as np
-
-# ---------------------------------------------------------------------------
-# JSON loading
-# ---------------------------------------------------------------------------
-
-
-def load_run(paths: list[str]) -> list[dict]:
- """Load and merge all JSON files for one run into a flat list of records.
-
- Each record is expected to have at least ``_concurrency`` (int) and
- ``_task`` (str) keys injected by ``bench_tts.py``. Records that come
- from a file that contains a list are flattened.
- """
- records: list[dict] = []
- for p in paths:
- raw = json.loads(Path(p).read_text(encoding="utf-8"))
- if isinstance(raw, list):
- records.extend(raw)
- elif isinstance(raw, dict):
- records.append(raw)
- return records
-
-
-def _get(record: dict, key: str) -> float:
- v = record.get(key, float("nan"))
- if v is None or (isinstance(v, float) and math.isnan(v)):
- return float("nan")
- try:
- return float(v)
- except (TypeError, ValueError):
- return float("nan")
-
-
-# ---------------------------------------------------------------------------
-# Plotting helpers
-# ---------------------------------------------------------------------------
-
-
-def _bar_group(
- ax: plt.Axes,
- x: np.ndarray,
- data_per_label: dict[str, list[float]],
- width: float,
- colors: list[str],
- ylabel: str,
- title: str,
- concurrency_labels: list[str],
- fmt: str = ".1f",
-) -> None:
- n = len(data_per_label)
- offsets = np.linspace(-(n - 1) * width / 2, (n - 1) * width / 2, n) if n > 1 else [0.0]
-
- for i, (label, values) in enumerate(data_per_label.items()):
- plot_vals = [0.0 if math.isnan(v) else v for v in values]
- bar = ax.bar(x + offsets[i], plot_vals, width, label=label, color=colors[i % len(colors)], alpha=0.85)
- max_val = max((v for v in values if not math.isnan(v)), default=1.0)
- for rect, val in zip(bar, values):
- if not math.isnan(val) and val > 0:
- ax.text(
- rect.get_x() + rect.get_width() / 2,
- rect.get_height() + max_val * 0.02,
- f"{val:{fmt}}",
- ha="center",
- va="bottom",
- fontsize=8,
- fontweight="bold",
- )
-
- ax.set_xlabel("Concurrency", fontsize=11)
- ax.set_ylabel(ylabel, fontsize=11)
- ax.set_title(title, fontsize=12, fontweight="bold")
- ax.set_xticks(x)
- ax.set_xticklabels(concurrency_labels)
- ax.legend(fontsize=9)
- ax.grid(axis="y", alpha=0.3)
- ax.set_axisbelow(True)
-
-
-COLORS = ["#2196F3", "#FF5722", "#4CAF50", "#FFC107", "#9C27B0"]
-
-
-# ---------------------------------------------------------------------------
-# Comparison plot (multiple labels / runs)
-# ---------------------------------------------------------------------------
-
-
-def plot_comparison(
- all_runs: list[list[dict]],
- labels: list[str],
- output_path: str,
- task_filter: str | None = None,
- title_prefix: str = "TTS",
-) -> None:
- """One 2×2 subplot per task found in the data."""
- # Determine tasks to plot
- tasks: list[str] = []
- for run in all_runs:
- for r in run:
- t = r.get("_task", "unknown")
- if t not in tasks:
- tasks.append(t)
- if task_filter:
- tasks = [t for t in tasks if t == task_filter]
-
- n_tasks = len(tasks)
- if n_tasks == 0:
- print("[plot_results] No tasks found in data.")
- return
-
- fig, axes_grid = plt.subplots(n_tasks, 4, figsize=(18, 4.5 * n_tasks))
- fig.suptitle(f"{title_prefix} Benchmark", fontsize=15, fontweight="bold")
-
- # Ensure axes_grid is always 2D
- if n_tasks == 1:
- axes_grid = [axes_grid]
-
- for row_idx, task in enumerate(tasks):
- # Collect concurrencies across all runs for this task
- all_concs: set[int] = set()
- for run in all_runs:
- for r in run:
- if r.get("_task") == task:
- c = r.get("_concurrency")
- if c is not None:
- all_concs.add(int(c))
- concurrencies = sorted(all_concs)
- x = np.arange(len(concurrencies))
- conc_labels = [str(c) for c in concurrencies]
-
- def _series(run: list[dict], metric_key: str) -> list[float]:
- conc_map = {int(r["_concurrency"]): r for r in run if r.get("_task") == task and "_concurrency" in r}
- return [_get(conc_map.get(c, {}), metric_key) for c in concurrencies]
-
- metrics = [
- ("mean_audio_ttfp_ms", "TTFP (ms)", "Time-to-First-Packet", ".0f"),
- ("mean_e2el_ms", "E2E Latency (ms)", "End-to-End Latency", ".0f"),
- ("mean_audio_rtf", "RTF", "Real-Time Factor (RTF)", ".3f"),
- ("audio_throughput", "audio-s / wall-s", "Audio Throughput", ".2f"),
- ]
-
- axes_row = axes_grid[row_idx]
- for col_idx, (key, ylabel, subtitle, fmt) in enumerate(metrics):
- data_per_label = {lbl: _series(run, key) for lbl, run in zip(labels, all_runs)}
- _bar_group(
- axes_row[col_idx],
- x,
- data_per_label,
- width=0.3 if len(labels) > 1 else 0.5,
- colors=COLORS,
- ylabel=ylabel,
- title=f"{task} — {subtitle}",
- concurrency_labels=conc_labels,
- fmt=fmt,
- )
-
- plt.tight_layout()
- Path(output_path).parent.mkdir(parents=True, exist_ok=True)
- plt.savefig(output_path, dpi=150, bbox_inches="tight")
- print(f"Plot saved to {output_path}")
- plt.close()
-
-
-# ---------------------------------------------------------------------------
-# Markdown comparison table
-# ---------------------------------------------------------------------------
-
-
-def print_comparison_table(all_runs: list[list[dict]], labels: list[str]) -> None:
- tasks: list[str] = []
- for run in all_runs:
- for r in run:
- t = r.get("_task", "unknown")
- if t not in tasks:
- tasks.append(t)
-
- perf_metrics = [
- ("TTFP (ms)", "mean_audio_ttfp_ms", ".1f"),
- ("E2E (ms)", "mean_e2el_ms", ".1f"),
- ("RTF", "mean_audio_rtf", ".3f"),
- ("Throughput (a-s/s)", "audio_throughput", ".2f"),
- ]
- quality_metrics = [
- ("WER (%)", "seed_tts_mean_wer", ".1f"),
- ("SIM", "seed_tts_mean_sim", ".3f"),
- ("UTMOS", "seed_tts_mean_utmos", ".2f"),
- ]
-
- for task in tasks:
- all_concs: set[int] = set()
- for run in all_runs:
- for r in run:
- if r.get("_task") == task:
- c = r.get("_concurrency")
- if c is not None:
- all_concs.add(int(c))
- concurrencies = sorted(all_concs)
-
- print(f"\n## {task}\n")
- col_header = "| Metric | Concurrency |" + "".join(f" {lbl} |" for lbl in labels)
- sep = "| --- | --- |" + " --- |" * len(labels)
- print(col_header)
- print(sep)
-
- for metric, key, fmt in perf_metrics + quality_metrics:
- for c in concurrencies:
- row = f"| {metric} | {c} |"
- for run in all_runs:
- conc_map = {
- int(r["_concurrency"]): r for r in run if r.get("_task") == task and "_concurrency" in r
- }
- val = _get(conc_map.get(c, {}), key)
- row += f" {val:{fmt}} |" if not math.isnan(val) else " n/a |"
- print(row)
-
- # Improvement column (2-run comparison only)
- if len(all_runs) == 2:
- print(f"\n### Improvement ({labels[0]} vs {labels[1]})\n")
- print("| Metric | Concurrency | Change |")
- print("| --- | --- | --- |")
- for metric, key, _ in perf_metrics:
- for c in concurrencies:
- conc_map0 = {
- int(r["_concurrency"]): r for r in all_runs[0] if r.get("_task") == task and "_concurrency" in r
- }
- conc_map1 = {
- int(r["_concurrency"]): r for r in all_runs[1] if r.get("_task") == task and "_concurrency" in r
- }
- v0 = _get(conc_map0.get(c, {}), key)
- v1 = _get(conc_map1.get(c, {}), key)
- if not math.isnan(v0) and not math.isnan(v1) and v1 > 0:
- pct = (v1 - v0) / v1 * 100
- print(f"| {metric} | {c} | {pct:+.1f}% |")
-
-
-# ---------------------------------------------------------------------------
-# CLI
-# ---------------------------------------------------------------------------
-
-
-def parse_args() -> argparse.Namespace:
- parser = argparse.ArgumentParser(
- description=__doc__,
- formatter_class=argparse.RawDescriptionHelpFormatter,
- )
- parser.add_argument(
- "--results",
- type=str,
- nargs="+",
- action="append",
- required=True,
- metavar="FILE",
- help="JSON result file(s) for one run. Repeat --results for multiple runs to compare.",
- )
- parser.add_argument(
- "--labels",
- type=str,
- nargs="+",
- default=None,
- help="Label for each --results group (must match the number of --results groups).",
- )
- parser.add_argument("--output", type=str, default="results/tts_benchmark.png", help="Output image path.")
- parser.add_argument("--title", type=str, default="TTS", help="Title prefix for the plot.")
- parser.add_argument("--task", type=str, default=None, help="Filter to a single task (e.g. voice_clone).")
- return parser.parse_args()
-
-
-def main() -> None:
- args = parse_args()
-
- # args.results is a list-of-lists due to action="append"
- all_runs: list[list[dict]] = [load_run(group) for group in args.results]
- n_runs = len(all_runs)
-
- labels: list[str]
- if args.labels:
- if len(args.labels) != n_runs:
- raise SystemExit(f"--labels count ({len(args.labels)}) must match --results groups ({n_runs})")
- labels = args.labels
- else:
- labels = [f"run{i + 1}" for i in range(n_runs)]
-
- print_comparison_table(all_runs, labels)
- plot_comparison(all_runs, labels, args.output, task_filter=args.task, title_prefix=args.title)
-
-
-if __name__ == "__main__":
- main()
diff --git a/collect_env.py b/collect_env.py
deleted file mode 100644
index 8b09379e1a3..00000000000
--- a/collect_env.py
+++ /dev/null
@@ -1,760 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-# ruff: noqa
-# code borrowed from https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py
-
-import datetime
-import locale
-import os
-import subprocess
-import sys
-
-# Unlike the rest of the PyTorch this file must be python2 compliant.
-# This script outputs relevant system environment info
-# Run it with `python collect_env.py` or `python -m torch.utils.collect_env`
-from collections import namedtuple
-
-import regex as re
-
-from vllm.envs import environment_variables
-
-try:
- import torch
-
- TORCH_AVAILABLE = True
-except (ImportError, NameError, AttributeError, OSError):
- TORCH_AVAILABLE = False
-
-# System Environment Information
-SystemEnv = namedtuple(
- "SystemEnv",
- [
- "torch_version",
- "is_debug_build",
- "cuda_compiled_version",
- "gcc_version",
- "clang_version",
- "cmake_version",
- "os",
- "libc_version",
- "python_version",
- "python_platform",
- "is_cuda_available",
- "cuda_runtime_version",
- "cuda_module_loading",
- "nvidia_driver_version",
- "nvidia_gpu_models",
- "cudnn_version",
- "pip_version", # 'pip' or 'pip3'
- "pip_packages",
- "conda_packages",
- "hip_compiled_version",
- "hip_runtime_version",
- "miopen_runtime_version",
- "caching_allocator_config",
- "is_xnnpack_available",
- "cpu_info",
- "rocm_version", # vllm specific field
- "vllm_version", # vllm specific field
- "vllm_omni_version", # vllm-omni specific field
- "vllm_build_flags", # vllm specific field
- "gpu_topo", # vllm specific field
- "env_vars",
- ],
-)
-
-DEFAULT_CONDA_PATTERNS = {
- "torch",
- "numpy",
- "cudatoolkit",
- "soumith",
- "mkl",
- "magma",
- "triton",
- "optree",
- "nccl",
- "transformers",
- "zmq",
- "nvidia",
- "pynvml",
- "flashinfer-python",
-}
-
-DEFAULT_PIP_PATTERNS = {
- "torch",
- "numpy",
- "mypy",
- "flake8",
- "triton",
- "optree",
- "onnx",
- "nccl",
- "transformers",
- "zmq",
- "nvidia",
- "pynvml",
- "flashinfer-python",
-}
-
-
-def run(command):
- """Return (return-code, stdout, stderr)."""
- shell = True if type(command) is str else False
- try:
- p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=shell)
- raw_output, raw_err = p.communicate()
- rc = p.returncode
- if get_platform() == "win32":
- enc = "oem"
- else:
- enc = locale.getpreferredencoding()
- output = raw_output.decode(enc)
- if command == "nvidia-smi topo -m":
- # don't remove the leading whitespace of `nvidia-smi topo -m`
- # because they are meaningful
- output = output.rstrip()
- else:
- output = output.strip()
- err = raw_err.decode(enc)
- return rc, output, err.strip()
-
- except FileNotFoundError:
- cmd_str = command if isinstance(command, str) else command[0]
- return 127, "", f"Command not found: {cmd_str}"
-
-
-def run_and_read_all(run_lambda, command):
- """Run command using run_lambda; reads and returns entire output if rc is 0."""
- rc, out, _ = run_lambda(command)
- if rc != 0:
- return None
- return out
-
-
-def run_and_parse_first_match(run_lambda, command, regex):
- """Run command using run_lambda, returns the first regex match if it exists."""
- rc, out, _ = run_lambda(command)
- if rc != 0:
- return None
- match = re.search(regex, out)
- if match is None:
- return None
- return match.group(1)
-
-
-def run_and_return_first_line(run_lambda, command):
- """Run command using run_lambda and returns first line if output is not empty."""
- rc, out, _ = run_lambda(command)
- if rc != 0:
- return None
- return out.split("\n")[0]
-
-
-def get_conda_packages(run_lambda, patterns=None):
- if patterns is None:
- patterns = DEFAULT_CONDA_PATTERNS
- conda = os.environ.get("CONDA_EXE", "conda")
- out = run_and_read_all(run_lambda, [conda, "list"])
- if out is None:
- return out
-
- return "\n".join(
- line for line in out.splitlines() if not line.startswith("#") and any(name in line for name in patterns)
- )
-
-
-def get_gcc_version(run_lambda):
- return run_and_parse_first_match(run_lambda, "gcc --version", r"gcc (.*)")
-
-
-def get_clang_version(run_lambda):
- return run_and_parse_first_match(run_lambda, "clang --version", r"clang version (.*)")
-
-
-def get_cmake_version(run_lambda):
- return run_and_parse_first_match(run_lambda, "cmake --version", r"cmake (.*)")
-
-
-def get_nvidia_driver_version(run_lambda):
- if get_platform() == "darwin":
- cmd = "kextstat | grep -i cuda"
- return run_and_parse_first_match(run_lambda, cmd, r"com[.]nvidia[.]CUDA [(](.*?)[)]")
- smi = get_nvidia_smi()
- return run_and_parse_first_match(run_lambda, smi, r"Driver Version: (.*?) ")
-
-
-def get_gpu_info(run_lambda):
- if get_platform() == "darwin" or (
- TORCH_AVAILABLE and hasattr(torch.version, "hip") and torch.version.hip is not None
- ):
- if TORCH_AVAILABLE and torch.cuda.is_available():
- if torch.version.hip is not None:
- prop = torch.cuda.get_device_properties(0)
- if hasattr(prop, "gcnArchName"):
- gcnArch = " ({})".format(prop.gcnArchName)
- else:
- gcnArch = "NoGCNArchNameOnOldPyTorch"
- else:
- gcnArch = ""
- return torch.cuda.get_device_name(None) + gcnArch
- return None
- smi = get_nvidia_smi()
- uuid_regex = re.compile(r" \(UUID: .+?\)")
- rc, out, _ = run_lambda(smi + " -L")
- if rc != 0:
- return None
- # Anonymize GPUs by removing their UUID
- return re.sub(uuid_regex, "", out)
-
-
-def get_running_cuda_version(run_lambda):
- return run_and_parse_first_match(run_lambda, "nvcc --version", r"release .+ V(.*)")
-
-
-def get_cudnn_version(run_lambda):
- """Return a list of libcudnn.so; it's hard to tell which one is being used."""
- if get_platform() == "win32":
- system_root = os.environ.get("SYSTEMROOT", "C:\\Windows")
- cuda_path = os.environ.get("CUDA_PATH", "%CUDA_PATH%")
- where_cmd = os.path.join(system_root, "System32", "where")
- cudnn_cmd = '{} /R "{}\\bin" cudnn*.dll'.format(where_cmd, cuda_path)
- elif get_platform() == "darwin":
- # CUDA libraries and drivers can be found in /usr/local/cuda/. See
- # https://docs.nvidia.com/cuda/cuda-installation-guide-mac-os-x/index.html#install
- # https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installmac
- # Use CUDNN_LIBRARY when cudnn library is installed elsewhere.
- cudnn_cmd = "ls /usr/local/cuda/lib/libcudnn*"
- else:
- cudnn_cmd = 'ldconfig -p | grep libcudnn | rev | cut -d" " -f1 | rev'
- rc, out, _ = run_lambda(cudnn_cmd)
- # find will return 1 if there are permission errors or if not found
- if len(out) == 0 or (rc != 1 and rc != 0):
- l = os.environ.get("CUDNN_LIBRARY")
- if l is not None and os.path.isfile(l):
- return os.path.realpath(l)
- return None
- files_set = set()
- for fn in out.split("\n"):
- fn = os.path.realpath(fn) # eliminate symbolic links
- if os.path.isfile(fn):
- files_set.add(fn)
- if not files_set:
- return None
- # Alphabetize the result because the order is non-deterministic otherwise
- files = sorted(files_set)
- if len(files) == 1:
- return files[0]
- result = "\n".join(files)
- return "Probably one of the following:\n{}".format(result)
-
-
-def get_nvidia_smi():
- # Note: nvidia-smi is currently available only on Windows and Linux
- smi = "nvidia-smi"
- if get_platform() == "win32":
- system_root = os.environ.get("SYSTEMROOT", "C:\\Windows")
- program_files_root = os.environ.get("PROGRAMFILES", "C:\\Program Files")
- legacy_path = os.path.join(program_files_root, "NVIDIA Corporation", "NVSMI", smi)
- new_path = os.path.join(system_root, "System32", smi)
- smis = [new_path, legacy_path]
- for candidate_smi in smis:
- if os.path.exists(candidate_smi):
- smi = '"{}"'.format(candidate_smi)
- break
- return smi
-
-
-def get_rocm_version(run_lambda):
- """Returns the ROCm version if available, otherwise 'N/A'."""
- return run_and_parse_first_match(run_lambda, "hipcc --version", r"HIP version: (\S+)")
-
-
-def get_vllm_version():
- from vllm import __version__, __version_tuple__
-
- if __version__ == "dev":
- return "N/A (dev)"
- version_str = __version_tuple__[-1]
- if isinstance(version_str, str) and version_str.startswith("g"):
- # it's a dev build
- if "." in version_str:
- # it's a dev build containing local changes
- git_sha = version_str.split(".")[0][1:]
- date = version_str.split(".")[-1][1:]
- return f"{__version__} (git sha: {git_sha}, date: {date})"
- else:
- # it's a dev build without local changes
- git_sha = version_str[1:] # type: ignore
- return f"{__version__} (git sha: {git_sha})"
- return __version__
-
-
-def get_vllm_omni_version(run_lambda):
- try:
- import vllm_omni
- from vllm_omni import __version__, __version_tuple__
-
- version_str = __version_tuple__[-1]
- if isinstance(version_str, str) and version_str.startswith("g"):
- if "." in version_str:
- git_sha = version_str.split(".")[0][1:]
- date = version_str.split(".")[-1][1:]
- return f"{__version__} (git sha: {git_sha}, date: {date})"
- else:
- git_sha = version_str[1:]
- return f"{__version__} (git sha: {git_sha})"
-
- package_dir = os.path.dirname(os.path.abspath(vllm_omni.__file__))
- git_sha = run_and_read_all(run_lambda, f"git -C {package_dir} rev-parse --short HEAD")
- if git_sha:
- return f"{__version__} (git sha: {git_sha})"
-
- return __version__
- except ImportError:
- return "N/A (vllm_omni not installed)"
-
-
-def summarize_vllm_build_flags():
- # This could be a static method if the flags are constant, or dynamic if you need to check environment variables, etc.
- return "CUDA Archs: {}; ROCm: {}".format(
- os.environ.get("TORCH_CUDA_ARCH_LIST", "Not Set"),
- "Enabled" if os.environ.get("ROCM_HOME") else "Disabled",
- )
-
-
-def get_gpu_topo(run_lambda):
- output = None
-
- if get_platform() == "linux":
- output = run_and_read_all(run_lambda, "nvidia-smi topo -m")
- if output is None:
- output = run_and_read_all(run_lambda, "rocm-smi --showtopo")
-
- return output
-
-
-def get_cpu_info(run_lambda):
- rc, out, err = 0, "", ""
- if get_platform() == "linux":
- rc, out, err = run_lambda("lscpu")
- elif get_platform() == "win32":
- rc, out, err = run_lambda(
- "wmic cpu get Name,Manufacturer,Family,Architecture,ProcessorType,DeviceID, \
- CurrentClockSpeed,MaxClockSpeed,L2CacheSize,L2CacheSpeed,Revision /VALUE"
- )
- elif get_platform() == "darwin":
- rc, out, err = run_lambda("sysctl -n machdep.cpu.brand_string")
- cpu_info = "None"
- if rc == 0:
- cpu_info = out
- else:
- cpu_info = err
- return cpu_info
-
-
-def get_platform():
- if sys.platform.startswith("linux"):
- return "linux"
- elif sys.platform.startswith("win32"):
- return "win32"
- elif sys.platform.startswith("cygwin"):
- return "cygwin"
- elif sys.platform.startswith("darwin"):
- return "darwin"
- else:
- return sys.platform
-
-
-def get_mac_version(run_lambda):
- return run_and_parse_first_match(run_lambda, "sw_vers -productVersion", r"(.*)")
-
-
-def get_windows_version(run_lambda):
- system_root = os.environ.get("SYSTEMROOT", "C:\\Windows")
- wmic_cmd = os.path.join(system_root, "System32", "Wbem", "wmic")
- findstr_cmd = os.path.join(system_root, "System32", "findstr")
- return run_and_read_all(run_lambda, "{} os get Caption | {} /v Caption".format(wmic_cmd, findstr_cmd))
-
-
-def get_lsb_version(run_lambda):
- return run_and_parse_first_match(run_lambda, "lsb_release -a", r"Description:\t(.*)")
-
-
-def check_release_file(run_lambda):
- return run_and_parse_first_match(run_lambda, "cat /etc/*-release", r'PRETTY_NAME="(.*)"')
-
-
-def get_os(run_lambda):
- from platform import machine
-
- platform = get_platform()
-
- if platform == "win32" or platform == "cygwin":
- return get_windows_version(run_lambda)
-
- if platform == "darwin":
- version = get_mac_version(run_lambda)
- if version is None:
- return None
- return "macOS {} ({})".format(version, machine())
-
- if platform == "linux":
- # Ubuntu/Debian based
- desc = get_lsb_version(run_lambda)
- if desc is not None:
- return "{} ({})".format(desc, machine())
-
- # Try reading /etc/*-release
- desc = check_release_file(run_lambda)
- if desc is not None:
- return "{} ({})".format(desc, machine())
-
- return "{} ({})".format(platform, machine())
-
- # Unknown platform
- return platform
-
-
-def get_python_platform():
- import platform
-
- return platform.platform()
-
-
-def get_libc_version():
- import platform
-
- if get_platform() != "linux":
- return "N/A"
- return "-".join(platform.libc_ver())
-
-
-def is_uv_venv():
- if os.environ.get("UV"):
- return True
- pyvenv_cfg_path = os.path.join(sys.prefix, "pyvenv.cfg")
- if os.path.exists(pyvenv_cfg_path):
- with open(pyvenv_cfg_path, "r") as f:
- return any(line.startswith("uv = ") for line in f)
- return False
-
-
-def get_pip_packages(run_lambda, patterns=None):
- """Return `pip list` output. Note: will also find conda-installed pytorch and numpy packages."""
- if patterns is None:
- patterns = DEFAULT_PIP_PATTERNS
-
- def run_with_pip():
- try:
- import importlib.util
-
- pip_spec = importlib.util.find_spec("pip")
- pip_available = pip_spec is not None
- except ImportError:
- pip_available = False
-
- if pip_available:
- cmd = [sys.executable, "-mpip", "list", "--format=freeze"]
- elif is_uv_venv():
- print("uv is set")
- cmd = ["uv", "pip", "list", "--format=freeze"]
- else:
- raise RuntimeError("Could not collect pip list output (pip or uv module not available)")
-
- out = run_and_read_all(run_lambda, cmd)
- return "\n".join(line for line in out.splitlines() if any(name in line for name in patterns))
-
- pip_version = "pip3" if sys.version[0] == "3" else "pip"
- out = run_with_pip()
- return pip_version, out
-
-
-def get_cachingallocator_config():
- ca_config = os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
- return ca_config
-
-
-def get_cuda_module_loading_config():
- if TORCH_AVAILABLE and torch.cuda.is_available():
- torch.cuda.init()
- config = os.environ.get("CUDA_MODULE_LOADING", "")
- return config
- else:
- return "N/A"
-
-
-def is_xnnpack_available():
- if TORCH_AVAILABLE:
- import torch.backends.xnnpack
-
- return str(torch.backends.xnnpack.enabled) # type: ignore[attr-defined]
- else:
- return "N/A"
-
-
-def get_env_vars():
- env_vars = ""
- secret_terms = ("secret", "token", "api", "access", "password")
- report_prefix = (
- "TORCH",
- "NCCL",
- "PYTORCH",
- "CUDA",
- "CUBLAS",
- "CUDNN",
- "OMP_",
- "MKL_",
- "NVIDIA",
- )
- for k, v in os.environ.items():
- if any(term in k.lower() for term in secret_terms):
- continue
- if k in environment_variables:
- env_vars = env_vars + "{}={}".format(k, v) + "\n"
- if k.startswith(report_prefix):
- env_vars = env_vars + "{}={}".format(k, v) + "\n"
-
- return env_vars
-
-
-def get_env_info():
- run_lambda = run
- pip_version, pip_list_output = get_pip_packages(run_lambda)
-
- if TORCH_AVAILABLE:
- version_str = torch.__version__
- debug_mode_str = str(torch.version.debug)
- cuda_available_str = str(torch.cuda.is_available())
- cuda_version_str = torch.version.cuda
- if not hasattr(torch.version, "hip") or torch.version.hip is None: # cuda version
- hip_compiled_version = hip_runtime_version = miopen_runtime_version = "N/A"
- else: # HIP version
-
- def get_version_or_na(cfg, prefix):
- _lst = [s.rsplit(None, 1)[-1] for s in cfg if prefix in s]
- return _lst[0] if _lst else "N/A"
-
- cfg = torch._C._show_config().split("\n")
- hip_runtime_version = get_version_or_na(cfg, "HIP Runtime")
- miopen_runtime_version = get_version_or_na(cfg, "MIOpen")
- cuda_version_str = "N/A"
- hip_compiled_version = torch.version.hip
- else:
- version_str = debug_mode_str = cuda_available_str = cuda_version_str = "N/A"
- hip_compiled_version = hip_runtime_version = miopen_runtime_version = "N/A"
-
- sys_version = sys.version.replace("\n", " ")
-
- conda_packages = get_conda_packages(run_lambda)
-
- rocm_version = get_rocm_version(run_lambda)
- vllm_version = get_vllm_version()
- vllm_omni_version = get_vllm_omni_version(run_lambda)
- vllm_build_flags = summarize_vllm_build_flags()
- gpu_topo = get_gpu_topo(run_lambda)
-
- return SystemEnv(
- torch_version=version_str,
- is_debug_build=debug_mode_str,
- python_version="{} ({}-bit runtime)".format(sys_version, sys.maxsize.bit_length() + 1),
- python_platform=get_python_platform(),
- is_cuda_available=cuda_available_str,
- cuda_compiled_version=cuda_version_str,
- cuda_runtime_version=get_running_cuda_version(run_lambda),
- cuda_module_loading=get_cuda_module_loading_config(),
- nvidia_gpu_models=get_gpu_info(run_lambda),
- nvidia_driver_version=get_nvidia_driver_version(run_lambda),
- cudnn_version=get_cudnn_version(run_lambda),
- hip_compiled_version=hip_compiled_version,
- hip_runtime_version=hip_runtime_version,
- miopen_runtime_version=miopen_runtime_version,
- pip_version=pip_version,
- pip_packages=pip_list_output,
- conda_packages=conda_packages,
- os=get_os(run_lambda),
- libc_version=get_libc_version(),
- gcc_version=get_gcc_version(run_lambda),
- clang_version=get_clang_version(run_lambda),
- cmake_version=get_cmake_version(run_lambda),
- caching_allocator_config=get_cachingallocator_config(),
- is_xnnpack_available=is_xnnpack_available(),
- cpu_info=get_cpu_info(run_lambda),
- rocm_version=rocm_version,
- vllm_version=vllm_version,
- vllm_omni_version=vllm_omni_version,
- vllm_build_flags=vllm_build_flags,
- gpu_topo=gpu_topo,
- env_vars=get_env_vars(),
- )
-
-
-env_info_fmt = """
-==============================
- System Info
-==============================
-OS : {os}
-GCC version : {gcc_version}
-Clang version : {clang_version}
-CMake version : {cmake_version}
-Libc version : {libc_version}
-
-==============================
- PyTorch Info
-==============================
-PyTorch version : {torch_version}
-Is debug build : {is_debug_build}
-CUDA used to build PyTorch : {cuda_compiled_version}
-ROCM used to build PyTorch : {hip_compiled_version}
-
-==============================
- Python Environment
-==============================
-Python version : {python_version}
-Python platform : {python_platform}
-
-==============================
- CUDA / GPU Info
-==============================
-Is CUDA available : {is_cuda_available}
-CUDA runtime version : {cuda_runtime_version}
-CUDA_MODULE_LOADING set to : {cuda_module_loading}
-GPU models and configuration : {nvidia_gpu_models}
-Nvidia driver version : {nvidia_driver_version}
-cuDNN version : {cudnn_version}
-HIP runtime version : {hip_runtime_version}
-MIOpen runtime version : {miopen_runtime_version}
-Is XNNPACK available : {is_xnnpack_available}
-
-==============================
- CPU Info
-==============================
-{cpu_info}
-
-==============================
-Versions of relevant libraries
-==============================
-{pip_packages}
-{conda_packages}
-""".strip()
-
-# both the above code and the following code use `strip()` to
-# remove leading/trailing whitespaces, so we need to add a newline
-# in between to separate the two sections
-env_info_fmt += "\n\n"
-
-env_info_fmt += """
-==============================
- vLLM Info
-==============================
-ROCM Version : {rocm_version}
-vLLM Version : {vllm_version}
-vLLM-Omni Version : {vllm_omni_version}
-vLLM Build Flags:
- {vllm_build_flags}
-GPU Topology:
- {gpu_topo}
-
-==============================
- Environment Variables
-==============================
-{env_vars}
-""".strip()
-
-
-def pretty_str(envinfo):
- def replace_nones(dct, replacement="Could not collect"):
- for key in dct.keys():
- if dct[key] is not None:
- continue
- dct[key] = replacement
- return dct
-
- def replace_bools(dct, true="Yes", false="No"):
- for key in dct.keys():
- if dct[key] is True:
- dct[key] = true
- elif dct[key] is False:
- dct[key] = false
- return dct
-
- def prepend(text, tag="[prepend]"):
- lines = text.split("\n")
- updated_lines = [tag + line for line in lines]
- return "\n".join(updated_lines)
-
- def replace_if_empty(text, replacement="No relevant packages"):
- if text is not None and len(text) == 0:
- return replacement
- return text
-
- def maybe_start_on_next_line(string):
- # If `string` is multiline, prepend a \n to it.
- if string is not None and len(string.split("\n")) > 1:
- return "\n{}\n".format(string)
- return string
-
- mutable_dict = envinfo._asdict()
-
- # If nvidia_gpu_models is multiline, start on the next line
- mutable_dict["nvidia_gpu_models"] = maybe_start_on_next_line(envinfo.nvidia_gpu_models)
-
- # If the machine doesn't have CUDA, report some fields as 'No CUDA'
- dynamic_cuda_fields = [
- "cuda_runtime_version",
- "nvidia_gpu_models",
- "nvidia_driver_version",
- ]
- all_cuda_fields = dynamic_cuda_fields + ["cudnn_version"]
- all_dynamic_cuda_fields_missing = all(mutable_dict[field] is None for field in dynamic_cuda_fields)
- if TORCH_AVAILABLE and not torch.cuda.is_available() and all_dynamic_cuda_fields_missing:
- for field in all_cuda_fields:
- mutable_dict[field] = "No CUDA"
- if envinfo.cuda_compiled_version is None:
- mutable_dict["cuda_compiled_version"] = "None"
-
- # Replace True with Yes, False with No
- mutable_dict = replace_bools(mutable_dict)
-
- # Replace all None objects with 'Could not collect'
- mutable_dict = replace_nones(mutable_dict)
-
- # If either of these are '', replace with 'No relevant packages'
- mutable_dict["pip_packages"] = replace_if_empty(mutable_dict["pip_packages"])
- mutable_dict["conda_packages"] = replace_if_empty(mutable_dict["conda_packages"])
-
- # Tag conda and pip packages with a prefix
- # If they were previously None, they'll show up as ie '[conda] Could not collect'
- if mutable_dict["pip_packages"]:
- mutable_dict["pip_packages"] = prepend(mutable_dict["pip_packages"], "[{}] ".format(envinfo.pip_version))
- if mutable_dict["conda_packages"]:
- mutable_dict["conda_packages"] = prepend(mutable_dict["conda_packages"], "[conda] ")
- mutable_dict["cpu_info"] = envinfo.cpu_info
- return env_info_fmt.format(**mutable_dict)
-
-
-def get_pretty_env_info():
- return pretty_str(get_env_info())
-
-
-def main():
- print("Collecting environment information...")
- output = get_pretty_env_info()
- print(output)
-
- if TORCH_AVAILABLE and hasattr(torch, "utils") and hasattr(torch.utils, "_crash_handler"):
- minidump_dir = torch.utils._crash_handler.DEFAULT_MINIDUMP_DIR
- if sys.platform == "linux" and os.path.exists(minidump_dir):
- dumps = [os.path.join(minidump_dir, dump) for dump in os.listdir(minidump_dir)]
- latest = max(dumps, key=os.path.getctime)
- ctime = os.path.getctime(latest)
- creation_time = datetime.datetime.fromtimestamp(ctime).strftime("%Y-%m-%d %H:%M:%S")
- msg = (
- "\n*** Detected a minidump at {} created on {}, ".format(latest, creation_time)
- + "if this is related to your bug please include it when you file a report ***"
- )
- print(msg, file=sys.stderr)
-
-
-if __name__ == "__main__":
- main()
diff --git a/docker/Dockerfile.ci b/docker/Dockerfile.ci
deleted file mode 100644
index a263c12e2d2..00000000000
--- a/docker/Dockerfile.ci
+++ /dev/null
@@ -1,18 +0,0 @@
-ARG VLLM_BASE_IMAGE=vllm/vllm-openai
-ARG VLLM_BASE_TAG=v0.20.0
-FROM ${VLLM_BASE_IMAGE}:${VLLM_BASE_TAG}
-ARG APP_DIR=/workspace/vllm-omni
-WORKDIR ${APP_DIR}
-COPY . .
-
-# Install system dependencies
-RUN apt-get update && \
- apt-get install -y espeak-ng git jq && \
- apt-get clean && \
- rm -rf /var/lib/apt/lists/*
-
-RUN uv pip install --system ".[dev]"
-
-RUN ln -sf /usr/bin/python3 /usr/bin/python
-
-ENTRYPOINT []
diff --git a/docker/Dockerfile.cuda b/docker/Dockerfile.cuda
deleted file mode 100644
index 78f64f6a5e0..00000000000
--- a/docker/Dockerfile.cuda
+++ /dev/null
@@ -1,22 +0,0 @@
-ARG BASE_IMAGE=vllm/vllm-openai:v0.20.0
-FROM ${BASE_IMAGE}
-
-ARG COMMON_WORKDIR=/app
-
-WORKDIR ${COMMON_WORKDIR}
-
-# Step 1: Setup - Install system dependencies
-RUN apt-get update && \
- apt-get install -y git jq && \
- apt-get clean && \
- rm -rf /var/lib/apt/lists/*
-
-RUN mkdir -p ${COMMON_WORKDIR}/vllm-omni
-
-# Step 2: Copy vllm-omni code and install
-COPY . ${COMMON_WORKDIR}/vllm-omni
-RUN cd ${COMMON_WORKDIR}/vllm-omni && uv pip install --python "$(python3 -c 'import sys; print(sys.executable)')" --no-cache-dir "."
-
-RUN ln -sf /usr/bin/python3 /usr/bin/python
-
-ENTRYPOINT []
diff --git a/docker/Dockerfile.npu b/docker/Dockerfile.npu
deleted file mode 100644
index 2e961b89e65..00000000000
--- a/docker/Dockerfile.npu
+++ /dev/null
@@ -1,31 +0,0 @@
-ARG VLLM_ASCEND_IMAGE=quay.io/ascend/vllm-ascend
-ARG VLLM_ASCEND_TAG=v0.18.0rc1
-FROM ${VLLM_ASCEND_IMAGE}:${VLLM_ASCEND_TAG}
-
-# WORKDIR /vllm-workspace/vllm
-# RUN git fetch origin --tags && git checkout v0.18.0
-
-# WORKDIR /vllm-workspace/vllm-ascend
-# RUN git fetch origin releases/v0.18.0 && git checkout d781902ce9dbda8ab1e11bb0f2f0c1bc508fee7a
-# # Install vllm-ascend
-# # Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
-# RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
-# source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
-# source /usr/local/Ascend/nnal/atb/set_env.sh && \
-# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
-# python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
-# python3 -m pip cache purge
-
-ARG APP_DIR=/vllm-workspace/vllm-omni
-WORKDIR ${APP_DIR}
-COPY . .
-
-RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
- source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
- source /usr/local/Ascend/nnal/atb/set_env.sh && \
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
- python3 -m pip install -v -e /vllm-workspace/vllm-omni/ --no-build-isolation
-
-ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
-
-ENTRYPOINT []
diff --git a/docker/Dockerfile.npu.a3 b/docker/Dockerfile.npu.a3
deleted file mode 100644
index e3781fc18f8..00000000000
--- a/docker/Dockerfile.npu.a3
+++ /dev/null
@@ -1,31 +0,0 @@
-ARG VLLM_ASCEND_IMAGE=quay.io/ascend/vllm-ascend
-ARG VLLM_ASCEND_TAG=v0.18.0rc1-a3
-FROM ${VLLM_ASCEND_IMAGE}:${VLLM_ASCEND_TAG}
-
-# WORKDIR /vllm-workspace/vllm
-# RUN git fetch origin --tags && git checkout v0.18.0
-
-# WORKDIR /vllm-workspace/vllm-ascend
-# RUN git fetch origin releases/v0.18.0 && git checkout d781902ce9dbda8ab1e11bb0f2f0c1bc508fee7a
-# # Install vllm-ascend
-# # Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
-# RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
-# source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
-# source /usr/local/Ascend/nnal/atb/set_env.sh && \
-# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
-# python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
-# python3 -m pip cache purge
-
-ARG APP_DIR=/vllm-workspace/vllm-omni
-WORKDIR ${APP_DIR}
-COPY . .
-
-RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
- source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
- source /usr/local/Ascend/nnal/atb/set_env.sh && \
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
- python3 -m pip install -v -e /vllm-workspace/vllm-omni/ --no-build-isolation
-
-ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
-
-ENTRYPOINT []
diff --git a/docker/Dockerfile.rocm b/docker/Dockerfile.rocm
deleted file mode 100644
index ab95077fc7b..00000000000
--- a/docker/Dockerfile.rocm
+++ /dev/null
@@ -1,73 +0,0 @@
-ARG BASE_IMAGE=vllm/vllm-openai-rocm:v0.20.0
-FROM ${BASE_IMAGE} AS base
-
-# Declare a variable to know if we want to use the nightly build or the stable build.
-# NOTE: REMINDER to vLLM-Omni rebase maintainer
-# Remember to set `USE_NIGHTLY_BUILD` to 0, when switching back to stable vLLM docker image
-# 1. If vLLM-omni maintainer is forced to use custom commits
-# during rebasing, they can change the two variables
-# 2. Whenever vLLM upstream has released a stable version,
-# we should swap over to use stable release ASAP.
-# We should avoid relying on custom commits.
-ARG USE_NIGHTLY_BUILD=0
-ARG VLLM_VERSION_OR_COMMIT_HASH=89138b21cc246ae944c741d5c399c148e2b770ab
-ARG ARG_PYTORCH_ROCM_ARCH
-ENV PYTORCH_ROCM_ARCH=${ARG_PYTORCH_ROCM_ARCH:-${PYTORCH_ROCM_ARCH}}
-
-ARG COMMON_WORKDIR=/app
-WORKDIR ${COMMON_WORKDIR}
-
-# Step 1: Setup - Install system dependencies
-# Need to include ffmpeg because vllm rocm upstream docker image
-# does not include it.
-RUN apt-get update && \
- apt-get install -y espeak-ng ffmpeg git jq && \
- apt-get clean && \
- rm -rf /var/lib/apt/lists/*
-
-# Step 2: Conditionally reinstall vllm from source for nightly builds
-RUN if [ "${USE_NIGHTLY_BUILD}" = "1" ]; then \
- python3 -m pip uninstall -y vllm && rm -rf vllm && \
- git clone https://github.com/vllm-project/vllm.git && \
- cd vllm && \
- git checkout ${VLLM_VERSION_OR_COMMIT_HASH} && \
- python3 -m pip install -r requirements/rocm.txt && \
- python3 setup.py clean --all && \
- PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} python3 setup.py develop && \
- cd ../ && \
- rm -rf vllm/.git; \
- fi
-
-# Step 3: Copy vllm-omni code and install without uv
-RUN mkdir -p ${COMMON_WORKDIR}/vllm-omni
-COPY . ${COMMON_WORKDIR}/vllm-omni
-
-# This is a workaround to ensure pytest exits with the correct status code in CI tests.
-RUN printf '%s\n' \
- 'import os' \
- '' \
- '_exit_code = 1' \
- '' \
- 'def pytest_sessionfinish(session, exitstatus):' \
- ' global _exit_code' \
- ' _exit_code = int(exitstatus)' \
- '' \
- 'def pytest_unconfigure(config):' \
- ' import sys' \
- ' sys.stdout.flush()' \
- ' sys.stderr.flush()' \
- ' os._exit(_exit_code)' \
- > ${COMMON_WORKDIR}/vllm-omni/conftest.py
-
-RUN cd ${COMMON_WORKDIR}/vllm-omni && uv pip install --python "$(python3 -c 'import sys; print(sys.executable)')" --no-cache-dir ".[dev]" --no-build-isolation
-
-RUN ln -sf /usr/bin/python3 /usr/bin/python
-
-FROM base AS test
-
-CMD ["/bin/bash"]
-ENTRYPOINT []
-
-# Set entrypoint for vllm-openai official images
-FROM base AS vllm-openai
-ENTRYPOINT ["vllm", "serve", "--omni"]
diff --git a/docker/Dockerfile.xpu b/docker/Dockerfile.xpu
deleted file mode 100644
index f015059ed88..00000000000
--- a/docker/Dockerfile.xpu
+++ /dev/null
@@ -1,150 +0,0 @@
-# Argument to configure vllm base image if pre-built
-ARG VLLM_BASE=vllm-base
-
-FROM intel/deep-learning-essentials:2025.3.2-0-devel-ubuntu24.04 AS vllm-base
-
-WORKDIR /workspace/
-
-ARG PYTHON_VERSION=3.12
-ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/xpu"
-
-RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \
- echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list
-
-RUN apt clean && apt-get update -y && \
- apt-get install -y --no-install-recommends --fix-missing \
- curl \
- espeak-ng \
- git \
- libsm6 \
- libxext6 \
- libgl1 \
- lsb-release \
- libaio-dev \
- numactl \
- wget \
- vim \
- python3.12 \
- python3.12-dev \
- python3-pip
-
-RUN apt update && apt upgrade -y && \
- apt install -y intel-oneapi-compiler-dpcpp-cpp-2025.3
-
-# Install UMD with fixed version
-RUN mkdir neo && \
- cd neo && \
- wget https://github.com/intel/intel-graphics-compiler/releases/download/v2.24.8/intel-igc-core-2_2.24.8+20344_amd64.deb && \
- wget https://github.com/intel/intel-graphics-compiler/releases/download/v2.24.8/intel-igc-opencl-2_2.24.8+20344_amd64.deb && \
- wget https://github.com/intel/compute-runtime/releases/download/25.48.36300.8/intel-ocloc_25.48.36300.8-0_amd64.deb && \
- wget https://github.com/intel/compute-runtime/releases/download/25.48.36300.8/intel-opencl-icd_25.48.36300.8-0_amd64.deb && \
- wget https://github.com/intel/compute-runtime/releases/download/25.48.36300.8/libigdgmm12_22.8.2_amd64.deb && \
- wget https://github.com/intel/compute-runtime/releases/download/25.48.36300.8/libze-intel-gpu1_25.48.36300.8-0_amd64.deb && \
- wget https://github.com/oneapi-src/level-zero/releases/download/v1.26.0/level-zero_1.26.0+u24.04_amd64.deb && \
- dpkg -i *.deb && \
- cd .. && \
- rm -rf neo
-
-ENV PATH="/root/.local/bin:$PATH"
-ENV VIRTUAL_ENV="/opt/venv"
-ENV UV_PYTHON_INSTALL_DIR=/opt/uv/python
-RUN curl -LsSf https://astral.sh/uv/install.sh | sh
-RUN uv venv --python ${PYTHON_VERSION} --seed ${VIRTUAL_ENV}
-ENV PATH="$VIRTUAL_ENV/bin:$PATH"
-
-# This oneccl contains the BMG support which is not the case for default version of oneapi 2025.2.
-ARG ONECCL_INSTALLER="intel-oneccl-2021.15.7.8_offline.sh"
-RUN wget "https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.7/${ONECCL_INSTALLER}" && \
- bash "${ONECCL_INSTALLER}" -a --silent --eula accept && \
- rm "${ONECCL_INSTALLER}" && \
- echo "source /opt/intel/oneapi/setvars.sh --force" >> /root/.bashrc && \
- echo "source /opt/intel/oneapi/ccl/2021.15/env/vars.sh --force" >> /root/.bashrc
-RUN rm -f /opt/intel/oneapi/ccl/latest && \
- ln -s /opt/intel/oneapi/ccl/2021.15 /opt/intel/oneapi/ccl/latest
-
-SHELL ["bash", "-c"]
-CMD ["bash", "-c", "source /root/.bashrc && exec bash"]
-
-WORKDIR /workspace/
-ENV UV_HTTP_TIMEOUT=500
-
-# Configure package index for XPU
-ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
-ENV UV_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
-ENV UV_INDEX_STRATEGY="unsafe-best-match"
-ENV UV_LINK_MODE="copy"
-
-ARG VLLM_VERSION=v0.20.0
-RUN git clone -b ${VLLM_VERSION} https://github.com/vllm-project/vllm
-WORKDIR /workspace/vllm
-
-RUN --mount=type=cache,target=/root/.cache/uv \
- uv pip install --upgrade pip && \
- uv pip install -r requirements/xpu.txt
-
- # used for suffix method speculative decoding
- # build deps for proto + nanobind-based extensions to set up the build environment
-RUN --mount=type=cache,target=/root/.cache/uv \
- uv pip install grpcio-tools protobuf nanobind
- # arctic-inference is built from source which needs torch-xpu properly installed first
-RUN --mount=type=cache,target=/root/.cache/uv \
- source /opt/intel/oneapi/setvars.sh --force && \
- source /opt/intel/oneapi/ccl/2021.15/env/vars.sh --force && \
- export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH}" && \
- uv pip install --no-build-isolation arctic-inference==0.1.1
-
-ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/"
-
-ENV VLLM_TARGET_DEVICE=xpu
-ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
-
-RUN --mount=type=cache,target=/root/.cache/uv \
- uv pip install --no-build-isolation .
-
-CMD ["/bin/bash"]
-
-FROM vllm-base AS vllm-openai
-
-# install additional dependencies for openai api server
-RUN --mount=type=cache,target=/root/.cache/uv \
- uv pip install accelerate hf_transfer pytest pytest_asyncio lm_eval[api] modelscope
-
-# install development dependencies (for testing)
-RUN uv pip install -e tests/vllm_test_utils
-
-# install nixl from source code
-ENV NIXL_VERSION=0.7.0
-RUN python /workspace/vllm/tools/install_nixl_from_source_ubuntu.py
-
-# ensure vllm is properly installed
-RUN python -c "import vllm, inspect; print(vllm.__file__)"
-RUN uv pip show vllm
-
-CMD ["/bin/bash"]
-
-ENTRYPOINT []
-
-FROM ${VLLM_BASE} AS vllm-omni
-
-WORKDIR /workspace/vllm-omni
-COPY . .
-
-ENV VLLM_OMNI_TARGET_DEVICE=xpu
-RUN uv pip install --no-cache-dir ".[dev]" --no-build-isolation
-
-# FIX triton
-RUN --mount=type=cache,target=/root/.cache/uv \
- uv pip uninstall triton triton-xpu && \
- uv pip install triton-xpu==3.6.0
-
-# remove torch bundled oneccl to avoid conflicts
-RUN --mount=type=cache,target=/root/.cache/uv \
- uv pip uninstall oneccl oneccl-devel
-
-FROM vllm-omni AS vllm-omni-openai
-
-RUN ln -sf /usr/bin/python3 /usr/bin/python
-
-ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
-
-ENTRYPOINT ["vllm", "serve", "--omni"]
diff --git a/docs/.nav.yml b/docs/.nav.yml
deleted file mode 100644
index 2604f1e985d..00000000000
--- a/docs/.nav.yml
+++ /dev/null
@@ -1,134 +0,0 @@
-nav:
-- Home: README.md
-- User Guide:
- - Getting Started:
- - getting_started/quickstart.md
- - getting_started/installation/*
- - Serving:
- - OpenAI-Compatible API:
- - Diffusion Chat API: serving/diffusion_chat_api.md
- - Image Generation: serving/image_generation_api.md
- - Image Edit: serving/image_edit_api.md
- - Text to Speech: serving/speech_api.md
- - Audio Generation: serving/audio_generate_api.md
- - Streaming Video Input: serving/video_stream_api.md
- - Examples:
- - examples/README.md
- - Offline Inference:
- - BAGEL-7B-MoT: user_guide/examples/offline_inference/bagel.md
- - CosyVoice3: user_guide/examples/offline_inference/cosyvoice3.md
- - Fish Speech S2 Pro: user_guide/examples/offline_inference/fish_speech.md
- - GLM-Image Multistage End-to-End Inference: user_guide/examples/offline_inference/glm_image.md
- - Helios Video Generation: user_guide/examples/offline_inference/helios.md
- - HunyuanImage-3.0 Image-to-Text Inference: user_guide/examples/offline_inference/hunyuan_image3.md
- - Image-To-Image: user_guide/examples/offline_inference/image_to_image.md
- - Image-To-Video: user_guide/examples/offline_inference/image_to_video.md
- - InternVLA-A1: user_guide/examples/offline_inference/internvla_a1.md
- - MammothModa2-Preview: user_guide/examples/offline_inference/mammothmodal2_preview.md
- - MiMo-Audio Offline Inference: user_guide/examples/offline_inference/mimo_audio.md
- - Qwen2.5-Omni: user_guide/examples/offline_inference/qwen2_5_omni.md
- - Qwen3-Omni: user_guide/examples/offline_inference/qwen3_omni.md
- - Qwen3-TTS: user_guide/examples/offline_inference/qwen3_tts.md
- - Text-To-Audio: user_guide/examples/offline_inference/text_to_audio.md
- - Text-To-Image: user_guide/examples/offline_inference/text_to_image.md
- - Text-To-Video: user_guide/examples/offline_inference/text_to_video.md
- - Voxtral TTS Offline Inference: user_guide/examples/offline_inference/voxtral_tts.md
- - X-To-Video-Audio: user_guide/examples/offline_inference/x_to_video_audio.md
- - Online Serving:
- - BAGEL-7B-MoT: user_guide/examples/online_serving/bagel.md
- - vLLM-Omni Helm Chart: user_guide/examples/online_serving/chart-helm.md
- - Diffusers Backend Adapter Example: user_guide/examples/online_serving/diffusers_pipeline_adapter.md
- - Fish Speech S2 Pro: user_guide/examples/online_serving/fish_speech.md
- - GLM-Image Online Serving: user_guide/examples/online_serving/glm_image.md
- - Image-To-Image: user_guide/examples/online_serving/image_to_image.md
- - Image-To-Video: user_guide/examples/online_serving/image_to_video.md
- - Online serving Example of vLLM-Omni for MiMo-Audio: user_guide/examples/online_serving/mimo_audio.md
- - Qwen2.5-Omni: user_guide/examples/online_serving/qwen2_5_omni.md
- - Qwen3-Omni: user_guide/examples/online_serving/qwen3_omni.md
- - Qwen3-TTS: user_guide/examples/online_serving/qwen3_tts.md
- - Text-To-Image: user_guide/examples/online_serving/text_to_image.md
- - Text-To-Video: user_guide/examples/online_serving/text_to_video.md
- - General:
- - usage/*
- - Configuration:
- - configuration/README.md
- - configuration/*
- - Models:
- - models/supported_models.md
- - Features:
- - Sleep Mode: features/sleep_mode.md
- - Diffusion Features:
- - Overview: user_guide/diffusion_features.md
- - Feature Compatibility: user_guide/feature_compatibility.md
- - Cache Acceleration:
- - TeaCache: user_guide/diffusion/cache_acceleration/teacache.md
- - Cache-DiT: user_guide/diffusion/cache_acceleration/cache_dit.md
- - Attention Backends: user_guide/diffusion/attention_backends.md
- - Frame Interpolation: user_guide/diffusion/frame_interpolation.md
- - Parallelism:
- - Overview: user_guide/diffusion/parallelism/overview.md
- - CFG Parallel: user_guide/diffusion/parallelism/cfg_parallel.md
- - Expert Parallel: user_guide/diffusion/parallelism/expert_parallel.md
- - Hybrid Sharded Data Parallel: user_guide/diffusion/parallelism/hsdp.md
- - Sequence Parallel: user_guide/diffusion/parallelism/sequence_parallel.md
- - Tensor Parallel: user_guide/diffusion/parallelism/tensor_parallel.md
- - VAE Patch Parallel: user_guide/diffusion/parallelism/vae_patch_parallel.md
- - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
- - LoRA: user_guide/diffusion/lora.md
- - Custom Pipeline: features/custom_pipeline.md
- - Step Execution: user_guide/diffusion/step_execution.md
- - Quantization:
- - Overview: user_guide/quantization/overview.md
- - Online Quantization: user_guide/quantization/online.md
- - FP8 W8A8: user_guide/quantization/fp8.md
- - Int8 W8A8: user_guide/quantization/int8.md
- - GGUF: user_guide/quantization/gguf.md
- - AutoRound: user_guide/quantization/autoround.md
- - msModelSlim: user_guide/quantization/msmodelslim.md
- - ComfyUI: features/comfyui.md
-- Developer Guide:
- - General:
- - contributing/README.md
- - glob: contributing/*
- flatten_single_child_sections: true
- - Model Implementation:
- - contributing/model/README.md
- - contributing/model/adding_omni_model.md
- - contributing/model/adding_tts_model.md
- - contributing/model/adding_diffusion_model.md
- - CI: contributing/ci
- - Design Documents:
- - design/index.md
- - design/architecture_overview.md
- - Feature Design:
- - design/feature/disaggregated_inference.md
- - design/feature/ray_based_execution.md
- - design/feature/omni_connectors/
- - design/feature/prefix_caching.md
- - design/feature/cfg_parallel.md
- - design/feature/expert_parallel.md
- - design/feature/sequence_parallel.md
- - design/feature/tensor_parallel.md
- - design/feature/vae_parallel.md
- - design/feature/hsdp.md
- - design/feature/cache_dit.md
- - design/feature/teacache.md
- - design/feature/async_chunk.md
- - design/feature/vae_parallel.md
- - design/feature/diffusion_step_execution.md
- - Module Design:
- - design/module/ar_module.md
- - design/module/dit_module.md
- - design/module/entrypoint_module.md
- - design/module/async_omni_architecture.md
- - Docs Guide: contributing/DOCS_GUIDE.md
-- API Reference:
- - api/README.md
- - api/vllm_omni
-- CLI Reference: cli
-- Community:
- - Governance: community/governance.md
- - community/*
- - Slack: https://slack.vllm.ai
- - Blog: https://blog.vllm.ai
- - Forum: https://discuss.vllm.ai
diff --git a/docs/README.md b/docs/README.md
deleted file mode 100644
index 66fc8ef4668..00000000000
--- a/docs/README.md
+++ /dev/null
@@ -1,64 +0,0 @@
----
-hide:
- - navigation
- - toc
----
-
-# Welcome to vLLM-Omni
-
-
-
-
-
-
-
-
-Easy, fast, and cheap omni-modality model serving for everyone
-
-
-
-## About
-
-[vLLM](https://github.com/vllm-project/vllm) was originally designed to support large language models for text-based autoregressive generation tasks. vLLM-Omni is a framework that extends its support for omni-modality model inference and serving:
-
-- **Omni-modality**: Text, image, video, and audio data processing
-- **Non-autoregressive Architectures**: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models
-- **Heterogeneous outputs**: from traditional text generation to multimodal outputs
-
-
-
-
-
-
-
-
-vLLM-Omni is fast with:
-
-- State-of-the-art AR support by leveraging efficient KV cache management from vLLM
-- Pipelined stage execution overlapping for high throughput performance
-- Fully disaggregation based on OmniConnector and dynamic resource allocation across stages
-
-vLLM-Omni is flexible and easy to use with:
-
-- Heterogeneous pipeline abstraction to manage complex model workflows
-- Seamless integration with popular Hugging Face models
-- Tensor, pipeline, data and expert parallelism support for distributed inference
-- Streaming outputs
-- OpenAI-compatible API server
-
-vLLM-Omni seamlessly supports most popular open-source models on HuggingFace, including:
-
-- Omni-modality models (e.g. Qwen2.5-Omni, Qwen3-Omni)
-- Multi-modality generation models (e.g. Qwen-Image)
-
-For more information, checkout the following:
-
-- [vllm-omni architecture design and recent roadmaps](https://docs.google.com/presentation/d/1XJWgv79lORl8rbaVvp2d5Sqs6ZEBgAgj/edit?slide=id.p1#slide=id.p1)
-- [vllm-omni announcement blogpost](https://blog.vllm.ai/2025/11/30/vllm-omni.html)
diff --git a/docs/api/README.md b/docs/api/README.md
deleted file mode 100644
index 0147f19e126..00000000000
--- a/docs/api/README.md
+++ /dev/null
@@ -1,154 +0,0 @@
-# Summary
-
-## Entry Points
-
-Main entry points for vLLM-Omni inference and serving.
-
-- [vllm_omni.entrypoints.async_omni.AsyncOmni][]
-- [vllm_omni.engine.cfg_companion_tracker.CfgCompanionTracker][]
-- [vllm_omni.entrypoints.cli.benchmark.base.OmniBenchmarkSubcommandBase][]
-- [vllm_omni.entrypoints.cli.benchmark.main.OmniBenchmarkSubcommand][]
-- [vllm_omni.entrypoints.cli.benchmark.serve.OmniBenchmarkServingSubcommand][]
-- [vllm_omni.entrypoints.cli.serve.OmniServeCommand][]
-- [vllm_omni.entrypoints.client_request_state.ClientRequestState][]
-- [vllm_omni.entrypoints.omni.Omni][]
-- [vllm_omni.entrypoints.omni_base.OmniBase][]
-- [vllm_omni.entrypoints.pd_utils.PDDisaggregationMixin][]
-
-## Inputs
-
-Input data structures for multi-modal inputs.
-
-- [vllm_omni.inputs.data.OmniCustomPrompt][]
-- [vllm_omni.inputs.data.OmniDiffusionSamplingParams][]
-- [vllm_omni.inputs.data.OmniEmbedsPrompt][]
-- [vllm_omni.inputs.data.OmniTextPrompt][]
-- [vllm_omni.inputs.data.OmniTokenInputs][]
-- [vllm_omni.inputs.data.OmniTokensPrompt][]
-- [vllm_omni.inputs.preprocess.OmniInputPreprocessor][]
-
-## Engine
-
-Engine classes for offline and online inference.
-
-- [vllm_omni.diffusion.diffusion_engine.DiffusionEngine][]
-- [vllm_omni.distributed.omni_connectors.connectors.mooncake_transfer_engine_connector.BufferAllocator][]
-- [vllm_omni.distributed.omni_connectors.connectors.mooncake_transfer_engine_connector.ManagedBuffer][]
-- [vllm_omni.distributed.omni_connectors.connectors.mooncake_transfer_engine_connector.MooncakeAgentMetadata][]
-- [vllm_omni.distributed.omni_connectors.connectors.mooncake_transfer_engine_connector.MooncakeTransferEngineConnector][]
-- [vllm_omni.distributed.omni_connectors.connectors.mooncake_transfer_engine_connector.QueryRequest][]
-- [vllm_omni.distributed.omni_connectors.connectors.mooncake_transfer_engine_connector.QueryResponse][]
-- [vllm_omni.engine.AdditionalInformationEntry][]
-- [vllm_omni.engine.AdditionalInformationPayload][]
-- [vllm_omni.engine.OmniEngineCoreOutput][]
-- [vllm_omni.engine.OmniEngineCoreOutputs][]
-- [vllm_omni.engine.OmniEngineCoreRequest][]
-- [vllm_omni.engine.PromptEmbedsPayload][]
-- [vllm_omni.engine.arg_utils.OmniEngineArgs][]
-- [vllm_omni.engine.async_omni_engine.AsyncOmniEngine][]
-- [vllm_omni.engine.mm_outputs.MultimodalCompletionOutput][]
-- [vllm_omni.engine.mm_outputs.MultimodalPayload][]
-- [vllm_omni.engine.orchestrator.Orchestrator][]
-- [vllm_omni.engine.orchestrator.OrchestratorRequestState][]
-- [vllm_omni.engine.output_modality.OutputModality][]
-- [vllm_omni.engine.output_modality.TensorAccumulationStrategy][]
-- [vllm_omni.engine.output_processor.MultimodalOutputProcessor][]
-- [vllm_omni.engine.output_processor.OmniRequestState][]
-- [vllm_omni.engine.stage_engine_core_client.StageEngineCoreClient][]
-- [vllm_omni.engine.stage_init_utils.StageMetadata][]
-- [vllm_omni.engine.stage_init_utils.StartedLlmStage][]
-
-## Core
-
-Core scheduling and caching components.
-
-- [vllm_omni.core.sched.omni_ar_scheduler.KVCacheTransferData][]
-- [vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler][]
-- [vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler][]
-- [vllm_omni.core.sched.output.OmniCachedRequestData][]
-- [vllm_omni.core.sched.output.OmniNewRequestData][]
-- [vllm_omni.core.sched.output.OmniSchedulerOutput][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.cfm.BASECFM][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.cfm.CausalConditionalCFM][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.cfm.CausalMaskedDiffWithDiT][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.cfm.ConditionalCFM][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.CausalConv1d][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.CausalConv1dUpsample][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.CausalConvRNNF0Predictor][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.CausalHiFTGenerator][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.HiFTGenerator][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.SineGen][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.SineGen2][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.Snake][]
-- [vllm_omni.model_executor.models.cosyvoice3.code2wav_core.hifigan.SourceModuleHnNSF][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.DistributedGroupResidualVectorQuantization][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.DistributedResidualVectorQuantization][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.EuclideanCodebook][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.VectorQuantization][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.preprocess][]
-
-## Configuration
-
-Configuration classes.
-
-- [vllm_omni.config.model.OmniModelConfig][]
-- [vllm_omni.config.stage_config.ModelPipeline][]
-- [vllm_omni.config.stage_config.StageConfig][]
-- [vllm_omni.config.stage_config.StageConfigFactory][]
-- [vllm_omni.config.stage_config.StageType][]
-- [vllm_omni.diffusion.cache.teacache.config.TeaCacheConfig][]
-- [vllm_omni.distributed.omni_connectors.utils.config.ConnectorSpec][]
-- [vllm_omni.distributed.omni_connectors.utils.config.OmniTransferConfig][]
-- [vllm_omni.model_executor.models.cosyvoice3.config.CosyVoice3Config][]
-- [vllm_omni.model_executor.models.fish_speech.configuration_fish_speech.FishSpeechConfig][]
-- [vllm_omni.model_executor.models.fish_speech.configuration_fish_speech.FishSpeechFastARConfig][]
-- [vllm_omni.model_executor.models.fish_speech.configuration_fish_speech.FishSpeechSlowARConfig][]
-- [vllm_omni.model_executor.models.mimo_audio.config_mimo_audio.MiMoAudioConfig][]
-- [vllm_omni.model_executor.models.mimo_audio.config_mimo_audio.MiMoAudioTokenizerConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSSpeakerEncoderConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSTalkerCodePredictorConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.configuration_qwen3_tts.Qwen3TTSTalkerConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_12hz.configuration_qwen3_tts_tokenizer_v2.Qwen3TTSTokenizerV2Config][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_12hz.configuration_qwen3_tts_tokenizer_v2.Qwen3TTSTokenizerV2DecoderConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1Config][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1DecoderBigVGANConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1DecoderConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1DecoderDiTConfig][]
-- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.configuration_qwen3_tts_tokenizer_v1.Qwen3TTSTokenizerV1EncoderConfig][]
-- [vllm_omni.transformers_utils.configs.mammoth_moda2.Mammothmoda2Config][]
-- [vllm_omni.transformers_utils.configs.mammoth_moda2.Mammothmoda2Qwen2_5_VLConfig][]
-- [vllm_omni.transformers_utils.configs.mammoth_moda2.Mammothmoda2Qwen2_5_VLTextConfig][]
-- [vllm_omni.transformers_utils.configs.mammoth_moda2.Mammothmoda2Qwen2_5_VLVisionConfig][]
-
-## Workers
-
-Worker classes and model runners for distributed inference.
-
-- [vllm_omni.diffusion.worker.diffusion_model_runner.DiffusionModelRunner][]
-- [vllm_omni.diffusion.worker.diffusion_worker.CustomPipelineWorkerExtension][]
-- [vllm_omni.diffusion.worker.diffusion_worker.DiffusionWorker][]
-- [vllm_omni.diffusion.worker.diffusion_worker.WorkerProc][]
-- [vllm_omni.diffusion.worker.diffusion_worker.WorkerWrapperBase][]
-- [vllm_omni.diffusion.worker.utils.DiffusionRequestState][]
-- [vllm_omni.diffusion.worker.utils.RunnerOutput][]
-- [vllm_omni.platforms.npu.worker.npu_ar_model_runner.ExecuteModelState][]
-- [vllm_omni.platforms.npu.worker.npu_ar_model_runner.NPUARModelRunner][]
-- [vllm_omni.platforms.npu.worker.npu_ar_worker.NPUARWorker][]
-- [vllm_omni.platforms.npu.worker.npu_generation_model_runner.NPUGenerationModelRunner][]
-- [vllm_omni.platforms.npu.worker.npu_generation_worker.NPUGenerationWorker][]
-- [vllm_omni.platforms.npu.worker.npu_model_runner.OmniNPUModelRunner][]
-- [vllm_omni.platforms.xpu.worker.xpu_ar_model_runner.XPUARModelRunner][]
-- [vllm_omni.platforms.xpu.worker.xpu_ar_worker.XPUARWorker][]
-- [vllm_omni.platforms.xpu.worker.xpu_generation_model_runner.XPUGenerationModelRunner][]
-- [vllm_omni.platforms.xpu.worker.xpu_generation_worker.XPUGenerationWorker][]
-- [vllm_omni.worker.base.OmniGPUWorkerBase][]
-- [vllm_omni.worker.gpu_ar_model_runner.ExecuteModelState][]
-- [vllm_omni.worker.gpu_ar_model_runner.GPUARModelRunner][]
-- [vllm_omni.worker.gpu_ar_worker.GPUARWorker][]
-- [vllm_omni.worker.gpu_generation_model_runner.GPUGenerationModelRunner][]
-- [vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker][]
-- [vllm_omni.worker.gpu_memory_utils.parse_cuda_visible_devices][]
-- [vllm_omni.worker.gpu_model_runner.CUDAGraphWrapper][]
-- [vllm_omni.worker.gpu_model_runner.OmniGPUModelRunner][]
-- [vllm_omni.worker.mixins.OmniWorkerMixin][]
diff --git a/docs/assets/WeChat.jpg b/docs/assets/WeChat.jpg
deleted file mode 100644
index 209d0922c6a..00000000000
Binary files a/docs/assets/WeChat.jpg and /dev/null differ
diff --git a/docs/cli/README.md b/docs/cli/README.md
deleted file mode 100644
index 1fcfdb14eac..00000000000
--- a/docs/cli/README.md
+++ /dev/null
@@ -1,42 +0,0 @@
-# vLLM-Omni CLI Guide
-
-The CLI for vLLM-Omni inherits from vllm with some additional arguments.
-
-## serve
-
-Starts the vLLM-Omni OpenAI Compatible API server.
-
-Start with a model:
-
-```bash
-vllm serve Qwen/Qwen2.5-Omni-7B --omni
-```
-
-Specify the port:
-
-```bash
-vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091
-```
-
-If you have custom stage configs file, launch the server with command below
-```bash
-vllm serve Qwen/Qwen2.5-Omni-7B --omni --stage-configs-path /path/to/stage_configs_file
-```
-
-
-## bench
-
-Run benchmark tests for online serving throughput.
-Available Commands:
-
-```bash
-vllm bench serve --omni \
- --model Qwen/Qwen2.5-Omni-7B \
- --host server-host \
- --port server-port \
- --random-input-len 32 \
- --random-output-len 4 \
- --num-prompts 5
-```
-
-See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
diff --git a/docs/cli/bench/serve.md b/docs/cli/bench/serve.md
deleted file mode 100644
index f6e43be0945..00000000000
--- a/docs/cli/bench/serve.md
+++ /dev/null
@@ -1,359 +0,0 @@
-# vLLM-Omni Benchmark CLI Guide
-The vllm bench command launches the vLLM-Omni benchmark to evaluate the performance of multimodal models.
-
-## Notes
-We currently only support using the "openai-chat-omni" backend.
-
-## Basic Parameter Description
-You can use `vllm bench serve --omni --help=all` to get descriptions of all parameters. The commonly used parameters are described below:
-- `--omni`
- Enable Omni (multimodal) mode, supporting multimodal inputs and outputs such as images, videos, and audio.
-
-- `--backend`
- Specify the backend adapter as openai-chat-omni, using OpenAI Chat compatible API behavior as the protocol. Currently only openai-chat-omni is supported.
-
-- `--model`
- The model identifier to load, filled according to the models supported by vLLM-Omni.
-
-- `--endpoint`
- The API endpoint exposed externally, to which clients send their requests.
-
-- `--dataset-name`
- The name of the dataset used; random-mm indicates generating random multimodal inputs (images, videos, audio).
-
-- `--num-prompts`
- The total number of requests to send, an integer.
-
-- `--max-concurrency`
- "Maximum number of concurrent requests. This can be used "
- "to help simulate an environment where a higher level component "
- "is enforcing a maximum number of concurrent requests. While the "
- "--request-rate argument controls the rate at which requests are "
- "initiated, this argument will control how many are actually allowed "
- "to execute at a time. This means that when used in combination, the "
- "actual request rate may be lower than specified with --request-rate, "
- "if the server is not processing requests fast enough to keep up."
-
-- `--request-rate`
- "Number of requests per second. If this is inf, "
- "then all the requests are sent at time 0. "
- "Otherwise, we use Poisson process or gamma distribution "
- "to synthesize the request arrival times."
-
-- `--ignore-eos`
- "Set ignore_eos flag when sending the benchmark request."
-
-- `--metric-percentiles`
- Comma-separated list of percentiles for selected metrics. "
- "To report 25-th, 50-th, and 75-th percentiles, use \"25,50,75\". "
- "Default value is \"99\"."
- "Use \"--percentile-metrics\" to select metrics.
-
-- `--percentile-metrics`
- "Comma-separated list of selected metrics to report percentiles."
- "This argument specifies the metrics to report percentiles."
- 'Allowed metric names are "ttft", "tpot", "itl", "e2el", "audio_ttfp", "audio_rtf", "audio_duration". '
-
-- `--save-result`
-Specify to save benchmark results to a json file
-
-- `--save-detailed`
-"When saving the results, whether to include per request "
- "information such as response, error, ttfs, tpots, etc."
-
-- `--result-dir`
- "Specify directory to save benchmark json results."
- "If not specified, results are saved in the current directory."
-
-- `--result-filename`
-"Specify the filename to save benchmark json results."
- "If not specified, results will be saved in "
- "{label}-{args.request_rate}qps-{base_model_id}-{current_dt}.json"
-
-- `--random-prefix-len`
- Number of fixed prefix tokens before the random context in a request.
- The total input length is the sum of random-prefix-len and a random
- context length sampled from [input_len * (1 - range_ratio),
- input_len * (1 + range_ratio)].Only the random and random-mm modes
- support this parameter.
-
-- `--random-input-len`
- Number of input tokens per request.Only the random and random-mm modes support this parameter.
-
-- `--random-output-len`
- Number of output tokens per request.Only the random and random-mm modes support this parameter.
-
-- `--random-range-ratio`
- Range ratio for sampling input/output length,
- used only for random sampling. Must be in the range [0, 1) to define
- a symmetric sampling range
- [length * (1 - range_ratio), length * (1 + range_ratio)].
- Only the random and random-mm modes support this parameter.
-
-- `--random-mm-base-items-per-request`
- Base number of multimodal items per request for random-mm.
- Actual per-request count is sampled around this base using
- --random-mm-num-mm-items-range-ratio.
- Only the random-mm mode supports this parameter.
-
-- `--random-mm-limit-mm-per-prompt`
- Per-modality hard caps for items attached per request, e.g.
- '{"image": 3, "video": 1, "audio": 1}'. The sampled per-request item
- count is clamped to the sum of these limits. When a modality
- reaches its cap, its buckets are excluded and probabilities are
- renormalized.
- Only the random-mm mode supports this parameter.
-
-- `--random-mm-num-mm-items-range-ratio`
- Range ratio r in [0, 1] for sampling items per request.
- We sample uniformly from the closed integer range
- [floor(n*(1-r)), ceil(n*(1+r))]
- where n is the base items per request.
- r=0 keeps it fixed; r=1 allows 0 items. The maximum is clamped
- to the sum of per-modality limits from
- --random-mm-limit-mm-per-prompt.
- An error is raised if the computed min exceeds the max.
- Only the random-mm mode supports this parameter.
-
-- `--random-mm-bucket-config`
- The bucket config is a dictionary mapping a multimodal item
- sampling configuration to a probability.
- Currently allows for 3 modalities: audio, images and videos.
- A bucket key is a tuple of (height, width, num_frames)
- The value is the probability of sampling that specific item.
- Example:
- --random-mm-bucket-config
- "{(256, 256, 1): 0.5, (720, 1280, 16): 0.4, (0, 1, 5): 0.10}"
- First item: images with resolution 256x256 w.p. 0.5
- Second item: videos with resolution 720x1280 and 16 frames
- Third item: audios with 1s duration and 5 channels w.p. 0.1
- OBS.: If the probabilities do not sum to 1, they are normalized.
- Only the random-mm mode supports this parameter
-
-## Usage Examples
-
-### Online Benchmark
-
-Show more
-
-First start serving your model:
-
-```bash
-vllm serve Qwen/Qwen2.5-Omni-7B --omni
-```
-
-Then run the benchmarking for sharegpt:
-
-```bash
-# download dataset
-# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-vllm bench serve \
- --omni \
- --port 43845 \
- --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
- --endpoint /v1/chat/completions \
- --backend openai-chat-omni \
- --num-prompts 2 \
- --dataset-name sharegpt \
- --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
- --percentile-metrics ttft,tpot,itl,e2el
-```
-If successful, you will see the following output:
-```text
-============ Serving Benchmark Result ============
-Successful requests: 2
-Failed requests: 0
-Benchmark duration (s): 81.63
-Request throughput (req/s): 0.02
-Peak concurrent requests: 2.00
-----------------End-to-end Latency----------------
-Mean E2EL (ms): 56966.13
-Median E2EL (ms): 56966.13
-P99 E2EL (ms): 81016.80
-================== Text Result ===================
-Total input tokens: 36
-Total generated tokens: 5926
-Output token throughput (tok/s): 72.60
-Peak output token throughput (tok/s): 103.00
-Peak concurrent requests: 2.00
-Total Token throughput (tok/s): 73.04
----------------Time to First Token----------------
-Mean TTFT (ms): 124.76
-Median TTFT (ms): 124.76
-P99 TTFT (ms): 156.10
------Time per Output Token (excl. 1st token)------
-Mean TPOT (ms): 481.30
-Median TPOT (ms): 481.30
-P99 TPOT (ms): 947.55
----------------Inter-token Latency----------------
-Mean ITL (ms): 25.11
-Median ITL (ms): 0.33
-P99 ITL (ms): 25.17
-================== Audio Result ==================
-Total audio duration generated(s): 3.95
-Total audio frames generated: 94890
-Audio throughput(audio duration/s): 0.05
-==================================================
-```
-
-Or run the benchmarking for random:
-
-```bash
-vllm bench serve \
- --omni \
- --port 43845 \
- --endpoint /v1/chat/completions \
- --backend openai-chat-omni \
- --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
- --dataset-name random \
- --num-prompts 2 \
- --random-prefix-len 5 \
- --random-input-len 10 \
- --random-output-len 100 \
- --percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf \
- --ignore-eos
-```
-
-If successful, you will see the following output:
-
-```text
-============ Serving Benchmark Result ============
-Successful requests: 2
-Failed requests: 0
-Benchmark duration (s): 3.89
-Request throughput (req/s): 0.51
-Peak concurrent requests: 2.00
-----------------End-to-end Latency----------------
-Mean E2EL (ms): 3824.76
-Median E2EL (ms): 3824.76
-P99 E2EL (ms): 3888.54
-================== Text Result ===================
-Total input tokens: 30
-Total generated tokens: 10101
-Output token throughput (tok/s): 2595.57
-Peak output token throughput (tok/s): 111.00
-Peak concurrent requests: 2.00
-Total Token throughput (tok/s): 2603.28
----------------Time to First Token----------------
-Mean TTFT (ms): 117.15
-Median TTFT (ms): 117.15
-P99 TTFT (ms): 142.69
------Time per Output Token (excl. 1st token)------
-Mean TPOT (ms): 0.73
-Median TPOT (ms): 0.73
-P99 TPOT (ms): 0.74
----------------Inter-token Latency----------------
-Mean ITL (ms): 16.47
-Median ITL (ms): 16.19
-P99 ITL (ms): 52.55
-================== Audio Result ==================
-Total audio duration generated(s): 15.79
-Total audio frames generated: 379050
-Audio throughput(audio duration/s): 4.06
----------------Time to First Packet---------------
-Mean AUDIO_TTFP (ms): 3701.37
-Median AUDIO_TTFP (ms): 3701.37
-P99 AUDIO_TTFP (ms): 3762.25
------------------Real Time Factor-----------------
-Mean AUDIO_RTF: 0.47
-Median AUDIO_RTF: 0.47
-P99 AUDIO_RTF: 0.48
-==================================================
-```
-Notes:
-We use audio generation time / audio duration to calculate RTF.
-
-
-
-### Multi-Modal Benchmark
-
-
-Show more
-
-Benchmark the performance of multi-modal requests in vLLM-Omni.
-
-Generate synthetic image、video、audio inputs alongside random text prompts to stress-test vision models without external datasets.
-
-Notes:
-
-- Works only with online benchmark via the OpenAI backend (`--backend openai-chat-omni`) and endpoint `/v1/chat/completions`.
-
-Start the server (example):
-
-```bash
-vllm serve Qwen/Qwen2.5-Omni-7B --omni
-```
-
-It is recommended to use the flag `--ignore-eos` to simulate real responses. You can set the size of the output via the arg `random-output-len`.
-
-Then run the benchmarking script:
-```bash
-vllm bench serve \
- --omni \
- --dataset-name random-mm \
- --port 40849 \
- --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct \
- --endpoint /v1/chat/completions \
- --backend openai-chat-omni \
- --request-rate 1 \
- --num-prompts 1 \
- --random-input-len 10 \
- --random-range-ratio 0.0 \
- --random-mm-base-items-per-request 2 \
- --random-mm-num-mm-items-range-ratio 0 \
- --random-mm-limit-mm-per-prompt '{"image":1,"video":1, "audio": 1}' \
- --random-mm-bucket-config '{"(32, 32, 1)": 0.5, "(0, 1, 1)": 0.1, "(32, 32, 2)":0.4}' \
- --ignore-eos \
- --percentile-metrics ttft,tpot,itl \
- --random-output-len 2 \
- --extra_body '{"modalities": ["text"]}'
-```
-
-If successful, you will see the following output:
-
-```text
-============ Serving Benchmark Result ============
-Successful requests: 1
-Failed requests: 0
-Request rate configured (RPS): 1.00
-Benchmark duration (s): 1.21
-Request throughput (req/s): 0.83
-Peak concurrent requests: 1.00
-================== Text Result ===================
-Total input tokens: 10
-Total generated tokens: 3
-Output token throughput (tok/s): 2.49
-Peak output token throughput (tok/s): 3.00
-Peak concurrent requests: 1.00
-Total Token throughput (tok/s): 10.77
----------------Time to First Token----------------
-Mean TTFT (ms): 179.74
-Median TTFT (ms): 179.74
-P99 TTFT (ms): 179.74
------Time per Output Token (excl. 1st token)------
-Mean TPOT (ms): 12.76
-Median TPOT (ms): 12.76
-P99 TPOT (ms): 12.76
----------------Inter-token Latency----------------
-Mean ITL (ms): 12.76
-Median ITL (ms): 12.76
-P99 ITL (ms): 25.24
-================== Audio Result ==================
-Total audio duration generated(s): 0.00
-Total audio frames generated: 0
-Audio throughput(audio duration/s): 0.00
-==================================================
-```
-
-Behavioral notes:
-
-- If the requested base item count cannot be satisfied under the provided per-prompt limits, the tool raises an error rather than silently clamping.
-
-How sampling works:
-
-- Determine per-request item count k by sampling uniformly from the integer range defined by `--random-mm-base-items-per-request` and `--random-mm-num-mm-items-range-ratio`, then clamp k to at most the sum of per-modality limits.
-- For each of the k items, sample a bucket (H, W, T) according to the normalized probabilities in `--random-mm-bucket-config`, while tracking how many items of each modality have been added.
-- If a modality (e.g., image) reaches its limit from `--random-mm-limit-mm-per-prompt`, all buckets of that modality are excluded and the remaining bucket probabilities are renormalized before continuing.
-This should be seen as an edge case, and if this behavior can be avoided by setting `--random-mm-limit-mm-per-prompt` to a large number. Note that this might result in errors due to engine config `--limit-mm-per-prompt`.
-- The resulting request contains synthetic image data in `multi_modal_data` (OpenAI Chat format). When `random-mm` is used with the OpenAI Chat backend, prompts remain text and MM content is attached via `multi_modal_data`.
-
diff --git a/docs/cli/serve.md b/docs/cli/serve.md
deleted file mode 100644
index 9d1747e0be2..00000000000
--- a/docs/cli/serve.md
+++ /dev/null
@@ -1,65 +0,0 @@
-# vllm-omni serve
-
-## Stage-based CLI quickstart
-
-The stage-based CLI is designed for deployments that require launching each pipeline stage in an isolated process
-(e.g., across separate operating system processes, distinct GPUs, or distributed hosts).
-
-- For **migrated models** that utilize the bundled deployment YAML configurations located in
- `vllm_omni/deploy/`, the `--deploy-config` flag is only required to override the default configuration. By default, executing `vllm serve MODEL --omni ...`
- automatically loads the bundled deployment configuration.
-- For **legacy models** utilizing configuration files located in
- `vllm_omni/model_executor/stage_configs/`, the `--stage-configs-path` parameter remains mandatory.
-
-Example: Initializing Stage 0 (Orchestrator and API Server):
-The commands below show a common device mapping where Stage 0 uses GPU 0 and
-worker stages use GPU 1 via `CUDA_VISIBLE_DEVICES`.
-
-```bash
-CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
- --port 8091 \
- --stage-id 0 \
- --omni-master-address 127.0.0.1 \
- --omni-master-port 26000
-```
-
-Example: Initializing a Headless Worker Stage (Stage 1):
-
-```bash
-CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
- --stage-id 1 \
- --headless \
- --omni-master-address 127.0.0.1 \
- --omni-master-port 26000
-```
-
-When utilizing a custom deployment YAML based on the new schema, append `--deploy-config /path/to/override.yaml` to each command execution. Conversely, for legacy models, substitute this parameter with `--stage-configs-path /path/to/stage_configs.yaml`.
-
-In the standard execution paradigm, the `--stage-overrides` argument is utilized to apply stage-specific configurations from a single CLI command.
-However, under the **stage-based CLI** paradigm, where each process strictly encapsulates a single stage, it is recommended to specify tuning parameters directly via discrete command-line flags for the respective stage, rather than constructing a composite `--stage-overrides` JSON string.
-
-For example, as an alternative to the following composite configuration:
-
-```bash
-vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
- --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
-```
-
-the stage-based CLI permits the direct initialization of Stage 1 with explicit parameters:
-
-```bash
-CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
- --stage-id 1 \
- --headless \
- --gpu-memory-utilization 0.5 \
- --omni-master-address 127.0.0.1 \
- --omni-master-port 26000
-```
-
-## JSON CLI Arguments
-
---8<-- "docs/cli/json_tip.inc.md"
-
-## Arguments
-
---8<-- "docs/generated/argparse_omni/omni_serve.inc.md"
diff --git a/docs/community/contact_us.md b/docs/community/contact_us.md
deleted file mode 100644
index 09c7815a038..00000000000
--- a/docs/community/contact_us.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Contact Us
-
-- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm-omni/issues)
-- For coordinating contributions and development and discussing with other users and developers, please join `sig-omni` channel in our [Slack](https://slack.vllm.ai/) or use the [vLLM Forum](https://discuss.vllm.ai/)
-- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm-omni/security/advisories) feature
diff --git a/docs/community/governance.md b/docs/community/governance.md
deleted file mode 100644
index 6af578e2d8f..00000000000
--- a/docs/community/governance.md
+++ /dev/null
@@ -1,54 +0,0 @@
-# Governance
-
-vLLM-Omni's governance is inspired by the [vLLM governance process](https://docs.vllm.ai/en/latest/governance/process/). We share the same commitment to open source community and meritocratic norms.
-
-## Values
-
-vLLM-Omni aims to be the fastest and easiest-to-use omni-modality inference and serving engine. Our values are aligned with [vLLM's values](https://docs.vllm.ai/en/latest/governance/process/#values):
-
-### Design Values
-
-1. **Top performance**: System performance is our top priority. We monitor overheads, optimize kernels, and publish benchmarks. We never leave performance on the table.
-2. **Ease of use**: vLLM-Omni must be simple to install, configure, and operate. We provide clear documentation, fast startup, clean logs, helpful error messages, and monitoring guides. Many users fork our code or study it deeply, so we keep it readable and modular.
-3. **Wide coverage**: vLLM-Omni supports frontier models and high-performance accelerators. We make it easy to add new models and hardware. vLLM-Omni + PyTorch form a simple interface that avoids complexity.
-4. **Production ready**: vLLM-Omni runs 24/7 in production. It must be easy to operate and monitor for health issues.
-5. **Extensibility**: vLLM-Omni serves as fundamental omni-modality infrastructure. Our codebase cannot cover every use case, so we design for easy forking and customization.
-
-### Collaboration Values
-
-1. **Tightly Knit and Fast-Moving**: Our maintainer team is aligned on vision, philosophy, and roadmap. We work closely to unblock each other and move quickly.
-2. **Individual Merit**: No one buys their way into governance. Committer status belongs to individuals, not companies. We reward contribution, maintenance, and project stewardship.
-
-## Project Maintainers
-
-### Lead Maintainers
-
-Lead maintainers are responsible for the overall direction and strategy of the project:
-
-- [@Gaohan123](https://github.com/Gaohan123)
-- [@hsliuustc0106](https://github.com/hsliuustc0106)
-- [@ywang96](https://github.com/ywang96)
-
-### Active Committers
-
-Committers have write access and merge rights. They typically have deep expertise in specific areas of this project and shepherd the community contributions:
-
-- [@david6666666](https://github.com/david6666666): Quantization and Community Relationship
-- [@gcanlin](https://github.com/gcanlin): Hardware plugin and NPU integration
-- [@Isotr0py](https://github.com/Isotr0py): Diffusion and Quantization
-- [@linyueqian](https://github.com/linyueqian): TTS and Omni Support
-- [@lishunyang12](https://github.com/lishunyang12): Quantization and Configuration
-- [@princepride](https://github.com/princepride): Diffusion and Omni Support
-- [@SamitHuang](https://github.com/SamitHuang): RL and Diffusion
-- [@tzhouam](https://github.com/tzhouam): Engine and New Model Support
-- [@wtomin](https://github.com/wtomin): Diffusion and Parallelism
-- [@ZeldaHuang](https://github.com/ZeldaHuang): Omni Support
-- [@ZJY0516](https://github.com/ZJY0516): Diffusion and CustomOp
-
-## Meetings
-
-Committers hold **bi-weekly meetings** to discuss future directions and collaborations of the project.
-
-## Committer Nomination Process
-
-Every month, any active committer can nominate new committer(s) to the project. Up to **two new committers** will be admitted per month based on the quality and impact of their contributions.
diff --git a/docs/community/meetups.md b/docs/community/meetups.md
deleted file mode 100644
index 3374fe711cf..00000000000
--- a/docs/community/meetups.md
+++ /dev/null
@@ -1 +0,0 @@
-# Meetups
diff --git a/docs/community/volunteers.md b/docs/community/volunteers.md
deleted file mode 100644
index 2c25485ea90..00000000000
--- a/docs/community/volunteers.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# Volunteers for Bugfix and CI
-
-We encourage you to check current docs and [issues](https://github.com/vllm-project/vllm-omni/issues) to find possible solutions for your questions. If non of these can solve it, please propose an issue to describe your questions about bug or CI problems for developing.
-
-If you have urgent need for locating and solving bugfix or CI problems, please find community volunteers below.
-
-| Dec 4-Dec 12 | Dec 15-Dec 19 | Dec 22-Dec 26 | Dec 29- Jan 2, 2026| Jan 5-Jan 9 | Jan 12-Jan 16 |
-|----------|----------|----------|----------|----------|----------|
-| Conw729 | yinpeiqi | tzhouam | SamitHuang | gcanlin | natureofnature |
-| david6666666 | R2-Y | hsliuustc0106 | Gaohan123 | ZJY0516 | qibaoyuan |
-
-We kindly welcome more contributors to fix bugs and contribute new features!
diff --git a/docs/configuration/README.md b/docs/configuration/README.md
deleted file mode 100644
index 390176e9cea..00000000000
--- a/docs/configuration/README.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# Configuration Options
-
-This section lists the most common options for running vLLM-Omni.
-
-For options within a vLLM Engine. Please refer to [vLLM Configuration](https://docs.vllm.ai/en/v0.16.0/configuration/index.html)
-
-Currently, the main options are maintained by stage configs for each model.
-
-For a specific example, see the [Qwen2.5-Omni deploy config](gh-file:vllm_omni/deploy/qwen2_5_omni.yaml). The matching frozen pipeline topology lives at [vllm_omni/model_executor/models/qwen2_5_omni/pipeline.py](gh-file:vllm_omni/model_executor/models/qwen2_5_omni/pipeline.py).
-
-For introduction, please check [Introduction for stage config](./stage_configs.md)
-
-## Memory Configuration
-
-- **[GPU Memory Calculation and Configuration](./gpu_memory_utilization.md)** - Guide on how to calculate memory requirements and set up `gpu_memory_utilization` for optimal performance
-
-## Multi-Stage Recipes
-
-- **[Prefill-Decode Disaggregation](./pd_disaggregation.md)** - How to derive a PD-aware Qwen3-Omni stage config from the default config without introducing another bundled YAML
-
-## Optimization Features
-
-- **[Diffusion Features Overview](../user_guide/diffusion_features.md)** - Complete overview of all diffusion model features and supported models
diff --git a/docs/configuration/gpu_memory_utilization.md b/docs/configuration/gpu_memory_utilization.md
deleted file mode 100644
index 74106603532..00000000000
--- a/docs/configuration/gpu_memory_utilization.md
+++ /dev/null
@@ -1,207 +0,0 @@
-# GPU Memory Calculation and Configuration
-
-This guide explains how to calculate GPU memory requirements and properly configure `gpu_memory_utilization` for vLLM-Omni stages.
-
-## Overview
-
-`gpu_memory_utilization` is a critical parameter that controls how much GPU memory each stage can use. It's specified as a fraction between 0.0 and 1.0, where:
-- `0.8` means 80% of the GPU's total memory
-- `1.0` means 100% of the GPU's total memory (not recommended, leaves no buffer)
-
-## How Memory is Calculated
-
-### Memory Allocation Formula
-
-For each stage, vLLM-Omni calculates the requested memory as:
-
-```
-requested_memory = total_gpu_memory × gpu_memory_utilization
-```
-
-The system checks that:
-```
-free_memory ≥ requested_memory
-```
-
-If this condition is not met, the stage will fail to initialize with an error message showing the memory requirements.
-
-### Memory Components
-
-The total memory used by a stage includes:
-
-1. **Model Weights**: The size of the model parameters loaded on the GPU
-2. **KV Cache**: Memory for storing key-value cache during generation
-3. **Activation Memory**: Temporary memory for intermediate computations
-4. **System Overhead**: Memory used by CUDA, PyTorch, and other system components
-5. **Non-Torch Memory**: Memory allocated outside of PyTorch (e.g., CUDA graphs)
-
-### Example Calculation
-
-For a GPU with 80GB total memory:
-- `gpu_memory_utilization: 0.8` → 64GB available for the stage
-- `gpu_memory_utilization: 0.6` → 48GB available for the stage
-- `gpu_memory_utilization: 0.15` → 12GB available for the stage
-
-## Setting Up `gpu_memory_utilization`
-
-### Step 1: Determine GPU Memory
-
-First, check your GPU's total memory:
-
-```bash
-# Using nvidia-smi
-nvidia-smi --query-gpu=memory.total --format=csv
-
-# Or using Python
-python -c "import torch; print(f'{torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')"
-```
-
-### Step 2: Estimate Model Memory Requirements
-
-#### For Autoregressive (AR) Stages
-
-AR stages typically need more memory due to:
-- Large model weights
-- KV cache for attention
-- Activation buffers
-
-#### For Diffusion/Generation Stages
-
-Diffusion stages (like code2wav) typically need less memory:
-- Smaller model components
-- Different memory access patterns
-
-**Typical values:**
-- `0.1 - 0.3` for most diffusion stages
-
-### Step 3: Consider Multi-Stage Scenarios
-
-When multiple stages share the same GPU, you must ensure the sum of their `gpu_memory_utilization` values doesn't exceed 1.0.
-
-**Example: Two stages on GPU 0**
-```yaml
-stage_args:
- - stage_id: 0
- runtime:
- devices: "0"
- engine_args:
- gpu_memory_utilization: 0.6 # Uses 60% of GPU 0
-
- - stage_id: 1
- runtime:
- devices: "0"
- engine_args:
- gpu_memory_utilization: 0.3 # Uses 30% of GPU 0
- # Total: 90% of GPU 0 (safe, leaves 10% buffer)
-```
-
-**Important:** If stages run on different GPUs, each can use up to 1.0 independently.
-
-### Step 4: Account for Tensor Parallelism
-
-When using `tensor_parallel_size > 1`, the model is split across multiple GPUs, so each GPU needs less memory.
-
-**Example: 2-way tensor parallelism**
-```yaml
-stage_args:
- - stage_id: 0
- runtime:
- devices: "0,1" # Uses both GPUs
- engine_args:
- tensor_parallel_size: 2
- gpu_memory_utilization: 0.6 # 60% per GPU
- # Model is split, so each GPU uses ~30% of model memory
-```
-
-## Examples
-
-### Qwen3-Omni-MoE on 2x H100-80GB
-
-```yaml
-stage_args:
- - stage_id: 0 # Thinker stage with TP=2
- runtime:
- devices: "0,1"
- engine_args:
- tensor_parallel_size: 2
- gpu_memory_utilization: 0.6 # 48GB per GPU
-
- - stage_id: 1 # Talker stage
- runtime:
- devices: "1"
- engine_args:
- gpu_memory_utilization: 0.3 # 24GB on GPU 1
-
- - stage_id: 2 # Code2Wav stage
- runtime:
- devices: "0"
- engine_args:
- gpu_memory_utilization: 0.1 # 8GB on GPU 0
-```
-**Note:** In this configuration, stages 0 and 2 share GPU 0, but they run at different times in the pipeline, so their memory usage doesn't overlap.
-
-## Troubleshooting
-
-### Error: "Free memory is less than desired GPU memory utilization"
-
-This means the GPU doesn't have enough free memory when the stage starts.
-
-**Solutions:**
-1. Free up memory by closing other processes
-2. Reduce `gpu_memory_utilization` for this stage
-3. Use a GPU with more memory
-4. Move the stage to a different GPU
-
-### Error: OOM during inference
-
-The stage initialized but ran out of memory during processing.
-
-**Solutions:**
-1. Reduce `max_num_batched_tokens`
-2. Reduce `max_num_seqs` in engine_args
-3. Lower `gpu_memory_utilization` slightly
-4. Enable quantization if supported
-
-### Memory Not Fully Utilized
-
-If you see low memory usage, you can:
-1. Increase `gpu_memory_utilization` to allow larger KV cache
-2. Increase `max_num_batched_tokens` for better batching
-3. Check if other stages are limiting throughput
-
-## Useful formula for Memory Calculation
-
-### KV Cache Memory
-
-The KV cache size depends on:
-- Number of sequences in batch
-- Sequence length (prompt + generation)
-- Model hidden size
-- Number of attention heads
-- Number of layers
-
-approximate Formula:
-```
-kv_cache_memory ≈ batch_size × seq_len × hidden_size × num_layers × 2 × dtype_size
-```
-2 for k & v
-
-### Model Weight Memory
-
-```
-model_memory ≈ num_parameters × dtype_size
-```
-
-For example:
-- 7B parameters in FP16: ~14GB
-- 7B parameters in FP32: ~28GB
-- 7B parameters in INT8: ~7GB
-
-### Activation Memory
-
-Activation memory is typically smaller but varies with:
-- Batch size
-- Sequence length
-- Model architecture
-
-It's usually 10-30% of model weight memory during inference.
diff --git a/docs/configuration/pd_disaggregation.md b/docs/configuration/pd_disaggregation.md
deleted file mode 100644
index 9196bdb0240..00000000000
--- a/docs/configuration/pd_disaggregation.md
+++ /dev/null
@@ -1,171 +0,0 @@
-# Prefill-Decode (PD) Disaggregation
-
-PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode
-stages so prompt processing and token generation can run on different workers.
-
-This is documented as a stage-config recipe instead of a bundled YAML because the
-deployment-specific values usually change per environment:
-
-- GPU placement
-- `tensor_parallel_size`
-- connector backend and connector ports
-- connector IPs or bootstrap addresses
-
-Start from the [default Qwen3-Omni stage config](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml)
-and copy it to your own file, for example `qwen3_omni_pd.yaml`. Then apply the
-changes below.
-
-## Requirements
-
-- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
-- A KV connector supported by vLLM, such as `MooncakeConnector`
-- Matching `tensor_parallel_size` on the prefill and decode thinker stages
-
-## 1. Split the thinker into prefill and decode stages
-
-Replace the original thinker stage with two stages:
-
-```yaml
-stage_args:
- - stage_id: 0
- stage_type: llm
- is_prefill_only: true
- runtime:
- devices: "0"
- engine_args:
- max_num_seqs: 16
- model_stage: thinker
- model_arch: Qwen3OmniMoeForConditionalGeneration
- worker_type: ar
- scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
- gpu_memory_utilization: 0.9
- enforce_eager: true
- trust_remote_code: true
- engine_output_type: latent
- distributed_executor_backend: "mp"
- enable_prefix_caching: false
- max_num_batched_tokens: 32768
- hf_config_name: thinker_config
- tensor_parallel_size: 1
- kv_transfer_config:
- kv_connector: "MooncakeConnector"
- kv_role: "kv_producer"
- kv_rank: 0
- kv_parallel_size: 2
- kv_connector_extra_config:
- mooncake_bootstrap_port: 25201
- final_output: false
- is_comprehension: true
- default_sampling_params:
- temperature: 0.4
- top_p: 0.9
- top_k: 1
- max_tokens: 2048
- seed: 42
- detokenize: True
- repetition_penalty: 1.05
-
- - stage_id: 1
- stage_type: llm
- is_decode_only: true
- runtime:
- devices: "1"
- engine_args:
- max_num_seqs: 64
- model_stage: thinker
- model_arch: Qwen3OmniMoeForConditionalGeneration
- worker_type: ar
- scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
- gpu_memory_utilization: 0.9
- enforce_eager: true
- trust_remote_code: true
- engine_output_type: latent
- distributed_executor_backend: "mp"
- enable_prefix_caching: false
- max_num_batched_tokens: 32768
- hf_config_name: thinker_config
- tensor_parallel_size: 1
- kv_transfer_config:
- kv_connector: "MooncakeConnector"
- kv_role: "kv_consumer"
- kv_rank: 1
- kv_parallel_size: 2
- kv_connector_extra_config:
- mooncake_bootstrap_port: 25202
- engine_input_source: [0]
- final_output: true
- final_output_type: text
- is_comprehension: true
- default_sampling_params:
- temperature: 0.4
- top_p: 0.9
- top_k: 1
- max_tokens: 2048
- seed: 42
- detokenize: True
- repetition_penalty: 1.05
-```
-
-Notes:
-
-- `is_prefill_only: true` marks the thinker stage that only saves KV.
-- `is_decode_only: true` marks the thinker stage that resumes from remote KV.
-- `kv_transfer_config` is required on both stages.
-- The orchestrator forces the prefill stage to run with `max_tokens=1`, so the
- prefill side only processes the prompt and exports KV.
-
-## 2. Shift the downstream stages by one index
-
-After inserting the extra thinker stage, renumber the remaining stages:
-
-```yaml
- - stage_id: 2
- runtime:
- devices: "2"
- engine_input_source: [1]
- custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker
-
- - stage_id: 3
- runtime:
- devices: "2"
- engine_args:
- max_num_seqs: 1
- engine_input_source: [2]
- custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav
-```
-
-Compared with the default Qwen3-Omni config:
-
-- the talker becomes stage `2` instead of stage `1`
-- the code2wav stage becomes stage `3` instead of stage `2`
-- the talker now reads from decode stage `1`
-
-## 3. Add runtime edges for the four-stage pipeline
-
-```yaml
-runtime:
- enabled: true
- edges:
- - from: 0
- to: 1
- - from: 1
- to: 2
- - from: 2
- to: 3
-```
-
-## 4. Launch with your custom config
-
-```bash
-vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
- --stage-configs-path /path/to/qwen3_omni_pd.yaml
-```
-
-## Operational Notes
-
-- `MooncakeConnector` does not support heterogeneous TP sizes across the PD
- pair. Keep prefill and decode at the same `tensor_parallel_size`.
-- If the thinker requires TP=2, both thinker stages must use TP=2 and be given
- separate GPU sets, for example `"0,1"` for prefill and `"2,3"` for decode.
-- Choose connector ports and addresses that match your deployment. The values
- shown above are examples only.
diff --git a/docs/configuration/stage_configs.md b/docs/configuration/stage_configs.md
deleted file mode 100644
index 3a17f1c4013..00000000000
--- a/docs/configuration/stage_configs.md
+++ /dev/null
@@ -1,509 +0,0 @@
-# Stage configs for vLLM-Omni
-
-In vLLM-Omni, the target model is separated into multiple stages, which are processed by different LLMEngines, DiffusionEngines or other types of engines. Depending on different types of stages, such as Autoregressive (AR) stage or Diffusion transformer (DiT) stage, each can choose corresponding schedulers, model workers to load with the Engines in a plug-in fashion.
-
-!!! note
- Default deploy config YAMLs (for example, `vllm_omni/deploy/qwen2_5_omni.yaml`, `vllm_omni/deploy/qwen3_omni_moe.yaml`, and `vllm_omni/deploy/qwen3_tts.yaml`) are bundled and loaded automatically when neither `--stage-configs-path` nor `--deploy-config` is provided — the model registry resolves the right pipeline + deploy YAML by `model_type`. The bundled defaults have been verified on 1xH100 for Qwen2.5-Omni and 2xH100 for Qwen3-Omni. Models that have not yet migrated to the new schema continue to use the legacy `vllm_omni/model_executor/stage_configs/.yaml` files via `--stage-configs-path`.
-
-## New deploy schema reference
-
-The new deploy schema lives under `vllm_omni/deploy/` and is paired with a frozen `PipelineConfig` registered by the model's `pipeline.py`. Each deploy YAML has these top-level fields:
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `base_config` | str (path) | optional | — | Overlay parent (relative or absolute). `stages:` / `platforms:` deep-merged by stage_id; other scalars overlay-wins. Intended for user-authored overlays; prod yamls stay flat. |
-| `async_chunk` | bool | optional | `true` | Enable chunked streaming between stages. Pin to `false` if the pipeline runs end-to-end. |
-| `connectors` | dict | optional | `null` | Named connector specs (`{name, extra}`). Referenced by each stage's `input_connectors` / `output_connectors`. See [Connector schema](#connector-schema). |
-| `edges` | list | optional | `null` | Explicit edge list for the KV transfer graph. Auto-derived from stage inputs if omitted. |
-| `stages` | list | required | — | Per-stage engine args + wiring (see [Stage fields](#stage-fields)). |
-| `platforms` | dict | optional | `null` | Keyed by `npu` / `rocm` / `xpu`, each contains a `stages:` list with per-platform overrides applied on top of the CUDA defaults. |
-| `pipeline` | str | optional | `null` | Override the auto-detected pipeline registry key (used for structural variants like `qwen2_5_omni_thinker_only`). |
-| `trust_remote_code` | bool | optional | `true` | **Pipeline-wide.** Trust HF remote code on model load; applies to every stage. |
-| `distributed_executor_backend` | str \| null | optional | `null` | **Pipeline-wide.** Distributed executor backend forwarded to vLLM (`"mp"`, `"ray"`, `"external_launcher"`). If omitted, vLLM auto-selects backend from runtime topology. |
-| `dtype` | str \| null | optional | `null` | **Pipeline-wide.** Model dtype for every stage. |
-| `quantization` | str \| null | optional | `null` | **Pipeline-wide.** Quantization method for every stage. |
-| `enable_prefix_caching` | bool | optional | `false` | **Pipeline-wide.** Prefix cache toggle applied to every stage. |
-| `enable_chunked_prefill` | bool \| null | optional | `null` | **Pipeline-wide.** Chunked prefill toggle applied to every stage. |
-| `data_parallel_size` | int | optional | `1` | **Pipeline-wide.** DP degree for every stage. |
-| `pipeline_parallel_size` | int | optional | `1` | **Pipeline-wide.** PP degree for every stage. |
-
-Note: for diffusion path, `distributed_executor_backend` currently defaults to
-`mp`, and `ray` / `external_launcher` are not fully supported yet.
-
-### Stage fields
-
-Each entry under `stages:` accepts any `StageDeployConfig` field directly (no nested `engine_args:`). Only fields whose value legitimately varies across stages live here; pipeline-wide settings (trust_remote_code, distributed_executor_backend, dtype, quantization, prefix/chunked prefill, DP/PP sizes) are declared at the top level and applied to every stage. Unknown keys fall through to `engine_extras:` and are forwarded to the engine.
-
-| Field | Type | Required | Default | Description |
-|-------|------|----------|---------|-------------|
-| `stage_id` | int | required | — | Stage identity; matched against `PipelineConfig.stages[*].stage_id`. |
-| `max_num_seqs` | int | optional | `64` | Max concurrent sequences per stage. |
-| `gpu_memory_utilization` | float | optional | `0.9` | Per-stage memory budget. |
-| `tensor_parallel_size` | int | optional | `1` | TP degree for this stage. |
-| `enforce_eager` | bool | optional | `false` | Disable CUDA graphs. |
-| `max_num_batched_tokens` | int | optional | `32768` | Prefill budget. |
-| `max_model_len` | int \| null | optional | `null` | Per-stage context length (auto-sets `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` when larger than HF default). |
-| `async_scheduling` | bool \| null | optional | `null` | Per-stage async scheduling toggle. |
-| `devices` | str | optional | `"0"` | `CUDA_VISIBLE_DEVICES`-style device list. |
-| `output_connectors` | dict \| null | optional | `null` | Keyed by `to_stage_`; values are names registered under top-level `connectors:`. |
-| `input_connectors` | dict \| null | optional | `null` | Keyed by `from_stage_`; values are names registered under top-level `connectors:`. |
-| `default_sampling_params` | dict \| null | optional | `null` | Baseline sampling params. Deep-merged with pipeline `sampling_constraints` (pipeline wins). |
-| `engine_extras` | dict | optional | `{}` | Catch-all for keys not listed above; deep-merged across overlays. Also carries per-stage overrides of pipeline-wide settings (e.g. stage-specific `dtype`). |
-
-### Connector schema
-
-Each entry under top-level `connectors:` follows this shape:
-
-```yaml
-connectors:
- :
- name: # required — class registered in vllm_omni.distributed
- extra: # optional — forwarded to the connector's __init__
- :
- ...
-```
-
-| Connector class | Use case | `extra` keys |
-|-----------------|----------|--------------|
-| `SharedMemoryConnector` | Same-host KV transfer between stages (default for bundled YAMLs). | `shm_threshold_bytes` (int, default `65536`). |
-| `MooncakeStoreConnector` | Cross-host KV transfer over TCP. Required for multi-node deployments. | `host`, `metadata_server`, `master`, `segment` (int bytes), `localbuf` (int bytes), `proto` (`"tcp"` / `"rdma"`). |
-
-A stage references a connector by name in its `input_connectors` / `output_connectors`:
-
-```yaml
-connectors:
- shm:
- name: SharedMemoryConnector
-
-stages:
- - stage_id: 0
- output_connectors: {to_stage_1: shm}
- - stage_id: 1
- input_connectors: {from_stage_0: shm}
-```
-
-### CLI flags introduced in this refactor
-
-| Flag | Description |
-|------|-------------|
-| `--deploy-config PATH` | Load a new-schema deploy YAML. Takes precedence over `--stage-configs-path`. **Optional** — when omitted, the bundled `vllm_omni/deploy/.yaml` is auto-loaded by the model registry. |
-| `--stage-overrides JSON` | Per-stage JSON overrides, e.g. `'{"0":{"gpu_memory_utilization":0.5}}'`. Per-stage values always win over global flags. |
-| `--async-chunk` / `--no-async-chunk` | Flip the deploy YAML's `async_chunk:` bool. Unset (default) leaves the YAML value in force. |
-| `--stage-configs-path` | **Deprecated.** Accepts legacy `stage_args` yamls and (auto-detected) new deploy yamls; emits a deprecation warning. Migrate to `--deploy-config`. To be removed in a follow-up PR. |
-
-### Stage-Based CLI Paradigm
-
-The stage-based CLI paradigm facilitates the execution of discrete pipeline stages within isolated processes:
-
-- **Stage 0** typically encapsulates the orchestrator and the primary API server. Invocation requires `--stage-id 0`,
- `--omni-master-address`, `--omni-master-port`, and standard port declarations (e.g., `--port`).
-- **Worker Stages** operate without a distinct API server (i.e., using `--headless`), are assigned sequential `--stage-id` identifiers, and must reference the corresponding
- `--omni-master-address` and `--omni-master-port` parameters to successfully register with Stage 0.
-
-For migrated architectures, the system automatically resolves and loads the bundled deployment YAML. Consequently, the primary execution path
-does **not** necessitate the explicit definition of `--deploy-config`:
-the example below uses `CUDA_VISIBLE_DEVICES=0` for Stage 0 and
-`CUDA_VISIBLE_DEVICES=1` for Stage 1.
-
-```bash
-CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
- --port 8091 \
- --stage-id 0 \
- --omni-master-address 127.0.0.1 \
- --omni-master-port 26000
-
-CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
- --stage-id 1 \
- --headless \
- --omni-master-address 127.0.0.1 \
- --omni-master-port 26000
-```
-
-When instantiating a custom deployment YAML conforming to the updated schema, append the `--deploy-config /path/to/override.yaml` directive
-to all node invocations. For legacy architectures (e.g., BAGEL) configured via deprecated `stage_args:` schemas, continue to specify the relevant configuration via `--stage-configs-path /path/to/config.yaml`.
-
-In the context of standard initialization architectures, utilizing the `--stage-overrides` parameter operates as the optimal methodology
-for delineating stage-specific tuning from the CLI interface:
-
-```bash
-vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
- --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
-```
-
-Conversely, in the context of the **stage-based CLI** paradigm, given that each execution process exclusively instantiates a single pipeline stage, configuration override attributes
-can be defined uniformly via explicit CLI flags on the corresponding instantiation command, rendering composite `--stage-overrides` JSON strings unnecessary:
-
-```bash
-CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
- --stage-id 1 \
- --headless \
- --gpu-memory-utilization 0.5 \
- --omni-master-address 127.0.0.1 \
- --omni-master-port 26000
-```
-
-### Precedence
-
-From highest to lowest:
-
-1. Per-stage flags (`--stage-overrides` JSON, `--stage--` if registered)
-2. Explicit global CLI flags (`--gpu-memory-utilization 0.85`, etc.)
-3. Platform section (`platforms.npu.stages`, etc.) on top of the base `stages:`
-4. Overlay YAML (via `base_config:`) on top of the base YAML
-5. Parser defaults
-
-### Worked override example
-
-Starting from the bundled `vllm_omni/deploy/qwen3_omni_moe.yaml`:
-
-```yaml
-# vllm_omni/deploy/qwen3_omni_moe.yaml (excerpt)
-async_chunk: true
-stages:
- - stage_id: 0
- gpu_memory_utilization: 0.9
- max_num_seqs: 32
- - stage_id: 1
- gpu_memory_utilization: 0.7
- max_num_seqs: 16
-```
-
-A user-authored overlay that inherits the base and overrides only stage 1:
-
-```yaml
-# my_overrides.yaml
-base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml
-stages:
- - stage_id: 1
- gpu_memory_utilization: 0.5 # smaller GPU
-```
-
-Launched with both an explicit global flag and a per-stage override:
-
-```bash
-vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
- --deploy-config my_overrides.yaml \
- --max-model-len 16384 \
- --stage-overrides '{"0": {"max_num_seqs": 8}}'
-```
-
-Within the stage-based CLI paradigm, equivalent configuration parameters can inherently be passed directly
-as command-line arguments to the designated single-stage process instantiation:
-
-```bash
-CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
- --stage-id 0 \
- --max-num-seqs 8 \
- --omni-master-address 127.0.0.1 \
- --omni-master-port 26000
-```
-
-Effective config per stage after the merge:
-
-| Stage | Field | Final value | Source |
-|-------|-------|-------------|--------|
-| 0 | `gpu_memory_utilization` | `0.9` | base YAML (overlay didn't touch stage 0) |
-| 0 | `max_num_seqs` | `8` | per-stage CLI (`--stage-overrides`) — wins over base `32` |
-| 0 | `max_model_len` | `16384` | global CLI |
-| 1 | `gpu_memory_utilization` | `0.5` | overlay YAML — wins over base `0.7` |
-| 1 | `max_num_seqs` | `16` | base YAML (overlay didn't touch this field) |
-| 1 | `max_model_len` | `16384` | global CLI |
-| 2 | (all defaults) | — | base YAML (no overrides apply) |
-
-Therefore, as a core part of vLLM-Omni, the stage configs for a model have several main functions:
-
-- Claim partition of stages and their corresponding class implementation in `model_executor/models`.
-- The disaggregated configuration for each stage and the communication topology among them.
-- Engine arguments for each engine within the stage.
-- Input and output dependencies for each stage.
-- Default input parameters.
-
-To override specific parameters, explicitly inject the customized configuration schema
-in both online and offline instantiation flows. Prioritize the `--deploy-config` flag
-when loading the new-schema deploy YAML schemas, reserving the `--stage-configs-path` parameter
-exclusively to maintain compatibility with legacy `stage_args` YAML constructs.
-
-Examples:
-
-For offline (Assume necessary dependencies have been imported):
-```python
-model_name = "Qwen/Qwen2.5-Omni-7B"
-omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")
-```
-
-For online serving:
-```bash
-vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --deploy-config /path/to/deploy_config.yaml
-```
-
-Legacy online serving:
-
-```bash
-vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
-```
-!!! important
- We are actively iterating on the definition of stage configs, and we welcome all feedbacks from both community users and developers to help us shape the development!
-
-Below is a specific example of stage_configs.yaml in Qwen2.5-omni.
-```python
-# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime.
-stage_args:
- - stage_id: 0 # mark the unique id for each stage
- runtime: # The disaggregated configuration
- process: true # Run this stage in a separate process
- devices: "0" # Logical device index for this stage (mapped through CUDA_VISIBLE_DEVICES / ASCEND_RT_VISIBLE_DEVICES if set)
- engine_args: # Engine arguments for a certain engine
- model_stage: thinker
- max_num_seqs: 1
- model_arch: Qwen2_5OmniForConditionalGeneration # The model implementation registered in model_executor/models/registry.py
- worker_type: ar # The specific worker used
- scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler # The specific scehduler used
- gpu_memory_utilization: 0.8 # The gpu memory allocation for the stage within a single chip
- enforce_eager: true # Now we only support eager mode
- trust_remote_code: true # Needed by huggingface config parsing
- engine_output_type: latent # It claims that the stage will input latent hiddenstates besides token ids
- enable_prefix_caching: false # For request with hiddenstates output, the prefix caching is not supported now
- is_comprehension: true # If the stage is a text or multimodal comprehension module. If it is, the AsyncOmni will use its tokenizer as default
- final_output: true # If the stage has output as part of final outputs. If it is false, which means that the stage only works as a intermediate role.
- final_output_type: text # What is the final output type. It can be text and audio now.
- default_sampling_params: # sampling parameters for the stage. Their meaning aligns with vLLM.
- temperature: 0.0
- top_p: 1.0
- top_k: -1
- max_tokens: 2048
- seed: 42
- detokenize: True
- repetition_penalty: 1.1
- - stage_id: 1
- runtime:
- process: true
- devices: "1"
- engine_args:
- model_stage: talker
- max_num_seqs: 3
- model_arch: Qwen2_5OmniForConditionalGeneration
- worker_type: ar
- scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
- gpu_memory_utilization: 0.8
- enforce_eager: true
- trust_remote_code: true
- enable_prefix_caching: false
- engine_output_type: latent
- engine_input_source: [0]
- custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker
- default_sampling_params:
- temperature: 0.9
- top_p: 0.8
- top_k: 40
- max_tokens: 2048
- seed: 42
- detokenize: True
- repetition_penalty: 1.05
- stop_token_ids: [8294]
- - stage_id: 2
- runtime:
- process: true
- devices: "0" # Example: use a different GPU than the previous stage; use "0" if single GPU
- engine_args:
- model_stage: code2wav
- max_num_seqs: 1
- model_arch: Qwen2_5OmniForConditionalGeneration
- worker_type: generation
- scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
- gpu_memory_utilization: 0.15
- enforce_eager: true
- trust_remote_code: true
- enable_prefix_caching: false
- engine_output_type: audio
- engine_input_source: [1]
- final_output: true
- final_output_type: audio
- default_sampling_params:
- temperature: 0.0
- top_p: 1.0
- top_k: -1
- max_tokens: 2048
- seed: 42
- detokenize: True
- repetition_penalty: 1.1
-
-# Top-level runtime config (concise): default windows and stage edges
-runtime:
- enabled: true
-
- edges:
- - from: 0 # thinker → talker: trigger only after receiving full input (-1)
- to: 1
- - from: 1 # talker → code2wav: trigger only after receiving full input (-1)
- to: 2
-
-```
-
-## Stage Configuration Arguments
-
-Each stage in the `stage_args` list contains the following configuration options:
-
-### `stage_id`
-
-A unique identifier for each stage in the multi-stage pipeline. Stages are numbered sequentially starting from 0, and this ID is used to reference stages in inter-stage dependencies (e.g., `engine_input_source`).
-
-### `prompt_expand_func` (Optional)
-
-A custom Python function hook for the LLM stage (Stage 0) that expands a single incoming prompt object into multiple prompts. This is primarily used for multi-modal Classifier-Free Guidance (CFG), where it generates the necessary companion requests (like a negative text prompt) and tags them with internal roles (e.g., `cfg_text`). This ensures the upstream LLM generates the needed contextual hidden states for both the conditional and unconditional generations simultaneously.
-
-### `cfg_kv_collect_func` (Optional)
-
-A custom Python function hook for downstream diffusion stages (Stage 1+) to collect, map, and process the KV caches transferred from the companion requests fired by `prompt_expand_func`. It aggregates the hidden condition states cleanly (e.g., binding them as `cfg_text_past_key_values` and `cfg_text_kv_metadata`), allowing the diffusion runtime to perform CFG smoothly without redundantly evaluating text paths on the DiT workers.
-
-### `runtime`
-
-Configuration for disaggregated execution of the stage, controlling how the stage is deployed and executed.
-
-#### `runtime.process`
-
-Whether to run this stage in a separate process. When set to `true`, the stage will be executed in an isolated process, enabling better resource isolation and parallel execution across different stages. This is essential for multi-GPU deployments where different stages run on different devices.
-
-Default: `true`
-
-#### `runtime.devices`
-
-Logical device indices for this stage, specified as a string. Values are **logical indices** (`0`, `1`, `2`, ...) — not physical GPU IDs — and are mapped through the platform's visibility env var (`CUDA_VISIBLE_DEVICES` on CUDA, `ASCEND_RT_VISIBLE_DEVICES` on NPU) before being applied via `torch.cuda.set_device()` (or the equivalent).
-
-Example: if `CUDA_VISIBLE_DEVICES=0,2,4` is set in the environment, then `devices: "0"` selects physical GPU 0 (the first visible), `devices: "1"` selects physical GPU 2, and `devices: "0,1"` makes physical GPUs 0 and 2 available to the stage. If no visibility env var is set, logical and physical IDs coincide.
-
-Default: `"0"`
-
-#### `engine_args.max_num_seqs`
-
-The maximum number of sequences for concurrent processing in this stage. For LLM stages, this controls the vLLM scheduler's maximum concurrent sequences. For all stage types, this also controls how many tasks can be batched together in the task processing loop.
-
-Default: `1`
-
-### `engine_args`
-
-Engine arguments for configuring the LLM engine, diffusion engine, or other engine types used by this stage.
-
-#### `engine_args.model_stage`
-
-The name identifier for this model stage within the multi-stage architecture. This is used internally to distinguish different stages of the same model (e.g., "thinker", "talker", "code2wav" in Qwen2.5-Omni).
-
-#### `engine_args.model_arch`
-
-The model architecture class name that is registered in `model_executor/models/registry.py`. This specifies which model implementation to use for this stage. The class must be registered in the model registry for vLLM-Omni to locate and instantiate it.
-
-#### `engine_args.worker_cls`
-
-The specific worker class to use for this stage. This determines how the model computations are executed. Examples include `vllm_omni.worker.gpu_ar_worker.GPUARWorker` for autoregressive stages and `vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker` for diffusion-based stages.
-
-#### `engine_args.scheduler_cls`
-
-The scheduler class to use for this stage. The scheduler manages request queuing, batching, and execution order. Examples include `vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler` for standard stages and `vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler` for diffusion stages.
-
-#### `engine_args.gpu_memory_utilization`
-
-The fraction of GPU memory to allocate for this stage within a single GPU chip. This is a value between 0.0 and 1.0, where 0.8 means 80% of the GPU memory will be used by this stage. This allows fine-grained control over memory allocation when multiple stages share the same GPU or when reserving memory for other operations.
-
-Default: `0.8`
-
-!!! tip "Memory Configuration Guide"
- For detailed information on how to calculate memory requirements and properly configure `gpu_memory_utilization`, see the [GPU Memory Calculation and Configuration Guide](./gpu_memory_utilization.md).
-
-#### `engine_args.enforce_eager`
-
-Whether to enforce eager execution mode. When set to `true`, the engine will run in eager mode without using CUDA graphs or other compilation optimizations. Currently, vLLM-Omni only supports eager mode.
-
-Default: `true`
-
-#### `engine_args.trust_remote_code`
-
-Whether to trust remote code when loading models from Hugging Face. This is required for models that use custom code in their configuration files. Set to `true` when loading models that require custom model implementations.
-
-Default: `true`
-
-#### `engine_args.engine_output_type`
-
-Specifies the type of output produced by this stage's engine. This determines what kind of data flows to downstream stages. Possible values include `latent` (hidden states), `text` (tokenized text), and `audio` (audio waveforms). When set to `latent`, the stage outputs latent hidden states in addition to token IDs, which are consumed by downstream stages.
-
-Default: `latent`
-
-#### `engine_args.enable_prefix_caching`
-
-Whether to enable prefix caching for this stage. Prefix caching can improve performance by caching KV cache for common prompt prefixes. However, for requests that output hidden states (when `engine_output_type` is `latent`), prefix caching is not currently supported and should be set to `false`.
-
-Default: `false`
-
-### `is_comprehension`
-
-Whether this stage is a text or multimodal comprehension module. When set to `true`, the stage acts as a comprehension module that processes input text or multimodal content. If this is the first comprehension stage, `AsyncOmni` will use its tokenizer as the default tokenizer for the entire pipeline.
-
-Default: `true`
-
-### `final_output`
-
-Whether this stage produces output that is part of the final outputs returned to the user. When set to `false`, the stage only works as an intermediate stage, processing data that flows to downstream stages but not contributing directly to the final response.
-
-Default: `true`
-
-### `final_output_type`
-
-The type of final output produced by this stage. This specifies what format the output will be in when returned to the user. Currently supported values are `text` (for text generation) and `audio` (for audio generation).
-
-Default: `text`
-
-### `default_sampling_params`
-
-Default sampling parameters for this stage. These parameters control the generation behavior and align with vLLM's sampling parameter semantics. These defaults are used when no explicit sampling parameters are provided in the request.
-
-#### `default_sampling_params.temperature`
-
-Sampling temperature for controlling randomness. Lower values (e.g., 0.0) make the output more deterministic and focused, while higher values increase randomness.
-
-Default: `0.0`
-
-#### `default_sampling_params.top_p`
-
-Nucleus sampling parameter. Only tokens with cumulative probability mass up to `top_p` are considered. This helps filter out low-probability tokens while maintaining diversity.
-
-Default: `1.0`
-
-#### `default_sampling_params.top_k`
-
-Top-k sampling parameter. Only the top `k` most likely tokens are considered. Set to `-1` to disable top-k filtering and consider all tokens.
-
-Default: `-1`
-
-#### `default_sampling_params.max_tokens`
-
-Maximum number of tokens to generate in this stage. This limits the length of the output sequence.
-
-Default: `2048`
-
-#### `default_sampling_params.seed`
-
-Random seed for reproducible generation. When set, the random number generator will be initialized with this seed to ensure consistent outputs across runs.
-
-Default: `42`
-
-#### `default_sampling_params.detokenize`
-
-Whether to detokenize the output tokens into text. When set to `true`, token IDs are converted back to readable text strings.
-
-Default: `True`
-
-#### `default_sampling_params.repetition_penalty`
-
-Penalty applied to tokens that have already appeared in the generated sequence. Values greater than 1.0 discourage repetition, while values less than 1.0 encourage it. A value of 1.0 applies no penalty.
-
-Default: `1.1`
-
-### `tts_args` (TTS stages only)
-
-Configuration for Text-to-Speech specific parameters. This section is only applicable to TTS model stages (e.g., `qwen3_tts`).
-
-#### `tts_args.max_instructions_length`
-
-Maximum character length for voice style/emotion instructions. Instructions exceeding this limit will be rejected with a validation error.
-
-Default: `500`
-
-This value can be overridden at runtime using the `--tts-max-instructions-length` CLI parameter when starting the server.
diff --git a/docs/contributing/DOCS_GUIDE.md b/docs/contributing/DOCS_GUIDE.md
deleted file mode 100644
index 100bac67423..00000000000
--- a/docs/contributing/DOCS_GUIDE.md
+++ /dev/null
@@ -1,139 +0,0 @@
-# Documentation Build Guide
-
-This directory contains the source files for the vLLM-Omni documentation.
-
-## Building Documentation Locally
-
-### Prerequisites
-
-Install documentation dependencies:
-
-```bash
-uv pip install -e ".[docs]"
-```
-
-### Build and Serve Documentation
-
-From the project root:
-
-```bash
-# Serve documentation locally (auto-reload on changes)
-# This starts a local web server at http://127.0.0.1:8000
-mkdocs serve
-
-# Build static site (generates HTML files in site/ directory)
-mkdocs build
-```
-
-When using `mkdocs serve`, the documentation will be automatically available at `http://127.0.0.1:8000`. The server will automatically reload when you make changes to the documentation files.
-
-## Auto-generating API Documentation
-
-The documentation automatically extracts docstrings from the code using mkdocstrings. To ensure your code is documented:
-
-1. Add docstrings to all public classes, functions, and methods
-2. Use Google or NumPy style docstrings (both are supported)
-3. Rebuild the documentation to see changes
-
-Example docstring:
-
-```python
-class Omni:
- """Main entry point for vLLM-Omni inference.
-
- This class provides a high-level interface for running multi-modal
- inference with non-autoregressive models.
-
- Args:
- model: Model name or path
- stage_configs: Optional stage configurations
- **kwargs: Additional arguments passed to the engine
-
- Example:
- >>> llm = Omni(model="Qwen/Qwen2.5-Omni")
- >>> outputs = llm.generate(prompts="Hello")
- """
-```
-
-## Documentation Structure
-
-```
-docs/
-├── index.md # Main documentation page
-├── getting_started/ # Getting started guides
-├── architecture/ # Architecture documentation
-├── api/ # API reference (auto-generated from code)
-├── examples/ # Code examples
-└── stylesheets/ # Custom CSS
-```
-
-## Publishing Documentation
-
-### GitHub Pages (Recommended)
-
-The documentation is automatically deployed to GitHub Pages using GitHub Actions.
-
-1. **Enable GitHub Pages**:
- - Go to repository `Settings` → `Pages`
- - Set `Source` to `GitHub Actions`
- - Save settings
-
-2. **Push changes**:
- ```bash
- git push origin main
- ```
-
-3. **Documentation will be available at**:
- - `https://vllm-omni.readthedocs.io`
-
-The GitHub Actions workflow (`.github/workflows/docs.yml`) will automatically:
-- Build the documentation when you push to `main` branch
-- Deploy it to GitHub Pages
-- Update the documentation whenever you make changes
-
-
-### Read the Docs (Alternative)
-
-You can also use Read the Docs for hosting:
-
-1. Sign up at https://readthedocs.org/
-2. Import the `vllm-project/vllm-omni` repository
-3. Read the Docs will automatically build using `.readthedocs.yml`
-4. Documentation will be available at: `https://vllm-omni.readthedocs.io/`
-
-## Configuration
-
-The documentation configuration is in `mkdocs.yml` at the project root.
-
-## Tips
-
-- **API Documentation**: API docs are automatically generated using `mkdocs-api-autonav` and `mkdocstrings`
- - No need to manually create API pages - they're generated automatically
- - Use `[module.name.ClassName][]` syntax for cross-references in Summary pages
-- **Code Snippets**: Use `--8<-- "path/to/file.py"` for including code snippets
-- **Markdown**: Use Markdown for all documentation (no need for RST)
-- **Material Theme**: Use Material theme features like:
- - Admonitions: `!!! note`, `!!! warning`, etc.
- - Code blocks with syntax highlighting
- - Tabs for organizing content
- - Math formulas using `pymdownx.arithmatex`
-
-## Troubleshooting
-
-### Documentation not updating
-
-- Make sure you've saved all files
-- If using `mkdocs serve`, it should auto-reload
-- Check for syntax errors in `mkdocs.yml`
-
-### API links not working
-
-- Ensure class names match exactly (case-sensitive)
-- Check that the module is imported correctly
-- Run `mkdocs build --strict` to check for errors
-
-### Build errors
-
-- Check Python version (requires 3.9+)
-- Ensure all dependencies are installed: `pip install -e ".[docs]"`
-- Check `mkdocs.yml` syntax with `mkdocs build --strict`
diff --git a/docs/contributing/README.md b/docs/contributing/README.md
deleted file mode 100644
index 8a5bcfff0ad..00000000000
--- a/docs/contributing/README.md
+++ /dev/null
@@ -1,151 +0,0 @@
-# Contributing to vLLM-Omni
-
-Thank you for your interest in contributing to vLLM-Omni! This document provides guidelines and instructions for contributing.
-
-!!! note
- We host weekly developer-facing online meetings to discuss milestones and updates **every Tuesday at 19:30 PDT**. Meeting link as well as the past meeting notes can be found [here](https://tinyurl.com/vllm-omni-meeting).
-
-## Getting Started
-
-vLLM-Omni uses `uv` as the environment manager, to create and manage Python environments. Please follow the documentation to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
-
-```bash
-uv venv --python 3.12 --seed
-source .venv/bin/activate
-```
-
-### Development Environment for vLLM and vLLM-Omni
-
-vLLM-Omni is quickly evolving, please see the [installation guide](../getting_started/installation/README.md) for details. It's recommended to build from source to provide the latest development environment.
-
-!!! tip
- vLLM-Omni is compatible with Python versions 3.10 to 3.12. However, we recommend developing with Python 3.12 to minimize the chance of your local environment clashing with our CI environment.
-
-### Adding a new model to vLLM-Omni
-
-Please check [model implementation](model/README.md) for how to add diffusion and omni-modality models to vLLM-Omni.
-
-### Linting
-
-vLLM-Omni uses `pre-commit` to lint and format the codebase. See [pre-commit documentation](https://pre-commit.com/#usage) if `pre-commit` is new to you. Setting up `pre-commit` is as easy as:
-
-```bash
-uv pip install pre-commit
-pre-commit install
-```
-
-vLLM-Omni's `pre-commit` hooks will now run automatically every time you commit.
-
-!!! tip
- You can manually run the `pre-commit` hooks using:
-
- ```bash
- pre-commit run # runs on staged files
- pre-commit run --show-diff-on-failure --color=always --all-files # runs on all files (short for --all-files)
- ```
-
-### Documentation
-
-MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, `mkdocs.yml`.
-
-Get started with:
-
-```bash
-uv pip install -e ".[docs]"
-```
-
-MkDocs comes with a built-in dev-server that lets you preview your documentation as you work on it. From the root of the repository, run:
-
-```bash
-mkdocs serve # with API ref (~10 minutes)
-API_AUTONAV_EXCLUDE=vllm_omni mkdocs serve # API ref off (~15 seconds)
-```
-
-Once you see `Serving on http://127.0.0.1:8000/` in the logs, the live preview is ready! Open in your browser to see it.
-
-For additional features and advanced configurations, refer to the:
-
-- [MkDocs documentation](https://www.mkdocs.org/)
-- [Material for MkDocs documentation](https://squidfunk.github.io/mkdocs-material/) (the MkDocs theme we use)
-
-### Testing
-
-vLLM-Omni uses `pytest` to test the codebase.
-Please refer to the [test instructions](./ci/test_guide.md) for detailed testing information.
-
-!!! warning
- Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU platform to run unit tests locally, rely on the continuous integration system to run the tests for now.
-
-## Issues
-
-If you encounter a bug or have a feature request, please search existing issues first to see if it has already been reported. If not, please file a new issue, providing as much relevant information as possible.
-
-!!! important
- If you discover a security vulnerability, please report it by creating a GitHub issue with the `security` label.
-
-## Pull Requests & Code Reviews
-
-Thank you for your contribution to vLLM-Omni! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM-Omni maintain the code quality and improve the efficiency of the review process.
-
-### DCO and Signed-off-by
-
-When contributing changes to this project, you must agree to the [DCO](https://developercertificate.org/). Commits must include a `Signed-off-by:` header which certifies agreement with the terms of the DCO.
-
-Using `-s` with `git commit` will automatically add this header.
-
-!!! tip
- You can enable automatic sign-off via your IDE:
-
- - **PyCharm**: Click on the `Show Commit Options` icon to the right of the `Commit and Push...` button in the `Commit` window. It will bring up a `git` window where you can modify the `Author` and enable `Sign-off commit`.
- - **VSCode**: Open the Settings editor and enable the `Git: Always Sign Off` (`git.alwaysSignOff`) field.
-
-### PR Title and Classification
-
-Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
-
-- `[Bugfix]` for bug fixes.
-- `[CI/Build]` for build or continuous integration improvements.
-- `[Doc]` for documentation fixes and improvements.
-- `[Model]` for adding a new model or improving an existing model. Model name should appear in the title.
-- `[Frontend]` For changes on the vLLM-Omni frontend (e.g., OpenAI API server, `Omni`/`AsyncOmni`, etc.)
-- `[Kernel]` for changes affecting CUDA kernels or other compute kernels.
-- `[Core]` for changes in the core vLLM-Omni logic (e.g., `OmniProcessor`, `OmniARScheduler`, etc.)
-- `[Hardware][Vendor]` for hardware-specific changes. Vendor name should appear in the prefix, such as [Ascend] for Ascend NPUs.
-- `[Misc]` for PRs that do not fit the above categories. Please use this sparingly.
-
-!!! note
- If the PR spans more than one category, please include all relevant prefixes.
-
-### Local Test
-Please run the L1 and L2 test cases locally first and attach the results before contacting us to add the "ready" label. Please refer to the [test instructions](./ci/test_guide.md) for running the test cases.
-
-### Code Quality
-
-The PR needs to meet the following code quality standards:
-
-- We adhere to Google Python style guide and Google C++ style guide.
-- Pass all linter checks.
-- The code needs to be well-documented to ensure future contributors can easily understand the code.
-- Include sufficient tests to ensure the project stays correct and robust. This includes both unit tests and integration tests.
-- Please add documentation to `docs/` if the PR modifies the user-facing behaviors of vLLM-Omni. It helps vLLM-Omni users understand and utilize the new features or changes.
-
-### Notes for Large Changes
-
-Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with `rfc-required` and might not go through the PR.
-
-### What to Expect for the Reviews
-
-The goal of the vLLM-Omni team is to be a _transparent reviewing machine_. We would like to make the review process transparent and efficient and make sure no contributor feels confused or frustrated. However, the vLLM-Omni team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
-
-- After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
-- After the PR is assigned, the reviewer will provide status updates every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM-Omni team.
-- After the review, the reviewer will put an `action-required` label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
-- Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.
-
-## Additional Resources
-
-- [Design Documents](../design/index.md) - Architecture and design documentation
-
-## Thank You
-
-Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM-Omni. All of your contributions help make vLLM-Omni a great tool and community for everyone!
diff --git a/docs/contributing/ci/.nav.yaml b/docs/contributing/ci/.nav.yaml
deleted file mode 100644
index 0f187f3a15d..00000000000
--- a/docs/contributing/ci/.nav.yaml
+++ /dev/null
@@ -1,6 +0,0 @@
-nav:
- - CI_5levels.md
- - failures.md
- - test_guide.md
- - test_markers.md
- - test_style.md
diff --git a/docs/contributing/ci/CI_5levels.md b/docs/contributing/ci/CI_5levels.md
deleted file mode 100644
index 6c451936233..00000000000
--- a/docs/contributing/ci/CI_5levels.md
+++ /dev/null
@@ -1,725 +0,0 @@
-# Multi-Level Automated Testing System Documentation
-
-## Document Overview
-
-This testing system aims to build a complete, efficient, and well-structured quality assurance framework for the development, integration, and release of model services. It draws on the concept of the test pyramid from modern software engineering, progressively expanding testing activities from basic code logic verification to complex end-to-end (E2E) functionality, performance, accuracy, and even long-term stability validation.
-
-Through five levels (L1-L5) and common (Common) specifications, the system clarifies the testing objectives, scope, execution frequency, and required resources for different development stages (e.g., each commit, PR merge, daily build, pre-release). This ensures that models meet high standards for functionality, performance, and reliability across various deployment scenarios (online serving and offline inference).
-
-
- Chapter 4
- L5: Purpose, Test Content, Directory Location, Example
-
-
Weekly / Days before Release
-
GPU
-
-
-
-
-
----
-
- The folder structure for tests file based on the 5 levels design
-Legend: `✅` = test exists, `⬜` = suggested to add.
-```
-vllm_omni/ tests/
-├── config/ → ├── config/
-│ ├── model.py │ └── test_model.py ⬜
-│ └── lora.py │ └── test_lora.py ⬜
-│
-├── core/ → ├── core/
-│ └── sched/ │ └── sched/
-│ ├── omni_ar_scheduler.py │ ├── test_omni_ar_scheduler.py ⬜
-│ ├── omni_generation_scheduler.py │ ├── test_omni_generation_scheduler.py ⬜
-│ └── output.py │ └── test_output.py ✅ currently in entrypoints/test_omni_new_request_data.py (tests output.OmniNewRequestData)
-│
-├── diffusion/ → ├── diffusion/
-│ ├── diffusion_engine.py │ ├── test_diffusion_engine.py ⬜
-│ ├── attention/ │ ├── attention/
-│ │ ├── layer.py │ │ ├── test_attention_sp.py ✅
-│ │ └── backends/ │ │ └── test_flash_attn.py ✅
-│ ├── distributed/ │ ├── distributed/
-│ │ └── ... │ │ ├── test_comm.py ✅
-│ │ │ │ ├── test_cfg_parallel.py ✅
-│ │ │ │ └── test_sp_plan_hooks.py ✅
-│ ├── lora/ │ ├── lora/
-│ │ └── ... │ │ ├── test_base_linear.py ✅
-│ │ │ │ └── test_lora_manager.py ✅
-│ ├── models/ │ ├── models/
-│ │ ├── qwen_image/ │ │ ├── qwen_image/ (e2e coverage)
-│ │ ├── z_image/ │ │ └── z_image/
-│ │ └── ... │ │ └── test_zimage_tp_constraints.py ✅
-│ └── worker/ │ └── worker/
-│ ├── diffusion_worker.py │ └── test_diffusion_worker.py ✅ file at diffusion/test_diffusion_worker.py
-│ └── diffusion_model_runner.py │
-│
-├── distributed/ → ├── distributed/
-│ └── omni_connectors/ │ └── omni_connectors/
-│ ├── adapter.py │ ├── test_adapter_and_flow.py ✅
-│ ├── kv_transfer_manager.py │ ├── test_basic_connectors.py ✅
-│ ├── connectors/ │ ├── test_kv_flow.py ✅
-│ └── utils/ │ └── test_omni_connector_configs.py ✅
-│
-├── engine/ → ├── engine/
-│ ├── input_processor.py │ ├── test_input_processor.py ⬜ (no processor.py in source)
-│ ├── output_processor.py │ └── test_output_processor.py ⬜
-│ └── arg_utils.py │ └── test_arg_utils.py ⬜
-│
-├── entrypoints/ → ├── entrypoints/
-│ ├── stage_utils.py │ ├── test_stage_utils.py ✅
-│ ├── cli/ │ ├── cli/ (benchmarks/test_serve_cli.py covers CLI serve)
-│ │ └── ... │ │ └── test_*.py ⬜
-│ └── openai/ │ └── openai_api/ # maps to entrypoints/openai/
-│ ├── api_server.py │ ├── test_api_server.py ⬜ (e2e indirect coverage)
-│ ├── serving_chat.py │ ├── test_serving_chat_sampling_params.py ✅
-│ ├── serving_speech.py │ ├── test_serving_speech.py ✅
-│ └── image_api_utils.py │ └── test_image_server.py ✅
-│
-├── inputs/ → ├── inputs/
-│ ├── data.py │ ├── test_data.py ⬜
-│ ├── parse.py │ ├── test_parse.py ⬜
-│ └── preprocess.py │ └── test_preprocess.py ✅ currently in entrypoints/test_omni_input_preprocessor.py
-│
-├── model_executor/ → ├── model_executor/
-│ ├── layers/ │ ├── layers/
-│ │ └── mrope.py │ │ └── test_mrope.py ⬜
-│ ├── model_loader/ │ ├── model_loader/
-│ │ └── weight_utils.py │ │ └── test_weight_utils.py ⬜
-│ ├── models/ │ ├── models/
-│ │ ├── qwen2_5_omni/ │ │ ├── qwen2_5_omni/
-│ │ │ ├── qwen2_5_omni_thinker.py │ │ │ ├── test_audio_length.py ✅
-│ │ │ ├── qwen2_5_omni_talker.py │ │ │ ├── test_qwen2_5_omni_thinker.py ⬜
-│ │ │ └── qwen2_5_omni_token2wav.py │ │ │ ├── test_qwen2_5_omni_talker.py ⬜
-│ │ └── qwen3_omni/ │ │ │ └── test_qwen2_5_omni_token2wav.py ⬜
-│ │ └── ... │ │ └── qwen3_omni/
-│ ├── stage_configs/ │ │ └── test_*.py ⬜
-│ │ └── *.yaml │ └── stage_configs/ (used by e2e, test_*.py can be added) ⬜
-│ └── stage_input_processors/ │ └── stage_input_processors/
-│ └── ... │ └── test_*.py ⬜
-│
-├── sample/ → ├── sample/
-│ └── __init__.py │ └── test_*.py ⬜
-│
-├── utils/ → ├── utils/
-│ └── __init__.py │ └── test_*.py ⬜ (no platform_utils.py currently)
-│
-├── worker/ → ├── worker/
-│ ├── gpu_ar_model_runner.py │ ├── test_gpu_ar_model_runner.py ⬜
-│ ├── gpu_ar_worker.py │ ├── test_gpu_ar_worker.py ⬜
-│ ├── gpu_generation_model_runner.py │ ├── test_gpu_generation_model_runner.py ✅
-│ ├── gpu_generation_worker.py │ ├── test_gpu_generation_worker.py ⬜
-│ ├── gpu_model_runner.py │ ├── test_omni_gpu_model_runner.py ✅
-│ └── mixins.py │ └── (npu under platforms/npu/worker/) # not worker/npu/
-│
-├── platforms/ → (no tests/platforms/, e2e and stage_configs provide indirect coverage)
-│ ├── cuda/
-│ ├── npu/worker/ # NPU worker here, not vllm_omni/worker/npu/
-│ ├── rocm/
-│ └── xpu/worker/
-│
-├── outputs.py → test_outputs.py ✅ (at tests root)
-├── (logger, patch, request, version) → (no corresponding unit test)
-│
-└── e2e (tests side only) → ├── e2e/
- ├── online_serving/ ✅ non-empty
- │ ├── test_qwen2_5_omni.py
- │ ├── test_async_omni.py
- │ ├── test_qwen3_omni.py
- │ ├── test_qwen3_omni_expansion.py
- │ ├── test_mimo_audio.py
- │ ├── test_image_gen_edit.py
- │ └── test_images_generations_lora.py
- └── offline_inference/ ✅
- ├── test_qwen2_5_omni.py
- ├── test_qwen3_omni.py
- ├── test_bagel_text2img.py
- ├── test_t2i_model.py
- ├── test_t2v_model.py
- ├── test_ovis_image.py
- ├── test_zimage_tensor_parallel.py
- ├── test_cache_dit.py
- ├── test_teacache.py
- ├── test_stable_audio_expansion.py
- ├── test_diffusion_cpu_offload.py
- ├── test_diffusion_layerwise_offload.py
- ├── test_diffusion_lora.py
- ├── test_sequence_parallel.py
- └── stage_configs/ (legacy schema, still
- ├── bagel_*.yaml present for unmigrated
- └── npu/, rocm/, etc. models)
-
-# Migrated models (qwen3_omni_moe, qwen2_5_omni, qwen3_tts) live under
-# vllm_omni/deploy/ instead — see docs/configuration/stage_configs.md.
-```
-
-
-
-
-
-## Common Specifications
-
-Before entering specific testing levels, the project establishes two common specifications aimed at standardizing the development process and quickly locating issues.
-
-1. ***PR Checklist ([Tests Style](../ci/tests_style.md))***: This template defines the self-check items that must be completed before submitting a code review (Pull Request). It ensures that each code change meets basic requirements such as code style, dependency updates, and documentation synchronization before entering the automated testing pipeline, serving as the first manual line of defense for quality assurance.
-2. ***CI Failure Explanation ([CI Failures](../ci/failures.md))***: This document archives and explains common failure patterns in the Continuous Integration (CI) pipeline, error log interpretation, and preliminary troubleshooting steps. It helps developers and testers quickly diagnose the causes of automated test failures, improving problem-solving efficiency.
-
-## Chapter 1: L1 & L2 Level Testing - Unit Testing and Basic End-to-End Verification
-
-### 1.1 Testing Purpose
-
-L1 and L2 level testing form the foundation of the quality assurance system. L1 level testing focuses on verifying the internal logic correctness of code units (e.g., functions, classes), ensuring each independent component behaves as designed.
-
-L2 level testing builds upon L1 by introducing GPU resources and verifying that the end-to-end (E2E) process of the model in basic deployment scenarios is smooth. For example, it uses dummy models to confirm that core interfaces like the inference pipeline, output format, and streaming response work properly. The common goal of these two levels is to provide developers with rapid feedback, discovering and fixing issues early in the development cycle.
-
-
-
-### 1.2 Testing Content and Scope
-
-- ***L1 (Unit & Logic Testing)***:
-- - ***Scope***: Tests internal functions and methods of core components such as `entrypoints`, `models`.
- - ***Focus***: Branch coverage, exception handling, algorithm logic correctness. Does not involve external dependencies or the complete service stack.
- - ***Time Cost***: Execution time is controlled within ***15 minutes*** to ensure fast feedback.
-- ***L2 (Basic End-to-End Testing)***:
-- - ***Scope***: Covers two basic deployment scenarios: `online` (serving) and `offline` (inference).
- - ***Focus***: Uses `dummy` models or lightweight real models to verify that the entire chain from request input to result output works normally, including output data structure, streaming (stream) support, etc. Also includes some unit tests that require launching independent service instances.
- - ***Characteristic***: Requires ***GPU*** resources to perform model computations.
-
-### 1.3 Test Directory and Execution Files
-
-A clear directory structure is key to managing test cases efficiently.
-
-- ***L1 Test Directory***: `/tests/{component_name}/test_xxx.py`
-- - Here, `{component_name}` corresponds to modules in the source code, such as `distributed`, `entrypoints`, etc., and `test_xxx.py` is the specific test file.
-- ***L2 Test Directory***:
-- - Online Serving: `/tests/e2e/online_serving/test_{model_name}.py`
- - Offline Inference: `/tests/e2e/offline_inference/test_{model_name}.py`
-
-### 1.4 Execution Method and Example
-
-- ***Trigger Timing***: **`PR with ready label`**. That is, when a developer adds a "ready for review" or similar label to a PR on platforms like GitHub, L1 and L2 tests are automatically triggered.
-- ***Execution Environment***: L1 uses ***CPU*** environment; L2 requires ***GPU*** environment.
-- ***Script Example***:
-
-
- L1 Test Examples
-
-Examples from `tests/model_executor/models/qwen2_5_omni/test_audio_length.py`
-```python
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-import pytest
-
-pytestmark = [pytest.mark.core_model, pytest.mark.cpu]
-
-def test_resolve_max_mel_frames_default():
- from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import resolve_max_mel_frames
-
- assert resolve_max_mel_frames(None, default=30000) == 30000
- assert resolve_max_mel_frames(None, default=6000) == 6000
-
-
-def test_resolve_max_mel_frames_explicit():
- from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import resolve_max_mel_frames
-
- # Explicit argument always wins over default
- assert resolve_max_mel_frames(123, default=30000) == 123
- assert resolve_max_mel_frames(6000, default=30000) == 6000
- assert resolve_max_mel_frames(0, default=30000) == 0
-
-
-@pytest.mark.parametrize("repeats", [2, 4])
-@pytest.mark.parametrize("code_len", [0, 1, 32768])
-@pytest.mark.parametrize("max_mel_frames", [None, -1, 0, 1, 6000, 30000])
-def test_cap_and_align_mel_length_no_mismatch(repeats, code_len, max_mel_frames):
- """Guard that any max_mel_frames yields a mel length aligned to repeats, and
- consistent with the truncated code length (prevents concat mismatch).
- """
- from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import cap_and_align_mel_length
-
- target_code_len, target_mel_len = cap_and_align_mel_length(
- code_len=code_len,
- repeats=repeats,
- max_mel_frames=max_mel_frames,
- )
-
- assert isinstance(target_code_len, int)
- assert isinstance(target_mel_len, int)
-
- if code_len == 0:
- assert target_code_len == 0
- assert target_mel_len == 0
- return
-
- assert target_code_len >= 1
- assert target_mel_len >= repeats
- assert target_mel_len % repeats == 0
- assert target_mel_len == target_code_len * repeats
- assert target_code_len <= code_len
-
- if max_mel_frames is not None and int(max_mel_frames) > 0 and int(max_mel_frames) >= repeats:
- assert target_mel_len <= int(max_mel_frames)
-```
-
-
-
- L2 Test Examples
-You can refer to Test Examples in Chapter 2 to see example test cases that incorporate both L2 and L3 testing logic.
-
-
-- - ***Run Command***:
-
- `pytest -s -v /tests/e2e/online_serving/test_{model_name}.py`
- `pytest -s -v -m 'core_model and cpu' --run-level=core_model`
-
-## Chapter 2: L3 Level Testing - Core Integration, Performance, and Accuracy Verification
-
-### 2.1 Testing Purpose
-
-L3 level testing executes after code is merged into the main branch. Its core purpose is to verify the integration effect, key performance indicators, and output accuracy of ***real models*** in ***multiple deployment scenarios***
-
-. It acts as the "quality gatekeeper" for the main branch, ensuring that no merge breaks the core capabilities of the model service. Testing needs to provide clear conclusions within a relatively short time (<30min), balancing test depth with feedback speed.
-
-
-
-### 2.2 Testing Content and Scope
-
-- ***Deployment Scenarios***: Covers richer `online` and `offline` deployment configurations, which may include different hardware configurations, batch sizes, concurrency levels, etc.
-- ***Core Verification***:
-- 1. ***Inference Functionality***: Ensures real models can perform forward computation normally and return results.
- 2. ***Accuracy Compliance***: Verifies that the model's evaluation metrics (e.g., accuracy) meet the expected baseline, preventing code changes from introducing accuracy issues.
- 3. ***Important Performance***: Verifies whether performance (e.g., P99 latency, throughput) in core scenarios meets preset thresholds.
-
-### 2.3 Test Directory and Execution Files
-
-- ***Functional Testing***:
-- - Online Serving: `/tests/e2e/online_serving/test_{model_name}_expansion.py`
- - Offline Inference: `/tests/e2e/offline_inference/test_{model_name}_expansion.py`
- - (Note: `_expansion.py` likely means it contains more comprehensive scenario cases compared to L2 tests).
-
-### 2.4 Execution Method and Example
-
-- ***Trigger Timing***: **`PR Merged`**. Automatically triggered after code review is approved and merged into the main branch.
-- ***Execution Environment***: ***GPU*** servers.
-- ***Script Example***:
-
-???+ example "Test Examples"
-
- **2.4.1 Mark Declaration Section**
-
- ```python
- @pytest.mark.advanced_model
- @pytest.mark.core_model
- @pytest.mark.parametrize("omni_server", test_params, indirect=True)
- ```
-
- **Explanation**:
-
- @pytest.mark.advanced_model: Marks the test as L3 merge level, indicating deep validation with real models. @pytest.mark.full_model: Marks L4 nightly-only suites (e.g. `test_*_expansion.py`, doc examples).
-
- @pytest.mark.core_model: Marks the test as L1 or L2 level, indicating that this test case validates the basic functionality of the core model. It uses mock weights and only checks if the relevant interface functions correctly.
-
- @pytest.mark.parametrize: A parameterization decorator that allows abstracting test data into parameters, enabling reuse of the same test logic across different data configurations. indirect=True indicates that parameters will be passed to the fixture for processing.
-
- **Notes**: If you believe the test case only needs to execute basic run logic at the PR-level CI, you can mark it only with @pytest.mark.core_model. If you believe it only needs to execute deep validation at merge (L3), use @pytest.mark.advanced_model. For L4 nightly-only expansion and doc-example tests, use @pytest.mark.full_model with `--run-level full_model`. If the test case needs both basic run and deep validation, mark with @pytest.mark.core_model and the appropriate L3/L4 marker (`advanced_model` and/or `full_model`).
-
- **2.4.2 Test Function Definition and Documentation**
-
- ```python
- def test_mix_to_text_audio_001(omni_server, openai_client) -> None:
- """
- Test multi-modal input processing and text/audio output generation via OpenAI API.
- Deploy Setting: default yaml
- Input Modal: text + audio + video + image
- Output Modal: text + audio
- Input Setting: stream=True
- Datasets: single request
- """
- ```
-
- **Explanation**:
-
- **Function Naming Convention**: Uses the test_ prefix, describes the test scenario mix_to_text_audio, and the number 001 indicates the first test case for this scenario.
-
- **Parameter Explanation**:
-
- omni_server: Omni server instance obtained via fixture, containing model information and configuration.
-
- openai_client: Unified OpenAI client processing instance, encapsulating request sending and response validation logic.
-
- Docstring: Describes the test purpose, deployment settings, input/output modalities, streaming settings, and dataset type in detail, providing clear context for test maintenance.
-
- **2.4.3 Multimodal Data Preparation**
-
- ```python
- video_data_url = f"data:video/mp4;base64,{generate_synthetic_video(224, 224, 300)['base64']}"
- image_data_url = f"data:image/jpeg;base64,{generate_synthetic_image(224, 224)['base64']}"
- audio_data_url = f"data:audio/wav;base64,{generate_synthetic_audio(5, 1)['base64']}"
- ```
-
- **Explanation**:
-
- **Data Generation Functions**: Use the generate_synthetic_* series of functions to generate synthetic test data, avoiding reliance on external resources and ensuring test reproducibility and stability.
-
- **Parameter Explanation**:
-
- Video: width, height, duration_frames
-
- Image: width, height
-
- Audio: duration_seconds, channels
-
- **2.4.4 Request Configuration and Keyword Validation**
-
- ```python
- request_config = {
- "model": omni_server.model,
- "messages": messages,
- "stream": True,
- "key_words": {
- "audio": ["water", "cricket"],
- "video": ["sphere", "globe", "circle", "round"],
- "image": ["square", "quadrate"],
- "text": ["beijing"]
- },
- }
- ```
-
- **Explanation**:
-
- **Model Specification**: Uses omni_server.model to ensure the test aligns with the model configured on the server.
-
- **Keyword Validation Mechanism**: This is an innovative design of the template to address the specific needs of multimodal testing:
-
- Audio Keywords: Validate whether the generated text's description of audio content contains expected elements (e.g., "water" for water sounds, "cricket" for cricket sounds). If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
-
- **Video Keywords**: Validate whether the generated text's description of video content contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
-
- Image Keywords: Validate whether the generated text's description of image content contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
-
- Text Keywords: Validate whether the generated text contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
-
- **2.4.5 Request Execution**
-
- ```python
- openai_client.send_omni_request(request_config, request_num=1) # for omni-understanding models
- # or
- openai_client.send_diffusion_request(request_config, request_num=1) # for diffusion models
- ```
-
- **Explanation**:
-
- **Unified Client**: Uses the OpenAIClientHandler instance to send requests. This client encapsulates error handling, retry mechanisms, and response validation logic.
-
- **Single Request**: The comment clearly states this is a single-request completion test. For concurrent testing, it can be extended to multiple requests using request_num = n.
-
- **Implicit Validation**: The `send_omni_request` and `send_diffusion_request` methods internally includes validation logic dynamically selected based on the --run-level parameter: core_model performs basic validation, while advanced_model and full_model perform deep validation.
-
-- ***Run Command (L3 merge)***: `pytest -s -v /tests/e2e/online_serving/test_{model_name}.py -m advanced_model --run-level=advanced_model`
-
-- ***Run Command (L4 nightly expansion)***: `pytest -s -v /tests/e2e/online_serving/test_{model_name}_expansion.py -m full_model --run-level=full_model`
-
-## Chapter 3: L4 Level Testing - Full Functionality, Performance, and Documentation Testing
-
-### 3.1 Testing Purpose
-
-L4 level testing is a comprehensive quality audit before a version release. It expands upon L3, executing ***full*** functional scenarios, conducting systematic ***performance stress tests***, and simultaneously verifying the correctness of accompanying ***example documentation***. Its purpose is to perform deep validation of the system during off-peak nighttime hours, providing quality trend reports for daytime development and data support for release decisions.
-
-
-
-### 3.2 Testing Content and Scope
-
-- ***Full Functionality Testing***: Executes all test cases defined in `test_{model_name}_expansion.py`, covering all implemented features, positive flows, boundary conditions, and exception handling.
-- ***Performance Testing***: Uses `tests/dfx/perf/tests/test_qwen_omni.json`, `tests/dfx/perf/tests/test_tts.json`, and diffusion configs in the form `tests/dfx/perf/tests/test_*_vllm_omni.json` (passed to `run_benchmark.py` via `--test-config-file`) to drive performance testing tools for stress, load, and endurance tests, collecting metrics like throughput, response time, and resource utilization.
-- ***Documentation Testing***: Verifies whether the example code provided to users is runnable and its results match the description.
-
-### 3.3 Test Directory and Execution Files
-
-- ***Functional Testing***: Same directories as L3.
-- ***Performance Test Configuration***: `tests/dfx/perf/tests/test_qwen_omni.json`, `tests/dfx/perf/tests/test_tts.json`, and diffusion configs `tests/dfx/perf/tests/test_*_vllm_omni.json` (e.g. `test_qwen_image_vllm_omni.json`)
-- ***Documentation Example Tests***:
-- - `tests/example/online_serving/test_{model_name}.py`
- - `tests/example/offline_inference/test_{model_name}.py`
-
-### 3.4 Execution Method and Example
-
-- ***Trigger Timing***: **`Nightly`**, automatically executed every night.
-- ***Execution Environment***: ***GPU*** server clusters to meet the resource demands of performance testing.
-- ***Script Example***:
-
-??? example "Test Examples: Documentation Example Tests"
-
- --8<-- "docs/contributing/ci/test_examples/l4_doc_example_tests.inc.md"
-
-??? example "Test Examples: Performance Tests"
-
- --8<-- "docs/contributing/ci/test_examples/l4_performance_tests.inc.md"
-
-??? example "Test Examples: Functionality Tests"
-
- --8<-- "docs/contributing/ci/test_examples/l4_functionality_tests.inc.md"
-
-- ***Run Command***: (Specific commands would depend on the performance testing tool and configuration defined in `nightly.json`).
-
-## Chapter 4: L5 Level Testing - Stability and Reliability Testing
-
-### 4.1 Testing Purpose
-
-L5 level testing focuses on the performance of model services under ***long-running*** and ***abnormal fault*** scenarios. It aims to uncover deep-seated issues that only manifest under sustained pressure or extreme conditions, such as memory leaks, resource contention, gradual performance degradation, and lack of fault tolerance mechanisms. This is the final, yet crucial, line of defense for ensuring service high availability and production environment robustness.
-
-
-
-### 4.2 Testing Content and Scope
-
-- ***Long-term Stability (Stability) Testing***: Uses JSON under `tests/dfx/stability/tests/` (for example `test_qwen3_omni.json` and `test_wan22.json`) to run the service under moderate load for an extended period (e.g., over 12 hours), monitoring whether metrics like memory/VRAM usage, response time, and throughput degrade over time, and whether the service process remains stable.
-- ***Reliability Testing***: Uses pytest suites under `tests/dfx/reliability/` to inject controlled faults against a **live** `vllm_omni serve` instance (same **`omni_server` / `omni_server_function`** fixture style as E2E). Current suites emphasize **GPU memory pressure** (CUDA sidecar “memory hog”), **worker / runtime process kill** (`SIGKILL` on `VLLM::Worker` for Qwen3-Omni or `multiprocessing.spawn` for Wan2.2 video workers), **large multimodal chat** or **`/v1/videos`** jobs under OOM, **`/health` → 503** and **fast-fail / non-hanging concurrent** requests after kill, and **OpenAI-style 5xx error contracts** (e.g. text vs text+audio under OOM). **Post-fault recovery** checks exist where enabled (some cases may be `skip` while issues are tracked). See the Reliability `` block in Section 4.4 for file-level responsibilities and CI markers (`slow`, `hardware_test`, POSIX-only kill).
-
-### 4.3 Test Directory and Execution Files
-
-- ***Stability Test Configuration***: `tests/dfx/stability/tests/test_qwen3_omni.json`, `tests/dfx/stability/tests/test_wan22.json` (one JSON per model / runner family)
-- ***Reliability Test Suite*** (`tests/dfx/reliability/`):
- - `test_reliability_qwen3_omni.py` — Qwen3-Omni chat / multimodal reliability (GPU OOM, process kill, recovery, error contract under `--async-chunk` vs default).
- - `test_reliability_wan22.py` — Wan2.2 T2V video API reliability (`/v1/videos` under OOM and process kill, recovery).
- - `helpers.py` — Shared primitives used by current suites: raw HTTP probes for `/v1/chat/completions` and `/health`, OpenAI-style error parsing, GPU OOM sidecar (`inject_gpu_oom` / `stop_gpu_oom_hogs`), and `pgrep`-based process-kill injector construction (`make_process_kill_fault_injector`).
- - `conftest.py` — `fault_injector` and `omni_server_after_fault` / `omni_server_after_fault_function` fixtures to run a callable **after** the server is ready.
- - `README.md` — Short local run commands for this directory.
-
-### 4.4 Execution Method and Example
-
-- ***Trigger Timing***: **`Weekly`** (weekly) or **`Days before Release`** (several days before a major release). Due to long execution times, the frequency is lower.
-- ***Execution Environment***: ***GPU*** servers, requiring a stable and exclusive testing environment.
-- ***Script Example***:
-
- Test Examples
-
-When you want to add L5-level stability test cases, add or extend the appropriate JSON file under `tests/dfx/stability/tests/` (for example `test_qwen3_omni.json` for Omni bench traffic, or `test_wan22.json` for diffusion `/v1/videos` workloads). The following illustrates the Qwen3-Omni shape:
-
-```json
-{
- "test_name": "test_qwen3_omni_stability",
- "server_params": {
- "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
- "stage_config_name": "qwen3_omni.yaml"
- },
- "benchmark_params": [
- {
- "dataset_name": "random",
- "backend": "openai-chat-omni",
- "endpoint": "/v1/chat/completions",
- "duration_sec": 43200,
- "request_rate": 0.5,
- "num_prompts_per_batch": 20,
- "random_input_len": 2500,
- "random_output_len": 900,
- "ignore_eos": true,
- "percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration"
- }
- ]
-}
-```
-
-#### Parameter Explanation
-
-***Overview***
-
-| Field | Required | Description |
-| ---------------- | -------- | --------------------------------------------------------------------------- |
-| test_name | Yes | Unique identifier for the stability test case |
-| server_params | Yes | Server-side configuration parameters (model, stage configuration, etc.) |
-| benchmark_params | Yes | Stability benchmark running parameters (supports multiple configurations) |
-
-#### server_params Configuration
-
-##### Basic Parameters
-
-| Parameter | Required | Example | Description |
-| ----------------- | -------- | ---------------------------------- | ----------------------------------- |
-| model | Yes | "Qwen/Qwen3-Omni-30B-A3B-Instruct" | Model name or path |
-| stage_config_name | Yes | "qwen3_omni.yaml" | Stage configuration file name |
-
-##### Dynamic Configuration (update/delete)
-
-Supports incremental modifications based on the basic configuration:
-
-| Operation | Description |
-| --------- | ------------------------------------ |
-| update | Update or add configuration items |
-| delete | Delete specified configuration items |
-
-***Example***:
-You can refer to Test Examples in Chapter 3.4
-
-#### benchmark_params Configuration
-
-For stability testing, the key parameters are:
-
-- **duration_sec**: Total duration (in seconds) during which benchmark traffic is sent. The stability benchmark will keep sending batches until this duration is reached.
-- **request_rate** / **max_concurrency**: Exactly one of them must be specified. They control how the traffic is generated for each batch:
- - `request_rate`: Number of requests per second. The benchmark will send `num_prompts_per_batch` requests at the given rate.
- - `max_concurrency`: Maximum number of concurrent requests. When this is used, `request_rate` is set to `inf` internally.
-- **num_prompts_per_batch**: Number of prompts sent in each batch. Multiple batches will be executed sequentially within `duration_sec`.
-
-All other optional parameters follow the same rules as the in Chapter 3.4.
-
-
-
-
- Reliability test suite (tests/dfx/reliability)
-
-#### Purpose and relationship to stability
-
-Reliability tests are **short fault-injection** integration runs (L5 **(b)** in `tests/dfx/reliability/README.md`). They complement **stability** JSON-driven long runs: instead of hours of steady traffic, they **perturb** the server (GPU OOM sidecar, fatal signals on selected processes) and check **failure mode** and **latency bounds** (e.g. chat or `/v1/videos` must not hang under concurrent fault-time load).
-
-#### Directory layout
-
-| Path | Responsibility |
-| ---- | -------------- |
-| `helpers.py` | Shared helpers used by current reliability suites: raw `POST`/`GET` probes (`/v1/chat/completions`, `/health`), OpenAI error parsing (`extract_openai_error_contract_from_bytes`), GPU OOM sidecar lifecycle (`inject_gpu_oom`, `stop_gpu_oom_hogs`), and process-kill injector builder (`make_process_kill_fault_injector`). |
-| `conftest.py` | Pytest fixtures: indirect `fault_injector`, `omni_server_after_fault` / `omni_server_after_fault_function` (run injector after server is ready, then yield server). |
-| `test_reliability_qwen3_omni.py` | Qwen3-Omni: OOM vs **text vs text+audio** error contract, large multimodal chat under OOM, concurrent pressure, **SIGKILL** on `VLLM::Worker`, `/health` → 503 + fast-fail + concurrent chat; optional OOM recovery scenario (may be skipped while tracked in issues). |
-| `test_reliability_wan22.py` | Wan2.2 T2V: large `/v1/videos` under OOM, **SIGKILL** on `multiprocessing.spawn` chain, health / fast-fail / concurrent video requests; optional recovery test (may be skipped). |
-| `README.md` | Minimal run / collect examples. |
-
-#### Parametrization and markers
-
-- Each test module defines a **`RELIABILITY_SCENARIOS`** list (`test_name`, `server_params`: model, `stage_config_name` or diffusion `server_args`, etc.). **`create_reliability_omni_server_params()`** in `tests/dfx/conftest.py` resolves stage paths (including XPU substitutions where applicable) and builds **`OmniServerParams`** lists consumed by **`@pytest.mark.parametrize(..., indirect=True)`** on `omni_server` or `omni_server_function`.
-- Cases are tagged **`@pytest.mark.slow`** for weekly / selective CI. GPU-heavy suites use **`@hardware_test(res={"cuda": "H100"}, num_cards=...)`** (Qwen3-Omni paths often require **2** cards; Wan2.2 video paths **1** card).
-- **Process-kill** tests use **`@pytest.mark.skipif(os.name == "nt", ...)`** because injection uses POSIX **`pgrep` / `kill`**.
-
-#### CI trigger
-
-Weekly Buildkite (`.buildkite/test-weekly.yml`) runs, for example:
-
-```bash
-pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m "slow"
-pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m "slow"
-```
-
-#### Local commands
-
-```bash
-pytest --collect-only tests/dfx/reliability
-pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m slow
-pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m slow
-```
-
-#### Adding a new model suite
-
-1. Add `test_reliability_.py` under `tests/dfx/reliability/`.
-2. Define **`RELIABILITY_SCENARIOS`** and pass them through **`create_reliability_omni_server_params()`** with the correct deploy or e2e stage-config directory (same pattern as existing files).
-3. Reuse **`helpers`** for OOM / kill / raw HTTP; prefer **`assert_fault_exception()`** and **`resolve_oom_device_spec()`** from `tests/dfx/conftest.py` for consistent device selection vs stage YAML.
-4. Register **`slow`** (and **`hardware_test`** if needed); extend **`.buildkite/test-weekly.yml`** when the suite should run in weekly L5.
-
-
-
-- - ***Stability***: `pytest -s -v tests/dfx/stability/scripts/test_stability_qwen3_omni.py` or `pytest -s -v tests/dfx/stability/scripts/test_stability_wan22.py` (or add `test_stability_.py` alongside a matching JSON config)
- - ***Reliability***: `pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m slow` and/or `pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m slow` (add `test_reliability_.py` for new models)
-
-## Summary
-
-This multi-level testing system achieves continuous, progressive validation of model service quality by tightly integrating testing activities with the development workflow (commit, review, merge, release). From rapid unit testing to comprehensive end-to-end testing, and further to in-depth performance, stability, and reliability verification, each level has clear objectives, collectively building a robust quality protection net. By following this system, teams can deliver high-quality, highly reliable model services more efficiently.
diff --git a/docs/contributing/ci/failures.md b/docs/contributing/ci/failures.md
deleted file mode 100644
index 68e468be1df..00000000000
--- a/docs/contributing/ci/failures.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# CI Failures
-
-## Overview of CI Checks
-
-When you open a PR against vLLM-Omni, several CI checks run automatically:
-
-| Check | Platform | What it does |
-| ----- | -------- | ------------ |
-| **pre-commit** | GitHub Actions | Runs linting (Ruff), formatting, spell-checking (typos), and YAML validation. |
-| **Build Wheel** | GitHub Actions | Builds Python wheels for Python 3.11 and 3.12 on Ubuntu. Skipped for docs-only or Markdown-only changes (controlled by `paths-ignore` in the workflow). |
-| **DCO** | GitHub | Verifies every commit has a `Signed-off-by` line. |
-| **docs/readthedocs.org:vllm-omni** | Read the Docs | Builds the MkDocs documentation site. |
-| **buildkite/vllm-omni** | Buildkite | Runs GPU-based tests on NVIDIA CUDA hardware (L4, H100). |
-| **buildkite/vllm-omni-amd** | Buildkite | Runs GPU-based tests on AMD ROCm hardware (MI325). |
-| **buildkite/vllm-omni-intel** | Buildkite | Runs GPU-based tests on Intel XPU hardware (Intel Arc BMG). |
-
-## Step 1: Identify the Failing Check
-
-Click the **Details** link next to the failing check on your PR to open the build log. The most common failures fall into these categories:
-
-### pre-commit failures
-
-These are typically formatting or linting issues introduced by your PR. Fix them locally:
-
-```bash
-uv pip install pre-commit
-pre-commit run --all-files
-```
-
-Then commit the fixes and push.
-
-### DCO failures
-
-Every commit must include a `Signed-off-by` line. If you forgot, amend your commits:
-
-```bash
-git commit --amend -s
-git push --force-with-lease
-```
-
-For multiple commits, use an interactive rebase to add the sign-off to each one.
-
-### Read the Docs failures
-
-The documentation build uses MkDocs with `fail_on_warning: true`, so even a minor warning (not just errors) will cause the build to fail. To reproduce locally:
-
-```bash
-uv pip install -e ".[docs]"
-mkdocs build --strict
-```
-
-Common causes include broken cross-references, invalid admonition syntax, or missing files referenced by `--8<--` includes.
-
-### Buildkite failures
-
-Buildkite runs GPU tests in Docker containers. These are the most complex checks and can fail for reasons unrelated to your PR (infrastructure issues, flaky tests, etc.). See the sections below for how to investigate.
-
-## Step 2: Check if the Failure Is a Known Issue
-
-Before spending time debugging, check whether the failure already exists on the `main` branch:
-
-1. **Look at the Buildkite build log** — the test name and error message are usually enough to identify the issue.
-2. **Check recent CI runs on `main`** — if the same test is failing there, the failure is not caused by your PR.
-3. **Search existing issues** — look for open issues in the [vllm-omni issue tracker](https://github.com/vllm-project/vllm-omni/issues) with the test name or error message.
-
-If the failure is already tracked, leave a comment on your PR noting that the failure is pre-existing and link the issue.
-
-## Step 3: Investigate the Failure
-
-If the failure appears to be new, investigate whether your changes caused it.
-
-### Reading Buildkite Logs
-
-1. Click **Details** next to the Buildkite check on your PR.
-2. Find the failing step in the pipeline (e.g., "Diffusion Model Test", "Simple Unit Test").
-3. Expand the step to see the full test output with the traceback.
-
-### Running Tests Locally
-
-For instructions on running tests locally (including specific test files, functions, and markers), see the [Running Tests](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/test_guide/#running-tests) section in the Test Guide.
-
-## Step 4: Raise an Issue or Fix It
-
-### If the failure is pre-existing (not caused by your PR)
-
-1. **Raise a new issue** if one doesn't already exist, using the title format:
- `[CI Failure]: [job-name] - [test-path]`
-2. Include the error message, relevant log excerpts, and the commit hash where the failure occurs (e.g., "Still failing on main as of commit `abc1234`").
-3. Leave a comment on your PR linking to the issue and noting that the failure is unrelated to your changes.
-
-### If the failure is caused by your PR
-
-1. Fix the issue in your branch and push the update.
-2. If the fix is non-trivial, consider adding a test to prevent regression.
-
-## Common Failure Patterns
-
-| Symptom | Likely Cause | Fix |
-| ------- | ------------ | --- |
-| `ruff` or formatting errors | Code style violation | Run `pre-commit run --all-files` |
-| `Signed-off-by` missing | DCO check | Amend commits with `git commit --amend -s` |
-| MkDocs build warning | Broken docs reference | Run `mkdocs build --strict` locally |
-| `OOM` or `CUDA out of memory` | Test exceeds GPU memory | Check if your changes increased memory usage; use `--vae_use_slicing` / `--vae_use_tiling` for diffusion tests |
-| Import errors | Missing or changed dependency | Check `pyproject.toml` and make sure dependencies are correct |
-| Timeout (step exceeded N minutes) | Test is too slow or hangs | Profile the test; check for infinite loops or deadlocks |
-| `Agent lost` in Buildkite | Infrastructure issue (not your fault) | Re-trigger the build; comment on your PR |
diff --git a/docs/contributing/ci/test_examples/l4_doc_example_tests.inc.md b/docs/contributing/ci/test_examples/l4_doc_example_tests.inc.md
deleted file mode 100644
index 13dd032e275..00000000000
--- a/docs/contributing/ci/test_examples/l4_doc_example_tests.inc.md
+++ /dev/null
@@ -1,49 +0,0 @@
-**Preferred Test Strategy**
-
-Use one of the following patterns depending on page type:
-
-- **Dynamic code-block extraction (preferred for offline docs)**
- - Extract Python/Bash code blocks from markdown AST analyzer, then execute them directly in tests.
- - Benefit: test logic stays automatically aligned with docs.
- - Basic idea: Use `ReadmeSnippet.extract_readme_snippets` to extract a list of code blocks as a global variable in file,
- use this list as `pytest.mark.parametrize` parameters, and pass each snippet item to `example_runner.run` inside the parametrized test.
- Additionally pass an `output_subfolder` argument for the 2nd-level output folder explained in **Output Directory Structure** below.
- If any extra environment variable is need for a test (e.g., the example script reads it), `example_runner.run` also accepts a 3rd `env` parameter.
- - See [tests/examples/offline_inference/test_text_to_image.py](https://github.com/vllm-project/vllm-omni/blob/main/tests/examples/offline_inference/test_text_to_image.py) for reference implementation.
-
-- **Explicit copied scripts (used by online docs for now until further update)**
- - For online serving pages, it is acceptable to copy code from docs into dedicated test functions, because only client-side, request-sending scripts are tested.
- - Benefit: dynamic extraction is overly complex: need to tell server-launch and client-request scripts.
- - Requirement: copied test code must be kept in sync with doc updates.
-
-**Test Case Naming Convention**
-
-- Dynamic code extraction (auto-generated internally):
- - `test_{single_function_name_matching_file_name}[h2_heading_00X]`
- - Example: `test_text_to_image[basic_usage_001]`
-- Explicit copied scripts:
- - `test_{h2_heading_00X}[{dummy_param_id_for_omni_server}]`
- - Example: `test_api_calls_001[omni_server0]`
-
-**Runtime Configuration**
-
-In the example code tests, do **not** reduce `num_inference_steps` just to speed up the tests unless there is a strong CI reliability reason to do otherwise.
-
-**Skipping Rules**
-
-You may skip examples falling in the following categories using `pytest.mark.skip` or `pytest.skip`:
-
-- Gradio UI scripts
-- Scenarios that significantly overlap with existing tests and add little new coverage.
-
-**Output Directory Structure**
-
-Use a three-layer output structure to store output artifacts:
-
-1. Root output directory
- - Auto-detected from `OUTPUT_DIR` env var or auto-generated under `/tmp`.
-2. Doc-page directory
- - Define and use a clear page-level folder name in each `test_*.py` yourself (abbreviations are acceptable, e.g., `example_offline_t2i`).
-3. Test-case directory
- - Must match the case identifier (e.g., `basic_usage_001`).
- - Auto-generated for dynamic extracted tests.
diff --git a/docs/contributing/ci/test_examples/l4_functionality_tests.inc.md b/docs/contributing/ci/test_examples/l4_functionality_tests.inc.md
deleted file mode 100644
index 688933f135d..00000000000
--- a/docs/contributing/ci/test_examples/l4_functionality_tests.inc.md
+++ /dev/null
@@ -1,46 +0,0 @@
-**Scope**
-
-For diffusion models, the L4 functionality test covers all or common *diffusion features* that are supported by this model, including several [parallelism acceleration methods](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/parallelism_acceleration/), [CPU offloading](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/cpu_offload_diffusion/), [TeaCache](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/teacache/) and [Cache-DiT](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/diffusion/cache_dit_acceleration/) cache backends, [quantization methods](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/quantization/overview/).
-
-**Test Case Design**
-
-For a *high priority* model (currently listed in [issue #1832](https://github.com/vllm-project/vllm-omni/issues/1832)), we use several test cases, each with multiple features turned on, so that each supported feature is tested in at least one test case. This is to relieve the GPU workload on the CI machine. The suggested test case combination is as follows:
-
-- If the model can fit into 4 L4 GPU (with quantization and tensor parallel always on) (20GB RAM each)
- - (1 GPU) TeaCache + Layerwise CPU offloading + GGUF
- - (4 GPUs) CacheDiT + Ulysses=2 + TP=2 + VAE=2 + FP8
- - (4 GPUs) CacheDiT + Ring=2 + HSDP=2 + VAE=2 + GGUF
- - (4 GPUs) TeaCache + CFG=2 + TP=2 + VAE=2 + FP8
-- Otherwise, consider 2 H100 GPU environment (80GB RAM each) with the following tests
- - (1 GPU) TeaCache + Layerwise CPU offloading + GGUF
- - (2 GPUs) CacheDiT + Ulysses=2 + FP8
- - (2 GPUs) CacheDiT + Ring=2 + GGUF
- - (2 GPUs) TeaCache + CFG=2 + FP8
- - (2 GPUs) CacheDiT + TP=2 + VAE=2 + FP8
- - (2 GPUs) CacheDiT + HSDP=2 + VAE=2 + GGUF
-- If 2 H100 GPU cannot handle the model either (e.g., HunyuanImage 3.0)
- - Still design tests and feature combinations that can best fit real-world scenario.
- - Do not include it in CI (or exclude it from the CI's filtering criteria). Instead, relevant PR authors are suggested to run these tests locally.
-- Fallback plan
- - If the model does not support layerwise CPU offloading, replace the corresponding test case with module-wise offloading
- - If the model only supports specific or no caching feature, use this option or remove this option in all test cases.
- - If the model only supports specific or no quantization feature, use this option or remove this option in all test cases.
- - If the model does not support certain other features, remove this option from that test case. If, consequently, the coverage of this modified test case completely overlaps with others, remove this test case.
-
-For a *normal priority* model, further reduce the number of test cases.
-
-- Only write one or two test cases for the most common feature combinations.
-- The author can explore themselves to see which feature combination balances output quality and performance. Alternatively, the author can refer to any example code in the PR that adds the model, or the example code in the PR that adds a feature (if the code involves this model of interest).
-
-Currently all the features are available in online serving mode. Hence, only need to add `tests/e2e/online_serving/test_{model}_expansion.py`.
-
-**Code Style**
-
-- Validation: test that the multimodal output files of your model have the correct shapes. `OpenAIClientHandler.send_diffusion_request` should have taken care of this.
-- Test marks: always add `full_model` and `diffusion` for L4 nightly `test_*_expansion.py` cases. Add GPU-related marks if needed. Ref: [Markers for Tests](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_markers/).
-- To maximize code reuse, you may refer to
- - `tests/conftest.py` for `omni_server` (running server in subprocess) and `openai_client` fixtures (sending requests and validating output), `generate_synthetic_image` and `assert_XXX_valid` helper.
- - `tests/helpers/mark.py` for `@hardware_test(...)` and `hardware_marks`.
- - [Parametrizing tests (pytest doc)](https://docs.pytest.org/en/stable/example/parametrize.html) to reuse test function implementation for different cases.
-- Doc: add a concise docstring for each test function.
-- Reference L4 test implementation: [tests/e2e/online_serving/test_qwen_image_edit_expansion.py](https://github.com/vllm-project/vllm-omni/blob/main/tests/e2e/online_serving/test_qwen_image_edit_expansion.py).
diff --git a/docs/contributing/ci/test_examples/l4_performance_tests.inc.md b/docs/contributing/ci/test_examples/l4_performance_tests.inc.md
deleted file mode 100644
index f1f3073dc52..00000000000
--- a/docs/contributing/ci/test_examples/l4_performance_tests.inc.md
+++ /dev/null
@@ -1,89 +0,0 @@
-When you want to add L4-level ***performance test*** cases, you can refer to the following format for case addition in `tests/dfx/perf/tests/test_qwen_omni.json`, `tests/dfx/perf/tests/test_tts.json`, or diffusion configs such as `tests/dfx/perf/tests/test_*_vllm_omni.json` (selected via `pytest ... run_benchmark.py --test-config-file `):
-
-```JSON
-{
- "test_name": "test_qwen3_omni",
- "server_params": {
- "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
- "stage_config_name": "qwen3_omni.yaml"
- },
- "benchmark_params": [
- {
- "dataset_name": "random",
- "num_prompts": [10, 20],
- "max_concurrency": [1, 4],
- "random_input_len": 2500,
- "random_output_len": 900,
- "ignore_eos": true,
- "percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration",
- "baseline": {
- "mean_ttft_ms": [500, 800],
- "mean_audio_ttfp_ms": [2000, 3500],
- "mean_audio_rtf": [0.25, 0.35]
- }
- }
- ]
-}
-```
-
-**Parameter Explanation**
-
-*Overview*
-
-| Field | Required | Description |
-| ---------------- | -------- | --------------------------------------------------------------- |
-| test_name | Yes | Unique identifier for the test case |
-| server_params | Yes | Server-side configuration parameters |
-| benchmark_params | Yes | Benchmark running parameters (supports multiple configurations) |
-
-**`server_params` Configuration**
-
-*Basic Parameters*
-
-| Parameter | Required | Example | Description |
-| ----------------- | -------- | ---------------------------------- | ----------------------------- |
-| model | Yes | "Qwen/Qwen3-Omni-30B-A3B-Instruct" | Model name or path |
-| stage_config_name | Yes | "qwen3_omni.yaml" | Stage configuration file name |
-
-*Dynamic Configuration (update/delete)*
-
-Supports incremental modifications based on the basic configuration:
-
-| Operation | Description |
-| --------- | ------------------------------------ |
-| update | Update or add configuration items |
-| delete | Delete specified configuration items |
-
-**Example**:
-
-```
-"update": {
- "async_chunk": true, // Enable asynchronous chunk processing
- "stage_args": {
- "0": {
- "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk"
- }
- }
-},
-"delete": {
- "stage_args": {
- "2": ["custom_process_input_func"] // Delete this configuration for stage 2
- }
-}
-```
-
-**`benchmark_params` Configuration**
-
-You can add any benchmark running parameters you need here. For all optional parameters, refer to the [benchmark documentation](https://github.com/vllm-project/vllm-omni/blob/main/docs/cli/bench/serve.md). General modifications are as follows:
-
-1. Change the --xxx-xx-xx running parameters to xxx_xx_xx format and fill them as keys in the JSON file.
-2. For boolean variables in the running parameters, modify them to forms such as ignore_eos: true/false and fill them into the JSON file.
-3. Optionally add a `baseline` object (see **Baseline thresholds** below). If you omit `baseline` or leave it empty, the performance test still runs but does not assert metric thresholds from this field.
-4. The qps and concurrency modes are recommended to be mutually exclusive. For detailed explanations, see the table below:
-
-| Parameter | Type | Required | Example/Values | Description |
-| --------------- | ----------- | -------- | --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| num_prompts | int / array | Yes | 10,[10, 20, 30] | Number of requests. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of qps or max_concurrency, e.g., [10,10,10]. If an array is used, its length must match the number of qps or max_concurrency. |
-| request_rate | float / array | No | 0.5, [0.5, 1, inf] | Queries per second. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of num_prompts, e.g., [1,1,1]. If an array is used, its length must match the number of num_prompts. |
-| max_concurrency | int / array | No | 1, [1, 2, 3] | Maximum concurrent in-flight requests. Same array / expansion rules as `request_rate` (mutually exclusive with QPS mode). |
-| baseline | object | No | see above | Optional per-metric thresholds; keys must match benchmark output fields. Scalar, list (per sweep step), or object (keyed by concurrency or QPS string).
diff --git a/docs/contributing/ci/test_guide.md b/docs/contributing/ci/test_guide.md
deleted file mode 100644
index 018c47b053f..00000000000
--- a/docs/contributing/ci/test_guide.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# Test Guide
-## Setting Up the Test Environment
-### Creating a Container
-vLLM-Omni provides an official Docker image for deployment. These images are built upon vLLM Docker images and are available on [Docker Hub](https://hub.docker.com/r/vllm/vllm-omni/tags). The version of vLLM-Omni indicates which vLLM release it is based on.
-For a local test environment, you can follow the steps below to create a container:
-## Installing Dependencies
-### vLLM & vLLM-Omni
-vLLM-Omni is built based on vLLM. You can follow [install guide](../../getting_started/installation/README.md) to build your local environment.
-
-### Test Case Dependencies
-When running test cases, you may need to install the following dependencies:
-
-```bash
-uv pip install ".[dev]"
-apt-get install -y ffmpeg
-```
-
-## Running Tests
-Our test scripts use the pytest framework. First, please use `git clone https://github.com/vllm-project/vllm-omni.git` to download the vllm-omni source code. Then, in the root directory of vllm-omni, you can run the following commands in your local test environment to execute the corresponding test cases.
-
-=== "L1 level"
-
- ```bash
- cd tests
- pytest -s -v -m "core_model and cpu"
- ```
- The latest test command is available in the "Simple Unit Test" step of this [pipeline](https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/test-ready.yml).
-
-=== "L2 level"
-
- ```bash
- cd tests
- pytest -s -v -m "core_model and not cpu" --run-level=core_model
- ```
- If you only want to run a specific test case, you can use:
- ```bash
- pytest -s -v test_xxxx.py --run-level=core_model
- ```
- If you only want to run specific test cases on a particular platform, you can use:
- ```bash
- pytest -s -v -m "core_model and distributed_cuda and L4" --run-level=core_model
- ```
- The latest test commands for various test suites can be found in the [pipeline](https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/test-ready.yml).
-
-=== "L3 level"
-
- ```bash
- pytest -s -v -m "advanced_model" --run-level=advanced_model
- ```
- If you only want to run a specific test case, you can use:
- ```bash
- pytest -s -v test_xxxx.py --run-level=advanced_model
- ```
- If you only want to run specific test cases on a particular platform, you can use:
- ```bash
- pytest -s -v -m "advanced_model and distributed_cuda and L4" --run-level=advanced_model
- ```
- The latest L3 test commands for various test suites can be found in the [pipeline](https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/test-merge.yml).
-
-
-=== "L4 level"
-
- ```bash
- cd tests
- pytest -s -v -m "full_model" --run-level=full_model
- ```
- If you only want to run a specific test case, you can use:
- ```bash
- pytest -s -v test_xxxx.py --run-level=full_model
- ```
- If you only want to run specific test cases on a particular platform, you can use:
- ```bash
- pytest -s -v -m "full_model and distributed_cuda and L4" --run-level=full_model
- ```
- Note: To run performance tests (defaults to ``test_qwen_omni.json``; use ``--test-config-file tests/dfx/perf/tests/test_tts.json`` for TTS):
- ```bash
- pytest -s -v tests/dfx/perf/scripts/run_benchmark.py
- ```
- The latest L4 (nightly) test commands use the `full_model` marker and `--run-level full_model` (see [test-nightly.yml](https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/test-nightly.yml) and [test-nightly-diffusion.yml](https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/test-nightly-diffusion.yml)). Example:
-
- ```bash
- cd tests
- pytest -s -v -m "full_model and omni and H100" --run-level=full_model
- ```
-
-=== "L5 level"
-
- L5 includes stability and reliability testing. Typical commands:
-
- ```bash
- cd tests
-
- # Stability: Qwen3-Omni
- pytest -s -v dfx/stability/scripts/test_stability_qwen3_omni.py
-
- # Stability: Wan2.2 (v1/videos diffusion benchmark loop)
- pytest -s -v dfx/stability/scripts/test_stability_wan22.py
-
- ```
-
- The latest L5 commands for CI can be found in the [pipeline](https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/test-ready.yml).
-
-You can find more information about markers in the documentation: [marker doc](./tests_markers.md)
-
-## Adding New Test Cases
-Please refer to the [L5 Layering Specification document](./CI_5levels.md).
diff --git a/docs/contributing/ci/tests_markers.md b/docs/contributing/ci/tests_markers.md
deleted file mode 100644
index 6130541a617..00000000000
--- a/docs/contributing/ci/tests_markers.md
+++ /dev/null
@@ -1,179 +0,0 @@
-# Markers for Tests
-
-By adding markers before test functions, tests can later be executed uniformly by simply declaring the corresponding marker type.
-
-## Current Markers
-Defined in `pyproject.toml`:
-
-| Marker | Description |
-| ------------------ | --------------------------------------------------------- |
-| `core_model` | L1&L2 tests (run in each PR) |
-| `advanced_model` | L3 tests (run on each merge to main) |
-| `full_model` | L4 tests (run nightly) |
-| `diffusion` | Diffusion model tests |
-| `omni` | Omni model tests |
-| `cache` | Cache backend tests |
-| `parallel` | Parallelism/distributed tests |
-| `cpu` | Tests that run on CPU |
-| `gpu` | Tests that run on GPU * |
-| `cuda` | Tests that run on CUDA * |
-| `rocm` | Tests that run on AMD/ROCm * |
-| `xpu` | Tests that run on Intel XPU * |
-| `npu` | Tests that run on NPU/Ascend * |
-| `H100` | Tests that require H100 GPU * |
-| `L4` | Tests that require L4 GPU * |
-| `MI325` | Tests that require MI325 GPU (AMD/ROCm) * |
-| `A2` | Tests that require A2 NPU * |
-| `A3` | Tests that require A3 NPU * |
-| `distributed_cuda` | Tests that require multi cards on CUDA platform * |
-| `distributed_rocm` | Tests that require multi cards on ROCm platform * |
-| `distributed_npu` | Tests that require multi cards on NPU platform * |
-| `skipif_cuda` | Skip if the num of CUDA cards is less than the required * |
-| `skipif_rocm` | Skip if the num of ROCm cards is less than the required * |
-| `skipif_npu` | Skip if the num of NPU cards is less than the required * |
-| `slow` | Slow tests (may skip in quick CI) |
-| `benchmark` | Benchmark tests |
-
-\* Means those markers are auto-added by `@hardware_test` (parametrization decorator) or `hardware_marks` (only returning the list of marks for flexibility).
-
-### Example usage for markers
-
-```python
-from tests.helpers.mark import hardware_test
-
-@pytest.mark.core_model
-@pytest.mark.omni
-@hardware_test(
- res={"cuda": "L4", "rocm": "MI325", "npu": "A2"},
- num_cards=2,
-)
-@pytest.mark.parametrize("omni_server", test_params, indirect=True)
-def test_video_to_audio()
- ...
-```
-
-### Decorator: `@hardware_test`
-
-This decorator is intended to make hardware-aware, cross-platform test authoring easier and more robust for CI/CD environments. The `hardware_test` decorator in `vllm-omni/tests/helpers/mark.py` performs the following actions:
-
-1. **Applies platform and resource markers**
- Adds the appropriate pytest markers for each specified hardware platform (e.g., `cuda`, `rocm`, `xpu`, `npu`) and resource type (e.g., `L4`, `H100`, `MI325`, `B60`, `A2`, `A3`).
- ```
- @pytest.mark.cuda
- @pytest.mark.L4
- ```
-2. **Handles multi-card (distributed) scenarios**
- For tests requiring multiple cards, it automatically adds distributed markers such as `distributed_cuda`, `distributed_rocm`, or `distributed_npu`.
- ```
- @pytest.mark.distributed_cuda(num_cards=num_cards)
- ```
-3. **Supports flexible card requirements**
- Accepts `num_cards` as either a single integer for all platforms or as a dictionary with per-platform values. If not specified, defaults to 1 card per platform.
-
-4. **Integrates resource validation**
- On CUDA, adds a skip marker (`skipif_cuda`) if the system does not have the required number of devices.
- Support for `skipif_rocm` and `skipif_npu` will be implemented later.
-
-
-5. **Works with pytest filtering**
- Allows tests to be filtered and selected at runtime using standard pytest marker expressions (e.g., `-m "distributed_cuda and L4"`).
-
-#### Example usage for decorator
-- Single call for multiple platforms:
- ```python
- @hardware_test(
- res={"cuda": "L4", "rocm": "MI325", "xpu": "B60", "npu": "A2"},
- num_cards={"cuda": 2, "rocm": 2, "xpu": 2, "npu": 2},
- )
- ```
- or
- ```python
- @hardware_test(
- res={"cuda": "L4", "rocm": "MI325", "xpu": "B60", "npu": "A2"},
- num_cards=2,
- )
- ```
-- `res` must be a dict; supported resources: CUDA (L4/H100), ROCm (MI325), NPU (A2/A3)
-- `num_cards` can be int (all platforms) or dict (per platform); defaults to 1 when missing
-- Distributed markers (`distributed_cuda`, `distributed_rocm`, `distributed_npu`) are auto-added for multi-card cases
-- Filtering examples:
- - CUDA only: `pytest -m "distributed_cuda and L4"`
- - ROCm only: `pytest -m "distributed_rocm and MI325"`
- - NPU only: `pytest -m "distributed_npu"`
-
-### Function: `hardware_marks`
-
-`hardware_marks` returns a list of pytest mark objects with the same signature as `@hardware_test`. Use it when you need more flexibility, such as attaching hardware marks to individual `pytest.param` entries rather than an entire test function.
-
-```python
-from tests.helpers.mark import hardware_marks
-
-MULTI_CARD_MARKS = hardware_marks(
- res={"cuda": "H100", "rocm": "MI325", "npu": "A2"}, num_cards=2
-)
-
-@pytest.mark.parametrize("omni_server", [
- pytest.param(OmniServerParams(...), id="case_001", marks=MULTI_CARD_MARKS),
-], indirect=True)
-def test_feature(omni_server):
- ...
-```
-
-## Add Support for a New Platform
-
-If you want to add support for a new platform (e.g., "tpu" for a new accelerator), follow these steps:
-
-1. **Extend the marker list in your pytest config** so that platform/resource markers are defined:
- ```toml
- # In pyproject.toml or pytest.ini
- [tool.pytest.ini_options]
- markers = [
- # ... existing markers ...
- "tpu: Tests that require TPU device",
- "TPU_V3: Tests that require TPU v3 hardware",
- "distributed_tpu: Tests that require multiple TPU devices",
- ]
- ```
-2. **Implement a marker construction function for your platform** in `vllm-omni/tests/helpers/mark.py`:
- ```python
- # In vllm-omni/tests/helpers/mark.py
-
- def tpu_marks(*, res: str, num_cards: int):
- test_platform = pytest.mark.tpu
- if res == "TPU_V3":
- test_resource = pytest.mark.TPU_V3
- else:
- raise ValueError(
- f"Invalid TPU resource type: {res}. Supported: TPU_V3")
-
- if num_cards == 1:
- return [test_platform, test_resource]
- else:
- test_distributed = pytest.mark.distributed_tpu(num_cards=num_cards)
- # Optionally: add skipif_tpu when implemented
- return [test_platform, test_resource, test_distributed]
- ```
-3. **Update `hardware_marks` to recognize your new platform**:
- In the relevant place (see the `hardware_marks` implementation), add:
- ```python
- if platform == "tpu":
- marks = tpu_marks(res=resource, num_cards=cards)
- ```
- (`hardware_test` calls `hardware_marks` internally, so both will pick up the change.)
-4. **(Recommended) Add a test using your new markers**:
- ```python
- @hardware_test(
- res={"tpu": "TPU_V3"},
- num_cards=2,
- )
- def test_my_tpu_feature():
- ...
- ```
-
-**Summary**:
-- Add pytest markers for your new platform/resources
-- Implement a marker function (`xxx_marks`)
-- Plug into `hardware_marks`
-- You're done: tests using `@hardware_test` or `hardware_marks` with your platform now automatically get the correct markers, distribution, and isolation!
-
-See code in `vllm-omni/tests/helpers/mark.py` for existing examples (`cuda_marks`, `rocm_marks`, `npu_marks`).
diff --git a/docs/contributing/ci/tests_style.md b/docs/contributing/ci/tests_style.md
deleted file mode 100644
index 3a8cb0f127c..00000000000
--- a/docs/contributing/ci/tests_style.md
+++ /dev/null
@@ -1,467 +0,0 @@
-# Test File Structure and Style Guide
-
-To ensure project maintainability and sustainable development, we encourage contributors to submit test code (unit tests, system tests, or end-to-end tests) alongside their code changes. This document outlines the guidelines for organizing and naming test files.
-
-## Checklist before submitting your test files
-
-1. The file is saved in an appropriate place and the file name is clear.
-2. The coding style follows the requirements outlined below.
-3. All test functions have appropriate pytest markers.
-4. For tests that need run in CI, please ensure it labeled as ``@pytest.mark.core_model` the test is configured under the `./buildkite/` folder.
-
-
-## Test Types
-
-For more details about our [Five Levels Tests design](../ci/CI_5levels.md).
-
-### Unit Tests and System Tests
-For unit tests and system tests, we strongly recommend placing test files in the same directory structure as the source code being tested, using the naming convention `test_*.py`.
-
-### End-to-End (E2E) Tests for Models
-End-to-end tests verify the complete functionality of a system or component. For our project, the E2E tests for different omni models are organized into two subdirectories:
-
-- **`tests/e2e/offline_inference/`**: Tests for offline inference modes (e.g., Qwen3Omni offline inference)
-
-- **`tests/e2e/online_serving/`**: Tests for online serving scenarios (e.g., API server tests)
-
-## Test Directory Structure
-
-The ideal directory structure mirrors the source code organization. Legend: `✅` = test exists, `⬜` = suggested to add.
-
-```
-vllm_omni/ tests/
-├── config/ → ├── config/
-│ ├── model.py │ └── test_model.py ⬜
-│ └── lora.py │ └── test_lora.py ⬜
-│
-├── core/ → ├── core/
-│ └── sched/ │ └── sched/
-│ ├── omni_ar_scheduler.py │ ├── test_omni_ar_scheduler.py ⬜
-│ ├── omni_generation_scheduler.py │ ├── test_omni_generation_scheduler.py ⬜
-│ └── output.py │ └── test_output.py ✅ currently in entrypoints/test_omni_new_request_data.py (tests output.OmniNewRequestData)
-│
-├── diffusion/ → ├── diffusion/
-│ ├── diffusion_engine.py │ ├── test_diffusion_engine.py ⬜
-│ ├── attention/ │ ├── attention/
-│ │ ├── layer.py │ │ ├── test_attention_sp.py ✅
-│ │ └── backends/ │ │ └── test_flash_attn.py ✅
-│ ├── distributed/ │ ├── distributed/
-│ │ └── ... │ │ ├── test_comm.py ✅
-│ │ │ │ ├── test_cfg_parallel.py ✅
-│ │ │ │ └── test_sp_plan_hooks.py ✅
-│ ├── lora/ │ ├── lora/
-│ │ └── ... │ │ ├── test_base_linear.py ✅
-│ │ │ │ └── test_lora_manager.py ✅
-│ ├── models/ │ ├── models/
-│ │ ├── qwen_image/ │ │ ├── qwen_image/ (e2e coverage)
-│ │ ├── z_image/ │ │ └── z_image/
-│ │ └── ... │ │ └── test_zimage_tp_constraints.py ✅
-│ └── worker/ │ └── worker/
-│ ├── diffusion_worker.py │ └── test_diffusion_worker.py ✅ file at diffusion/test_diffusion_worker.py
-│ └── diffusion_model_runner.py │
-│
-├── distributed/ → ├── distributed/
-│ └── omni_connectors/ │ └── omni_connectors/
-│ ├── adapter.py │ ├── test_adapter_and_flow.py ✅
-│ ├── kv_transfer_manager.py │ ├── test_basic_connectors.py ✅
-│ ├── connectors/ │ ├── test_kv_flow.py ✅
-│ └── utils/ │ └── test_omni_connector_configs.py ✅
-│
-├── engine/ → ├── engine/
-│ ├── input_processor.py │ ├── test_input_processor.py ⬜ (no processor.py in source)
-│ ├── output_processor.py │ └── test_output_processor.py ⬜
-│ └── arg_utils.py │ └── test_arg_utils.py ⬜
-│
-├── entrypoints/ → ├── entrypoints/
-│ ├── stage_utils.py │ ├── test_stage_utils.py ✅
-│ ├── cli/ │ ├── cli/ (benchmarks/test_serve_cli.py covers CLI serve)
-│ │ └── ... │ │ └── test_*.py ⬜
-│ └── openai/ │ └── openai_api/ # maps to entrypoints/openai/
-│ ├── api_server.py │ ├── test_api_server.py ⬜ (e2e indirect coverage)
-│ ├── serving_chat.py │ ├── test_serving_chat_sampling_params.py ✅
-│ ├── serving_speech.py │ ├── test_serving_speech.py ✅
-│ └── image_api_utils.py │ └── test_image_server.py ✅
-│
-├── inputs/ → ├── inputs/
-│ ├── data.py │ ├── test_data.py ⬜
-│ ├── parse.py │ ├── test_parse.py ⬜
-│ └── preprocess.py │ └── test_preprocess.py ✅ currently in entrypoints/test_omni_input_preprocessor.py
-│
-├── model_executor/ → ├── model_executor/
-│ ├── layers/ │ ├── layers/
-│ │ └── mrope.py │ │ └── test_mrope.py ⬜
-│ ├── model_loader/ │ ├── model_loader/
-│ │ └── weight_utils.py │ │ └── test_weight_utils.py ⬜
-│ ├── models/ │ ├── models/
-│ │ ├── qwen2_5_omni/ │ │ ├── qwen2_5_omni/
-│ │ │ ├── qwen2_5_omni_thinker.py │ │ │ ├── test_audio_length.py ✅
-│ │ │ ├── qwen2_5_omni_talker.py │ │ │ ├── test_qwen2_5_omni_thinker.py ⬜
-│ │ │ └── qwen2_5_omni_token2wav.py │ │ │ ├── test_qwen2_5_omni_talker.py ⬜
-│ │ └── qwen3_omni/ │ │ │ └── test_qwen2_5_omni_token2wav.py ⬜
-│ │ └── ... │ │ └── qwen3_omni/
-│ ├── stage_configs/ │ │ └── test_*.py ⬜
-│ │ └── *.yaml │ └── stage_configs/ (used by e2e, test_*.py can be added) ⬜
-│ └── stage_input_processors/ │ └── stage_input_processors/
-│ └── ... │ └── test_*.py ⬜
-│
-├── sample/ → ├── sample/
-│ └── __init__.py │ └── test_*.py ⬜
-│
-├── utils/ → ├── utils/
-│ └── __init__.py │ └── test_*.py ⬜ (no platform_utils.py currently)
-│
-├── worker/ → ├── worker/
-│ ├── gpu_ar_model_runner.py │ ├── test_gpu_ar_model_runner.py ⬜
-│ ├── gpu_ar_worker.py │ ├── test_gpu_ar_worker.py ⬜
-│ ├── gpu_generation_model_runner.py │ ├── test_gpu_generation_model_runner.py ✅
-│ ├── gpu_generation_worker.py │ ├── test_gpu_generation_worker.py ⬜
-│ ├── gpu_model_runner.py │ ├── test_omni_gpu_model_runner.py ✅
-│ └── mixins.py │ └── (npu under platforms/npu/worker/) # not worker/npu/
-│
-├── platforms/ → (no tests/platforms/, e2e and stage_configs provide indirect coverage)
-│ ├── cuda/
-│ ├── npu/worker/ # NPU worker here, not vllm_omni/worker/npu/
-│ ├── rocm/
-│ └── xpu/worker/
-│
-├── outputs.py → test_outputs.py ✅ (at tests root)
-├── (logger, patch, request, version) → (no corresponding unit test)
-│
-└── e2e (tests side only) → ├── e2e/
- ├── online_serving/ ✅ non-empty
- │ ├── test_qwen2_5_omni.py
- │ ├── test_async_omni.py
- │ ├── test_qwen3_omni.py
- │ ├── test_qwen3_omni_expansion.py
- │ ├── test_mimo_audio.py
- │ ├── test_image_gen_edit.py
- │ └── test_images_generations_lora.py
- └── offline_inference/ ✅
- ├── test_qwen2_5_omni.py
- ├── test_qwen3_omni.py
- ├── test_bagel_text2img.py
- ├── test_t2i_model.py
- ├── test_t2v_model.py
- ├── test_ovis_image.py
- ├── test_zimage_tensor_parallel.py
- ├── test_cache_dit.py
- ├── test_teacache.py
- ├── test_stable_audio_expansion.py
- ├── test_diffusion_cpu_offload.py
- ├── test_diffusion_layerwise_offload.py
- ├── test_diffusion_lora.py
- ├── test_sequence_parallel.py
- ├── test_qwen_image_edit_expansion.py
- └── stage_configs/ (legacy schema, still present
- ├── bagel_*.yaml for unmigrated models)
- └── npu/, rocm/, etc.
-
-# Migrated models (qwen3_omni_moe, qwen2_5_omni, qwen3_tts) live under
-# vllm_omni/deploy/ instead — see docs/configuration/stage_configs.md.
-examples/ tests
-│ └── examples
-├── online_serving/ → ├── online_serving/
-│ └── {doc_page_title}/README.md │ └── test_{doc_page_title}.py ⬜
-└── offline_inference/ → └── offline_inference/
- └── {doc_page_title}/README.md └── test_{doc_page_title}.py ⬜
-```
-
-
-
-### Naming Conventions
-
-- **Unit Tests**: Use `test_.py` format. Example: `stage_utils.py` → `test_stage_utils.py`
-
-- **E2E Tests**: Place in `tests/e2e/offline_inference/` or `tests/e2e/online_serving/` with descriptive names. Example: `tests/e2e/offline_inference/test_qwen3_omni.py`, `tests/e2e/offline_inference/test_diffusion_model.py`
-
-- **Expansion Tests**
-
-### Best Practices
-
-1. **Mirror Source Structure**: Test directories should mirror the source code structure
-2. **Test Type Indicators**: Use comments to indicate test types (UT for unit tests, E2E for end-to-end tests)
-3. **Shared Resources**: Place shared test configurations (e.g., CI configs) in appropriate subdirectories
-4. **Consistent Naming**: Follow the `test_*.py` naming convention consistently across all test files
-
-
-## Test codes requirements
-
-### Coding style
-
-1. **File header**: Add SPDX license header to all test files
-2. **Imports**: Pls don't use manual `sys.path` modifications, use standard imports instead.
-3. **Test type differentiation**:
-
- - Unit tests: Maintain mock style
- - E2E tests for models: Consider using OmniRunner uniformly, avoid decorators
-
-4. **Documentation**: Add docstrings to all test functions
-5. **Environment variables**: Set uniformly in `conftest.py` or at the top of files
-6. **Type annotations**: Add type annotations to all test function parameters
-7. **Pytest Markers**: Add necessary markers like `@pytest.mark.core_model` and use `@hardware_test` to declare hardware requirements (check detailed in [Markers for Tests](../ci/tests_markers.md)).
-
-### Template
-#### E2E - Online serving
-
-E2E Online tests for Qwen3-Omni model with mix input and audio+text output. Based on `tests/e2e/online_serving/test_qwen3_omni.py`.
-
-```python
-"""
-E2E Online tests for Qwen3-Omni model with mix input and audio+text output.
-"""
-
-import os
-
-os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
-os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0"
-
-import threading
-from pathlib import Path
-
-import openai
-import pytest
-
-from tests.helpers.media import (
- convert_audio_bytes_to_text,
- cosine_similarity_text,
- generate_synthetic_video,
-)
-from tests.helpers.runtime import OmniServer, dummy_messages_from_mix_data
-from tests.helpers.stage_config import get_deploy_config_path, modify_stage_config
-from vllm_omni.platforms import current_omni_platform
-
-# Edit: model name and stage config path
-models = ["Qwen/Qwen3-Omni-30B-A3B-Instruct"]
-
-#If you use the default configuration file, you can directly use the following address.
-def get_default_config():
- return get_deploy_config_path("ci/qwen3_omni_moe.yaml")
-
-#If you need to modify the configuration file, you can use modify_stage_config.
-def get_chunk_config():
- path = modify_stage_config(
- get_default_config(),
- updates={
- "async_chunk": True,
- "stage_args": {
- 0: {
- "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk"
- },
- 1: {
- "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk"
- },
- },
- },
- deletes={"stage_args": {2: ["custom_process_input_func"]}},
- )
- return path
-
-stage_configs = [get_default_config(), CHUNK_CONFIG_PATH]
-
-test_params = [(model, stage_config) for model in models for stage_config in stage_configs]
-
-
-#Please use this method to launch the online instance.
-_omni_server_lock = threading.Lock()
-
-@pytest.fixture(scope="module")
-def omni_server(request):
- """Start vLLM-Omni server as a subprocess. Use module scope for multi-stage init (10-20+ min)."""
- with _omni_server_lock:
- model, stage_config_path = request.param
- with OmniServer(
- model,
- ["--stage-configs-path", stage_config_path, "--stage-init-timeout", "120"],
- ) as server:
- yield server
-
-
-@pytest.fixture
-def client(omni_server):
- """OpenAI client for the running vLLM-Omni server."""
- return openai.OpenAI(
- base_url=f"http://{omni_server.host}:{omni_server.port}/v1",
- api_key="EMPTY",
- )
-
-#Please use function definitions above the test function to define the prompts and other parameters you need.
-def get_system_prompt():
- return {
- "role": "system",
- "content": [
- {
- "type": "text",
- "text": (
- "You are Qwen, a virtual human developed by the Qwen Team, "
- "Alibaba Group, capable of perceiving auditory and visual inputs, "
- "as well as generating text and speech."
- ),
- }
- ],
- }
-
-...
-
-#Please define test case tags according to the instructions in the marker documentation.
-@pytest.mark.core_model
-@pytest.mark.omni
-@pytest.mark.parametrize("omni_server", test_params, indirect=True)
-def test_mix_to_text_audio_001(client: openai.OpenAI, omni_server, request) -> None:
- # PLEASE FOLLOW THESE TEMPLATE INSTRUCTIONS:
- # ============================================================================
- # TEMPLATE USAGE GUIDE:
- # 1. Copy this entire function as a starting point for multi-modal tests
- # 2. Update the test name to reflect your specific test scenario
- # 3. Modify input/output modalities as needed (see OPTIONS section below)
- # 4. Adjust assertions based on your expected outcomes
- # 5. Add custom validation logic for your specific use case
- # ============================================================================
-
- #Please list the relevant test points.
- """
- Test multi-modal input processing and text/audio output generation via OpenAI API.
- Deploy Setting: default yaml
- Input Modal: text + audio + video + image
- Output Modal: text + audio
- Input Setting: stream=True
- Datasets: single request
- """
- # SECTION 1: TEST SETUP AND INITIALIZATION
- # =========================================
- # INSTRUCTIONS: Initialize test variables and prepare test environment
- # MODIFY: Add any additional test setup required for your scenario
- e2e_list = list()
- # SECTION 2: TEST DATA GENERATION
- # ================================
- # INSTRUCTIONS: Generate or load test data for each input modality
- # MODIFY: Replace synthetic generators with your actual data sources
- # VIDEO DATA - Generate synthetic video for testing
- # FORMAT: data:video/mp4;base64,{base64_encoded_video}
- # PARAMETERS: width, height, duration_frames
- video_data_url = f"data:video/mp4;base64,{generate_synthetic_video(224, 224, 300)['base64']}"
- # IMAGE DATA - Generate synthetic image for testing
- # FORMAT: data:image/jpeg;base64,{base64_encoded_image}
- # PARAMETERS: width, height
- image_data_url = f"data:image/jpeg;base64,{generate_synthetic_image(224, 224)['base64']}"
- # AUDIO DATA - Generate synthetic audio for testing
- # FORMAT: data:audio/wav;base64,{base64_encoded_audio}
- # PARAMETERS: duration_seconds, channels
- audio_data_url = f"data:audio/wav;base64,{generate_synthetic_audio(5, 1)['base64']}"
-
- # SECTION 3: MESSAGE CONSTRUCTION
- # ================================
- # INSTRUCTIONS: Assemble the complete message payload for API request
- # MODIFY: Add/remove modalities or change prompt structure as needed
-
- # USAGE: Construct a message containing all input modalities
- # IMPORTANT: Ensure the message structure matches OpenAI API expectations
- # CUSTOMIZATION POINTS:
- # - system_prompt: Controls the assistant's behavior
- # - content_text: The user's text prompt/question
- # - *_data_url: URLs for media content (video/image/audio)
- messages = dummy_messages_from_mix_data(
- system_prompt=get_system_prompt(),
- video_data_url=video_data_url,
- image_data_url=image_data_url,
- audio_data_url=audio_data_url,
- content_text=get_prompt("mix"),
- )
-
- # SECTION 4: API REQUEST EXECUTION
- # =================================
- # INSTRUCTIONS: Make the API call and measure performance
- # MODIFY: Add timeout, retry logic, or additional parameters
- start_time = time.perf_counter()
- chat_completion = client.chat.completions.create(model=omni_server.model, messages=messages, stream=True)
-
- #Call using your preferred method and obtain the final audio and text outputs.
- ...
-
- # SECTION 5: OUTPUT VALIDATION
- # =============================
- # INSTRUCTIONS: Verify that outputs meet expected criteria
- # MODIFY: Adjust validation logic for your specific requirements
-
- # ASSERTION 1: E2E Validation
- # PURPOSE: Verify that the E2E latency is less than the baseline.
- current_e2e = time.perf_counter() - start_time
- print(f"the request e2e is: {current_e2e}")
- e2e_list.append(current_e2e)
-
- print(f"the avg e2e is: {sum(e2e_list) / len(e2e_list)}")
-
-
-
- # ASSERTION 2: Text Output Validation
- # PURPOSE: Verify that text output was generated with keyword content
- assert text_content is not None and len(text_content) >= 2, "No text output is generated"
- assert any(
- keyword in text_content.lower() for keyword in ["square", "quadrate", "sphere", "globe", "circle", "round"]
- ), "The output does not contain any of the keywords."
-
-
- # ASSERTION 3: Cross-Modal Consistency
- # PURPOSE: Verify text and audio outputs convey the same information
- # CUSTOMIZATION: Adjust similarity threshold (0.9) based on accuracy requirements
- assert audio_data is not None, "No audio output is generated"
- audio_content = convert_audio_bytes_to_text(audio_data)
- print(f"text content is: {text_content}")
- print(f"audio content is: {audio_content}")
- similarity = cosine_similarity_text(audio_content.lower(), text_content.lower())
- print(f"similarity is: {similarity}")
- assert similarity > 0.9, "The audio content is not same as the text"
-```
-
-
-#### E2E - Offline inference
-```python
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-Offline E2E smoke test for an omni model (video → audio).
-"""
-
-import os
-from pathlib import Path
-
-import pytest
-from vllm.assets.video import VideoAsset
-
-from tests.helpers.mark import hardware_test
-from ..multi_stages.conftest import OmniRunner
-
-# Optional: set process start method for workers
-os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
-
-models = ["{your model name}"] #Edit here to load your model
-stage_configs = [str(Path(__file__).parent / "stage_configs" / {your model yaml})] #Edit here to load your model yaml
-
-# Create parameter combinations for model and stage config
-test_params = [(model, stage_config) for model in models for stage_config in stage_configs]
-
-# function name: test_{input_modality}_to_{output_modality}
-# modality candidate: text, image, audio, video, mixed_modalities
-@pytest.mark.core_model
-@pytest.mark.omni
-@hardware_test(
- res={"cuda": "L4", "rocm": "MI325", "npu": "A2"},
- num_cards=2,
-)
-@pytest.mark.parametrize("test_config", test_params)
-def test_video_to_audio(omni_runner: type[OmniRunner], model: str) -> None:
- """Offline inference: video input, audio output."""
- model, stage_config_path = test_config
- with omni_runner(model, seed=42, stage_configs_path=stage_config_path) as runner:
- # Prepare inputs
- video = VideoAsset(name="sample", num_frames=4).np_ndarrays
-
- outputs = runner.generate_multimodal(
- prompts="Describe this video briefly.",
- videos=video,
- )
-
- # Minimal assertions: got outputs and at least one audio result
- assert outputs
- has_audio = any(o.final_output_type == "audio" for o in outputs)
- assert has_audio
-```
diff --git a/docs/contributing/metrics.md b/docs/contributing/metrics.md
deleted file mode 100644
index 92dcd92ccac..00000000000
--- a/docs/contributing/metrics.md
+++ /dev/null
@@ -1,173 +0,0 @@
-
-# Metrics
-
-You can use these metrics in production to monitor the health and performance of the vLLM-omni system. Typical scenarios include:
-
-- **Performance Monitoring**: Track throughput (e.g., `e2e_avg_tokens_per_s`), latency (e.g., `e2e_total_ms`), and resource utilization to verify that the system meets expected standards.
-
-- **Debugging and Troubleshooting**: Use detailed per-request metrics to diagnose issues, such as high transfer times or unexpected token counts.
-
-## How to Enable and View Metrics
-
-### Start the Service with Metrics Logging
-
-```bash
-vllm serve /workspace/models/Qwen3-Omni-30B-A3B-Instruct --omni --port 8014 --log-stats
-```
-
-### Send a Request
-
-```bash
-python openai_chat_completion_client_for_multimodal_generation.py --query-type use_image
-```
-
-### What You Will See
-
-With `--log-stats` enabled, the server will output detailed metrics logs after each request. Example output:
-
-
-#### Overall Summary
-
-| Field | Value |
-|-----------------------------|--------------|
-| e2e_requests | 1 |
-| e2e_wall_time_ms | 41,299.190 |
-| e2e_total_tokens | 5,202 |
-| e2e_avg_time_per_request_ms | 41,299.190 |
-| e2e_avg_tokens_per_s | 125.959 |
-| e2e_stage_0_wall_time_ms | 10,192.289 |
-| e2e_stage_1_wall_time_ms | 30,541.409 |
-| e2e_stage_2_wall_time_ms | 207.496 |
-
-#### RequestE2EStats
-
-| Field | Value |
-|-------------------------|------------|
-| e2e_total_ms | 41,299.133 |
-| e2e_total_tokens | 5,202 |
-| transfers_total_time_ms | 245.895 |
-| transfers_total_kbytes | 138,089.939|
-
-#### StageRequestStats
-
-| Field | 0 | 1 | 2 |
-|------------------------|--------|--------|--------|
-| audio_generated_frames | 0 | 0 | 525,525|
-| batch_id | 38 | 274 | 0 |
-| batch_size | 1 | 1 | 1 |
-| num_tokens_in | 4,860 | 4,826 | 4,384 |
-| num_tokens_out | 67 | 275 | 0 |
-| postprocess_time_ms | 256.158| 0.491 | 0.000 |
-| stage_gen_time_ms | 9,910.007|30,379.198|160.745|
-
-#### TransferEdgeStats
-
-| Field | 0->1 | 1->2 |
-|---------------------|-------------|------------|
-| size_kbytes | 109,277.349 | 28,812.591 |
-| tx_time_ms | 78.701 | 18.790 |
-| rx_decode_time_ms | 111.865 | 31.706 |
-| in_flight_time_ms | 2.015 | 2.819 |
-
-
-These logs include:
-
-- **Overall summary**: total requests, wall time, average tokens/sec, etc.
-
-- **E2E table**: per-request latency and token counts.
-
-- **Stage table**: per-stage batch and timing details.
-
-- **Transfer table**: data transfer and timing for each edge.
-
-You can use these logs to monitor system health, debug performance, and analyze request-level metrics as described above.
-
-
-## Metrics Scope: Offline vs Online Inference
-
-For **offline inference** (batch mode), the summary includes both system-level metrics (aggregated across all requests) and per-request metrics. In this case, `e2e_requests` can be greater than 1, reflecting multiple completed requests in a batch.
-
-For **online inference** (serving mode), the summary is always per-request. `e2e_requests` is always 1, and only request-level metrics are reported for each completion.
-
----
-
-## Parameter Details
-
-### Summary Metrics
-
-| Field | Meaning |
-|---------------------------|----------------------------------------------------------------------------------------------|
-| `e2e_requests` | Number of completed requests. |
-| `e2e_wall_time_ms` | Wall-clock time span from run start to last completion, in ms. |
-| `e2e_total_tokens` | Total tokens counted across all completed requests (stage0 input + all stage outputs). |
-| `e2e_avg_time_per_request_ms` | Average wall time per request: `e2e_wall_time_ms / e2e_requests`. |
-| `e2e_avg_tokens_per_s` | Average token throughput over wall time: `e2e_total_tokens * 1000 / e2e_wall_time_ms`. |
-| `e2e_stage_{i}_wall_time_ms` | Wall-clock time span for stage i, in ms. Each stage's wall time is reported as a separate field, e.g., `e2e_stage_0_wall_time_ms`, `e2e_stage_1_wall_time_ms`, etc. |
-
----
-
-### E2E Table (per request)
-
-| Field | Meaning |
-|---------------------------|-----------------------------------------------------------------------|
-| `e2e_total_ms` | End-to-end latency in ms. |
-| `e2e_total_tokens` | Total tokens for the request (stage0 input + all stage outputs). |
-| `transfers_total_time_ms` | Sum of transfer edge `total_time_ms` for this request. |
-| `transfers_total_kbytes` | Sum of transfer kbytes for this request. |
-
-
----
-
-### Stage Table (per stage event / request)
-
-| Field | Meaning |
-|---------------------------|-------------------------------------------------------------------------------------------------|
-| `batch_id` | Batch index. |
-| `batch_size` | Batch size. |
-| `num_tokens_in` | Input tokens to the stage. |
-| `num_tokens_out` | Output tokens from the stage. |
-| `stage_gen_time_ms` | Stage compute time in ms, excluding postprocessing time (reported separately as `postprocess_time_ms`). |
-| `image_num` | Number of images generated (for diffusion/image stages). |
-| `resolution` | Image resolution (for diffusion/image stages). |
-| `postprocess_time_ms` | Diffusion/image: post-processing time in ms. |
-
----
-
-### Transfer Table (per edge / request)
-
-| Field | Meaning |
-|----------------------|---------------------------------------------------------------------------|
-| `size_kbytes` | Total kbytes transferred. |
-| `tx_time_ms` | Sender transfer time in ms. |
-| `rx_decode_time_ms` | Receiver decode time in ms. |
-| `in_flight_time_ms` | In-flight time in ms. |
-
-
-### Expectation of the Numbers (Verification)
-
-**Formulas:**
-
-- `e2e_total_tokens = Stage0's num_tokens_in + sum(all stages' num_tokens_out)`
-
-- `transfers_total_time_ms = sum(tx_time_ms + rx_decode_time_ms + in_flight_time_ms)` for every edge
-
-**Using the example above:**
-
-**e2e_total_tokens**
-
-- Stage0's `num_tokens_in`: **4,860**
-- Stage0's `num_tokens_out`: **67**
-- Stage1's `num_tokens_out`: **275**
-- Stage2's `num_tokens_out`: **0**
-
-so `e2e_total_tokens = 4,860 + 67 + 275 + 0 = 5,202`, which matches the table value `e2e_total_tokens`.
-
-**transfers_total_time_ms**
-
-For each edge:
-
-- 0->1: tx_time_ms (**78.701**) + rx_decode_time_ms (**111.865**) + in_flight_time_ms (**2.015**) = **192.581**
-
-- 1->2: tx_time_ms (**18.790**) + rx_decode_time_ms (**31.706**) + in_flight_time_ms (**2.819**) = **53.315**
-
-192.581 + 53.315 = **245.896** = transfers_total_time_ms, which matches the calculation (difference is due to rounding)
diff --git a/docs/contributing/model/README.md b/docs/contributing/model/README.md
deleted file mode 100644
index b3e951c8bfe..00000000000
--- a/docs/contributing/model/README.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# Adding a New Model
-
-This section provides comprehensive guidance on how to add a new model to vLLM-Omni.
-
-## Documentation
-
-- **[Adding an Omni-Modality Model](adding_omni_model.md)**: Complete step-by-step guide using Qwen3-Omni as an example.
-
-- **[Adding a Diffusion Model](adding_diffusion_model.md)**: Complete step-by-step guide using Qwen/Qwen-Image-Edit as an example.
-
-
-
-## Quick Start
-
-For a quick reference, see the [Adding a New Multi-Stage Model Guide](adding_omni_model.md) and [Adding a New Diffusion Model Guide](adding_diffusion_model.md).
diff --git a/docs/contributing/model/adding_diffusion_model.md b/docs/contributing/model/adding_diffusion_model.md
deleted file mode 100644
index 6d5782a6e3c..00000000000
--- a/docs/contributing/model/adding_diffusion_model.md
+++ /dev/null
@@ -1,1071 +0,0 @@
-# Adding a Diffusion Model
-
-This guide walks you through adding a new diffusion model to vLLM-Omni. We use **Qwen-Image** as the primary example, with references to other models (LongCat, Flux, Wan2.2) to illustrate different patterns.
-
-
----
-
-## Table of Contents
-
-1. [Overview](#overview)
-2. [Directory Structure](#directory-structure)
-3. [Basic Implementation](#basic-implementation)
-4. [Advanced Features](#advanced-features)
-5. [Troubleshooting](#troubleshooting)
-6. [Pull Request Checklist](#pull-request-checklist)
-7. [Reference Implementations](#reference-implementations)
-8. [Summary](#summary)
-
----
-
-## Overview
-
-vLLM-Omni's diffusion inference follows this architecture:
-
-
-
-
-
-
-
-
-**Key Components:**
-
-1. **Request Handling:** User prompts → `OmniDiffusionRequest`
-2. **Diffusion Engine:** Request → Preprocessing (Optional) → Pipeline execution -> Post-processing
-3. **Pipeline Execution:** Request → Encode prompt → Diffusion steps → Vae decode
-
-
-## Directory Structure
-
-Organize your model files following this structure:
-
-```
-vllm_omni/
-└── diffusion/
- ├── registry.py # ← Register your model here
- ├── request.py # Request data structures
- └── models/
- └── your_model_name/ # ← Create this directory
- ├── __init__.py # Export pipeline and transformer
- ├── pipeline_xxx.py # Pipeline implementation
- └── xxx_transformer.py # Transformer implementation
-```
-
-**Naming Conventions:**
-
-- **Model directory:** `your_model_name` (lowercase, underscores), e.g., `qwen_image`, `flux`, `longcat_image`, `wan2_2`
-- **Pipeline file:** `pipeline_xxx.py` where `xxx` describes the task, e.g., `pipeline_qwen_image.py`, `pipeline_qwen_image_edit.py`
-- **Transformer file:** `xxx_transformer.py` matching transformer class name, e.g., `qwen_image_transformer.py`, `flux_transformer.py`
-
----
-
-## Basic Implementation
-
-This section covers the minimal steps to get a model working in vLLM-Omni with basic features (online/offline serving, batch requests).
-
-### Step 1: Adapt Transformer Model
-
-The transformer is the core denoising network. Start by copying the transformer implementation from Diffusers and making these adaptations.
-
-
-#### 1.1: Remove Diffusers Mixins
-
-Diffusers' `Mixin` classes are not needed in vLLM-Omni. Remove them:
-
-```diff
-# Before (Diffusers)
-- from diffusers.models.modeling_utils import ModelMixin
-- from diffusers.models.attention_processor import AttentionModuleMixin
-
-- class YourModelTransformer2DModel(ModelMixin, AttentionModuleMixin):
-+ class YourModelTransformer2DModel(nn.Module):
- """Your transformer model."""
-```
-
-**Example mixins to remove:**
-
-- `ModelMixin` - Weight loading utilities (vLLM-Omni has its own weight loader)
-- `AttentionModuleMixin` - Attention processors (using vLLM-Omni's Attention layer instead)
-- `ConfigMixin` - Config management (not needed)
-- `PeftAdapterMixin` - Parameter efficient finetune utilities (not needed)
-
-#### 1.2: Replace Attention Implementation
-
-**The most important adaptation:** Replace Diffusers' attention with vLLM-Omni's optimized `Attention` layer.
-
-**Before (Diffusers):**
-```python
-from diffusers.models.attention_processor import dispatch_attention_fn
-
-class YourAttentionBlock(nn.Module):
- def forward(self, hidden_states, encoder_hidden_states=None, ...):
- ...
- hidden_states = dispatch_attention_fn(
- query, key, value,
- attn_mask=attention_mask,
- dropout_p=0.0,
- is_causal=False,
- backend=self._attention_backend,
- )
-```
-
-**After (vLLM-Omni):**
-```python
-from vllm_omni.diffusion.attention.layer import Attention
-from vllm_omni.diffusion.attention.backends.abstract import AttentionMetadata
-
-class YourAttentionBlock(nn.Module):
- def __init__(self, ...):
- super().__init__()
-
- # Initialize vLLM-Omni's Attention layer
- self.attn = Attention(
- num_heads=self.num_heads,
- head_size=self.head_dim,
- softmax_scale=1.0 / (self.head_dim ** 0.5),
- causal=False, # Diffusion models typically use bidirectional attention
- num_kv_heads=self.num_kv_heads,
- )
-
- def forward(self, hidden_states, encoder_hidden_states=None, attention_mask=None, ...):
- ...
- # Create attention metadata
- attn_metadata = AttentionMetadata(attn_mask=attention_mask)
- hidden_states = self.attn(query, key, value, attn_metadata=attn_metadata)
-
-```
-
-**Key Points:**
-
-- **Attention layer initialization:** Done in `__init__`, not per-forward
-- **Tensor shapes:** vLLM-Omni `Attention` expects QKV to have `[B, seq, num_heads, head_dim]` shape
-- **AttentionMetadata:** Wraps attention mask and other metadata
-
-**Attention backends:** vLLM-Omni automatically selects the attention backend given the environmental variable `DIFFUSION_ATTENTION_BACKEND`. The default attention backend is `FLASH_ATTN` for diffusion models.
-
-#### 1.3: Replace Imports and Utilities
-
-**Logger:**
-```diff
-- from diffusers.utils import logging
-- logger = logging.get_logger(__name__)
-
-+ from vllm.logger import init_logger
-+ logger = init_logger(__name__)
-```
-
-**Custom layers from vLLM and vLLM-Omni (if needed):**
-
-```python
-from vllm.model_executor.layers.layernorm import RMSNorm
-from vllm_omni.diffusion.layers.rope import RotaryEmbedding
-from vllm_omni.diffusion.layers.adalayernorm import AdaLayerNorm
-```
-
-#### 1.4: Remove Training-Only Code
-
-Remove code that's only needed for training:
-
-```diff
-# Remove gradient checkpointing
-- if torch.is_grad_enabled() and self.gradient_checkpointing:
-- hidden_states = torch.utils.checkpoint.checkpoint(
-- self._forward_block, hidden_states, ...
-- )
-- else:
-- hidden_states = self._forward_block(hidden_states, ...)
-+ hidden_states = self._forward_block(hidden_states, ...)
-
-# Remove training-specific attributes
-- self.gradient_checkpointing = False
-
-# Remove dropout (set to 0 or remove)
-- self.dropout = nn.Dropout(dropout_prob)
-+ # Removed dropout for inference
-```
-
-#### 1.5: Add Configuration Support
-
-Add support for vLLM-Omni's `OmniDiffusionConfig`:
-
-```python
-from vllm_omni.diffusion.data import OmniDiffusionConfig
-
-class YourModelTransformer2DModel(nn.Module):
- def __init__(
- self,
- *,
- od_config: OmniDiffusionConfig | None = None, # ← Add vLLM-Omni config
- # ... other model-specific parameters
- num_layers: int = 28,
- hidden_size: int = 3072,
- num_heads: int = 24,
- **kwargs,
- ):
- super().__init__()
-
- # Store config
- self.od_config = od_config
- self.parallel_config = od_config.parallel_config if od_config else None
-
- # Model architecture
- self.num_layers = num_layers
- self.hidden_size = hidden_size
- # ... initialize layers
-```
-
-### Step 2: Adapt Pipeline
-
-The pipeline orchestrates the full generation process (text encoding, denoising loop, VAE decoding). Adapt it from Diffusers format to vLLM-Omni's interface.
-
-#### 2.1: Remove Diffusers Inheritance
-
-**Remove Diffusers base classes:**
-```diff
-- from diffusers import DiffusionPipeline
-- from diffusers.loaders import LoraLoaderMixin
-
-- class YourModelPipeline(DiffusionPipeline, LoraLoaderMixin):
-+ class YourModelPipeline(nn.Module):
- """Your model pipeline for vLLM-Omni."""
-```
-
-#### 2.2: Adapt `__init__` Method
-
-**Before (Diffusers):**
-```python
-class YourModelPipeline(DiffusionPipeline):
- def __init__(
- self,
- vae: AutoencoderKL,
- text_encoder: CLIPTextModel,
- tokenizer: CLIPTokenizer,
- transformer: YourTransformer,
- scheduler: FlowMatchScheduler,
- ):
- super().__init__()
- self.register_modules(
- vae=vae,
- text_encoder=text_encoder,
- tokenizer=tokenizer,
- transformer=transformer,
- scheduler=scheduler,
- )
-```
-
-**After (vLLM-Omni):**
-```python
-import os
-from diffusers import AutoencoderKL
-from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
-from transformers import CLIPTextModel, CLIPTokenizer
-
-from vllm_omni.diffusion.data import OmniDiffusionConfig
-from vllm_omni.diffusion.distributed.utils import get_local_device
-from vllm_omni.diffusion.utils.tf_utils import get_transformer_config_kwargs
-from vllm_omni.diffusion.models.your_model_name.your_model_transformer import (
- YourModelTransformer2DModel,
-)
-
-
-class YourModelPipeline(nn.Module):
- def __init__(
- self,
- *,
- od_config: OmniDiffusionConfig,
- prefix: str = "",
- ):
- super().__init__()
- self.od_config = od_config
- self.parallel_config = od_config.parallel_config
- self.device = get_local_device()
- model = od_config.model
- local_files_only = os.path.exists(model)
-
- # Load components from checkpoint
- self.scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
- model, subfolder="scheduler", local_files_only=local_files_only)
- self.text_encoder = CLIPTextModel.from_pretrained(
- model, subfolder="text_encoder", local_files_only=local_files_only).to(self.device)
- self.tokenizer = CLIPTokenizer.from_pretrained(
- model, subfolder="tokenizer", local_files_only=local_files_only)
- self.vae = AutoencoderKL.from_pretrained(
- model, subfolder="vae", local_files_only=local_files_only).to(self.device)
-
- # Initialize transformer with vLLM-Omni config
- transformer_kwargs = get_transformer_config_kwargs(
- od_config.tf_model_config, YourModelTransformer2DModel)
- self.transformer = YourModelTransformer2DModel(
- od_config=od_config, **transformer_kwargs)
-
- # Store VAE scale factor for latent space conversions
- self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
- self.default_sample_size = 128 # Default latent size
-```
-
-**Key Changes:**
-
-1. **`od_config` parameter:** All configuration through `OmniDiffusionConfig`
-2. **Manual component loading:** No `register_modules()`, load each component explicitly
-3. **Local files support:** Check `os.path.exists(model)` for local checkpoints
-4. **Transformer with config:** Pass `od_config` to transformer constructor
-
-#### 2.3: Adapt `__call__` → `forward` Method
-
-**Change signature:**
-
-```diff
-- @torch.no_grad()
-- def __call__(
-+ def forward(
- self,
-+ req: OmniDiffusionRequest, # ← Add request parameter here
-- ):
-+ ) -> DiffusionOutput: # ← Add return type
-```
-
-[`OmniDiffusionRequest`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/request/#vllm_omni.diffusion.request.OmniDiffusionRequest) is a dataclass that contains the **prompts** and **sampling parameters** [`OmniDiffusionSamplingParams`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/inputs/data/#vllm_omni.inputs.data.OmniDiffusionSamplingParams) for the diffusion pipeline execution. It also contains a request_id for other components to trace this request and its outputs.
-
-See some parameters in `OmniDiffusionSamplingParams` as follows:
-
-| parameters | type |value | function |
-|:---:|:---:|:---:|:---:|
-| `num_inference_steps` | `int` | 50 | The number of diffusion steps during inference|
-| `guidance_scale` | `float` | 0.0 | The classifier free guidance scale |
-| `width` and `height` | `int` | None | The width and height of the generated image |
-
-**Extract parameters from request:**
-
-```python
-from vllm_omni.diffusion.request import OmniDiffusionRequest
-from vllm_omni.diffusion.data import DiffusionOutput
-
-def forward(
- self,
- req: OmniDiffusionRequest,
-) -> DiffusionOutput:
- # Extract prompts from request
- if req.prompts is not None:
- prompt = [
- p if isinstance(p, str) else (p.get("prompt") or "")
- for p in req.prompts
- ]
-
- # Extract sampling parameters
- sampling_params = req.sampling_params
- num_inference_steps = sampling_params.num_inference_steps or 50
- guidance_scale = sampling_params.guidance_scale or 7.5
- height = sampling_params.height or (self.default_sample_size * self.vae_scale_factor)
- width = sampling_params.width or (self.default_sample_size * self.vae_scale_factor)
-
- # For image editing pipelines, extract images from multi_modal_data
- if hasattr(req, 'multi_modal_data') and req.multi_modal_data:
- input_images = req.multi_modal_data.get('image', [])
-
- # ... rest of generation logic
-```
-
-For an image editing model, an example `OmniDiffusionRequest` is like:
-```python
-{
- "prompt": "turn this cat to a dog",
- "multi_modal_data": {"image": input_image}
-},
-```
-
-**Wrap output:**
-
-```diff
- # Generate images
- images = self.vae.decode(latents)[0]
-
-- return {"images": images}
-+ return DiffusionOutput(output=images)
-```
-
-#### 2.4: Extract Pre/Post-Processing Functions
-
-vLLM-Omni separates image processing from the main pipeline for better modularity.
-
-**Post-processing function (required):**
-```python
-def get_your_model_post_process_func(
- od_config: OmniDiffusionConfig,
-):
- """
- Create post-processing function for your model.
-
- Returns a function that converts latents to images.
- """
- from diffusers.image_processor import VaeImageProcessor
- import json
-
- # Load VAE config to get scale factor
- model_path = od_config.model
- if not os.path.exists(model_path):
- from vllm_omni.diffusion.model_loader.utils import download_weights_from_hf_specific
- model_path = download_weights_from_hf_specific(model_path, None, ["*"])
-
- vae_config_path = os.path.join(model_path, "vae/config.json")
- with open(vae_config_path) as f:
- vae_config = json.load(f)
- vae_scale_factor = 2 ** (len(vae_config["block_out_channels"]) - 1)
-
- # Create image processor
- image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)
-
- def post_process_func(images: torch.Tensor):
- return image_processor.postprocess(images, output_type="pil")
-
- return post_process_func
-```
-
-**Pre-processing function (for image editing pipelines):**
-
-```python
-def get_your_model_pre_process_func(
- od_config: OmniDiffusionConfig,
-):
- """
- Create pre-processing function for image editing.
-
- Returns a function that prepares input images.
- """
- from PIL import Image
- from diffusers.image_processor import VaeImageProcessor
-
- # Load VAE config
- # ... (similar to post_process_func)
-
- image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)
-
- def pre_process_func(
- request: OmniDiffusionRequest,
- ):
- for i, prompt in enumerate(request.prompts):
- multi_modal_data = prompt.get("multi_modal_data", {}) if not isinstance(prompt, str) else None
- raw_image = multi_modal_data.get("image", None) if multi_modal_data is not None else None
- # image pre-processing
- # after pre-processing, update the request attributes
- ...
- return request
-
- return pre_process_func
-```
-
-#### 2.5: Add Weight Loading Support
-
-Add methods for automatic weight downloading and loading:
-
-```python
-from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader
-from vllm.model_executor.models.utils import AutoWeightsLoader
-
-class YourModelPipeline(nn.Module):
- def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""):
- super().__init__()
- # ... initialization code
-
- # Define weight sources for automatic loading
- self.weights_sources = [
- DiffusersPipelineLoader.ComponentSource(
- model_or_path=od_config.model,
- subfolder="transformer",
- revision=None,
- prefix="transformer.",
- fall_back_to_pt=True,
- )
- ]
-
- def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
- """
- Customize the weight loading behavior, such as filter weights name.
-
- Args:
- weights: Iterable of (param_name, param_tensor) tuples
-
- Returns:
- Set of loaded parameter names
- """
- loader = AutoWeightsLoader(self)
- return loader.load_weights(weights)
-```
-
-### Step 3: Register Model
-
-Register your model in `vllm_omni/diffusion/registry.py` so vLLM-Omni can discover and load it.
-
-#### 3.1: Register Pipeline Class
-
-```python
-# vllm_omni/diffusion/registry.py
-
-_DIFFUSION_MODELS = {
- # Format: "PipelineClassName": (module_folder, module_file, class_name)
-
- # Existing models
- "QwenImagePipeline": ("qwen_image", "pipeline_qwen_image", "QwenImagePipeline"),
- "FluxPipeline": ("flux", "pipeline_flux", "FluxPipeline"),
-
- # Add your model
- "YourModelPipeline": (
- "your_model_name", # Module folder name
- "pipeline_your_model", # Python file name (without .py)
- "YourModelPipeline", # Pipeline class name
- ),
-}
-```
-
-#### 3.2: Register Pre/Post-Processing Function
-
-```python
-# vllm_omni/diffusion/registry.py
-_DIFFUSION_PRE_PROCESS_FUNCS = {
- # arch: pre_process_func
- # `pre_process_func` function must be placed in {mod_folder}/{mod_relname}.py,
- # where mod_folder and mod_relname are defined and mapped using `_DIFFUSION_MODELS` via the `arch` key
- "GlmImagePipeline": "get_glm_image_pre_process_func",
- "QwenImageEditPipeline": "get_qwen_image_edit_pre_process_func",
-
- # Add your model
- "YourModelPipeline": "get_your_model_pre_process_func", # Optional
-}
-_DIFFUSION_POST_PROCESS_FUNCS = {
- # Format: "PipelineClassName": "function_name"
-
- # Existing models
- "QwenImagePipeline": "get_qwen_image_post_process_func",
- "FluxPipeline": "get_flux_post_process_func",
-
- # Add your model
- "YourModelPipeline": "get_your_model_post_process_func",
-}
-```
-
-
-#### 3.3: Export from Module
-
-Create/update `__init__.py` to export your classes:
-
-```python
-# vllm_omni/diffusion/models/your_model_name/__init__.py
-
-from .pipeline_your_model import (
- YourModelPipeline,
- get_your_model_post_process_func,
-)
-from .your_model_transformer import YourModelTransformer2DModel
-
-__all__ = [
- "YourModelPipeline",
- "YourModelTransformer2DModel",
- "get_your_model_post_process_func",
-]
-```
-
----
-
-### Step 4: Add Example Script
-
-
-If your model is one of Text-to-Image, Text-to-Audio, Text-to-Video, Image-to-Image, Image-to-Video models, you can simply try one of the following offline inference scripts to run your model:
-
-| Model Category | Offline Inference Script |
-|---|---|
-| Image-to-Image | `examples/offline_inference/image_to_image/image_edit.py` |
-| Image-to-Video | `examples/offline_inference/image_to_video/image_to_video.py` |
-| Text-to-Image | `examples/offline_inference/text_to_image/text_to_image.py` |
-| Text-to-Audio | `examples/offline_inference/text_to_audio/text_to_audio.py` |
-| Text-to-Video | `examples/offline_inference/text_to_video/text_to_video.py` |
-
-
-If new CLI arguments need to be added, please edit the offline inference script corresponding to your model category from the table above, and update the example inference script in its corresponding document file (e.g., `examples/offline_inference/text_to_video/text_to_video.md`).
-
-For online inference, all the supported tasks are listed in `docs/user_guide/examples/online_serving/`. If your model falls into these categories, please check the corresponding documentation in this folder and the example at `examples/online_serving/TASK_NAME`. Update them accordingly if needed.
-
----
-
-If your model is an Omni (understanding and generation) model, please follow the steps below.
-
-#### 4.1: Create Example File
-
-Taking **BAGEL** model as examples for both offline and online:
-
-- Offline: `examples/offline_inference/bagel/`
-- Online: `examples/online_serving/bagel/`
-
-Add **two example folders** for your model:
-
-```bash
-mkdir -p examples/offline_inference/your_model_name
-mkdir -p examples/online_serving/your_model_name
-```
-
-**Offline (recommended minimum):** create `examples/offline_inference/your_model_name/end2end.py` and a README.
-
-- Script: `examples/offline_inference/your_model_name/end2end.py`
- - Parse args like BAGEL (`--model`, `--modality`, optional `--image-path`, `--steps`, etc.)
- - Use `from vllm_omni.entrypoints.omni import Omni` (or `OmniDiffusion` if your model is diffusion-only)
- - Save outputs (images/audio/video/text) with deterministic filenames (e.g., `output_0_0.png`)
-- Doc: `examples/offline_inference/your_model_name/README.md`
- - Include at least one runnable command, e.g.:
-
-```bash
-cd examples/offline_inference/your_model_name
-python end2end.py --model your-org/your-model-name --modality text2img --prompts "A cute cat"
-```
-
-#### 4.2: Add Online Serving Example (OpenAI-Compatible)
-
-Mirror BAGEL’s online serving setup:
-
-- Server launcher: `examples/online_serving/your_model_name/run_server.sh`
- - Wrap `vllm serve ... --omni --port ...` (and `--stage-configs-path ...` if needed)
-- Client: `examples/online_serving/your_model_name/openai_chat_client.py`
- - Send requests to `POST /v1/chat/completions`
- - Support multimodal inputs (e.g., base64 image) if your model needs it
-- Doc: `examples/online_serving/your_model_name/README.md`
- - Include both “launch server” and “send request”:
-
-```bash
-# Terminal 1: launch server
-cd examples/online_serving/your_model_name
-bash run_server.sh
-
-# Terminal 2: send request
-python openai_chat_client.py --prompt "A cute cat" --modality text2img
-```
-
-
-### Step 5: Test Your Implementation
-
-Before submitting, thoroughly test your implementation.
-
-#### 5.1: Performance/Speed Check
-
-Manually compare **latency/throughput** and **output quality** against a Diffusers baseline.
-
-For a fair comparison, keep the same **prompt**, **seed**, **resolution**, **num_inference_steps**, and **guidance settings**, and run multiple trials to reduce randomness. Record the results (and your hardware / driver / CUDA versions) in your PR description.
-
-
-#### 5.2 Functionality Check in CI
-
-To ensure project maintainability and sustainable development, please submit test code (unit tests, system tests, or end-to-end tests) alongside their code changes.
-
-For comprehensive testing guidelines and the definition of test levels (L1-L5), please refer to the [Multi-Level Automated Testing System Documentation](../ci/CI_5levels.md). You are at least required to add an L4 *functionality* test described in that document.
-
----
-
-## Advanced Features
-
-Once basic implementation works, add advanced features for better performance.
-
-### torch.compile Support
-
-Enable automatic compilation for repeated blocks:
-
-```python
-# In your_model_transformer.py
-
-class YourModelTransformer2DModel(nn.Module):
- # Specify which blocks can be compiled
- _repeated_blocks = ["YourTransformerBlock"] # List of block class names
-
- def __init__(self, ...):
- super().__init__()
- # ... initialization
-```
-
-vLLM-Omni automatically compiles blocks in `_repeated_blocks` when `torch.compile` is available.
-
-### Tensor Parallelism
-
-See detailed guide: [How to add Tensor Parallel support](../../design/feature/tensor_parallel.md)
-
-**Quick setup:**
-
-1. Replace Linear layers by various parallel linear layers (e.g., `ColumnParallelLinear`) in vLLM
-2. Check `tp_size` validity: `hidden_dim`, `num_heads`, and `num_kv_heads` must be divisible by `tp_size`
-
-**Usage:** Set `tensor_parallel_size` when initializing:
-```python
-omni = Omni(model="your-model", tensor_parallel_size=2)
-```
-
-### CFG Parallelism
-
-See detailed guide: [How to add CFG-Parallel support](../../design/feature/cfg_parallel.md)
-
-**Quick setup:**
-
-1. Implement `diffuse()` method
-2. Inherit `CFGParallelMixin` in your pipeline class
-
-**Usage:** Set `cfg_parallel_size` when initializing:
-```python
-omni = Omni(model="your-model", cfg_parallel_size=2)
-```
-
-### Sequence Parallelism
-
-See detailed guide: [How to add Sequence Parallel support](../../design/feature/sequence_parallel.md)
-
-**Quick setup:**
-
-1. Add `_sp_plan` class attribute to transformer
-2. Specify where to shard/gather tensors
-
-**Usage:** Set `ulysses_degree` and `ring_degree` when initializing:
-```python
-omni = Omni(model="your-model", ulysses_degree=2, ring_degree=2)
-```
-
-### Step Execution
-
-See detailed design guide: [How to add step execution support](../../design/feature/diffusion_step_execution.md)
-
-Use this only when your pipeline can be split into stable request-scoped and
-step-scoped phases. The reference implementation is
-`QwenImagePipeline`, which maps its request-level `forward()` into:
-
-1. `prepare_encode()` for prompt encoding, latent init, timestep prep, and per-request scheduler setup.
-2. `denoise_step()` for one transformer/noise prediction.
-3. `step_scheduler()` for one scheduler update and `step_index` advance.
-4. `post_decode()` for the final VAE decode.
-
-Do not enable `step_execution=True` until those four methods are implemented
-and validated against the request-level path.
-
-### Cache Acceleration
-
-#### TeaCache
-
-See detailed guide: [How to add TeaCache support](../../design/feature/teacache.md)
-
-**Quick setup:**
-
-1. Write extractor function
-2. Register in `EXTRACTOR_REGISTRY`
-3. Add polynomial coefficients
-
-**Usage:** Set `cache_backend` and `cache_config` when initializing:
-```python
-omni = Omni(model="your-model",
- cache_backend="tea_cache",
- cache_config={"rel_l1_thresh": 0.2}
-)
-
-```
-
-
-#### Cache-DiT
-
-See detailed guide: [How to add Cache-DiT support](../../design/feature/cache_dit.md)
-
-**Quick setup:**
-
-- For standard models: Works automatically
-- For complex architectures: Write custom cache config
-
-**Usage:** Set `cache_backend` and `cache_config` when initializing:
-```python
-omni = Omni(model="your-model",
- cache_backend="cache_dit",
- cache_config={
- "Fn_compute_blocks": 1,
- "Bn_compute_blocks": 0,
- "max_warmup_steps": 4,
- }
-)
-```
-
-### CPU Offload
-
-See detailed guide: [CPU Offloading for Diffusion Models](../../user_guide/diffusion/cpu_offload_diffusion.md)
-
-vLLM-Omni provides two offloading strategies to reduce GPU memory usage:
-
-1. **Model-level offload**: Mutual exclusion between DiT and encoders (only one on GPU at a time)
-2. **Layerwise (Blockwise) offload**: Keeps only a single transformer block on GPU at a time with compute-memory overlap
-
-**Usage:** Enable offload when initializing:
-```python
-# Model-level offload
-omni = Omni(model="your-model", enable_cpu_offload=True)
-
-# Layerwise offload
-omni = Omni(model="your-model", enable_layerwise_offload=True)
-```
-
-**To support layerwise offloading:** Define the blocks attribute name in your transformer:
-
-```python
-class WanTransformer3DModel(nn.Module):
- _layerwise_offload_blocks_attrs = ["blocks"] # Attribute name containing transformer blocks
-
- def __init__(self):
- self.blocks = nn.ModuleList([...]) # Transformer blocks
-```
-
-**Note:** Layerwise offloading is primarily recommended for large **video generation models** where the compute cost per block is high enough to effectively overlap with memory prefetch operations.
-
-
----
-
-### Diffusion Pipeline Profiler (Performance Profiling)
-When adapting a new diffusion model, it is often useful to analyze the latency of key components such as text encoding, diffusion denoising, and VAE decoding.
-vLLM-Omni provides a timing utility via `DiffusionPipelineProfilerMixin` to help developers quickly identify performance bottlenecks.
-
-!!! info
- `DiffusionPipelineProfilerMixin` is different from using `torch.profiler` for diffusion models, as introduced in this [tutorial](https://github.com/vllm-project/vllm-omni/blob/main/docs/contributing/profiling.md). `DiffusionPipelineProfilerMixin` only prints the timing information of multiple functions (such as `vae.decode`), while `torch.profiler` saves detailed GPU/CPU computation time, call/execution steps.
-
-This tool automatically measures the execution time of selected pipeline modules and prints the results in the logs.
-
-**Enabling Diffusion Pipeline Profiler**
-
-
-Enable timing by setting:
-```
-vllm serve Qwen/Qwen-Image --omni --port 8091 --enable-diffusion-pipeline-profiler
-```
-You can optionally specify which modules to profile:
-```
-class YourPipeline(xxx, DiffusionPipelineProfilerMixin):
- def __init__(self, xxx):
- ...
- self.setup_diffusion_pipeline_profiler(profiler_targets=["diffuse"], enable_diffusion_pipeline_profiler)
-```
-If not specified, the default targets are used:
-```
-["vae.encode", "vae.decode", "diffuse", "text_encoder.forward", "tokenizer.forward"]
-```
-**Adding DiffusionPipelineProfilerMixin to a Pipeline**
-To enable timing support in your pipeline, inherit from DiffusionPipelineProfilerMixin.
-```python
-from vllm_omni.diffusion.profiler import DiffusionPipelineProfilerMixin
-
-class YourModelPipeline(nn.Module, DiffusionPipelineProfilerMixin):
- # Optional: Specify custom timing targets
- _PROFILER_TARGETS = ["vae.encode", "vae.decode", "diffuse", "text_encoder.forward", "tokenizer.forward"]
-
- def __init__(
- self,
- *,
- od_config: OmniDiffusionConfig,
- prefix: str = "",
- ):
- super().__init__()
- self.od_config = od_config
- self.parallel_config = od_config.parallel_config
- # initialize pipeline components
- ...
-
- # initialize timing profiler
- self.setup_diffusion_pipeline_profiler(
- enable_diffusion_pipeline_profiler=self.od_config.enable_diffusion_pipeline_profiler
- )
-```
-The mixin dynamically wraps selected methods and records their execution time during inference.
-
-If you need to fetch the execution time of different modules, you will need to pass `self.stage_durations` to `DiffusionOutput`, as shown below:
-
-```diff
-- return DiffusionOutput(output=img)
-+ return DiffusionOutput(
- output=image, stage_durations=self.stage_durations if hasattr(self, "stage_durations") else None
- )
-```
-
-**Pipeline Design for Timing**
-The current diffusion timing utility is function-based, meaning it measures the execution time of individual methods.
-
-When implementing a new pipeline, avoid putting all logic inside a single function (e.g., forward). Instead, structure the pipeline in a modular way by separating key stages into independent methods, such as the diffusion loop.
-
-For example:
-```
-def forward(self, req: OmniDiffusionRequest):
- prompt_embeds = self.encode_prompt(req)
- latents = self.diffuse(prompt_embeds, req)
- images = self.vae.decode(latents)
- return DiffusionOutput(output=images)
-```
-This allows the timing utility to measure each stage (e.g., encode_prompt, diffuse, vae.decode) separately and helps identify performance bottlenecks more easily.
-
-
-**Default Profiled Modules**
-
-By default, the following pipeline modules are timed:
-```
-vae.encode
-vae.decode
-diffuse
-text_encoder.forward
-tokenizer.forward
-```
-
-**Example Output**
-
-When enabled, timing logs appear like this:
-```
-[DiffusionPipelineProfiler] text_encoder.forward took 0.018s
-[DiffusionPipelineProfiler] diffuse took 2.412s
-[DiffusionPipelineProfiler] vae.decode took 0.063s
-```
-These measurements help identify bottlenecks during model adaptation and optimization
-
-
-
-## Troubleshooting
-
-
-**Issue: ImportError when loading model**
-
-**Symptoms:** `ModuleNotFoundError` or `ImportError` when calling `Omni(model="your-model")`
-
-**Causes:**
-
-1. Model not registered in `registry.py`
-2. Wrong class name in registry
-3. Missing `__init__.py` exports
-
-
-**Issue: Shape mismatch in attention**
-
-**Symptoms:** `RuntimeError: shape mismatch` in attention forward
-
-**Cause:** Incorrect tensor reshaping for vLLM-Omni's attention interface
-
-**Solution:** Ensure correct shapes:
-
-```python
-# vLLM-Omni expects: [batch, seq_len, num_heads, head_dim]
-query = query.view(batch_size, seq_len, self.num_heads, self.head_dim)
-key = key.view(batch_size, kv_seq_len, self.num_kv_heads, self.head_dim)
-value = value.view(batch_size, kv_seq_len, self.num_kv_heads, self.head_dim)
-
-hidden_states = self.attn(query, key, value, attn_metadata=attn_metadata)
-
-# Reshape back: [batch, seq_len, num_heads, head_dim] → [batch, seq_len, hidden_size]
-hidden_states = hidden_states.reshape(batch_size, seq_len, -1)
-```
-
-**Issue: Different outputs compared to Diffusers**
-
-**Symptoms:** Generated images look different from Diffusers
-
-**Causes:**
-
-1. Attention backend differences (FlashAttention vs PyTorch SDPA)
-2. Missing normalization or scaling
-
-**4. Issue: Out of memory (OOM)**
-
-**Symptoms:** CUDA out of memory errors
-
-**Solutions:**
-
-1. **Reduce batch size:**
- ```python
- omni.generate(prompts=[...], max_num_seqs=2)
- ```
-
-2. **Use smaller image size:**
- ```python
- sampling_params = OmniDiffusionSamplingParams(height=512, width=512)
- ```
-
-3. **Enable model offloading:**
- ```python
- omni = Omni(model="...", enable_cpu_offload=True)
- ```
-
-4. **Apply vae tiling and slicing**
- ```python
- omni = Omni(model="...", vae_use_slicing=True, vae_use_tiling=True,)
- ```
-
----
-
-## Pull Request Checklist
-
-When submitting a PR to add your model, include:
-
-**1. Implementation Files**
-
-- ✅ Transformer model (`xxx_transformer.py`)
-- ✅ Pipeline (`pipeline_xxx.py`)
-- ✅ Registry entries in `registry.py`
-- ✅ `__init__.py` with proper exports
-
-**2. Example and Tests**
-
-- ✅ Example script in `examples/`
-- ✅ Test file in `tests/e2e/`
-- ✅ Documentation (`docs/`) creation or updates
-
-_Note: End-to-end test files in `tests/e2e/` are optional but strongly recommended. README updates are required for all new models._
-
-**3. Documentation Updates**
-
-- ✅ Add model to supported models table in `docs/models/supported_models.md`
-- ✅ If supporting acceleration features (e.g., sequence parallelism, CFG parallel), update acceleration feature tables in:
- - `docs/user_guide/diffusion_acceleration.md`
- - `docs/user_guide/diffusion/parallelism_acceleration.md`
-
----
-
-## Model Recipe
-
-After implementing and testing your model, please add a model recipe to the [vllm-project/recipes](https://github.com/vllm-project/recipes) repository. This helps other users understand how to use your model with vLLM-Omni.
-
-**What to Include**
-
-Your recipe should include:
-
-1. **Model Overview**: Brief description of the model and its capabilities
-2. **Installation Instructions**: Step-by-step setup instructions including:
- - Installing vllm-omni and dependencies
- - Installing any additional required packages (e.g., xformers, diffusers)
- - Any version requirements
-3. **Usage Examples**: Command-line examples demonstrating how to run the model
-4. **Configuration Details**: Important configuration parameters and their meanings
-
-**Example**
-
-For reference, see the [LongCat recipe example](https://github.com/vllm-project/recipes/pull/179) which demonstrates the expected format and structure.
-
-**Recipe Location**
-
-Create your recipe file in the appropriate directory structure:
-- For organization-specific models: `OrganizationName/ModelName.md`
-- For general models: `ModelName.md`
-
-The recipe should be a Markdown file that provides clear, reproducible instructions for users to get started with your model.
-
----
-
-## Reference Implementations
-
-Study these complete examples:
-
-| Model | Architecture | Key Features | Files |
-|-------|--------------|--------------|-------|
-| **Qwen-Image** | Dual-stream transformer | CFG-Parallel, SP, TP, Cache | `vllm_omni/diffusion/models/qwen_image/` |
-| **Wan2.2** | Video transformer | Dual transformers, SP, CFG-Parallel | `vllm_omni/diffusion/models/wan2_2/` |
-
----
-
-## Summary
-
-Adding a diffusion model to vLLM-Omni involves:
-
-1. ✅ **Adapt transformer** - Replace attention, remove mixins, add config support
-2. ✅ **Adapt pipeline** - Change interface, add request handling, extract processing
-3. ✅ **Register model** - Add entries to `registry.py`
-4. ✅ **Add examples** - Provide runnable scripts
-5. ✅ **Test thoroughly** - Verify correctness and performance
-6. ✅ **Add advanced features** - Enable parallelism and acceleration (optional)
-7. ✅ **Submit PR** - Include verification results and documentation
-
-**Need help?** Check reference implementations or ask in [slack.vllm.ai](https://slack.vllm.ai) or vLLM user forum at [discuss.vllm.ai](https://discuss.vllm.ai).
diff --git a/docs/contributing/model/adding_omni_model.md b/docs/contributing/model/adding_omni_model.md
deleted file mode 100644
index 1eaff10596c..00000000000
--- a/docs/contributing/model/adding_omni_model.md
+++ /dev/null
@@ -1,624 +0,0 @@
-# Adding an Omni-Modality Model
-
-This guide walks through the process of adding a new multi-stage model to vLLM-Omni, using **Qwen3-Omni** as a comprehensive example. Qwen3-Omni is a multi-stage omni-modality model that demonstrates the full capabilities of vLLM-Omni's architecture.
-
-## Table of Contents
-
-1. [Overview](#overview)
-2. [Directory Structure](#directory-structure)
-3. [Step-by-Step Implementation](#step-by-step-implementation)
-4. [Key Components](#key-components)
-5. [Model Registration](#model-registration)
-6. [Stage Configuration](#stage-configuration)
-7. [Stage Input Processors](#stage-input-processors)
-8. [Testing](#testing)
-9. [Adding a Model Recipe](#adding-a-model-recipe)
-10. [Summary](#summary)
-
-## Overview
-
-vLLM-Omni supports multi-stage model architectures where different stages can run on different devices and process different modalities. The Qwen3-Omni model exemplifies this with three stages:
-
-1. **Thinker Stage**: Multimodal understanding (text + audio + video) → text generation
-2. **Talker Stage**: Text embeddings → RVQ codec codes
-3. **Code2Wav Stage**: RVQ codes → audio waveform
-
-Each stage is implemented as a separate model class that can be configured independently.
-
-## Directory Structure
-
-When adding a new model, you'll need to create the following structure:
-
-```
-vllm_omni/model_executor/models/
-└── your_model_name/ # Model directory (e.g., qwen3_omni)
- ├── __init__.py # Exports main model class
- ├── your_model.py # Main unified model class
- ├── your_model_stage1_implementation.py # Stage 1 implementation (e.g., thinker)
- ├── your_model_stage2_implementation.py # Stage 2 implementation (e.g., talker)
- └── your_model_stage3_implementation.py # Stage 3 implementation (e.g., code2wav)
- └── ... maybe other stage implementations
-
-vllm_omni/model_executor/stage_input_processors/
-└── your_model_name.py # Stage transition processors
-
-vllm_omni/model_executor/stage_configs/
-└── your_model_name.yaml # Stage configuration file
-```
-
-## Step-by-Step Implementation
-
-### Step 1: Create the Model Directory
-
-Create a new directory under `vllm_omni/model_executor/models/`
-
-### Step 2: Implement Stage Components
-
-For Qwen3-Omni, we have three stage components:
-
-#### 2.1 Thinker Stage (`qwen3_omni_moe_thinker.py`)
-
-The thinker stage handles multimodal understanding. Key features:
-
-- Inherits from base Qwen3 MoE model in vLLM, using vLLM fused ops & page attn to accelerate
-- Implements multimodal processing interfaces
-- Handles audio, video, and image inputs
-- Generates text outputs
-
-```python
-from vllm.model_executor.models.interfaces import SupportsMultiModal, SupportsPP
-from vllm.model_executor.models.qwen3_moe import Qwen3MoeForCausalLM
-
-class Qwen3OmniMoeThinkerForConditionalGeneration(
- Qwen3MoeForCausalLM,
- SupportsMultiModal,
- SupportsPP
-):
- """Thinker stage: multimodal understanding → text generation."""
-
- def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- # Initialize base model
- # Set up multimodal processors
- # Configure audio/video/image encoders
- pass
-```
-
-#### 2.2 Talker Stage (`qwen3_omni_moe_talker.py`)
-
-The talker stage converts text embeddings to codec codes:
-
-```python
-class Qwen3OmniMoeTalkerForConditionalGeneration(
- Qwen3MoeForCausalLM,
- SupportsPP
-):
- """Talker stage: text embeddings → RVQ codec codes."""
-
- def __init__(self, vllm_config, talker_config, prefix):
- # Initialize base model
- # Replace LM head with codec head
- # Set up text projection from thinker
- pass
-```
-
-#### 2.3 Code2Wav Stage (`qwen3_omni_code2wav.py`)
-
-The code2wav stage generates audio waveforms:
-
-```python
-class Qwen3OmniMoeCode2Wav(nn.Module):
- """Code2Wav stage: RVQ codes → audio waveform."""
-
- def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- # Initialize audio decoder
- # Set up codec processing
- pass
-```
-
-### Step 3: Implement the Unified Model Class
-
-The main model class (`qwen3_omni.py`) orchestrates all stages:
-
-```python
-@MULTIMODAL_REGISTRY.register_processor(
- Qwen3OmniMoeThinkerMultiModalProcessor,
- info=Qwen3OmniMoeThinkerProcessingInfo,
- dummy_inputs=Qwen3OmniMoeThinkerDummyInputsBuilder,
-)
-class Qwen3OmniMoeForConditionalGeneration(
- nn.Module, SupportsMultiModal, SupportsPP, Qwen3OmniMoeConditionalGenerationMixin
-):
- """
- Unified Qwen3 Omni MoE model combining thinker, talker, and code2wav.
-
- Architecture:
- - Thinker: Multimodal understanding (text + audio + video) → text generation
- - Talker: Text embeddings → RVQ codec codes
- - Code2Wav: RVQ codes → audio waveform
-
- Usage:
- Set `model_stage` in vllm_config to one of: "thinker", "talker", "code2wav"
- """
-
- def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- super().__init__()
- self.have_multimodal_outputs = True
- config: Qwen3OmniMoeConfig = vllm_config.model_config.hf_config
-
- # Determine which stage to initialize
- self.model_stage = vllm_config.model_config.model_stage
-
- if self.model_stage == "thinker":
- # Initialize thinker model
- thinker_vllm_config = vllm_config.with_hf_config(
- config.thinker_config,
- architectures=["Qwen3OmniMoeThinkerForConditionalGeneration"]
- )
- self.thinker = init_vllm_registered_model(
- vllm_config=thinker_vllm_config,
- prefix=maybe_prefix(prefix, "thinker"),
- hf_config=config.thinker_config,
- architectures=["Qwen3OmniMoeThinkerForConditionalGeneration"],
- )
- self.model = self.thinker
-
- elif self.model_stage == "talker":
- # Initialize talker model
- talker_vllm_config = vllm_config.with_hf_config(
- config.talker_config,
- architectures=["Qwen3OmniMoeTalkerForConditionalGeneration"]
- )
- self.talker = init_vllm_registered_model(
- vllm_config=talker_vllm_config,
- prefix=maybe_prefix(prefix, "talker"),
- hf_config=config.talker_config,
- architectures=["Qwen3OmniMoeTalkerForConditionalGeneration"],
- )
- self.model = self.talker
-
- elif self.model_stage == "code2wav":
- # Initialize code2wav model
- code2wav_vllm_config = vllm_config.with_hf_config(
- config.code2wav_config,
- architectures=["Qwen3OmniMoeCode2Wav"]
- )
- self.code2wav = init_vllm_registered_model(
- vllm_config=code2wav_vllm_config,
- prefix=maybe_prefix(prefix, "code2wav"),
- hf_config=config.code2wav_config,
- architectures=["Qwen3OmniMoeCode2Wav"],
- )
- self.model = self.code2wav
- else:
- raise ValueError(
- f"Invalid model_stage: {self.model_stage}. "
- f"Must be one of: 'thinker', 'talker', 'code2wav'"
- )
-```
-
-#### Key Methods to Implement
-
-1. **`forward()`**: Handles the forward pass for each stage
-2. **`embed_input_ids()`**: Embeds input token IDs
-3. **`embed_multimodal()`**: Processes multimodal inputs (if applicable)
-4. **`compute_logits()`**: Computes logits from hidden states
-5. **`load_weights()`**: Loads model weights with proper prefixing of different stages
-
-### Step 4: Create `__init__.py`
-
-Export the main model class:
-
-```python
-# vllm_omni/model_executor/models/qwen3_omni/__init__.py
-from .qwen3_omni import Qwen3OmniMoeForConditionalGeneration
-
-__all__ = ["Qwen3OmniMoeForConditionalGeneration"]
-```
-
-## Key Components
-
-### 1. Model Interfaces
-
-Your model should implement the appropriate interfaces:
-
-- **`SupportsMultiModal`**: For models that process multimodal inputs
-- **`SupportsPP`**: For models that support pipeline parallelism
-- **`SupportsMRoPE`**: For models using multi-dimensional RoPE (if applicable)
-
-### 2. Multimodal Registration
-
-If your model processes multimodal inputs, register it with the multimodal registry:
-
-```python
-@MULTIMODAL_REGISTRY.register_processor(
- YourMultiModalProcessor,
- info=YourProcessingInfo,
- dummy_inputs=YourDummyInputsBuilder,
-)
-class YourModel(nn.Module, SupportsMultiModal):
- pass
-```
-
-### 3. Weight Loading
-
-Implement `load_weights()` to handle weight loading with proper prefixing:
-
-```python
-def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
- """Load weights for all components of the omni model."""
- loaded_weights = set()
- thinker_weights = []
- talker_weights = []
- code2wav_weights = []
-
- # Separate weights by component
- for k, v in weights:
- if k.startswith("thinker."):
- thinker_weights.append((k, v))
- elif k.startswith("talker."):
- talker_weights.append((k, v))
- elif k.startswith("code2wav."):
- code2wav_weights.append((k, v))
-
- # Load each component's weights
- if self.thinker and thinker_weights:
- thinker_loaded = self.thinker.load_weights(thinker_weights)
- thinker_loaded = add_prefix_to_loaded_weights(thinker_loaded, "thinker")
- loaded_weights.update(thinker_loaded)
-
- # Similar for talker and code2wav...
-
- return loaded_weights
-```
-
-### 4. Output Format
-
-Use `OmniOutput` for stage outputs:
-
-```python
-from vllm_omni.model_executor.models.output_templates import OmniOutput
-
-# In forward method
-return OmniOutput(
- text_hidden_states=hidden_states,
- multimodal_outputs={"additional_data": data},
- next_token_id=next_token_id,
-)
-```
-
-## Model Registration
-
-Register your model in `vllm_omni/model_executor/models/registry.py`:
-
-```python
-_OMNI_MODELS = {
- # ... existing models ...
-
- # Your new model
- "YourModelForConditionalGeneration": (
- "your_model_name", # Module folder name
- "your_model", # Module file name (without .py)
- "YourModelForConditionalGeneration", # Class name
- ),
- "YourModelThinkerForConditionalGeneration": (
- "your_model_name",
- "your_model_thinker",
- "YourModelThinkerForConditionalGeneration",
- ),
- # ... other stages ...
-}
-```
-
-The registry uses lazy loading, so the model class is imported only when needed.
-
-## Stage Configuration
-
-Create a YAML configuration file in `vllm_omni/deploy/`. For a complete example, see the [Qwen3-Omni configuration file](gh-file:vllm_omni/deploy/qwen3_omni_moe.yaml).
-
-### Key Configuration Fields
-
-- **`model_stage`**: Which stage to run ("thinker", "talker", "code2wav", etc.)
-- **`model_arch`**: The model architecture name (must match registry)
-- **`engine_input_source`**: List of stage IDs that provide input to this stage
-- **`custom_process_input_func`**: Function to process inputs from previous stages
-- **`final_output`**: Whether this stage produces the final output (True/False)
-- **`final_output_type`**: Type of final output ("text", "audio", "image", etc.)
-
-## Stage Input Processors
-
-Stage transitions are the mechanism by which outputs from one stage are converted into inputs for the next stage. This section explains where and how stage transitions occur.
-
-### Where Stage Transitions Are Called
-
-Stage transitions happen automatically in the runtime orchestrator. Here's the detailed flow:
-
-1. **Location**: `vllm_omni/engine/orchestrator.py` in `_forward_to_next_stage()`
-2. **Trigger**: When a stage completes processing and produces outputs
-3. **Execution Flow**:
- ```python
- # In orchestrator.py
- next_stage_id = stage_id + 1
- next_client = self.stage_clients[next_stage_id]
- params = req_state.sampling_params_list[next_stage_id]
-
- # Save current stage outputs so stage_input_processors can consume them.
- self.stage_clients[stage_id].set_engine_outputs([output])
-
- # THIS IS WHERE STAGE TRANSITION HAPPENS
- next_inputs = next_client.process_engine_inputs(
- stage_list=self.stage_clients,
- prompt=req_state.prompt,
- )
-
- # Build and submit request(s) to the next stage.
- for next_input in next_inputs:
- request = build_engine_core_request_from_tokens(
- request_id=req_id,
- prompt=next_input,
- params=params,
- model_config=self.stage_vllm_configs[next_stage_id].model_config,
- )
- await next_client.add_request_async(request)
- ```
-
-### How Stage Transitions Work
-
-The stage transition process follows these steps:
-
-1. **Stage Completion**: When a stage finishes processing a request, the orchestrator stores outputs via `stage_client.set_engine_outputs(...)`
-
-2. **Transition Detection**: The orchestrator checks if there's a next stage and calls `process_engine_inputs()` on it
-
-3. **Input Processing**: The stage input processor configured in stage YAML (under `vllm_omni/model_executor/stage_input_processors/`) handles the transition:
- ```python
- def process_engine_inputs(
- self, stage_list: list[Any], prompt: OmniTokensPrompt | TextPrompt = None
- ) -> list[OmniTokensPrompt | TextPrompt]:
- """Process engine inputs for this stage from upstream stage outputs."""
-
- if self.custom_process_input_func is None:
- # Default behavior: pass token IDs directly
- # Extract outputs from source stage
- source_stage_id = self.engine_input_source[0]
- source_outputs = stage_list[source_stage_id].engine_outputs
- # ... create OmniTokensPrompt from token_ids ...
- else:
- # Custom transition function (YOUR CODE HERE)
- return self.custom_process_input_func(
- stage_list,
- self.engine_input_source,
- prompt,
- self.requires_multimodal_data
- )
- ```
- - If `custom_process_input_func` is configured, it calls that function
- - Otherwise, it uses default behavior (passing token IDs directly)
-
-4. **Custom Function Execution**: Your custom function receives:
- - `stage_list`: List of all stage objects (to access upstream stage outputs)
- - `engine_input_source`: List of source stage IDs (e.g., `[0]` for stage 0)
- - `prompt`: Original prompt data (for preserving multimodal data)
- - `requires_multimodal_data`: Whether multimodal data is required
-
-5. **Output Format**: The function must return a list of `OmniTokensPrompt` objects ready for the next stage
-
-### Data Structures in Stage Transitions
-
-Understanding the data structures is crucial for implementing stage transitions:
-
-**Input to your function:**
-- `stage_list[source_stage_id].engine_outputs`: List of `EngineCoreOutput` objects
-- - Each contains `outputs`: List of `RequestOutput` objects
- - Each `RequestOutput` has:
-- - - `token_ids`: Generated token IDs
- - `multimodal_output`: Dict with keys like `"code_predictor_codes"`, etc.These are the hidden states or intermediate outputs from the model's forward pass
- - `prompt_token_ids`: Original prompt token IDs
-
-**Output from your function:**
-- Must return `list[OmniTokensPrompt]` where each `OmniTokensPrompt` contains:
-- - `prompt_token_ids`: List[int] - Token IDs for the next stage
- - `additional_information`: Dict[str, Any] - Optional metadata (e.g., embeddings, hidden states)
- - `multi_modal_data`: Optional multimodal data if needed
-
-### How Model Outputs Are Stored
-
-The model's `forward()` method returns an `OmniOutput` object that contains:
-- `text_hidden_states`: Final hidden states for text generation
-- `multimodal_outputs`: Dict containing intermediate outputs
-
-These outputs are captured during the forward pass and stored in `multimodal_output` with specific keys:
-
-```python
-# In your model's forward() method (e.g., qwen3_omni.py)
-def forward(self, ...):
- # ... processing ...
-
- # For thinker stage: capture embeddings and hidden states
- multimodal_outputs = {
- "0": captured_embeddings, # Layer 0 embeddings
- "24": captured_hidden_states, # Layer 24 hidden states
- "tts_bos_embed": tts_bos_embed,
- "tts_eos_embed": tts_eos_embed,
- # ... other intermediate outputs ...
- }
-
- return OmniOutput(
- text_hidden_states=hidden_states,
- multimodal_outputs=multimodal_outputs,
- )
-```
-
-These keys are then accessible in your stage transition function:
-```python
-# In stage_input_processors/qwen3_omni.py
-thinker_prefill_embeddings = output.multimodal_output["0"] # Access by key
-thinker_hidden_states = output.multimodal_output["24"]
-```
-
-### Key Points
-
-1. **Accessing Upstream Outputs**: Use `stage_list[source_stage_id].engine_outputs` to get outputs from the source stage
-2. **Extracting Data**: Access `output.multimodal_output[key]` to get specific hidden states or intermediate results
- - Keys are defined by your model's `forward()` method when it creates `multimodal_outputs`
-3. **Device Management**: Move tensors to appropriate devices (CPU for serialization, GPU for processing)
-4. **Shape Transformations**: Reshape tensors as needed for the next stage (e.g., flattening codec codes)
-5. **Batch Handling**: Process each request in the batch separately and return a list
-
-### Complete Flow Diagram
-
-
-
-
-
-
-
-
-### Implementation Example
-
-Create stage transition processors in `vllm_omni/model_executor/stage_input_processors/your_model_name.py`:
-
-```python
-# qwen3_omni.py
-
-def thinker2talker(
- stage_list: list[Any],
- engine_input_source: list[int],
- prompt: OmniTokensPrompt | TextPrompt | None = None,
- requires_multimodal_data: bool = False,
-) -> list[OmniTokensPrompt]:
- """
- Process thinker outputs to create talker inputs.
-
- Args:
- stage_list: List of stage objects
- engine_input_source: Source stage IDs (typically [0] for thinker)
- prompt: Original prompt data
-
- Returns:
- List of OmniTokensPrompt for talker stage
- """
- source_stage_id = engine_input_source[0]
- thinker_outputs = stage_list[source_stage_id].engine_outputs
- talker_inputs = []
-
- for thinker_output in thinker_outputs:
- output = thinker_output.outputs[0]
- # Extract thinker embeddings and hidden states
- thinker_prefill_embeddings = output.multimodal_output["0"].float().clone().detach().cuda()
- thinker_hidden_states = output.multimodal_output["24"].float().clone().detach().cuda()
-
- info = {
- "thinker_prefill_embeddings": thinker_prefill_embeddings,
- "thinker_hidden_states": thinker_hidden_states,
- "thinker_sequences": thinker_output.prompt_token_ids + output.token_ids,
- "thinker_input_ids": thinker_output.prompt_token_ids,
- }
-
- talker_inputs.append(
- OmniTokensPrompt(
- prompt_token_ids=[0] * computed_length,
- additional_information=info,
- multi_modal_data=None,
- )
- )
-
- return talker_inputs
-
-
-def talker2code2wav(
- stage_list: list[Any],
- engine_input_source: list[int],
- prompt: OmniTokensPrompt | TextPrompt | None = None,
- requires_multimodal_data: bool = False,
-) -> list[OmniTokensPrompt]:
- """
- Process talker outputs to create code2wav inputs.
- """
- source_stage_id = engine_input_source[0]
- talker_outputs = stage_list[source_stage_id].engine_outputs
- code2wav_inputs = []
-
- for talker_output in talker_outputs:
- output = talker_output.outputs[0]
- # Extract codec codes
- codec_codes = (
- output.multimodal_output["code_predictor_codes"]
- .to(torch.long)
- .transpose(0, 1)
- .cpu()
- .to(torch.long)
- .reshape(-1)
- .tolist()
- )
-
- code2wav_inputs.append(
- OmniTokensPrompt(
- prompt_token_ids=codec_codes,
- multi_modal_data=None,
- )
- )
-
- return code2wav_inputs
-```
-
-## Testing
-
-For comprehensive testing guidelines, please refer to the [Test File Structure and Style Guide](../ci/tests_style.md).
-
-## Adding a Model Recipe
-
-After implementing and testing your model, please add a model recipe to the [vllm-project/recipes](https://github.com/vllm-project/recipes) repository. This helps other users understand how to use your model with vLLM-Omni.
-
-### What to Include
-
-Your recipe should include:
-
-1. **Model Overview**: Brief description of the model and its capabilities
-2. **Installation Instructions**: Step-by-step setup instructions including:
- - Installing vllm-omni and dependencies
- - Installing any additional required packages (e.g., xformers, diffusers)
- - Any version requirements
-3. **Usage Examples**: Command-line examples demonstrating how to run the model
-4. **Configuration Details**: Important configuration parameters and their meanings
-
-### Example
-
-For reference, see the [LongCat recipe example](https://github.com/vllm-project/recipes/pull/179) which demonstrates the expected format and structure.
-
-### Recipe Location
-
-Create your recipe file in the appropriate directory structure:
-- For organization-specific models: `OrganizationName/ModelName.md`
-- For general models: `ModelName.md`
-
-The recipe should be a Markdown file that provides clear, reproducible instructions for users to get started with your model.
-
-## Summary
-
-Adding a new model to vLLM-Omni involves:
-
-1. **Create model directory structure** with stage implementations
-2. **Implement unified model class** that orchestrates stages
-3. **Register model** in `registry.py`
-4. **Create stage configuration** YAML file
-5. **Implement stage input processors** for stage transitions
-6. **Write tests** to verify functionality
-7. **Add model recipe** to the [vllm-project/recipes](https://github.com/vllm-project/recipes) repository (see [Adding a Model Recipe](#adding-a-model-recipe) section)
-
-### Qwen3-Omni Reference Files
-
-For a complete reference implementation, see:
-
-- **Main model**: `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py`
-- **Thinker**: `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_thinker.py`
-- **Talker**: `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py`
-- **Code2Wav**: `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py`
-- **Stage config**: `vllm_omni/deploy/qwen3_omni_moe.yaml`
-- **Input processors**: `vllm_omni/model_executor/stage_input_processors/qwen3_omni.py`
-- **Registry**: `vllm_omni/model_executor/models/registry.py`
-- **Testing**: `vllm_omni/tests/e2e/offline_inference/test_qwen3_omni.py`
-
-For more information, see:
-- [Architecture Overview](../../design/architecture_overview.md)
-- [Supported Models](../../models/supported_models.md)
-- [Stage Configuration Guide](../../configuration/stage_configs.md)
diff --git a/docs/contributing/model/adding_tts_model.md b/docs/contributing/model/adding_tts_model.md
deleted file mode 100644
index 34fd2dbb503..00000000000
--- a/docs/contributing/model/adding_tts_model.md
+++ /dev/null
@@ -1,968 +0,0 @@
-# Adding a TTS Model
-
-This guide walks through adding a new TTS model to vLLM-Omni. Two patterns are
-supported:
-
-- **Two-stage pipeline** (e.g. Qwen3-TTS, Fish Speech): an AR code-predictor stage
- feeds an audio decoder stage via the `async_chunk` framework. This is the standard
- pattern for maximum streaming performance.
-- **Single-stage AR model** (e.g. MOSS-TTS-Nano): the model runs entirely inside one
- AR worker and streams audio chunks directly from its own `inference_stream()` generator.
-
-Qwen3-TTS is used as the reference for the two-stage pattern. For the single-stage
-pattern, refer to MOSS-TTS-Nano.
-
-## Table of Contents
-
-1. [Overview](#overview)
-2. [Cross-Cutting Invariants](#cross-cutting-invariants)
-3. [Directory Structure](#directory-structure)
-4. [Step-by-Step Implementation](#step-by-step-implementation)
-5. [Key Components](#key-components)
-6. [Model Registration](#model-registration)
-7. [Stage Configuration](#stage-configuration)
-8. [Stage Input Processors](#stage-input-processors)
-9. [Online Serving Integration](#online-serving-integration)
-10. [Single-Stage Models](#single-stage-models)
-11. [Testing](#testing)
-12. [Pre-commit and DCO](#pre-commit-and-dco)
-13. [Summary](#summary)
-
-## Cross-Cutting Invariants
-
-These rules apply to every TTS model regardless of architecture (AR vs AR+diffusion,
-single-stage vs two-stage, codec-based vs VAE-based). Each has surfaced as a silent
-bug in a shipped PR — check them at the end of every phase, not just at the start.
-
-**I1. Streaming output contract.** Pick one per-step semantics for `forward()` and
-document it in the docstring:
-
-- *Delta*: yield only new audio samples produced this step. Preferred — linear cost.
-- *Cumulative*: re-decode from step 0 every call. O(N²); only acceptable when the
- codec exposes no streaming decode.
-
-If you choose delta, audit the full chain: `forward()` returns the new chunk →
-`_consolidate_multimodal_tensors()` in `vllm_omni/engine/output_processor.py`
-concatenates the audio key into a single tensor at finish → streaming consumers
-receive per-step chunks, offline consumers receive the concatenated tensor. A
-mismatch (consolidator skips the key with `continue`, or consumers expect a list
-but receive a tensor) is invisible in offline RTF benchmarks — users hear replays
-or truncation only under live playback.
-
-**I2. Multimodal output consumer hygiene.** `outputs[0].outputs[0].multimodal_output[key]`
-can be `Tensor`, `list[Tensor]` (pre-consolidation snapshot), `np.ndarray`, or
-scalar. In every test, example, and benchmark:
-
-- Never write `dict.get("a") or dict.get("b")` on tensor values — Python evaluates
- the tensor's truthiness and raises `Boolean value of Tensor with more than one
- value is ambiguous`. Use explicit `if x is None` chains.
-- Defensively handle the list form:
- `if isinstance(x, list): x = torch.cat([t.reshape(-1) for t in x], dim=0)`.
-- Assert `shape` / `dtype` / `duration` explicitly — do not rely on truthiness for
- presence checks.
-
-**I3. Hot-loop GPU discipline.** Inside any per-step model loop (AR decode,
-diffusion solver, CFM Euler step, per-frame vocoder):
-
-- No `tensor.item()`, `.cpu()`, or `.tolist()` — each triggers a GPU→CPU sync; a
- 10-step × 60-frame × 4-op loop creates 2400 syncs per request.
-- Prefer `dst.copy_(src)` over `dst.fill_(src.item())` for scalar-into-buffer writes.
-- Whole-model `torch.compile(Model.forward, fullgraph=False)` usually outperforms
- per-submodule compile — fewer dispatch boundaries, larger fusion regions. Measure
- before choosing granularity.
-- No Python control flow that depends on tensor values; use `torch.where` or masking.
-
-Profile before optimizing.
-
-**I4. Validation pyramid.** Offline RTF alone is necessary but not sufficient. A
-new TTS model must pass all three levels:
-
-| Layer | Catches | Tool |
-|-------|---------|------|
-| Offline RTF / duration | Throughput regressions, missing audio, wrong sample rate | `end2end.py`, pytest e2e |
-| Browser streaming playback | Delta-vs-cumulative bugs, chunk boundary glitches, TTFP regressions | Gradio demo over `/v1/audio/speech?stream=true` |
-| Concurrent requests | Per-request state leaks, codec window round-robin gaps | `max_num_seqs>1` smoke with 4+ parallel prompts |
-
-**I5. Per-request state belongs to the request.** If the model caches anything
-across `forward()` calls (streaming generators, codec buffers, sliding-window pads,
-CUDA graph state), key it by `info.get("_omni_req_id")` and free the entry on
-request finish. A shared buffer silently corrupts audio across concurrent requests —
-the symptom is crosstalk or truncation under load, nothing in single-request tests.
-
-## Overview
-
-vLLM-Omni supports TTS models as multi-stage pipelines where each stage runs independently
-and can be placed on different devices. Qwen3-TTS has two stages:
-
-| Stage | Name | Input | Output |
-|-------|------|-------|--------|
-| 0 | Code Predictor (AR) | Text tokens | Discrete RVQ codec codes |
-| 1 | Code2Wav (Decoder) | RVQ codec codes | Audio waveform |
-
-Each stage is a separate model class configured independently via YAML. The two stages
-are connected by the `async_chunk` framework, which enables inter-stage streaming for
-low first-packet latency (see [Async Chunk Design](../../design/feature/async_chunk.md)).
-
-### Without async_chunk (batch mode)
-
-Stage 0 runs to completion before Stage 1 starts, resulting in long first-packet latency:
-
-```mermaid
-flowchart TB
- subgraph stage0["Stage 0: AR Code Predictor"]
- direction LR
- P[Prefill] --> D1[Decode 1]
- D1 --> D2[Decode 2]
- D2 --> Dots1["..."]
- Dots1 --> DN[Decode N]
- end
-
- subgraph stage1["Stage 1: Code2Wav"]
- direction LR
- DEC[Decode all codes at once]
- end
-
- stage0 -- "all N codes" --> stage1
- stage1 --> FPL["First Packet Latency = Stage 0 + Stage 1"]
-
- style stage0 fill:#dae8fc,stroke:#6c8ebf
- style stage1 fill:#f8d7c8,stroke:#d4856a
- style FPL fill:#e8f0fe,stroke:#3366CC,stroke-width:2px
-```
-
-### With async_chunk (streaming mode)
-
-Stage 0 sends codec codes to Stage 1 every `chunk_size=25` tokens. Stage 1 begins decoding
-immediately, reducing first-packet latency from the full AR time to just the first chunk:
-
-```mermaid
-flowchart TB
- subgraph stage0["Stage 0: Code Predictor (AR)"]
- direction LR
- P[Prefill] --> D1["Decode 1-25"]
- D1 --> D2["Decode 26-50"]
- D2 --> Dots1["..."]
- Dots1 --> DN["Decode N"]
- end
-
- subgraph stage1["Stage 1: Code2Wav"]
- direction LR
- C1["Chunk 1\n(25 frames)"] --> C2["Chunk 2\n(context + 25)"]
- C2 --> Dots2["..."]
- Dots2 --> CN["Final chunk"]
- end
-
- D1 -. "chunk 1 (25 codes)" .-> C1
- D2 -. "chunk 2 (context + 25)" .-> C2
- DN -. "final" .-> CN
-
- stage0 --> FPL["⏱ First Packet Latency = Prefill + 25 decode steps only"]
-
- style stage0 fill:#dae8fc,stroke:#6c8ebf
- style stage1 fill:#e8d4f8,stroke:#8a6cad
- style FPL fill:#e8f0fe,stroke:#3366CC,stroke-width:2px
-```
-
-Key parameters: `chunk_size=25`, `left_context_size=25` (validated defaults from Qwen3-TTS
-and Qwen3-Omni).
-
-## Directory Structure
-
-When adding a new TTS model, create the following structure:
-
-```
-vllm_omni/model_executor/models/
- your_model_name/
- __init__.py
- your_model.py # Unified class (stage dispatch)
- your_model_ar_stage.py # Stage 0: AR stage
- your_model_decoder.py # Stage 1: audio decoder
-
-vllm_omni/model_executor/stage_input_processors/
- your_model_name.py # Stage 0 -> Stage 1 transition
-
-vllm_omni/model_executor/stage_configs/
- your_model_name.yaml # Batch mode config
- your_model_name_async_chunk.yaml # Streaming mode config
-```
-
-**Qwen3-TTS reference files:**
-
-| File | Purpose |
-|------|---------|
-| `models/qwen3_tts/qwen3_tts.py` | Unified model class |
-| `models/qwen3_tts/qwen3_tts_code_predictor_vllm.py` | Stage 0 - optimized AR |
-| `models/qwen3_tts/qwen3_tts_code2wav.py` | Stage 1 - decoder |
-| `deploy/qwen3_tts.yaml` (new schema) | Deploy config (async_chunk enabled) — paired with `models/qwen3_tts/pipeline.py` for the frozen topology |
-
-> **Chunked vs end-to-end modes**: `qwen3_tts` registers a single
-> pipeline whose stage 1 declares alternate processor functions — an
-> `async_chunk_process_next_stage_input_func` (per-chunk streaming, used
-> when `deploy.async_chunk=True`) and a `sync_process_input_func`
-> (batch-end, used when `deploy.async_chunk=False`). The loader selects
-> one at merge time based on the bool, so `--no-async-chunk` alone
-> switches modes — no variant yaml or variant pipeline registration is
-> needed. Pipelines that only make sense in one mode (e.g.
-> `qwen3_omni_moe` is always chunked) can keep using the unconditional
-> `custom_process_*` fields.
-| `stage_input_processors/qwen3_tts.py` | Stage transition processors |
-
-## Step-by-Step Implementation
-
-### Step 1: Implement Stage 0 - AR Stage
-
-Stage 0 is the autoregressive stage that generates intermediate audio representations.
-**It must use vLLM's native decoder layers with fused ops and PagedAttention** for the LLM
-backbone - this is the primary source of speedup over HuggingFace inference.
-
-#### 1.1 Use vLLM Decoder Layers Directly
-
-Build your transformer layers from the corresponding vLLM decoder layer class (e.g.
-`Qwen3DecoderLayer` for Qwen3-based backbones, or the equivalent for LLaMA, Qwen2, etc.).
-Do not wrap the HuggingFace model directly - that bypasses PagedAttention and fused kernels.
-
-```python
-# your_model_ar_stage.py
-
-from vllm.model_executor.models.qwen3 import Qwen3DecoderLayer
-
-class YourTTSARStage(nn.Module):
-
- def __init__(self, config, vllm_config, prefix):
- self.layers = nn.ModuleList([
- Qwen3DecoderLayer(
- config, vllm_config=vllm_config, prefix=f"{prefix}.layers.{i}"
- )
- for i in range(config.num_hidden_layers)
- ])
- self.lm_head = ParallelLMHead(config.codec_size, config.hidden_size)
-```
-
-See `qwen3_tts_code_predictor_vllm.py` for the full implementation.
-
-#### 1.2 Forward Pass
-
-Implement `forward()` to return an `OmniOutput` with intermediate data for Stage 1:
-
-```python
-def forward(self, input_ids, positions, intermediate_tensors=None,
- inputs_embeds=None, **kwargs) -> OmniOutput:
- hidden_states = self.run_layers(input_ids, positions, intermediate_tensors, inputs_embeds)
- logits = self.lm_head(hidden_states)
-
- return OmniOutput(
- text_hidden_states=hidden_states,
- multimodal_outputs={
- "audio_codes": self.extract_codes(logits),
- },
- )
-```
-
-The keys in `multimodal_outputs` are what your stage input processor will read to build
-Stage 1 inputs.
-
-#### 1.3 Weight Loading with Fused QKV
-
-When using vLLM's fused `QKVParallelLinear`, pack the HF `q_proj`/`k_proj`/`v_proj` weights
-into `qkv_proj` using `stacked_params_mapping`. See the `load_weights()` method in
-`qwen3_tts_code_predictor_vllm.py` for the standard pattern - it can be reused as-is
-for any Qwen-family backbone.
-
-#### 1.4 Custom Stop Condition (if needed)
-
-Some TTS models use a learned stop head rather than an EOS token. If your model does this,
-implement it inside `sample()`:
-
-```python
-def sample(self, logits, sampling_metadata) -> SamplerOutput | None:
- output = self.sampler(logits, sampling_metadata)
- if self._stop_head_fired():
- output = mark_as_finished(output)
- return output
-```
-
-### Step 2: Implement Stage 1 - Decoder
-
-Stage 1 decodes Stage 0 output into audio. It runs outside the scheduler (no PagedAttention
-needed). Implement `chunked_decode_streaming()` to support async_chunk streaming:
-
-```python
-# your_model_decoder.py
-
-class YourTTSDecoder(nn.Module):
-
- def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- super().__init__()
- # Initialize your audio decoder (SpeechTokenizer, HiFiGAN, etc.)
-
- def forward(self, codes: torch.Tensor, **kwargs) -> torch.Tensor:
- return self.decoder(codes)
-
- def chunked_decode_streaming(self, codes, chunk_size=25,
- left_context_size=25) -> torch.Tensor:
- """Decode with a sliding context window for smooth chunk boundaries."""
- end_index = codes.shape[-1]
- context_size = 0 if end_index <= chunk_size else left_context_size
- wav_chunk = self(codes)
- # Trim left context to avoid duplicate audio
- return wav_chunk[..., context_size * self.total_upsample:]
-```
-
-### Step 3: Implement the Unified Model Class
-
-The unified class dispatches to the correct stage based on `model_stage` in the config:
-
-```python
-# your_model.py
-
-class YourTTSModelForConditionalGeneration(nn.Module, SupportsPP):
-
- def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- super().__init__()
- self.model_stage = vllm_config.model_config.model_stage
-
- if self.model_stage == "ar_stage":
- ar_vllm_config = vllm_config.with_hf_config(
- vllm_config.model_config.hf_config.ar_config,
- architectures=["YourTTSARStageForConditionalGeneration"],
- )
- self.ar_stage = init_vllm_registered_model(
- vllm_config=ar_vllm_config,
- prefix=maybe_prefix(prefix, "ar"),
- hf_config=ar_vllm_config.model_config.hf_config,
- architectures=["YourTTSARStageForConditionalGeneration"],
- )
- self.model = self.ar_stage
-
- elif self.model_stage == "decoder":
- self.decoder = YourTTSDecoder(vllm_config=vllm_config, prefix=prefix)
- self.model = self.decoder
-```
-
-### Step 4: Create `__init__.py`
-
-```python
-# vllm_omni/model_executor/models/your_model_name/__init__.py
-from .your_model import YourTTSModelForConditionalGeneration
-
-__all__ = ["YourTTSModelForConditionalGeneration"]
-```
-
-## Key Components
-
-### Model Interfaces
-
-Your unified model class should implement the appropriate interfaces:
-
-- **`SupportsPP`**: Required for pipeline parallelism support (all models should implement this)
-- **`SupportsMultiModal`**: Only if your model accepts multimodal inputs (e.g. reference audio for voice cloning)
-
-### Output Format
-
-Use `OmniOutput` so the orchestrator can route intermediate data between stages:
-
-```python
-from vllm_omni.model_executor.models.output_templates import OmniOutput
-
-return OmniOutput(
- text_hidden_states=hidden_states,
- multimodal_outputs={
- "audio_codes": codec_codes,
- },
-)
-```
-
-### Weight Loading from a Single Checkpoint
-
-If both stages load from one checkpoint, separate them by prefix in the unified class:
-
-```python
-def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
- ar_weights, decoder_weights = [], []
- for name, tensor in weights:
- if name.startswith("decoder."):
- decoder_weights.append((name, tensor))
- else:
- ar_weights.append((name, tensor))
-
- if self.model_stage == "ar_stage":
- return self.ar_stage.load_weights(ar_weights)
- elif self.model_stage == "decoder":
- return self.decoder.load_weights(decoder_weights)
-```
-
-## Model Registration
-
-Register all stage classes in `vllm_omni/model_executor/models/registry.py`:
-
-```python
-_OMNI_MODELS = {
- # (package_name, module_name, class_name)
- "YourTTSModelForConditionalGeneration": (
- "your_model_name", "your_model",
- "YourTTSModelForConditionalGeneration",
- ),
- "YourTTSARStageForConditionalGeneration": (
- "your_model_name", "your_model_ar_stage",
- "YourTTSARStageForConditionalGeneration",
- ),
- "YourTTSDecoder": (
- "your_model_name", "your_model_decoder",
- "YourTTSDecoder",
- ),
-}
-```
-
-The registry uses lazy loading - model classes are only imported when needed.
-
-## Stage Configuration
-
-Each stage has a `worker_type` that determines how it is scheduled:
-
-- `worker_type: ar` - autoregressive stage, uses `OmniARScheduler` with PagedAttention
-- `worker_type: generation` - non-AR stage (e.g. decoder), uses `OmniGenerationScheduler`
-
-Key configuration fields:
-
-| Field | Description |
-|-------|-------------|
-| `model_stage` | Which stage to initialize (`ar_stage`, `decoder`, etc.) |
-| `model_arch` | Architecture name, must match `registry.py` |
-| `engine_input_source` | List of upstream stage IDs that provide input (e.g. `[0]`) |
-| `engine_output_type` | Output type: `latent` for intermediate, `audio` for final |
-| `custom_process_next_stage_input_func` | Async chunk processor function path (streaming only) |
-| `final_output` | Whether this stage produces the final user-facing output |
-| `final_output_type` | Type of final output (`audio`, `text`, etc.) |
-
-### Batch mode
-
-```yaml
-# stage_configs/your_model_name.yaml
-
-stage_args:
- - stage_id: 0
- stage_type: llm
- runtime:
- devices: "0"
- engine_args:
- model_stage: ar_stage
- max_num_seqs: 64
- model_arch: YourTTSModelForConditionalGeneration
- worker_type: ar
- scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
- engine_output_type: latent
- default_sampling_params:
- temperature: 0.9
- top_k: 50
- max_tokens: 2048
-
- - stage_id: 1
- stage_type: llm
- runtime:
- devices: "0"
- engine_args:
- model_stage: decoder
- model_arch: YourTTSModelForConditionalGeneration
- worker_type: generation
- scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
- engine_output_type: audio
- engine_input_source: [0]
- final_output: true
- final_output_type: audio
-```
-
-### Streaming mode (async_chunk)
-
-Add `async_chunk: true` at the top level and specify `custom_process_next_stage_input_func`
-on Stage 0 to define how intermediate outputs are chunked and forwarded:
-
-```yaml
-# stage_configs/your_model_name_async_chunk.yaml
-
-async_chunk: true
-
-stage_args:
- - stage_id: 0
- stage_type: llm
- runtime:
- devices: "0"
- engine_args:
- model_stage: ar_stage
- max_num_seqs: 64
- model_arch: YourTTSModelForConditionalGeneration
- worker_type: ar
- scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
- engine_output_type: latent
- custom_process_next_stage_input_func: >
- vllm_omni.model_executor.stage_input_processors.your_model_name.ar2decoder_async_chunk
- default_sampling_params:
- temperature: 0.9
- top_k: 50
- max_tokens: 2048
-
- - stage_id: 1
- stage_type: llm
- runtime:
- devices: "0"
- engine_args:
- model_stage: decoder
- model_arch: YourTTSModelForConditionalGeneration
- worker_type: generation
- scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
- engine_output_type: audio
- engine_input_source: [0]
- final_output: true
- final_output_type: audio
-```
-
-## Stage Input Processors
-
-Stage input processors convert Stage 0 outputs into Stage 1 inputs. Create yours in
-`vllm_omni/model_executor/stage_input_processors/your_model_name.py`.
-
-See `stage_input_processors/qwen3_tts.py` for the full reference implementation.
-
-### Data structures
-
-Understanding what's available in stage outputs:
-
-- `stage_list[source_id].engine_outputs` - list of `EngineCoreOutput` objects
-- Each `EngineCoreOutput` has `outputs` - list of `RequestOutput` objects
-- Each `RequestOutput` has:
- - `token_ids` - generated token IDs
- - `multimodal_output` - dict with keys matching your model's `OmniOutput.multimodal_outputs`
- - `prompt_token_ids` - original prompt token IDs
-
-### Batch mode (non-streaming)
-
-Collects all Stage 0 outputs and forwards them to Stage 1 in one shot:
-
-```python
-def ar2decoder(
- stage_list: list[Any],
- engine_input_source: list[int],
- prompt: OmniTokensPrompt | TextPrompt | None = None,
- requires_multimodal_data: bool = False,
-) -> list[OmniTokensPrompt]:
- source_id = engine_input_source[0]
- decoder_inputs = []
-
- for output in stage_list[source_id].engine_outputs:
- result = output.outputs[0]
- codes = result.multimodal_output["audio_codes"].cpu()
- decoder_inputs.append(
- OmniTokensPrompt(prompt_token_ids=codes.reshape(-1).tolist())
- )
-
- return decoder_inputs
-```
-
-### Streaming mode (async_chunk)
-
-Buffers Stage 0 outputs and forwards a chunk to Stage 1 once `chunk_size` frames
-have accumulated. The function signature follows the `OmniChunkTransferAdapter` protocol:
-
-```python
-def ar2decoder_async_chunk(
- transfer_manager: Any,
- pooling_output: dict[str, Any] | None,
- request: Any,
- is_finished: bool = False,
-) -> dict[str, Any] | None:
- """Forward chunks of AR output to the decoder stage."""
- request_id = request.external_req_id
- finished = bool(is_finished or request.is_finished())
-
- # Extract and buffer the latest frame
- if isinstance(pooling_output, dict):
- frame = extract_frame(pooling_output)
- if frame is not None:
- transfer_manager.code_prompt_token_ids[request_id].append(
- frame.cpu().tolist()
- )
- elif not finished:
- return None
-
- # Read chunk config from connector
- chunk_size = 25
- left_context_size = 25
-
- length = len(transfer_manager.code_prompt_token_ids[request_id])
- if length <= 0:
- if finished:
- return {"codes": [], "finished": torch.tensor(True, dtype=torch.bool)}
- return None
-
- # Wait until a full chunk is ready (or request is finished)
- chunk_length = length % chunk_size
- if chunk_length != 0 and not finished:
- return None
-
- # Build context window: left_context + chunk
- context_length = chunk_length if chunk_length != 0 else chunk_size
- end_index = min(length, left_context_size + context_length)
- window = transfer_manager.code_prompt_token_ids[request_id][-end_index:]
-
- return {
- "codes": torch.tensor(window).transpose(0, 1).reshape(-1).tolist(),
- "left_context_size": max(0, int(end_index - context_length)),
- "finished": torch.tensor(finished, dtype=torch.bool),
- }
-```
-
-Key points:
-- `transfer_manager` is the `OmniChunkTransferAdapter` that owns the chunk lifecycle
-- Each call appends one AR decode step's output; a chunk is emitted every `chunk_size` steps
-- The final (possibly partial) chunk is flushed when `is_finished` is true
-- `left_context_size` frames of overlap are included for smooth audio boundaries
-
-## Testing
-
-For general testing conventions, see [tests_style.md](../ci/tests_style.md).
-
-Recommended test cases for a new TTS model:
-
-1. **Single request** - verify waveform output shape and sample rate
-2. **Batched requests** - verify each request in the batch finishes independently
-3. **async_chunk streaming** - verify audio chunks arrive incrementally and decode correctly
-4. **Speaker conditioning** (if applicable) - verify different speaker inputs produce different outputs
-
-Reference test: `tests/model_executor/stage_input_processors/test_qwen3_tts_async_chunk.py`
-
-### E2E Online Serving Tests (`tests/e2e/online_serving/test_.py`)
-
-The `omni_server` fixture in `tests/conftest.py` is **module-scoped**. Each distinct
-`OmniServerParams` id in the same test file forces the fixture to tear the server
-down and spawn a new one mid-module. A few rules that save real CI debugging time:
-
-- **Prefer a single `OmniServerParams` set per file.** If you need to exercise two
- deploy variants (e.g. `model.yaml` and `model_async_chunk.yaml`), either use one
- variant and exercise streaming via request args, or split into two test files so
- each file does exactly one server lifecycle. Mid-module teardown/restart is the
- fragile path and surfaces startup races first.
-- **Never depend on server-side fetching of external URLs** for reference audio or
- other fixture data. CI runners (and China-hosted dev boxes) routinely fail on
- SSL/DNS for public URLs. Inline the payload as a `data:audio/wav;base64,...`
- ref_audio value — the serving layer accepts both forms.
-- **Don't roll your own readiness probe.** The harness already waits for HTTP 200
- on `/health` before releasing the server to the test. If your model needs extra
- warmup signals, expose them through `/health` rather than adding `time.sleep(...)`
- inside the test. (Bare TCP `connect_ex` probes were insufficient; see
- `tests/conftest.py` `OmniServer.wait_for_ready`.)
-- **Use `core_model` marker + H100 hardware_test** to match the `test-ready.yml`
- pipeline so your test is picked up by the `ready` label, not only nightly.
-
-## Online Serving Integration
-
-To expose your model through the `/v1/audio/speech` OpenAI-compatible endpoint, add
-**all five** of the following integration points to
-`vllm_omni/entrypoints/openai/serving_speech.py` in a **single commit**. Adding them
-piecemeal causes partial-integration failures that are hard to debug.
-
-### 1. Stage constant
-
-Near the top of the file, alongside the other `_*_TTS_MODEL_STAGES` constants:
-
-```python
-_YOUR_MODEL_TTS_MODEL_STAGES = {"your_model_stage_key"}
-```
-
-### 2. Union into `_TTS_MODEL_STAGES`
-
-Add to the `_TTS_MODEL_STAGES` set union:
-
-```python
-_TTS_MODEL_STAGES: set[str] = (
- ...
- | _YOUR_MODEL_TTS_MODEL_STAGES
-)
-```
-
-### 3. Model type detection
-
-In `_detect_tts_model_type()`, add before the final `return None`:
-
-```python
-if model_stage in _YOUR_MODEL_TTS_MODEL_STAGES:
- return "your_model"
-```
-
-### 4. Request validation dispatch
-
-In `_validate_tts_request()`, add before the fallback `return`:
-
-```python
-if self._tts_model_type == "your_model":
- return self._validate_your_model_request(request)
-```
-
-### 5. Validation and parameter-builder methods
-
-Add two new methods:
-
-```python
-def _validate_your_model_request(
- self, request: OpenAICreateSpeechRequest
-) -> str | None:
- """Validate YourModel request. Returns an error string or None."""
- if not request.input or not request.input.strip():
- return "Input text cannot be empty"
- return None
-
-def _build_your_model_params(
- self, request: OpenAICreateSpeechRequest
-) -> dict[str, Any]:
- """Build additional_information dict for YourModel."""
- params: dict[str, Any] = {"text": [request.input]}
- if request.voice is not None:
- params["voice"] = [request.voice]
- # Add any other model-specific fields here
- return params
-```
-
-Then wire `_build_your_model_params` into the request-dispatch block in
-`_create_tts_request()` (search for the equivalent `_build_*_params` call for an
-existing model to find the right location). If the model supports voice cloning
-(`ref_audio` → `prompt_audio_path`, `ref_text` → `prompt_text`), add those mappings
-here too — follow any existing `_build__params` in `serving_speech.py` (e.g.
-`_build_moss_tts_params` for the voice-cloning variant) for the pattern.
-
-> **Two dispatch patterns coexist:** Fish Speech uses a `self._is_fish_speech` boolean
-> checked *before* `elif self._is_tts`. All newer models use the `_tts_model_type`
-> string pattern shown above. For new models, always use the string pattern — do not
-> add new `_is_*` boolean flags.
-
-> **Note on unused variables:** Only extract parameters in `_build_your_model_params`
-> that you actually pass to the model's generate / `inference_stream` call. Extracting
-> a variable without forwarding it will trigger a `ruff F841` pre-commit failure.
-
-### Merge conflicts
-
-`serving_speech.py` is modified by every new model PR and is the most common source of
-rebase conflicts. When rebasing onto `main` and a conflict appears here, the resolution
-is always to **keep both** the upstream model's additions and your own — never discard
-either side. After resolving:
-
-```bash
-git add vllm_omni/entrypoints/openai/serving_speech.py
-git rebase --continue
-```
-
-## Single-Stage Models
-
-Some TTS models (e.g. MOSS-TTS-Nano) do not use a two-stage pipeline. Instead the
-entire AR LM and audio decoder run inside a single AR worker, streaming audio chunks
-directly from the model's own generator.
-
-### Directory structure
-
-```
-vllm_omni/model_executor/models/your_model_name/
- __init__.py
- modeling_your_model_name.py # unified class: load_weights + forward + streaming
-
-vllm_omni/model_executor/stage_configs/your_model_name.yaml
-```
-
-No stage input processor is needed.
-
-### Stage config
-
-Use a single stage with `worker_type: ar`. The `is_comprehension: true` field and the
-top-level `async_chunk: false` are required — omitting them causes silent
-misclassification in the serving layer. Set `max_num_seqs` to at least 4 for
-concurrent production use.
-
-```yaml
-# stage_configs/your_model_name.yaml
-async_chunk: false
-
-stage_args:
- - stage_id: 0
- stage_type: llm
- is_comprehension: true # required for serving_speech.py dispatch
- runtime:
- devices: "0"
- engine_args:
- model_stage: your_model_stage_key
- model_arch: YourModelForCausalLM
- worker_type: ar
- scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
- engine_output_type: audio
- max_num_seqs: 4 # min 4 for concurrent requests; default 1 causes gaps
- final_output: true
- final_output_type: audio
-```
-
-### Generator-based streaming pattern
-
-This is the MOSS-TTS-Nano pattern, distinct from VoxCPM2's vLLM-native AR pattern
-(see `plan/voxcpm2_native_ar_design.md` for that variant). Load model weights in
-`load_weights()` (not `__init__`) so vLLM finishes distributed initialisation before
-any CUDA allocations. Stream via a per-request generator stored in an instance dict:
-
-```python
-class YourModelForCausalLM(nn.Module):
-
- def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- super().__init__()
- self._lm: nn.Module | None = None # populated in load_weights()
- self._stream_gens: dict[str, Any] = {} # request_key → generator
-
- def load_weights(self, weights):
- # Load self._lm here, after vLLM distributed init
- ...
-
- def forward(
- self,
- input_ids,
- positions,
- intermediate_tensors=None,
- inputs_embeds=None,
- runtime_additional_information: list[dict] | None = None, # one dict per request
- **kwargs,
- ) -> OmniOutput:
- infos = runtime_additional_information or [{}]
- # Return empty output during dummy/profiling calls
- if not runtime_additional_information or all(i.get("_is_dummy") for i in infos):
- self._ar_emit_stop_token = True
- return OmniOutput(...)
-
- outputs, last_flags = [], []
- for info in infos:
- request_key = str(info.get("_omni_req_id", "0")) # set by vLLM, not user code
- if request_key not in self._stream_gens:
- self._stream_gens[request_key] = self._create_stream_gen(info)
- try:
- chunk, is_last = next(self._stream_gens[request_key])
- except StopIteration:
- chunk, is_last = torch.zeros(0), True
- if is_last:
- del self._stream_gens[request_key]
- outputs.append(chunk)
- last_flags.append(is_last)
-
- self._ar_emit_stop_token = all(last_flags)
- return OmniOutput(multimodal_outputs={"model_outputs": outputs, "is_last": last_flags})
-
- def _create_stream_gen(self, info: dict):
- """Yield (waveform_tensor, is_last) from the model's inference_stream().
-
- Handle both incremental ("audio" events) and batch ("result" event) models:
- some upstream implementations emit one "result" event with the full waveform
- instead of incremental "audio" events. Both paths must be covered.
- """
- for event in self._lm.inference_stream(...):
- if event["type"] == "audio":
- yield event["waveform"], False
- elif event["type"] == "result":
- # Fallback for models that don't emit incremental audio events
- yield event.get("waveform", torch.zeros(0)), True
- return
- yield torch.zeros(0), True
-
- def compute_logits(self, hidden_states, sampling_metadata):
- # Emit EOS only when the last chunk has been yielded so the AR
- # scheduler ends the request at the right time.
- ...
-```
-
-For an in-tree reference, look for any single-stage AR model under
-`vllm_omni/model_executor/models/` (for example
-`moss_tts_nano/modeling_moss_tts_nano.py` once its integration has landed).
-
-## Pre-commit and DCO
-
-All contributions must pass the pre-commit checks and the Developer Certificate of
-Origin (DCO) sign-off before merging.
-
-### Running pre-commit
-
-Install the hooks once with `pre-commit install`. Then run before committing:
-
-```bash
-pre-commit run --files \
- vllm_omni/model_executor/models/your_model_name/*.py \
- vllm_omni/entrypoints/openai/serving_speech.py \
- vllm_omni/model_executor/models/registry.py \
- tests/e2e/offline_inference/test_your_model_name.py \
- tests/e2e/online_serving/test_your_model_name.py
-```
-
-When pre-commit **modifies files**, it exits with a non-zero code but the reformatting
-is correct. Stage the modified files and commit again — do not revert the changes.
-
-Common failures and fixes:
-
-| Check | Cause | Fix |
-|-------|-------|-----|
-| `ruff F841` | Local variable assigned but never used | Remove the extraction or forward it to the model call |
-| `ruff E402` | Module-level import not at top of file | Move import to the top-level import block |
-| `ruff format` | Line length, spacing, or quote style | Accept the auto-fix, stage, and re-commit |
-
-### DCO sign-off
-
-Every commit must carry a `Signed-off-by` trailer. Use the `-s` flag when committing:
-
-```bash
-git commit -s -m "feat(your-model): add YourModel TTS support"
-```
-
-Or configure git to add it automatically:
-
-```bash
-git config format.signOff true
-```
-
-To fix a missing sign-off on the most recent commit:
-
-```bash
-git commit --amend -s --no-edit
-git push origin your-branch --force-with-lease
-```
-
-> The DCO check verifies that the commit author email matches the `Signed-off-by` email.
-> Make sure `git config user.email` is set to the address associated with your GitHub
-> account before committing.
-
-## Adding a Model Recipe
-
-After implementing and testing your model, add a model recipe to the
-[vllm-project/recipes](https://github.com/vllm-project/recipes) repository so users can
-get started quickly. See [Adding an Omni-Modality Model](./adding_omni_model.md#adding-a-model-recipe)
-for the expected format.
-
-## Summary
-
-Adding a TTS model to vLLM-Omni involves:
-
-1. **Create model directory** with AR stage, decoder stage, and unified class (two-stage)
- or a single unified class with generator-based streaming (single-stage)
-2. **AR stage** - use vLLM's native decoder layers with fused QKV; do not wrap HF directly
-3. **Decoder stage** - thin wrapper around your audio decoder; implement `chunked_decode_streaming()`
-4. **Unified class** - dispatches on `model_stage`; same structure as `Qwen3TTSModelForGeneration`
-5. **Register** all stage classes in `registry.py`
-6. **YAML configs** - provide both batch and `async_chunk` variants (two-stage), or a single-stage AR config
-7. **Stage input processor** - buffer Stage 0 outputs and forward in chunks of 25 (two-stage only)
-8. **Online serving** - add all 5 integration points to `serving_speech.py` in one commit
-9. **Tests** - cover single request, batching, and streaming
-10. **Pre-commit + DCO** - run `pre-commit` before pushing; sign every commit with `git commit -s`
-11. **Model recipe** - add to [vllm-project/recipes](https://github.com/vllm-project/recipes)
-12. **Invariants** - re-check I1–I5 (streaming contract, consumer hygiene, hot-loop discipline, validation pyramid, per-request state) at the end of every phase
-
-### Qwen3-TTS Reference Files
-
-| File | Purpose |
-|------|---------|
-| `models/qwen3_tts/qwen3_tts.py` | Unified model class |
-| `models/qwen3_tts/qwen3_tts_code_predictor_vllm.py` | AR stage with vLLM fused ops |
-| `models/qwen3_tts/qwen3_tts_code2wav.py` | Decoder stage with `chunked_decode_streaming()` |
-| `models/qwen3_tts/pipeline.py` | Frozen pipeline topology (registered at import time) |
-| `deploy/qwen3_tts.yaml` | Deploy config (user-editable, async_chunk + SharedMemoryConnector) |
-| `stage_input_processors/qwen3_tts.py` | Stage transition processors |
-
-For more information, see:
-
-- [Architecture Overview](../../design/architecture_overview.md)
-- [Async Chunk Design](../../design/feature/async_chunk.md)
-- [Stage Configuration Guide](../../configuration/stage_configs.md)
diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md
deleted file mode 100644
index e1dbc8234b0..00000000000
--- a/docs/contributing/profiling.md
+++ /dev/null
@@ -1,286 +0,0 @@
-# Profiling Diffusion Models
-
-> **Warning:** Profiling is for development and debugging only. It adds significant overhead and should not be enabled in production.
-
-Diffusion profiling supports two backends through `profiler_config`:
-
-- `torch`: detailed CPU/CUDA traces, operator tables, and optional memory snapshots
-- `cuda`: low-overhead CUDA range control for NVIDIA Nsight Systems (`nsys`)
-
-## 1. Configure `profiler_config`
-
-Use `profiler_config` to enable profiling for a diffusion model. For diffusion usage, pass it directly to `Omni(...)` or `vllm serve`.
-
-Minimal torch-profiler config:
-
-```yaml
-profiler_config:
- profiler: torch
- torch_profiler_dir: ./perf
-```
-
-Supported fields:
-
-| Field | Description |
-|---|---|
-| `profiler` | Profiler backend. Supported values: `torch`, `cuda`. Use `torch` for `trace.json`, Excel operator tables, and optional memory snapshots. Use `cuda` for Nsight Systems only. |
-| `torch_profiler_dir` | Output directory for torch-profiler artifacts. Required when `profiler: torch`. |
-| `torch_profiler_use_gzip` | Compress `trace_rank*.json` into `trace_rank*.json.gz`. |
-| `torch_profiler_record_shapes` | Record input shapes and add a `by_shape` sheet to `ops_rank*.xlsx`. |
-| `torch_profiler_with_stack` | Record call stacks, add a `by_stack` sheet to `ops_rank*.xlsx`, and export `stacks_cpu_rank*.txt` and `stacks_cuda_rank*.txt`. |
-| `torch_profiler_with_memory` | Enable memory profiling and attempt to dump `memory_snapshot_rank*.pickle`. The pickle is only generated when the current backend supports memory history and snapshot APIs. |
-| `torch_profiler_with_flops` | Enable FLOPs collection in `torch.profiler`. This does not add a separate output file. |
-| `torch_profiler_dump_cuda_time_total` | Export an additional text summary `profiler_out_.txt` sorted by `self_cuda_time_total`. |
-| `delay_iterations` | Number of worker iterations to skip before profiling starts. |
-| `max_iterations` | Maximum number of worker iterations to capture before auto-stop. |
-| `wait_iterations` | Torch-profiler wait iterations before warmup. |
-| `warmup_iterations` | Torch-profiler warmup iterations. |
-| `active_iterations` | Torch-profiler active iterations. |
-
-### Minimal configurations by output
-
-Only collect trace output:
-
-```python
-profiler_config = {
- "profiler": "torch",
- "torch_profiler_dir": "./perf",
-}
-```
-
-Outputs:
-
-- `trace_rank*.json`
-- `ops_rank*.xlsx` with a `summary` sheet
-
-Collect compressed trace output:
-
-```python
-profiler_config = {
- "profiler": "torch",
- "torch_profiler_dir": "./perf",
- "torch_profiler_use_gzip": True,
-}
-```
-
-Outputs:
-
-- `trace_rank*.json.gz`
-- `ops_rank*.xlsx` with a `summary` sheet
-
-Collect trace and full operator tables:
-
-```python
-profiler_config = {
- "profiler": "torch",
- "torch_profiler_dir": "./perf",
- "torch_profiler_record_shapes": True,
- "torch_profiler_with_stack": True,
-}
-```
-
-Outputs:
-
-- `trace_rank*.json`
-- `ops_rank*.xlsx` with `summary`, `by_shape`, and `by_stack`
-- `stacks_cpu_rank*.txt`
-- `stacks_cuda_rank*.txt`
-
-Collect trace, operator tables, and memory snapshots:
-
-```python
-profiler_config = {
- "profiler": "torch",
- "torch_profiler_dir": "./perf",
- "torch_profiler_record_shapes": True,
- "torch_profiler_with_stack": True,
- "torch_profiler_with_memory": True,
-}
-```
-
-Outputs:
-
-- `trace_rank*.json`
-- `ops_rank*.xlsx` with `summary`, `by_shape`, and `by_stack`
-- `stacks_cpu_rank*.txt`
-- `stacks_cuda_rank*.txt`
-- `memory_snapshot_rank*.pickle` when supported by the current backend
-
-### Full torch-profiler configuration
-
-If you want to enable the commonly used torch-profiler options together:
-
-```python
-profiler_config = {
- "profiler": "torch",
- "torch_profiler_dir": "./perf",
- "torch_profiler_use_gzip": False,
- "torch_profiler_record_shapes": True,
- "torch_profiler_with_stack": True,
- "torch_profiler_with_memory": True,
- "torch_profiler_with_flops": False,
- "torch_profiler_dump_cuda_time_total": False,
- "delay_iterations": 0,
- "max_iterations": 0,
- "wait_iterations": 0,
- "warmup_iterations": 0,
- "active_iterations": 0,
-}
-```
-
-## 2. Profiling Diffusion with PyTorch Profiler
-
-Single-stage diffusion models use `start_profile()` / `stop_profile()` controls. The profiler only writes artifacts after profiling has been started and then stopped.
-
-```python
-from vllm_omni import Omni
-
-omni = Omni(
- model="Wan-AI/Wan2.2-I2V-A14B-Diffusers",
- profiler_config={
- "profiler": "torch",
- "torch_profiler_dir": "./perf",
- },
-)
-
-omni.start_profile()
-...
-omni.stop_profile()
-```
-
-For diffusion offline example scripts under `examples/offline_inference/`, pass `--profiler-config` as a JSON object. The script enables profiling when this argument is set and wraps generation with `start_profile()` / `stop_profile()`.
-
-Example:
-
-```bash
-python examples/offline_inference/image_to_video/image_to_video.py \
- --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
- --image input.jpg \
- --prompt "A cat playing with yarn" \
- --profiler-config '{
- "profiler": "torch",
- "torch_profiler_dir": "./perf",
- "torch_profiler_record_shapes": true,
- "torch_profiler_with_stack": true
- }'
-```
-
-Examples:
-
-1. [Image edit example](https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_image/image_edit.py)
-2. [Image to video example](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video)
-
-## 3. Profiling Diffusion with Nsight Systems (`nsys`)
-
-For Nsight Systems, use `profiler: cuda` and wrap the process with `nsys profile`.
-
-```bash
-nsys profile \
- --trace-fork-before-exec=true \
- --cuda-graph-trace=node \
- --capture-range=cudaProfilerApi \
- --capture-range-end=repeat \
- -o diffusion_trace \
- python image_to_video.py ...
-```
-
-The Python process being profiled must create the diffusion engine with:
-
-```python
-profiler_config = {"profiler": "cuda"}
-```
-
-Then call `start_profile()` before the requests you want to capture and `stop_profile()` after them. The diffusion worker processes open and close the CUDA capture range themselves, so `nsys` sees the actual GPU work instead of only the parent process.
-
-## 4. Profiling Online Serving
-
-When `profiler_config.profiler` is set for a diffusion model, the server exposes:
-
-- `POST /start_profile`
-- `POST /stop_profile`
-
-### Start the server
-
-Single-stage diffusion serving with torch profiler:
-
-```bash
-vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers \
- --omni \
- --port 8091 \
- --profiler-config '{
- "profiler": "torch",
- "torch_profiler_dir": "/tmp/vllm_profile_wan22_i2v",
- "torch_profiler_with_stack": true,
- "torch_profiler_with_flops": false,
- "torch_profiler_use_gzip": true,
- "torch_profiler_dump_cuda_time_total": false,
- "torch_profiler_record_shapes": true,
- "torch_profiler_with_memory": true,
- "delay_iterations": 0,
- "max_iterations": 0,
- "wait_iterations": 0,
- "warmup_iterations": 0,
- "active_iterations": 0
- }'
-```
-
-Single-stage diffusion serving with Nsight Systems:
-
-```bash
-nsys profile \
- --trace-fork-before-exec=true \
- --cuda-graph-trace=node \
- --capture-range=cudaProfilerApi \
- --capture-range-end=repeat \
- -o serving_trace \
- vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers \
- --omni \
- --port 8091 \
- --profiler-config '{"profiler": "cuda"}'
-```
-
-### Control capture
-
-Example profiling flow for an online Qwen-Image request:
-
-```bash
-# Start profiling.
-curl -X POST http://localhost:8091/start_profile
-
-# Send a Qwen-Image generation request while profiling is active.
-curl http://localhost:8091/v1/images/generations \
- -H "Content-Type: application/json" \
- -d '{
- "model": "Qwen/Qwen-Image",
- "prompt": "A red vintage bicycle parked beside a quiet canal at sunset"
- }'
-
-# Stop profiling and flush profiler artifacts.
-curl -X POST http://localhost:8091/stop_profile
-```
-
-## 5. Diffusion Pipeline Profiler
-
-For lightweight per-stage pipeline timing such as `vae.decode` or `diffuse`, see [Diffusion Pipeline Profiler](model/adding_diffusion_model.md#diffusion-pipeline-profiler-performance-profiling). That utility logs stage durations only and does not generate torch-profiler artifacts such as `trace.json`, Excel tables, or memory snapshots.
-
-## 6. Analyze Results
-
-Torch-profiler output:
-
-- Chrome/Perfetto trace: `trace_rank*.json` or `trace_rank*.json.gz`
-- Excel workbook: `ops_rank*.xlsx` with `summary`, and optional `by_shape` / `by_stack` sheets
-- Stack exports: `stacks_cpu_rank*.txt` and `stacks_cuda_rank*.txt` when stack capture is enabled
-- Memory snapshot: `memory_snapshot_rank*.pickle` when memory capture is enabled and supported by the backend
-- Optional CUDA-time text summary: `profiler_out_.txt` when `torch_profiler_dump_cuda_time_total` is enabled
-
-CUDA profiler / Nsight Systems output:
-
-- `.nsys-rep` report files written by `nsys -o ...`
-
-Recommended viewers:
-
-- [Perfetto](https://ui.perfetto.dev/) for torch traces
-- `nsys stats .nsys-rep` for CLI summaries
-- Nsight Systems GUI for CUDA kernel timelines
-
-For upstream background on the underlying vLLM profiling infrastructure, see the [vLLM profiling guide](https://docs.vllm.ai/en/stable/contributing/profiling/).
diff --git a/docs/design/architecture_overview.md b/docs/design/architecture_overview.md
deleted file mode 100644
index 1c38ba67183..00000000000
--- a/docs/design/architecture_overview.md
+++ /dev/null
@@ -1,202 +0,0 @@
-# Architecture Overview
-
-This document outlines the architectural design for vLLM-Omni.
-
-
-
-
-
-
-
-
-# Goals
-
-The primary goal of the vLLM-Omni project is to build the fastest and easiest-to-use open-source Omni-Modality model inference & serving engine. vLLM-Omni extends the original vLLM, which was created to support large language models for text-based autoregressive (AR) generation tasks. vLLM-Omni is designed to support:
-
-* **Non-textual Output:** Enables the integration, efficient processing and output of various data types, including but not limited to, images, audio, and video, alongside text.
-* **Non-Autoregressive Structure:** Support model structure beyond autoregressive, especially Diffusion Transformer (DiT), which is widely used in visual and audio generation.
-* **Integration with vLLM Core:** Maintain compatibility and leverage existing vLLM key modules and optimizations where applicable.
-* **Extensibility:** Design a modular and flexible architecture that can easily accommodate new modalities, model architectures, and output formats.
-
-
-# Representative omni-modality models
-
-According to analysis for current popular open-source models, most of them have the combination of AR+DiT. Specifically, they can be further categorized into 3 types below:
-
-**DiT as a main structure, with AR as text encoder (e.g.: Qwen-Image)**
- A powerful image generation foundation model capable of complex text rendering and precise image editing.
-
-
-
-
-
-
-
-
-**AR as a main structure, with DiT as multi-modal generator (e.g. BAGEL)**
- A unified multimodal comprehension and generation model, with cot text output and visual generation.
-
-
-
-
-
-
-
-
-**AR+DiT (e.g. Qwen-Omni)**
- A natively end-to-end omni-modal LLM for multimodal inputs (text/image/audio/video...) and outputs (text/audio...).
-
-
-
-
-
-
-
-
-# vLLM-Omni main architecture
-
-
-
-
-
-
-
-
-## Key Components
-
-| Component | Description |
-| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
-| **OmniRouter** | provide an intelligent router for Omni-modality requests dispatch |
-| **EntryPoints** | define the APIs for offline/online serving (APIServer, Omni/AsyncOmni), while `AsyncOmniEngine` and `Orchestrator` coordinate multi-stage AR/DiT execution |
-| **AR** | adapted for omni-modality models while inheriting efficient features from vLLM, such as cache management |
-| **Diffusion** | natively implemented and optimized using acceleration components |
-| **OmniConnector** | supports fully disaggregation based on E/P/D/G (Encoding/Processing/Decoding/Generation) disaggregation across stages |
-
-Disaggregated stages are managed through stage configuration. In Qwen3-Omni, Thinker/Talker/Code2wav are declared as separate configured stages, and runtime routing is handled by `Orchestrator` over `StageEngineCoreClient` / `StageDiffusionClient`.
-
-## Main features
-
-vLLM-Omni aims to be fast, flexible, and easy to use with the following features:
-
-### Performance and Acceleration
-
-The framework achieves high performance through several optimization techniques:
-
-* **Efficient AR Support:** Leverages efficient KV cache management inherited from vLLM.
-* **Pipelined Execution:** Uses pipelined stage execution overlapping to ensure high throughput.
-* **Full Disaggregation:** Relies on the OmniConnector and dynamic resource allocation across stages.
-* **Diffusion Acceleration:** Includes integrated support for diffusion acceleration. This is managed by the acceleration layer, which handles:
- * **Cache:** Includes DBCache, TeaCache and third-party integration(e.g., [cache-dit](https://github.com/vipshop/cache-dit)).
- * **Parallelism:** Supports TP, CP, USP, and CFG.
- * **Attention:** Provides an interface for third-party integration (e.g., FA3, SAGE, MindIE-SD).
- * **Quantization:** Supports various quantization implementations including FP8 and AWQ.
- * **FusedOps:** Allows for custom and third-party integration.
-
-### Classifier-Free Guidance (CFG) Companion Flow
-
-vLLM-Omni natively models Classifier-Free Guidance (CFG) across disaggregated multi-stage setups via a "companion request" paradigm, eliminating redundant textual/multimodal context computation boundaries:
-1. **Prompt Expansion:** In the initial autoregressive (AR) stage, a customized `prompt_expand_func` hook intercepts incoming generation prompts and pairs them directly with negative companion prompts (e.g., a default negative prompt) on the fly, tagging the secondary prompt with a specific internal role (`cfg_text`).
-2. **Synchronized KV Cache Transfer:** The AR stage evaluates both the primary and companion sequence batches concurrently. The `OmniConnector` captures these specific structural dependencies and reliably passes the positive and negative outcome KV caches seamlessly across stage boundaries via shared memory or network protocols.
-3. **KV Cache Collection & Injection:** Upon reaching the downstream Diffusion (DiT) Engine, an assigned `cfg_kv_collect_func` automatically intercepts the mapped companion caches (`cfg_text_past_key_values`). These auxiliary dependencies are natively gathered and seamlessly bound to the primary generation sequence variables, enabling the DiT Engine to cleanly implement cross-attention CFG guidance over accurate conditioning and unconditioning structures in parallel.
-
-### Flexibility and Usability
-
-vLLM-Omni is designed to be flexible and straightforward for users:
-
-* **Heterogeneous Pipeline Abstraction:** Manages complex model workflows effectively.
-* **Hugging Face Integration:** Offers seamless integration with popular Hugging Face models.
-* **Distributed Inference:** Supports tensor, pipeline, data, and expert parallelism.
-* **Streaming Outputs:** Supports streaming outputs.
-* **Unified API:** Provides a consistent and unified API interface compatible with vLLM.
-* **OpenAI-compatible API Server:** Includes a FastAPI-based server for online serving that is compatible with the OpenAI API.
-
-# Interface design
-
-If you use vLLM, then you know how to use vLLM-Omni from Day 0:
-
-
-
-
-
-
-
-
-Taking **Qwen3-Omni** as an example:
-
-## Offline Inference
-The **Omni** class provides a Python interface for offline batched inference. Users initialize the Omni class with a Hugging Face model name and use the generate method, passing inputs that include both text prompts and multi-modal data:
-
-```
-# Create an omni runtime with HF model name.
-from vllm_omni.entrypoints.omni import Omni
-
-omni = Omni(model="Qwen/Qwen3-Omni-30B-A3B-Instruct")
-
-# Example prompts.
-om_inputs = {"prompt": prompt,
- "multi_modal_data": {
- "video": video_frames,
- "audio": audio_signal,
- }}
-
-# Generate texts and audio from the multi-modality inputs.
-outputs = omni.generate(om_inputs, sampling_params_list)
-```
-
-## Online Serving
-Similar to vLLM, vLLM-Omni also provides a FastAPI-based server for online serving. Users can launch the server using the vllm serve command with the `--omni` flag:
-
-```
-vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
-```
-
-Users can send requests to the server using curl:
-
-```
-# prepare user content
-user_content='[
- {
- "type": "video_url",
- "video_url": {
- "url": "'"$SAMPLE_VIDEO_URL"'"
- }
- },
- {
- "type": "text",
- "text": "Why is this video funny?"
- }
- ]'
- sampling_params_list='[
- '"$thinker_sampling_params"',
- '"$talker_sampling_params"',
- '"$code2wav_sampling_params"'
- ]'
- mm_processor_kwargs="{}"
-
-# send the request
-curl -sS -X POST http://localhost:8091/v1/chat/completions \
- -H "Content-Type: application/json" \
- -d @- <
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-## Architecture
-
-### Async Chunk Pipeline Overview
-
-The following diagram illustrates the **Async Chunk Architecture** for multi-stage models (e.g., Qwen3-Omni with Thinker → Talker → Code2Wav), showing how data flows through the 4-stage pipeline with parallel processing and dual-stream output:
-
-
-
-In sequential mode, each stage must wait for the previous stage to complete entirely before starting.
-
-### Async Chunk System Architecture
-
-
-
-
-
-
-
-
-### Key Components
-
-1. **OmniConnector**: Inter-stage data transport only
- - Shared memory or other IPC mechanisms
- - **Transport-only API**: `put(from_stage, to_stage, put_key, data)` and `get(from_stage, to_stage, get_key)` (optionally with timeout)
- - **No request-specific state**: Connector does not track put_requests, get_requests, request_payload, finished_requests, or other request-bound metadata; it only performs put/get operations
- - Chunk keys and request/chunk lifecycle are managed by **OmniChunkTransferAdapter**
-
-2. **Transfer Adapter Layer**: Extensible abstraction for managing data transfer via connectors
- - **OmniTransferAdapterBase**: Base class with background **recv_loop** and **save_loop** threads;
- - **OmniChunkTransferAdapter**: Chunk-specific implementation that owns the full chunk lifecycle when async_chunk is enabled
- - **Chunk ID and key construction**: Builds keys like `{req_id}_{stage_id}_{chunk_id}` for put/get
- - **Async get**: `load_async(request)` enqueues the request; background **recv_loop** polls the connector (non-blocking); when data is available, updates the request and marks it in `_finished_load_reqs`; scheduler calls `get_finished_requests()` to learn which requests have chunks ready
- - **Async put**: `save_async(pooling_output, request)` invokes `custom_process_next_stage_input_func` in the main thread to build the payload, then enqueues a save task; background **save_loop** performs `connector.put()`; payload processing and chunk accumulation (e.g. code2wav chunk_size) remain in the main thread
-
-3. **Stage Input Processors**: Custom functions that process stage outputs into chunks for different models
- - Receive **transfer_manager** (OmniChunkTransferAdapter)
- - Qwen3-omni reference: `thinker2talker_async_chunk`, `talker2code2wav_async_chunk`
-
-4. **Schedulers**: Modified to handle chunk-based scheduling with async IO-compute overlap
- - `OmniARScheduler`: For autoregressive stages
- - `OmniGenerationScheduler`: For generation stages
- - Both schedulers use **OmniChunkTransferAdapter** and **before/after** hooks around `super().schedule()`:
- - **Before** `super().schedule()`: `process_pending_chunks(waiting, running)` moves requests waiting for chunks to `WAITING_FOR_CHUNK`, enqueues load tasks for background polling
- - **After** `super().schedule()`: `restore_queues(waiting, running)` restores requests with ready chunks back to waiting/running, `postprocess_scheduler_output(scheduler_output)` attaches cached additional_information, clears chunk-ready flags
- - **put_chunk** `save_async(pooler_output, request)`; **get_chunk** / **get_chunk_for_generation** `load_async(request)`
-
-5. **Model Runners**: Handle chunk processing
- - `OmniGPUModelRunner`: Processes chunks in AR stages
- - `GPUGenerationModelRunner`: Processes chunks in generation stages
- - Uses `ubatch_slices` from `get_forward_context()` to track per-request sequence lengths in batched inference
- - Reuses `ubatch_slices_padded` for code2wav batching to properly split batch outputs
- - Handles list-type multimodal outputs: iterates through requests and assigns corresponding tensor to each
- - Improved request state management: removes unscheduled and finished requests from input batch
-
-6. **Model Implementation**: Model-specific chunk handling
- - `Qwen3OmniMoeForConditionalGeneration`: Main model with async_chunk support
- - **Code2Wav stage batching**: Uses `ubatch_slices` to construct batched codec codes tensor `[batch_size, 16, max_seq_len]`
- - **Batch output handling**: `generate_audio()` returns `list[torch.Tensor]`, one audio tensor per request
- - **Multimodal outputs**: Returns list of audio tensors for batch processing instead of single concatenated tensor
- - `Qwen3OmniCode2WavDecoder`: Audio generation model
- - `chunked_decode()` and `chunked_decode_streaming()`: Return `list[torch.Tensor]` (one per request)
- - Uses `ubatch_slices` to split batched waveform output into per-request audio chunks
- - Each request gets correctly sized audio based on its code sequence length: `waveform[:, :, :code_seq_len * total_upsample]`
-
-7. **Request status**: `RequestStatus.WAITING_FOR_CHUNK` is added via patch (e.g. in `vllm_omni/patch.py`) so requests waiting for a chunk are not scheduled by the base vLLM scheduler until the chunk is ready.
-
-## Configuration
-
-Enable async_chunk in stage configuration YAML:
-
-```yaml
-async_chunk: true
-stage_args:
- - stage_id: 0
- engine_args:
- custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk
- - stage_id: 1
- engine_args:
- custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
-```
-
-### Stage Configuration
-
-- `async_chunk: bool`: Enable/disable async chunk mode
-- `custom_process_next_stage_input_func: str`: Path to custom chunk processing function; receives `(transfer_manager, pooling_output, request)`. For qwen3-omni: `thinker2talker_async_chunk`, `talker2code2wav_async_chunk`
-- `stage_connector_config: dict`: Connector configuration
-- `worker_type: str`: Model type, e.g. `"ar"` or `"generation"` (used by OmniChunkTransferAdapter for mode-specific payload handling)
-- `max_num_seqs: int`: Maximum number of sequences for concurrent processing in the stage
-
-
-### Connector Configuration
-
-```yaml
-connectors:
- - from_stage: 0
- to_stage: 1
- spec:
- name: SharedMemoryConnector
- extra:
- stage_id: 0
-```
-
-### Code2Wav Batch Configuration
-
-For optimal performance with async_chunk, the code2wav stage should be configured with batching:
-
-```yaml
-stage_args:
- - stage_id: 2 # code2wav stage
- runtime:
- devices: "1"
- engine_args:
- model_stage: code2wav
- max_num_seqs: 64 # Enables batched audio generation
-```
-
-## Related Files
-
-- `vllm_omni/model_executor/stage_input_processors/qwen3_omni.py`: Chunk processing functions (receive `transfer_manager` as first param)
-- `vllm_omni/distributed/omni_connectors/transfer_adapter/base.py`: OmniTransferAdapterBase (recv_loop, save_loop, load_async, save_async)
-- `vllm_omni/distributed/omni_connectors/transfer_adapter/chunk_transfer_adapter.py`: OmniChunkTransferAdapter (process_pending_chunks, restore_queues, postprocess_scheduler_output)
-- `vllm_omni/distributed/omni_connectors/connectors/shm_connector.py`: SharedMemoryConnector (transport-only put/get)
-- `vllm_omni/core/sched/omni_ar_scheduler.py`: AR scheduler with chunk_transfer_adapter
-- `vllm_omni/core/sched/omni_generation_scheduler.py`: Generation scheduler with same async chunk pattern
-- `vllm_omni/worker/gpu_model_runner.py`: Model runner with chunk handling
-- `vllm_omni/worker/gpu_generation_model_runner.py`: Generation model runner with batch output handling and ubatch_slices support
-- `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py`: Model implementation with code2wav batching
-- `vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_code2wav.py`: Code2wav decoder with batch support
-- `vllm_omni/engine/arg_utils.py`: Configuration definitions (async_chunk, worker_type)
-- `vllm_omni/config/model.py`: Model config with async_chunk field
diff --git a/docs/design/feature/cache_dit.md b/docs/design/feature/cache_dit.md
deleted file mode 100644
index 237a958774d..00000000000
--- a/docs/design/feature/cache_dit.md
+++ /dev/null
@@ -1,286 +0,0 @@
-# Cache-DiT
-
-This section describes how to add cache-dit acceleration to a new diffusion pipeline. We use the Qwen-Image pipeline and LongCat-Image pipeline as reference implementations.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [Standard Models: Automatic Support](#standard-models-automatic-support)
-- [Custom Architectures: Writing Custom Implementation](#custom-architectures-writing-custom-implementation)
-- [Testing](#testing)
-- [Troubleshooting](#troubleshooting)
-- [Reference Implementations](#reference-implementations)
-- [Summary](#summary)
-
----
-
-## Overview
-
-### What is Cache-DiT?
-
-Cache-DiT is an acceleration library for Diffusion Transformers (DiT) that caches intermediate computation results across denoising steps. The core insight is that adjacent denoising steps often produce similar intermediate features, so we can skip redundant computations by reusing cached results.
-
-The library supports three main caching strategies:
-
-- **DBCache:** Dynamic block-level caching that selectively computes or caches transformer blocks based on residual differences
-- **TaylorSeer:** Calibration-based prediction that estimates block outputs using Taylor expansion
-- **SCM (Step Computation Masking):** Dynamic step skipping based on configurable policies
-
-### Architecture
-
-vLLM-omni integrates cache-dit through the `CacheDiTBackend` class, which provides a unified interface for managing cache-dit acceleration on diffusion models.
-
-| Method/Class | Purpose | Behavior |
-|--------------|---------|----------|
-| [`CacheDiTBackend`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/#vllm_omni.diffusion.cache.CacheBackend) | Unified backend interface | Automatically handles enabler selection and cache refresh |
-| [`enable_cache_for_dit()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/cache_dit_backend/#vllm_omni.diffusion.cache.cache_dit_backend.enable_cache_for_dit) | Apply caching to transformer | Configures DBCache on transformer blocks |
-
-**Key APIs from Cache-DiT:**
-
-[Cache-DiT API Reference](https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/)
-
-| API | Description |
-|-----|-------------|
-| `BlockAdapter` | Core abstraction for applying cache-dit to transformers. Specifies transformer module(s), block list(s), and forward signature pattern(s). |
-| `ForwardPattern` | Defines block forward signature patterns: `Pattern_0`, `Pattern_1`, `Pattern_2` |
-| `ParamsModifier` | Per-transformer or per-block-list cache configuration customization |
-| `DBCacheConfig` | Configuration for DBCache parameters (warmup steps, cached steps, thresholds) |
-| `refresh_context()` | Update cache context | Called when `num_inference_steps` changes |
-
----
-
-## Standard Models: Automatic Support
-
-Most DiT models follow this pattern:
-- Single transformer with one `ModuleList` of blocks
-- Standard forward signature
-- Compatible with cache-dit's automatic detection
-
-**Examples:** Qwen-Image, Z-Image
-
-For standard single-transformer models, **no code changes are needed**. The `CacheDiTBackend` automatically uses `enable_cache_for_dit()`:
-
-```python
-from vllm_omni import Omni
-
-# Works automatically for standard models
-omni = Omni(
- model="Qwen/Qwen-Image", # Standard single-transformer model
- cache_backend="cache_dit",
- cache_config={
- "Fn_compute_blocks": 1,
- "Bn_compute_blocks": 0,
- "max_warmup_steps": 4,
- }
-)
-```
-
-**What happens automatically:**
-
-```python
-def enable_cache_for_dit(pipeline: Any, cache_config: Any) -> Callable[[int], None]:
- """Default enabler for standard single-transformer DiT models."""
-
- # Build cache configuration
- db_cache_config = DBCacheConfig(
- num_inference_steps=None, # Will be set during first inference
- Fn_compute_blocks=cache_config.Fn_compute_blocks,
- Bn_compute_blocks=cache_config.Bn_compute_blocks,
- max_warmup_steps=cache_config.max_warmup_steps,
- max_cached_steps=cache_config.max_cached_steps,
- max_continuous_cached_steps=cache_config.max_continuous_cached_steps,
- residual_diff_threshold=cache_config.residual_diff_threshold,
- )
-
- # Enable cache-dit on transformer
- cache_dit.enable_cache(
- pipeline.transformer,
- cache_config=db_cache_config,
- )
-
- # Return refresh function for dynamic num_inference_steps updates
- def refresh_cache_context(pipeline: Any, num_inference_steps: int, verbose: bool = True):
- cache_dit.refresh_context(pipeline.transformer, num_inference_steps=num_inference_steps, verbose=verbose)
-
- return refresh_cache_context
-```
-
----
-
-## Custom Architectures: Writing Custom Implementation
-
-Some models require custom handling:
-
-- **Single or dual-transformer:** Models that may use one or two transformers (e.g., Wan2.2)
-- **Multi-block-list:** Models with multiple block lists in one transformer (e.g., LongCatImage with `transformer_blocks` + `single_transformer_blocks`)
-- **Special forward patterns:** Models with non-standard block execution patterns
-
-### Example 1: Single or Dual-Transformer Model (Wan2.2)
-
-Wan2.2 can use either a single transformer or two transformers (one for high-noise steps and one for low-noise steps). The implementation automatically detects the mode based on the presence of `transformer_2`.
-
-**Key difference:** Use `BlockAdapter` to wrap multiple transformers with separate configurations.
-
-```python
-# Standard: cache_dit.enable_cache(pipeline.transformer, ...)
-# Custom: Use BlockAdapter to handle multiple transformers
-cache_dit.enable_cache(
- BlockAdapter(
- transformer=[pipeline.transformer, pipeline.transformer_2], # Multiple transformers
- blocks=[pipeline.transformer.blocks, pipeline.transformer_2.blocks],
- forward_pattern=[ForwardPattern.Pattern_2, ForwardPattern.Pattern_2],
- params_modifiers=[
- ParamsModifier(...), # Config for high-noise transformer
- ParamsModifier(...), # Config for low-noise transformer (different params)
- ],
- ),
- cache_config=db_cache_config,
-)
-```
-
-**Key difference:** `refresh_context` must be called on each transformer separately.
-
-```python
-# Standard: cache_dit.refresh_context(pipeline.transformer, num_inference_steps=N)
-# Custom: Refresh each transformer with its own step count
-def refresh_cache_context(pipeline, num_inference_steps, verbose=True):
- high_steps, low_steps = _split_inference_steps(num_inference_steps)
- cache_dit.refresh_context(pipeline.transformer, num_inference_steps=high_steps, ...)
- cache_dit.refresh_context(pipeline.transformer_2, num_inference_steps=low_steps, ...)
-```
-
-### Example 2: Multi-Block-List Model (LongCatImage)
-
-LongCatImage has a single transformer with two block lists: `transformer_blocks` and `single_transformer_blocks`.
-
-**Key difference:** Use `BlockAdapter` to specify multiple block lists within one transformer.
-
-```python
-# Standard: cache_dit.enable_cache(pipeline.transformer, ...)
-# - Automatically detects single block list
-# Custom: Use BlockAdapter to specify multiple block lists
-cache_dit.enable_cache(
- BlockAdapter(
- transformer=pipeline.transformer, # Single transformer
- blocks=[
- pipeline.transformer.transformer_blocks, # Block list 1
- pipeline.transformer.single_transformer_blocks, # Block list 2
- ],
- forward_pattern=[ForwardPattern.Pattern_1, ForwardPattern.Pattern_1],
- params_modifiers=[modifier],
- ),
- cache_config=db_cache_config,
-)
-```
-
-> **Note:** For single transformer with multiple block lists, `refresh_context` works the same as standard models.
-
-### Registering Custom Implementations
-
-After writing your custom enabler, register it in `CUSTOM_DIT_ENABLERS` in `vllm_omni/diffusion/cache/cache_dit_backend.py`:
-
-```python
-CUSTOM_DIT_ENABLERS = {
- "Wan22Pipeline": enable_cache_for_wan22,
- "LongCatImagePipeline": enable_cache_for_longcat_image,
- "YourCustomPipeline": enable_cache_for_your_model, # Add here
-}
-```
-
----
-
-## Testing
-
-After adding cache-dit support, test with:
-
-```python
-from vllm_omni import Omni
-from vllm_omni.inputs.data import OmniDiffusionSamplingParams
-
-# Test your custom model
-omni = Omni(
- model="your-model-name",
- cache_backend="cache_dit",
- cache_config={
- "Fn_compute_blocks": 1,
- "Bn_compute_blocks": 0,
- "max_warmup_steps": 4,
- "residual_diff_threshold": 0.24,
- }
-)
-
-images = omni.generate(
- "a beautiful landscape",
- OmniDiffusionSamplingParams(num_inference_steps=50),
-)
-```
-
-**Verify:**
-
-1. Cache is applied (check logs for "Cache-dit enabled successfully on xxx")
-2. Performance improvement (should be around 1.5x-2x faster)
-3. Image quality (compare with `cache_backend=None`)
-
----
-
-## Troubleshooting
-
-### Issue: Cache not applied
-
-**Symptoms:** No speedup observed, no cache-related log messages.
-
-**Causes & Solutions:**
-
-- **Enabler not registered:**
-
-**Problem:** Pipeline name not in `CUSTOM_DIT_ENABLERS` registry.
-
-**Solution:** Verify `pipeline.__class__.__name__` matches the registry key and add your enabler to `CUSTOM_DIT_ENABLERS`.
-
-### Issue: Quality degradation
-
-**Symptoms:** Generated images have artifacts or lower quality compared to non-cached inference.
-
-**Causes & Solutions:**
-
-- **Cache parameters too aggressive:**
-
-**Solution:**
-```python
-cache_config={
- "residual_diff_threshold": 0.12, # Lower from 0.24 (try 0.12-0.18)
- "max_warmup_steps": 6, # Increase from 4 (try 6-8)
- "max_continuous_cached_steps": 2, # Reduce if higher
-}
-```
-
-Check the [user guide for cache_dit](../../user_guide/diffusion/cache_acceleration/cache_dit.md) for more adjustable parameters.
-
----
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Pattern | Notes |
-|-------|------|---------|-------|
-| **Standard DiT** | [`cache_dit_backend.py::enable_cache_for_dit`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/cache_dit_backend/#vllm_omni.diffusion.cache.cache_dit_backend.enable_cache_for_dit) | Default enabler | Single transformer, automatic |
-| **Wan2.2** | [`cache_dit_backend.py::enable_cache_for_wan22`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/cache_dit_backend/#vllm_omni.diffusion.cache.cache_dit_backend.enable_cache_for_wan22) | Single or dual-transformer | Auto-detects mode based on transformer_2 presence |
-| **LongCat** | [`cache_dit_backend.py::enable_cache_for_longcat_image`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/cache_dit_backend/#vllm_omni.diffusion.cache.cache_dit_backend.enable_cache_for_longcat_image) | Multi-block-list | Two block lists in one transformer |
-| **BAGEL** | [`cache_dit_backend.py::enable_cache_for_bagel`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/cache_dit_backend/#vllm_omni.diffusion.cache.cache_dit_backend.enable_cache_for_bagel) | Omni model | Complex architecture |
-
----
-
-## Summary
-
-Adding cache-dit support:
-
-1. ✅ **Check model type** - Standard models work automatically, custom architectures need enablers
-2. ✅ **Write enabler** (if needed) - Use `BlockAdapter` for complex architectures
-3. ✅ **Register enabler** (if needed) - Add to `CUSTOM_DIT_ENABLERS` dictionary
-4. ✅ **Return refresh function** (if needed) - Handle `num_inference_steps` changes
-5. ✅ **Test** - Verify with `cache_backend="cache_dit"`
-
-For most models, the default enabler is sufficient. Only write custom enablers for complex architectures!
diff --git a/docs/design/feature/cfg_parallel.md b/docs/design/feature/cfg_parallel.md
deleted file mode 100644
index c73a87749f5..00000000000
--- a/docs/design/feature/cfg_parallel.md
+++ /dev/null
@@ -1,350 +0,0 @@
-# CFG-Parallel
-
-This section describes how to add CFG-Parallel (Classifier-Free Guidance Parallel) to a diffusion pipeline. We use the Qwen-Image pipeline as the reference implementation.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [Step-by-Step Implementation](#step-by-step-implementation)
-- [Customization](#customization)
-- [Testing](#testing)
-- [Troubleshooting](#troubleshooting)
-- [Reference Implementations](#reference-implementations)
-- [Summary](#summary)
-
----
-
-## Overview
-
-### What is CFG-Parallel?
-
-In standard Classifier-Free Guidance, each diffusion step requires two forward passes through the transformer:
-
-1. **Positive/Conditional**: Guided by the text prompt
-2. **Negative/Unconditional**: Typically using empty or negative prompt
-
-Some models require 3 or more CFG branches (see [N-Branch CFG](#n-branch-cfg-3-branches)).
-
-CFG-Parallel eliminates this bottleneck by distributing the forward passes across different GPU ranks, allowing them to execute simultaneously rather than sequentially.
-
-### Architecture
-
-vLLM-omni provides `CFGParallelMixin` that encapsulates all CFG parallel logic. Pipelines inherit from this mixin and implement a `diffuse()` method that orchestrates the denoising loop.
-
-| Method | Purpose | Automatic Behavior |
-|--------|---------|-------------------|
-| [`predict_noise_maybe_with_cfg()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/distributed/cfg_parallel/) | Predict noise with 2-branch CFG | Detects parallel mode, distributes computation, gathers results |
-| [`predict_noise_with_multi_branch_cfg()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/distributed/cfg_parallel/) | Predict noise with N-branch CFG | Round-robin dispatches N branches across M GPUs |
-| [`scheduler_step_maybe_with_cfg()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/distributed/cfg_parallel/) | Step scheduler | All ranks step locally (no broadcast needed) |
-| [`combine_cfg_noise()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/distributed/cfg_parallel/) | Combine 2-branch predictions | Applies CFG formula with optional normalization |
-| [`combine_multi_branch_cfg_noise()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/distributed/cfg_parallel/) | Combine N-branch predictions | Override for custom multi-branch combine logic |
-| [`predict_noise()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/distributed/cfg_parallel/) | Forward pass wrapper | Override for custom transformer calls |
-| [`cfg_normalize_function()`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/distributed/cfg_parallel/) | Normalize CFG output | Override for custom normalization |
-
-### How It Works
-
-`predict_noise_maybe_with_cfg()` automatically detects and switches between two execution modes:
-
-- **CFG-Parallel mode** (when `cfg_world_size > 1`):
- - Rank 0 computes positive prompt prediction
- - Rank 1 computes negative prompt prediction
- - Results are gathered via `all_gather()`
- - All ranks compute CFG combine locally (deterministic, identical results)
-
-- **Sequential mode** (when `cfg_world_size == 1`):
- - Single rank computes both positive and negative predictions
- - Directly combines them with CFG formula
-
-`scheduler_step_maybe_with_cfg()` ensures consistent latent states across all ranks:
-
-- All ranks compute the scheduler step locally — no broadcast needed because `predict_noise_maybe_with_cfg` already ensures all ranks have identical noise predictions after `all_gather` + local combine.
-
-### N-Branch CFG (3+ branches)
-
-Some models require more than 2 CFG branches. For example, Bagel and OmniGen2 use 3 branches, DreamID Omni uses 4 branches.
-
-`predict_noise_with_multi_branch_cfg()` handles these by automatically dispatching N branches across M GPUs using round-robin (rule: branch `i` → rank `i % M`):
-
-| Branches (N) | GPUs (M) | Dispatch |
-|:---:|:---:|:---|
-| 3 | 2 | `[[0, 2], [1]]` |
-| 3 | 3 | `[[0], [1], [2]]` |
-| 4 | 2 | `[[0, 2], [1, 3]]` |
-| 4 | 3 | `[[0, 3], [1], [2]]` |
-| 4 | 4 | `[[0], [1], [2], [3]]` |
-
-When a rank handles multiple branches, it runs them sequentially. After `all_gather`, all ranks execute `combine_multi_branch_cfg_noise()` locally, producing identical results.
-
----
-
-## Step-by-Step Implementation
-
-### Step 1: Inherit `CFGParallelMixin`
-
-Allow your pipeline to inherit from `CFGParallelMixin` and implements the `diffuse()` method for your specific model.
-
-**Example (Qwen-Image):**
-
-```python
-from vllm_omni.diffusion.distributed.cfg_parallel import CFGParallelMixin
-import torch.nn as nn
-class YourModelPipeline(nn.Module, CFGParallelMixin):
- def diffuse(self, ...) -> torch.Tensor:
- for i, t in enumerate(timesteps):
- # Prepare positive_kwargs (conditional) and negative_kwargs (unconditional)
- positive_kwargs = {...} # hidden_states, encoder_hidden_states, etc.
- negative_kwargs = {...} if do_true_cfg else None
-
- # Key method 1: Predict noise with automatic CFG parallel handling
- noise_pred = self.predict_noise_maybe_with_cfg(
- do_true_cfg=do_true_cfg,
- true_cfg_scale=true_cfg_scale,
- positive_kwargs=positive_kwargs,
- negative_kwargs=negative_kwargs,
- )
-
- # Key method 2: Step scheduler with automatic CFG synchronization
- latents = self.scheduler_step_maybe_with_cfg(
- noise_pred, t, latents, do_true_cfg
- )
-
- return latents
-```
-
-**Key Points:**
-
-- `positive_kwargs`: transformer arguments for conditional (text-guided) prediction
-- `negative_kwargs`: transformer arguments for unconditional prediction (set to `None` if CFG disabled)
-- For image editing pipelines, add `output_slice=image_seq_len` to extract the generative image portion
-- For models with 3+ CFG branches, see [Multi-Branch CFG](#multi-branch-cfg-3-branches) in the Customization section
-
-### Step 2: Call `diffuse`
-
-Call `self.diffuse` in your pipeline's forward function:
-
-```python
-import torch.nn as nn
-class YourModelPipeline(nn.Module, CFGParallelMixin):
- def forward(
- self,
- prompt: str,
- negative_prompt: str | None = None,
- guidance_scale: float = 3.5,
- num_inference_steps: int = 50,
- **kwargs,
- ):
- # Encode prompts, Initialize latents, Get timesteps
- ...
- # Run diffusion loop (calls the mixin's diffuse method)
- latents = self.diffuse(
- prompt_embeds=prompt_embeds,
- prompt_embeds_mask=prompt_embeds_mask,
- negative_prompt_embeds=negative_embeds,
- negative_prompt_embeds_mask=negative_mask,
- latents=latents,
- timesteps=timesteps,
- do_true_cfg=do_true_cfg,
- true_cfg_scale=guidance_scale,
- ...
- )
-```
-
----
-
-## Customization
-
-### Override `predict_noise()` for Custom Transformer Calls
-
-If your transformer requires custom prediction function, you can rewrite `predict_noise` function. Taking Wan2.2 as an example, which has two transformer models. The actual transformer to be called is determined by `self.transformer`.
-
-```python
-class Wan22Pipeline(nn.Module, CFGParallelMixin):
- def predict_noise(self, current_model: nn.Module | None = None, **kwargs: Any) -> torch.Tensor:
- if current_model is None:
- current_model = self.transformer
- return current_model(**kwargs)[0]
-```
-
-
-### Override `cfg_normalize_function()` for Custom Normalization
-
-Some models have their own normalization function. Taking LongCat Image model as an example:
-
-```python
-class LongCatImagePipeline(nn.Module, CFGParallelMixin):
- def cfg_normalize_function(self, noise_pred, comb_pred, cfg_renorm_min=0.0):
- """
- Normalize the combined noise prediction.
- """
- cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
- noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
- scale = (cond_norm / (noise_norm + 1e-8)).clamp(min=cfg_renorm_min, max=1.0)
- noise_pred = comb_pred * scale
- return noise_pred
-
- # The original cfg_normalize_function function in CFGParallelMixin
- # cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
- # noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
- # noise_pred = comb_pred * (cond_norm / noise_norm)
- # return noise_pred
-```
-
-
-### Multi-Branch CFG (3+ branches)
-
-For models with 3 or more CFG branches, use `predict_noise_with_multi_branch_cfg()` instead of `predict_noise_maybe_with_cfg()`, and override `combine_multi_branch_cfg_noise()` for custom combine logic. This interface also works for standard 2-branch CFG — just pass 2 branches in `branches_kwargs`.
-
-**Example (3-branch with dual guidance scale):**
-
-```python
-class YourMultiBranchPipeline(nn.Module, CFGParallelMixin):
- def combine_multi_branch_cfg_noise(self, predictions, true_cfg_scale, cfg_normalize=False):
- text_scale = true_cfg_scale["text"]
- image_scale = true_cfg_scale["image"]
- pos, ref, uncond = predictions
- return uncond + image_scale * (ref - uncond) + text_scale * (pos - ref)
-
- def diffuse(self, ...):
- for i, t in enumerate(timesteps):
- positive_kwargs = {...} # conditional prompt
- ref_neg_kwargs = {...} # negative prompt + reference
- uncond_kwargs = {...} # unconditional
-
- noise_pred = self.predict_noise_with_multi_branch_cfg(
- do_true_cfg=do_true_cfg,
- true_cfg_scale={"text": text_guidance_scale, "image": image_guidance_scale},
- branches_kwargs=[positive_kwargs, ref_neg_kwargs, uncond_kwargs],
- )
- latents = self.scheduler_step_maybe_with_cfg(noise_pred, t, latents, do_true_cfg)
-
- return latents
-```
-
-### Override Combine Functions
-
-There are two combine functions for different scenarios:
-
-- **`combine_cfg_noise()`** — Used by `predict_noise_maybe_with_cfg()`. Override when `predict_noise()` returns a tuple (e.g., video + audio) and you need per-element CFG logic.
-- **`combine_multi_branch_cfg_noise()`** — Used by `predict_noise_with_multi_branch_cfg()`. Override to implement custom multi-branch combine formulas (see [Multi-Branch CFG](#multi-branch-cfg-3-branches) above).
-
-### Implement a Composite Scheduler for Multi-Output Models
-
-When each output has its own denoising schedule, implement a composite scheduler that dispatches to per-output schedulers. Assign it to `self.scheduler` so the default `scheduler_step()` works without override.
-
-**Complete example (video + audio with separate schedulers and diffuse loop):**
-
-```python
-class VideoAudioScheduler:
- """Composite scheduler dispatching to video and audio schedulers."""
- def __init__(self, video_scheduler, audio_scheduler):
- self.video_scheduler = video_scheduler
- self.audio_scheduler = audio_scheduler
-
- def step(self, noise_pred, t, latents, return_dict=False, generator=None):
- video_out = self.video_scheduler.step(noise_pred[0], t[0], latents[0], return_dict=False, generator=generator)[0]
- audio_out = self.audio_scheduler.step(noise_pred[1], t[1], latents[1], return_dict=False, generator=generator)[0]
- return ((video_out, audio_out),)
-
-class MyVideoAudioPipeline(nn.Module, CFGParallelMixin):
- def __init__(self, ...):
- self.scheduler = VideoAudioScheduler(video_sched, audio_sched)
-
- def predict_noise(self, **kwargs):
- video_pred, audio_pred = self.transformer(**kwargs)
- return (video_pred, audio_pred)
-
- def combine_cfg_noise(self, positive_noise_pred, negative_noise_pred, scale, normalize):
- # ... (as above)
-
- def diffuse(self, video_latents, audio_latents, timesteps_video, timesteps_audio, ...):
- for t_v, t_a in zip(timesteps_video, timesteps_audio):
- positive_kwargs = {...}
- negative_kwargs = {...} if do_true_cfg else None
-
- video_pred, audio_pred = self.predict_noise_maybe_with_cfg(
- do_true_cfg=do_true_cfg, true_cfg_scale=self.guidance_scale,
- positive_kwargs=positive_kwargs, negative_kwargs=negative_kwargs,
- )
- video_latents, audio_latents = self.scheduler_step_maybe_with_cfg(
- (video_pred, audio_pred), (t_v, t_a),
- (video_latents, audio_latents), do_true_cfg=do_true_cfg,
- generator=generator,
- )
- return video_latents, audio_latents
-```
-
-> **Note:** If you use a non-deterministic scheduler, e.g., DDPM, please set `self.scheduler_step_maybe_with_cfg(..., generator=torch.Generator(device).manual_seed(seed))` explicitly to control the randomness of scheduler step among ranks.
-
----
-
-## Testing
-
-After adding CFG-Parallel support, test with:
-
-```bash
-cd examples/offline_inference/text_to_image
-python text_to_image.py \
- --model Your-org/your-model \
- --prompt "a cup of coffee on the table" \
- --negative-prompt "ugly, unclear" \
- --cfg-scale 4.0 \
- --num-inference-steps 50 \
- --output "cfg_enabled.png" \
- --cfg-parallel-size 2
-```
-
-**Verify:**
-
-1. Check logs for CFG parallel being activated
-2. Record the `e2e_time_ms` in the log and compare with CFG-Parallel disabled
-3. Compare the generated result quality with baseline
-4. Record comparison results in your PR
-
----
-
-## Troubleshooting
-
-### Issue: CFG parallel not activating
-
-**Symptoms:** Generation still slow, logs don't show CFG parallel being used.
-
-**Causes & Solutions:**
-
-- **CFG is not enabled:**
-
-**Problem:** Guidance scale too low or negative prompt not provided.
-
-**Solution:** Ensure `guidance_scale > 1.0` and negative prompt is provided:
-```python
-images = pipeline(
- prompt="a cat",
- negative_prompt="", # Must provide (even if empty)
- guidance_scale=3.5, # Must be > 1.0
-)
-```
-
----
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Pattern | Notes |
-|-------|------|---------|-------|
-| **Qwen-Image** | `vllm_omni/diffusion/models/qwen_image/cfg_parallel.py` | Mixin | Dual-stream transformer |
-| **Qwen-Image-Edit** | `vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit.py` | Mixin | Image editing with `output_slice` |
-| **Wan2.2** | `vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py` | Mixin | Dual-transformer architecture |
-| **CFGParallelMixin** | `vllm_omni/diffusion/distributed/cfg_parallel.py` | Base implementation | Core mixin class |
-
----
-
-## Summary
-
-Adding CFG-Parallel support:
-
-1. ✅ **Create mixin** - Inherit from `CFGParallelMixin` and implement `diffuse()` method
-2. ✅ **(Optional) Customize** - Override `predict_noise()` or `cfg_normalize_function()` for custom behavior
-3. ✅ **(Optional) Multi-branch** - For 3+ branch models, use `predict_noise_with_multi_branch_cfg()` and override `combine_multi_branch_cfg_noise()`
-4. ✅ **Test** - Verify with `--cfg-parallel-size 2` (or 3/4 for multi-branch) and compare performance
diff --git a/docs/design/feature/diffusion_step_execution.md b/docs/design/feature/diffusion_step_execution.md
deleted file mode 100644
index b8c81f04f69..00000000000
--- a/docs/design/feature/diffusion_step_execution.md
+++ /dev/null
@@ -1,121 +0,0 @@
-# Diffusion Step Execution
-
-This guide documents vLLM-Omni's stepwise diffusion contract for model authors
-and contributors implementing `step_execution=True` support for a diffusion
-pipeline.
-
-For end-user enablement, supported models, and current limitations, see
-[Step Execution](../../user_guide/diffusion/step_execution.md).
-
-## Current Support Scope
-
-`step_execution` is **not** a generic diffusion toggle. It only works for
-pipelines that implement the segmented stateful contract in
-[`vllm_omni/diffusion/models/interface.py`](gh-file:vllm_omni/diffusion/models/interface.py).
-
-Current in-tree support:
-
-| Pipeline | Example models | Step execution |
-|----------|----------------|----------------|
-| `QwenImagePipeline` | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | Yes |
-| All other diffusion pipelines | `QwenImageEditPipeline`, `QwenImageEditPlusPipeline`, `QwenImageLayeredPipeline`, GLM-Image, Wan, Flux, etc. | No |
-
-Current engine/runtime limitations:
-
-- `StepScheduler` only schedules `batch_size=1`.
-- `cache_backend` is not supported in step mode.
-- Request-mode extras such as KV transfer are not wired into step mode yet.
-- Unsupported pipelines now fail early during model loading instead of failing on the first request.
-
-## Execution Contract
-
-Step mode is driven by four pipeline methods plus the shared mutable request
-state object:
-
-- `prepare_encode(state)`: one-time request preparation.
-- `denoise_step(state)`: compute the noise prediction for the current step.
-- `step_scheduler(state, noise_pred)`: mutate latents and advance step state.
-- `post_decode(state)`: decode the final output after denoising is complete.
-
-The state lives in
-[`vllm_omni/diffusion/worker/utils.py`](gh-file:vllm_omni/diffusion/worker/utils.py)
-as `DiffusionRequestState`. Store request-scoped tensors there, or use
-`state.extra` for model-specific fields that do not justify extending the core
-dataclass.
-
-The worker-side step loop lives in
-[`vllm_omni/diffusion/worker/diffusion_model_runner.py`](gh-file:vllm_omni/diffusion/worker/diffusion_model_runner.py):
-
-1. `prepare_encode()` runs once for a new request.
-2. `denoise_step()` runs every scheduler tick.
-3. `step_scheduler()` mutates `state.latents` and advances `state.step_index`.
-4. `post_decode()` runs exactly once after `state.denoise_completed` becomes true.
-
-## Recommended Split
-
-When converting an existing request-level `forward()` pipeline, keep the split
-strict and mechanical:
-
-| Request-level phase | Stepwise method | What belongs there |
-|---------------------|-----------------|--------------------|
-| Input validation, prompt encoding, latent init, timestep prep, per-request scheduler creation | `prepare_encode()` | Anything that should happen once per request |
-| Transformer forward / noise prediction | `denoise_step()` | Pure denoise computation for the current timestep |
-| `scheduler.step(...)` and `step_index += 1` | `step_scheduler()` | Only latent/state mutation for one step |
-| VAE decode / postprocess | `post_decode()` | Final decode only |
-
-Keep the stepwise path reusing the same helpers as the request-level path
-whenever possible. Reimplementing the denoise loop from scratch is the easiest
-way to introduce behavioral drift.
-
-## Qwen-Image Reference
-
-[`pipeline_qwen_image.py`](gh-file:vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py)
-is the reference implementation and is split correctly for the current
-contract:
-
-- `prepare_encode()` reuses `_prepare_generation_context()` so prompt encoding,
- latent init, timestep creation, CFG setup, and shape bookkeeping stay aligned
- with `forward()`.
-- `prepare_encode()` deep-copies `self.scheduler` **after**
- `prepare_timesteps()` so request-specific scheduler state is isolated.
-- `denoise_step()` reuses `_build_denoise_kwargs()` plus
- `predict_noise_maybe_with_cfg()`, so sequential CFG, CFG-parallel, and
- non-CFG behavior stay identical to the request-level path.
-- `step_scheduler()` only calls
- `scheduler_step_maybe_with_cfg(..., per_request_scheduler=state.scheduler)`
- and increments `state.step_index`.
-- `post_decode()` reuses `_decode_latents()`, so the final image decode matches
- the normal `forward()` path.
-
-That decomposition is the target pattern for future models.
-
-## Rules For New Pipelines
-
-- Do not keep request-scoped scheduler state on `self.scheduler`. Copy it into
- `state.scheduler` during `prepare_encode()`.
-- Do not mutate `state.step_index` inside `denoise_step()`. Only
- `step_scheduler()` should advance the step.
-- Do not decode partial outputs in `denoise_step()` or `step_scheduler()`.
-- If the request-level pipeline has condition latents, masks, or edit-specific
- tensors, store them in `state` or `state.extra`, not in global pipeline
- attributes.
-- Preserve CFG behavior by sharing the same helper path used by `forward()`.
-- Keep `post_decode()` equivalent to the tail of `forward()`.
-
-## Validation Checklist
-
-Before marking a pipeline as `supports_step_execution = True`, verify:
-
-- Stepwise output matches request-level output for the same seed and sampling params.
-- Per-request scheduler state is isolated across concurrent requests.
-- Abort during denoise does not leak cached state.
-- `step_index` reported by `RunnerOutput` matches the scheduler progress.
-- CFG-parallel and non-CFG paths both work if the request-level pipeline supports them.
-
-## Related Files
-
-- Contract: [`vllm_omni/diffusion/models/interface.py`](gh-file:vllm_omni/diffusion/models/interface.py)
-- State: [`vllm_omni/diffusion/worker/utils.py`](gh-file:vllm_omni/diffusion/worker/utils.py)
-- Runner loop: [`vllm_omni/diffusion/worker/diffusion_model_runner.py`](gh-file:vllm_omni/diffusion/worker/diffusion_model_runner.py)
-- Scheduler transport: [`vllm_omni/diffusion/sched/interface.py`](gh-file:vllm_omni/diffusion/sched/interface.py)
-- Reference pipeline: [`vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py`](gh-file:vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py)
diff --git a/docs/design/feature/disaggregated_inference.md b/docs/design/feature/disaggregated_inference.md
deleted file mode 100644
index 83c35ac0107..00000000000
--- a/docs/design/feature/disaggregated_inference.md
+++ /dev/null
@@ -1,108 +0,0 @@
-# Disaggregated Inference for Omni-Modality Models
-
-This guide explains how to configure and use distributed connectors
-(`vllm_omni/distributed/omni_connectors`) in vllm-omni for multi-stage pipelines.
-
-Backend-specific setup lives in separate docs:
-
-- [SharedMemoryConnector](omni_connectors/shared_memory_connector.md)
-- [MooncakeStoreConnector](omni_connectors/mooncake_store_connector.md)
-- [MooncakeTransferEngineConnector](omni_connectors/mooncake_transfer_engine_connector.md)
-- [YuanrongConnector](omni_connectors/yuanrong_connector.md)
-
-## Overview
-
-Connectors enable data transfer between pipeline stages (e.g., Thinker -> Talker).
-Current connectors operate in D2H2D (device to host to device) mode.
-
-## Connector Choices
-
-| Use Case | Recommended Connector | Notes |
-| :--- | :--- | :--- |
-| Single node | SharedMemoryConnector | Auto-configured if no connector is specified. |
-| Multi node (Mooncake Store) | MooncakeStoreConnector | TCP-based, requires Mooncake Master + metadata server. |
-| Multi node (Mooncake RDMA) | MooncakeTransferEngineConnector | RDMA/TCP direct transfer with managed memory pool. Fastest. |
-| Multi node (Yuanrong) | YuanrongConnector | Requires Yuanrong Datasystem + etcd. |
-
-## Core API
-
-The connector system is built around `OmniConnectorBase`.
-
-```python
-class OmniConnectorBase(ABC):
- @abstractmethod
- def put(self, from_stage: str, to_stage: str, put_key: str, data: Any) -> tuple[bool, int, Optional[dict]]:
- """
- Store data.
- Returns: (success, serialized_size, metadata)
- """
- pass
-
- @abstractmethod
- def get(self, from_stage: str, to_stage: str, get_key: str, metadata: Optional[dict] = None) -> Optional[tuple[Any, int]]:
- """
- Retrieve data.
- Args: metadata - transport-specific handles returned by put() (e.g., SHM name).
- Returns: (object, serialized_size)
- """
- pass
-```
-
-### Metadata Passing
-
-Some connectors (e.g., SharedMemoryConnector) generate transient resources during `put()`.
-This `metadata` must be passed through the control plane so `get()` can locate the data.
-
-## Configuration Model
-
-Define connectors in runtime:
-
-```yaml
-runtime:
- connectors:
- connector_of_shared_memory:
- name: SharedMemoryConnector
- extra:
- shm_threshold_bytes: 65536
-```
-
-Wire stages to connectors:
-
-```yaml
-stage_args:
- - stage_id: 0
- output_connectors:
- to_stage_1: connector_of_shared_memory
-
- - stage_id: 1
- input_connectors:
- from_stage_0: connector_of_shared_memory
-```
-
-If a pipeline edge has no explicit connector, the system auto-creates a
-SharedMemoryConnector for that edge.
-
-## Relationship with vLLM
-
-vLLM provides specialized distributed mechanisms for specific artifacts:
-
-- KV Transfer (`vllm.distributed.kv_transfer`): optimized for KV caches.
-- EC Transfer (`vllm.distributed.ec_transfer`): optimized for encoder embeddings.
-- Device Communicators (`vllm.distributed.device_communicators`): low-level primitives (NCCL, SHM).
-
-vllm-omni complements this with a generalized connector abstraction:
-
-1. Unifies transport via a single `put`/`get` API for any stage artifact.
-2. Enables DAG-style pipelines across processes or nodes with per-edge transports.
-3. Can wrap vLLM-specific transfers for KV paths while keeping a consistent interface.
-
-## Operational Notes
-
-- Fail-fast config validation: missing expected edges cause startup failures.
-- Missing payloads halt stages: verify connector wiring and metadata propagation.
-
-## Future Roadmap: D2D Transport
-
-Current connectors use D2H2D paths. Future versions will introduce direct
-device-to-device connectors (NCCL, UCX, IPC) to reduce latency for large
-tensor payloads.
diff --git a/docs/design/feature/expert_parallel.md b/docs/design/feature/expert_parallel.md
deleted file mode 100644
index e05eec33613..00000000000
--- a/docs/design/feature/expert_parallel.md
+++ /dev/null
@@ -1,221 +0,0 @@
-# Expert Parallel
-
-This section describes how to add Expert Parallel (EP) to a diffusion transformer that uses Mixture-of-Experts (MoE) layers.
-We use **HunyuanImage3.0** as the reference implementation.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [Step-by-Step Implementation](#step-by-step-implementation)
-- [Testing](#testing)
-- [Reference Implementations](#reference-implementations)
-- [Summary](#summary)
-
----
-
-## Overview
-
-### What is Expert Parallel?
-
-**Expert Parallel** is a parallelism strategy in Mixture-of-Experts (MoE) models that distributes different expert networks across distinct computational devices. Each device holds and computes only a subset of experts (local experts), with tokens dispatched to and gathered from remote devices via collective communication operations (e.g., All-to-All, All-Gather).
-
-| Backend | Description |
-|---------|-------------|
-| `allgather_reducescatter` | Default backend based on allgather/reducescatter primitives, suitable for general EP+DP deployments.|
-
-## Configuration
-
-Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:
-
-```text
-EP_SIZE = TP_SIZE × SP_SIZE × CFG_SIZE × DP_SIZE
-```
-
-
-Where:
-
-- `TP_SIZE`: Tensor parallel size
-- `SP_SIZE`: Sequence parallel size
-- `CFG_SIZE`: Classifier-free guidance parallel size
-- `DP_SIZE`: Data parallel size
-- `EP_SIZE`: Expert parallel size (computed automatically)
-
-Note:
-- Expert parallelism is only applicable to Mixture-of-Experts (MoE) models.
-- The EP group is created **per pipeline stage**, meaning it includes all ranks that participate in model parallelism except pipeline parallelism.
-- The underlying communication pattern for expert parallelism is **All-to-All** among the ranks in the EP group.
-
-For example, consider a configuration with `TP=2`, `SP=1`, `CFG=2`, and `DP=4` (total 2×1×2×4 = 16 GPUs).
-
-- Expert layers are handled by an EP group of size 16.
-
-- Attention layers use tensor parallelism of size 2 within each of the 8 DP groups (because `DP×CFG×SP = 4×2×1 = 8` groups, each containing the 2 TP ranks). Inside each such group, the attention weights are sharded across the 2 GPUs.
-
-
-## Step-by-Step Implementation
-
-### Step 1: Configure Expert Parallelism Settings
-
-Calculate local experts per rank:
-
-```
-ep_size = 8 # Expert Parallel size (typically equals TP size)
-num_experts = 64
-num_local_experts = num_experts // ep_size # 8 experts per card
-
-# Check divisibility
-assert num_experts % ep_size == 0, "Experts must be divisible by EP size"
-```
-
-### Step 2: Use Sparse MoE Block to enable EP routing.
-
-Example:
-```
-from vllm.model_executor.layers.linear import ReplicatedLinear
-class HunYuanSparseMoeBlock(nn.Module):
- def __init__(
- self,
- config: PretrainedConfig,
- layer_id: int = -1,
- prefix: str = "",
- ):
- super().__init__()
- self.tp_size = get_tensor_model_parallel_world_size()
- self.n_routed_experts = config.num_experts # 64
-
- # Calculate local experts per rank (key for EP)
- if self.tp_size > self.n_routed_experts:
- raise ValueError(f"TP size {self.tp_size} > experts {self.n_routed_experts}")
-
- # Routing gate (replicated on all ranks, computes scores for all tokens to all experts)
- self.gate = ReplicatedLinear(
- config.hidden_size,
- config.num_experts,
- bias=False,
- quant_config=None,
- prefix=f"{prefix}.gate",
- )
-
- # EP expert layer (factory loads platform-specific implementation)
- self.experts = HunyuanFusedMoE(...)
-```
-**Key Points:**
-- gate is **ReplicatedLinear** (replicated on all ranks)
-- experts is created via **HunyuanFusedMoE factory**, which automatically handles EP dispatch
-
-### Step 3: Initialize EP Runtime
-
-Initialize the EP communication context before model loading.
-```
-from vllm.utils.import_utils import resolve_obj_by_qualname
-# Call during __init__ or model loading
-op_name = "hunyuan_fused_moe"
-
-# Prepare EP runtime: establish communication groups, assign local expert indices, init _expert_map
-current_omni_platform.prepare_diffusion_op_runtime(op_name)
-
-# Factory automatically resolves platform implementation (GPU: FusedMoE / NPU: AscendFusedMoE)
-impl = resolve_obj_by_qualname(
- current_omni_platform.get_diffusion_model_impl_qualname(op_name)
-)
-```
-
-### Step 4: Expert Weight Mapping & Loading
-
-Each rank loads only the expert weights assigned to its local allocation.
-```
-# Get expert parameter mapping (different per rank)
-expert_mapping = HunyuanFusedMoE.make_expert_params_mapping(
- model=self,
- ckpt_gate_proj_name="gate_proj",
- ckpt_down_proj_name="down_proj",
- ckpt_up_proj_name="up_proj",
- num_experts=64,
- num_redundant_experts=0,
-)
-# Returns: [(param_name, weight_name, expert_id, shard_id), ...]
-# Note: Each rank only contains mappings for its local expert_ids
-
-# Filter non-local experts during loading
-for name, loaded_weight in weights:
- if "mlp.experts" in name:
- # Parse expert_id from weight name (implementation needed)
- expert_id = parse_expert_id_from_name(name)
- local_expert_start = (ep_rank) * num_local_experts
- local_expert_end = (ep_rank + 1) * num_local_experts
-
- if not (local_expert_start <= expert_id < local_expert_end):
- continue # Skip non-local expert weights
-```
-### Step 5: Forward Pass with EP
-
-Example (MoE Forward):
-```
-def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
- orig_shape = hidden_states.shape
- hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
-
- # 1. Global routing computation (all tokens, all expert scores)
- # hidden_states: [num_tokens, hidden_dim] (full tensor)
- router_logits, _ = self.gate(hidden_states) # [num_tokens, num_experts]
-
- # 2. EP dispatch and compute (HunyuanFusedMoE handles all_to_all internally)
- # - Dispatch: Send tokens to target ranks based on router_logits
- # - Local Compute: Each rank processes only its num_local_experts
- # - Combine: Results returned to original token positions
- final_hidden_states = self.experts(
- hidden_states=hidden_states,
- router_logits=router_logits,
- )
-
- # 3. Add shared expert output (not EP, computed on all ranks)
- if self.shared_mlp is not None:
- shared_out = self.shared_mlp(hidden_states)
- final_hidden_states = final_hidden_states + shared_out
-
- # 4. Tensor Parallel All-Reduce (synchronize across TP group)
- if self.tp_size > 1:
- final_hidden_states = self.experts.maybe_all_reduce_tensor_model_parallel(
- final_hidden_states
- )
-
- return final_hidden_states.view(orig_shape)
-```
-
-## Testing
-After adding Expert Parallel support, test via command line:
-```bash
-cd examples/offline_inference/text_to_image
-python text_to_image.py \
- --model Your-org/your-model \
- --prompt "a cup of coffee on the table" \
- --output "ep_enabled.png" \
- --num-inference-steps 50 \
- --guidance-scale 5.0 \
- --tensor-parallel-size 8 \
- --seed 1234 \
- --enable-expert-parallel
-```
-
-vLLM‑Omni currently focuses on core diffusion model inference acceleration, so the Expert Parallel implementation includes only the basic multi‑GPU expert sharding functionality (enabled via --enable-expert-parallel). Advanced features such as communication backend selection (--all2all-backend), load balancing (--enable-eplb and its configuration), and multi‑node deployment belong to the extended capabilities of the main vLLM project and have not yet been integrated into Omni.
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Pattern | Notes |
-|-------|------|---------|-------|
-| **HunyuanImage3.0** | `vllm_omni/diffusion/models/hunyuan_image3/hunyuan_image3_transformer.py` | Standard EP | Full implementation with validation |
-| **EP Tests** | `vllm-omni/tests/e2e/offline_inference/test_expert_parallel.py` | E2E testing | EP correctness and performance |
-| **Constraint Tests** | `vllm-omni/tests/diffusion/models/hunyuan_image3/test_hunyuan_fused_moe.py` | Unit testing | Validation logic |
-
----
-## Summary
-
-Adding Expert Parallel support to diffusion model:
-
-1. **Identify MoE layers** - Locate the router and expert networks in each transformer block.
-2. **Validate EP constraints** – Ensure num_experts is divisible by expert_parallel_size.
-3. **Test** - Run with enable-expert-parallel, check memory reduction, speedup, and output quality against single‑GPU baseline.
diff --git a/docs/design/feature/hsdp.md b/docs/design/feature/hsdp.md
deleted file mode 100644
index 958ad192ead..00000000000
--- a/docs/design/feature/hsdp.md
+++ /dev/null
@@ -1,146 +0,0 @@
-# HSDP
-
-This section describes how to add HSDP (Hybrid Sharded Data Parallel) support to a diffusion transformer model. We use the Wan2.2 transformer as the reference implementation.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [Step-by-Step Implementation](#step-by-step-implementation)
-- [Testing](#testing)
-- [Reference Implementations](#reference-implementations)
-
----
-
-## Overview
-
-### What is HSDP?
-
-HSDP (Hybrid Sharded Data Parallel) is a memory optimization technique that **shards model weights** across multiple GPUs using PyTorch's FSDP2. Unlike Tensor Parallelism which splits computation, HSDP:
-
-- Shards weights across GPUs to reduce per-GPU memory usage
-- Gathers weights on-demand during forward passes
-- Can work standalone or combined with other parallelism (e.g., Sequence Parallel)
-
-This enables inference of large models (e.g., Wan2.2 14B) on GPUs with limited memory.
-
-**Important constraints:**
-- HSDP cannot be used with Tensor Parallelism
-- For standalone HSDP (no other parallelism), `hsdp_shard_size` must be specified explicitly
-
-### Architecture
-
-HSDP implementation relies on:
-
-1. **`_hsdp_shard_conditions`**: Model attribute specifying which modules to shard
-2. **`apply_hsdp_to_model`**: Function that applies FSDP2 sharding
-3. **`HSDPInferenceConfig`**: Runtime configuration for HSDP
-
----
-
-## Step-by-Step Implementation
-
-### Step 1: Identify Modules to Shard
-
-Determine which modules in your transformer should be sharded. Typically, these are:
-
-- Transformer blocks (e.g., `blocks.0`, `blocks.1`, ...)
-- Large submodules with significant weight memory
-
-**Key questions:**
-- Which modules have the largest weights?
-- Which modules are repeated (like transformer blocks)?
-
-### Step 2: Define Shard Conditions
-
-Add `_hsdp_shard_conditions` to your model class. This is a list of functions that return `True` for modules that should be sharded.
-
-**Example (Transformer Blocks):**
-
-```python
-class MyTransformerModel(nn.Module):
-
- @staticmethod
- def _is_transformer_block(name: str, module) -> bool:
- """Match transformer blocks for HSDP sharding.
-
- Args:
- name: Module name from named_modules() (e.g., "blocks.0", "blocks.0.attn")
- module: The module instance
-
- Returns:
- True if this module should be sharded
- """
- return "blocks" in name and name.split(".")[-1].isdigit()
-
- _hsdp_shard_conditions = [_is_transformer_block]
-```
-
-**Multiple Conditions Example:**
-
-```python
-class MyModel(nn.Module):
-
- @staticmethod
- def _is_transformer_block(name: str, module) -> bool:
- return "blocks" in name and name.split(".")[-1].isdigit()
-
- @staticmethod
- def _is_moe_expert(name: str, module) -> bool:
- # Also shard MoE expert layers
- return "experts" in name and name.split(".")[-1].isdigit()
-
- # Module is sharded if ANY condition returns True
- _hsdp_shard_conditions = [_is_transformer_block, _is_moe_expert]
-```
-
----
-
-## Testing
-
-After adding HSDP support, test with:
-
-```python
-from vllm_omni import Omni
-from vllm_omni.diffusion.data import DiffusionParallelConfig
-from vllm_omni.inputs.data import OmniDiffusionSamplingParams
-
-parallel_config = DiffusionParallelConfig(
- use_hsdp=True,
- hsdp_shard_size=8, # Shard across 8 GPUs
-)
-omni = Omni(model="your-model-name", parallel_config=parallel_config)
-
-output = omni.generate(
- "a cup of coffee on the table",
- OmniDiffusionSamplingParams(num_inference_steps=50),
-)
-```
-
-**Or via command line:**
-
-```bash
-vllm serve Your-org/your-model --omni --port 8091 --use-hsdp
-```
-
-**Verify:**
-
-1. Check logs for "HSDP Inference: replicate_size=..., shard_size=..."
-2. Check logs for "Sharded N modules + root"
-3. Verify memory usage is reduced proportionally
-4. Compare generated output quality with HSDP disabled
-
----
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Notes |
-|-------|------|-------|
-| **Wan2.2** | `vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py` | Reference implementation |
-| **HSDP Core** | `vllm_omni/diffusion/distributed/hsdp.py` | `apply_hsdp_to_model`, `shard_model` |
-| **HSDP Tests** | `tests/diffusion/distributed/test_hsdp.py` | Unit tests |
-
----
diff --git a/docs/design/feature/omni_connectors/mooncake_store_connector.md b/docs/design/feature/omni_connectors/mooncake_store_connector.md
deleted file mode 100644
index 2ff11685977..00000000000
--- a/docs/design/feature/omni_connectors/mooncake_store_connector.md
+++ /dev/null
@@ -1,312 +0,0 @@
-# MooncakeStoreConnector
-
-## When to Use
-
-Best for multi-node distributed inference using Mooncake.
-
-## Installation
-
-```bash
-# For CUDA-enabled systems (recommended)
-pip install mooncake-transfer-engine
-
-# For non-CUDA systems
-pip install mooncake-transfer-engine-non-cuda
-```
-
-## Start Mooncake Master
-
-```bash
-# If you use Mooncake SSD storage
-mkdir -p ./mc_storage
-
-mooncake_master \
- --rpc_port=50051 \
- --enable_http_metadata_server=true \
- --http_metadata_server_host=0.0.0.0 \
- --http_metadata_server_port=8080 \
- --metrics_port=9003 \
- --root_fs_dir=./mc_storage/ \
- --cluster_id=mc-local-1 &
-```
-
-## Configuration
-
-Define the connector in runtime:
-
-```yaml
-runtime:
- connectors:
- connector_of_mooncake:
- name: MooncakeStoreConnector
- extra:
- host: "127.0.0.1"
- metadata_server: "http://:8080/metadata"
- master: ":50051"
- segment: 512000000
- localbuf: 64000000
- proto: "tcp"
-```
-
-Wire stages to the connector:
-
-```yaml
-stage_args:
- - stage_id: 0
- output_connectors:
- to_stage_1: connector_of_mooncake
-
- - stage_id: 1
- input_connectors:
- from_stage_0: connector_of_mooncake
-```
-
-Parameters:
-
-- host: local worker IP registered in the metadata server.
-- metadata_server: metadata server URL for discovery and setup.
-- master: Mooncake Master address.
-- segment: global memory segment size in bytes.
-- localbuf: local buffer size in bytes.
-- proto: transport protocol ("tcp" or "rdma").
-
-For more details, refer to the
-[Mooncake repository](https://github.com/kvcache-ai/Mooncake).
-
----
-
-## Design
-
-### 1. Overview
-
-`MooncakeStoreConnector` is the storage-oriented remote connector in `vllm_omni/distributed/omni_connectors`. It is built on top of Mooncake distributed store APIs and provides a uniform `put()` / `get()` abstraction for transferring stage payloads across nodes.
-
-Compared with `SharedMemoryConnector`, this connector is intended for multi-node deployments. Compared with `MooncakeTransferEngineConnector`, it is simpler and more object-oriented: data is written into a distributed store and fetched back by key, rather than transferred through an explicit remote-memory write protocol.
-
-Its primary role is to provide a general-purpose remote connector for stage payloads when:
-
-- the pipeline spans multiple hosts
-- a shared-memory connector is not possible
-- a simpler remote backend is preferred over the RDMA-oriented transfer engine
-
-### 2. Relationship with the OmniConnector System
-
-`MooncakeStoreConnector` implements `OmniConnectorBase`, so it integrates with the same connector lifecycle as the other backends:
-
-- `OmniConnectorFactory` instantiates it from `ConnectorSpec`
-- `load_omni_transfer_config()` maps edge config to the connector spec
-- All callers (batch forwarding, chunk transfer, KV transfer, etc.) interact with it through the same `put()` / `get()` contract
-
-This means the rest of the pipeline does not need store-specific logic. It only relies on the generic connector contract.
-
-### 3. Design Goals
-
-The connector is designed around the following goals:
-
-- **Cross-node object transfer** using a single key-based abstraction
-- **Minimal connector-specific control plane**, since the store itself is the shared medium
-- **Compatibility with arbitrary Python payloads** through the Omni serializer
-- **Operational simplicity** compared with the transfer-engine-based remote connector
-
-The connector is intentionally not optimized for zero-copy tensor movement or direct remote-memory access.
-
-### 4. Core Design
-
-#### 4.1 Object-Oriented Transfer Model
-
-`MooncakeStoreConnector` treats the transport backend as a distributed object store:
-
-1. Serialize the Python object into bytes.
-2. Generate a unique connector key.
-3. Store the bytes in Mooncake.
-4. Fetch the bytes from Mooncake on the receiver side.
-5. Deserialize the bytes back into the original object.
-
-This design makes the connector conceptually close to a distributed key-value transport.
-
-#### 4.2 Key Construction
-
-The connector uses `OmniConnectorBase._make_key()` to derive the internal store key:
-
-```text
-{request_key}@{from_stage}_{to_stage}
-```
-
-This adds stage routing information to the logical request key and avoids collisions between different pipeline edges.
-
-#### 4.3 No Extra Metadata Path
-
-Unlike `SharedMemoryConnector` and `MooncakeTransferEngineConnector`, this connector returns:
-
-```python
-metadata = None
-```
-
-from `put()`.
-
-This is a meaningful design difference:
-
-- the store itself is the shared rendezvous point
-- consumers only need the same key
-- no transport handle, address, or side-channel metadata needs to be propagated
-
-As a result, the control plane is simpler for this connector than for the SHM and RDMA variants.
-
-### 5. Initialization
-
-#### 5.1 Required Mooncake Components
-
-The connector requires the Mooncake Python bindings to expose:
-
-- `MooncakeDistributedStore`
-- `ReplicateConfig`
-
-If these symbols are unavailable, construction fails immediately with `ImportError`. This keeps startup failures explicit and avoids silent fallback behavior.
-
-#### 5.2 Store Setup
-
-During `_init_store()`, the connector:
-
-1. creates `MooncakeDistributedStore()`
-2. calls `store.setup(...)`
-3. validates the return code
-4. creates a `ReplicateConfig`
-5. enables `with_soft_pin = True`
-
-The setup step is completed during connector construction, not lazily at first use.
-
-### 6. Put / Get Flow
-
-#### 6.1 Producer Flow: `put()`
-
-The producer-side flow is:
-
-1. Validate that the store has been initialized.
-2. Serialize the payload into bytes.
-3. Build the internal key with stage routing information.
-4. Call `store.put(key, serialized_data, self.pin)`.
-5. Update metrics and return success.
-
-The returned tuple is:
-
-```python
-(True, len(serialized_data), None)
-```
-
-#### 6.2 Consumer Flow: `get()`
-
-The consumer-side flow is:
-
-1. Validate that the store has been initialized.
-2. Build the same internal key.
-3. Poll `store.get(key)` for up to 20 retries.
-4. Sleep 50 ms between retries.
-5. If data is found, deserialize it and return `(data, payload_size)`.
-6. If all retries are exhausted, record a timeout and return `None`.
-
-This design gives the connector a bounded waiting model rather than an indefinite blocking get.
-
-### 7. Integration with Stage Communication
-
-All callers use the connector through the same `put()` / `get()` contract:
-
-- the sender calls `put()` to serialize and store the payload
-- the receiver calls `get()` to retrieve and deserialize it
-- no connector-specific metadata is required, since the store key is the rendezvous point
-
-Because `put()` returns `metadata=None`, the connector is naturally compatible with callers that do not forward metadata (e.g. polling-based flows). The trade-off is that all payloads incur full serialization and deserialization costs, which makes the connector functional but not the highest-performance option for large payloads such as KV cache blocks.
-
-### 8. Data Flow in the Pipeline
-
-The end-to-end transfer model is:
-
-```mermaid
-sequenceDiagram
- participant SenderStage
- participant MooncakeStoreConnector
- participant MooncakeStore
- participant ReceiverStage
-
- SenderStage->>MooncakeStoreConnector: put(from_stage, to_stage, put_key, data)
- MooncakeStoreConnector->>MooncakeStoreConnector: serialize object
- MooncakeStoreConnector->>MooncakeStore: put(internal_key, bytes)
- MooncakeStoreConnector-->>SenderStage: (success, size, None)
-
- ReceiverStage->>MooncakeStoreConnector: get(from_stage, to_stage, get_key)
- MooncakeStoreConnector->>MooncakeStore: get(internal_key)
- MooncakeStoreConnector->>MooncakeStoreConnector: deserialize bytes
- MooncakeStoreConnector-->>ReceiverStage: (data, size)
-```
-
-This is a store-mediated remote transfer model rather than a direct peer-to-peer transport model.
-
-### 9. Strengths and Trade-offs
-
-#### Strengths
-
-- Simple conceptual model: store by key, fetch by key.
-- No connector-specific metadata handoff is required.
-- Works naturally for remote multi-node stage transfer.
-- Easy to integrate into the existing connector abstraction.
-
-#### Trade-offs
-
-- Full serialization and deserialization are always required.
-- Large payloads are more expensive than in direct-memory transports.
-- Runtime behavior depends on external Mooncake services being available.
-- Cleanup semantics are weaker than request-scoped local buffer management.
-
-### 10. Important Implementation Characteristics
-
-#### 10.1 Cleanup Is a No-op
-
-`cleanup()` only logs a debug message and does not actively delete remote data. The current design assumes Mooncake-side lifecycle management rather than explicit per-request removal.
-
-This keeps the connector implementation simple but means request-level reclamation is not modeled inside the connector itself.
-
-#### 10.2 Close Releases the Store Handle
-
-Unlike `SharedMemoryConnector`, where `close()` is a no-op, `MooncakeStoreConnector.close()` performs a meaningful shutdown:
-
-1. Calls `self.store.close()` to release the Mooncake store handle.
-2. Sets `self.store = None` to prevent further operations.
-
-This ensures the connector releases its connection to the external Mooncake service on shutdown. Errors during close are logged but do not propagate.
-
-#### 10.3 Health Output Uses a Shared Schema
-
-`health()` reports:
-
-- `host`
-- `metadata_server`
-- `master`
-- metrics
-
-and also includes placeholder fields such as `protocol`, `pool_device`, and `pool_size` to keep a more uniform shape across connector health outputs. This reflects a system-level consistency choice rather than a store-specific design need.
-
-#### 10.4 Get Is Retry-Based, Not Event-Driven
-
-The receiver side polls the store with a short retry loop. This is simple and robust, but it means the connector is latency-sensitive to:
-
-- store visibility delay
-- network jitter
-- payload size
-
-If the payload is not visible within the retry window, the connector reports a timeout.
-
-### 11. Summary
-
-`MooncakeStoreConnector` is the remote, store-based connector in the OmniConnector stack. Its design is straightforward:
-
-- serialize the payload
-- store it in Mooncake under a stage-qualified key
-- fetch it by the same key on the receiver side
-- deserialize it back into the original object
-
-This connector fills an important gap in the system:
-
-1. It enables cross-node transfer without requiring shared memory.
-2. It keeps the stage communication model uniform.
-3. It provides a simpler operational alternative to the transfer-engine-based remote connector.
-
-Its simplicity is also its main trade-off: it favors a clean object-transfer model over the fast-path and peer-to-peer optimizations implemented by `MooncakeTransferEngineConnector`.
diff --git a/docs/design/feature/omni_connectors/mooncake_transfer_engine_connector.md b/docs/design/feature/omni_connectors/mooncake_transfer_engine_connector.md
deleted file mode 100644
index 306a0620b4b..00000000000
--- a/docs/design/feature/omni_connectors/mooncake_transfer_engine_connector.md
+++ /dev/null
@@ -1,793 +0,0 @@
-# MooncakeTransferEngineConnector
-
-## When to Use
-
-Best for high-performance multi-node data transfer between stages using Mooncake
-Transfer Engine. Supports both RDMA and TCP protocols with a managed memory pool,
-zero-copy deserialization, and optional GPUDirect RDMA. Applicable to any
-inter-stage data (KV caches, request payloads, etc.), not limited to KV cache transfer.
-
-Compared to `MooncakeStoreConnector` (TCP key-value store), this connector
-provides **~60x faster** data transfer via RDMA direct memory access.
-
-## Installation
-
-```bash
-pip install mooncake-transfer-engine
-```
-
-Ensure RDMA drivers are installed on all nodes (e.g., Mellanox OFED for
-InfiniBand/RoCE NICs).
-
-## Configuration
-
-Define the connector in runtime:
-
-```yaml
-runtime:
- connectors:
- rdma_connector:
- name: MooncakeTransferEngineConnector
- extra:
- host: "auto" # Auto-detect local RDMA IP
- zmq_port: 50051 # ZMQ base port (see "Port Offset Scheme" below)
- protocol: "rdma" # "rdma" or "tcp"
- device_name: "" # RDMA device (e.g., "mlx5_0"), empty for auto-detect
- memory_pool_size: 4294967296 # 4 GB (CPU); use 2147483648 (2 GB) for GPU
- memory_pool_device: "cpu" # "cpu" for pinned memory (recommended), "cuda" for GPUDirect RDMA
-```
-
-Wire stages to the connector:
-
-```yaml
-stage_args:
- - stage_id: 0
- output_connectors:
- to_stage_1: rdma_connector
-
- - stage_id: 1
- input_connectors:
- from_stage_0: rdma_connector
-```
-
-## Parameters
-
-### Required
-
-| Parameter | Description |
-|---|---|
-| `role` | **Internal, do not set manually.** Auto-injected by the orchestration layer (`"sender"` for `output_connectors`, `"receiver"` for `input_connectors`). Defaults to `"sender"` if omitted. |
-| `host` | Local IP address for RDMA. `"auto"` detects from network interfaces. |
-| `protocol` | Transport protocol: `"rdma"` (InfiniBand/RoCE) or `"tcp"`. |
-
-### Memory Pool
-
-| Parameter | Default | Description |
-|---|---|---|
-| `memory_pool_size` | 4 GB (CPU) / 2 GB (GPU) | Total size of the RDMA-registered memory pool in bytes. Recommended 4 GB for CPU pinned memory; 2 GB for GPU VRAM to conserve device memory. |
-| `memory_pool_device` | `"cpu"` | `"cpu"`: pinned host memory (recommended, works on all topologies). `"cuda"`: GPU VRAM for GPUDirect RDMA (requires NIC-GPU direct PCIe connectivity, PIX topology). |
-
-### Networking
-
-| Parameter | Default | Description |
-|---|---|---|
-| `zmq_port` | 50051 | ZMQ **base** port. The orchestration layer computes the actual port as `base + purpose_offset + stage_offset` (see table below). Users only set this base value. |
-| `sender_host` | `None` | **Internal.** Receiver-side only — dynamically resolved via `update_sender_info()`. Not needed in YAML. |
-| `sender_zmq_port` | `None` | **Internal.** Receiver-side only — defaults to the sender's adjusted port. Not needed in YAML. |
-| `device_name` | `""` | RDMA device name (e.g., `"mlx5_0"`). Empty for auto-detect. Can also be set via `RDMA_DEVICE_NAME` env var. |
-
-#### ZMQ Port Offset Scheme
-
-To avoid port conflicts when multiple edges, purposes, DP replicas, or TP ranks share the same node, the actual ZMQ port is computed as:
-
-```
-side_channel_port = zmq_port + purpose_offset + stage_offset + dp_index * tp_size
-sender_listen = side_channel_port + tp_rank
-receiver_connect = remote_side_channel_port + tp_rank
-```
-
-| Component | Value | Description |
-|---|---|---|
-| `zmq_port` | 50051 (default) | Base port from YAML config |
-| `purpose_offset` | `request_forwarding` = 0, `kv_transfer` = 100 | Separates control-plane vs KV-cache connections |
-| `stage_offset` | `int(from_stage)` (0, 1, 2...) | Separates edges from different source stages |
-| `dp_index * tp_size` | e.g., DP1 × TP2 = 2 | Each DP replica reserves a port range of size `tp_size` (following vLLM convention: `VLLM_MOONCAKE_BOOTSTRAP_PORT + dp_index * tp_size`) |
-| `tp_rank` | 0, 1, 2... | Each TP rank within a DP replica uses its own port |
-| orchestrator | +200 | Extra offset when caller is the orchestrator (avoids collision with stage workers on the same node) |
-
-**Example** (base=50051, stage 0→1, DP=2, TP=2, kv_transfer):
-
-| Caller | DP | TP rank | Port |
-|---|---|---|---|
-| Stage worker | DP0 | rank 0 | `50051 + 100 + 0 + 0×2 + 0 = 50151` |
-| Stage worker | DP0 | rank 1 | `50051 + 100 + 0 + 0×2 + 1 = 50152` |
-| Stage worker | DP1 | rank 0 | `50051 + 100 + 0 + 1×2 + 0 = 50153` |
-| Stage worker | DP1 | rank 1 | `50051 + 100 + 0 + 1×2 + 1 = 50154` |
-| Orchestrator | — | — | `50051 + 200 + 0 = 50251` |
-
-## Memory Pool Modes
-
-| Mode | Config | Recommended Pool Size | Data Flow | Best For |
-|---|---|---|---|---|
-| CPU Pinned | `memory_pool_device: "cpu"` | 4 GB | GPU → CPU pool → RDMA → CPU pool → GPU | Most hardware topologies (recommended) |
-| GPUDirect | `memory_pool_device: "cuda"` | 2 GB | GPU → GPU pool → RDMA (NIC reads GPU BAR1) → GPU pool | NIC-GPU direct PCIe (PIX topology) |
-
-> **Note**: GPUDirect RDMA requires the NIC and GPU to share a direct PCIe
-> switch (PIX topology). On systems where they are connected via PXB or NODE,
-> CPU pinned memory is faster due to GPU BAR1 bandwidth limitations.
-
-## Environment Variables
-
-| Variable | Description |
-|---|---|
-| `RDMA_DEVICE_NAME` | Override RDMA device name (e.g., `mlx5_0`). |
-| `MC_IB_PCI_RELAXED_ORDERING` | Set to `1` to enable PCIe relaxed ordering for GPUDirect. |
-
-## Docker / Container Setup
-
-RDMA requires host-level device access:
-
-```bash
-docker run -it \
- --cap-add=SYS_PTRACE \
- --cap-add=IPC_LOCK \
- --security-opt seccomp=unconfined \
- --network=host \
- --device=/dev/infiniband \
- -v /sys/class/infiniband:/sys/class/infiniband:ro \
- your-image:tag
-```
-
-## Performance
-
-Benchmark results on H800 GPUs with mlx5_0 RDMA NIC (~186 MB KV cache):
-
-| Metric | MooncakeStoreConnector | MooncakeTransferEngineConnector (CPU) |
-|---|---|---|
-| KV transfer wall time | ~810 ms | **~14 ms** |
-| RDMA throughput | N/A (TCP) | ~22 GB/s |
-| Speedup | 1x | **58x** |
-
-## Troubleshooting
-
-### Quick Diagnostics
-
-```bash
-# 1. Check RDMA devices and link status
-ibdev2netdev
-# Expected: "mlx5_X port 1 ==> (Up)"
-# RoCE devices map to Ethernet interfaces (e.g., enp75s0f0)
-# IB devices map to ib0, ib1, etc.
-
-# 2. Check InfiniBand device details
-ibstat
-
-# 3. Verify /dev/infiniband is accessible (required in containers)
-ls /dev/infiniband/
-
-# 4. Check Mooncake installation
-python -c "from mooncake.engine import TransferEngine; print('OK')"
-
-# 5. Check environment variables
-echo "RDMA_DEVICE_NAME=${RDMA_DEVICE_NAME:-}"
-echo "MC_IB_PCI_RELAXED_ORDERING=${MC_IB_PCI_RELAXED_ORDERING:-}"
-```
-
-### Common Issues
-
-| Symptom | Cause | Fix |
-|---|---|---|
-| `Failed to modify QP to RTR` | Cross-NIC QP handshake failure (multi-NIC DGX) | Set `device_name` to a single RoCE NIC (e.g., `mlx5_2`) or set `RDMA_DEVICE_NAME` env var |
-| `transport retry counter exceeded` | RDMA path between incompatible NICs | Same as above — restrict to one NIC |
-| `zmq.error.Again: Resource temporarily unavailable` | ZMQ recv timeout (transfer took too long) | Check NIC selection; increase data may need longer timeout |
-| `Mooncake Engine initialization failed` | Missing RDMA drivers or `/dev/infiniband` | Install Mellanox OFED; in Docker add `--device=/dev/infiniband` |
-| `MemoryError` in allocator | Memory pool too small for payload | Increase `memory_pool_size` |
-| GPU transfer slower than CPU | GPU BAR1 bandwidth limitation (PXB/NODE topology) | Use `memory_pool_device: "cpu"` instead of `"cuda"` |
-
-### Multi-NIC Environments (DGX)
-
-On DGX machines with 12+ RDMA NICs, only RoCE NICs (with a bound network
-interface) reliably support loopback. IB-only NICs may fail cross-NIC QP
-handshakes. To identify RoCE NICs:
-
-```bash
-ibdev2netdev | grep -v "ib[0-9]"
-# RoCE devices show Ethernet interface names like enp75s0f0
-```
-
-Then configure the connector:
-```yaml
-device_name: "mlx5_2" # or set RDMA_DEVICE_NAME=mlx5_2
-```
-
-See the RDMA Test README in tests/distributed/omni_connectors/README.md
-for test-specific setup instructions.
-
-For more details on the underlying engine, refer to the
-[Mooncake repository](https://github.com/kvcache-ai/Mooncake).
-
----
-
-## Design
-
-### 1. Overview
-
-`MooncakeTransferEngineConnector` is the high-performance remote connector in `vllm_omni/distributed/omni_connectors`. It is built on top of Mooncake `TransferEngine` and combines:
-
-- a **direct data plane** for remote memory writes
-- a **ZMQ side channel** for metadata lookup, handshake, and completion signaling
-- a **managed local memory pool** for both send and receive buffers
-
-Unlike `MooncakeStoreConnector`, which treats the backend as a distributed store, `MooncakeTransferEngineConnector` is designed as a peer-to-peer transport. Its goal is to move large stage payloads efficiently while still fitting the common `put()` / `get()` API defined by `OmniConnectorBase`.
-
-It is the most performance-oriented connector in the current OmniConnector family and is intended for large remote payloads such as:
-
-- KV cache transfer
-- stage hidden-state payloads
-- streaming chunk payloads
-- other binary-heavy inter-stage artifacts
-
-### 2. Relationship with the OmniConnector System
-
-`MooncakeTransferEngineConnector` implements the same connector contract as the other backends:
-
-- `put(from_stage, to_stage, put_key, data)`
-- `get(from_stage, to_stage, get_key, metadata=None)`
-- `cleanup(request_id, ...)`
-- `health()`
-- `close()`
-
-It is integrated into the system through the standard connector plumbing:
-
-- `OmniConnectorFactory` constructs the connector from `ConnectorSpec`
-- `load_omni_transfer_config()` resolves the edge-level connector configuration
-- `get_connectors_config_for_stage()` and `resolve_omni_kv_config_for_stage()` inject the connector role
-- All callers (batch forwarding, chunk transfer, KV transfer, etc.) interact with it through the same `put()` / `get()` contract
-
-The key system-level distinction is that this connector is **role-aware**:
-
-- sender instances expose data and listen for pull requests
-- receiver instances allocate buffers and actively pull data from the sender
-
-### 3. Design Goals
-
-The connector is designed around four primary goals:
-
-1. **High-throughput remote transfer**
- Avoid store-mediated round trips and write directly into the receiver memory region.
-
-2. **Fast path for raw payloads**
- Support `torch.Tensor`, `bytes`, and `ManagedBuffer` without forcing all traffic through full object serialization.
-
-3. **Unified connector abstraction**
- Preserve the same `put()` / `get()` interface used by the rest of the OmniConnector stack.
-
-4. **Safe lifecycle management**
- Manage allocation, reuse, cleanup, and failure recovery for a registered memory pool.
-
-### 4. Architecture Overview
-
-At a high level, the connector is composed of four main subsystems:
-
-```mermaid
-classDiagram
- class OmniConnectorBase {
- <>
- +put(from_stage, to_stage, put_key, data)
- +get(from_stage, to_stage, get_key, metadata)
- +cleanup(request_id)
- +health()
- +close()
- }
-
- class MooncakeTransferEngineConnector {
- +supports_raw_data: bool
- -engine: TransferEngine
- -allocator: BufferAllocator
- -pool: torch.Tensor
- -zmq_ctx: zmq.Context
- -_local_buffers: dict
- -_sender_executor: ThreadPoolExecutor
- -_listener_thread: threading.Thread
- +put(...)
- +get(...)
- +update_sender_info(sender_host, sender_zmq_port)
- +get_connection_info()
- +cleanup(request_id, from_stage, to_stage)
- +close()
- }
-
- class BufferAllocator {
- -total_size: int
- -alignment: int
- -free_blocks: list
- +alloc(size) int
- +free(offset, size)
- }
-
- class ManagedBuffer {
- -allocator: BufferAllocator
- -offset: int
- -size: int
- -pool_tensor: torch.Tensor
- +tensor
- +as_tensor(dtype, shape) torch.Tensor
- +to_bytes() bytes
- +release()
- }
-
- class TransferEngine {
- +initialize(host, handshake, protocol, device_name)
- +register_memory(base_ptr, size)
- +batch_transfer_sync_write(remote_session, src_addrs, dst_addrs, lengths)
- +unregister_memory(base_ptr)
- +get_rpc_port() int
- }
-
- class QueryRequest {
- +request_id: str
- }
-
- class QueryResponse {
- +request_id: str
- +data_size: int
- +is_fast_path: bool
- }
-
- class MooncakeAgentMetadata {
- +remote_hostname: str
- +remote_port: int
- +request_id: str
- +dst_addrs: list[int]
- +lengths: list[int]
- }
-
- OmniConnectorBase <|-- MooncakeTransferEngineConnector
- MooncakeTransferEngineConnector *-- BufferAllocator
- MooncakeTransferEngineConnector *-- TransferEngine
- ManagedBuffer --> BufferAllocator : releases to
- ManagedBuffer --> "1" torch.Tensor : views
- MooncakeTransferEngineConnector ..> ManagedBuffer : returns / retains
- MooncakeTransferEngineConnector ..> QueryRequest : decodes
- MooncakeTransferEngineConnector ..> QueryResponse : encodes
- MooncakeTransferEngineConnector ..> MooncakeAgentMetadata : exchanges
-```
-
-#### 4.1 Transfer Engine
-
-Mooncake `TransferEngine` is responsible for the actual data-plane transfer. It registers local memory and performs synchronous remote writes through:
-
-```python
-batch_transfer_sync_write(...)
-```
-
-#### 4.2 Managed Memory Pool
-
-Each connector instance owns a large pre-registered memory pool:
-
-- CPU pinned memory when `memory_pool_device == "cpu"`
-- GPU memory when `memory_pool_device == "cuda"`
-
-This avoids repeated memory registration and allows each transfer to allocate subranges from one long-lived pool.
-
-#### 4.3 Buffer Manager
-
-Two helper classes control local memory ownership:
-
-- `BufferAllocator`
- Manages aligned subrange allocation and free-list merging.
-
-- `ManagedBuffer`
- Represents one live slice of the pool and exposes:
- - `.tensor`
- - `.as_tensor(dtype, shape)`
- - `.to_bytes()`
- - `.release()`
-
-#### 4.4 ZMQ Side Channel
-
-ZMQ is used for transport coordination, not for the data payload itself. It handles:
-
-- metadata query from receiver to sender
-- pull request submission
-- completion or error signaling
-- internal notification from worker threads back to the listener thread
-
-This split makes the control plane lightweight while keeping the bulk payload on the transfer engine data plane.
-
-### 5. Role Model
-
-#### 5.1 Sender Role
-
-A sender connector:
-
-- accepts `put()` calls
-- stores live transfer-ready buffers in `_local_buffers`
-- starts a ZMQ listener thread
-- responds to metadata queries and pull requests from receivers
-
-#### 5.2 Receiver Role
-
-A receiver connector:
-
-- does not bind the sender-side ZMQ listener
-- accepts `get()` calls
-- allocates receive buffers from its own pool
-- requests metadata or transfer service from the sender
-
-The role is not inferred dynamically. It is injected by the stage configuration layer:
-
-- incoming edge for a stage -> `role="receiver"`
-- outgoing edge for a stage -> `role="sender"`
-
-This is important because incorrect role assignment would break initialization semantics.
-
-#### 5.3 Host Auto-Detection
-
-The `host` configuration field supports the special value `"auto"`. When set, the connector auto-detects the local IP address that would be used for external communication (via a UDP socket probe to `8.8.8.8`). If that fails, it falls back to hostname resolution, and ultimately to `127.0.0.1`.
-
-This is useful in environments where the operator does not want to hard-code IP addresses in the connector config.
-
-#### 5.4 RDMA Device Filtering
-
-The `device_name` configuration field allows the operator to specify which RDMA NICs to use (comma-separated, e.g. `"mlx5_0,mlx5_1"`). If not set in config, the connector also checks the `RDMA_DEVICE_NAME` environment variable.
-
-This is important in environments with mixed InfiniBand/RoCE NICs, where not all devices are suitable for the transfer engine.
-
-### 6. Local Memory Management
-
-#### 6.1 Memory Pool Registration
-
-During initialization, the connector:
-
-1. allocates a large pool tensor
-2. records its base pointer
-3. registers that memory with Mooncake
-4. creates a `BufferAllocator` for subrange management
-
-This means every later transfer only allocates offsets inside the pre-registered pool rather than registering memory per request.
-
-#### 6.2 BufferAllocator
-
-`BufferAllocator` maintains a sorted free list of `(offset, size)` blocks and enforces alignment. Its responsibilities include:
-
-- aligned allocation
-- freeing previously allocated blocks
-- adjacent block merging
-- double-free detection
-- overlap detection to catch corruption
-
-This is a critical piece of the connector because both sender and receiver depend on long-lived pool reuse.
-
-#### 6.3 ManagedBuffer
-
-`ManagedBuffer` is the main fast-path data wrapper. It can:
-
-- expose the pool slice as a zero-copy 1D `uint8` tensor
-- reinterpret that slice as a typed tensor
-- copy out the contents as Python `bytes`
-- release the slice back to the allocator
-
-The connector uses `ManagedBuffer` in two different ways:
-
-- as a send-side holder to keep the pool slice alive
-- as a receive-side return type when `is_fast_path=True`
-
-### 7. Put Flow
-
-#### 7.1 High-Level Behavior
-
-`put()` is only valid in sender mode. Its job is to expose a payload for later remote pull by the receiver.
-
-The high-level flow is:
-
-1. validate connector state and role
-2. convert the input into a pool-backed transferable representation
-3. store the transfer metadata in `_local_buffers`
-4. return lightweight metadata describing how the receiver can fetch the data
-
-#### 7.2 Payload Type Handling
-
-`put()` supports three payload classes:
-
-**A. `ManagedBuffer`**
-
-If the buffer belongs to the same pool, the connector can use it directly without copying. This is the most efficient path.
-
-If the buffer comes from a different pool, the connector falls back to a copy path.
-
-**B. `torch.Tensor` or `bytes`**
-
-These payloads are copied into the local pool and marked as fast-path data:
-
-- no Omni object serialization is required
-- receiver can get a `ManagedBuffer` back
-
-**C. Generic Python object**
-
-Any other payload is serialized via `OmniSerializer.serialize(...)` and then copied into the pool.
-
-In this case:
-
-- `is_fast_path=False`
-- the receiver will deserialize back into a Python object
-
-#### 7.3 Sender Metadata
-
-The sender returns:
-
-```python
-{
- "source_host": self.host,
- "source_port": self.zmq_port,
- "data_size": size,
- "is_fast_path": is_fast_path,
-}
-```
-
-This metadata is intentionally lightweight. It tells the receiver:
-
-- where the sender-side control plane lives
-- how large the remote transfer will be
-- whether the payload should be returned as a `ManagedBuffer` or a deserialized object
-
-#### 7.4 Sender Buffer Table
-
-The sender stores each live payload in `_local_buffers` under the stage-qualified key. Each entry contains:
-
-- source addresses
-- lengths
-- the holder object
-- ownership information (`should_release`)
-- `is_fast_path`
-- creation time
-
-This table is the sender-side truth source for both metadata queries and pull requests.
-
-### 8. Get Flow
-
-#### 8.1 High-Level Behavior
-
-`get()` runs on the receiver side and performs four steps:
-
-1. resolve metadata
-2. allocate a destination buffer in the local pool
-3. request the sender to write into that destination buffer
-4. return either a `ManagedBuffer` or a deserialized object
-
-#### 8.2 Metadata Resolution Paths
-
-The `metadata` parameter in `get()` is optional. The connector supports two resolution modes depending on whether the caller supplies it.
-
-**With metadata**
-
-When the caller passes metadata, the connector uses it directly. The metadata carries:
-
-- `source_host` / `source_port` — sender ZMQ endpoint
-- `data_size` — payload byte count
-- `is_fast_path` — whether the receiver gets a `ManagedBuffer` or a deserialized object
-
-This mode is suitable when the control plane already forwards the sender's `put()` output to the receiver.
-
-**Without metadata**
-
-When `get(metadata=None)` is called, the connector queries the sender over ZMQ to discover the same fields (`data_size`, `is_fast_path`). The caller must first call:
-
-```python
-update_sender_info(sender_host, sender_zmq_port)
-```
-
-so that the connector knows where to send the query.
-
-This mode is suitable for polling-based flows (e.g. KV transfer, async chunk transfer) where the receiver does not have metadata from the control plane.
-
-#### 8.3 Destination Allocation
-
-Once metadata is resolved, the receiver:
-
-1. allocates a subrange from its own local pool
-2. wraps it in a `ManagedBuffer`
-3. builds a `MooncakeAgentMetadata` request containing:
- - receiver hostname
- - receiver RPC port
- - request ID
- - destination addresses
- - transfer lengths
-
-This tells the sender exactly where to write the incoming data.
-
-#### 8.4 Transfer Completion
-
-The receiver then sends the pull request over ZMQ and waits for:
-
-- `TRANS_DONE`
-- or `TRANS_ERROR`
-
-If the transfer succeeds:
-
-- for `is_fast_path=True`, the receiver returns `(ManagedBuffer, size)`
-- for `is_fast_path=False`, the receiver copies to bytes, deserializes, releases the buffer, and returns `(object, size)`
-
-### 9. Sender-Side Listener Design
-
-#### 9.1 Listener Thread
-
-In sender mode, the connector starts `_zmq_listener_loop()` after initialization. The listener:
-
-- binds `tcp://{host}:{zmq_port}`
-- receives incoming requests
-- uses a poller for socket events and internal notifications
-- periodically reclaims stale buffers
-
-If the bind fails, initialization fails immediately. The code does not silently downgrade the connector role.
-
-#### 9.2 Worker Thread Pool
-
-The listener hands work to `_sender_executor` so that the listener thread does not block on transfer work.
-
-There are two request types:
-
-- metadata query -> `_handle_query_request(...)`
-- transfer request -> `_handle_pull_request(...)`
-
-#### 9.3 Query Handling
-
-For metadata queries, the sender looks up the request ID in `_local_buffers` and returns:
-
-- data size
-- fast-path flag
-
-This supports consumers that only know the sender endpoint but not the original sender metadata.
-
-#### 9.4 Pull Handling
-
-For a transfer request, the sender:
-
-1. locates the source addresses in `_local_buffers`
-2. constructs the remote session identifier
-3. calls `batch_transfer_sync_write(...)`
-4. replies `TRANS_DONE` or `TRANS_ERROR`
-
-On success, the sender immediately calls `cleanup(meta.request_id)` and frees the producer-side buffer if it owns it.
-
-This choice is important: it makes the connector effectively a **single-consumer transfer model** for each successful put/get pair.
-
-### 10. Fast Path Semantics
-
-This connector explicitly advertises:
-
-```python
-supports_raw_data = True
-```
-
-That means it can move raw payloads without forcing everything through the Omni object serializer.
-
-#### Fast Path
-
-For `torch.Tensor`, `bytes`, or pool-local `ManagedBuffer`:
-
-- sender returns `is_fast_path=True`
-- receiver returns a `ManagedBuffer`
-- caller is responsible for calling `release()`
-
-This avoids an unnecessary copy on the receiver side.
-
-#### Serialized Object Path
-
-For arbitrary Python objects:
-
-- sender serializes the payload
-- receiver converts the receive buffer to bytes
-- receiver deserializes the object
-- receive buffer is released internally
-
-This preserves a uniform object-oriented API while still allowing optimized raw-data transport when possible.
-
-### 11. Failure Handling and Cleanup
-
-#### 11.1 Timeouts and Socket Recovery
-
-The receiver caches ZMQ REQ sockets per thread, but invalidates them after failures. This avoids reusing sockets that may be stuck in a bad state after timeout or receive errors.
-
-Timeout is scaled based on payload size:
-
-- a base timeout
-- plus additional time for large payloads
-
-This is intended to reduce false timeouts for large remote writes.
-
-#### 11.2 Stale Buffer Reclamation
-
-The sender periodically reclaims old entries from `_local_buffers` using a TTL policy. This protects the memory pool from permanent leaks if a receiver crashes or never consumes a prepared payload.
-
-This is a practical recovery mechanism, although the code notes that TTL cleanup can still race with very long-running in-flight transfers.
-
-#### 11.3 Connector Shutdown
-
-`close()` is a full resource teardown routine. It:
-
-- stops the listener thread
-- shuts down the worker executor
-- releases all pending buffers
-- closes cached sockets
-- unregisters memory from the engine when supported
-- terminates the ZMQ context
-- drops the pool reference
-
-This makes `MooncakeTransferEngineConnector` the most lifecycle-aware connector in the current connector family.
-
-### 12. Current Implementation Constraints
-
-The current code documents several important topology constraints.
-
-#### 12.1 One Sender to One Receiver per Successful Transfer
-
-After a successful transfer, the sender-side buffer is cleaned up immediately. This means the same prepared payload is not retained for multiple independent receivers.
-
-#### 12.2 One Receiver to One Active Sender Endpoint
-
-The receiver only stores one `(sender_host, sender_zmq_port)` pair through `update_sender_info(...)`. So the metadata-query mode is currently single-sender at a time.
-
-#### 12.3 Explicit Buffer Ownership Matters
-
-When the connector allocates a pool slice internally, it is responsible for releasing it. When a caller passes an externally owned `ManagedBuffer`, the connector keeps it alive for transfer but does not assume ownership of its eventual release.
-
-These constraints are consistent with the current implementation and should be treated as design assumptions rather than incidental behavior.
-
-### 13. Data Flow in the Pipeline
-
-The end-to-end sender/receiver interaction is:
-
-```mermaid
-sequenceDiagram
- participant SenderStage
- participant SenderConnector
- participant ReceiverConnector
- participant ReceiverStage
-
- SenderStage->>SenderConnector: put(from_stage, to_stage, put_key, data)
- SenderConnector->>SenderConnector: place payload in local memory pool
- SenderConnector-->>SenderStage: metadata(source_host, source_port, data_size, is_fast_path)
-
- ReceiverStage->>ReceiverConnector: get(..., metadata)
- ReceiverConnector->>ReceiverConnector: allocate destination buffer
- ReceiverConnector->>SenderConnector: ZMQ pull request with dst addr
- SenderConnector->>ReceiverConnector: TransferEngine remote write
- SenderConnector-->>ReceiverConnector: TRANS_DONE
- ReceiverConnector-->>ReceiverStage: ManagedBuffer or deserialized object
-```
-
-For metadata-less polling, the flow simply adds a metadata query step before the pull request.
-
-### 14. Strengths and Trade-offs
-
-#### Strengths
-
-- Best remote-transfer design in the current connector stack for large payloads.
-- Supports raw-data fast path.
-- Keeps stage communication under the same connector abstraction.
-- Includes real lifecycle and memory-pool management.
-- Works for both stage payload transfer and KV transfer scenarios.
-
-#### Trade-offs
-
-- More complex than the store-based connector.
-- Correctness depends on role injection and endpoint coordination.
-- Caller must release fast-path receive buffers.
-- Current implementation is optimized for single-consumer transfer semantics.
-
-### 15. Summary
-
-`MooncakeTransferEngineConnector` is the high-performance peer-to-peer transport in the OmniConnector system. Its design combines:
-
-- a registered memory pool
-- a safe subrange allocator
-- a ZMQ control plane
-- a Mooncake transfer-engine data plane
-
-This allows the connector to support both:
-
-1. a **fast path** for raw tensors and bytes
-2. a **generic object path** for arbitrary Python payloads
-
-Within vLLM-Omni, it is the connector that most directly targets performance-sensitive remote transfer, especially for large payloads and KV cache movement. Its additional complexity is deliberate: it is the connector that turns the generic OmniConnector abstraction into a transport capable of efficient remote memory movement rather than simple object storage.
diff --git a/docs/design/feature/omni_connectors/shared_memory_connector.md b/docs/design/feature/omni_connectors/shared_memory_connector.md
deleted file mode 100644
index 5b35014f233..00000000000
--- a/docs/design/feature/omni_connectors/shared_memory_connector.md
+++ /dev/null
@@ -1,259 +0,0 @@
-# SharedMemoryConnector
-
-## When to Use
-
-Best for single-node deployments where stages run on the same host. It is
-auto-configured when no explicit connector is specified for an edge.
-
-## How It Works
-
-All payloads are serialized and stored in shared memory (`/dev/shm`); the SHM
-segment name is returned in metadata. The configuration exposes a
-`shm_threshold_bytes` field for a future inline-vs-SHM split, but the current
-implementation always uses shared memory regardless of payload size.
-
-## Configuration
-
-```yaml
-runtime:
- connectors:
- connector_of_shared_memory:
- name: SharedMemoryConnector
- extra:
- shm_threshold_bytes: 65536
-```
-
-## Notes
-
-- Auto-mode uses SharedMemoryConnector if no connector is declared for an edge.
-
----
-
-## Design
-
-### 1. Overview
-
-`SharedMemoryConnector` is the default same-node connector in `vllm_omni/distributed/omni_connectors`. It is designed for stage-to-stage transfer when producer and consumer processes run on the same host and can share `/dev/shm`.
-
-The connector provides a unified `put()` / `get()` API for arbitrary Python objects while keeping the control plane lightweight:
-
-- The payload is serialized by the connector.
-- The serialized bytes are placed in shared memory.
-- The queue/control plane only carries a small metadata handle.
-
-This makes `SharedMemoryConnector` the simplest connector in the OmniConnector family and the default fallback when an edge does not explicitly configure another backend.
-
-### 2. Relationship with the OmniConnector System
-
-`SharedMemoryConnector` implements `OmniConnectorBase`, so it follows the same lifecycle and API contract as the other connectors:
-
-- `put(from_stage, to_stage, put_key, data)`
-- `get(from_stage, to_stage, get_key, metadata=None)`
-- `cleanup(request_id)`
-- `health()`
-- `close()`
-
-Within the larger system:
-
-- `load_omni_transfer_config()` automatically fills missing edges with `SharedMemoryConnector`.
-- Callers interact with the connector exclusively through the `put()` / `get()` / `cleanup()` contract — the connector does not require caller-specific logic.
-
-Compared with the remote Mooncake-based connectors, `SharedMemoryConnector` is intentionally minimal and local-only.
-
-### 3. Design Goals
-
-The connector is built around the following goals:
-
-- **Low-friction local transfer** for single-node multi-process pipelines.
-- **Unified object semantics** for arbitrary Python payloads.
-- **Small control-plane overhead** by passing only metadata through queues.
-- **Zero external dependencies** beyond Python shared memory and the existing stage utilities.
-
-It is not intended to provide cross-node transfer, RDMA, or raw tensor zero-copy semantics across processes.
-
-### 4. Core Design
-
-#### 4.1 Serialization Model
-
-`SharedMemoryConnector` always starts from a Python object and serializes it through the shared Omni serializer:
-
-```python
-payload = self.serialize_obj(data)
-```
-
-This keeps the connector behavior consistent with the rest of the connector stack:
-
-- producer code does not need connector-specific serialization logic
-- consumer code always receives the original object after deserialization
-- the connector can reuse the same serializer used by other backends
-
-#### 4.2 Shared Memory as the Data Plane
-
-The actual data plane is a shared-memory segment created by:
-
-- `shm_write_bytes(...)`
-- `shm_read_bytes(...)`
-
-The connector stores a small metadata object such as:
-
-```python
-{
- "shm": {"name": ..., "size": ...},
- "size": ...
-}
-```
-
-This metadata is passed over the control plane and allows the downstream stage to locate the shared-memory segment.
-
-#### 4.3 Locking Model
-
-To avoid races between the producer and consumer, the connector uses a lock file per request:
-
-```text
-/dev/shm/shm_{put_key}_lockfile.lock
-```
-
-Locking is done with `fcntl.flock`:
-
-- producer uses `LOCK_EX`
-- consumer uses `LOCK_EX`
-
-Both sides acquire an exclusive lock. This ensures that the shared-memory segment is not read while it is still being written and makes the handoff safer in a multi-process environment.
-
-### 5. Put / Get Flow
-
-#### 5.1 Producer Flow: `put()`
-
-The producer-side flow is:
-
-1. Serialize the input object to bytes.
-2. Compute the payload size.
-3. Acquire the per-request lock file.
-4. Write the bytes into shared memory.
-5. Return lightweight metadata to the caller.
-
-The returned tuple is:
-
-```python
-(success, serialized_size, metadata)
-```
-
-where `metadata` contains the shared-memory handle needed by the consumer.
-
-#### 5.2 Consumer Flow: `get(metadata=...)`
-
-The primary consumer path is metadata-driven:
-
-1. Extract the shared-memory handle from `metadata`.
-2. Acquire the exclusive lock.
-3. Read the raw bytes from shared memory.
-4. Deserialize the bytes back into the original Python object.
-5. Remove the lock file if it still exists.
-
-This is the path used by the current stage-to-stage connector flow.
-
-#### 5.3 Compatibility Flow: `get(metadata=None)`
-
-The connector also keeps a compatibility path for callers that only know the key:
-
-1. Attempt to open the shared-memory segment by name via `SharedMemory(name=get_key)`.
-2. If the segment exists and has non-zero size, acquire the exclusive lock and read the bytes.
-3. Deserialize the bytes and return the object.
-
-If the segment does not exist or any exception occurs, the call returns `None` immediately. There is no retry loop in this path -- it is a single-attempt open.
-
-This path is mainly for older code paths and is not the preferred mode for the current connector pipeline.
-
-### 6. Key Implementation Characteristics
-
-#### 6.1 Threshold Exists, but the Current Code Always Uses SHM
-
-The class keeps a `shm_threshold_bytes` field and still exposes metrics for inline writes. However, the current implementation uses:
-
-```python
-if True:
- ...
-```
-
-inside `put()`, which means the current code path always writes to shared memory.
-
-So the design still suggests a future split between:
-
-- small payloads inline
-- large payloads in shared memory
-
-but the current behavior is effectively:
-
-- all payloads go through shared memory
-
-This should be documented because it affects real runtime behavior.
-
-#### 6.2 Cleanup Is Currently Passive
-
-`cleanup()` is currently a no-op. The intended assumption is:
-
-- the consumer reads the segment
-- the underlying shared-memory helpers unlink it
-
-If the consumer never executes `get()`, the shared-memory segment may remain allocated. This means the connector relies on the normal success path for resource reclamation.
-
-#### 6.3 Close Is Currently Minimal
-
-`close()` is also a no-op. There is no connector-owned background thread, socket, or memory pool to tear down, so the lifecycle is simple. The trade-off is that `close()` does not scan or recover leaked shared-memory resources.
-
-### 7. Data Flow in the Pipeline
-
-The typical flow with `SharedMemoryConnector` is:
-
-```mermaid
-sequenceDiagram
- participant SenderStage
- participant SharedMemoryConnector
- participant QueueOrControlPlane
- participant ReceiverStage
-
- SenderStage->>SharedMemoryConnector: put(from_stage, to_stage, put_key, data)
- SharedMemoryConnector->>SharedMemoryConnector: serialize object
- SharedMemoryConnector->>SharedMemoryConnector: write bytes to /dev/shm
- SharedMemoryConnector-->>SenderStage: metadata {shm, size}
- SenderStage->>QueueOrControlPlane: forward connector metadata
- QueueOrControlPlane->>ReceiverStage: task + connector metadata
- ReceiverStage->>SharedMemoryConnector: get(from_stage, to_stage, get_key, metadata)
- SharedMemoryConnector->>SharedMemoryConnector: read bytes from /dev/shm
- SharedMemoryConnector->>SharedMemoryConnector: deserialize object
- SharedMemoryConnector-->>ReceiverStage: (data, size)
-```
-
-This is a classic split-control-plane / data-plane design, but constrained to a single host.
-
-### 8. Strengths and Trade-offs
-
-#### Strengths
-
-- Very simple deployment model.
-- No external service dependency.
-- Fits naturally into the existing queue-driven orchestration flow.
-- Good default for local multi-process pipelines.
-
-#### Trade-offs
-
-- Same-node only.
-- Full object serialization and deserialization are still required.
-- Resource cleanup depends on the normal consumer path.
-- Shared memory capacity is limited by host configuration.
-
-### 9. Summary
-
-`SharedMemoryConnector` is the baseline local transport for the OmniConnector system. Its design is intentionally straightforward:
-
-- serialize object
-- place bytes in shared memory
-- pass metadata through the control plane
-- deserialize on the receiving side
-
-It plays two important roles in vLLM-Omni:
-
-1. It is the simplest production-ready connector for same-node stage pipelines.
-2. It serves as the automatic fallback connector when no explicit edge transport is configured.
-
-Although the current implementation is deliberately minimal, it provides the foundation for reliable local connector semantics and keeps the stage communication model uniform across the system.
diff --git a/docs/design/feature/omni_connectors/yuanrong_connector.md b/docs/design/feature/omni_connectors/yuanrong_connector.md
deleted file mode 100644
index 12127ba3e64..00000000000
--- a/docs/design/feature/omni_connectors/yuanrong_connector.md
+++ /dev/null
@@ -1,358 +0,0 @@
-# YuanrongConnector
-
-## When to Use
-
-Best for multi-node distributed inference using Yuanrong Datasystem.
-
-## Mechanism
-
-Uses Yuanrong Datasystem's distributed KV store (`datasystem.kv_client`).
-
-- Data Plane: TCP or RDMA for high-bandwidth transfer.
-- Control Plane: Yuanrong Datasystem workers and etcd.
-- Keying: deterministic keys based on `put_key` (often composed as `request_id:fromStage_toStage`).
-
-## Installation
-
-```bash
-pip install openyuanrong-datasystem
-```
-
-## Start etcd
-
-```bash
-# Download and install etcd (v3.5.12 or higher)
-ETCD_VERSION="v3.5.12"
-ETCD_ARCH="linux-arm64"
-wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-${ETCD_ARCH}.tar.gz
-tar -xvf etcd-${ETCD_VERSION}-${ETCD_ARCH}.tar.gz
-cd etcd-${ETCD_VERSION}-${ETCD_ARCH}
-sudo cp etcd etcdctl /usr/local/bin/
-
-# Start etcd
-etcd \
- --name etcd-single \
- --data-dir /tmp/etcd-data \
- --listen-client-urls http://0.0.0.0:2379 \
- --advertise-client-urls http://0.0.0.0:2379 \
- --listen-peer-urls http://0.0.0.0:2380 \
- --initial-advertise-peer-urls http://0.0.0.0:2380 \
- --initial-cluster etcd-single=http://0.0.0.0:2380 &
-
-# Verify etcd is running
-etcdctl --endpoints "127.0.0.1:2379" put key "value"
-etcdctl --endpoints "127.0.0.1:2379" get key
-```
-
-For production environments, refer to the
-[official etcd clustering documentation](https://etcd.io/docs/current/op-guide/clustering/).
-
-## Start Datasystem Worker
-
-```bash
-# Replace ${ETCD_IP} with etcd node IP, ${WORKER_IP} with local node IP
-dscli start -w \
- --worker_address "${WORKER_IP}:31501" \
- --etcd_address "${ETCD_IP}:2379" \
- --shared_memory_size_mb 20480
-```
-
-To stop the worker:
-
-```bash
-dscli stop --worker_address "${WORKER_IP}:31501"
-```
-
-## Configuration
-
-Define the connector in runtime:
-
-```yaml
-runtime:
- connectors:
- connector_of_yuanrong:
- name: YuanrongConnector
- extra:
- host: "127.0.0.1"
- port: 31501
- get_sub_timeout_ms: 1000
-```
-
-Wire stages to the connector:
-
-```yaml
-stage_args:
- - stage_id: 0
- output_connectors:
- to_stage_1: connector_of_yuanrong
-
- - stage_id: 1
- input_connectors:
- from_stage_0: connector_of_yuanrong
-```
-
-Parameters:
-
-- host: datasystem worker host.
-- port: datasystem worker port (default: `35001` if omitted; the example above uses `31501` to match the worker startup command).
-- get_sub_timeout_ms: get timeout in milliseconds (0 for no timeout).
-
-For more details, refer to the
-[Yuanrong Datasystem repository](https://atomgit.com/openeuler/yuanrong-datasystem).
-
----
-
-## Design
-
-### 1. Overview
-
-`YuanrongConnector` is the Datasystem-based remote connector in `vllm_omni/distributed/omni_connectors`. It uses Yuanrong Datasystem's distributed key-value client as the transport backend and exposes the same `put()` / `get()` interface as the other OmniConnectors.
-
-Like `MooncakeStoreConnector`, it is a store-oriented remote connector rather than a direct peer-to-peer transport. Its role is to let stage payloads move across nodes through a deterministic key-based storage abstraction while keeping the rest of the pipeline on the common connector API.
-
-It is intended for deployments that already use Yuanrong Datasystem and want a remote connector that integrates with the existing OmniConnector configuration and orchestration model.
-
-### 2. Relationship with the OmniConnector System
-
-`YuanrongConnector` implements `OmniConnectorBase`, so it participates in the same connector lifecycle as the other backends:
-
-- `OmniConnectorFactory` constructs it from a `ConnectorSpec`
-- stage edge configuration is resolved by `load_omni_transfer_config()`
-- All callers (batch forwarding, chunk transfer, KV transfer, etc.) interact with it through the same `put()` / `get()` contract
-
-This means the connector is not exposed directly to stage logic. Stages only interact with the generic connector contract, and the backend choice remains a configuration concern.
-
-### 3. Design Goals
-
-The connector is built around the following goals:
-
-1. **Cross-node payload transfer through Datasystem**
- Reuse Yuanrong Datasystem as the remote exchange medium for stage data.
-
-2. **Uniform object transfer semantics**
- Allow arbitrary Python objects to be transmitted through the shared Omni serializer.
-
-3. **Minimal connector-specific control plane**
- Use deterministic keys so that consumers can retrieve data without an extra transport metadata handoff.
-
-4. **Operational reuse of existing infrastructure**
- Fit into environments that already deploy Yuanrong Datasystem workers and etcd.
-
-The connector is not designed for direct remote-memory writes or tensor-specific fast-path transfer.
-
-### 4. Core Design
-
-#### 4.1 Store-Oriented Transfer Model
-
-`YuanrongConnector` treats the transport backend as a distributed object store:
-
-1. serialize the Python payload
-2. build a deterministic connector key
-3. write the serialized bytes into Datasystem
-4. read the bytes back on the receiver side
-5. deserialize them into the original object
-
-This is the same broad architectural class as `MooncakeStoreConnector`, but implemented on top of Yuanrong Datasystem APIs instead of Mooncake store APIs.
-
-#### 4.2 Deterministic Keying
-
-Unlike the default `_make_key()` in `OmniConnectorBase`, `YuanrongConnector` defines its own key format:
-
-```text
-{request_id}:{from_stage}_{to_stage}
-```
-
-This has two design implications:
-
-- the request identifier remains the primary lookup handle
-- stage routing information is embedded in the key so that the same logical request ID can safely appear on different edges
-
-The explicit override also makes the key format easier to align with Datasystem-side debugging and operational inspection.
-
-#### 4.3 No Extra Metadata Hand-off
-
-`put()` returns:
-
-```python
-(success, serialized_size, None)
-```
-
-and does not generate connector-specific metadata.
-
-This design works because the receiver can reconstruct the exact same key from:
-
-- `get_key`
-- `from_stage`
-- `to_stage`
-
-As a result, the connector does not require a separate side-channel metadata handoff.
-
-### 5. Initialization
-
-#### 5.1 Datasystem Client Dependency
-
-The connector requires the Datasystem Python bindings to expose:
-
-- `KVClient`
-- `SetParam`
-- `WriteMode`
-
-If any of these symbols are unavailable, connector construction fails immediately with `ImportError`. This keeps configuration errors explicit and avoids a partially initialized runtime.
-
-#### 5.2 Client Setup
-
-During `_init_client()`, the connector:
-
-1. reads `host` and `port`
-2. creates `KVClient(host, port)`
-3. calls `client.init()`
-
-At construction time it also creates a `SetParam` and fixes:
-
-```python
-self.set_param.write_mode = WriteMode.NONE_L2_CACHE_EVICT
-```
-
-This means the connector has a stable write policy for all writes and does not currently expose write-mode selection as a higher-level connector option.
-
-### 6. Put / Get Flow
-
-#### 6.1 Producer Flow: `put()`
-
-The producer-side flow is:
-
-1. verify that the Datasystem client has been initialized
-2. serialize the input object with the shared Omni serializer
-3. build the Datasystem key using the connector-specific `_make_key()`
-4. call `client.set(key, serialized_data, self.set_param.write_mode)`
-5. update metrics and return success
-
-The returned metadata is always `None`, because the Datasystem key itself is the lookup contract between producer and consumer.
-
-#### 6.2 Consumer Flow: `get()`
-
-The consumer-side flow is:
-
-1. verify that the Datasystem client has been initialized
-2. rebuild the same key with `from_stage`, `to_stage`, and `get_key`
-3. call:
-
-```python
-client.get([key], False, self.get_sub_timeout_ms)
-```
-
-4. take the first returned element if present
-5. deserialize it and return `(data, payload_size)`
-
-If the returned list is empty or contains no data for the key, `get()` returns `None`.
-
-### 7. Timeout and Retrieval Semantics
-
-The connector uses `get_sub_timeout_ms` as its read timeout. Unlike `MooncakeStoreConnector`, which performs an explicit retry loop in Python, `YuanrongConnector` delegates waiting behavior more directly to the Datasystem client call.
-
-This leads to a slightly different retrieval model:
-
-- `MooncakeStoreConnector`: retry-oriented polling in connector code
-- `YuanrongConnector`: single client call with backend-managed timeout behavior
-
-From the connector API perspective the result is the same, but operationally the waiting behavior is more dependent on Datasystem client semantics.
-
-### 8. Integration with Stage Communication
-
-All callers use the connector through the same `put()` / `get()` contract:
-
-- the sender calls `put()` to serialize and store the payload
-- the receiver calls `get()` to retrieve and deserialize it
-- no connector-specific metadata is required, since the Datasystem key is the rendezvous point
-
-Because `put()` returns `metadata=None`, the connector is naturally compatible with callers that do not forward metadata (e.g. polling-based flows). The trade-off is that all payloads incur full serialization and deserialization costs, and there is no raw tensor fast path, which makes the connector functional but not optimized for the largest payloads.
-
-### 9. Data Flow in the Pipeline
-
-The end-to-end transfer model is:
-
-```mermaid
-sequenceDiagram
- participant SenderStage
- participant YuanrongConnector
- participant Datasystem
- participant ReceiverStage
-
- SenderStage->>YuanrongConnector: put(from_stage, to_stage, put_key, data)
- YuanrongConnector->>YuanrongConnector: serialize object
- YuanrongConnector->>Datasystem: set(key, bytes)
- YuanrongConnector-->>SenderStage: (success, size, None)
-
- ReceiverStage->>YuanrongConnector: get(from_stage, to_stage, get_key)
- YuanrongConnector->>Datasystem: get([key], timeout)
- YuanrongConnector->>YuanrongConnector: deserialize bytes
- YuanrongConnector-->>ReceiverStage: (data, size)
-```
-
-This is a store-mediated remote connector design with deterministic key lookup and no explicit side-channel metadata exchange.
-
-### 10. Strengths and Trade-offs
-
-#### Strengths
-
-- Reuses existing Yuanrong Datasystem infrastructure.
-- Keeps the connector contract simple and uniform.
-- Requires no connector-specific metadata handoff.
-- Suitable for remote stage transfer in Datasystem-based deployments.
-
-#### Trade-offs
-
-- Always pays full serialization and deserialization cost.
-- Does not support raw bytes / tensor fast-path semantics.
-- Depends on external Datasystem worker availability.
-- Retrieval and timeout behavior are partly delegated to the backend client.
-
-### 11. Important Implementation Characteristics
-
-#### 11.1 Cleanup Is a No-op
-
-`cleanup()` only logs a debug message and does not explicitly remove data from Datasystem. The current design assumes backend-side lifecycle or garbage collection rather than request-scoped delete semantics inside the connector.
-
-This mirrors the same design trade-off seen in other store-based connectors: simplicity over explicit per-request data reclamation.
-
-#### 11.2 Close Only Releases the Local Client Handle
-
-`close()` does not perform a remote shutdown. It simply clears the local `client` reference and marks the connector as closed from the process perspective:
-
-```python
-self.client = None
-```
-
-This is appropriate for a client-based store connector, but it also means that backend resource lifecycle remains outside the connector's control.
-
-#### 11.3 Error and Timeout Metrics Are Coarse-Grained
-
-The connector tracks:
-
-- `puts`
-- `gets`
-- `bytes_transferred`
-- `errors`
-- `timeouts`
-
-In the current code, failed `put()` increments `errors`, while `get()` exceptions increment `timeouts`. This is operationally useful, but it does not distinguish between:
-
-- backend timeout
-- not-found result
-- transport failure
-- deserialization failure
-
-So the metrics should be read as high-level health indicators, not detailed root-cause diagnostics.
-
-### 12. Summary
-
-`YuanrongConnector` is the Datasystem-backed remote connector in the OmniConnector family. Its design is straightforward:
-
-- serialize payloads
-- store them under a deterministic stage-qualified key
-- retrieve them by the same key
-- deserialize them on the receiving side
-
-Within vLLM-Omni, it provides a clean integration point for Yuanrong Datasystem-based deployments while preserving the same connector abstraction used by the rest of the pipeline.
-
-Its design priorities are simplicity, infrastructure reuse, and API consistency rather than direct-memory transport optimization.
diff --git a/docs/design/feature/prefix_caching.md b/docs/design/feature/prefix_caching.md
deleted file mode 100644
index ebad8b69106..00000000000
--- a/docs/design/feature/prefix_caching.md
+++ /dev/null
@@ -1,164 +0,0 @@
-# Automatic Prefix Caching in Omni Models
-
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [High-Level Approach](#high-level-approach)
-- [Example](#example)
-- [What About Multimodal Inputs?](#what-about-multimodal-inputs)
-
----
-
-### Overview
-
-Prefix caching in the context of kv-cache management is a useful optimization for avoiding redundant computations. The main idea is that we store portions of the kv-cache from processed requests, so that we can reuse them if incoming requests have the same prefix as previous requests.
-
-vLLM manages the kv-cache as blocks, which represent a span of tokens of a fixed length. Blocks are hashable by the content that they contain, which typically means the tokens within the span, but also could be influenced by other factors, e.g., LoRA and multimodal data.
-
-vLLM implements automatic prefix caching for managing its kv-cache, which is best understood by reading the design document [here](https://docs.vllm.ai/en/latest/design/prefix_caching/). vLLM-Omni builds on top of the prefix caching mechanism in a noninvasive way to allow caching between stages in Omni pipelines. This typically means for a given stage we aim to support caching for the following:
-
-- The last hidden states produced by the stage
-- Model / stage specific multimodal data
-
-!!! note "Note 1"
- This document describes vLLM-Omni's mechanism for caching tensor outputs that are meant to be passed between stages, when requests have common prefixes, similar to the way in which vLLM has prefix caching for the kv-cache. This works in conjunction with vLLM's multimodal encoder caching, but is distinct. See the final section for a concrete example for how they tie together in practice.
-
-### High-Level Approach
-!!! note "Note 2"
- Prior to reading this section, it's recommended to take a look at the design documents in vLLM for [Automatic Prefix Caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/), which will make some of the concepts more clear.
-
-The main focus of vLLM-Omni's approach to prefix caching stage outputs is to build on vLLM's prefix caching in the least invasive way possible while minimizing impact for cache misses, and consuming a minimal amount of GPU memory. To understand the implementation, there are a few important things to note:
-
-- Between stages, device tensors are generally moved to CPU; this is important since we're just caching the outputs of stages, so it is okay to keep the entire cache on the CPU.
-
-- For a tensor to be considered cacheable, the first dimension (currently) needs to be the same as the token count, as it allows us to reuse block/slot mappings for our externally maintained tensor caches. This allows us to dynamically discover the tensors to be marked as cacheable outputs in each Omni model without having to explicitly specify cacheable output field names in every model.
-
-With this in mind, consider the set of blocks in a 2D layout, where the row represents the index of blocks being considered, and the columns represent the slots corresponding to tokens within each block. Since we know the `num_blocks` and `block_size` from our kv cache config, if we want to cache a tensor with feature size `D`, we can preallocate a CPU tensor of size `(num_blocks, block_size, D)`, and use the same block index and slot mapping to retrieve the corresponding feature vector.
-
-
-### Example
-!!! note "Note 3"
- Prefix caching in vLLM-Omni currently is only supported on AutoRegressive stages with one kv-cache group. It can be enabled/disabled per-stage via the `enable_prefix_caching` parameter in the model's stage config.
-
-The way in which vLLM-Omni ties into vLLM's prefix caching is best understood by example. Say that we have the following:
-
-- `num_blocks=8`
-- `block_size=4`
-- `hidden_size=2`
-- A stage specific multimodal output tensor named `mm_feature` with feature dimension `16`
-
-The prefix cache flow is then outlined below.
-
-1. When the model is initialized, we can determine the `hidden_size` from the `ModelConfig`, and allocate a cache of size `(num_blocks, block_size, hidden_size)`.
-
-2. Say we process the request `The quick brown fox was tired and slept beneath the shady tree`, which is 12 tokens and evenly divides into 3 blocks as shown below.
-
-```
- [ The quick brown fox ] [ was tired and slept ] [beneath the shady tree ]
-Block 1: |<--- block tokens ---->|
-Block 2: |<------- prefix ------>| |<--- block tokens --->|
-Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
-```
-
-When the request processes, we inspect the multimodal outputs and identify the `mm_feature` tensor, which will be of shape `(seq_len, feature_dim)`, i.e., `(12, 16)` in this example. We note that the first axis is dependent on the `seq_len` and add a new cache_tensor of shape `(num_blocks, block_size, feature_dim)` to our multimodal cache for tensors.
-
-
-3. If we lay out the cache as a 2D tensor of shape (`num_blocks`, `block_size`), we'll have something like the following:
-
-```
-0: [ The quick brown fox ]
-1: [ was tired and slept ]
-2: [beneath the shady tree ]
-3: [EMPTY]
-...
-7: [EMPTY]
-```
-
-Or, if we flatten it down to 1D,
-```
-0: The
-1: quick
-2: brown
-3: fox
-...
-11: tree
-12: [EMPTY]
-...
-```
-
-which we can think of as row indices into the hidden states tensor if we view it as the 2D shape `(num_blocks x block_size, feature_dim)`. That is, the analogous flattened (from 3D -> 2D) mapping of the cache for hidden states becomes the following.
-```
-0:
-1:
-2:
-3:
-...
-11:
-12: [EMPTY]
-...
-```
-
-Similarly, for the multimodal outputs cache, the flattened coordinates are the same, but the `mm_feature` maps to vectors of length `16` instead of the hidden size of `2`. Note that in practice, we may have multiple multimodal output tensors per forward pass, which may have different names and different feature dimensions.
-
-
-4. Now, say that we receive a new request `The quick brown fox jumped over the dog`.
-
-```
- [ The quick brown fox ] [ jumped over the dog ]
-Block 1: |<--- block tokens ---->|
-Block 2: |<------- prefix ------>| |<--- block tokens --->|
-```
-
-Here, we will have a cache hit for `Block 1` which will be detected by vLLM based on the hash of the first block when it's handling the prefix caching on the kv-cache. As a result, when we get the output from the scheduler, we will see that `num_computed_tokens=4` (corresponding to the cached first block), and we only need to process the remaining 4 new tokens in the new prefill.
-
-Since we have the block indices / slot mappings from the kv cache manager, we can simply mirror the mappings and leverage the same indices for the cached hidden states and multimodal outputs. This allows us to look up the correct tensors from our externally maintained 3D caches.
-
-```
-0: [ The quick brown fox ] < already in the cache
-1: [ was tired and slept ]
-2: [beneath the shady tree ]
-3: [ jumped over the dog ] < added on the second request
-4: [EMPTY]
-...
-7: [EMPTY]
-...
-```
-
-Finally, to pass the full hidden states and multimodal outputs to the next stage, we simply concatenate the cached contents with the corresponding new tensors computed from the current forward call.
-
-
-### What About Multimodal Inputs?
-It's also useful to consider the case about how Omni prefix caching is handled when we have multimodal inputs that don't cleanly end on block boundaries, as well as how this works with multimodal encoder caching in vLLM. For example:
-
-```
- [ Im0 Im1 Im2 Im3 ] [ Im4 Im5 foo ]
-Block 1: |<--- block tokens ---->|
-Block 2: |<------- prefix ------>| |<--- block tokens --->|
-```
-
-In this case, only `Block 1` will have outputs stored in the prefix tensor cache, because vLLM does not store partial blocks. This may appear to be a problem at first glance, because the multimodal input is fragmented across a new block that wasn't cached.
-
-In reality, this isn't a big problem for correctness, because vLLM also maintains an encoder cache for multimodal inputs. In other words, after the first pass, we'll have the following:
-
-- The Block 1 hash, which is used for prefix caching
-- The hash describing the image data starting at position 0 and with length 6
-- In vLLM's encoder cache, a mapping from the image hash above to the encoder output
-
-
-To understand what happens, say we get the following input as a second request:
-```
- [ Im0 Im1 Im2 Im3 ] [ Im4 Im5 bar baz ]
-Block 1: |<--- block tokens ---->|
-Block 2: |<------- prefix ------>| |<--- block tokens --->|
-```
-
-First, the scheduler will check for a prefix cache hit, which we will see on `Block 1`. As a result, we will have 4 tokens marked as precomputed, and only see the remaining 4 tokens in the following prefill.
-
-Because we have multimodal data in a scheduled span that isn't fully precomputed, we still need to call the visual encoder. However, since we have the image hash and encoder cache, we will retrieve the encoder outputs for `Im4` and `Im5` as we create the multimodal embeddings.
-
-When we pass our multimodal tensors to the language model component in the same stage, we'll then expect the same outputs, because the prefix caching behaviors in vLLM-Omni / vLLM match, so the LLM will use vLLM's KV cache manager's prefix caching to correctly handle the attention information for `Block 1` while calculating the outputs for `Block 2`, giving us the correct results for processing `Block 2` with the context of `Block 1`.
-
-Finally, we look up the output hidden states/multimodal tensors corresponding to the prefix cache hit `Block 1` and concatenate it with the forward pass result to get the final result, which is expected to be identical to the full hidden states when prefix caching is disabled.
diff --git a/docs/design/feature/ray_based_execution.md b/docs/design/feature/ray_based_execution.md
deleted file mode 100644
index ae10d661fd6..00000000000
--- a/docs/design/feature/ray_based_execution.md
+++ /dev/null
@@ -1,59 +0,0 @@
-# Distributed utils
-
-This directory (vllm_omni/distributed/ray_utils) contains utilities for distributed execution in vllm-omni, supporting both **Ray** and **Multiprocessing** backends.
-## 1. Installation
-```bash
-pip install "ray[default]"
-```
-## 2. Ray Utils
-
-The `ray_utils` module provides helper functions for managing Ray clusters and actors, which is essential for:
-* **Multi-node deployment**: Running pipeline stages across different physical machines.
-* **Resource management**: Efficient GPU/CPU allocation.
-
-### 2.1 Basic Usage
-
-To use the Ray backend, specify `worker_backend="ray"` when initializing the engine.
-
-**Command Line Example:**
-```bash
-vllm serve Qwen/Qwen2.5-Omni-7B \
- --omni \
- --port 8091 \
- --worker-backend ray \
- --ray-address auto
-```
-
-### 2.2 Cluster Setup
-
-**Step 1: Start Head Node**
-Run this on your primary machine:
-```bash
-ray start --head --port=6399
-```
-
-**Step 2: Connect Worker Nodes**
-Run this on each worker machine:
-```bash
-ray start --address=:6399
-```
-
-> **Tip**: For a complete cluster setup script, refer to the vLLM example:
-> [run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh)
-
-### 2.3 Distributed Connector Support
-
-When running on Ray, the system automatically adapts its communication strategy:
-
-* **Cross-Node**: Recommended to use `MooncakeTransferEngineConnector` (RDMA, fastest) or `MooncakeStoreConnector` (TCP fallback).
-* **Same-Node**: Can still use `SharedMemoryConnector` for efficiency, or Ray's native object store (plasma).
-* **SHM threshold default differs**: when `worker_backend="ray"`, the SharedMemoryConnector default threshold is set to `sys.maxsize`, which forces payloads to go inline (no SHM). Override `shm_threshold_bytes` in the connector config if you want SHM for Ray runs.
-
-### 2.4 Internal Helpers
-
-* **`initialize_ray_cluster`**: Connects to an existing Ray cluster or starts a local one.
-
-## 3. Troubleshooting
-
-* **Connection Issues**: Ensure the Ray head node is accessible and ports (default 6399 in this example) are open.
-* **Version Mismatch**: Ensure all nodes run the same version of Ray and Python.
diff --git a/docs/design/feature/sequence_parallel.md b/docs/design/feature/sequence_parallel.md
deleted file mode 100644
index d0328bcf611..00000000000
--- a/docs/design/feature/sequence_parallel.md
+++ /dev/null
@@ -1,531 +0,0 @@
-# Sequence Parallel
-
-This section describes how to add Sequence Parallel (SP) to a diffusion transformer model. We use the Qwen-Image transformer and Wan2.2 transformer as reference implementations.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [UAA Mode (Experimental)](#uaa-mode-experimental)
-- [Approach 1: Non-Intrusive `_sp_plan` (Recommended)](#approach-1-non-intrusive-_sp_plan-recommended)
-- [Approach 2: Intrusive Modification (For Complex Cases)](#approach-2-intrusive-modification-for-complex-cases)
-- [Testing](#testing)
-- [Troubleshooting](#troubleshooting)
-- [Reference Implementations](#reference-implementations)
-- [Summary](#summary)
-
----
-
-## Overview
-
-
-### What is Sequence Parallel?
-
-**Terminology Note:** Our "Sequence Parallelism" (SP) corresponds to "Context Parallelism" (CP) in the [diffusers library](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/_modeling_parallel.py). We use "Sequence Parallelism" to align with vLLM-Omni's terminology.
-
-Diffusion transformers process long sequences of image patches or video frames. For high-resolution generation, these sequences can become very large. Enabling SP allows each GPU to process only a portion of the sequence, with attention mechanisms (Ulysses/Ring) handling cross-GPU communication transparently.
-
-### Architecture
-
-The major APIs for Sequence Parallel:
-
-```python
-from vllm_omni.diffusion.distributed.sp_plan import (
- SequenceParallelInput, # For sharding (splitting) tensors
- SequenceParallelOutput, # For gathering tensors
-)
-from vllm_omni.diffusion.distributed.sp_sharding import sp_shard, sp_gather
-```
-
-| Method/Class | Purpose | Behavior |
-|--------------|---------|----------|
-| `SequenceParallelInput` | Declare input sharding in `_sp_plan` | Auto-shards tensors at module input |
-| `SequenceParallelOutput` | Declare output gathering in `_sp_plan` | Auto-gathers tensors at module output |
-| `sp_shard()` | Manual tensor sharding | Splits tensor across SP workers |
-| `sp_gather()` | Manual tensor gathering | Gathers sharded tensors from all workers |
-
----
-
-## UAA Mode (Experimental)
-
-`ulysses_mode="advanced_uaa"` enables the experimental UAA ("Ulysses Anything Attention") feature, which lets Ulysses attention handle arbitrary sequence lengths and arbitrary attention head counts. The same idea is also supported by [Cache-DiT](https://cache-dit.readthedocs.io/en/latest/user_guide/CONTEXT_PARALLEL/#uaa-ulysses-anything-attention).
-
-Use it when plain Ulysses-SP would otherwise fail because:
-
-- the local sequence shards are not evenly divisible after split hooks, or
-- the attention head count is not divisible by `ulysses_degree`.
-
-### Design Summary
-
-1. **Strict mode stays unchanged.**
- `ulysses_mode="strict"` keeps the original fast path and still requires divisible sequence/head shapes.
-
-2. **UAA uses variable all-to-all split sizes for sequence shards.**
- Before the Ulysses Q/K/V exchange, each rank all-gathers its local sequence length and uses those lengths as `all_to_all_single(..., output_split_sizes=seq_lens)`. This lets Ulysses gather the full sequence even when each rank started with a different local shard length.
-
-3. **UAA pads heads only inside the Ulysses exchange.**
- If `head_cnt % ulysses_degree != 0`, UAA pads the head dimension up to the next multiple of `ulysses_degree`, performs the forward/reverse all-to-all, then slices the temporary head padding away after the reverse exchange. The same rule is applied to joint attention tensors.
-
-4. **Hybrid Ulysses + Ring is still shape-constrained.**
- Ring attention expects every rank in a ring group to exchange exactly the same post-Ulysses sequence shape. UAA therefore validates those shapes before entering the ring path and raises a clear error if ring peers disagree on `S_global`.
-
-5. **Tiny scalar gathers stay out of TorchDynamo tracing.**
- `_all_gather_int()` is marked with `@torch.compiler.disable` so the scalar `item()` conversions used by UAA metadata collection do not get traced into `torch.compile`.
-
-### UAA vs `auto_pad`
-
-- `auto_pad=True` pads sequence tokens in `_sp_plan` and requires attention backends that support `attention_mask`.
-- `advanced_uaa` does not depend on mask-based token padding inside Ulysses attention. It is therefore a better fit for non-divisible head counts and uneven Ulysses shard sizes.
-- `auto_pad=True` remains incompatible with Ring attention because the ring backend does not consume `attention_mask`.
-- `advanced_uaa` is still experimental and hybrid mode remains limited by Ring's equal-shape requirement.
-
----
-
-## Approach 1: Non-Intrusive `_sp_plan` (Recommended)
-
-The `_sp_plan` mechanism allows SP **without modifying `forward()` logic**. The framework automatically registers hooks to shard inputs and gather outputs at module boundaries.
-
-**When to use:**
-- Standard transformer architectures
-- Tensor operations happen at `nn.Module` boundaries
-- Predictable sharding/gathering patterns
-
-This is the ideal approach for integrating sequence parallelism into new models, as it is easier to maintain and ensure compatibility with other types of acceleration.
-
-**How it works:**
-1. Declare `_sp_plan` dict in your transformer class
-2. Framework automatically applies hooks when `sequence_parallel_size > 1`
-3. Hooks shard/gather tensors at specified module boundaries
-4. Attention layers handle cross-GPU communication internally
-
-```python
-class StandardTransformer(nn.Module):
- _sp_plan = {
- # Shard hidden_states at first transformer block input
- "blocks.0": {
- "hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3),
- },
- # Gather at final output projection
- "proj_out": SequenceParallelOutput(gather_dim=1, expected_dims=3),
- }
-```
-
-`StandardTransformer` has a transformer blocks list `self.blocks = nn.ModuleList([...])`, and a projection output layer `self.proj_out`. The `_sp_plan` above defines that when SP is enabled, sharding the input tensor to the first transformer block, and gathering the sharded tensor at the final output projection layer.
-
-**Requirements:**
-- Tensor operations that need sharding/gathering must happen at **`nn.Module` boundaries**
-- Inline Python operations (e.g., `torch.cat`, `pad_sequence`) **cannot be hooked**
-
-**Solution for inline operations:** Extract into a submodule (see Step 2 below).
-
-### Step 1: Understand Module Boundaries
-
-Identify where tensors need to be sharded or gathered in your model's forward pass:
-
-```python
-class MyTransformer(nn.Module):
- def __init__(self):
- self.patch_embed = PatchEmbed() # ← Boundary 1
- self.pos_embed = RoPE() # ← Boundary 2
- self.blocks = nn.ModuleList([...]) # ← Boundary 3
- self.norm_out = LayerNorm()
- self.proj_out = Linear() # ← Boundary 4
-
- def forward(self, x):
- x = self.patch_embed(x) # ← Shard before this?
- pos = self.pos_embed(x) # ← Shard RoPE outputs?
- for block in self.blocks:
- x = block(x, pos) # ← Blocks process sharded x
- x = self.norm_out(x)
- output = self.proj_out(x) # ← Gather after this?
- return output
-```
-
-### Step 2: Handle Inline Operations
-
-If your `forward()` contains inline tensor operations, **extract them into submodules**.
-
-**Example: Z-Image concatenates image + text features inline**
-
-```python
-# ❌ BAD: Inline operation - hooks cannot intercept
-class ZImageTransformer(nn.Module):
- def forward(self, x, cap_feats):
- # This concatenation happens inline - _sp_plan can't shard it!
- unified = torch.cat([x, cap_feats], dim=1)
-
- for layer in self.layers:
- unified = layer(unified)
-
- return unified
-
-# ✅ GOOD: Extract into submodule
-class UnifiedPrepare(nn.Module):
- """Submodule to concatenate image and text features."""
- def forward(self, x, cap_feats):
- return torch.cat([x, cap_feats], dim=1)
-
-class ZImageTransformer(nn.Module):
- def __init__(self):
- super().__init__()
- self.unified_prepare = UnifiedPrepare() # Now a module!
- self.layers = nn.ModuleList([...])
-
- def forward(self, x, cap_feats):
- # Now _sp_plan can shard the output of unified_prepare!
- unified = self.unified_prepare(x, cap_feats)
-
- for layer in self.layers:
- unified = layer(unified)
-
- return unified
-```
-
-**Other common cases:**
-- `pad_sequence()` → `PadSequenceModule`
-- `torch.cat()` → `ConcatModule`
-- `tensor.reshape()` → `ReshapeModule`
-- Complex preprocessing → `PreprocessModule`
-
-### Step 3: Write `_sp_plan` for Your Model
-
-Create a class-level `_sp_plan` dictionary specifying where to shard/gather tensors.
-
-Typically, there are two patterns for diffusion models:
-
-**Pattern 1: Shard at first block, gather at output projection**
-
-Most common pattern for standard transformers:
-
-```python
-from vllm_omni.diffusion.distributed.sp_plan import (
- SequenceParallelInput, # For sharding (splitting) tensors
- SequenceParallelOutput, # For gathering tensors
-)
-class StandardTransformer(nn.Module):
- _sp_plan = {
- # Shard hidden_states at first transformer block input
- "blocks.0": {
- "hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3),
- },
- # Gather at final output projection
- "proj_out": SequenceParallelOutput(gather_dim=1, expected_dims=3),
- }
-```
-
-**Pattern 2: Shard RoPE embeddings separately**
-
-When RoPE is computed in a separate module:
-
-```python
-from vllm_omni.diffusion.distributed.sp_plan import (
- SequenceParallelInput, # For sharding (splitting) tensors
- SequenceParallelOutput, # For gathering tensors
-)
-class TransformerWithRoPE(nn.Module):
- _sp_plan = {
- # Shard RoPE module OUTPUTS (returns tuple of cos, sin)
- "rope": {
- 0: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True), # cos
- 1: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True), # sin
- },
- # Shard transformer block INPUT
- "blocks.0": {
- "hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3),
- },
- # Gather at output
- "proj_out": SequenceParallelOutput(gather_dim=1, expected_dims=3),
- }
-```
-
-**Pattern 3: Shard RoPE for Dual Stream Attention**
-In some cases, different streams in attention may need to handle sequence parallelism differently. For example, we may want to shard the image embeddings, while replicating the text embeddings to correctly configure joint attention.
-
-```python
-class DualStreamTransformer(nn.Module):
- """
- Dual-stream model where we need to replicate the text components, but shard
- the image components to correctly handle sequence parallelism.
- """
- _sp_plan = {
- # In this case, the rope_preparer returns a tuple of len 4, where the
- # first 2 items correspond to the text, and the second 2 correspond to
- # visual inputs, so we only shard the second.
- "rope_preparer": {
- # Outputs 0, 1 (text) - NOT sharded (replicated)
- # Outputs 2, 3 (image) - sharded
- 2: SequenceParallelInput(split_dim=0, expected_dims=2, split_output=True), # img_cos
- 3: SequenceParallelInput(split_dim=0, expected_dims=2, split_output=True), # img_sin
- },
- # Shard transformer block INPUT
- "transformer_blocks.0": {
- "hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3),
- },
- # Gather at output
- "proj_out": SequenceParallelOutput(gather_dim=1, expected_dims=3),
- }
-```
-
-NOTE: be careful to test adequately when refactoring classes that take this style of plan, as changing the order of the return values will break sequence parallelism.
-
-### API Reference
-
-**SequenceParallelInput Parameters:**
-
-| Parameter | Type | Description |
-|-----------|------|-------------|
-| `split_dim` | int | Dimension to split (usually `1` for sequence) |
-| `expected_dims` | int \| None | Expected tensor rank for validation (optional) |
-| `split_output` | bool | `False`: shard **input** params; `True`: shard **output** tensors |
-| `auto_pad` | bool | Auto-pad if sequence not divisible by world_size (default: `False`) |
-
-**SequenceParallelOutput Parameters:**
-
-| Parameter | Type | Description |
-|-----------|------|-------------|
-| `gather_dim` | int | Dimension to gather (usually `1` for sequence) |
-| `expected_dims` | int \| None | Expected tensor rank for validation (optional) |
-
-**Module Naming Conventions:**
-
-| Key | Meaning | Python equivalent |
-|-----|---------|-------------------|
-| `""` | Root model | `model` |
-| `"blocks.0"` | First element of ModuleList | `model.blocks[0]` |
-| `"blocks.*"` | All elements of ModuleList | `for b in model.blocks` |
-| `"rope"` | Named submodule | `model.rope` |
-| `"outputs.main"` | ModuleDict entry | `model.outputs["main"]` |
-
-**Dictionary Value Types:**
-
-| Key type | `split_output` | Description |
-|----------|----------------|-------------|
-| `"param_name"` (str) | `False` | Shard **input parameter** by name |
-| `0`, `1`, ... (int) | `True` | Shard **output tuple** by index |
-
----
-
-## Approach 2: Intrusive Modification (For Complex Cases)
-
-For models with dynamic sharding logic that cannot be expressed via `_sp_plan`, manually insert shard/gather calls.
-
-
-**When to use:**
-- Dynamic/conditional sharding logic
-- Complex tensor manipulations that can't be encapsulated
-- Temporary workaround during development
-
-```python
-from vllm_omni.diffusion.distributed.sp_sharding import sp_shard, sp_gather
-
-def forward(self, hidden_states, ...):
- if self.parallel_config.sequence_parallel_size > 1:
- hidden_states = sp_shard(hidden_states, dim=1)
-
- # ... computation ...
-
- if self.parallel_config.sequence_parallel_size > 1:
- output = sp_gather(output, dim=1)
-
- return output
-```
-
----
-
-## Testing
-
-After implementing Sequence Parallel support, thoroughly test your implementation to ensure correctness and performance across different configurations.
-
-**Test Different `sp_size`:**
-
-Test your model with various sequence parallel world sizes to verify correctness and identify optimal configurations:
-
-```bash
-cd examples/offline_inference/text_to_image
-python text_to_image.py \
- --model Your-org/your-model \
- --prompt "a cup of coffee on the table" \
- --num-inference-steps 50 \
- --ulysses-degree 2 \
- --ring-degree 2 \
- --output sp_test_image_ulysses=2_ring=2.png
-```
-
-**Verify:**
-
-1. **Correctness:** Output should be identical across all `sp_size` values
-2. **Speed:** Throughput should remain stable or improve (especially for large sequences)
-3. **Logs:** Check for any shape mismatch or communication errors
-
-**Test with Tensor Parallel:**
-
-Sequence Parallel can be combined with other parallelism strategies:
-
-```bash
-cd examples/offline_inference/text_to_image
-python text_to_image.py \
- --model Your-org/your-model \
- --prompt "a cup of coffee on the table" \
- --num-inference-steps 50 \
- --ulysses-degree 2 \
- --tensor-parallel-size 2 \
- --output sp_test_image_ulysses=2_tp=2.png
-```
-
----
-
-## Troubleshooting
-
-### Issue: Shape mismatch errors
-
-**Symptoms:** `RuntimeError: shape mismatch` during forward pass.
-
-**Causes & Solutions:**
-
-- **RoPE dimension mismatch:**
-
-**Problem:** RoPE embeddings not sharded, but hidden_states is sharded.
-
-**Solution:** Shard RoPE outputs in `_sp_plan`:
-```python
-_sp_plan = {
- "rope": {
- 0: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True),
- 1: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True),
- },
- ...
-}
-```
-
-- **Sequence Length not divisible by sp_size:**
-
-**Problem:** strict Ulysses sequence parallel requires divisible shapes. If the shard length is uneven, or if the model head count is not divisible by `ulysses_degree`, the strict path will raise an error.
-
-**Solutions:**
-
-1. Use `ulysses_mode="advanced_uaa"` for Ulysses-SP when you want the experimental uneven-shape path without relying on attention-mask padding.
-2. If the model already uses `_sp_plan` token padding and the attention backend supports `attention_mask`, set `auto_pad=True` and add attention-mask plumbing.
-
-> **Experimental Feature:** `ulysses_mode="advanced_uaa"` is experimental. It is intended to relax Ulysses divisibility constraints, but hybrid Ulysses + Ring still requires equal post-Ulysses sequence lengths inside each ring group.
-
-> **Experimental Feature:** `auto_pad=True` is an experimental feature and may be changed in the future. We plan to improve this solution to involve minimal changes to model files. More details are [here](https://github.com/vllm-project/vllm-omni/issues/1324).
-
-**Constraints of auto_pad:**
-
-| Constraint | Description |
-|------------|-------------|
-| **Attention Backend Compatibility** | The attention backends must support `attention_mask`. Currently only `TORCH_SDPA` and `FLASH_ATTN` (default for diffusion models) are supported. |
-| **Ring Attention Limitation** | Ring attention does not support `attention_mask`. Therefore, when using `auto_pad=True`, the combination of Ulysses + Ring attention is not feasible. |
-
-1. Enable `auto_pad=True` for all sequence-dimension inputs in `_sp_plan`:
-```python
-_sp_plan = {
- "rope": {
- 0: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True, auto_pad=True),
- 1: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True, auto_pad=True),
- },
- "blocks.0": {
- "hidden_states": SequenceParallelInput(split_dim=1, expected_dims=3, auto_pad=True)
- },
- ...
-}
-```
-
-2. Create attention mask dynamically when padding is applied:
-```python
-from vllm_omni.diffusion.forward_context import get_forward_context
-from vllm_omni.diffusion.attention.backends.abstract import AttentionMetadata
-
-# In model forward(), before transformer blocks:
-hidden_states_mask = None
-ctx = get_forward_context()
-if ctx.sp_original_seq_len is not None and ctx.sp_padding_size > 0:
- batch_size = hidden_states.shape[0]
- padded_seq_len = ctx.sp_original_seq_len + ctx.sp_padding_size
- hidden_states_mask = torch.ones(batch_size, padded_seq_len, dtype=torch.bool, device=hidden_states.device)
- hidden_states_mask[:, ctx.sp_original_seq_len:] = False
-
-# Pass mask to attention layers
-attn_metadata = AttentionMetadata(attn_mask=hidden_states_mask) if hidden_states_mask is not None else None
-output = self.attn(query, key, value, attn_metadata)
-```
-
-**Important Quality Considerations:**
-
-While `auto_pad` enables generation for irregular resolutions, be aware of potential quality impacts:
-
-| Aspect | Impact |
-|--------|--------|
-| **Training Distribution** | Models perform best on aspect ratios seen during training (e.g., 1:1, 16:9, 4:3). Unusual ratios like 700x400 (1.75:1) may produce lower quality results. |
-| **Padding Overhead** | Padded positions consume compute even when masked. For best efficiency, prefer resolutions divisible by `sp_size`. |
-
-**Recommendations for users:**
-- Use standard aspect ratios when possible (e.g., 768x432 for 16:9 instead of 700x400)
-- Ensure post-patch dimensions are divisible by `sp_size` for optimal quality
-- Test generation quality when using unusual resolutions
-
-### Issue: Inline operations not sharded
-
-**Symptoms:** Some tensors remain full-sized, not sharded.
-
-**Causes & Solutions:**
-
-- **Operations happen inline in `forward()`, not at module boundaries:**
-
-**Problem:**
-```python
-def forward(self, x, cap):
- unified = torch.cat([x, cap], dim=1) # ← Inline operation!
- # _sp_plan can't hook this
-```
-
-**Solution:** Extract into submodule:
-```python
-class ConcatModule(nn.Module):
- def forward(self, x, cap):
- return torch.cat([x, cap], dim=1)
-
-class MyModel(nn.Module):
- _sp_plan = {
- "concat": {
- 0: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True),
- 1: SequenceParallelInput(split_dim=1, expected_dims=4, split_output=True),
- },
- ...
- }
- def __init__(self):
- self.concat = ConcatModule() # Now hookable!
-
- def forward(self, x, cap):
- unified = self.concat(x, cap) # ← Can be sharded via _sp_plan
-```
-
----
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Pattern | Notes |
-|-------|------|---------|-------|
-| **LongCat** | `vllm_omni/diffusion/models/longcat_image/longcat_image_transformer.py` | Dual-stream | Text components replicated, image components sharded |
-| **Qwen-Image** | `vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py` | Dual-stream + preprocessing | auto_pad, separate RoPE |
-| **Wan2.2** | `vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py` | Dual-Transformer + RoPE | Video transformer |
-| **Z-Image** | `vllm_omni/diffusion/models/z_image/z_image_transformer.py` | Unified sequence | Concatenated input |
-| **SP Plan Types** | `vllm_omni/diffusion/distributed/sp_plan.py` | Type definitions | SequenceParallelInput/Output |
-| **Hook Implementation** | `vllm_omni/diffusion/hooks/sequence_parallel.py` | Hook mechanics | How hooks work |
-| **Tests** | `tests/diffusion/distributed/test_sp_plan_hooks.py` | Test examples | Validation patterns |
-
----
-
-## Summary
-
-Adding Sequence Parallel support to a transformer:
-
-1. ✅ **Choose approach** - Use `_sp_plan` for standard cases, intrusive modification for complex cases
-2. ✅ **Identify sharding boundaries** - Where should tensors be split/gathered? And which module boundaries need to be moved to facilitate this?
-3. ✅ **Extract inline operations** - Move `torch.cat`, `pad_sequence`, etc. to submodules
-4. ✅ **Define `_sp_plan`** - Declare shard/gather points as class attribute
-5. ✅ **Use `auto_pad` for variable lengths** - Support non-uniform sequences
-6. ✅ **Test** - Verify with different `ulysses_degree` and `ring_degree` combinations
diff --git a/docs/design/feature/teacache.md b/docs/design/feature/teacache.md
deleted file mode 100644
index 8577cff1f05..00000000000
--- a/docs/design/feature/teacache.md
+++ /dev/null
@@ -1,491 +0,0 @@
-# TeaCache
-
-This section describes how to add TeaCache to a diffusion transformer model. We use the Qwen-Image transformer as the reference implementation.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [Step-by-Step Implementation](#step-by-step-implementation)
-- [Customization](#customization)
-- [Testing](#testing)
-- [Troubleshooting](#troubleshooting)
-- [Reference Implementations](#reference-implementations)
-- [Summary](#summary)
-
----
-
-## Overview
-
-### What is TeaCache?
-
-TeaCache speeds up diffusion inference by caching transformer block computations when consecutive timesteps are similar. It provides **1.5x-2.0x speedup** with minimal quality loss.
-
-The core insight is that the modulated input (after normalization and timestep conditioning) changes gradually across timesteps. By measuring the L1 distance between consecutive modulated inputs and comparing it to a threshold, TeaCache decides whether to execute the full transformer blocks or reuse the cached residual from the previous step.
-
-vLLM-omni provides a **hook-based** TeaCache system that requires **zero changes to model code**. The hook completely intercepts the transformer's forward pass and implements adaptive caching transparently. This design allows easy integration with any transformer model by simply writing an extractor function.
-
-### Architecture
-
-The TeaCache system consists of three main components:
-
-| Component | Purpose | Location |
-|-----------|---------|----------|
-| [`CacheContext`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/#vllm_omni.diffusion.cache.CacheContext) | Dataclass containing model-specific information for caching | `vllm_omni/diffusion/cache/teacache/context.py` |
-| [`EXTRACTOR_REGISTRY`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/teacache/extractors/#vllm_omni.diffusion.cache.teacache.extractors.EXTRACTOR_REGISTRY) | Maps transformer class names to extractor functions | `vllm_omni/diffusion/cache/teacache/extractors.py` |
-| [`TeaCacheConfig`](https://docs.vllm.ai/projects/vllm-omni/en/latest/api/vllm_omni/diffusion/cache/#vllm_omni.diffusion.cache.TeaCacheConfig) | Configuration including thresholds and polynomial coefficients | `vllm_omni/diffusion/cache/teacache/config.py` |
-
-The hook handles all caching logic automatically, including:
-
-- CFG-aware state management (separate states for positive/negative branches)
-- CFG-parallel compatibility
-- L1 distance computation with polynomial rescaling
-- Residual caching and reuse
-
-
----
-
-## Step-by-Step Implementation
-
-To add TeaCache support for a new model, you need to:
-
-1. Write an **extractor function** that returns a `CacheContext` object
-2. Register the extractor in the `EXTRACTOR_REGISTRY`
-3. Add model-specific polynomial coefficients to `TeaCacheConfig`
-
-### Step 1: Model-Specific Preprocessing
-
-Extract and process model inputs. This typically involves:
-- Embedding image/latent inputs
-- Processing text encoder outputs (if dual-stream)
-- Creating timestep embeddings
-- Computing positional embeddings
-
-**Example (Qwen-Image):**
-
-```python
-def extract_qwen_context(
- module: nn.Module,
- hidden_states: torch.Tensor,
- encoder_hidden_states: torch.Tensor,
- encoder_hidden_states_mask: torch.Tensor,
- timestep: torch.Tensor,
- img_shapes: torch.Tensor,
- txt_seq_lens: torch.Tensor,
- guidance: torch.Tensor | None = None,
- **kwargs: Any,
-) -> CacheContext:
- # Validate model structure
- if not hasattr(module, "transformer_blocks") or len(module.transformer_blocks) == 0:
- raise ValueError("Module must have transformer_blocks")
-
- # Preprocessing: embed inputs
- hidden_states = module.img_in(hidden_states)
- timestep = timestep.to(device=hidden_states.device, dtype=hidden_states.dtype)
- encoder_hidden_states = module.txt_norm(encoder_hidden_states)
- encoder_hidden_states = module.txt_in(encoder_hidden_states)
-
- # Create timestep embedding
- if guidance is not None:
- guidance = guidance.to(hidden_states.dtype) * 1000
- temb = (
- module.time_text_embed(timestep, hidden_states)
- if guidance is None
- else module.time_text_embed(timestep, guidance, hidden_states)
- )
-
- # Compute position embeddings
- image_rotary_emb = module.pos_embed(img_shapes, txt_seq_lens, device=hidden_states.device)
-```
-
-### Step 2: Extract Modulated Input
-
-The modulated input is used for cache decisions. Extract it from the **first transformer block** after normalization and modulation.
-
-**Example (Qwen-Image):**
-
-```python
- # Extract modulated input from first transformer block
- block = module.transformer_blocks[0]
- img_mod_params = block.img_mod(temb)
- img_mod1, _ = img_mod_params.chunk(2, dim=-1)
- img_modulated, _ = block.img_norm1(hidden_states, img_mod1)
-```
-
-**Key Points:**
-
-- Use the **first block** to extract modulated input early
-- Apply the same normalization and modulation as the actual forward pass
-- The tensor should represent the processed features that will change across timesteps
-
-### Step 3: Define Transformer Execution
-
-Create a callable that executes all transformer blocks. This encapsulates the main computation loop.
-
-**Example (Qwen-Image dual-stream):**
-
-```python
- def run_transformer_blocks():
- """Execute all Qwen transformer blocks."""
- h = hidden_states
- e = encoder_hidden_states
-
- for block in module.transformer_blocks:
- e, h = block(
- hidden_states=h,
- encoder_hidden_states=e,
- encoder_hidden_states_mask=encoder_hidden_states_mask,
- temb=temb,
- image_rotary_emb=image_rotary_emb,
- )
- return (h, e) # Return both image and text hidden states
-```
-
-**Example (Single-stream model like Flux):**
-
-```python
- def run_transformer_blocks():
- """Execute all Flux transformer blocks."""
- h = hidden_states
-
- for block in module.transformer_blocks:
- h = block(h, temb=temb)
- return (h,) # Return only image hidden states
-```
-
-**Key Points:**
-
-- Return format:
-- For single-stream models: return `(hidden_states,)`
-- For dual-stream models: return `(hidden_states, encoder_hidden_states)`
-
-### Step 4: Define Postprocessing
-
-Create a callable that applies final transformations to produce the model output.
-
-**Example (Qwen-Image):**
-
-```python
- return_dict = kwargs.get("return_dict", True)
-
- def postprocess(h):
- """Apply Qwen-specific output postprocessing."""
- h = module.norm_out(h, temb)
- output = module.proj_out(h)
- if not return_dict:
- return (output,)
- return Transformer2DModelOutput(sample=output)
-```
-
-### Step 5: Return CacheContext
-
-Package all information into a `CacheContext` object.
-
-```python
- return CacheContext(
- modulated_input=img_modulated,
- hidden_states=hidden_states,
- encoder_hidden_states=encoder_hidden_states, # or None for single-stream
- temb=temb,
- run_transformer_blocks=run_transformer_blocks,
- postprocess=postprocess,
- )
-```
-
-**CacheContext Fields:**
-
-| Field | Type | Purpose |
-|-------|------|---------|
-| `modulated_input` | `torch.Tensor` | Tensor used for cache decision (similarity comparison) |
-| `hidden_states` | `torch.Tensor` | Current hidden states (will be modified by caching) |
-| `encoder_hidden_states` | `torch.Tensor | None` | Encoder states for dual-stream models, `None` for single-stream |
-| `temb` | `torch.Tensor` | Timestep embedding tensor |
-| `run_transformer_blocks` | `Callable[[], tuple]` | Executes transformer blocks, returns `(hidden_states, [encoder_hidden_states])` |
-| `postprocess` | `Callable[[torch.Tensor], Any]` | Applies final transformations to produce model output |
-| `extra_states` | `dict | None` | Optional dict for additional model-specific state |
-
-### Step 6: Register the Extractor
-
-Add your extractor to the `EXTRACTOR_REGISTRY` in `vllm_omni/diffusion/cache/teacache/extractors.py`:
-
-```python
-EXTRACTOR_REGISTRY: dict[str, Callable] = {
- "QwenImageTransformer2DModel": extract_qwen_context,
- "Bagel": extract_bagel_context,
- "ZImageTransformer2DModel": extract_zimage_context,
- "YourModelTransformer2DModel": extract_your_model_context, # Add here
-}
-```
-
-**Key:** Use the transformer class name (`module.__class__.__name__`)
-
-### Step 7: Add Model Coefficients
-
-Add polynomial rescaling coefficients to `vllm_omni/diffusion/cache/teacache/config.py`:
-
-```python
-_MODEL_COEFFICIENTS = {
- "QwenImageTransformer2DModel": [
- -4.50000000e02,
- 2.80000000e02,
- -4.50000000e01,
- 3.20000000e00,
- -2.00000000e-02,
- ],
- "YourModelTransformer2DModel": [ # Add your model's coefficients
- # 5 polynomial coefficients (can reuse similar model's coefficients initially)
- ],
-}
-```
-
-
-**Initial approach:** Start with coefficients from a similar model architecture, then tune empirically following [Customization](#customization) section.
-
----
-
-## Customization
-
-### Coefficient Estimation
-
-While you can start with coefficients from a similar model architecture, estimating custom coefficients for your specific model typically improves TeaCache performance.
-
-**Why Estimate Coefficients?**
-
-The polynomial coefficients rescale L1 distances between consecutive modulated inputs to better predict when cached residuals can be reused. Model-specific coefficients account for:
-
-- Architecture differences (layer count, hidden size, attention patterns)
-- Training data characteristics
-- Noise prediction behavior across timesteps
-
-| Approach | Performance | Effort |
-|----------|-------------|--------|
-| Using defaults from similar model | Within 5-10% of optimal | Low |
-| Estimating custom coefficients | Best performance | Medium |
-
-#### Implement Data Collection Adapter
-
-Add an adapter in `vllm_omni/diffusion/cache/teacache/coefficient_estimator.py`:
-
-```python
-class YourModelAdapter:
- """Adapter for coefficient estimation on your model."""
-
- @staticmethod
- def load_pipeline(model_path: str, device: str, dtype: torch.dtype) -> Any:
- """Load your diffusion pipeline."""
- from your_model_package import YourModelPipeline
-
- pipeline = YourModelPipeline.from_pretrained(
- model_path,
- torch_dtype=dtype,
- )
- pipeline = pipeline.to(device)
- return pipeline
-
- @staticmethod
- def get_transformer(pipeline: Any) -> tuple[Any, str]:
- """Extract transformer from pipeline."""
- return pipeline.transformer, "YourTransformer2DModel"
-
- @staticmethod
- def install_hook(transformer: Any, hook: DataCollectionHook) -> None:
- """Install data collection hook on transformer."""
- from vllm_omni.diffusion.hooks import HookRegistry
-
- registry = HookRegistry.get_or_create(transformer)
- registry.register_hook(hook._HOOK_NAME, hook)
-
-
-# Register your adapter
-_MODEL_ADAPTERS["YourModel"] = YourModelAdapter
-```
-
-#### Collect Data and Estimate
-
-```python
-from vllm_omni.diffusion.cache.teacache.coefficient_estimator import (
- TeaCacheCoefficientEstimator,
-)
-from datasets import load_dataset
-from tqdm import tqdm
-
-# Initialize estimator
-estimator = TeaCacheCoefficientEstimator(
- model_path="/path/to/your/model",
- model_type="YourModel",
-)
-
-# Load diverse prompts (paper recommends ~70 prompts)
-dataset = load_dataset("nateraw/parti-prompts", split="train")
-prompts = dataset["Prompt"][:70]
-
-# Collect data
-for prompt in tqdm(prompts, desc="Collecting data"):
- estimator.collect_from_prompt(prompt=prompt, num_inference_steps=50)
-
-# Estimate coefficients
-coeffs = estimator.estimate(poly_order=4)
-print(f"Estimated coefficients: {coeffs}")
-```
-
-Note: some models may require the vLLM context and config to be initialized to initialize vLLM modules. To this end, you may need a workaround like the following to be able to run coefficient estimation.
-```python
-from vllm_omni.diffusion.forward_context import set_forward_context
-from vllm_omni.diffusion.distributed.parallel_state import (
- init_distributed_environment,
- initialize_model_parallel,
-)
-from vllm.config import VllmConfig
-...
-
-if __name__ == "__main__":
- os.environ["MASTER_ADDR"] = "localhost"
- os.environ["MASTER_PORT"] = "8192"
- os.environ["LOCAL_RANK"] = "0"
- os.environ["RANK"] = "0"
- os.environ["WORLD_SIZE"] = "1"
-
- vllm_config = VllmConfig()
- init_distributed_environment()
- initialize_model_parallel()
-
- # NOTE: you may have to pass an initialized OmniDiffusionConfig as a kwarg
- # here to make current sp checks happy; if this is the case, just create one
- # .from_kwargs() with the model name to get around this check for now,
- # since your estimator subclass should handle the actual model configuration.
- #
- # This will be cleaned up in the future
- with set_forward_context(vllm_config):
-
-```
-
-
-**Data Statistics Guide:**
-
-| Metric | Good Range | Warning Signs |
-|--------|------------|---------------|
-| **Count** | 2000-5000+ | < 1000: too few prompts |
-| **Input Diffs (x)** | 0.01-0.10 | Very small (<0.001): model may not modulate properly |
-| **Output Diffs (y)** | Should correlate with x | No correlation: check extractor |
-| **Coefficient magnitude** | -1e6 to 1e6 | > 1e8: numerical instability |
-
----
-
-## Testing
-
-After adding TeaCache support, test with:
-
-```python
-from vllm_omni import Omni
-from vllm_omni.inputs.data import OmniDiffusionSamplingParams
-
-omni = Omni(
- model="your-model-name",
- cache_backend="tea_cache",
- cache_config={
- "rel_l1_thresh": 0.2,
- "coefficients": [1.33e6, -1.69e5, 7.95e3, -1.64e2, 1.26], # Your coefficients
- }
-)
-
-images = omni.generate(
- "a beautiful landscape",
- OmniDiffusionSamplingParams(num_inference_steps=50),
-)
-```
-
-**Verify:**
-
-1. **Check logs** - Look for TeaCache initialization messages
-2. **Compare performance** - Measure speedup vs baseline (expect 1.5x-2.0x)
-3. **Verify output quality** - Visually compare cached vs uncached outputs (should be nearly identical)
-
-See more detailed examples in [user guide for teacache](../../user_guide/diffusion/cache_acceleration/teacache.md).
-
----
-
-## Troubleshooting
-
-### Issue: "Unknown model type"
-
-**Symptoms:** Error message indicating the model type is not recognized when enabling TeaCache.
-
-**Causes & Solutions:**
-
-- **Extractor not registered:**
-
-**Problem:** The transformer class name doesn't exist in `EXTRACTOR_REGISTRY`.
-
-**Solution:** Check the class name and add to registry:
-```python
-# Check transformer class name
-print(pipeline.transformer.__class__.__name__)
-
-# Add to EXTRACTOR_REGISTRY
-EXTRACTOR_REGISTRY["YourTransformer2DModel"] = extract_your_context
-```
-
-- **Transformer class name mismatch:**
-
-**Solution:** Ensure the registry key matches exactly with `module.__class__.__name__`.
-
-### Issue: "Cannot find coefficients"
-
-**Symptoms:** Error when initializing TeaCache about missing model coefficients.
-
-**Causes & Solutions:**
-
-- **Missing coefficients in config:**
-
-**Solution:** Add coefficients to `_MODEL_COEFFICIENTS` in `config.py`, or pass custom coefficients:
-```python
-omni = Omni(
- model="your-model",
- cache_backend="tea_cache",
- cache_config={"coefficients": [1.0, -0.5, 0.1, -0.01, 0.001]}
-)
-```
-
-### Issue: Quality Degradation
-
-**Symptoms:** Output images look noticeably different or have artifacts compared to baseline.
-
-**Causes & Solutions:**
-
-- **Threshold too high:**
-
-**Problem:** `rel_l1_thresh` is too aggressive, causing cache reuse when outputs differ significantly.
-
-**Solution:** Lower the threshold:
-```python
-cache_config={"rel_l1_thresh": 0.1} # Try 0.1-0.2
-```
-
-- **Coefficients not tuned:**
-
-**Solution:** Estimate model-specific coefficients using the coefficient estimation process described above.
-
----
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Pattern | Notes |
-|-------|------|---------|-------|
-| **Qwen-Image** | `vllm_omni/diffusion/cache/teacache/extractors.py` | Dual-stream | `extract_qwen_context` |
-| **Bagel** | `vllm_omni/diffusion/cache/teacache/extractors.py` | Omni model | `extract_bagel_context` |
-| **TeaCache Core** | `vllm_omni/diffusion/cache/teacache/` | Base implementation | Hook and config |
-| **Coefficient Estimator** | `vllm_omni/diffusion/cache/teacache/coefficient_estimator.py` | Estimation tool | Adapter pattern |
-
----
-
-## Summary
-
-Adding TeaCache support:
-
-1. ✅ **Write extractor** - Create function returning `CacheContext` with model-specific preprocessing
-2. ✅ **Register extractor** - Add to `EXTRACTOR_REGISTRY` with transformer class name
-3. ✅ **Add coefficients** - Add polynomial coefficients to `_MODEL_COEFFICIENTS`
-4. ✅ **Test** - Verify with `cache_backend="tea_cache"`
diff --git a/docs/design/feature/tensor_parallel.md b/docs/design/feature/tensor_parallel.md
deleted file mode 100644
index bcafde7e73a..00000000000
--- a/docs/design/feature/tensor_parallel.md
+++ /dev/null
@@ -1,279 +0,0 @@
-# Tensor Parallel
-
-This section describes how to add Tensor Parallel (TP) to a diffusion transformer model. We use the Z-Image transformer as the reference implementation.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [Step-by-Step Implementation](#step-by-step-implementation)
-- [Testing](#testing)
-- [Troubleshooting](#troubleshooting)
-- [Reference Implementations](#reference-implementations)
-- [Summary](#summary)
-
----
-
-## Overview
-
-### What is Tensor Parallel?
-
-Tensor Parallel (TP) is a model parallelism technique that **shards model weights** across multiple GPUs. Each GPU holds only a portion of the model's parameters and computes only part of each layer's output.
-
-Diffusion transformers contain large attention and MLP layers. We can use Tensor Parallel to shard the model dimension across multiple GPUs, allowing larger models to fit in memory while achieving near-linear speedup.
-
-### Architecture
-
-The Tensor Parallel implementation relies vLLM's Parallel Layers:
-
-[vLLM Parallel Layers API Reference](https://docs.vllm.ai/en/latest/contributing/model/basic/?h=column#3-optional-implement-tensor-parallelism-and-quantization-support)
-
-**Parallel Layer Types:**
-
-| Layer Type | Purpose | Weight Partitioning |
-|------------|---------|---------------------|
-| `ColumnParallelLinear` | First FFN layer, separated QKV | Columns (output dimension) |
-| `RowParallelLinear` | Second FFN layer, attention output | Rows (input dimension) |
-| `QKVParallelLinear` | Multi-head/grouped-query attention QKV | Handles head replication automatically |
-| `ReplicatedLinear` | Layers that shouldn't be sharded | No partitioning (replicated) |
-
----
-
-## Step-by-Step Implementation
-
-
-### Step 1: Identify Linear Layers
-
-Find all `nn.Linear` layers in your transformer that need to be sharded.
-
-**Key questions:**
-- Which layers should be column parallel (weight split by columns)?
-- Which layers should be row parallel (weight split by rows)?
-
-### Step 2: Replace Linear Layers with Parallel Equivalents
-
-Replace `nn.Linear` with parallel layers from `vllm.model_executor.layers.linear`.
-
-**Example (MLP Block - Up-Down Pattern):**
-
-```python
-class FeedForward(nn.Module):
- def __init__(self, dim: int, hidden_dim: int):
- super().__init__()
- # Column parallel: weight split by columns [hidden_dim/N, dim]
- self.w1 = ColumnParallelLinear(
- dim,
- hidden_dim,
- bias=False,
- return_bias=False,
- )
- self.act = nn.GELU()
-
- self.w2 = RowParallelLinear(
- hidden_dim,
- dim,
- bias=False,
- input_is_parallel=True, # Input already sharded from w1
- return_bias=False,
- )
-
- def forward(self, x):
- # x: [batch, seq, dim] (replicated on all GPUs)
- # w1 outputs sharded [batch, seq, hidden_dim/N]
- x = self.w1(x)
- # act operates on sharded tensors (no communication)
- x = self.act(x)
- # w2 outputs full dim [batch, seq, dim] via all-reduce
- x = self.w2(x)
- return x
-```
-
-**Example (Attention - QKV-Out Pattern):**
-
-```python
-from vllm_omni.diffusion.attention.layer import Attention
-class YourModelAttention(nn.Module):
- def __init__(self, dim: int, num_heads: int, num_kv_heads: int):
- super().__init__()
- self.head_dim = dim // num_heads
-
- # Column parallel: QKV weight split by columns
- # Each GPU gets num_heads/N heads
- self.to_qkv = QKVParallelLinear(
- hidden_size=dim,
- head_size=self.head_dim,
- total_num_heads=num_heads,
- total_num_kv_heads=num_kv_heads,
- bias=False,
- return_bias=False,
- )
-
- # Row parallel: output weight split by rows
- self.to_out = RowParallelLinear(
- dim,
- dim,
- bias=False,
- input_is_parallel=True, # Input sharded from attention
- return_bias=False,
- )
-
- self.attn = Attention(
- num_heads=self.to_qkv.num_heads, # Each GPU gets num_heads/N heads
- head_size=self.head_dim,
- softmax_scale=1.0 / (self.head_dim**0.5),
- causal=False,
- num_kv_heads=self.to_qkv.num_kv_heads,
- )
-
- def forward(self, x):
- # x: [batch, seq, dim] (replicated)
- # to_qkv outputs sharded [batch, seq, (q+k+v) * head_dim/N]
- qkv = self.to_qkv(x)
- # Split into Q, K, V (each sharded on heads)
- q, k, v = qkv.split([...], dim=-1)
- # Attention computed independently on each GPU
- out = self.attn(q, k, v)
- # to_out all-reduces to full dim
- out = self.to_out(out)
- return out
-```
-
-**Key Points:**
-
-- `ColumnParallelLinear` → `RowParallelLinear` is the standard pairing
-- Set `input_is_parallel=True` on `RowParallelLinear` when input comes from `ColumnParallelLinear`
-- Use `QKVParallelLinear` for attention projections (handles head replication automatically)
-
-### Step 3: Validate TP Constraints
-
-For correct TP operation, these dimensions **must be divisible** by `tensor_parallel_size`:
-
-| Dimension | Reason | Example Error |
-|-----------|--------|---------------|
-| `num_heads` | Heads sharded by QKVParallelLinear | `num_heads=30, tp=4` ❌ (30 % 4 ≠ 0) |
-| `num_kv_heads` | KV heads sharded by QKVParallelLinear | `num_kv_heads=30, tp=4` ❌ (30 % 4 ≠ 0) |
-
----
-
-## Testing
-
-After adding Tensor Parallel support, test with:
-
-```python
-from vllm_omni import Omni
-from vllm_omni.diffusion.data import DiffusionParallelConfig
-from vllm_omni.inputs.data import OmniDiffusionSamplingParams
-
-parallel_config = DiffusionParallelConfig(tensor_parallel_size=2)
-omni = Omni(model="your-model-name", parallel_config=parallel_config)
-
-output = omni.generate(
- "a cup of coffee on the table",
- OmniDiffusionSamplingParams(num_inference_steps=50),
-)
-```
-
-**Or via command line:**
-
-```bash
-cd examples/offline_inference/text_to_image
-python text_to_image.py \
- --model Your-org/your-model \
- --prompt "a cup of coffee on the table" \
- --negative-prompt "ugly, unclear" \
- --cfg-scale 4.0 \
- --num-inference-steps 50 \
- --output "tp_enabled.png" \
- --tensor-parallel-size 2
-```
-
-**Verify:**
-
-1. Check the `e2e_time_ms` in the log for speedup
-2. Compare generated image quality with TP disabled
-3. Verify memory usage is reduced proportionally
-4. Record comparison results in your PR
-
----
-
-## Troubleshooting
-
-### Issue: TP not activating
-
-**Symptoms:** Model runs on single GPU, no memory savings or speedup.
-
-**Causes & Solutions:**
-
-- **Still using `nn.Linear`:**
-
-**Problem:** Linear layers not replaced with parallel equivalents.
-
-**Solution:** Replace with parallel layers:
-```python
-# ❌ BAD
-self.proj = nn.Linear(dim, dim)
-
-# ✅ GOOD
-self.proj = RowParallelLinear(dim, dim, input_is_parallel=True)
-```
-
-### Issue: Dimension mismatch errors
-
-**Symptoms:** `RuntimeError: shape mismatch` during forward pass.
-
-**Causes & Solutions:**
-
-- **Missing `input_is_parallel=True`:**
-
-**Problem:** RowParallelLinear expects sharded input but receives full tensor.
-
-**Solution:** Set `input_is_parallel=True` when input comes from ColumnParallelLinear:
-```python
-# ✅ GOOD: Correct pairing
-self.w1 = ColumnParallelLinear(dim, hidden_dim, return_bias=False,)
-self.w2 = RowParallelLinear(
- hidden_dim,
- dim,
- input_is_parallel=True, # Input sharded from w1
- return_bias=False,
-)
-```
-
-- **Incorrect split dimensions:**
-
-**Problem:** QKV split sizes don't match sharded dimensions.
-
-**Solution:** Use `self.to_qkv.num_heads` (local heads per GPU):
-```python
-# ❌ BAD: Uses total heads
-q_size = self.total_num_heads * self.head_dim
-
-# ✅ GOOD: Uses local heads
-q_size = self.to_qkv.num_heads * self.head_dim
-```
-
----
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Pattern | Notes |
-|-------|------|---------|-------|
-| **Z-Image** | `vllm_omni/diffusion/models/z_image/z_image_transformer.py` | Standard TP | Full implementation with validation |
-| **FLUX** | `vllm_omni/diffusion/models/flux/flux_transformer.py` | Dual-stream | Image + text streams |
-| **Qwen-Image** | `vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py` | Standard TP | With RoPE |
-| **TP Tests** | `tests/e2e/offline_inference/test_zimage_parallelism.py` | E2E testing | TP correctness and performance |
-| **Constraint Tests** | `tests/diffusion/models/z_image/test_zimage_tp_constraints.py` | Unit testing | Validation logic |
-
----
-
-## Summary
-
-Adding Tensor Parallel support to a transformer:
-
-1. ✅ **Identify linear layers** - Which layers should be sharded?
-2. ✅ **Replace with parallel layers** - Use QKVParallelLinear, ColumnParallelLinear, RowParallelLinear
-3. ✅ **Validate TP constraints** - Ensure dimensions divisible by TP size
-4. ✅ **Test** - Verify with `tensor_parallel_size=N`, check memory, speed, and quality
diff --git a/docs/design/feature/vae_parallel.md b/docs/design/feature/vae_parallel.md
deleted file mode 100644
index e330b41a68f..00000000000
--- a/docs/design/feature/vae_parallel.md
+++ /dev/null
@@ -1,459 +0,0 @@
-# VAE Patch Parallelism
-
-This document describes how to add **VAE Patch Parallelism** support to a diffusion model.
-We use **Qwen-Image** as the reference implementation for decode parallel, and **Wan2.2** for encode parallel.
-
----
-
-## Table of Contents
-
-- [Overview](#overview)
-- [Step-by-Step Implementation (Decode)](#step-by-step-implementation-decode)
-- [Encode Parallel Implementation](#encode-parallel-implementation)
-- [Testing](#testing)
-- [Reference Implementations](#reference-implementations)
-- [Summary](#summary)
-
----
-
-## Overview
-
-### What is Vae Patch parallel?
-
-**VAE Patch Parallelism** is an acceleration technique for both **encoding** and **decoding**. Instead of processing the entire tensor at once, the tensor is:
-
-+ Split into multiple spatial tiles
-
-+ Distributed across multiple ranks
-
-+ Encoded/Decoded in parallel
-
-+ Merged to reconstruct the final output
-
-This approach:
-
-+ Distributes computation across multiple devices
-
-+ Reduces peak memory usage per device
-
-+ Accelerates encoding/decoding latency
-
-### When to Use Encode vs Decode Parallel
-
-| Operation | Use Case | Example |
-|-----------|----------|---------|
-| **Decode Parallel** | Text-to-Image, Text-to-Video | Latent → Image/Video |
-| **Encode Parallel** | Image-to-Video (I2V) | Image → Latent (for conditioning) |
-
-### Architecture
-We introduce **DistributedVaeExecutor** as the core component responsible for distributed VAE encoding/decoding.
-
-The executor is model-agnostic and accepts three function parameters:
-
-+ split – Partition the latent into tiles
-
-+ exec – Decode a single tile
-
-+ merge – Combine decoded tiles into the final output
-
-#### Execution Flow
-
-+ Call split(z) to generate a list of TileTask and a GridSpec
-
-+ Dispatch tasks across ranks using workload-based balancing
-
-+ Each rank executes exec(task) on its assigned tiles
-
-+ Gather decoded tile results to rank 0
-
-+ Rank 0 performs merge(...)
-
-+ (Optional) Broadcast final result to all ranks
-
-This design separates:
-
-+ Distributed execution logic
-
-+ Model-specific tiling and merging logic
-
-#### Why split / exec / merge is necessary?
-
-The latent tensor cannot be arbitrarily partitioned.
-
-During decoding:
-
-+ Each output pixel may depend on neighboring pixels
-
-+ The receptive field is model-dependent
-
-Therefore:
-
-+ Tiles must include overlap
-
-+ Merge must perform blending to avoid seams
-
-## Step-by-Step Implementation (Decode)
-
-### Step 1: Implement DistributedAutoencoderKLQwenImage
-`QwenImagePipeline` use `AutoencoderKLQwenImage` for vae, so implement a distributed version:
-
-
-```
-class DistributedAutoencoderKLQwenImage(AutoencoderKLQwenImage, DistributedVaeMixin):
- @classmethod
- def from_pretrained(cls, *args: Any, **kwargs: Any):
- model = super().from_pretrained(*args, **kwargs)
- model.init_distributed()
- return model
-```
-**Key points**:
-+ Inherit both AutoencoderKLQwenImage and DistributedVaeMixin
-+ Call init_distributed() after loading weights
-
-### Step 2: Implement split/exec/merge
-Reuse `AutoencoderKLQwenImage.tiled_decode` logic and divide it into three stages. And we need return tiles with `GridSpec` and `TileTask`:
-```
-class GridSpec:
- split_dims: tuple[int, ...] # Tensor dimensions being split (e.g., (2, 3) for (B, C, H, W))
- grid_shape: tuple[int, ...] # Tile grid layout (num_rows, num_cols)
- tile_spec: dict = field(default_factory=dict) # Metadata required for merging
- output_dtype: torch.dtype | None = None # Final output dtype
-```
-```
-class TileTask:
- tile_id: int # task id
- grid_coord: tuple[int, ...] # Tile position in grid
- tensor: torch.Tensor | list[torch.Tensor] # The tile tensor
- workload: int | float = 1 # Used for load balancing (e.g., tile area)
-```
-And tiled base split/exec/merge as follow:
-```
-def tile_split(self, z: torch.Tensor) -> tuple[list[TileTask], GridSpec]:
- # mostly copy from AutoencoderKL
- _, _, num_frames, height, width = z.shape
- sample_height = height * self.spatial_compression_ratio
- sample_width = width * self.spatial_compression_ratio
-
- tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio
- tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio
- tile_latent_stride_height = self.tile_sample_stride_height // self.spatial_compression_ratio
- tile_latent_stride_width = self.tile_sample_stride_width // self.spatial_compression_ratio
-
- blend_height = self.tile_sample_min_height - self.tile_sample_stride_height
- blend_width = self.tile_sample_min_width - self.tile_sample_stride_width
-
- # Split z into overlapping tiles and decode them separately.
- # The tiles have an overlap to avoid seams between tiles.
- tiletask_list = []
- for i in range(0, height, tile_latent_stride_height):
- for j in range(0, width, tile_latent_stride_width):
- time_list = []
- for k in range(num_frames):
- self._conv_idx = [0]
- tile = z[:, :, k : k + 1, i : i + tile_latent_min_height, j : j + tile_latent_min_width]
- time_list.append(tile)
- tiletask_list.append(
- TileTask(
- len(tiletask_list),
- (i // tile_latent_stride_height, j // tile_latent_stride_width),
- time_list,
- workload=time_list[0].shape[3] * time_list[0].shape[4],
- )
- )
- tile_spec = {
- "sample_height": sample_height,
- "sample_width": sample_width,
- "blend_height": blend_height,
- "blend_width": blend_width,
- }
- grid_spec = GridSpec(
- split_dims=(3, 4),
- grid_shape=(tiletask_list[-1].grid_coord[0] + 1, tiletask_list[-1].grid_coord[1] + 1),
- tile_spec=tile_spec,
- output_dtype=self.dtype,
- )
- return tiletask_list, grid_spec
-
-def tile_exec(self, task: TileTask) -> torch.Tensor:
- """Decode a single latent tile into RGB space."""
- self.clear_cache()
- time = []
- for k in range(len(task.tensor)):
- self._conv_idx = [0]
- tile = self.post_quant_conv(task.tensor[k])
- decoded = self.decoder(tile, feat_cache=self._feat_map, feat_idx=self._conv_idx)
- time.append(decoded)
- result = torch.cat(time, dim=2)
- return result
-
-def tile_merge(self, coord_tensor_map: dict[tuple[int, ...], torch.Tensor], grid_spec: GridSpec) -> torch.Tensor:
- """Merge decoded tiles into a full image."""
- grid_h, grid_w = grid_spec.grid_shape
- result_rows = []
- self.clear_cache()
-
- result_rows = []
- for i in range(grid_h):
- result_row = []
- for j in range(grid_w):
- tile = coord_tensor_map[(i, j)]
- if i > 0:
- tile = self.blend_v(coord_tensor_map[(i - 1, j)], tile, grid_spec.tile_spec["blend_height"])
- if j > 0:
- tile = self.blend_h(coord_tensor_map[(i, j - 1)], tile, grid_spec.tile_spec["blend_width"])
- result_row.append(tile[:, :, :, : self.tile_sample_stride_height, : self.tile_sample_stride_width])
- result_rows.append(torch.cat(result_row, dim=-1))
- dec = torch.cat(result_rows, dim=3)[
- :, :, :, : grid_spec.tile_spec["sample_height"], : grid_spec.tile_spec["sample_width"]
- ]
- return dec
-```
-
-### Step 3: Override tiled_decode
-We need to override tiled_decode, the main logic is:
-+ check distributed is enabled
-+ select split/exec/merge
-+ Invoke self.distributed_executor.execute to decode
-```
-def tiled_decode(self, z: torch.Tensor, return_dict: bool = True):
- if not self.is_distributed_enabled():
- return super().tiled_decode(z, return_dict=return_dict)
-
- logger.info("Decode run with distributed executor")
- result = self.distributed_executor.execute(
- z,
- DistributedOperator(split=self.tile_split, exec=self.tile_exec, merge=self.tile_merge),
- broadcast_result=True,
- )
- if not return_dict:
- return (result,)
-
- return DecoderOutput(sample=result)
-```
-`broadcast_result` is set to True or False depending on the model; when enabled, the result will be used even on ranks other than 0.
-
-### Step 4: Modify Pipeline
-Change vae model from AutoencoderKLQwenImage to DistributedAutoencoderKLQwenImage
-```
-class YourModelPipeline(nn.Module):
- def __init__(
- self,
- *,
- od_config: OmniDiffusionConfig,
- prefix: str = "",
- ):
- super().__init__()
- ...
-- self.vae = AutoencoderKL.from_pretrained(
-- model, subfolder="vae", local_files_only=local_files_only).to(self.device)
-+ self.vae = DistributedAutoencoderKL.from_pretrained(
-+ model, subfolder="vae", local_files_only=local_files_only
-+ ).to(self.device)
-```
-
-## Encode Parallel Implementation
-
-For models that require VAE encoding (e.g., Image-to-Video), you can also parallelize the encode operation. We use **Wan2.2** as the reference implementation.
-
-### Step 1: Implement encode_tile_split
-
-Similar to decode, split the input tensor into tiles. Key considerations:
-
-+ **Patchify handling**: If the model uses `patch_size`, scale tile parameters accordingly
-+ **Temporal chunking**: Video VAEs may have temporal compression (e.g., 4x)
-
-```python
-def encode_tile_split(self, x: torch.Tensor) -> tuple[list[TileTask], GridSpec]:
- _, _, num_frames, height, width = x.shape
- encode_spatial_compression_ratio = self.spatial_compression_ratio
-
- # Scale tile parameters for patchified coordinate system
- tile_sample_min_height = self.tile_sample_min_height
- tile_sample_min_width = self.tile_sample_min_width
- tile_sample_stride_height = self.tile_sample_stride_height
- tile_sample_stride_width = self.tile_sample_stride_width
-
- if self.config.patch_size is not None:
- # When input is patchified, scale tile parameters accordingly
- encode_spatial_compression_ratio = self.spatial_compression_ratio // self.config.patch_size
- tile_sample_min_height = tile_sample_min_height // self.config.patch_size
- tile_sample_min_width = tile_sample_min_width // self.config.patch_size
- tile_sample_stride_height = tile_sample_stride_height // self.config.patch_size
- tile_sample_stride_width = tile_sample_stride_width // self.config.patch_size
-
- latent_height = height // encode_spatial_compression_ratio
- latent_width = width // encode_spatial_compression_ratio
-
- tile_latent_min_height = tile_sample_min_height // encode_spatial_compression_ratio
- tile_latent_min_width = tile_sample_min_width // encode_spatial_compression_ratio
- tile_latent_stride_height = tile_sample_stride_height // encode_spatial_compression_ratio
- tile_latent_stride_width = tile_sample_stride_width // encode_spatial_compression_ratio
-
- blend_height = tile_latent_min_height - tile_latent_stride_height
- blend_width = tile_latent_min_width - tile_latent_stride_width
-
- tiletask_list = []
- # Use temporal compression ratio from config instead of hardcoding
- temporal_compression = self.config.scale_factor_temporal
-
- for i in range(0, height, tile_sample_stride_height):
- for j in range(0, width, tile_sample_stride_width):
- time_list = []
- frame_range = 1 + (num_frames - 1) // temporal_compression
- for k in range(frame_range):
- if k == 0:
- tile = x[:, :, :1, i : i + tile_sample_min_height, j : j + tile_sample_min_width]
- else:
- tile = x[
- :, :,
- 1 + temporal_compression * (k - 1) : 1 + temporal_compression * k,
- i : i + tile_sample_min_height,
- j : j + tile_sample_min_width,
- ]
- time_list.append(tile)
- tiletask_list.append(
- TileTask(len(tiletask_list), (i // tile_sample_stride_height, j // tile_sample_stride_width),
- time_list, workload=time_list[0].shape[3] * time_list[0].shape[4])
- )
-
- grid_spec = GridSpec(
- split_dims=(3, 4),
- grid_shape=(tiletask_list[-1].grid_coord[0] + 1, tiletask_list[-1].grid_coord[1] + 1),
- tile_spec={
- "latent_height": latent_height, "latent_width": latent_width,
- "blend_height": blend_height, "blend_width": blend_width,
- "tile_latent_stride_height": tile_latent_stride_height,
- "tile_latent_stride_width": tile_latent_stride_width,
- },
- output_dtype=self.dtype,
- )
- return tiletask_list, grid_spec
-```
-
-### Step 2: Implement encode_tile_exec
-
-```python
-def encode_tile_exec(self, task: TileTask) -> torch.Tensor:
- """Encode a single sample tile into latent space."""
- self.clear_cache()
- time = []
- for k, tile in enumerate(task.tensor):
- self._enc_conv_idx = [0]
- encoded = self.encoder(tile, feat_cache=self._enc_feat_map, feat_idx=self._enc_conv_idx)
- encoded = self.quant_conv(encoded)
- time.append(encoded)
- result = torch.cat(time, dim=2)
- self.clear_cache()
- return result
-```
-
-### Step 3: Implement encode_tile_merge
-
-```python
-def encode_tile_merge(
- self, coord_tensor_map: dict[tuple[int, ...], torch.Tensor], grid_spec: GridSpec
-) -> torch.Tensor:
- """Merge encoded tiles into a full latent tensor."""
- grid_h, grid_w = grid_spec.grid_shape
- result_rows = []
- for i in range(grid_h):
- result_row = []
- for j in range(grid_w):
- tile = coord_tensor_map[(i, j)]
- if i > 0:
- tile = self.blend_v(coord_tensor_map[(i - 1, j)], tile, grid_spec.tile_spec["blend_height"])
- if j > 0:
- tile = self.blend_h(coord_tensor_map[(i, j - 1)], tile, grid_spec.tile_spec["blend_width"])
- result_row.append(tile[:, :, :,
- : grid_spec.tile_spec["tile_latent_stride_height"],
- : grid_spec.tile_spec["tile_latent_stride_width"]])
- result_rows.append(torch.cat(result_row, dim=-1))
-
- enc = torch.cat(result_rows, dim=3)[
- :, :, :, : grid_spec.tile_spec["latent_height"], : grid_spec.tile_spec["latent_width"]
- ]
- return enc
-```
-
-### Step 4: Override tiled_encode method
-
-Override `tiled_encode` instead of `encode`. The parent's `_encode()` handles patchify before calling `tiled_encode()`, so input `x` is already patchified.
-
-```python
-def tiled_encode(self, x: torch.Tensor) -> torch.Tensor:
- """
- Encode using distributed VAE executor.
-
- Note: x is already patchified by parent's _encode() before calling this method.
- """
- if not self.is_distributed_enabled():
- return super().tiled_encode(x)
-
- self.clear_cache()
- result = self.distributed_executor.execute(
- x,
- DistributedOperator(
- split=self.encode_tile_split,
- exec=self.encode_tile_exec,
- merge=self.encode_tile_merge,
- ),
- broadcast_result=True, # Latents needed by all ranks for diffusion
- )
- self.clear_cache()
- return result
-```
-
-**Key differences from decode parallel:**
-
-| Aspect | Decode Parallel | Encode Parallel |
-|--------|-----------------|-----------------|
-| `broadcast_result` | Often `False` (only rank 0 needs output) | `True` (all ranks need latents for diffusion) |
-| Patchify | Applied in merge (unpatchify) | Handled by parent `_encode()` before `tiled_encode()` |
-| Temporal chunking | Frame-by-frame | Chunk-based (e.g., 1 + 4n frames) |
-
-## Testing
-Verify numerical consistency between:
-+ vae_patch_parallel_size = 1
-
-+ vae_patch_parallel_size = N
-
-Example:
-torch.allclose(output_1, output_n, atol=1e-5)
-
-Testing requirements:
-+ Fix random seed
-+ Use identical tiling strategy
-
-```python
-m = Omni(
- model=model_name,
- vae_use_tiling=True,
- parallel_config=DiffusionParallelConfig(
- tensor_parallel_size=2,
- vae_patch_parallel_size=1, # or 2
- ),
- )
-```
-When vae_patch_parallel_size is larger than the DiT world size, it will automatically fall back to using the DiT world size instead.
-
-## Reference Implementations
-
-Complete examples in the codebase:
-
-| Model | Path | Decode Parallel | Encode Parallel |
-|-------|------|-----------------|-----------------|
-| **Z-Image** | `vllm_omni/diffusion/distributed/autoencoders/autoencoder_kl.py` | ✅ | ❌ |
-| **Wan2.2** | `vllm_omni/diffusion/distributed/autoencoders/autoencoder_kl_wan.py` | ✅ | ✅ |
-| **Qwen-Image** | `vllm_omni/diffusion/distributed/autoencoders/autoencoder_kl_qwenimage.py` | ✅ | ❌ |
-
----
-
-## Summary
-
-Adding VAE Patch Parallel support to diffusion model:
-
-1. **Implement Distributed VAE** - Inherit from base VAE class and `DistributedVaeMixin`
-2. **Decode Parallel** - Refactor `tiled_decode` into `tile_split`/`tile_exec`/`tile_merge`
-3. **Encode Parallel** (optional) - Implement `encode_tile_split`/`encode_tile_exec`/`encode_tile_merge` for I2V models
-4. **Change VAE model in pipeline** - Use the distributed version
-5. **Test** - Verify numerical consistency with `vae_patch_parallel_size=1` vs `N`
diff --git a/docs/design/figures/omni/E2EL_s_vllm_omni_vs_transformers.png b/docs/design/figures/omni/E2EL_s_vllm_omni_vs_transformers.png
deleted file mode 100644
index 15112d5862a..00000000000
Binary files a/docs/design/figures/omni/E2EL_s_vllm_omni_vs_transformers.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_AUDIO_RTF_Baseline_vs_Batch.png b/docs/design/figures/omni/Mean_AUDIO_RTF_Baseline_vs_Batch.png
deleted file mode 100644
index 2f0615f77bb..00000000000
Binary files a/docs/design/figures/omni/Mean_AUDIO_RTF_Baseline_vs_Batch.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_CUDA_Graph_vs_Async_Chunk.png b/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_CUDA_Graph_vs_Async_Chunk.png
deleted file mode 100644
index 62d8bc79b6b..00000000000
Binary files a/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_CUDA_Graph_vs_Async_Chunk.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_vs_Batch_CUDA_Graph.png b/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_vs_Batch_CUDA_Graph.png
deleted file mode 100644
index 5838b45319e..00000000000
Binary files a/docs/design/figures/omni/Mean_AUDIO_RTF_Batch_vs_Batch_CUDA_Graph.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Baseline_vs_Batch.png b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Baseline_vs_Batch.png
deleted file mode 100644
index 24be814b7e9..00000000000
Binary files a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Baseline_vs_Batch.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_CUDA_Graph_vs_Async_Chunk.png b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_CUDA_Graph_vs_Async_Chunk.png
deleted file mode 100644
index c8df58ebcdf..00000000000
Binary files a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_CUDA_Graph_vs_Async_Chunk.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_vs_Batch_CUDA_Graph.png b/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_vs_Batch_CUDA_Graph.png
deleted file mode 100644
index 2d1a04e9c2c..00000000000
Binary files a/docs/design/figures/omni/Mean_AUDIO_TTFP_ms_Batch_vs_Batch_CUDA_Graph.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_E2EL_ms_Baseline_vs_Batch.png b/docs/design/figures/omni/Mean_E2EL_ms_Baseline_vs_Batch.png
deleted file mode 100644
index e598b543431..00000000000
Binary files a/docs/design/figures/omni/Mean_E2EL_ms_Baseline_vs_Batch.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_E2EL_ms_Batch_CUDA_Graph_vs_Async_Chunk.png b/docs/design/figures/omni/Mean_E2EL_ms_Batch_CUDA_Graph_vs_Async_Chunk.png
deleted file mode 100644
index 54452013eb4..00000000000
Binary files a/docs/design/figures/omni/Mean_E2EL_ms_Batch_CUDA_Graph_vs_Async_Chunk.png and /dev/null differ
diff --git a/docs/design/figures/omni/Mean_E2EL_ms_Batch_vs_Batch_CUDA_Graph.png b/docs/design/figures/omni/Mean_E2EL_ms_Batch_vs_Batch_CUDA_Graph.png
deleted file mode 100644
index 04c5ad7396a..00000000000
Binary files a/docs/design/figures/omni/Mean_E2EL_ms_Batch_vs_Batch_CUDA_Graph.png and /dev/null differ
diff --git a/docs/design/figures/omni/RTF_vllm_omni_vs_transformers.png b/docs/design/figures/omni/RTF_vllm_omni_vs_transformers.png
deleted file mode 100644
index d93ba0b2af5..00000000000
Binary files a/docs/design/figures/omni/RTF_vllm_omni_vs_transformers.png and /dev/null differ
diff --git a/docs/design/figures/omni/Summary_E2EL_ms_vs_features.png b/docs/design/figures/omni/Summary_E2EL_ms_vs_features.png
deleted file mode 100644
index 04087b5910f..00000000000
Binary files a/docs/design/figures/omni/Summary_E2EL_ms_vs_features.png and /dev/null differ
diff --git a/docs/design/figures/omni/Summary_RTF_vs_features.png b/docs/design/figures/omni/Summary_RTF_vs_features.png
deleted file mode 100644
index c2c8ad40834..00000000000
Binary files a/docs/design/figures/omni/Summary_RTF_vs_features.png and /dev/null differ
diff --git a/docs/design/figures/omni/Summary_TTFP_ms_vs_features.png b/docs/design/figures/omni/Summary_TTFP_ms_vs_features.png
deleted file mode 100644
index 3dcc1c55379..00000000000
Binary files a/docs/design/figures/omni/Summary_TTFP_ms_vs_features.png and /dev/null differ
diff --git a/docs/design/figures/omni/TTFP_s_vllm_omni_vs_transformers.png b/docs/design/figures/omni/TTFP_s_vllm_omni_vs_transformers.png
deleted file mode 100644
index 9a5b6c9bdaf..00000000000
Binary files a/docs/design/figures/omni/TTFP_s_vllm_omni_vs_transformers.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_AUDIO_RTF_vllm_omni_vs_transformers.png b/docs/design/figures/tts/Mean_AUDIO_RTF_vllm_omni_vs_transformers.png
deleted file mode 100644
index 68f0ef17e88..00000000000
Binary files a/docs/design/figures/tts/Mean_AUDIO_RTF_vllm_omni_vs_transformers.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_AUDIO_TTFP_(ms)_vllm_omni_vs_transformers.png b/docs/design/figures/tts/Mean_AUDIO_TTFP_(ms)_vllm_omni_vs_transformers.png
deleted file mode 100644
index 44be96e96da..00000000000
Binary files a/docs/design/figures/tts/Mean_AUDIO_TTFP_(ms)_vllm_omni_vs_transformers.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_E2EL_(ms)_vllm_omni_vs_transformers.png b/docs/design/figures/tts/Mean_E2EL_(ms)_vllm_omni_vs_transformers.png
deleted file mode 100644
index 2e5d1482bd7..00000000000
Binary files a/docs/design/figures/tts/Mean_E2EL_(ms)_vllm_omni_vs_transformers.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_e2e_ms_baseline_vs_batch.png b/docs/design/figures/tts/Mean_mean_e2e_ms_baseline_vs_batch.png
deleted file mode 100644
index 04d8f0bac53..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_e2e_ms_baseline_vs_batch.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_e2e_ms_batch_vs_cuda_graph.png b/docs/design/figures/tts/Mean_mean_e2e_ms_batch_vs_cuda_graph.png
deleted file mode 100644
index eb85ec0dd4f..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_e2e_ms_batch_vs_cuda_graph.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_e2e_ms_cuda_graph_vs_async_chunk.png b/docs/design/figures/tts/Mean_mean_e2e_ms_cuda_graph_vs_async_chunk.png
deleted file mode 100644
index 6f0e0e2529d..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_e2e_ms_cuda_graph_vs_async_chunk.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_rtf_baseline_vs_batch.png b/docs/design/figures/tts/Mean_mean_rtf_baseline_vs_batch.png
deleted file mode 100644
index 89ea30a8643..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_rtf_baseline_vs_batch.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_rtf_batch_vs_cuda_graph.png b/docs/design/figures/tts/Mean_mean_rtf_batch_vs_cuda_graph.png
deleted file mode 100644
index 2b207b88987..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_rtf_batch_vs_cuda_graph.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_rtf_cuda_graph_vs_async_chunk.png b/docs/design/figures/tts/Mean_mean_rtf_cuda_graph_vs_async_chunk.png
deleted file mode 100644
index f5f7ad72c8f..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_rtf_cuda_graph_vs_async_chunk.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_ttfp_ms_baseline_vs_batch.png b/docs/design/figures/tts/Mean_mean_ttfp_ms_baseline_vs_batch.png
deleted file mode 100644
index 6f8c1da4a5b..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_ttfp_ms_baseline_vs_batch.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_ttfp_ms_batch_vs_cuda_graph.png b/docs/design/figures/tts/Mean_mean_ttfp_ms_batch_vs_cuda_graph.png
deleted file mode 100644
index b0fe1d02a9d..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_ttfp_ms_batch_vs_cuda_graph.png and /dev/null differ
diff --git a/docs/design/figures/tts/Mean_mean_ttfp_ms_cuda_graph_vs_async_chunk.png b/docs/design/figures/tts/Mean_mean_ttfp_ms_cuda_graph_vs_async_chunk.png
deleted file mode 100644
index 008ba9bf78f..00000000000
Binary files a/docs/design/figures/tts/Mean_mean_ttfp_ms_cuda_graph_vs_async_chunk.png and /dev/null differ
diff --git a/docs/design/figures/tts/Summary_mean_e2e_ms_vs_features.png b/docs/design/figures/tts/Summary_mean_e2e_ms_vs_features.png
deleted file mode 100644
index 7c65aa11770..00000000000
Binary files a/docs/design/figures/tts/Summary_mean_e2e_ms_vs_features.png and /dev/null differ
diff --git a/docs/design/figures/tts/Summary_mean_rtf_vs_features.png b/docs/design/figures/tts/Summary_mean_rtf_vs_features.png
deleted file mode 100644
index 71bb2c54680..00000000000
Binary files a/docs/design/figures/tts/Summary_mean_rtf_vs_features.png and /dev/null differ
diff --git a/docs/design/figures/tts/Summary_mean_ttfp_ms_vs_features.png b/docs/design/figures/tts/Summary_mean_ttfp_ms_vs_features.png
deleted file mode 100644
index cef2546d6fe..00000000000
Binary files a/docs/design/figures/tts/Summary_mean_ttfp_ms_vs_features.png and /dev/null differ
diff --git a/docs/design/index.md b/docs/design/index.md
deleted file mode 100644
index 31420550fbd..00000000000
--- a/docs/design/index.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Design Documents
-
-This section contains design documents and architecture specifications for vLLM-Omni.
-
-## Architecture Documents
-
-- [Architecture Overview](architecture_overview.md)
-
-## Feature Design Documents
-
-- [Disaggregated Inference](feature/disaggregated_inference.md)
-- [Ray-based Execution](feature/ray_based_execution.md)
-
-## Module Design Documents
-
-- [AR Module](module/ar_module.md)
-- [DIT Module](module/dit_module.md)
-- [Entrypoint Module](module/entrypoint_module.md)
diff --git a/docs/design/module/ar_module.md b/docs/design/module/ar_module.md
deleted file mode 100644
index 5e0aa5b0713..00000000000
--- a/docs/design/module/ar_module.md
+++ /dev/null
@@ -1,387 +0,0 @@
-# AutoRegressive (AR) Module
-
-## 1. Overview
-
-The AutoRegressive (AR) module in vLLM-Omni handles autoregressive generation stages, primarily used for text, chain-of-thought(COT), and audio latent tokens generation stages in multi-stage models like Qwen2.5-Omni, Qwen3-Omni, BAGEL, .etc. Unlike some representative non-autoregressive generation stages (e.g., Diffusion), AR stages generate tokens sequentially, one at a time, following the standard transformer decoder pattern.
-
-The AR module of vLLM-Omni extends vLLM's core components to support:
-
-- **Multimodal inputs/outputs**: Processing images, videos, and audio alongside text
-- **Direct embedding transfer**: Passing pre-computed prompt embeddings between pipeline stages via serialized payloads
-- **Additional information flow**: Carrying per-request metadata (tensors, lists) through the pipeline
-- **Hidden state exposure**: Exposing per-request hidden representations for downstream stages
-- **Basic generator support**: Support some basic heterogeneous architecture such as Convolution, LSTM, etc.
-
-As shown in the [end2end example](../../user_guide/examples/offline_inference/qwen3_omni.md), AR module can be widely applied across multiple stages, generating text tokens in thinker(AR), audio latent tokens in talker(AR) and audio wave in code2wav(Convolution).
-
-## 2. Relationship with vLLM
-
-The AR module builds upon vLLM main framework through inheritance, extending core classes while preserving compatibility with vLLM's scheduling, batching, KV cache management, and execution mechanisms.
-
-### Inheritance Hierarchy
-- Scheduler
-
-```mermaid
-classDiagram
- class VLLMScheduler {
- +schedule() SchedulerOutput
- +update_from_output() EngineCoreOutputs
- }
- class OmniARScheduler {
- +schedule() SchedulerOutput
- }
- class OmniGenerationScheduler {
- +schedule() SchedulerOutput
- +update_from_output() EngineCoreOutputs
- }
- VLLMScheduler <|-- OmniARScheduler
- VLLMScheduler <|-- OmniGenerationScheduler
-```
-- Worker
-
-```mermaid
-classDiagram
- class GPUWorker {
- +init_device()
- +model_runner
- }
- class GPUARWorker {
- +init_device()
- }
- class GPUGenerationWorker {
- +init_device()
- }
- GPUWorker <|-- GPUARWorker
- GPUWorker <|-- GPUGenerationWorker
-```
-- ModelRunner
-
-```mermaid
-classDiagram
- class GPUModelRunner {
- +execute_model()
- +sample_tokens()
- }
- class OmniGPUModelRunner {
- +_update_states()
- +_preprocess()
- +_model_forward()
- }
- class GPUARModelRunner {
- +execute_model()
- +sample_tokens()
- }
- class GPUGenerationModelRunner {
- +execute_model()
- }
- GPUModelRunner <|-- OmniGPUModelRunner
- OmniGPUModelRunner <|-- GPUARModelRunner
- OmniGPUModelRunner <|-- GPUGenerationModelRunner
-```
-- InputProcessor/OutputProcessor
-
-```mermaid
-classDiagram
- class InputProcessor {
- +process_inputs() EngineCoreRequest
- }
-
- class VLLMOutputProcessor {
- +process_outputs() OutputProcessorOutput
- }
- class MultimodalOutputProcessor {
- +process_outputs() OutputProcessorOutput
- +_route_and_normalize()
- }
- VLLMOutputProcessor <|-- MultimodalOutputProcessor
-```
-
-### Key Extensions
-
-- **Scheduler**: `OmniARScheduler` extends `vllm.v1.core.sched.scheduler.Scheduler` to enrich scheduled requests with omni-specific payloads
-- **Worker**: `GPUARWorker` extends `vllm.v1.worker.gpu_worker.Worker` to initialize AR-specific model runners
-- **ModelRunner**: `GPUARModelRunner` extends `OmniGPUModelRunner` → `vllm.v1.worker.gpu_model_runner.GPUModelRunner` to expose hidden states and handle multimodal outputs
-- **InputProcessor**: Stage-0 uses upstream `vllm.v1.engine.input_processor.InputProcessor`; `AsyncOmniEngine` then restores omni-specific payloads (for example `additional_information` and `prompt_embeds`) when building `OmniEngineCoreRequest`
-- **OutputProcessor**: `MultimodalOutputProcessor` extends `vllm.v1.engine.output_processor.OutputProcessor` to route and accumulate multimodal outputs
-
-## 3. Scheduler Design
-
-The AR module provides two scheduler implementations: one for standard autoregressive generation and one for basic heterogeneous architectures.
-
-### Request Flow
-
-The following diagram illustrates the request flow through the AR module components:
-
-```mermaid
-flowchart TD
- A[InputProcessor stage-0 in AsyncOmniEngine] -->|EngineCoreRequest then upgraded to OmniEngineCoreRequest| B[OmniARScheduler]
- B -->|schedule: OmniNewRequestData| C[GPUARWorker]
- C -->|SchedulerOutput| D[GPUARModelRunner]
- D -->|execute_model: None| E[Model Forward Pass]
- E -->|hidden_states, logits| D
- D -->|sample_tokens: OmniModelRunnerOutput| F[OmniARScheduler]
- F -->|update_from_output| G[MultimodalOutputProcessor]
- G -->|RequestOutput| H[Client/Downstream Stage]
-
- style A fill:#e1f5ff
- style B fill:#fff4e1
- style C fill:#e8f5e9
- style D fill:#f3e5f5
- style G fill:#fce4ec
-```
-
-The flow follows vLLM's standard pattern: input processing → scheduling → worker execution → output processing, with omni-specific enrichments at each stage.
-
-### OmniARScheduler
-
-`OmniARScheduler` extends the base vLLM scheduler with minimal modifications, focusing on enriching scheduled requests with omni-specific payloads.
-
-#### Modified API: `schedule()`
-
-The scheduler wraps base `NewRequestData` entries with `OmniNewRequestData` to include prompt embeddings and additional information:
-
-```python
-def schedule(self) -> SchedulerOutput:
- scheduler_output = super().schedule()
- # Rewrap base NewRequestData entries with OmniNewRequestData
- new_list = []
- for nr in scheduler_output.scheduled_new_reqs:
- request = self.requests.get(nr.req_id)
- omni_nr = OmniNewRequestData(
- req_id=nr.req_id,
- prompt_token_ids=nr.prompt_token_ids,
- # ... other base fields ...
- prompt_embeds=getattr(request, "prompt_embeds", None),
- additional_information=getattr(request, "additional_information", None),
- )
- new_list.append(omni_nr)
- scheduler_output.scheduled_new_reqs = new_list
- return scheduler_output
-```
-
-The `update_from_output()` method remains unchanged, inheriting standard request lifecycle management from the base scheduler.
-
-### OmniGenerationScheduler
-
-`OmniGenerationScheduler` implements a fast-path scheduling strategy for basic heterogeneous architectures that process all input tokens in a single step.
-
-#### Modified API: `schedule()`
-
-Allocates all input tokens for a request at once (or 1 placeholder if zero), falling back to default scheduling if budget is insufficient:
-
-```python
-def schedule(self) -> SchedulerOutput:
- # Fast path: allocate all input tokens at once
- while self.waiting and token_budget > 0:
- request = self.waiting.peek_request()
- required_tokens = max(getattr(request, "num_prompt_tokens", 0), 1)
- if required_tokens > token_budget:
- break # Fall back to default scheduling
- # Allocate and schedule...
-```
-
-#### Modified API: `update_from_output()`
-
-Marks requests as finished immediately after one step, since generation models complete in a single forward pass:
-
-```python
-def update_from_output(self, ...) -> dict[int, EngineCoreOutputs]:
- # ...
- # Diffusion request: completes in one step
- request.status = RequestStatus.FINISHED_STOPPED
- kv_transfer_params = self._free_request(request)
- # ...
-```
-
-## 4. Worker and ModelRunner Design
-
-### GPUARWorker
-
-`GPUARWorker` initializes the AR-specific model runner while maintaining standard device initialization:
-
-```python
-class GPUARWorker(GPUWorker):
- def init_device(self):
- # ... standard device initialization ...
- self.model_runner = GPUARModelRunner(self.vllm_config, self.device)
-```
-
-### GPUARModelRunner
-
-`GPUARModelRunner` follows vLLM's two-phase execute/sample flow while exposing hidden states and multimodal outputs.
-
-#### Two-Phase Execution
-
-**Phase 1: `execute_model()`** - Runs forward pass and stores state:
-- Computes logits from hidden states
-- Stores `ExecuteModelState` with hidden states, logits, and multimodal outputs
-- Returns `None` to defer sampling
-
-**Phase 2: `sample_tokens()`** - Samples tokens and builds output:
-- Retrieves stored state from `execute_model()`
-- Samples tokens using logits
-- Extracts per-request hidden states and multimodal outputs
-- Builds `OmniModelRunnerOutput` with `pooler_output` containing hidden states
-
-```python
-def sample_tokens(self, grammar_output) -> OmniModelRunnerOutput:
- # Retrieve stored state
- hidden_states, multimodal_outputs = self.execute_model_state
-
- # Sample tokens
- sampler_output = self._sample(logits, spec_decode_metadata)
-
- # Extract per-request hidden states
- pooler_output = []
- for rid in req_ids:
- hidden_slice = hidden_states_cpu[start:end]
- payload = {"hidden": hidden_slice}
- # Add multimodal outputs if present
- pooler_output.append(payload)
-
- return OmniModelRunnerOutput(
- pooler_output=pooler_output,
- # ... other fields ...
- )
-```
-
-### GPUGenerationModelRunner
-
-`GPUGenerationModelRunner` implements a simplified single-phase execution for basic heterogeneous architectures:
-
-- No logits computation or token sampling
-- Direct generation from forward pass in model implementation
-- Returns outputs via `pooler_output` immediately after forward pass
-
-### OmniGPUModelRunner
-
-`OmniGPUModelRunner` provides shared functionality for both AR and Generation runners:
-
-#### Prompt Embeddings Overlay
-
-During prefill, overlays custom `prompt_embeds` from request state onto `inputs_embeds`:
-
-```python
-def _collect_additional_information_for_prefill(self, num_scheduled_tokens_np):
- for req_index, req_id in enumerate(self.input_batch.req_ids):
- req_state = self.requests[req_id]
- pe_cpu = getattr(req_state, "prompt_embeds_cpu", None)
- # Overlay prompt_embeds for prefill portion
- if pe_cpu is not None:
- src = pe_cpu[num_computed_tokens:num_computed_tokens + overlay_len]
- self.inputs_embeds[start_offset:start_offset + overlay_len].copy_(src)
-```
-
-#### Additional Information Processing
-
-Decodes and manages `additional_information` payloads:
-- Decodes serialized payloads → CPU tensors in request state
-- Passes runtime information to model via `runtime_additional_information` kwarg
-- Processes model-provided updates via `postprocess()` hook
-- Merges updates back into request state
-
-#### M-RoPE Position Initialization
-
-For multimodal models using M-RoPE (e.g., Qwen2-VL), computes position encodings from multimodal feature metadata (image grids, video grids, audio features).
-
-## 5. Input/Output Processing
-
-### Processing Pipeline
-
-The input/output processing pipeline handles serialization, routing, and accumulation of multimodal data:
-
-```mermaid
-sequenceDiagram
- participant Client
- participant AsyncOmniEngine
- participant InputProcessor
- participant Scheduler
- participant ModelRunner
- participant MultimodalOutputProcessor
- participant Client
-
- Client->>AsyncOmniEngine: prompt + prompt_embeds + additional_info
- AsyncOmniEngine->>InputProcessor: process_inputs()
- InputProcessor->>Scheduler: EngineCoreRequest
- AsyncOmniEngine->>AsyncOmniEngine: _upgrade_to_omni_request() + serialize_additional_information()
- Scheduler->>ModelRunner: OmniNewRequestData (with payloads)
- ModelRunner->>ModelRunner: Decode payloads → CPU tensors
- ModelRunner->>ModelRunner: Overlay prompt_embeds on inputs_embeds
- ModelRunner->>ModelRunner: Forward pass with runtime_additional_information
- ModelRunner->>ModelRunner: Extract hidden states + multimodal outputs
- ModelRunner->>MultimodalOutputProcessor: OmniModelRunnerOutput (pooler_output)
- MultimodalOutputProcessor->>MultimodalOutputProcessor: Route by output_type
- MultimodalOutputProcessor->>MultimodalOutputProcessor: Accumulate tensors in OmniRequestState
- MultimodalOutputProcessor->>MultimodalOutputProcessor: Consolidate tensor lists
- MultimodalOutputProcessor->>Client: RequestOutput (with multimodal_output)
-```
-
-### Stage-0 Input Processing
-
-Stage-0 now uses upstream `InputProcessor` directly, and `AsyncOmniEngine` upgrades the request to `OmniEngineCoreRequest` while restoring omni-specific payloads.
-
-```python
-request = self.input_processor.process_inputs(
- request_id=request_id,
- prompt=prompt,
- params=params,
- supported_tasks=self.supported_tasks,
-)
-request = _upgrade_to_omni_request(request, prompt)
-```
-
-### MultimodalOutputProcessor
-
-`MultimodalOutputProcessor` routes outputs by modality type and accumulates multimodal tensors.
-
-#### Output Routing
-
-Routes `EngineCoreOutput` by `output_type` attribute:
-- `"text"`: Standard text generation path
-- `"image"`, `"audio"`, `"latents"`: Extract from `pooling_output` or `multimodal_outputs`
-- Fallback: Heuristic based on presence of `pooling_output`
-
-#### Tensor Accumulation
-
-`OmniRequestState` accumulates multimodal tensors across multiple steps:
-
-```python
-def add_multimodal_tensor(self, payload, mm_type):
- # Normalize payload to dict
- incoming = {mm_type or "hidden": payload}
-
- # Accumulate: convert tensors to lists for deferred concatenation
- if isinstance(v, torch.Tensor) and isinstance(existing, torch.Tensor):
- self.mm_accumulated[k] = [existing, v] # List accumulation
-```
-
-Before final output, consolidates tensor lists via concatenation:
-
-```python
-def _consolidate_multimodal_tensors(self):
- for k, v in self.mm_accumulated.items():
- if isinstance(v, list) and isinstance(v[0], torch.Tensor):
- self.mm_accumulated[k] = torch.cat(v, dim=0) # Concatenate
-```
-
-The consolidated tensors are attached to `RequestOutput.multimodal_output` for consumption by downstream stages or clients.
-
-## 6. Summary
-
-The AR module of vLLM-Omni extends vLLM through strategic inheritance and minimal API modifications:
-
-### Key Design Patterns
-
-1. **Inheritance over composition**: Extends vLLM classes to preserve compatibility with existing scheduling, batching, and execution mechanisms
-2. **Payload serialization**: Uses serialized `additional_information` payloads together with prompt-embedding handoff for efficient inter-stage transfer
-3. **Two-phase execution**: Maintains vLLM's execute/sample separation for AR models while supporting single-phase execution for generation models
-4. **Multimodal routing**: Routes outputs by `output_type` and accumulates tensors incrementally to support streaming
-
-### Differences from vLLM
-
-- **Payload support**: Serialized additional information and prompt embeddings enable direct transfer between pipeline stages
-- **Multimodal handling**: Extended input/output processors support images, audio, and other modalities alongside text
-- **Hidden state exposure**: AR model runners expose per-request hidden states via `pooler_output` for downstream consumption
-- **Generation scheduler**: Fast-path scheduling for basic heterogeneous architectures that complete in one step
-
-The AR module seamlessly integrates with vLLM's existing infrastructure while adding the necessary extensions for multi-stage, multimodal generation pipelines.
diff --git a/docs/design/module/async_omni_architecture.md b/docs/design/module/async_omni_architecture.md
deleted file mode 100644
index 92b13a3da08..00000000000
--- a/docs/design/module/async_omni_architecture.md
+++ /dev/null
@@ -1,203 +0,0 @@
-# AsyncOmni Architecture (Qwen3-Omni Example)
-
-## 1. System Architecture
-
-```text
-• ┌─────────────────────────────────────────────────────────────────────────────────┐
- │ API Layer │
- │ ┌─────────────────────────────────────┐ ┌──────────────────────────────────┐ │
- │ │ AsyncOmni (EngineClient) │ │ Omni │ │
- │ │ • generate() / abort() / shutdown() │ │ • generate() │ │
- │ │ • _final_output_handler() │ │ | │
- │ └─────────────────────────────────────┘ └──────────────────────────────────┘ │
- ├─────────────────────────────────────────────────────────────────────────────────┤
- │ Engine Layer (Proxy) │
- │ ┌───────────────────────────────────────────────────────────────────────────┐ │
- │ │ AsyncOmniEngine │ │
- │ │ • _bootstrap_orchestrator() & _initialize_stages() │ │
- │ │ • add_request() / add_request_async() -> input_processor.process_inputs() │ │
- │ │ • try_get_output() / try_get_output_async() │ │
- │ └───────────────────┬─────────────────────────────────▲─────────────────────┘ │
- │ request_queue (janus.Queue) output_queue (janus.Queue) │
- ├──────────────────────┼─────────────────────────────────┼────────────────────────┤
- │ ▼ Orchestration Layer │ │
- │ ┌───────────────────────────────────────────────────────────────────────────┐ │
- │ │ Orchestrator [background thread] │ │
- │ │ • _request_handler() │ │
- │ │ - stage_client.add_request_async() & _prewarm_async_chunk_stages() │ │
- │ │ • _orchestration_output_handler() │ │
- │ │ - _process_stage_outputs() -> output_processors[i].process_outputs() │ │
- │ │ - _route_output() & _forward_to_next_stage() │ │
- │ └──────────┬─────────────────────────┬────────────────────────┬─────────────┘ │
- ├─────────────┼─────────────────────────┼────────────────────────┼────────────────┤
- │ │ Communication Layer │ │
- │ ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐ │
- │ │ StageEngineCoreClient │ │ StageEngineCoreClient │ │ StageDiffusionClient │ │
- │ │ • ZMQ ROUTER / PULL │ │ • ZMQ ROUTER / PULL │ │ • ZMQ ROUTER / PULL │ │
- │ │ • Msgpack codec │ │ • Msgpack codec │ │ • Msgpack codec │ │
- │ └──────────┬────────────┘ └──────────┬────────────┘ └──────────┬────────────┘ │
- │ ▼ ZMQ IPC ▼ ZMQ IPC ▼ ZMQ IPC │
- ├─────────────────────────────────────────────────────────────────────────────────┤
- │ Execution Layer │
- │ ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐ │
- │ │ StageCoreProc │ │ StageCoreProc │ │ DiffusionEngine │ │
- │ │ [background process] │ │ [background process] │ │ [background process] │ │
- │ └───────────────────────┘ └───────────────────────┘ └───────────────────────┘ │
- └─────────────────────────────────────────────────────────────────────────────────┘
-```
-
-## 2. Execution Flow (Arrow Steps, one generate request)
-
-```text
-[1] App
- -> AsyncOmni.generate(prompt, request_id)
-
-[2] AsyncOmni
- -> _final_output_handler() (started on first request)
- -> AsyncOmniEngine.add_request(stage_id=0, ...)
-
-[3] AsyncOmniEngine.add_request
- -> (if stage-0 is llm and input is not EngineCoreRequest)
- InputProcessor.process_inputs()
- OutputProcessor[0].add_request()
- -> request_queue.put(add_request_msg)
-
-[4] Orchestrator._request_handler
- -> _handle_add_request(msg)
- -> stage_clients[0].add_request_async(...)
-
-[5] Orchestrator._orchestration_loop (loop)
- -> poll stage output
- - llm stage: await get_output_async()
- - diffusion stage: get_diffusion_output_nowait()
- -> (llm stage) output_processors[i].process_outputs(...)
- -> _route_output(...)
- -> if finished and not final_stage and non-async-chunk:
- _forward_to_next_stage(...)
- -> next_stage.add_request_async(...)
- -> output_queue.put(output)
-
-[6] AsyncOmni._final_output_loop (background coroutine)
- -> AsyncOmniEngine.try_get_output_async()
- -> route by request_id to ClientRequestState.queue
-
-[7] AsyncOmni._process_orchestrator_results
- -> read from ClientRequestState.queue
- -> _process_single_result(...)
- -> yield OmniRequestOutput
-
-[8] Exit condition
- -> receive result["finished"] == True
- -> generate() ends
-```
-
-## 3. Runtime Sequence (one generate request)
-
-```mermaid
-sequenceDiagram
- participant APP as App
- participant AO as AsyncOmni
- participant ENG as AsyncOmniEngine
- participant ORCH as Orchestrator
- participant S0 as Stage-0 Client
- participant SN as Next Stage Client
-
- APP->>AO: generate
- AO->>AO: start output_handler once
- AO->>ENG: add_request(stage_id=0, ...)
- ENG->>ENG: input_processor.process_inputs()
- ENG->>ORCH: request_queue.put(add_request)
-
- ORCH->>ORCH: _handle_add_request
- ORCH->>S0: add_request_async
-
- loop poll route forward
- ORCH->>S0: get_output_async / get_diffusion_output_nowait
- ORCH->>ORCH: _route_output
- alt need forward to next stage
- ORCH->>SN: add_request_async
- end
- ORCH-->>ENG: output_queue.put
- end
-
- AO->>ENG: try_get_output_async
- ENG-->>AO: message
- AO-->>APP: yield OmniRequestOutput
-```
-
-## 4. Comparison
-
-Previous topology (reference):
-
-```text
-┌────────────────────────────────────────────────────────────────────────────┐
-│ Main Process │
-│ ┌──────────────────────┐ ┌────────────────────────────────────────────┐ │
-│ │ generate() │ │ final_output_handler() │ │
-│ └──────────────────────┘ └────────────────────────────────────────────┘ │
-└──────────┬─────────────────────────┬─────────────────────────┬─────────────┘
- mp.Queue (in_q/out_q) mp.Queue (in_q/out_q) mp.Queue (in_q/out_q)
- ▼▲ ▼▲ ▼▲
-┌───────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐
-│ Worker Proc-0 │ │ Worker Proc-1 │ │ Worker Proc-2 │
-│ (Thinker LLM) │ │ (Talker LLM) │ │ (Vocoder) │
-│ ┌────────────────┐ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │
-│ │_stage_worker │ │ │ │_stage_worker │ │ │ │_stage_worker │ │
-│ │_async() │ │ │ │_async() │ │ │ │_async() │ │
-│ └────────────────┘ │ │ └────────────────┘ │ │ └────────────────┘ │
-│ ┌────────────────┐ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │
-│ │output_handler()│ │ │ │output_handler()│ │ │ │output_handler()│ │
-│ └────────────────┘ │ │ └────────────────┘ │ │ └────────────────┘ │
-└──────────┬────────────┘ └──────────┬────────────┘ └──────────┬───────────┘
- ZMQ ▼ ▲ ZMQ ZMQ ▼ ▲ ZMQ ZMQ ▼ ▲ ZMQ
-┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
-│ EngineCore Proc-0 │ │ EngineCore Proc-1 │ │ EngineCore Proc-2 │
-│ (Thinker) │ │ (Talker) │ │ (Vocoder) │
-└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
-```
-
-Current topology:
-
-```text
-┌────────────────────────────────────────────────────────────────────────────┐
-│ Main Process │
-│ ┌──────────────────────────────────────────────────────────────────────┐ │
-│ │ Main Thread │ │
-│ │ ┌──────────────────────┐ ┌─────────────────────────────────────┐ │ │
-│ │ │ generate() │ │ final_output_handler() │ │ │
-│ │ └──────────────────────┘ └─────────────────────────────────────┘ │ │
-│ └──────────────────────────────────────────────────────────────────────┘ │
-│ janus.Queue (request_queue) ▼ ▲ janus.Queue (output_queue) │
-│ ┌──────────────────────────────────────────────────────────────────────┐ │
-│ │ Orchestrator Thread │ │
-│ │ ┌──────────────────────┐ ┌──────────────────────────────────────┐ │ │
-│ │ │ _request_handler() │ │ _orchestration_output_handler() │ │ │
-│ │ └──────────────────────┘ └──────────────────────────────────────┘ │ │
-│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
-│ │ │ _orchestration_loop(): poll/process/route outputs for all stages│ │ │
-│ │ └────────────────────────────────────────────────────────────────┘ │ │
-│ └───────┬─────────────────────────┬─────────────────────────┬──────────┘ │
-└──────────┬─────────────────────────┬─────────────────────────┬─────────────┘
- ZMQ ▼ ▲ ZMQ ZMQ ▼ ▲ ZMQ ZMQ ▼ ▲ ZMQ
- ┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
- │ EngineCore Proc-0 │ │ EngineCore Proc-1 │ │ EngineCore Proc-2 │
- │ (Thinker) │ │ (Talker) │ │ (Vocoder) │
- └──────────────────────┘ └──────────────────────┘ └──────────────────────┘
-```
-
-
-Test scripts:
-```bash
-# enter offline inference folder.
-cd examples/offline_inference/qwen2_5_omni
-python end2end.py --output-dir output_audio --query-type use_mixed_modalities
-
-cd ../qwen3_omni
-python end2end.py --output-dir output_audio --query-type text --async-chunk --enable-stats
-
-cd ../bagel
-python end2end.py --prompts "A cute cat"
-
-cd ../text_to_image
-python text_to_image.py --prompt "a cup of coffee on the table" --output output.png
-```
diff --git a/docs/design/module/dit_module.md b/docs/design/module/dit_module.md
deleted file mode 100644
index b0c7e9fc7fb..00000000000
--- a/docs/design/module/dit_module.md
+++ /dev/null
@@ -1,945 +0,0 @@
----
-toc_depth: 4
----
-
-# Diffusion Module Architecture Design
-
-The vLLM-Omni diffusion module (`vllm_omni/diffusion`) is a high-performance inference engine for diffusion models, designed with a modular architecture that separates concerns across multiple components. It provides efficient execution for non-autoregressive generation tasks such as image and video generation.
-
-This document describes the architecture design of the diffusion module, including the diffusion engine, scheduler, worker, diffusion pipeline, and acceleration components.
-
-
-
-
----
-
-## 1. Diffusion Engine
-
-**Location**: `vllm_omni/diffusion/diffusion_engine.py`
-
-### Responsibilities
-
-The `DiffusionEngine` is the **orchestrator** of the diffusion inference system. It manages the lifecycle of worker processes and coordinates the execution flow.
-
-### Key Components
-
-#### 1.1 Initialization
-
-```python
-class DiffusionEngine:
- def __init__(self, od_config: OmniDiffusionConfig):
- self.od_config = od_config
- self.post_process_func = get_diffusion_post_process_func(od_config)
- self.pre_process_func = get_diffusion_pre_process_func(od_config)
- self._processes: list[mp.Process] = []
- self._make_client()
-```
-
-**Key Features**:
-
-- **Pre/Post Processing**: Registers model-specific pre-processing and post-processing functions via registry pattern
-
-- **Worker Management**: Launches and manages multiple worker processes (one per GPU)
-
-- **Process Isolation**: Uses multiprocessing for true parallelism
-
-#### 1.2 Worker Launch Process
-
-The engine launches workers using a **spawn** method:
-
-```python
-def _launch_workers(self, broadcast_handle):
- # Creates one process per GPU
- for i in range(num_gpus):
- process = mp.Process(
- target=worker_proc.worker_main,
- args=(i, od_config, writer, broadcast_handle),
- name=f"DiffusionWorker-{i}",
- )
- process.start()
-```
-
-**Design Decisions**:
-
-- **Spawn Method**: Ensures clean state for each worker (no shared memory issues)
-
-- **Pipe Communication**: Uses `mp.Pipe` for initialization handshake
-
-- **Device Selection**: Each worker is assigned a specific GPU (`cuda:{rank}`)
-
-#### 1.3 Request Processing Flow
-
-```python
-def step(self, requests: list[OmniDiffusionRequest]):
- # 1. Pre-process requests
- requests = self.pre_process_func(requests)
-
- # 2. Send to scheduler and wait for response
- output = self.add_req_and_wait_for_response(requests)
-
- # 3. Post-process results
- result = self.post_process_func(output.output)
- return result
-```
-
-**Flow**:
-
-1. **Pre-processing**: Applies model-specific transformations
-
-2. **Scheduling**: Delegates to scheduler for distribution
-
-3. **Post-processing**: Converts raw outputs to final format (e.g., PIL images)
-
----
-
-## 2. Scheduler
-
-**Location**: `vllm_omni/diffusion/sched/`
-
-### Architecture
-
-The scheduler is a **request-state scheduler**. It owns request lifecycle management and scheduling decisions, while execution stays in `DiffusionEngine` and the executor.
-
-### Key Components
-
-#### 2.1 Scheduler Interface
-
-```python
-class SchedulerInterface(ABC):
- def add_request(self, request: OmniDiffusionRequest) -> str: ...
- def schedule(self) -> DiffusionSchedulerOutput: ...
- def update_from_output(
- self,
- sched_output: DiffusionSchedulerOutput,
- output: DiffusionOutput,
- ) -> set[str]: ...
-```
-
-**Responsibilities**:
-
-- **Lifecycle contract**: Defines how the engine adds requests, triggers one scheduling cycle, and feeds executor results back.
-
-- **Stable boundary**: `DiffusionSchedulerOutput` is the only scheduling result consumed by `DiffusionEngine`.
-
-- **Pluggability**: Different scheduler policies can reuse the same engine integration path.
-
-#### 2.2 Request State Model
-
-```python
-class DiffusionRequestStatus(enum.IntEnum):
- WAITING = ...
- RUNNING = ...
- PREEMPTED = ...
- FINISHED_COMPLETED = ...
- FINISHED_ABORTED = ...
- FINISHED_ERROR = ...
-
-@dataclass
-class DiffusionRequestState:
- sched_req_id: str
- req: OmniDiffusionRequest
- status: DiffusionRequestStatus = DiffusionRequestStatus.WAITING
-```
-
-**Design Features**:
-
-- **Scheduler-owned ID**: Each `OmniDiffusionRequest` is tracked by an internal `sched_req_id`, separated from public `request_id` values.
-
-- **Explicit lifecycle**: Requests move through waiting, running, optional preemption, and terminal states.
-
-- **Centralized error handling**: Completion, abort, and error states are all normalized in the scheduler layer.
-
-#### 2.3 Shared Bookkeeping in `_BaseScheduler`
-
-```python
-class _BaseScheduler(SchedulerInterface):
- def __init__(self) -> None:
- self._request_states = {}
- self._request_id_to_sched_req_id = {}
- self._waiting = deque()
- self._running = []
- self._finished_req_ids = set()
- self.max_num_running_reqs = 1
-```
-
-**Design Features**:
-
-- **Common state storage**: Shared request maps and waiting/running sets live in the base class.
-
-- **Shared cleanup logic**: Request-id registration, finish handling, and state removal are centralized instead of duplicated in each policy.
-
-- **Current constraint**: `max_num_running_reqs` remains `1` because the current engine path is still synchronous request-mode execution.
-
-#### 2.4 Current `RequestScheduler` Policy
-
-```python
-class RequestScheduler(_BaseScheduler):
- def schedule(self) -> DiffusionSchedulerOutput:
- # 1. keep existing RUNNING requests in the scheduling result
- # 2. pull WAITING requests while capacity remains
- # 3. move newly admitted requests into RUNNING
-```
-
-**Behavior**:
-
-- **FIFO request scheduling**: Waiting requests are promoted in queue order.
-
-- **Single-request admission**: The current policy only admits one active request at a time.
-
-- **Executor result feedback**: `update_from_output()` converts executor output into `FINISHED_COMPLETED` or `FINISHED_ERROR` and returns finished scheduler ids.
-
-#### 2.5 Engine-Driven Execution Loop
-
-```python
-sched_req_id = scheduler.add_request(request)
-while True:
- sched_output = scheduler.schedule()
- output = executor.add_req(req)
- finished_req_ids = scheduler.update_from_output(sched_output, output)
-```
-
-**Design Decisions**:
-
-- **Separation of concerns**: Scheduler manages state and policy; executor handles runtime execution.
-
-- **No scheduler-owned IPC**: Scheduler no longer talks to workers directly.
-
-- **Conservative concurrency**: The current request-mode implementation still allows only one active request at a time.
-
----
-
-## 3. Worker
-
-**Location**: `vllm_omni/diffusion/worker/gpu_worker.py`
-
-### Architecture
-
-Workers are **independent processes** that execute the actual model inference. Each worker runs on a dedicated GPU and participates in distributed inference.
-
-### Key Components
-
-#### 3.1 Worker Process Structure
-
-```python
-class WorkerProc:
- def __init__(self, od_config, gpu_id, broadcast_handle):
- # Initialize ZMQ context for IPC
- self.context = zmq.Context(io_threads=2)
-
- # Connect to broadcast queue (receive requests)
- self.mq = MessageQueue.create_from_handle(broadcast_handle, gpu_id)
-
- # Create result queue (only rank 0)
- if gpu_id == 0:
- self.result_mq = MessageQueue(n_reader=1, ...)
-
- # Initialize GPU worker
- self.worker = GPUWorker(local_rank=gpu_id, rank=gpu_id, od_config=od_config)
-```
-
-**Initialization Steps**:
-
-1. **IPC Setup**: Creates ZMQ context and message queues
-
-2. **Distributed Environment Setup**: Initializes PyTorch distributed communication
-
- - For CUDA GPUs: Uses NCCL (fast GPU communication)
-
- - For NPU: Uses HCCL (Huawei Collective Communications Library)
-
- - For other devices: Uses appropriate backend (GLOO, MCCL, etc.)
-
-3. **Model Loading**: Loads diffusion pipeline on assigned GPU
-
-4. **Cache Setup**: Enables cache backend if configured.
-
-#### 3.2 GPU Worker
-
-```python
-class GPUWorker:
- def init_device_and_model(self):
- # Set distributed environment variables
- os.environ["RANK"] = str(rank)
- os.environ["WORLD_SIZE"] = str(world_size)
-
- # Initialize PyTorch distributed
- init_distributed_environment(world_size, rank)
- parallel_config = self.od_config.parallel_config
- initialize_model_parallel(
- data_parallel_size=parallel_config.data_parallel_size,
- cfg_parallel_size=parallel_config.cfg_parallel_size,
- sequence_parallel_size=parallel_config.sequence_parallel_size,
- tensor_parallel_size=parallel_config.tensor_parallel_size,
- pipeline_parallel_size=parallel_config.pipeline_parallel_size,
- )
-
- # Load model
- model_loader = DiffusersPipelineLoader(load_config)
- self.pipeline = model_loader.load_model(od_config, load_device=f"cuda:{rank}")
-
- # Setup cache backend
- from vllm_omni.diffusion.cache.selector import get_cache_backend
- self.cache_backend = get_cache_backend(od_config.cache_backend, od_config.cache_config)
-
- if self.cache_backend is not None:
- self.cache_backend.enable(self.pipeline)
-```
-
-**Key Features**:
-
-- **Tensor Parallelism**: Supports multi-GPU tensor parallelism via PyTorch distributed
-
-- **Model Loading**: Uses `DiffusersPipelineLoader` for efficient weight loading
-
-- **Cache Integration**: Enables cache backends (TeaCache, cache-dit, etc.) transparently
-
-#### 3.3 Worker Busy Loop
-
-```python
-def worker_busy_loop(self):
- while self._running:
- # 1. Receive unified message (generation request, RPC request, or shutdown)
- msg = self.recv_message()
-
- # 2. Route message based on type
- if isinstance(msg, dict) and msg.get("type") == "rpc":
- # Handle RPC request
- result, should_reply = self.execute_rpc(msg)
- if should_reply:
- self.return_result(result)
-
- elif isinstance(msg, dict) and msg.get("type") == "shutdown":
- # Handle shutdown message
- self._running = False
-
- else:
- # Handle generation request (OmniDiffusionRequest list)
- output = self.worker.execute_model(msg, self.od_config)
- self.return_result(output)
-```
-
-**Execution Flow**:
-
-1. **Receive**: Dequeues unified messages from shared memory queue
-
-2. **Route**: Handles different message types (generation, RPC, shutdown)
-
-3. **Execute**: Runs forward pass through pipeline for generation requests
-
-4. **Respond**: Sends results back (rank 0 for generation, specified rank for RPC)
-
-#### 3.4 Model Execution
-
-```python
-@torch.inference_mode()
-def execute_model(self, reqs: list[OmniDiffusionRequest], od_config):
- req = reqs[0] # TODO: support batching
-
- # Refresh cache backend if enabled
- if self.cache_backend is not None and self.cache_backend.is_enabled():
- self.cache_backend.refresh(self.pipeline, req.num_inference_steps)
-
- # Set forward context for parallelism
- with set_forward_context(
- vllm_config=self.vllm_config,
- omni_diffusion_config=self.od_config
- ):
- output = self.pipeline.forward(req)
- return output
-```
-
-The model execution leverages multiple parallelism strategies that are transparently applied during the forward pass. The `set_forward_context()` context manager makes parallel group information available throughout the forward pass:
-
-```python
-# Inside transformer layers, parallel groups are accessed via:
-from vllm_omni.diffusion.distributed.parallel_state import (
- get_sp_group, get_dp_group, get_cfg_group, get_pp_group
-)
-```
-
-**Optimizations**:
-
-- **Cache Refresh**: Clears cache state before each generation for clean state
-
-- **Context Management**: Forward context ensures parallel groups are available during execution
-
-- **Single Request**: Currently processes one request at a time (batching TODO)
-
----
-
-## 4. Diffusion Pipeline
-
-**Location**: `vllm_omni/diffusion/models/*/pipeline_*.py`
-
-The pipeline is the **model-specific implementation** that orchestrates the diffusion process. Different models (QwenImage, Wan2.2, Z-Image) have their own pipeline implementations.
-
-Most pipeline implementation are referred from `diffusers`. The multi-step diffusion loop is usually the most time-consuming part during the overall inference process, which is defined by the `diffuse` function in the pipeline class. An example is as follows:
-
-```python
-def diffuse(self, ...):
- for i, t in enumerate(timesteps):
- # Forward pass for positive prompt
- transformer_kwargs = {
- "hidden_states": latents,
- "timestep": timestep / 1000,
- "encoder_hidden_states": prompt_embeds,
- }
- noise_pred = self.transformer(**transformer_kwargs)[0]
-
- # Forward pass for negative prompt (CFG)
- if do_true_cfg:
- neg_transformer_kwargs = {...}
- neg_transformer_kwargs["cache_branch"] = "negative"
- neg_noise_pred = self.transformer(**neg_transformer_kwargs)[0]
-
- # Combine predictions
- comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
- noise_pred = comb_pred * (cond_norm / noise_norm)
-
- # Scheduler step
- latents = self.scheduler.step(noise_pred, t, latents)[0]
-
- return latents
-```
-
-**Key Features**:
-
-- **CFG Support**: Handles classifier-free guidance with separate forward passes
-
-- **Cache Branching**: Uses `cache_branch` parameter for cache-aware execution
-
-- **True CFG**: Implements advanced CFG with norm preservation
-
-To learn more about the diffusion pipeline and how to add a new diffusion pipeline, please view [Adding Diffusion Model](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/model/adding_diffusion_model)
-
----
-
-## 5. Acceleration Components
-
-### 5.1 Attention Backends
-
-**Location**: `vllm_omni/diffusion/attention/`
-
-#### Architecture
-
-The attention system uses a **backend selector pattern** that automatically chooses the optimal attention implementation based on hardware and model configuration.
-
-#### Backend Selection
-
-**Location**: `vllm_omni/diffusion/attention/selector.py`
-
-```python
-class Attention(nn.Module):
- def __init__(self, num_heads, head_size, causal, softmax_scale, ...):
- # Auto-select backend
- self.attn_backend = get_attn_backend(-1)
- self.attn_impl_cls = self.attn_backend.get_impl_cls()
- self.attention = self.attn_impl_cls(...)
-```
-
-**Available Backends**:
-
-- **FlashAttention**: Optimized CUDA kernel (FA2/FA3) - memory efficient via tiling
-
-- **SDPA**: PyTorch's scaled dot-product attention - default, cross-platform
-
-- **SageAttention**: Sparse attention implementation from SageAttention library
-
-- **AscendAttention**: NPU-optimized attention for Ascend hardware
-
-These backends provide the **kernel implementations** for attention computation. For attention-level sequence parallelism strategies (Ring Attention, Ulysses), see [Parallel Attention](#52-parallel-attention).
-
-#### Backend Selection Mechanism
-
-```python
-def get_attn_backend(head_size: int) -> type[AttentionBackend]:
- # Check environment variable
- backend_name = os.environ.get("DIFFUSION_ATTENTION_BACKEND")
-
- if backend_name:
- return load_backend(backend_name.upper())
-
- # Default to SDPA
- return SDPABackend
-```
-
-**Selection Priority**:
-
-1. **Environment Variable**: `DIFFUSION_ATTENTION_BACKEND` for manual override
-
- - Valid values: `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `ASCEND`
-
- - Example: `export DIFFUSION_ATTENTION_BACKEND=SAGE_ATTN`
-
-2. **Automatic Fallback**: Falls back to SDPA if selected backend unavailable
-
-3. **Hardware Detection**: Can select based on device type (NPU, CUDA, etc.)
-
-**Backend Availability**:
-
-- **SDPA**: Always available (PyTorch built-in)
-
-- **FlashAttention**: Requires `flash-attn` package installed
-
-- **SageAttention**: Requires `sage-attention` package (from THU-ML GitHub)
-
-- **AscendAttention**: Only available on Ascend NPU hardware
-
-#### Attention Backend Registry
-
-**Location**: `vllm_omni/diffusion/attention/selector.py`
-
-The attention system uses a **registry pattern** to manage and dynamically load attention backends. This allows for easy extension and runtime selection of backends.
-
-
-**Registry Structure**:
-
-```python
-# Registry mapping backend names to their module paths and class names
-_BACKEND_CONFIG = {
- "FLASH_ATTN": {
- "module": "vllm_omni.diffusion.attention.backends.flash_attn",
- "class": "FlashAttentionBackend",
- },
- "TORCH_SDPA": {
- "module": "vllm_omni.diffusion.attention.backends.sdpa",
- "class": "SDPABackend",
- },
- "SAGE_ATTN": {
- "module": "vllm_omni.diffusion.attention.backends.sage_attn",
- "class": "SageAttentionBackend",
- },
- "ASCEND": {
- "module": "vllm_omni.diffusion.attention.backends.ascend_attn",
- "class": "AscendAttentionBackend",
- },
-}
-```
-
-#### Attention Backend Integration
-
-The `Attention` layer integrates backends through a unified interface. Here's how **FlashAttentionBackend** is integrated as an example:
-
-```python
-# attention/backends/flash_attn.py
-
-class FlashAttentionBackend(AttentionBackend):
- @staticmethod
- def get_name() -> str:
- return "FLASH_ATTN"
-
- @staticmethod
- def get_impl_cls() -> type["FlashAttentionImpl"]:
- return FlashAttentionImpl
-
- @staticmethod
- def get_supported_head_sizes() -> list[int]:
- return [64, 96, 128, 192, 256] # FlashAttention supports these head sizes
-
-
-class FlashAttentionImpl(AttentionImpl):
- def __init__(self, num_heads, head_size, softmax_scale, causal, ...):
- self.num_heads = num_heads
- self.causal = causal
- self.softmax_scale = softmax_scale
-
- def forward(self, query, key, value, attn_metadata=None):
- # Call FlashAttention kernel
- out = flash_attn_func(
- query, key, value,
- causal=self.causal,
- softmax_scale=self.softmax_scale,
- )
- return out
-```
-
----
-
-### 5.2 Parallel Attention
-
-**Location**: `vllm_omni/diffusion/attention/parallel/`
-
-#### Architecture
-
-Parallel attention strategies implement **Sequence Parallelism (SP) at the attention layer level**. These strategies distribute attention computation across multiple GPUs by splitting the sequence dimension, using different communication patterns. They work **on top of** AttentionBackend implementations (FlashAttention, SDPA, etc.), handling the parallelization/communication while the backends handle the actual attention computation.
-
-**Key Distinction**: Unlike AttentionBackend (which provides kernel implementations), ParallelAttentionStrategy provides communication patterns for multi-GPU attention parallelism. These strategies implement the `ParallelAttentionStrategy` interface and use AttentionBackend implementations internally.
-
-Both Ring Attention and Ulysses are forms of Sequence Parallelism (SP) that:
-
-- Split the sequence dimension across GPUs
-
-- Contribute to `sequence_parallel_size` (via `ring_degree` and `ulysses_degree`)
-
-- Work at the attention layer level (not model/pipeline level)
-
-#### Ulysses Sequence Parallelism (USP)
-
-**Location**: `vllm_omni/diffusion/attention/parallel/ulysses.py`
-
-USP is a sequence-parallel attention strategy that splits attention computation across multiple GPUs by distributing both the sequence dimension and attention heads. It uses **all-to-all communication** to efficiently parallelize attention for very long sequences. Specifically, it uses **all-to-all** collective operations to redistribute Q/K/V tensors before attention computation and gather results afterward.
-
-Ulysses splits attention computation in two dimensions:
-
-1. **Sequence Dimension**: Splits the sequence length across GPUs
-
-2. **Head Dimension**: Splits attention heads across GPUs
-
-**Configuration**: `ulysses_degree` contributes to `sequence_parallel_size`
-
-#### Ring Sequence Parallelism
-
-**Location**: `vllm_omni/diffusion/attention/parallel/ring.py`
-
-Ring Attention is a **parallel attention strategy** that implements sequence parallelism using ring-based point-to-point (P2P) communication. Unlike attention backends that provide the attention kernel implementation, Ring Attention is a **communication pattern** that works on top of attention backends (FlashAttention or SDPA).
-
-Ring Attention splits sequence dimension across GPUs in a ring topology, implemented via the `ParallelAttentionStrategy` interface, instead of `AttentionBackend`. P2P ring communication is applied to circulate Key/Value blocks across GPUs. Internally, `ring_flash_attn_func` or `ring_pytorch_attn_func` is used depending on available backends.
-
-**Architecture**:
-```python
-class RingParallelAttention:
- """Ring sequence-parallel strategy."""
-
- def run_attention(self, query, key, value, attn_metadata, ...):
- # Selects underlying attention kernel (FlashAttention or SDPA)
- if backend_pref == "sdpa":
- return ring_pytorch_attn_func(...) # Uses SDPA kernel
- else:
- return ring_flash_attn_func(...) # Uses FlashAttention kernel
-```
-
-**Integration**:
-
-- Ring Attention is activated when `ring_degree > 1` in parallel config
-
-- It's selected by `build_parallel_attention_strategy()` in the attention layer
-
-- The `Attention` layer routes to `_run_ring_attention()` when Ring is enabled
-
-- Works alongside attention backends: Ring handles communication, backends handle computation
-
-**Configuration**: `ring_degree` contributes to `sequence_parallel_size`
-
-#### Relationship with AttentionBackend
-
-Parallel attention strategies (Ring, Ulysses) work **on top of** AttentionBackend implementations:
-
-- They use AttentionBackend for the actual attention computation (FlashAttention, SDPA, etc.)
-
-- They handle the multi-GPU communication/parallelization layer
-
-- They implement `ParallelAttentionStrategy` interface (not `AttentionBackend`)
-
-For general parallelism strategies (Data Parallelism, Tensor Parallelism, Pipeline Parallelism), see [Parallel Strategies](#54-parallel-strategies).
-
----
-
-### 5.3 Cache Backends
-
-**Location**: `vllm_omni/diffusion/cache/`
-
-#### Architecture
-
-Cache backends provide a **unified interface** for applying different caching strategies to accelerate diffusion inference. The system supports multiple backends (TeaCache, cache-dit) with a consistent API for enabling and refreshing cache state.
-
-#### Cache Backend Interface
-
-```python
-class CacheBackend(ABC):
- def __init__(self, config: DiffusionCacheConfig):
- self.config = config
- self.enabled = False
-
- @abstractmethod
- def enable(self, pipeline: Any) -> None:
- """Enable cache on the pipeline."""
- raise NotImplementedError
-
- @abstractmethod
- def refresh(self, pipeline: Any, num_inference_steps: int, verbose: bool = True) -> None:
- """Refresh cache state for new generation."""
- raise NotImplementedError
-
- def is_enabled(self) -> bool:
- """Check if cache is enabled."""
- return self.enabled
-```
-
-**Design Pattern**:
-
-- **Abstract Base Class**: Defines contract for all cache backends
-
-- **Pipeline-based**: Works with pipeline instances (not just transformers)
-
-- **State Management**: Provides refresh mechanism for clean state between generations
-
-#### Available Backends
-
-**1. TeaCache Backend**
-
-**Location**: `vllm_omni/diffusion/cache/teacache/backend.py`
-
-```python
-class TeaCacheBackend(CacheBackend):
- def enable(self, pipeline: Any):
- # Extract transformer from pipeline
- transformer = pipeline.transformer
- transformer_type = transformer.__class__.__name__
-
- # Create TeaCacheConfig from DiffusionCacheConfig
- teacache_config = TeaCacheConfig(
- transformer_type=transformer_type,
- rel_l1_thresh=self.config.rel_l1_thresh,
- coefficients=self.config.coefficients,
- )
-
- # Apply hooks to transformer
- apply_teacache_hook(transformer, teacache_config)
- self.enabled = True
-
- def refresh(self, pipeline: Any, num_inference_steps: int, verbose: bool = True):
- transformer = pipeline.transformer
- if hasattr(transformer, "_hook_registry"):
- transformer._hook_registry.reset_hook(TeaCacheHook._HOOK_NAME)
-```
-
-**TeaCache Features**:
-
-- **Timestep-aware**: Caches based on timestep embedding similarity
-
-- **Adaptive**: Dynamically decides when to reuse cached computations
-
-- **CFG-aware**: Handles positive/negative branches separately
-
-- **Custom Hook System**: Uses a custom forward interception mechanism (via `HookRegistry`) that wraps the module's `forward` method, allowing transparent integration without modifying model code
-
-**2. Cache-DiT Backend**
-
-**Location**: `vllm_omni/diffusion/cache/cache_dit_backend.py`
-
-```python
-class CacheDiTBackend(CacheBackend):
- def enable(self, pipeline: Any):
- # Uses cache-dit library for acceleration
- # Supports DBCache, SCM (Step Computation Masking), TaylorSeer
- # Works with single and dual-transformer architectures
- ...
- self.enabled = True
-
- def refresh(self, pipeline: Any, num_inference_steps: int, verbose: bool = True):
- # Updates cache context with new num_inference_steps
- ...
-```
-
-**Cache-DiT Features**:
-
-- **DBCache**: Dynamic block caching with configurable compute blocks
-
-- **SCM**: Step Computation Masking for additional speedup
-
-- **TaylorSeer**: Advanced calibration for cache accuracy
-
-- **Dual-transformer Support**: Handles models like Wan2.2 with two transformers
-
-#### Cache Backend Selector
-
-**Location**: `vllm_omni/diffusion/cache/selector.py`
-
-```python
-def get_cache_backend(
- cache_backend: str | None,
- cache_config: dict | DiffusionCacheConfig
-) -> CacheBackend | None:
- """Get cache backend instance based on cache_backend string.
-
- Args:
- cache_backend: Cache backend name ("cache_dit", "tea_cache", or None)
- cache_config: Cache configuration (dict or DiffusionCacheConfig)
-
- Returns:
- Cache backend instance or None if cache_backend is "none"
- """
- if cache_backend is None or cache_backend == "none":
- return None
-
- if isinstance(cache_config, dict):
- cache_config = DiffusionCacheConfig.from_dict(cache_config)
-
- if cache_backend == "cache_dit":
- return CacheDiTBackend(cache_config)
- elif cache_backend == "tea_cache":
- return TeaCacheBackend(cache_config)
- else:
- raise ValueError(f"Unsupported cache backend: {cache_backend}")
-```
-
-**Usage Flow**:
-
-1. **Selection**: `get_cache_backend()` returns appropriate backend instance
-
-2. **Enable**: `backend.enable(pipeline)` called during worker initialization
-
-3. **Refresh**: `backend.refresh(pipeline, num_inference_steps)` called before each generation
-
-4. **Check**: `backend.is_enabled()` verifies cache is active
-
-### 5.4 Parallel Strategies
-
-**Location**: `vllm_omni/diffusion/distributed/parallel_state.py`
-
-#### Parallelism Types
-
-The system supports multiple orthogonal parallelism strategies:
-
-**Sequence Parallelism (SP)**
-
-- **Purpose**: Split sequence dimension across GPUs
-
-- **Attention-level SP**: Ring Attention and Ulysses (USP) implement SP at the attention layer level
-
- - See [Parallel Attention](#52-parallel-attention) for details
-
- - Configuration: `ulysses_degree` × `ring_degree` = `sequence_parallel_size`
-
-- **Use Case**: Very long sequences (e.g., high-resolution images)
-
-**Data Parallelism (DP)**
-
-- **Purpose**: Replicate model across GPUs, split batch
-
-- **Use Case**: Batch processing, throughput optimization
-
-**Tensor Parallelism (TP)** (Experimental)
-
-- **Purpose**: Split model weights across GPUs
-
-- **Implementation**: Uses vLLM's tensor parallel groups
-
-- **Use Case**: Large models that don't fit on single GPU
-
-**CFG Parallelism** (under development)
-
-- **Purpose**: Parallelize Classifier-Free Guidance (positive/negative prompts)
-
-- **Infrastructure**: CFG parallel groups are initialized and available via `get_cfg_group()`
-
-#### Parallel Group Management
-
-```python
-def initialize_model_parallel(
- data_parallel_size: int = 1,
- cfg_parallel_size: int = 1,
- sequence_parallel_size: int | None = None,
- ulysses_degree: int = 1,
- ring_degree: int = 1,
- tensor_parallel_size: int = 1,
- pipeline_parallel_size: int = 1,
- vae_parallel_size: int = 0,
-):
- # Generate orthogonal parallel groups
- rank_generator = RankGenerator(
- tensor_parallel_size,
- sequence_parallel_size,
- pipeline_parallel_size,
- cfg_parallel_size,
- data_parallel_size,
- "tp-sp-pp-cfg-dp",
- )
-
- # Initialize each parallel group
- _DP = init_model_parallel_group(rank_generator.get_ranks("dp"), ...)
- _CFG = init_model_parallel_group(rank_generator.get_ranks("cfg"), ...)
- _SP = init_model_parallel_group(rank_generator.get_ranks("sp"), ...)
- _PP = init_model_parallel_group(rank_generator.get_ranks("pp"), ...)
- _TP = init_model_parallel_group(rank_generator.get_ranks("tp"), ...)
-```
-
-**Rank Order**: `tp-sp-pp-cfg-dp` (tensor → sequence → pipeline → cfg → data)
-
-**Note**: For attention-level Sequence Parallelism implementations (Ring Attention and Ulysses), see [Parallel Attention](#52-parallel-attention). This section covers higher-level parallelism strategies.
-
-
----
-
-## 6. Data Flow
-
-### Complete Request Flow
-
-
-
-
-
- End-to-end Data Flow in the vLLM-Omni Diffusion Module
-
-
-
-```
-1. User Request
- └─> OmniDiffusion.generate(prompt)
- └─> Prepare OmniDiffusionRequest
- └─> DiffusionEngine.step(requests)
-
-2. Pre-processing
- └─> pre_process_func(requests)
- └─> Model-specific transformations
-
-3. Scheduling
- └─> scheduler.add_request(request)
- └─> scheduler.schedule()
- └─> DiffusionEngine submits scheduled request to executor.add_req(req)
-
-4. Worker Execution
- └─> WorkerProc.worker_busy_loop()
- └─> GPUWorker.execute_model(reqs)
- └─> Pipeline.forward(req)
- ├─> encode_prompt()
- ├─> prepare_latents()
- ├─> diffuse() [loop]
- │ ├─> transformer.forward() [with cache backend hooks]
- │ └─> scheduler.step()
- └─> vae.decode()
-
-5. Result Collection
- └─> Executor returns DiffusionOutput
- └─> scheduler.update_from_output(...)
- └─> DiffusionEngine pops finished request state
-
-6. Post-processing
- └─> post_process_func(output)
- └─> Convert to PIL images / final format
-```
-
----
diff --git a/docs/design/module/entrypoint_module.md b/docs/design/module/entrypoint_module.md
deleted file mode 100644
index 7a26fbb7f05..00000000000
--- a/docs/design/module/entrypoint_module.md
+++ /dev/null
@@ -1 +0,0 @@
-Architecture design of the entrypoint (update soon)
diff --git a/docs/design/qwen3_omni_tts_performance_optimization.md b/docs/design/qwen3_omni_tts_performance_optimization.md
deleted file mode 100644
index 2f18a1b1bc0..00000000000
--- a/docs/design/qwen3_omni_tts_performance_optimization.md
+++ /dev/null
@@ -1,539 +0,0 @@
-# Speech Generation on vLLM-Omni: Performance Optimizations for Qwen3-Omni and Qwen3-TTS
-
-## Summary
-
-vLLM-Omni supports end-to-end serving for speech-generating models, including both **Qwen3-Omni** (multimodal understanding + speech) and **Qwen3-TTS** (text-to-speech). Despite their different architectures, both models share the same multi-stage pipeline design and benefit from the same set of stacked optimizations:
-
-1. **Batching** improves GPU utilization stage by stage and increases overall throughput.
-2. **CUDA Graph** reduces CPU launch overhead and decode-time jitter on stable shapes.
-3. **Async Chunk and Streaming Output** overlap compute and communication across stages and emit audio incrementally, improving both TTFP and E2E.
-
-### Model architectures
-
-**Qwen3-Omni** is a native multimodal model that understands text, audio, image, and video inputs, and generates both text and speech outputs. Its pipeline has three stages:
-
-- **Thinker**: multimodal understanding and text generation
-- **Talker (+ Talker-MTP / code predictor path)**: converts semantic/text representations into codec tokens
-- **Code2Wav**: decodes codec tokens into waveform audio
-
-**Qwen3-TTS** is a lightweight, high-quality text-to-speech model. Its pipeline has two stages:
-
-- **Talker (AR decoder)**: auto-regressively generates codec tokens from text input
-- **Code2Wav (vocoder)**: decodes codec tokens into waveform audio
-
-The optimizations described in this post apply to both models. We present results for each side by side.
-
-### vLLM-Omni vs HF Transformers
-
-Compared with **HF Transformers** (offline, single request), vLLM-Omni with the full optimization stack delivers dramatically lower latency and higher efficiency for both models.
-
-**Qwen3-Omni** (A100):
-
-
-
-| Metric | vLLM-Omni | HF Transformers | Improvement |
-| --- | --- | --- | --- |
-| E2E latency (ms) | 941 | 15,513 | ~94% reduction |
-| TTFP (ms) | 64 | 15,513 | ~99.6% reduction (242× faster) |
-| RTF | 0.16 | 2.64 | ~94% reduction (~16.5× faster) |
-
-- **E2E latency**: 941 ms vs 15,513 ms - **~94%** reduction
-- **TTFP**: 64 ms vs 15,513 ms - **~99.6%** reduction (242x faster)
-- **RTF**: 0.16 vs 2.64 - **~94%** reduction (~16.5x faster)
-
-### Stacked optimization summary
-
-Each optimization stacks on the previous one. The summary plots below show the cumulative effect at each step, with one line per concurrency level (1, 4, 10).
-
-**Qwen3-Omni** (A100):
-
-
-
-
-
-
-
-- **E2EL reduction**: ~74% at concurrency 10 (410,054 ms -> 104,901 ms); ~90% at concurrency 1 (426,529 ms -> 41,216 ms)
-- **TTFP reduction**: ~96% at concurrency 10 (409,705 ms -> 16,482 ms); ~99.7% at concurrency 1 (426,078 ms -> 1,164 ms)
-- **RTF reduction**: ~74% at concurrency 10 (2.83 -> 0.74); ~90% at concurrency 1 (2.08 -> 0.21)
-
-**Qwen3-TTS** (H200):
-
-
-
-
-
-
-
-- **E2EL reduction**: ~85% at concurrency 10 (12,141 ms -> 1,767 ms); ~29% at concurrency 1 (1,323 ms -> 941 ms)
-- **TTFP reduction**: ~96.5% at concurrency 10 (12,141 ms -> 425 ms); ~95% at concurrency 1 (1,323 ms -> 64 ms)
-- **RTF reduction**: ~86% at concurrency 10 (2.19 -> 0.31); ~30% at concurrency 1 (0.23 -> 0.16)
-
-**Benchmark environment:**
-
-| | Qwen3-Omni | Qwen3-TTS |
-| --- |-----------------------------| --- |
-| **GPU** | A100 | H200 |
-| **Model** | Qwen3-Omni-30B-A3B-Instruct | Qwen3-TTS-12Hz-1.7B-CustomVoice |
-| **vLLM** | v0.17.0 | v0.18.0 |
-| **vllm-omni** | commit 199f7832 | v0.18.0rc2 |
-| **CUDA** | 12.9 | 12.8 |
-
-This post walks through each optimization in the same order they are typically enabled in practice, then ends with deployment playbooks for both models.
-
----
-
-## Pipeline Batching
-
-### How stage-wise batching works
-
-For both Qwen3-Omni and Qwen3-TTS, batching is a pipeline-level optimization:
-
-- Requests are grouped per stage using `runtime.max_batch_size`
-- Each stage executes batch inference with its own scheduler/worker
-- Stage outputs are routed to downstream stages with per-request mapping preserved
-
-**Batching strategy by stage:** The understanding and decode stages (Thinker for Omni, Talker for both) use **continuous batching**: requests can join and leave the batch over time. Code2Wav uses **static batching**: once a batch is formed, the stage runs the whole batch before starting the next. This matches the decode pattern of Code2Wav and keeps implementation simple while still improving throughput.
-
-### Batching results (Baseline vs. Batch)
-
-Batching alone greatly reduces E2EL and RTF across all concurrencies. The biggest gains appear at high concurrency where requests share GPU resources.
-
-**Qwen3-Omni** (A100):
-
-
-
-| Metric | Concurrency | Batch | + CUDA Graph | Improvement |
-| --- | --- | --- | --- | --- |
-| E2EL (ms) | 1 | 1,339 | 733 | 1.8× |
-| E2EL (ms) | 4 | 1,471 | 987 | 1.5× |
-| E2EL (ms) | 10 | 1,705 | 1,197 | 1.4× |
-| RTF | 1 | 0.234 | 0.124 | 1.9× |
-| RTF | 10 | 0.292 | 0.203 | 1.4× |
-| Throughput (audio-s/wall-s) | 10 | 33.53 | 47.15 | 1.4× |
-
-At concurrency 1, CUDA Graph reduces E2EL from 1,339 ms to 733 ms and RTF from 0.234 to 0.124 - nearly a 2x improvement. The benefit is consistent across all concurrency levels.
-
----
-
-## Async Chunk and Streaming Output: Earlier Audio and Cross-Stage Overlap
-
-### Why this step matters for first-packet latency
-
-Two mechanisms work together to improve user-visible latency:
-
-- **Streaming output**: audio streaming emits audio chunks as soon as they are decoded (lower **TTFP**). Without streaming, the client waits for larger buffers or end-of-sequence.
-- **Async chunk** is the main enabler for *earlier* audio: instead of handing off whole-request results between stages, each stage forwards **chunks** so the next stage can start as soon as the first chunk is ready. For Omni: Thinker -> Talker forwards hidden-state chunks; for both: Talker -> Code2Wav forwards codec chunks; Code2Wav decodes and emits packets incrementally. This **overlaps compute and communication** across stages and directly reduces time-to-first-audio-packet (TTFP) and end-to-end latency (E2EL).
-
-So in practice: streaming output defines *how* bytes are sent to the client; async chunk defines *when* the pipeline can produce the first bytes.
-
-**Dependency between the two:** Async chunk and audio streaming output are mutually dependent. Without async chunk, **audio streaming output cannot truly take effect**. Without audio streaming output, async chunk's **TTFP advantage is not fully realized**: the client would still wait for larger buffers or end-of-sequence instead of hearing the first packet as soon as it is ready. We therefore recommend enabling **both** on top of batching + CUDA Graph; the benchmarks in this post use both.
-
-### Results: Batch + CUDA Graph vs. Batch + CUDA Graph + Async Chunk + Streaming Output
-
-**Qwen3-Omni** (A100):
-
-