diff --git a/README.md b/README.md
index 66c8725ee5..20abbb6082 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@ Traditional vLLM systems are limited to text-based, autoregressive generation. v
 
 - **Multi-modal Models**: Text, image, video, audio, and sensor data processing
 - **Non-autoregressive Architectures**: Diffusion Transformers (DiT) and other parallel generation models
-- **Heterogeneous Outputs**: Beyond traditional text generation to structured, binary, and streaming outputs
+- **Heterogeneous Outputs**: Beyond traditional text generation to multimodal outputs
 
 ## 🏗️ Architecture
 
@@ -28,119 +28,48 @@ vLLM-omni is built on a modular architecture that extends vLLM's core functional
 - **Text**: Advanced tokenization and embedding generation
 - **Image**: Vision encoder integration (CLIP, etc.)
 - **Audio**: Speech processing and audio embedding
-- **Video**: Frame-by-frame and temporal processing
-- **Sensor**: IoT and sensor data interpretation
-
-### Output Formats
-
-- **Structured Data**: JSON, XML, and custom formats
-- **Binary Outputs**: Images, audio, and video generation
-- **Streaming**: Real-time progressive generation
-- **Multipart**: Combined multi-modal responses
 
 ## 📋 Supported Models
 
 ### AR + Diffusion Transformer (DiT) Models
-- Qwen-Image (Image generation and editing)
 - Qwen-omni (Thinker-Talker-Codec structure)
-- Custom DiT and hiybrid architectures
+- HunyunaImage 3.0 (Ongoing)
+- Qwen-Image (Ongoing)
 
 ## 🛠️ Installation
 
-### Quick Start
-
-#### Option 1: Docker (Recommended for macOS)
-
-```bash
-# Clone the repository
-git clone https://github.com/hsliuustc0106/vllm-omni.git
-cd vllm-omni
-
-# Run the automated Docker setup
-./scripts/docker-setup-macos.sh
-```
-
-#### Option 2: Local Installation
-
-```bash
-# Clone the repository
-git clone https://github.com/hsliuustc0106/vllm-omni.git
-cd vllm-omni
-
-# Run the installation script
-./install.sh
-```
-
-### Prerequisites
-
-- Python 3.11+ (recommended)
-- Conda or Miniconda
-- Git
-- CUDA 11.8+ (for GPU acceleration) or CPU-only installation
-
-### Installation Methods
-
-#### Method 1: Automated Installation (Recommended)
+Set up basic environments
 ```bash
-# Using shell script
-./install.sh
-
-# Or using Python script
-python install.py
+uv venv --python 3.12 --seed
+source .venv/bin/activate
 ```
+Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941
 
-#### Method 2: Manual Installation
 ```bash
-# Create conda environment
-conda create -n vllm_omni python=3.11 -y
-conda activate vllm_omni
-
-# Install PyTorch (CPU or GPU)
-pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cpu  # CPU
-# pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cu121  # GPU
-
-# Install dependencies
-pip install -r requirements.txt
-pip install "vllm>=0.10.2"
-
-# Install vLLM-omni
-pip install -e .
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941
+VLLM_USE_PRECOMPILED=1 uv pip install --editable .
 ```
 
-### Verify Installation
+## Run examples (Qwen2.5-omni)
 
+Get into the example folder
 ```bash
-# Test the installation
-python test_installation.py
-
-# Test basic functionality
-python -c "import vllm_omni; print('Ready!')"
-
-# Test CLI
-vllm --help
+cd vllm_omni
+cd examples/offline_inference/qwen2_5_omni
 ```
-
-For detailed installation instructions, see [INSTALL.md](INSTALL.md).
-
-## 📥 Model Download
-
-Models are automatically downloaded when first used, or you can pre-download them:
-
+Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run.
 ```bash
-# Check downloaded models
-python scripts/download_models.py --check-cache
-
-# Download all default models
-python scripts/download_models.py --all
-
-# Download specific models
-python scripts/download_models.py --ar-models Qwen/Qwen3-0.6B
-python scripts/download_models.py --dit-models stabilityai/stable-diffusion-2-1
+bash run.sh
 ```
+The output audio is saved in ./output_audio
 
-**Model Storage Location:**
-- Default: `~/.cache/huggingface/hub/`
-- AR models: 100MB - 1GB each
-- DiT models: 2GB - 7GB each
+## To-do list
+- [x] Offline inference example for Qwen2.5-omni with single request
+- [ ] Adaptation from current vllm branch to stable vllm v0.11.0
+- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests
+- [ ] Online inference support
+- [ ] Support for other models
 
-For detailed model management, see [MODEL_DOWNLOAD_GUIDE.md](docs/MODEL_DOWNLOAD_GUIDE.md).
+For detailed model management, see [vllm_omni_design.md](docs/architecture/vllm_omni_design.md) and [high_level_arch_design.md](docs/architecture/high_level_arch_design.md).
\ No newline at end of file
diff --git a/examples/offline_inference/qwen_2_5_omni/README.md b/examples/offline_inference/qwen_2_5_omni/README.md
new file mode 100644
index 0000000000..d1dfde2059
--- /dev/null
+++ b/examples/offline_inference/qwen_2_5_omni/README.md
@@ -0,0 +1,37 @@
+# Offline Example of vLLM-omni for Qwen2.5-omni
+
+## Installation
+
+Set up basic environments
+```bash
+uv venv --python 3.12 --seed
+source .venv/bin/activate
+```
+Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941
+
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941
+VLLM_USE_PRECOMPILED=1 uv pip install --editable .
+```
+
+## Run examples
+
+Get into the example folder
+```bash
+cd vllm_omni
+cd examples/offline_inference/qwen2_5_omni
+```
+Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run.
+```bash
+bash run.sh
+```
+The output audio is saved in ./output_audio
+
+## To-do list
+- [x] Offline inference example for Qwen2.5-omni with single request
+- [ ] Adaptation from current vllm branch to stable vllm v0.11.0
+- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests
+- [ ] Online inference support
+- [ ] Support for other models
\ No newline at end of file
diff --git a/examples/offline_inference/qwen_2_5_omni/end2end.py b/examples/offline_inference/qwen_2_5_omni/end2end.py
new file mode 100644
index 0000000000..030db0b717
--- /dev/null
+++ b/examples/offline_inference/qwen_2_5_omni/end2end.py
@@ -0,0 +1,130 @@
+import argparse
+import os
+import soundfile as sf
+import random
+import numpy as np
+import torch
+
+from vllm.sampling_params import SamplingParams
+
+import os as _os_env_toggle
+_os_env_toggle.environ["VLLM_USE_V1"] = "1"
+
+from vllm_omni.entrypoints.omni_llm import OmniLLM
+from utils import make_omni_prompt
+
+
+SEED = 42
+# Set all random seeds
+random.seed(SEED)
+np.random.seed(SEED)
+torch.manual_seed(SEED)
+torch.cuda.manual_seed(SEED)
+torch.cuda.manual_seed_all(SEED)
+
+# Make PyTorch deterministic
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+
+# Set environment variables for deterministic behavior
+os.environ["PYTHONHASHSEED"] = str(SEED)
+os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model', required=True, help='Path to merged model directory (will be created if downloading).')
+    parser.add_argument('--thinker-model', type=str, default=None)
+    parser.add_argument('--talker-model', type=str, default=None)
+    parser.add_argument('--code2wav-model', type=str, default=None)
+    parser.add_argument('--hf-hub-id', default='Qwen/Qwen2.5-Omni-7B', help='Hugging Face repo id to download if needed.')
+    parser.add_argument('--hf-revision', default=None, help='Optional HF revision (branch/tag/commit).')
+    parser.add_argument('--prompts', required=True, nargs='+', help='Input text prompts.')
+    parser.add_argument('--voice-type', default='default', help='Voice type, e.g., m02, f030, default.')
+    parser.add_argument('--code2wav-dir', default=None, help='Path to code2wav folder (contains spk_dict.pt).')
+    parser.add_argument('--dit-ckpt', default=None, help='Path to DiT checkpoint file (e.g., dit.pt).')
+    parser.add_argument('--bigvgan-ckpt', default=None, help='Path to BigVGAN checkpoint file.')
+    parser.add_argument('--dtype', default='bfloat16', choices=['float16', 'bfloat16', 'float32'])
+    parser.add_argument('--max-model-len', type=int, default=32768)
+
+    parser.add_argument("--thinker-only", action="store_true")
+    parser.add_argument("--text-only", action="store_true")
+    parser.add_argument("--do-wave", action="store_true")
+    parser.add_argument('--prompt_type',
+                        choices=[
+                            'text', 'audio', 'audio-long', 'audio-long-chunks',
+                            'audio-long-expand-chunks', 'image', 'video',
+                            'video-frames', 'audio-in-video', 'audio-in-video-v2',
+                            "audio-multi-round", "badcase-vl", "badcase-text",
+                            "badcase-image-early-stop", "badcase-two-audios",
+                            "badcase-two-videos", "badcase-multi-round",
+                            "badcase-voice-type", "badcase-voice-type-v2",
+                            "badcase-audio-tower-1", "badcase-audio-only"
+                        ],
+                        default='text')
+    parser.add_argument('--use-torchvision', action='store_true')
+    parser.add_argument('--tokenize', action='store_true')
+    parser.add_argument('--output-wav', default="output.wav", help='Output wav file path.')
+    parser.add_argument('--thinker-hidden-states-dir', default="thinker_hidden_states", help='Path to thinker hidden states directory.')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    model_name = args.model
+    omni_llm = OmniLLM(model=model_name)
+    thinker_sampling_params = SamplingParams(
+                                            temperature=0.0,    # Deterministic - no randomness
+                                            top_p=1.0,          # Disable nucleus sampling
+                                            top_k=-1,           # Disable top-k sampling
+                                            max_tokens=2048,
+                                            seed=SEED,          # Fixed seed for sampling
+                                            detokenize=True,
+                                            repetition_penalty=1.1,
+                                            )
+    talker_sampling_params = SamplingParams(
+                                            temperature=0.0,    # Deterministic - no randomness
+                                            top_p=1.0,          # Disable nucleus sampling
+                                            top_k=-1,           # Disable top-k sampling
+                                            max_tokens=2048,
+                                            seed=SEED,          # Fixed seed for sampling
+                                            detokenize=True,
+                                            repetition_penalty=1.1,
+                                            stop_token_ids=[8294]
+                                            )
+    code2wav_sampling_params = SamplingParams(
+                                            temperature=0.0,    # Deterministic - no randomness
+                                            top_p=1.0,          # Disable nucleus sampling
+                                            top_k=-1,           # Disable top-k sampling
+                                            max_tokens=2048,
+                                            seed=SEED,          # Fixed seed for sampling
+                                            detokenize=True,
+                                            repetition_penalty=1.1,
+                                            )
+
+    sampling_params_list = [thinker_sampling_params,
+                            talker_sampling_params,
+                            code2wav_sampling_params]
+    
+    prompt = [make_omni_prompt(args, prompt) for prompt in args.prompts]
+    omni_outputs = omni_llm.generate(prompt, sampling_params_list)
+
+    os.makedirs(args.output_wav, exist_ok=True)
+    for stage_outputs in omni_outputs:
+        if stage_outputs.final_output_type == "text":
+            for output in stage_outputs.request_output:
+                request_id = output.request_id
+                text_output = output.outputs[0].text
+                print(f"Request ID: {request_id}, Text Output: {text_output}")
+        elif stage_outputs.final_output_type == "audio":
+            for output in stage_outputs.request_output:
+                request_id = output.request_id
+                audio_tensor = output.multimodal_output["audio"]
+                output_wav = os.path.join(args.output_wav, f"output_{output.request_id}.wav")
+                sf.write(output_wav, audio_tensor.detach().cpu().numpy(), samplerate=24000)
+                print(f"Request ID: {request_id}, Saved audio to {output_wav}")
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/examples/offline_inference/qwen_2_5_omni/output_audio/output_0.wav b/examples/offline_inference/qwen_2_5_omni/output_audio/output_0.wav
new file mode 100644
index 0000000000..45caf1d0e2
Binary files /dev/null and b/examples/offline_inference/qwen_2_5_omni/output_audio/output_0.wav differ
diff --git a/examples/offline_inference/qwen_2_5_omni/processing_omni.py b/examples/offline_inference/qwen_2_5_omni/processing_omni.py
new file mode 100644
index 0000000000..bd6f4ae957
--- /dev/null
+++ b/examples/offline_inference/qwen_2_5_omni/processing_omni.py
@@ -0,0 +1,373 @@
+from __future__ import annotations
+
+import base64
+import logging
+import math
+import os
+import time
+import warnings
+from functools import lru_cache
+from io import BytesIO
+import requests
+import torch
+import torchvision
+from packaging import version
+from PIL import Image
+from torchvision import io, transforms
+from torchvision.transforms import InterpolationMode
+
+
+logger = logging.getLogger(__name__)
+
+IMAGE_FACTOR = 28
+MIN_PIXELS = 4 * 28 * 28
+MAX_PIXELS = 16384 * 28 * 28
+MAX_RATIO = 200
+
+VIDEO_MIN_PIXELS = 128 * 28 * 28
+VIDEO_MAX_PIXELS = 768 * 28 * 28
+VIDEO_TOTAL_PIXELS = 24576 * 28 * 28
+FRAME_FACTOR = 2
+FPS = 2.0
+FPS_MIN_FRAMES = 4
+FPS_MAX_FRAMES = 768
+
+temporal_patch_size = 2
+spatial_patch_size = 14
+spatial_merge_size = 2
+
+
+def round_by_factor(number: int, factor: int) -> int:
+    """Returns the closest integer to 'number' that is divisible by 'factor'."""
+    return round(number / factor) * factor
+
+
+def ceil_by_factor(number: int, factor: int) -> int:
+    """Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
+    return math.ceil(number / factor) * factor
+
+
+def floor_by_factor(number: int, factor: int) -> int:
+    """Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""
+    return math.floor(number / factor) * factor
+
+
+def smart_resize(height: int,
+                 width: int,
+                 factor: int = IMAGE_FACTOR,
+                 min_pixels: int = MIN_PIXELS,
+                 max_pixels: int = MAX_PIXELS) -> tuple[int, int]:
+    """
+    Rescales the image so that the following conditions are met:
+
+    1. Both dimensions (height and width) are divisible by 'factor'.
+
+    2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
+
+    3. The aspect ratio of the image is maintained as closely as possible.
+    """
+    if max(height, width) / min(height, width) > MAX_RATIO:
+        raise ValueError(
+            f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
+        )
+    h_bar = max(factor, round_by_factor(height, factor))
+    w_bar = max(factor, round_by_factor(width, factor))
+    if h_bar * w_bar > max_pixels:
+        beta = math.sqrt((height * width) / max_pixels)
+        h_bar = floor_by_factor(height / beta, factor)
+        w_bar = floor_by_factor(width / beta, factor)
+    elif h_bar * w_bar < min_pixels:
+        beta = math.sqrt(min_pixels / (height * width))
+        h_bar = ceil_by_factor(height * beta, factor)
+        w_bar = ceil_by_factor(width * beta, factor)
+    return h_bar, w_bar
+
+
+def fetch_image(ele: dict[str, str | Image.Image],
+                size_factor: int = IMAGE_FACTOR) -> Image.Image:
+    if "image" in ele:
+        image = ele["image"]
+    else:
+        image = ele["image_url"]
+    image_obj = None
+    if isinstance(image, Image.Image):
+        image_obj = image
+    elif image.startswith("http://") or image.startswith("https://"):
+        image_obj = Image.open(requests.get(image, stream=True).raw)
+    elif image.startswith("file://"):
+        image_obj = Image.open(image[7:])
+    elif image.startswith("data:image"):
+        if "base64," in image:
+            _, base64_data = image.split("base64,", 1)
+            data = base64.b64decode(base64_data)
+            image_obj = Image.open(BytesIO(data))
+    else:
+        image_obj = Image.open(image)
+    if image_obj is None:
+        raise ValueError(
+            f"Unrecognized image input, support local path, http url, base64 and PIL.Image, got {image}"
+        )
+    image = image_obj.convert("RGB")
+    ## resize
+    if "resized_height" in ele and "resized_width" in ele:
+        resized_height, resized_width = smart_resize(
+            ele["resized_height"],
+            ele["resized_width"],
+            factor=size_factor,
+        )
+    else:
+        width, height = image.size
+        min_pixels = ele.get("min_pixels", MIN_PIXELS)
+        max_pixels = ele.get("max_pixels", MAX_PIXELS)
+        resized_height, resized_width = smart_resize(
+            height,
+            width,
+            factor=size_factor,
+            min_pixels=min_pixels,
+            max_pixels=max_pixels,
+        )
+    image = image.resize((resized_width, resized_height))
+
+    return image
+
+
+def smart_nframes(
+    ele: dict,
+    total_frames: int,
+    video_fps: int | float,
+) -> int:
+    """calculate the number of frames for video used for model inputs.
+
+    Args:
+        ele (dict): a dict contains the configuration of video.
+            support either `fps` or `nframes`:
+                - nframes: the number of frames to extract for model inputs.
+                - fps: the fps to extract frames for model inputs.
+                    - min_frames: the minimum number of frames of the video, only used when fps is provided.
+                    - max_frames: the maximum number of frames of the video, only used when fps is provided.
+        total_frames (int): the original total number of frames of the video.
+        video_fps (int | float): the original fps of the video.
+
+    Raises:
+        ValueError: nframes should in interval [FRAME_FACTOR, total_frames].
+
+    Returns:
+        int: the number of frames for video used for model inputs.
+    """
+    assert not ("fps" in ele
+                and "nframes" in ele), "Only accept either `fps` or `nframes`"
+    if "nframes" in ele:
+        nframes = round_by_factor(ele["nframes"], FRAME_FACTOR)
+    else:
+        fps = ele.get("fps", FPS)
+        min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES),
+                                    FRAME_FACTOR)
+        max_frames = floor_by_factor(
+            ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)),
+            FRAME_FACTOR)
+        nframes = total_frames / video_fps * fps
+        nframes = min(max(nframes, min_frames), max_frames)
+        nframes = round_by_factor(nframes, FRAME_FACTOR)
+    if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
+        raise ValueError(
+            f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}."
+        )
+    return nframes
+
+
+def _read_video_torchvision(ele: dict, ) -> torch.Tensor:
+    """read video using torchvision.io.read_video
+
+    Args:
+        ele (dict): a dict contains the configuration of video.
+        support keys:
+            - video: the path of video. support "file://", "http://", "https://" and local path.
+            - video_start: the start time of video.
+            - video_end: the end time of video.
+    Returns:
+        torch.Tensor: the video tensor with shape (T, C, H, W).
+    """
+    video_path = ele["video"]
+    if version.parse(torchvision.__version__) < version.parse("0.19.0"):
+        if "http://" in video_path or "https://" in video_path:
+            warnings.warn(
+                "torchvision < 0.19.0 does not support http/https video path, please upgrade to 0.19.0."
+            )
+        if "file://" in video_path:
+            video_path = video_path[7:]
+    st = time.time()
+    video, audio, info = io.read_video(
+        video_path,
+        start_pts=ele.get("video_start", 0.0),
+        end_pts=ele.get("video_end", None),
+        pts_unit="sec",
+        output_format="TCHW",
+    )
+    total_frames, video_fps = video.size(0), info["video_fps"]
+    total_duration = round(total_frames / video_fps, 3)
+    logger.info(
+        f"torchvision:  {video_path=}, {total_frames=}, {video_fps=}, duration={total_duration}s, time={time.time() - st:.3f}s"
+    )
+    nframes = smart_nframes(ele,
+                            total_frames=total_frames,
+                            video_fps=video_fps)
+    idx = torch.linspace(0, total_frames - 1, nframes).round().long()
+    video = video[idx]
+    return video, total_duration, nframes
+
+
+def is_decord_available() -> bool:
+    import importlib.util
+
+    return importlib.util.find_spec("decord") is not None
+
+
+def _read_video_decord(ele: dict, ) -> torch.Tensor:
+    """read video using decord.VideoReader
+
+    Args:
+        ele (dict): a dict contains the configuration of video.
+        support keys:
+            - video: the path of video. support "file://", "http://", "https://" and local path.
+            - video_start: the start time of video.
+            - video_end: the end time of video.
+    Returns:
+        torch.Tensor: the video tensor with shape (T, C, H, W).
+    """
+    import decord
+    video_path = ele["video"]
+    st = time.time()
+    vr = decord.VideoReader(video_path)
+    # TODO: support start_pts and end_pts
+    if 'video_start' in ele or 'video_end' in ele:
+        raise NotImplementedError(
+            "not support start_pts and end_pts in decord for now.")
+    total_frames, video_fps = len(vr), vr.get_avg_fps()
+    total_duration = round(total_frames / video_fps, 3)
+    logger.info(
+        f"decord:  {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s"
+    )
+    nframes = smart_nframes(ele,
+                            total_frames=total_frames,
+                            video_fps=video_fps)
+    idx = torch.linspace(0, total_frames - 1, nframes).round().long().tolist()
+    video = vr.get_batch(idx).asnumpy()
+    video = torch.tensor(video).permute(0, 3, 1, 2)  # Convert to TCHW format
+    return video, total_duration, nframes
+
+
+VIDEO_READER_BACKENDS = {
+    "decord": _read_video_decord,
+    "torchvision": _read_video_torchvision,
+}
+
+FORCE_QWENVL_VIDEO_READER = os.getenv("FORCE_QWENVL_VIDEO_READER", None)
+
+
+@lru_cache(maxsize=1)
+def get_video_reader_backend() -> str:
+    if FORCE_QWENVL_VIDEO_READER is not None:
+        video_reader_backend = FORCE_QWENVL_VIDEO_READER
+    elif is_decord_available():
+        video_reader_backend = "decord"
+    else:
+        video_reader_backend = "torchvision"
+    # print(f"qwen-vl-utils using {video_reader_backend} to read video.", file=sys.stderr)
+    return video_reader_backend
+
+
+def fetch_video(
+        ele: dict,
+        image_factor: int = IMAGE_FACTOR) -> torch.Tensor | list[Image.Image]:
+    if isinstance(ele["video"], str):
+        video_reader_backend = get_video_reader_backend()
+        video, total_dur, nframes = VIDEO_READER_BACKENDS[
+            video_reader_backend](ele)
+        frame_timestamps = total_dur * torch.arange(1, nframes + 1) / nframes
+        grid_timestamps = frame_timestamps[::FRAME_FACTOR]
+        second_per_grid = grid_timestamps[1] - grid_timestamps[0]
+        nframes, _, height, width = video.shape
+        factor = spatial_patch_size * spatial_merge_size
+        min_pixels = ele.get("min_pixels", VIDEO_MIN_PIXELS)
+        total_pixels = ele.get("total_pixels", VIDEO_TOTAL_PIXELS)
+        max_pixels = max(
+            min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),
+            int(min_pixels * 1.05))
+        max_pixels = ele.get("max_pixels", max_pixels)
+        # min_pixels = (factor ** 2) * 52
+        # max_pixels = (factor ** 2) * min(768, (16384 / nframes * temporal_patch_size))
+        if "resized_height" in ele and "resized_width" in ele:
+            resized_height, resized_width = smart_resize(
+                ele["resized_height"],
+                ele["resized_width"],
+                factor=image_factor,
+            )
+        else:
+            resized_height, resized_width = smart_resize(
+                height,
+                width,
+                factor=image_factor,
+                min_pixels=min_pixels,
+                max_pixels=max_pixels,
+            )
+        video = transforms.functional.resize(
+            video,
+            [resized_height, resized_width],
+            interpolation=InterpolationMode.BICUBIC,
+            antialias=True,
+        ).float()
+        return video, total_dur, nframes, second_per_grid
+    else:
+        assert isinstance(ele["video"], (list, tuple))
+        process_info = ele.copy()
+        process_info.pop("type", None)
+        process_info.pop("video", None)
+        images = [
+            fetch_image({
+                "image": video_element,
+                **process_info
+            },
+                        size_factor=image_factor)
+            for video_element in ele["video"]
+        ]
+        nframes = ceil_by_factor(len(images), FRAME_FACTOR)
+        if len(images) < nframes:
+            images.extend([images[-1]] * (nframes - len(images)))
+        return images, None, None, None
+
+
+def extract_vision_info(
+        conversations: list[dict] | list[list[dict]]) -> list[dict]:
+    vision_infos = []
+    if isinstance(conversations[0], dict):
+        conversations = [conversations]
+    for conversation in conversations:
+        for message in conversation:
+            if isinstance(message["content"], list):
+                for ele in message["content"]:
+                    if ("image" in ele or "image_url" in ele or "video" in ele
+                            or ele["type"] in ("image", "image_url", "video")):
+                        vision_infos.append(ele)
+    return vision_infos
+
+
+def process_vision_info(
+    conversations: list[dict] | list[list[dict]],
+) -> tuple[list[Image.Image] | None, list[torch.Tensor | list[Image.Image]]
+           | None]:
+    vision_infos = extract_vision_info(conversations)
+    ## Read images or videos
+    image_inputs = []
+    video_inputs = []
+    for vision_info in vision_infos:
+        if "image" in vision_info or "image_url" in vision_info:
+            image_inputs.append(fetch_image(vision_info))
+        elif "video" in vision_info:
+            video_inputs.append(fetch_video(vision_info))
+        else:
+            raise ValueError("image, image_url or video should in content.")
+    if len(image_inputs) == 0:
+        image_inputs = None
+    if len(video_inputs) == 0:
+        video_inputs = None
+    return image_inputs, video_inputs
\ No newline at end of file
diff --git a/examples/offline_inference/qwen_2_5_omni/run.sh b/examples/offline_inference/qwen_2_5_omni/run.sh
new file mode 100644
index 0000000000..2b5db45cb0
--- /dev/null
+++ b/examples/offline_inference/qwen_2_5_omni/run.sh
@@ -0,0 +1,9 @@
+export PYTHONPATH=/path/to/vllm-omni:$PYTHONPATH
+export HF_ENDPOINT=https://hf-mirror.com
+python end2end.py --model Qwen/Qwen2.5-Omni-7B \
+                                 --prompts "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words." \
+                                 --voice-type "m02" \
+                                 --dit-ckpt none \
+                                 --bigvgan-ckpt none \
+                                 --output-wav output_audio \
+                                 --prompt_type text
\ No newline at end of file
diff --git a/examples/offline_inference/qwen_2_5_omni/utils.py b/examples/offline_inference/qwen_2_5_omni/utils.py
new file mode 100644
index 0000000000..c8cf5392df
--- /dev/null
+++ b/examples/offline_inference/qwen_2_5_omni/utils.py
@@ -0,0 +1,305 @@
+import tempfile
+from urllib.request import urlopen
+import librosa
+import soundfile as sf
+import resampy
+from typing import Dict, Optional
+import torch
+import requests
+import torchvision.io
+
+from typing import Union, List
+from vllm.inputs import TextPrompt
+from vllm_omni.inputs.data import OmniTokensPrompt
+from processing_omni import fetch_image, fetch_video
+
+
+def get_system_prompt():
+    
+    return {
+        'role':
+        'system',
+        'content': [{
+            'type':
+            'text',
+            'text':
+            'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'
+        }]
+    }
+
+
+def resample_wav_to_16khz(input_filepath):
+    data, original_sample_rate = sf.read(input_filepath)
+    # Only use the first channel
+    if len(data.shape) > 1:
+        data = data[:, 0]
+    # resample to 16kHz
+    data_resampled = resampy.resample(data,
+                                      sr_orig=original_sample_rate,
+                                      sr_new=16000)
+    return data_resampled
+
+
+def fetch_and_read_video(args, video_url: str, fps=2):
+
+    def read_video_with_torchvision(video_file_name: str):
+        video, audio, info = torchvision.io.read_video(
+            video_file_name,
+            start_pts=0.0,
+            end_pts=None,
+            pts_unit="sec",
+            output_format="TCHW",
+        )
+
+        total_frames, video_fps = video.size(0), info["video_fps"]
+        total_duration = round(total_frames / video_fps, 3)
+        nframes = int(total_frames / video_fps * fps)
+
+        frame_timestamps = total_duration * torch.arange(1,
+                                                         nframes + 1) / nframes
+        grid_timestamps = frame_timestamps[::2]
+        second_per_grid = grid_timestamps[1] - grid_timestamps[0]
+
+        idx = torch.linspace(0, video.size(0) - 1, nframes).round().long()
+        video = video[idx]
+
+        if args.legacy_omni_video:
+            return [video, total_duration, nframes, second_per_grid.item()]
+        else:
+            return video
+
+    def read_video_with_transformers(video_file_name: Union[str, List[str]]):
+        video, total_duration, nframes, second_per_grid = fetch_video(
+            {'video': video_file_name})
+        if total_duration is None and nframes is None:
+            nframes = len(video)
+            total_duration = 0.5 * nframes
+            second_per_grid = 1.0
+        if args.legacy_omni_video:
+            return [video, total_duration, nframes, second_per_grid]
+        else:
+            return video
+
+    def read_video(video_file_name: str):
+        if args.use_torchvision:
+            return read_video_with_torchvision(video_file_name)
+        else:
+            return read_video_with_transformers(video_file_name)
+
+    if isinstance(video_url, str) and video_url.startswith("http"):
+        with tempfile.NamedTemporaryFile(delete=True) as temp_video_file:
+            resp = requests.get(video_url)
+            assert resp.status_code == requests.codes.ok, f"Failed to fetch video from {video_url}, status_code:{resp.status_code}, resp:{resp}"
+
+            temp_video_file.write(urlopen(video_url).read())
+            temp_video_file_path = temp_video_file.name
+            video_file_name = temp_video_file_path
+            return read_video(video_file_name)
+    else:
+        video_file_name = video_url
+        return read_video(video_file_name)
+
+
+def make_inputs_qwen2_omni(
+    args,
+    messages: List[Dict[str, Union[str, List[Dict[str, str]]]]],
+    use_audio_in_video: Optional[bool] = False,
+    tokenize: bool = False,
+) -> Union[OmniTokensPrompt, TextPrompt]:
+    
+    from transformers import AutoConfig, AutoProcessor, AutoTokenizer
+    processor = AutoProcessor.from_pretrained(args.model)
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+
+    try:
+        config = AutoConfig.from_pretrained(args.model)
+        if 'Qwen2_5OmniModel' in config.architectures:
+            args.legacy_omni_video = False
+        else:
+            args.legacy_omni_video = True
+    except:
+        args.legacy_omni_video = True
+
+    audios, images, videos = [], [], []
+    for message in messages:
+        if not isinstance(message['content'], list):
+            message['content'] = [{
+                'type': 'text',
+                'text': message['content'],
+            }]
+        index, num_contents = 0, len(message['content'])
+        while index < num_contents:
+            ele = message['content'][index]
+            if 'type' not in ele:
+                if 'text' in ele:
+                    ele['type'] = 'text'
+                elif 'audio' in ele:
+                    ele['type'] = 'audio'
+                elif 'audio_url' in ele:
+                    ele['type'] = 'audio_url'
+                elif 'image' in ele:
+                    ele['type'] = 'image'
+                elif 'image_url' in ele:
+                    ele['type'] = 'image_url'
+                elif 'video' in ele:
+                    ele['type'] = 'video'
+                elif 'video_url' in ele:
+                    ele['type'] = 'video_url'
+                else:
+                    raise ValueError(f'Unknown ele: {ele}')
+
+            if ele['type'] == 'audio' or ele['type'] == 'audio_url':
+                if 'audio_url' in ele:
+                    audio_key = 'audio_url'
+                    with tempfile.NamedTemporaryFile(
+                            delete=True) as temp_audio_file:
+                        temp_audio_file.write(urlopen(ele[audio_key]).read())
+                        temp_audio_file_path = temp_audio_file.name
+                        audios.append(
+                            resample_wav_to_16khz(temp_audio_file_path))
+                        ele['audio'] = temp_audio_file_path
+                elif 'audio' in ele:
+                    audio_key = 'audio'
+                    audios.append(resample_wav_to_16khz(ele[audio_key]))
+                else:
+                    raise ValueError(f'Unknown ele {ele}')
+            elif use_audio_in_video and (ele['type'] == 'video'
+                                         or ele['type'] == 'video_url'):
+                # use video as audio as well
+                if 'video_url' in ele:
+                    audio_key = 'video_url'
+                    with tempfile.NamedTemporaryFile(
+                            delete=True) as temp_video_file:
+                        temp_video_file.write(urlopen(ele[audio_key]).read())
+                        temp_video_file_path = temp_video_file.name
+                        ele[audio_key] = temp_video_file_path
+                        audios.append(
+                            librosa.load(temp_video_file_path, sr=16000)[0])
+                        videos.append(
+                            fetch_and_read_video(args, temp_video_file_path))
+                        ele['video'] = temp_video_file_path
+                elif 'video' in ele:
+                    audio_key = 'video'
+                    audios.append(librosa.load(ele[audio_key], sr=16000)[0])
+                    videos.append(fetch_and_read_video(args, audio_key))
+                else:
+                    raise ValueError("Unknown ele {}".format(ele))
+                # insert a audio after the video
+                message['content'].insert(index + 1, {
+                    "type": "audio",
+                    "audio": ele[audio_key],
+                })
+                # no need to load the added audio again
+                index += 1
+            elif ele['type'] == 'video' or ele['type'] == 'video_url':
+                if 'video_url' in ele:
+                    video_key = 'video_url'
+                    with tempfile.NamedTemporaryFile(
+                            delete=True) as temp_video_file:
+                        temp_video_file.write(urlopen(ele['video_url']).read())
+                        temp_video_file_path = temp_video_file.name
+                        videos.append(fetch_and_read_video(args, temp_video_file))
+                        ele['video'] = temp_video_file_path
+                else:
+                    video_key = 'video'
+                    videos.append(fetch_and_read_video(args, ele[video_key]))
+            elif ele['type'] == 'image' or ele['type'] == 'image_url':
+                images.append(fetch_image(ele))
+
+            # move to the next content
+            index += 1
+
+    prompt = processor.apply_chat_template(
+        messages,
+        tokenize=tokenize,
+        add_generation_prompt=True,
+        add_vision_id=True,
+    )
+
+    audios = audios if len(audios) > 0 else None
+    images = images if len(images) > 0 else None
+    videos = videos if len(videos) > 0 else None
+
+    multi_modal_data = {}
+    if audios:
+        multi_modal_data["audio"] = audios
+    if images:
+        multi_modal_data["image"] = images
+    if videos:
+        multi_modal_data["video"] = videos
+
+    if isinstance(prompt, list) and isinstance(prompt[0], (list, str)):
+        prompt = prompt[0]
+
+    if tokenize:
+        return OmniTokensPrompt(
+            prompt_token_ids=prompt,
+            multi_modal_data=multi_modal_data,
+        )
+    else:
+        return TextPrompt(
+            prompt=prompt,
+            multi_modal_data=multi_modal_data,
+        )
+
+
+def make_text_prompt(args, prompt):
+    messages = [
+        get_system_prompt(),
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": prompt
+                },
+            ]
+        },
+    ]
+
+    prompt = make_inputs_qwen2_omni(args, messages, tokenize=args.tokenize)
+    return prompt
+
+
+def make_audio_in_video_v2_prompt(args):
+    messages = [
+        {
+            'role':
+            'system',
+            'content': [{
+                'type':
+                'text',
+                'text':
+                'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'
+            }]
+        },
+        {
+            "role":
+            "user",
+            "content": [
+                {
+                    "type":
+                    "video_url",
+                    "video_url":
+                    "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw_small.mp4"
+                },
+            ]
+        },
+    ]
+    prompt = make_inputs_qwen2_omni(
+        args,
+        messages,
+        use_audio_in_video=True,
+        tokenize=args.tokenize,
+    )
+    return prompt
+
+
+def make_omni_prompt(args, prompt = None) -> Union[OmniTokensPrompt, List[OmniTokensPrompt]]:
+    if args.prompt_type == 'text':
+        prompt = make_text_prompt(args, prompt)
+    elif args.prompt_type == 'audio-in-video-v2':
+        prompt = make_audio_in_video_v2_prompt(args)
+    else:
+        raise ValueError(f'Unsupported prompt type: {args.prompt_type}')
+    return prompt
\ No newline at end of file