diff --git a/README.md b/README.md index 66c8725ee5..20abbb6082 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ Traditional vLLM systems are limited to text-based, autoregressive generation. v - **Multi-modal Models**: Text, image, video, audio, and sensor data processing - **Non-autoregressive Architectures**: Diffusion Transformers (DiT) and other parallel generation models -- **Heterogeneous Outputs**: Beyond traditional text generation to structured, binary, and streaming outputs +- **Heterogeneous Outputs**: Beyond traditional text generation to multimodal outputs ## 🏗️ Architecture @@ -28,119 +28,48 @@ vLLM-omni is built on a modular architecture that extends vLLM's core functional - **Text**: Advanced tokenization and embedding generation - **Image**: Vision encoder integration (CLIP, etc.) - **Audio**: Speech processing and audio embedding -- **Video**: Frame-by-frame and temporal processing -- **Sensor**: IoT and sensor data interpretation - -### Output Formats - -- **Structured Data**: JSON, XML, and custom formats -- **Binary Outputs**: Images, audio, and video generation -- **Streaming**: Real-time progressive generation -- **Multipart**: Combined multi-modal responses ## 📋 Supported Models ### AR + Diffusion Transformer (DiT) Models -- Qwen-Image (Image generation and editing) - Qwen-omni (Thinker-Talker-Codec structure) -- Custom DiT and hiybrid architectures +- HunyunaImage 3.0 (Ongoing) +- Qwen-Image (Ongoing) ## 🛠️ Installation -### Quick Start - -#### Option 1: Docker (Recommended for macOS) - -```bash -# Clone the repository -git clone https://github.com/hsliuustc0106/vllm-omni.git -cd vllm-omni - -# Run the automated Docker setup -./scripts/docker-setup-macos.sh -``` - -#### Option 2: Local Installation - -```bash -# Clone the repository -git clone https://github.com/hsliuustc0106/vllm-omni.git -cd vllm-omni - -# Run the installation script -./install.sh -``` - -### Prerequisites - -- Python 3.11+ (recommended) -- Conda or Miniconda -- Git -- CUDA 11.8+ (for GPU acceleration) or CPU-only installation - -### Installation Methods - -#### Method 1: Automated Installation (Recommended) +Set up basic environments ```bash -# Using shell script -./install.sh - -# Or using Python script -python install.py +uv venv --python 3.12 --seed +source .venv/bin/activate ``` +Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941 -#### Method 2: Manual Installation ```bash -# Create conda environment -conda create -n vllm_omni python=3.11 -y -conda activate vllm_omni - -# Install PyTorch (CPU or GPU) -pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cpu # CPU -# pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cu121 # GPU - -# Install dependencies -pip install -r requirements.txt -pip install "vllm>=0.10.2" - -# Install vLLM-omni -pip install -e . +git clone https://github.com/vllm-project/vllm.git +cd vllm +git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941 +VLLM_USE_PRECOMPILED=1 uv pip install --editable . ``` -### Verify Installation +## Run examples (Qwen2.5-omni) +Get into the example folder ```bash -# Test the installation -python test_installation.py - -# Test basic functionality -python -c "import vllm_omni; print('Ready!')" - -# Test CLI -vllm --help +cd vllm_omni +cd examples/offline_inference/qwen2_5_omni ``` - -For detailed installation instructions, see [INSTALL.md](INSTALL.md). - -## 📥 Model Download - -Models are automatically downloaded when first used, or you can pre-download them: - +Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run. ```bash -# Check downloaded models -python scripts/download_models.py --check-cache - -# Download all default models -python scripts/download_models.py --all - -# Download specific models -python scripts/download_models.py --ar-models Qwen/Qwen3-0.6B -python scripts/download_models.py --dit-models stabilityai/stable-diffusion-2-1 +bash run.sh ``` +The output audio is saved in ./output_audio -**Model Storage Location:** -- Default: `~/.cache/huggingface/hub/` -- AR models: 100MB - 1GB each -- DiT models: 2GB - 7GB each +## To-do list +- [x] Offline inference example for Qwen2.5-omni with single request +- [ ] Adaptation from current vllm branch to stable vllm v0.11.0 +- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests +- [ ] Online inference support +- [ ] Support for other models -For detailed model management, see [MODEL_DOWNLOAD_GUIDE.md](docs/MODEL_DOWNLOAD_GUIDE.md). +For detailed model management, see [vllm_omni_design.md](docs/architecture/vllm_omni_design.md) and [high_level_arch_design.md](docs/architecture/high_level_arch_design.md). \ No newline at end of file diff --git a/examples/offline_inference/qwen_2_5_omni/README.md b/examples/offline_inference/qwen_2_5_omni/README.md new file mode 100644 index 0000000000..d1dfde2059 --- /dev/null +++ b/examples/offline_inference/qwen_2_5_omni/README.md @@ -0,0 +1,37 @@ +# Offline Example of vLLM-omni for Qwen2.5-omni + +## Installation + +Set up basic environments +```bash +uv venv --python 3.12 --seed +source .venv/bin/activate +``` +Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941 + +```bash +git clone https://github.com/vllm-project/vllm.git +cd vllm +git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941 +VLLM_USE_PRECOMPILED=1 uv pip install --editable . +``` + +## Run examples + +Get into the example folder +```bash +cd vllm_omni +cd examples/offline_inference/qwen2_5_omni +``` +Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run. +```bash +bash run.sh +``` +The output audio is saved in ./output_audio + +## To-do list +- [x] Offline inference example for Qwen2.5-omni with single request +- [ ] Adaptation from current vllm branch to stable vllm v0.11.0 +- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests +- [ ] Online inference support +- [ ] Support for other models \ No newline at end of file diff --git a/examples/offline_inference/qwen_2_5_omni/end2end.py b/examples/offline_inference/qwen_2_5_omni/end2end.py new file mode 100644 index 0000000000..030db0b717 --- /dev/null +++ b/examples/offline_inference/qwen_2_5_omni/end2end.py @@ -0,0 +1,130 @@ +import argparse +import os +import soundfile as sf +import random +import numpy as np +import torch + +from vllm.sampling_params import SamplingParams + +import os as _os_env_toggle +_os_env_toggle.environ["VLLM_USE_V1"] = "1" + +from vllm_omni.entrypoints.omni_llm import OmniLLM +from utils import make_omni_prompt + + +SEED = 42 +# Set all random seeds +random.seed(SEED) +np.random.seed(SEED) +torch.manual_seed(SEED) +torch.cuda.manual_seed(SEED) +torch.cuda.manual_seed_all(SEED) + +# Make PyTorch deterministic +torch.backends.cudnn.deterministic = True +torch.backends.cudnn.benchmark = False + +# Set environment variables for deterministic behavior +os.environ["PYTHONHASHSEED"] = str(SEED) +os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument('--model', required=True, help='Path to merged model directory (will be created if downloading).') + parser.add_argument('--thinker-model', type=str, default=None) + parser.add_argument('--talker-model', type=str, default=None) + parser.add_argument('--code2wav-model', type=str, default=None) + parser.add_argument('--hf-hub-id', default='Qwen/Qwen2.5-Omni-7B', help='Hugging Face repo id to download if needed.') + parser.add_argument('--hf-revision', default=None, help='Optional HF revision (branch/tag/commit).') + parser.add_argument('--prompts', required=True, nargs='+', help='Input text prompts.') + parser.add_argument('--voice-type', default='default', help='Voice type, e.g., m02, f030, default.') + parser.add_argument('--code2wav-dir', default=None, help='Path to code2wav folder (contains spk_dict.pt).') + parser.add_argument('--dit-ckpt', default=None, help='Path to DiT checkpoint file (e.g., dit.pt).') + parser.add_argument('--bigvgan-ckpt', default=None, help='Path to BigVGAN checkpoint file.') + parser.add_argument('--dtype', default='bfloat16', choices=['float16', 'bfloat16', 'float32']) + parser.add_argument('--max-model-len', type=int, default=32768) + + parser.add_argument("--thinker-only", action="store_true") + parser.add_argument("--text-only", action="store_true") + parser.add_argument("--do-wave", action="store_true") + parser.add_argument('--prompt_type', + choices=[ + 'text', 'audio', 'audio-long', 'audio-long-chunks', + 'audio-long-expand-chunks', 'image', 'video', + 'video-frames', 'audio-in-video', 'audio-in-video-v2', + "audio-multi-round", "badcase-vl", "badcase-text", + "badcase-image-early-stop", "badcase-two-audios", + "badcase-two-videos", "badcase-multi-round", + "badcase-voice-type", "badcase-voice-type-v2", + "badcase-audio-tower-1", "badcase-audio-only" + ], + default='text') + parser.add_argument('--use-torchvision', action='store_true') + parser.add_argument('--tokenize', action='store_true') + parser.add_argument('--output-wav', default="output.wav", help='Output wav file path.') + parser.add_argument('--thinker-hidden-states-dir', default="thinker_hidden_states", help='Path to thinker hidden states directory.') + args = parser.parse_args() + return args + + +def main(): + args = parse_args() + model_name = args.model + omni_llm = OmniLLM(model=model_name) + thinker_sampling_params = SamplingParams( + temperature=0.0, # Deterministic - no randomness + top_p=1.0, # Disable nucleus sampling + top_k=-1, # Disable top-k sampling + max_tokens=2048, + seed=SEED, # Fixed seed for sampling + detokenize=True, + repetition_penalty=1.1, + ) + talker_sampling_params = SamplingParams( + temperature=0.0, # Deterministic - no randomness + top_p=1.0, # Disable nucleus sampling + top_k=-1, # Disable top-k sampling + max_tokens=2048, + seed=SEED, # Fixed seed for sampling + detokenize=True, + repetition_penalty=1.1, + stop_token_ids=[8294] + ) + code2wav_sampling_params = SamplingParams( + temperature=0.0, # Deterministic - no randomness + top_p=1.0, # Disable nucleus sampling + top_k=-1, # Disable top-k sampling + max_tokens=2048, + seed=SEED, # Fixed seed for sampling + detokenize=True, + repetition_penalty=1.1, + ) + + sampling_params_list = [thinker_sampling_params, + talker_sampling_params, + code2wav_sampling_params] + + prompt = [make_omni_prompt(args, prompt) for prompt in args.prompts] + omni_outputs = omni_llm.generate(prompt, sampling_params_list) + + os.makedirs(args.output_wav, exist_ok=True) + for stage_outputs in omni_outputs: + if stage_outputs.final_output_type == "text": + for output in stage_outputs.request_output: + request_id = output.request_id + text_output = output.outputs[0].text + print(f"Request ID: {request_id}, Text Output: {text_output}") + elif stage_outputs.final_output_type == "audio": + for output in stage_outputs.request_output: + request_id = output.request_id + audio_tensor = output.multimodal_output["audio"] + output_wav = os.path.join(args.output_wav, f"output_{output.request_id}.wav") + sf.write(output_wav, audio_tensor.detach().cpu().numpy(), samplerate=24000) + print(f"Request ID: {request_id}, Saved audio to {output_wav}") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/examples/offline_inference/qwen_2_5_omni/output_audio/output_0.wav b/examples/offline_inference/qwen_2_5_omni/output_audio/output_0.wav new file mode 100644 index 0000000000..45caf1d0e2 Binary files /dev/null and b/examples/offline_inference/qwen_2_5_omni/output_audio/output_0.wav differ diff --git a/examples/offline_inference/qwen_2_5_omni/processing_omni.py b/examples/offline_inference/qwen_2_5_omni/processing_omni.py new file mode 100644 index 0000000000..bd6f4ae957 --- /dev/null +++ b/examples/offline_inference/qwen_2_5_omni/processing_omni.py @@ -0,0 +1,373 @@ +from __future__ import annotations + +import base64 +import logging +import math +import os +import time +import warnings +from functools import lru_cache +from io import BytesIO +import requests +import torch +import torchvision +from packaging import version +from PIL import Image +from torchvision import io, transforms +from torchvision.transforms import InterpolationMode + + +logger = logging.getLogger(__name__) + +IMAGE_FACTOR = 28 +MIN_PIXELS = 4 * 28 * 28 +MAX_PIXELS = 16384 * 28 * 28 +MAX_RATIO = 200 + +VIDEO_MIN_PIXELS = 128 * 28 * 28 +VIDEO_MAX_PIXELS = 768 * 28 * 28 +VIDEO_TOTAL_PIXELS = 24576 * 28 * 28 +FRAME_FACTOR = 2 +FPS = 2.0 +FPS_MIN_FRAMES = 4 +FPS_MAX_FRAMES = 768 + +temporal_patch_size = 2 +spatial_patch_size = 14 +spatial_merge_size = 2 + + +def round_by_factor(number: int, factor: int) -> int: + """Returns the closest integer to 'number' that is divisible by 'factor'.""" + return round(number / factor) * factor + + +def ceil_by_factor(number: int, factor: int) -> int: + """Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'.""" + return math.ceil(number / factor) * factor + + +def floor_by_factor(number: int, factor: int) -> int: + """Returns the largest integer less than or equal to 'number' that is divisible by 'factor'.""" + return math.floor(number / factor) * factor + + +def smart_resize(height: int, + width: int, + factor: int = IMAGE_FACTOR, + min_pixels: int = MIN_PIXELS, + max_pixels: int = MAX_PIXELS) -> tuple[int, int]: + """ + Rescales the image so that the following conditions are met: + + 1. Both dimensions (height and width) are divisible by 'factor'. + + 2. The total number of pixels is within the range ['min_pixels', 'max_pixels']. + + 3. The aspect ratio of the image is maintained as closely as possible. + """ + if max(height, width) / min(height, width) > MAX_RATIO: + raise ValueError( + f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}" + ) + h_bar = max(factor, round_by_factor(height, factor)) + w_bar = max(factor, round_by_factor(width, factor)) + if h_bar * w_bar > max_pixels: + beta = math.sqrt((height * width) / max_pixels) + h_bar = floor_by_factor(height / beta, factor) + w_bar = floor_by_factor(width / beta, factor) + elif h_bar * w_bar < min_pixels: + beta = math.sqrt(min_pixels / (height * width)) + h_bar = ceil_by_factor(height * beta, factor) + w_bar = ceil_by_factor(width * beta, factor) + return h_bar, w_bar + + +def fetch_image(ele: dict[str, str | Image.Image], + size_factor: int = IMAGE_FACTOR) -> Image.Image: + if "image" in ele: + image = ele["image"] + else: + image = ele["image_url"] + image_obj = None + if isinstance(image, Image.Image): + image_obj = image + elif image.startswith("http://") or image.startswith("https://"): + image_obj = Image.open(requests.get(image, stream=True).raw) + elif image.startswith("file://"): + image_obj = Image.open(image[7:]) + elif image.startswith("data:image"): + if "base64," in image: + _, base64_data = image.split("base64,", 1) + data = base64.b64decode(base64_data) + image_obj = Image.open(BytesIO(data)) + else: + image_obj = Image.open(image) + if image_obj is None: + raise ValueError( + f"Unrecognized image input, support local path, http url, base64 and PIL.Image, got {image}" + ) + image = image_obj.convert("RGB") + ## resize + if "resized_height" in ele and "resized_width" in ele: + resized_height, resized_width = smart_resize( + ele["resized_height"], + ele["resized_width"], + factor=size_factor, + ) + else: + width, height = image.size + min_pixels = ele.get("min_pixels", MIN_PIXELS) + max_pixels = ele.get("max_pixels", MAX_PIXELS) + resized_height, resized_width = smart_resize( + height, + width, + factor=size_factor, + min_pixels=min_pixels, + max_pixels=max_pixels, + ) + image = image.resize((resized_width, resized_height)) + + return image + + +def smart_nframes( + ele: dict, + total_frames: int, + video_fps: int | float, +) -> int: + """calculate the number of frames for video used for model inputs. + + Args: + ele (dict): a dict contains the configuration of video. + support either `fps` or `nframes`: + - nframes: the number of frames to extract for model inputs. + - fps: the fps to extract frames for model inputs. + - min_frames: the minimum number of frames of the video, only used when fps is provided. + - max_frames: the maximum number of frames of the video, only used when fps is provided. + total_frames (int): the original total number of frames of the video. + video_fps (int | float): the original fps of the video. + + Raises: + ValueError: nframes should in interval [FRAME_FACTOR, total_frames]. + + Returns: + int: the number of frames for video used for model inputs. + """ + assert not ("fps" in ele + and "nframes" in ele), "Only accept either `fps` or `nframes`" + if "nframes" in ele: + nframes = round_by_factor(ele["nframes"], FRAME_FACTOR) + else: + fps = ele.get("fps", FPS) + min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), + FRAME_FACTOR) + max_frames = floor_by_factor( + ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), + FRAME_FACTOR) + nframes = total_frames / video_fps * fps + nframes = min(max(nframes, min_frames), max_frames) + nframes = round_by_factor(nframes, FRAME_FACTOR) + if not (FRAME_FACTOR <= nframes and nframes <= total_frames): + raise ValueError( + f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}." + ) + return nframes + + +def _read_video_torchvision(ele: dict, ) -> torch.Tensor: + """read video using torchvision.io.read_video + + Args: + ele (dict): a dict contains the configuration of video. + support keys: + - video: the path of video. support "file://", "http://", "https://" and local path. + - video_start: the start time of video. + - video_end: the end time of video. + Returns: + torch.Tensor: the video tensor with shape (T, C, H, W). + """ + video_path = ele["video"] + if version.parse(torchvision.__version__) < version.parse("0.19.0"): + if "http://" in video_path or "https://" in video_path: + warnings.warn( + "torchvision < 0.19.0 does not support http/https video path, please upgrade to 0.19.0." + ) + if "file://" in video_path: + video_path = video_path[7:] + st = time.time() + video, audio, info = io.read_video( + video_path, + start_pts=ele.get("video_start", 0.0), + end_pts=ele.get("video_end", None), + pts_unit="sec", + output_format="TCHW", + ) + total_frames, video_fps = video.size(0), info["video_fps"] + total_duration = round(total_frames / video_fps, 3) + logger.info( + f"torchvision: {video_path=}, {total_frames=}, {video_fps=}, duration={total_duration}s, time={time.time() - st:.3f}s" + ) + nframes = smart_nframes(ele, + total_frames=total_frames, + video_fps=video_fps) + idx = torch.linspace(0, total_frames - 1, nframes).round().long() + video = video[idx] + return video, total_duration, nframes + + +def is_decord_available() -> bool: + import importlib.util + + return importlib.util.find_spec("decord") is not None + + +def _read_video_decord(ele: dict, ) -> torch.Tensor: + """read video using decord.VideoReader + + Args: + ele (dict): a dict contains the configuration of video. + support keys: + - video: the path of video. support "file://", "http://", "https://" and local path. + - video_start: the start time of video. + - video_end: the end time of video. + Returns: + torch.Tensor: the video tensor with shape (T, C, H, W). + """ + import decord + video_path = ele["video"] + st = time.time() + vr = decord.VideoReader(video_path) + # TODO: support start_pts and end_pts + if 'video_start' in ele or 'video_end' in ele: + raise NotImplementedError( + "not support start_pts and end_pts in decord for now.") + total_frames, video_fps = len(vr), vr.get_avg_fps() + total_duration = round(total_frames / video_fps, 3) + logger.info( + f"decord: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s" + ) + nframes = smart_nframes(ele, + total_frames=total_frames, + video_fps=video_fps) + idx = torch.linspace(0, total_frames - 1, nframes).round().long().tolist() + video = vr.get_batch(idx).asnumpy() + video = torch.tensor(video).permute(0, 3, 1, 2) # Convert to TCHW format + return video, total_duration, nframes + + +VIDEO_READER_BACKENDS = { + "decord": _read_video_decord, + "torchvision": _read_video_torchvision, +} + +FORCE_QWENVL_VIDEO_READER = os.getenv("FORCE_QWENVL_VIDEO_READER", None) + + +@lru_cache(maxsize=1) +def get_video_reader_backend() -> str: + if FORCE_QWENVL_VIDEO_READER is not None: + video_reader_backend = FORCE_QWENVL_VIDEO_READER + elif is_decord_available(): + video_reader_backend = "decord" + else: + video_reader_backend = "torchvision" + # print(f"qwen-vl-utils using {video_reader_backend} to read video.", file=sys.stderr) + return video_reader_backend + + +def fetch_video( + ele: dict, + image_factor: int = IMAGE_FACTOR) -> torch.Tensor | list[Image.Image]: + if isinstance(ele["video"], str): + video_reader_backend = get_video_reader_backend() + video, total_dur, nframes = VIDEO_READER_BACKENDS[ + video_reader_backend](ele) + frame_timestamps = total_dur * torch.arange(1, nframes + 1) / nframes + grid_timestamps = frame_timestamps[::FRAME_FACTOR] + second_per_grid = grid_timestamps[1] - grid_timestamps[0] + nframes, _, height, width = video.shape + factor = spatial_patch_size * spatial_merge_size + min_pixels = ele.get("min_pixels", VIDEO_MIN_PIXELS) + total_pixels = ele.get("total_pixels", VIDEO_TOTAL_PIXELS) + max_pixels = max( + min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), + int(min_pixels * 1.05)) + max_pixels = ele.get("max_pixels", max_pixels) + # min_pixels = (factor ** 2) * 52 + # max_pixels = (factor ** 2) * min(768, (16384 / nframes * temporal_patch_size)) + if "resized_height" in ele and "resized_width" in ele: + resized_height, resized_width = smart_resize( + ele["resized_height"], + ele["resized_width"], + factor=image_factor, + ) + else: + resized_height, resized_width = smart_resize( + height, + width, + factor=image_factor, + min_pixels=min_pixels, + max_pixels=max_pixels, + ) + video = transforms.functional.resize( + video, + [resized_height, resized_width], + interpolation=InterpolationMode.BICUBIC, + antialias=True, + ).float() + return video, total_dur, nframes, second_per_grid + else: + assert isinstance(ele["video"], (list, tuple)) + process_info = ele.copy() + process_info.pop("type", None) + process_info.pop("video", None) + images = [ + fetch_image({ + "image": video_element, + **process_info + }, + size_factor=image_factor) + for video_element in ele["video"] + ] + nframes = ceil_by_factor(len(images), FRAME_FACTOR) + if len(images) < nframes: + images.extend([images[-1]] * (nframes - len(images))) + return images, None, None, None + + +def extract_vision_info( + conversations: list[dict] | list[list[dict]]) -> list[dict]: + vision_infos = [] + if isinstance(conversations[0], dict): + conversations = [conversations] + for conversation in conversations: + for message in conversation: + if isinstance(message["content"], list): + for ele in message["content"]: + if ("image" in ele or "image_url" in ele or "video" in ele + or ele["type"] in ("image", "image_url", "video")): + vision_infos.append(ele) + return vision_infos + + +def process_vision_info( + conversations: list[dict] | list[list[dict]], +) -> tuple[list[Image.Image] | None, list[torch.Tensor | list[Image.Image]] + | None]: + vision_infos = extract_vision_info(conversations) + ## Read images or videos + image_inputs = [] + video_inputs = [] + for vision_info in vision_infos: + if "image" in vision_info or "image_url" in vision_info: + image_inputs.append(fetch_image(vision_info)) + elif "video" in vision_info: + video_inputs.append(fetch_video(vision_info)) + else: + raise ValueError("image, image_url or video should in content.") + if len(image_inputs) == 0: + image_inputs = None + if len(video_inputs) == 0: + video_inputs = None + return image_inputs, video_inputs \ No newline at end of file diff --git a/examples/offline_inference/qwen_2_5_omni/run.sh b/examples/offline_inference/qwen_2_5_omni/run.sh new file mode 100644 index 0000000000..2b5db45cb0 --- /dev/null +++ b/examples/offline_inference/qwen_2_5_omni/run.sh @@ -0,0 +1,9 @@ +export PYTHONPATH=/path/to/vllm-omni:$PYTHONPATH +export HF_ENDPOINT=https://hf-mirror.com +python end2end.py --model Qwen/Qwen2.5-Omni-7B \ + --prompts "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words." \ + --voice-type "m02" \ + --dit-ckpt none \ + --bigvgan-ckpt none \ + --output-wav output_audio \ + --prompt_type text \ No newline at end of file diff --git a/examples/offline_inference/qwen_2_5_omni/utils.py b/examples/offline_inference/qwen_2_5_omni/utils.py new file mode 100644 index 0000000000..c8cf5392df --- /dev/null +++ b/examples/offline_inference/qwen_2_5_omni/utils.py @@ -0,0 +1,305 @@ +import tempfile +from urllib.request import urlopen +import librosa +import soundfile as sf +import resampy +from typing import Dict, Optional +import torch +import requests +import torchvision.io + +from typing import Union, List +from vllm.inputs import TextPrompt +from vllm_omni.inputs.data import OmniTokensPrompt +from processing_omni import fetch_image, fetch_video + + +def get_system_prompt(): + + return { + 'role': + 'system', + 'content': [{ + 'type': + 'text', + 'text': + 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.' + }] + } + + +def resample_wav_to_16khz(input_filepath): + data, original_sample_rate = sf.read(input_filepath) + # Only use the first channel + if len(data.shape) > 1: + data = data[:, 0] + # resample to 16kHz + data_resampled = resampy.resample(data, + sr_orig=original_sample_rate, + sr_new=16000) + return data_resampled + + +def fetch_and_read_video(args, video_url: str, fps=2): + + def read_video_with_torchvision(video_file_name: str): + video, audio, info = torchvision.io.read_video( + video_file_name, + start_pts=0.0, + end_pts=None, + pts_unit="sec", + output_format="TCHW", + ) + + total_frames, video_fps = video.size(0), info["video_fps"] + total_duration = round(total_frames / video_fps, 3) + nframes = int(total_frames / video_fps * fps) + + frame_timestamps = total_duration * torch.arange(1, + nframes + 1) / nframes + grid_timestamps = frame_timestamps[::2] + second_per_grid = grid_timestamps[1] - grid_timestamps[0] + + idx = torch.linspace(0, video.size(0) - 1, nframes).round().long() + video = video[idx] + + if args.legacy_omni_video: + return [video, total_duration, nframes, second_per_grid.item()] + else: + return video + + def read_video_with_transformers(video_file_name: Union[str, List[str]]): + video, total_duration, nframes, second_per_grid = fetch_video( + {'video': video_file_name}) + if total_duration is None and nframes is None: + nframes = len(video) + total_duration = 0.5 * nframes + second_per_grid = 1.0 + if args.legacy_omni_video: + return [video, total_duration, nframes, second_per_grid] + else: + return video + + def read_video(video_file_name: str): + if args.use_torchvision: + return read_video_with_torchvision(video_file_name) + else: + return read_video_with_transformers(video_file_name) + + if isinstance(video_url, str) and video_url.startswith("http"): + with tempfile.NamedTemporaryFile(delete=True) as temp_video_file: + resp = requests.get(video_url) + assert resp.status_code == requests.codes.ok, f"Failed to fetch video from {video_url}, status_code:{resp.status_code}, resp:{resp}" + + temp_video_file.write(urlopen(video_url).read()) + temp_video_file_path = temp_video_file.name + video_file_name = temp_video_file_path + return read_video(video_file_name) + else: + video_file_name = video_url + return read_video(video_file_name) + + +def make_inputs_qwen2_omni( + args, + messages: List[Dict[str, Union[str, List[Dict[str, str]]]]], + use_audio_in_video: Optional[bool] = False, + tokenize: bool = False, +) -> Union[OmniTokensPrompt, TextPrompt]: + + from transformers import AutoConfig, AutoProcessor, AutoTokenizer + processor = AutoProcessor.from_pretrained(args.model) + tokenizer = AutoTokenizer.from_pretrained(args.model) + + try: + config = AutoConfig.from_pretrained(args.model) + if 'Qwen2_5OmniModel' in config.architectures: + args.legacy_omni_video = False + else: + args.legacy_omni_video = True + except: + args.legacy_omni_video = True + + audios, images, videos = [], [], [] + for message in messages: + if not isinstance(message['content'], list): + message['content'] = [{ + 'type': 'text', + 'text': message['content'], + }] + index, num_contents = 0, len(message['content']) + while index < num_contents: + ele = message['content'][index] + if 'type' not in ele: + if 'text' in ele: + ele['type'] = 'text' + elif 'audio' in ele: + ele['type'] = 'audio' + elif 'audio_url' in ele: + ele['type'] = 'audio_url' + elif 'image' in ele: + ele['type'] = 'image' + elif 'image_url' in ele: + ele['type'] = 'image_url' + elif 'video' in ele: + ele['type'] = 'video' + elif 'video_url' in ele: + ele['type'] = 'video_url' + else: + raise ValueError(f'Unknown ele: {ele}') + + if ele['type'] == 'audio' or ele['type'] == 'audio_url': + if 'audio_url' in ele: + audio_key = 'audio_url' + with tempfile.NamedTemporaryFile( + delete=True) as temp_audio_file: + temp_audio_file.write(urlopen(ele[audio_key]).read()) + temp_audio_file_path = temp_audio_file.name + audios.append( + resample_wav_to_16khz(temp_audio_file_path)) + ele['audio'] = temp_audio_file_path + elif 'audio' in ele: + audio_key = 'audio' + audios.append(resample_wav_to_16khz(ele[audio_key])) + else: + raise ValueError(f'Unknown ele {ele}') + elif use_audio_in_video and (ele['type'] == 'video' + or ele['type'] == 'video_url'): + # use video as audio as well + if 'video_url' in ele: + audio_key = 'video_url' + with tempfile.NamedTemporaryFile( + delete=True) as temp_video_file: + temp_video_file.write(urlopen(ele[audio_key]).read()) + temp_video_file_path = temp_video_file.name + ele[audio_key] = temp_video_file_path + audios.append( + librosa.load(temp_video_file_path, sr=16000)[0]) + videos.append( + fetch_and_read_video(args, temp_video_file_path)) + ele['video'] = temp_video_file_path + elif 'video' in ele: + audio_key = 'video' + audios.append(librosa.load(ele[audio_key], sr=16000)[0]) + videos.append(fetch_and_read_video(args, audio_key)) + else: + raise ValueError("Unknown ele {}".format(ele)) + # insert a audio after the video + message['content'].insert(index + 1, { + "type": "audio", + "audio": ele[audio_key], + }) + # no need to load the added audio again + index += 1 + elif ele['type'] == 'video' or ele['type'] == 'video_url': + if 'video_url' in ele: + video_key = 'video_url' + with tempfile.NamedTemporaryFile( + delete=True) as temp_video_file: + temp_video_file.write(urlopen(ele['video_url']).read()) + temp_video_file_path = temp_video_file.name + videos.append(fetch_and_read_video(args, temp_video_file)) + ele['video'] = temp_video_file_path + else: + video_key = 'video' + videos.append(fetch_and_read_video(args, ele[video_key])) + elif ele['type'] == 'image' or ele['type'] == 'image_url': + images.append(fetch_image(ele)) + + # move to the next content + index += 1 + + prompt = processor.apply_chat_template( + messages, + tokenize=tokenize, + add_generation_prompt=True, + add_vision_id=True, + ) + + audios = audios if len(audios) > 0 else None + images = images if len(images) > 0 else None + videos = videos if len(videos) > 0 else None + + multi_modal_data = {} + if audios: + multi_modal_data["audio"] = audios + if images: + multi_modal_data["image"] = images + if videos: + multi_modal_data["video"] = videos + + if isinstance(prompt, list) and isinstance(prompt[0], (list, str)): + prompt = prompt[0] + + if tokenize: + return OmniTokensPrompt( + prompt_token_ids=prompt, + multi_modal_data=multi_modal_data, + ) + else: + return TextPrompt( + prompt=prompt, + multi_modal_data=multi_modal_data, + ) + + +def make_text_prompt(args, prompt): + messages = [ + get_system_prompt(), + { + "role": "user", + "content": [ + { + "type": "text", + "text": prompt + }, + ] + }, + ] + + prompt = make_inputs_qwen2_omni(args, messages, tokenize=args.tokenize) + return prompt + + +def make_audio_in_video_v2_prompt(args): + messages = [ + { + 'role': + 'system', + 'content': [{ + 'type': + 'text', + 'text': + 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.' + }] + }, + { + "role": + "user", + "content": [ + { + "type": + "video_url", + "video_url": + "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw_small.mp4" + }, + ] + }, + ] + prompt = make_inputs_qwen2_omni( + args, + messages, + use_audio_in_video=True, + tokenize=args.tokenize, + ) + return prompt + + +def make_omni_prompt(args, prompt = None) -> Union[OmniTokensPrompt, List[OmniTokensPrompt]]: + if args.prompt_type == 'text': + prompt = make_text_prompt(args, prompt) + elif args.prompt_type == 'audio-in-video-v2': + prompt = make_audio_in_video_v2_prompt(args) + else: + raise ValueError(f'Unsupported prompt type: {args.prompt_type}') + return prompt \ No newline at end of file