Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ COPY . .

# Install system dependencies
RUN apt-get update && \
apt-get install -y espeak-ng ffmpeg git sox libsox-fmt-all jq && \
apt-get install -y espeak-ng git sox libsox-fmt-all jq && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile.cuda
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ WORKDIR ${COMMON_WORKDIR}

# Step 1: Setup - Install system dependencies
RUN apt-get update && \
apt-get install -y ffmpeg git sox libsox-fmt-all jq && \
Comment thread
Isotr0py marked this conversation as resolved.
apt-get install -y git sox libsox-fmt-all jq && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ WORKDIR ${COMMON_WORKDIR}

# Step 1: Setup - Install system dependencies
RUN apt-get update && \
apt-get install -y espeak-ng ffmpeg git sox libsox-fmt-all jq && \
apt-get install -y espeak-ng git sox libsox-fmt-all jq && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

Expand Down
2 changes: 0 additions & 2 deletions docker/Dockerfile.xpu
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,7 @@ RUN apt clean && apt-get update -y && \
apt-get install -y --no-install-recommends --fix-missing \
curl \
espeak-ng \
ffmpeg \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuechendi Could you check whether XPU have the same issue #2708?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

git \
libsndfile1 \
libsm6 \
libxext6 \
libgl1 \
Expand Down
8 changes: 0 additions & 8 deletions docs/usage/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,6 @@

A: Now, we support natively disaggregated deployment for different model stages within a model. There is a restriction that one chip can only have one AutoRegressive model stage. This is because the unified KV cache management of vLLM. Stages of other types can coexist within a chip. The restriction will be resolved in later version.

> Q: When trying to run examples, I encounter error about backend of librosa or soundfile. How to solve it?

A: If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```

> Q: I see GPU OOM or "free memory is less than desired GPU memory utilization" errors. How can I fix it?

A: Refer to [GPU memory calculation and configuration](../configuration/gpu_memory_utilization.md) for guidance on tuning `gpu_memory_utilization` and related settings.
Expand Down
7 changes: 0 additions & 7 deletions docs/user_guide/examples/offline_inference/bagel.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,13 +250,6 @@ For more details on the Mooncake connector and multi-node setup, see the [Moonca

## FAQ

- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.

```bash
sudo apt update
sudo apt install ffmpeg
```

- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.

| Stage | VRAM |
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide/examples/offline_inference/cosyvoice3.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Install dependencies:
uv pip install -e .
```

> **Note:** This includes required libraries such as `librosa`, `soundfile`,
> **Note:** This includes required libraries such as `soundfile`,
> `onnxruntime`, `x-transformers`, and `einops` via
> `requirements/common.txt` and platform-specific requirements files.

Expand Down
23 changes: 0 additions & 23 deletions docs/user_guide/examples/offline_inference/mimo_audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,29 +189,6 @@ Note: This task uses hardcoded message lists in the script.

## Troubleshooting

### Audio dependencies (soundfile, librosa)

This example depends on **soundfile** (read/write WAV) and **librosa** (load audio including MP3). Install the project requirements first:

```bash
pip install -r requirements/common.txt
# or at least: pip install soundfile>=0.13.1 librosa>=0.11.0
```

- **`soundfile` / libsndfile not found**
`soundfile` uses the C library **libsndfile**. On Linux, install the system package before pip:
- Debian/Ubuntu: `sudo apt-get install libsndfile1`
- For development builds: `sudo apt-get install libsndfile1-dev`
- Then: `pip install soundfile`

- **`librosa` fails to load MP3 or reports "No backend available"**
Loading MP3 (e.g. in `spoken_dialogue_sft_multiturn` with `.mp3` files) uses **ffmpeg** as the backend. Install ffmpeg:
- Debian/Ubuntu: `sudo apt-get install ffmpeg`
- macOS: `brew install ffmpeg`

- **`ImportError: No module named 'soundfile'` or `ModuleNotFoundError: ... librosa`**
Ensure you are in the same Python environment where vLLM Omni and the example dependencies are installed, and that `requirements/common.txt` (or the packages above) are installed.

### Tokenizer path

- **`MIMO_AUDIO_TOKENIZER_PATH` not set or model fails to find tokenizer**
Expand Down
8 changes: 0 additions & 8 deletions docs/user_guide/examples/offline_inference/qwen2_5_omni.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,6 @@ If media file paths are not provided, the script will use default assets. Suppor
- `use_audio_in_video`: Extract audio from video
- `text`: Text-only query

### FAQ

If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```

## Example materials

??? abstract "end2end.py"
Expand Down
8 changes: 0 additions & 8 deletions docs/user_guide/examples/offline_inference/qwen3_omni.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,14 +112,6 @@ python end2end_async_chunk.py \
> async_chunk example when you need the stage-level concurrency semantics
> described in PR #962 / #1151.

### FAQ

If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```

## Example materials

??? abstract "end2end.py"
Expand Down
7 changes: 0 additions & 7 deletions docs/user_guide/examples/online_serving/bagel.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,13 +357,6 @@ curl http://localhost:8091/v1/chat/completions \

## FAQ

- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.

```bash
sudo apt update
sudo apt install ffmpeg
```

- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.

| Stage | VRAM |
Expand Down
8 changes: 0 additions & 8 deletions docs/user_guide/examples/online_serving/qwen2_5_omni.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,14 +218,6 @@ The gradio script supports the following arguments:
- `--port`: Port for Gradio server (default: 7861)
- `--share`: Share the Gradio demo publicly (creates a public link)

### FAQ

If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```

## Example materials

??? abstract "gradio_demo.py"
Expand Down
9 changes: 0 additions & 9 deletions docs/user_guide/examples/online_serving/qwen3_omni.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,6 @@ python openai_chat_completion_client_for_multimodal_generation.py \
bash run_curl_multimodal_generation.sh use_image
```


### FAQ

If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```

## Modality control
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.

Expand Down
8 changes: 0 additions & 8 deletions docs/user_guide/examples/online_serving/qwen3_tts.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,14 +211,6 @@ with open("output.wav", "wb") as f:
f.write(response.content)
```

### FAQ

If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```

## API Reference

### Voices Endpoint
Expand Down
7 changes: 0 additions & 7 deletions examples/offline_inference/bagel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,13 +247,6 @@ For more details on the Mooncake connector and multi-node setup, see the [Moonca

## FAQ

- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.

```bash
sudo apt update
sudo apt install ffmpeg
```

- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.

| Stage | VRAM |
Expand Down
2 changes: 1 addition & 1 deletion examples/offline_inference/cosyvoice3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Install dependencies:
uv pip install -e .
```

> **Note:** This includes required libraries such as `librosa`, `soundfile`,
> **Note:** This includes required libraries such as `soundfile`,
> `onnxruntime`, `x-transformers`, and `einops` via
> `requirements/common.txt` and platform-specific requirements files.

Expand Down
22 changes: 2 additions & 20 deletions examples/offline_inference/cosyvoice3/verify_e2e_cosyvoice.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,19 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import os
from pathlib import Path

import librosa
import numpy as np
import soundfile as sf
from vllm import SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.multimodal.media.audio import load_audio

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.model_executor.models.cosyvoice3.config import CosyVoice3Config
from vllm_omni.model_executor.models.cosyvoice3.tokenizer import get_qwen_tokenizer
from vllm_omni.model_executor.models.cosyvoice3.utils import extract_text_token


def _ensure_mel_filters_asset() -> None:
repo_root = Path(__file__).resolve().parents[3]
filters_path = repo_root / "vllm_omni" / "model_executor" / "models" / "cosyvoice3" / "assets" / "mel_filters.npz"
if filters_path.exists():
return

source_url = "https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz"
raise FileNotFoundError(
"Missing CosyVoice3 mel filter asset:\n"
f" {filters_path}\n"
"Download it with:\n"
f" mkdir -p {filters_path.parent} && "
f"curl -L {source_url} -o {filters_path}"
)


def run_e2e():
parser = argparse.ArgumentParser()
# ""FunAudioLLM/Fun-CosyVoice3-0.5B-2512
Expand All @@ -56,7 +39,6 @@ def run_e2e():
help="Path to tokenizer directory (e.g., <model_path>/CosyVoice-BlankEN).",
)
args = parser.parse_args()
_ensure_mel_filters_asset()
# Ensure tokenizer directory exists
if not os.path.exists(args.tokenizer):
raise FileNotFoundError(f"{args.tokenizer} does not exist!")
Expand Down Expand Up @@ -85,7 +67,7 @@ def run_e2e():
if not os.path.exists(args.audio_path):
raise FileNotFoundError(f"Audio file not found: {args.audio_path}")
# Load at native sample rate
audio_signal, sr = librosa.load(args.audio_path, sr=None)
audio_signal, sr = load_audio(args.audio_path, sr=None)

# Validate sample rate before processing (similar to original CosyVoice)
min_sr = 16000
Expand Down
23 changes: 0 additions & 23 deletions examples/offline_inference/mimo_audio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,29 +190,6 @@ Note: This task uses hardcoded message lists in the script.

## Troubleshooting

### Audio dependencies (soundfile, librosa)

This example depends on **soundfile** (read/write WAV) and **librosa** (load audio including MP3). Install the project requirements first:

```bash
pip install -r requirements/common.txt
# or at least: pip install soundfile>=0.13.1 librosa>=0.11.0
```

- **`soundfile` / libsndfile not found**
`soundfile` uses the C library **libsndfile**. On Linux, install the system package before pip:
- Debian/Ubuntu: `sudo apt-get install libsndfile1`
- For development builds: `sudo apt-get install libsndfile1-dev`
- Then: `pip install soundfile`

- **`librosa` fails to load MP3 or reports "No backend available"**
Loading MP3 (e.g. in `spoken_dialogue_sft_multiturn` with `.mp3` files) uses **ffmpeg** as the backend. Install ffmpeg:
- Debian/Ubuntu: `sudo apt-get install ffmpeg`
- macOS: `brew install ffmpeg`

- **`ImportError: No module named 'soundfile'` or `ModuleNotFoundError: ... librosa`**
Ensure you are in the same Python environment where vLLM Omni and the example dependencies are installed, and that `requirements/common.txt` (or the packages above) are installed.

### Tokenizer path

- **`MIMO_AUDIO_TOKENIZER_PATH` not set or model fails to find tokenizer**
Expand Down
4 changes: 2 additions & 2 deletions examples/offline_inference/mimo_audio/message_convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@
import re
from collections.abc import Callable

import librosa
import numpy as np
import torch
import torchaudio
from process_speechdata import InputSegment, StreamingInputSegment
from torchaudio.transforms import MelSpectrogram
from vllm.multimodal.media.audio import load_audio

speech_zeroemb_idx = 151667
empty_token = "<|empty|>"
Expand Down Expand Up @@ -685,7 +685,7 @@ def get_audio_data(audio_url):
# File path
audio_file = audio_url

audio_signal, sr = librosa.load(audio_file, sr=24000)
audio_signal, sr = load_audio(audio_file, sr=24000)
audio_data = (audio_signal.astype(np.float32), sr)
return audio_data

Expand Down
4 changes: 2 additions & 2 deletions examples/offline_inference/omnivoice/end2end.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,9 @@ def run_e2e():
if not os.path.exists(args.ref_audio):
raise FileNotFoundError(f"Reference audio not found: {args.ref_audio}")

import librosa
from vllm.multimodal.media.audio import load_audio

audio_signal, sr = librosa.load(args.ref_audio, sr=None)
audio_signal, sr = load_audio(args.ref_audio, sr=None)
multi_modal_data["audio"] = (audio_signal.astype(np.float32), sr)
mm_processor_kwargs["ref_text"] = args.ref_text or ""
mm_processor_kwargs["sample_rate"] = sr
Expand Down
8 changes: 0 additions & 8 deletions examples/offline_inference/qwen2_5_omni/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,11 +60,3 @@ If media file paths are not provided, the script will use default assets. Suppor
- `mixed_modalities`: Audio + image + video
- `use_audio_in_video`: Extract audio from video
- `text`: Text-only query

### FAQ

If you encounter error about backend of librosa, try to install ffmpeg with command below.
```
sudo apt update
sudo apt install ffmpeg
```
Loading
Loading