Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,13 @@ ______________________________________________________________________

### 🚀 Release Highlights

- **[2026/05/13]** [vLLM-Omni v0.20.0](https://gallery.ecr.aws/deep-learning-containers/vllm) — EC2: `omni-cuda-v1.1` · SageMaker: `omni-sagemaker-cuda-v1.1` · Adds `/v1/audio/generate` (stable-audio-open) and `/v1/videos/sync` (unblocks video on SageMaker); supports CosyVoice3, ERNIE-Image-Turbo, Wan2.1-VACE-1.3B; CUDA 13.0 + PyTorch 2.11.0.
- **[2026/05/11]** [vLLM v0.20.2](https://gallery.ecr.aws/deep-learning-containers/vllm) — EC2: `0.20.2-gpu-py312-ec2` · SageMaker: `0.20.2-gpu-py312` · Bug fixes for DeepSeek V4.
- **[2026/05/06]** [SGLang v0.5.11](https://gallery.ecr.aws/deep-learning-containers/sglang) — EC2: `0.5.11-gpu-py312-ec2` · SageMaker: `0.5.11-gpu-py312` · Model support for Gemma 4, GLM-5.1, Qwen 3.4, and more
- **[2026/05/05]** [vLLM v0.20.1](https://gallery.ecr.aws/deep-learning-containers/vllm) — EC2: `0.20.1-gpu-py312-ec2` · SageMaker: `0.20.1-gpu-py312` · Bug fixes for DeepSeek V4.
- **[2026/04/30]** [PyTorch v2.11.0](https://gallery.ecr.aws/deep-learning-containers/pytorch) — EC2: `2.11.0-cu130-amzn2023` · SageMaker: `2.11.0-cu130-amzn2023-sagemaker` · Amazon Linux 2023 with EFA, flash-attn, and transformer-engine.
- **[2026/04/28]** [vLLM v0.20.0](https://gallery.ecr.aws/deep-learning-containers/vllm) — EC2: `0.20.0-gpu-py312-ec2` · SageMaker: `0.20.0-gpu-py312` · Introduces support for DeepSeek V4.
- **[2026/04/24]** [vLLM-Omni v0.18.0](https://gallery.ecr.aws/deep-learning-containers/vllm) — EC2: `omni-cuda-v1.0` · SageMaker: `omni-sagemaker-cuda-v1.0` · Initial release. Serves omni-modality models (TTS, image, video, multimodal chat) through OpenAI-compatible APIs; SageMaker routing middleware via `CustomAttributes`.
- **[2026/04/20]** [vLLM v0.19.1](https://gallery.ecr.aws/deep-learning-containers/vllm) — EC2: `0.19-gpu-py312-ec2` · SageMaker: `0.19-gpu-py312` · This upgrades Transformers to 5.5.4, enabling Gemma 4 support.

### 📢 Support Updates
Expand Down
2 changes: 1 addition & 1 deletion docs/src/data/vllm-omni/0.18.0-gpu-ec2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ platform: default
public_registry: true

tags:
- "omni-cuda-v1"
- "omni-cuda-v1.0"

announcements:
- "Initial release of vLLM-Omni containers for EC2, ECS, EKS"
Expand Down
2 changes: 1 addition & 1 deletion docs/src/data/vllm-omni/0.18.0-gpu-sagemaker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ platform: sagemaker
public_registry: true

tags:
- "omni-sagemaker-cuda-v1"
- "omni-sagemaker-cuda-v1.0"

announcements:
- "Initial release of vLLM-Omni containers for SageMaker"
Expand Down
35 changes: 35 additions & 0 deletions docs/src/data/vllm-omni/0.20.0-gpu-ec2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
framework: vLLM-Omni
version: "0.20.0"
ecr_repository: vllm
accelerator: gpu
python: py312
cuda: cu130
os: amzn2023
platform: default
public_registry: true

tags:
# Only the patch-floating tag is listed per release. The minor-floating
# `omni-cuda-v1` tag is documented in docs/vllm-omni/index.md (Pull Commands +
# Versioning and Tags) but isn't a per-release identifier — it points at
# whichever release is currently the v1-line target. Releases that hold
# only their patch-floating tag in this yaml (this convention) auto-correct
# when the v1 floater advances; no yaml edits needed.
- "omni-cuda-v1.1" # floats across DLC patches in the v1.1 line (auto-accepts security patches)

announcements:
- "Bumps vLLM-Omni to 0.20.0 and aligns with upstream vLLM v0.20.0"
- "CUDA 12.9 → 13.0 base image; PyTorch 2.10.0 → 2.11.0"
- "New `/v1/audio/generate` endpoint for diffusion-based audio generation (e.g., stable-audio-open)"
- "New `/v1/videos/sync` endpoint — blocking variant of `/v1/videos` that returns the MP4 directly"
- "Adds support for CosyVoice3, ERNIE-Image-Turbo, Wan2.1-VACE-1.3B, and Stable-Audio-Open-1.0"

packages:
vllm: "0.20.0"
vllm-omni: "0.20.0"
pytorch: "2.11.0"
torchvision: "0.26.0"
torchaudio: "2.11.0"
cuda: "13.0.2"
flashinfer: "0.6.8.post1"
efa: "1.47.0"
35 changes: 35 additions & 0 deletions docs/src/data/vllm-omni/0.20.0-gpu-sagemaker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
framework: vLLM-Omni
version: "0.20.0"
ecr_repository: vllm
accelerator: gpu
python: py312
cuda: cu130
os: amzn2023
platform: sagemaker
public_registry: true

tags:
# Only the patch-floating tag is listed per release. The minor-floating
# `omni-sagemaker-cuda-v1` tag is documented in docs/vllm-omni/index.md
# (Pull Commands + Versioning and Tags) but isn't a per-release identifier —
# it points at whichever release is currently the v1-line target. Releases
# that hold only their patch-floating tag in this yaml (this convention)
# auto-correct when the v1 floater advances; no yaml edits needed.
- "omni-sagemaker-cuda-v1.1" # floats across DLC patches in the v1.1 line (auto-accepts security patches)

announcements:
- "Bumps vLLM-Omni to 0.20.0 and aligns with upstream vLLM v0.20.0"
- "CUDA 12.9 → 13.0 base image; PyTorch 2.10.0 → 2.11.0"
- "Video generation now supported on SageMaker via the new `/v1/videos/sync` endpoint"
- "Adds `/v1/audio/generate` and `/v1/videos/sync` to the routing middleware"
- "Adds support for CosyVoice3, ERNIE-Image-Turbo, Wan2.1-VACE-1.3B, and Stable-Audio-Open-1.0"

packages:
vllm: "0.20.0"
vllm-omni: "0.20.0"
pytorch: "2.11.0"
torchvision: "0.26.0"
torchaudio: "2.11.0"
cuda: "13.0.2"
flashinfer: "0.6.8.post1"
efa: "1.47.0"
146 changes: 121 additions & 25 deletions docs/vllm-omni/index.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,83 @@
# vLLM-Omni Inference

Pre-built Docker images for serving omni-modality models (text-to-speech, image generation, video generation, and multimodal chat) with
[vLLM-Omni](https://github.com/vllm-project/vllm-omni). Built on Amazon Linux 2023 with CUDA 12.9 and Python 3.12.
Pre-built Docker images for serving omni-modality models (text-to-speech, audio generation, image generation, video generation, and multimodal chat)
with [vLLM-Omni](https://github.com/vllm-project/vllm-omni). Built on Amazon Linux 2023 with CUDA 13.0 and Python 3.12.

## Latest Announcements

**May 12, 2026** — vLLM-Omni 0.20.0 release. Aligns with upstream vLLM v0.20.0; bumps CUDA to 13.0 and PyTorch to 2.11.0. Adds two new endpoints:
`/v1/audio/generate` for diffusion-based audio generation (e.g., stable-audio-open) and `/v1/videos/sync` — a blocking variant of `/v1/videos` that
returns the MP4 directly and unblocks video generation on SageMaker. New supported models: CosyVoice3, ERNIE-Image-Turbo, Wan2.1-VACE-1.3B,
Stable-Audio-Open-1.0.

**April 24, 2026** — vLLM-Omni 0.18.0 initial release. Serves TTS, image, video, and omni-chat models through OpenAI-compatible APIs. Includes a
SageMaker routing middleware for dispatching `/invocations` to any omni endpoint via `CustomAttributes`.

## Pull Commands

**EC2:**
Images are published to both the public ECR gallery (no AWS credentials required) and the private DLC ECR repository (requires
`aws ecr get-login-password`, see [Getting Started](../get_started/index.md)).

**Multimodal (TTS, image/video/audio generation, omni chat) on EC2 / EKS:**

```bash
docker pull {{ images.latest_vllm_omni_ec2 }}
# Public ECR (anonymous pull):
docker pull public.ecr.aws/deep-learning-containers/vllm:omni-cuda

# Private ECR (authenticated; substitute your region):
docker pull 763104351884.dkr.ecr.<region>.amazonaws.com/vllm:omni-cuda
```

**SageMaker:**
**Multimodal on Amazon SageMaker AI:**

```bash
# Public ECR (anonymous pull):
docker pull public.ecr.aws/deep-learning-containers/vllm:omni-sagemaker-cuda

# Private ECR (authenticated; substitute your region):
docker pull 763104351884.dkr.ecr.<region>.amazonaws.com/vllm:omni-sagemaker-cuda
```

See [Available Images](../reference/available_images.md) for the full per-region URI table.

## Pin a Version

Append a version suffix to the base tag to control update behavior:

| Suffix | Example | Updates when |
| --- | --- | --- |
| (none) | `omni-cuda` | Any release, including breaking changes |
| `-v<MAJOR>` | `omni-cuda-v1` | New features and fixes, no breaking changes |
| `-v<MAJOR>.<MINOR>` | `omni-cuda-v1.1` | Security patches and bug fixes only |
| `-v<MAJOR>.<MINOR>.<PATCH>` | `omni-cuda-v1.1.0` | Never — immutable snapshot |

The same suffixes apply to the SageMaker base tag (`omni-sagemaker-cuda`).

**Recommended for production:** pin to `-v<MAJOR>.<MINOR>` (e.g., `omni-cuda-v1.1`). It auto-accepts security patches and bug fixes within the
0.20-line release while declining new minor releases that could change behavior — customers pinned here would have been insulated from the Code2Wav
un-batching regression that landed with the v1.1 minor bump (see [Known Limitations](#known-limitations) below) until they were ready to evaluate it.

For byte-identical reproducibility, pull by digest:

```bash
docker pull {{ images.latest_vllm_omni_sagemaker }}
docker pull public.ecr.aws/deep-learning-containers/vllm@sha256:<digest>
```

See [Available Images](../reference/available_images.md) for all image URIs and [Getting Started](../get_started/index.md) for authentication
instructions.
`docker inspect <image>` prints the digest of the image you have. Pulls by digest never change.

## Packages

For package versions included in each release, see the [Release Notes](../releasenotes/vllm-omni/index.md).

## Supported Modalities

| Modality | Route | Example Model |
| Modality | Route | Example Models |
| --- | --- | --- |
| Text-to-Speech | `/v1/audio/speech` | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` |
| Image Generation | `/v1/images/generations` | `black-forest-labs/FLUX.2-klein-4B` |
| Video Generation | `/v1/videos` | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` |
| Text-to-Speech | `/v1/audio/speech` | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`, `Qwen/Qwen3-TTS-12Hz-1.7B-Base`, `FunAudioLLM/CosyVoice3-0.5B` |
| Audio Generation | `/v1/audio/generate` (new in 0.20.0) | `stabilityai/stable-audio-open-1.0` |
| Image Generation | `/v1/images/generations` | `black-forest-labs/FLUX.2-klein-4B`, `baidu/ERNIE-Image-Turbo` |
| Video Generation (async) | `/v1/videos` | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-VACE-1.3B-Diffusers` |
| Video Generation (sync) | `/v1/videos/sync` (new in 0.20.0) | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-VACE-1.3B-Diffusers` |
| Multimodal Chat | `/v1/chat/completions` | `bytedance-research/BAGEL-7B-MoT`, `Qwen/Qwen2.5-Omni-3B` |

## Model Compatibility
Expand All @@ -59,15 +101,35 @@ starts the container, waits for readiness, submits a request, and writes the out
**Model:** [Qwen3-TTS-12Hz-1.7B-CustomVoice](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice) — a 1.7B-parameter Qwen3 text-to-speech
model supporting multiple voices and languages, runs on a single 24 GB GPU (A10G / L4).

For voice cloning, use [Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) or
[CosyVoice3-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice3-0.5B) — both accept a reference audio clip plus its transcript and synthesize new
speech in the reference speaker's voice. CosyVoice3 is zero-shot voice-clone only (no preset voices) and requires `--trust-remote-code`.

```bash
--8<-- "examples/vllm-omni/tts/run.sh"
```

### Audio Generation

**Model:** [Stable-Audio-Open-1.0](https://huggingface.co/stabilityai/stable-audio-open-1.0) — a diffusion model for text-to-audio (sound effects,
ambience, short music clips), distinct from TTS. Generates up to ~47 seconds of audio per request, runs on a single 24 GB GPU.

The `/v1/audio/generate` endpoint (new in 0.20.0) takes a text prompt plus diffusion knobs (`audio_length`, `guidance_scale`, `num_inference_steps`,
`seed`) and returns a single binary WAV blob — no streaming. See the
[upstream API spec](https://github.com/vllm-project/vllm-omni/blob/main/docs/serving/audio_generate_api.md) for the full request shape.

```bash
--8<-- "examples/vllm-omni/audio-generate/run.sh"
```

### Image Generation

**Model:** [FLUX.2-klein-4B](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) — a 4B-parameter rectified-flow transformer from Black Forest
Labs, produces high-quality 512×512 images from text prompts, runs on a single 24 GB GPU.

[ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo) is also supported as of 0.20.0 — an 8-step distilled DiT for fast inference with a
matching request shape.

```bash
--8<-- "examples/vllm-omni/image/run.sh"
```
Expand All @@ -76,14 +138,24 @@ Labs, produces high-quality 512×512 images from text prompts, runs on a single

**Model:** [Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers) — a 1.3B-parameter text-to-video diffusion model from the Wan
team, generates short clips at up to 480×832 resolution. Needs a 48 GB GPU (L40S) or 2× 24 GB GPUs with `--tensor-parallel-size 2`.
[Wan2.1-VACE-1.3B](https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B-Diffusers) (added in 0.20.0) is a unified video creation/editing pipeline that
accepts text plus optional video, mask, or reference image inputs.

The `/v1/videos` endpoint is asynchronous — it returns a job ID immediately and generates the video in the background. The script below submits the
job, polls until it completes, then downloads the MP4.
Two route options:

- **Async** (`POST /v1/videos`) — returns a job ID immediately; poll `GET /v1/videos/{id}` until status is `completed`, then download the MP4 from
`GET /v1/videos/{id}/content`. Best for long-running batch jobs and the only option in 0.18.0.
- **Sync** (`POST /v1/videos/sync`, new in 0.20.0) — blocks until generation completes and returns the raw MP4 in the response body. Simpler client
code, and crucially the only video path that works through SageMaker real-time endpoints (see [SageMaker Deployment](#sagemaker-deployment)).

```bash
--8<-- "examples/vllm-omni/video/run.sh"
```

```bash
--8<-- "examples/vllm-omni/video-sync/run.sh"
```

### Multimodal Chat

Use the standard OpenAI chat-completions API. Multimodal inputs (images, audio) are supplied as URL or base64 content parts in the message list.
Expand Down Expand Up @@ -128,8 +200,10 @@ header:
| `CustomAttributes` | Dispatched to |
| --- | --- |
| `route=/v1/audio/speech` | TTS |
| `route=/v1/audio/generate` | Audio generation (new in 0.20.0) |
| `route=/v1/images/generations` | Image generation |
| `route=/v1/videos` | Video generation (JSON auto-converted to form-data) — returns job-ID only in 0.18.0, MP4 not retrievable via SageMaker |
| `route=/v1/videos` | Video generation, async (JSON auto-converted to form-data) — returns job-ID only; MP4 not retrievable via SageMaker. Prefer `/v1/videos/sync` below. |
| `route=/v1/videos/sync` | Video generation, sync (new in 0.20.0) — blocks server-side and returns raw MP4 bytes; deploy behind SageMaker async inference (first-request `torch.compile` warmup exceeds the 60s real-time invoke timeout) |
| `route=/v1/chat/completions` | Multimodal chat |
| *(no route)* | vLLM default `/invocations` (chat/completion/embed) |

Expand All @@ -153,7 +227,7 @@ Any `SM_VLLM_*` env var is converted to a `--<name>` CLI argument (e.g., `SM_VLL
--8<-- "examples/vllm-omni/sagemaker/deploy_tts.py"
```

GPU deploys require `inference_ami_version` — the default SageMaker host AMI has incompatible NVIDIA drivers for CUDA 12.9 images. See
GPU deploys require `inference_ami_version` — the default SageMaker host AMI has incompatible NVIDIA drivers for CUDA 13.0 images. See
[ProductionVariant API reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html) for valid values.

When done, delete the endpoint:
Expand All @@ -167,8 +241,6 @@ predictor.delete_endpoint()
SageMaker real-time inference has a 60-second timeout. First requests to TTS models may exceed this due to `torch.compile` warmup (~67s); async
inference avoids the limit, as does retrying after warmup completes.

!!! warning "Video generation is not supported on SageMaker in 0.18.0 — see [Known Limitations](#known-limitations) below. Use EC2 for video."

```python
--8<-- "examples/vllm-omni/sagemaker/deploy_tts_async.py"
```
Expand All @@ -177,15 +249,39 @@ For async inference, upload the JSON input payload to S3 first, then call `invok
`CustomAttributes="route=/v1/audio/speech"`. The resulting `.out` object in the configured S3 output path is the raw WAV audio — no polling or
additional retrieval step required.

### Deploy a Video Endpoint

The `/v1/videos/sync` endpoint (new in 0.20.0) is the supported path for video on SageMaker. Unlike the async `/v1/videos` route — which writes a
job-ID JSON to S3 but never the MP4 — `/v1/videos/sync` blocks server-side until generation completes and writes the raw MP4 bytes to the configured
S3 output path.

Deploy behind **SageMaker async inference** (`AsyncInferenceConfig`), not real-time inference: first-request latency on video models is dominated by
model load + `torch.compile` warmup (3–4 minutes for Wan2.1-VACE-1.3B), which exceeds the 60-second real-time invoke timeout. Async inference allows
up to 1 hour and writes the response body verbatim to S3, so the `.out` object *is* the MP4 — no polling on a job ID.

```python
--8<-- "examples/vllm-omni/sagemaker/deploy_video_sync.py"
```

Validated 2026-05-11 on `ml.g5.2xlarge` (A10G 24 GB VRAM, 32 GB host RAM): 45 KB MP4 in ~10s after warmup. Reduce `num_inference_steps` and
`num_frames` to stay under the async ceiling for warm requests.

## Known Limitations

- **Video generation is not supported on SageMaker in 0.18.0.** The `/v1/videos` endpoint is async by design — it returns a job-ID JSON immediately
and generates the MP4 in the background. Through SageMaker async inference, only that job-ID JSON is written to S3; the MP4 itself never lands in S3
and cannot be retrieved through `invoke_endpoint` or `invoke_endpoint_async`. Use EC2 for video generation — direct container access supports the
full workflow (create job, poll status, download MP4). SageMaker support is expected once `POST /v1/videos/sync` (which blocks and returns raw MP4
bytes) is available in a future vllm-omni release.
- **First-request latency on SageMaker real-time endpoints.** TTS models can exceed the 60s invoke timeout on the first request due to `torch.compile`
warmup. Use async inference or retry after warmup.
- **`/v1/videos` (async) on SageMaker writes only the job-ID JSON to S3, not the MP4.** This is unchanged from 0.18.0 — the async route generates the
MP4 in the background and the bytes never land in S3. Use the new `/v1/videos/sync` route on SageMaker (see
[Deploy a Video Endpoint](#deploy-a-video-endpoint)) or stay on EC2 for the async workflow with status polling.
- **First-request latency on SageMaker real-time endpoints.** TTS, audio-generate, and video models can exceed the 60s invoke timeout on the first
request due to `torch.compile` warmup. Use async inference or retry after warmup.
- **Voice-clone TTS (Qwen3-TTS-Base) is slower in 0.20.0 than 0.18.0 due to an upstream Code2Wav decode-chunk un-batching regression**
([vllm-omni#3203](https://github.com/vllm-project/vllm-omni/pull/3203)). Observed on `g6.xlarge` with `qwen3-tts-12hz-1.7b-base`, concurrency 4, 20
prompts: requests/s **0.4 → 0.281**, audio RTF multiplier **1.6 → 1.109**, p95 E2E **11s → 15.9s**. TTS quality is unchanged. The fix is merged
upstream as [vllm-omni#3485](https://github.com/vllm-project/vllm-omni/pull/3485) post-0.20.0 and will land in the next omni point release.
Preset-voice TTS (Qwen3-TTS-CustomVoice) is unaffected.
- **CosyVoice3 requires `--trust-remote-code` and ~32 GB host RAM during model load.** A 16 GB host can SIGKILL the process during HuggingFace cache
hydration. Prefer `g6e.xlarge` or larger for both EC2 and SageMaker instance types.
- **Stable-Audio-Open output is capped at ~47 seconds per request** by the model itself. For longer clips, run multiple requests with adjusted
`audio_start` and concatenate client-side.

## Release Notes

Expand Down
Loading
Loading