[Docs Update] vLLM-Omni 0.20.0 by Yadan-Wei · Pull Request #6088 · aws/deep-learning-containers

Yadan-Wei · 2026-05-12T22:17:04Z

Summary

Public documentation for the upcoming vLLM-Omni 0.20.0 release. Follows the same pattern as PR #6007 (initial 0.18.0 docs) and the recent vLLM 0.20.x docs PRs (#6070-#6072).

New version data files under docs/src/data/vllm-omni/0.20.0-gpu-{ec2,sagemaker}.yml — auto-feed available_images.md and the per-version release-notes pages.
0.18.0 yamls re-pinned to immutable omni-cuda-v1.0 / omni-sagemaker-cuda-v1.0 tags; omni-cuda-v1 / omni-sagemaker-cuda-v1 continue to float to the latest in-line release (0.20.0 today). Both v1.0 tags already exist in 763... and public.ecr.aws.
docs/vllm-omni/index.md refreshed for the 0.20.0 release: new May 12 announcement, CUDA 13.0 / PyTorch 2.11.0 references, expanded Supported Modalities table, new Audio Generation and Sync Video sections, updated SageMaker routing-middleware table, refreshed Known Limitations.
docs/index.md What's New: vLLM-Omni v0.20.0 (2026-05-13) and v0.18.0 (2026-04-24) Release Highlights entries.
Three new endpoint examples wired into index.md via --8<-- includes:
- examples/vllm-omni/audio-generate/run.sh — stable-audio-open on EC2
- examples/vllm-omni/video-sync/run.sh — sync video on EC2
- examples/vllm-omni/sagemaker/deploy_video_sync.py — sync video on SageMaker (replaces the prior "video not supported on SageMaker" caveat)

Pinned versions match upstream vllm v0.20.0 requirements/cuda.txt: PyTorch 2.11.0, torchvision 0.26.0, torchaudio 2.11.0, flashinfer 0.6.8.post1. CUDA 13.0.2.

Known Limitations

User-facing limitations only — internal benchmark-tooling concerns (e.g., usage.completion_tokens=0 accounting, which only matters for clients parsing the omni-chat SSE usage block) are intentionally not listed here.

/v1/videos async on SageMaker — only writes job-ID JSON to S3, not MP4. Use /v1/videos/sync (new in 0.20.0) instead.
First-request torch.compile warmup can exceed the SageMaker 60s real-time invoke timeout for TTS / audio-generate / video.
Voice-clone TTS Code2Wav un-batching regression (vllm-omni#3203) — qwen3-tts-12hz-1.7b-base rps 0.4 → 0.281 on g6.xlarge, fix merged as vllm-omni#3485 post-0.20.0. Same regression that drove the threshold loosening in feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b #6079.
CosyVoice3 host-RAM requirement — --trust-remote-code load needs ~32 GB host RAM; 16 GB hosts SIGKILL during HF cache hydration.
Stable-Audio-Open ~47s output cap per request (model limitation).

Test plan

python docs/src/main.py --verbose — generation completes; new pages emitted at docs/releasenotes/vllm-omni/vllm-omni-0.20.0-gpu-{default,sagemaker}.md; docs/reference/available_images.md shows 0.20.0 (omni-cuda-v1) and 0.18.0 (omni-cuda-v1.0) rows.
mkdocs serve — /deep-learning-containers/vllm-omni/ returns HTTP 200; both 0.18.0 and 0.20.0 release-notes pages render; the three new --8<-- example includes resolve cleanly.
pre-commit run passes (ruff, ruff-format, flowmark, signoff, etc.)
Rebased onto latest main (post-feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b #6079 merge).
Manual end-to-end verification of the new examples deferred until the 0.20.0 image is published to public ECR.

…imitations New version data files (auto-feed available_images.md and release-notes pages) ------------------------------------------------------------------------------ docs/src/data/vllm-omni/0.20.0-gpu-ec2.yml docs/src/data/vllm-omni/0.20.0-gpu-sagemaker.yml Pinned package versions match upstream vllm v0.20.0 requirements/cuda.txt: PyTorch 2.11.0, torchvision 0.26.0, torchaudio 2.11.0, flashinfer 0.6.8.post1, CUDA 13.0.2. Same omni-cuda-v1 / omni-sagemaker-cuda-v1 tags are reused for the new image (both v1 tags now point at 0.20.0). docs/vllm-omni/index.md ----------------------- - May 12, 2026 announcement covering the 0.20.0 alignment, CUDA 13.0 bump, new /v1/audio/generate and /v1/videos/sync endpoints, and the four new supported models (CosyVoice3, ERNIE-Image-Turbo, Wan2.1-VACE-1.3B, Stable-Audio-Open-1.0). - Header CUDA reference 12.9 -> 13.0. - Supported Modalities table grows two rows (Audio Generation, Video sync) and the example-model lists are expanded for TTS / image / video. - New EC2 sections: Audio Generation (stable-audio-open) and Video sync. - SageMaker routing-middleware table: adds /v1/audio/generate and /v1/videos/sync rows; the existing async /v1/videos row now points at the sync route as the recommended SageMaker path. - New SageMaker section: Deploy a Video Endpoint (sync) — replaces the previous "video not supported on SageMaker" warning since that was the exact gap /v1/videos/sync closes. - Known Limitations refreshed: drops the SageMaker-video-not-supported item, keeps torch.compile warmup, adds usage.completion_tokens=0 caveat for omni-chat, CosyVoice3 host-RAM requirement, and stable-audio-open's ~47s per-request cap. New endpoint examples --------------------- examples/vllm-omni/audio-generate/run.sh — stable-audio-open EC2 examples/vllm-omni/video-sync/run.sh — sync video EC2 examples/vllm-omni/sagemaker/deploy_video_sync.py — sync video on SageMaker All three follow the existing examples' shape (single-shot docker run, health check, single curl/invoke, exit) so the index.md --8<-- includes work without further changes. Auto-generated release notes (docs/releasenotes/vllm-omni/0.20.0-*.md) and the available_images.md table row are emitted by docs/src/main.py from the YAMLs above; both are gitignored. Verified locally with `python docs/src/main.py && mkdocs serve`: /deep-learning-containers/vllm-omni/ (HTTP 200) /deep-learning-containers/releasenotes/vllm-omni/0.20.0-* (rendered) /deep-learning-containers/reference/available_images/ (0.20.0 row above 0.18.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…mitations Adds a Known Limitations entry documenting the upstream Code2Wav decode-chunk un-batching regression in vllm-omni#3203 that ships in 0.20.0 and slows voice-clone TTS (Qwen3-TTS-Base). Observed on g6.xlarge: rps 0.4 -> 0.281 audio rtf 1.6 -> 1.109 p95 e2e 11s -> 15.9s Quality is unchanged. Preset-voice TTS (Qwen3-TTS-CustomVoice) is unaffected. The fix is already merged upstream as vllm-omni#3485 (post-0.20.0) and will land in the next omni point release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…on 0.20.0 The `omni-cuda-v1` and `omni-sagemaker-cuda-v1` tags are now reused for 0.20.0 (per the image config files in main). Switch the 0.18.0 docs to the immutable `omni-cuda-v1.0` / `omni-sagemaker-cuda-v1.0` tags so users who want to reproduce the 0.18.0 image have a frozen URI; `v1` continues to float to the latest release in the v1 line (0.20.0 today). The v1.0 tags already exist in both 763... and public.ecr.aws. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

@sha256

…hat's New entries Tag versioning (DLC-level) -------------------------- Document the two-tier tag convention so customers can choose the right trade-off between freshness and stability: - omni-cuda-v1 / omni-sagemaker-cuda-v1 — float across DLC minor + patch (auto-upgrade on docker pull). Best for dev, quick-starts. - omni-cuda-v1.1 / omni-sagemaker-cuda-v1.1 — float across DLC patches only in the v1.1 line (auto-accept security fixes, decline new minor releases). Recommended for production. - @sha256:<digest> — escape hatch for byte-identical reproducibility. The semantic versioning tier is at the DLC level (v1, v1.1, v1.1.x), not the bundled vllm-omni upstream version (which can advance independently of DLC patches). Customers pinned to v1.1 would have been insulated from the Code2Wav un-batching regression that landed with the DLC v1.1 minor bump until they were ready to evaluate it. Reflected in: - docs/src/data/vllm-omni/0.20.0-gpu-{ec2,sagemaker}.yml — list both v1 and v1.1 tags with comments explaining the floating semantics - docs/vllm-omni/index.md — new Versioning and Tags section + expanded Pull Commands showing both tiers + digest pin Sync-video SageMaker example fix -------------------------------- The previous example used real-time invoke_endpoint, which has a hard 60-second timeout. First-request latency on Wan2.1-VACE-1.3B includes model load + torch.compile warmup (3-4 min), so the example would always fail on first invoke. Rewrote to mirror the pattern proven by test_vllm_omni_video_async_endpoint (last green 2026-05-11): - AsyncInferenceConfig with output_path + max_concurrent_invocations=1 - s3.put_object to upload the request payload - invoke_endpoint_async with InputLocation + CustomAttributes - Poll the .out object for raw MP4 bytes - Form-data values as strings (the middleware converts JSON to multipart/form-data; numeric values must be JSON strings) - Wan2.1-VACE-1.3B-diffusers + ml.g5.2xlarge (validated combination) End-to-end validated 2026-05-13 in account 897880167187: endpoint deployed, async invoke succeeded, 45 KB MP4 returned with Content-Type video/mp4 (valid ISO Media MP4 header), endpoint cleaned up after. docs/vllm-omni/index.md prose updated to recommend async inference as the default for video on SageMaker (it's required, not optional, given the warmup time). What's New entries ------------------ README.md (which generates docs/index.md): two new vLLM-Omni entries under Release Highlights: - 2026/05/13 vLLM-Omni v0.20.0 - 2026/04/24 vLLM-Omni v0.18.0 (initial release) Both reference the floating tag (omni-cuda-v1 / omni-sagemaker-cuda-v1) and v1.0 for the 0.18.0 entry. Removed ------- The "usage.completion_tokens=0 for omni-chat models" Known Limitations item — internal benchmark-tooling concern, not user-facing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…h-floating) Reorder the 0.20.0 yamls so omni-cuda-v1.1 / omni-sagemaker-cuda-v1.1 come first; the docs generator uses tags[0] for the per-row table cell in available_images.md. Before: 0.18.0 row showed `omni-cuda-v1.0`, 0.20.0 row showed `omni-cuda-v1` — inconsistent (one patch-floating, one minor-floating). After: both rows show their patch-floating tag, which uniquely identifies the release line and won't drift when the minor-floating v1 advances to a future image. Also bumps the README.md "What's New" entry for v0.20.0 to reference omni-cuda-v1.1 / omni-sagemaker-cuda-v1.1 for the same durability reason. Release-notes pages still print all four URIs (private + public ECR × both tag tiers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…2 macro The previous commit (25b7b22) reordered tags to put `omni-cuda-v1.1` first, intending to fix a perceived asymmetry in available_images.md (0.18.0 → v1.0, 0.20.0 → v1). That broke the Pull Commands section: both pull commands ended up pointing at v1.1, defeating the two-tier story. Root cause: docs/src/macros.py uses `latest.display_tag` (which returns `tags[0]`) to render `{{ images.latest_vllm_omni_ec2 }}`. That macro is the "latest supported" pull command in docs/vllm-omni/index.md. The original asymmetry was actually the convention working as intended: - 0.20.0 is the current floating-v1 release, so its yaml lists v1 first - 0.18.0 is no longer the floating-v1 target, so its yaml only lists v1.0 The maintenance pattern when a new release ships: remove v1 from the *previous* release's yaml. The 0.18.0 yaml already reflects this. Restore tags[0] = "omni-cuda-v1" on the 0.20.0 yamls and the README What's New entry; add a comment in each yaml documenting the convention so the next maintainer doesn't make the same mistake. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Stop listing `omni-cuda-v1` / `omni-sagemaker-cuda-v1` in per-release docs/src/data/vllm-omni/0.20.0-gpu-{ec2,sagemaker}.yml. Each release's yaml now lists only its patch-floating tag (`v1.1` for 0.20.0; `v1.0` already this way in 0.18.0). The minor-floating `v1` tag is still documented prominently in docs/vllm-omni/index.md (Pull Commands "latest supported" + Versioning and Tags section), but it isn't a per-release identifier — it points at whichever release is currently the v1-line target. Hardcoding the v1 pull URL in index.md (instead of using the `{{ images.latest_vllm_omni_* }}` macro that reads `tags[0]`) makes the prose source-of-truth for the floater, decoupled from per-release yaml metadata. Why this is better: - available_images.md table is now self-consistent — every row shows the release's patch-floating tag, no asymmetry between current and previous releases. - Self-correcting: when a future DLC release ships, no edits to the 0.20.0 yaml are required to remove `v1` (since it was never there). Today's convention required "drop v1 from old yaml on next release", easy to forget. - Decoupled concerns: yamls own per-release metadata, prose owns the floating-tag story. Verified locally with `python docs/src/main.py && mkdocs serve`: - reference/available_images table: 0.20.0 → v1.1, 0.18.0 → v1.0 - releasenotes/vllm-omni-0.20.0-*: only v1.1 URIs (no longer v1) - vllm-omni/index.md Pull Commands: both v1 (latest) and v1.1 (patch-stable) tags shown for EC2 + SageMaker - vllm-omni/index.md Versioning section table: unchanged README.md What's New entry for 0.20.0: bumped from v1 to v1.1 to match the 0.18.0 entry's pattern (per-release rows always show the patch-floating, durable identifier). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Replace the bespoke two-section "Pull Commands + Versioning and Tags" prose with the cleaner four-tier convention used by the public vLLM-server docs. Maps directly to the suffix structure customers already see in vllm-server pull commands. Pull Commands now show only the bare base tags (omni-cuda / omni-sagemaker-cuda) — "give me whatever ships". The Pin a Version section enumerates the four tiers in one table: | Suffix | Example | Updates when | |---------------------------|------------------|-----------------------------------------------| | (none) | omni-cuda | Any release, including breaking changes | | -v<MAJOR> | omni-cuda-v1 | New features and fixes, no breaking changes | | -v<MAJOR>.<MINOR> | omni-cuda-v1.1 | Security patches and bug fixes only | | -v<MAJOR>.<MINOR>.<PATCH> | omni-cuda-v1.1.0 | Never — immutable snapshot | Production recommendation (pin to -v<MAJOR>.<MINOR>) calls out the Code2Wav un-batching regression as the concrete example of why patch-stable insulates production from feature-release surprises. Switches both Pull Commands URIs from the private 763... ECR to public.ecr.aws/deep-learning-containers/vllm to match the vLLM-server docs convention (private ECR is in the per-region table on available_images.md). Removes the now-obsolete tag-history table — Pin a Version handles the same information through suffix semantics. Verified locally with `python docs/src/main.py && mkdocs serve`: - Pull Commands: bare omni-cuda and omni-sagemaker-cuda URIs - Pin a Version: 4-row suffix table with examples + update semantics - Section order: Latest Announcements -> Pull Commands -> Pin a Version -> Packages -> Supported Modalities -> ... Existing example scripts (deploy_tts.py, deploy_tts_async.py, deploy_video_sync.py) keep their -v1 URIs unchanged — examples document behavior validated at v1 and don't need to chase the latest tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Pull Commands section now shows both registry options for each deployment target: - Public ECR (anonymous pull): public.ecr.aws/deep-learning-containers/vllm - Private DLC ECR (authenticated): 763104351884.dkr.ecr.<region>.amazonaws.com/vllm Customers running on AWS infrastructure (EC2/EKS/SageMaker) typically prefer the private ECR for better network locality and IAM-controlled access; public ECR is the right path for local development or workloads outside AWS. A short prologue paragraph explains the auth difference and links to Getting Started for credentials. Per-region URI table still lives in available_images.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

… vllm) All six EC2 example shell scripts hardcoded the legacy repo name `vllm-omni:omni-cuda-v1`, but the actual ECR repo for these images is `vllm` (post-#6007's repo unification, also reflected in the docs generator's `ecr_repository: vllm` field and the prod_image config `vllm:omni-cuda-v1`). Customers running these scripts as-is would have hit a "repo does not exist" error from `docker pull`. Fix the IMAGE default in each script: examples/vllm-omni/audio-generate/run.sh examples/vllm-omni/image/run.sh examples/vllm-omni/qwen2.5-omni/run.sh examples/vllm-omni/tts/run.sh examples/vllm-omni/video-sync/run.sh examples/vllm-omni/video/run.sh The three SageMaker python examples (deploy_tts.py, deploy_tts_async.py, deploy_video_sync.py) already used the correct `vllm:omni-sagemaker-cuda-v1` repo path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

aws-deep-learning-containers-ci Bot added the authorized label May 12, 2026

Yadan Wei added 3 commits May 13, 2026 11:00

Yadan-Wei force-pushed the omni-0.20.0-docs branch from 586b788 to 8121da0 Compare May 13, 2026 18:01

Yadan Wei added 7 commits May 13, 2026 13:47

Yadan-Wei enabled auto-merge (squash) May 13, 2026 21:54

junpuf approved these changes May 13, 2026

View reviewed changes

Yadan-Wei merged commit f2243f2 into main May 13, 2026
8 checks passed

junpuf deleted the omni-0.20.0-docs branch May 13, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs Update] vLLM-Omni 0.20.0#6088

[Docs Update] vLLM-Omni 0.20.0#6088
Yadan-Wei merged 10 commits into
mainfrom
omni-0.20.0-docs

Yadan-Wei commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yadan-Wei commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known Limitations

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yadan-Wei commented May 12, 2026 •

edited

Loading