Skip to content

Add vllm Omni doc#6007

Merged
Yadan-Wei merged 11 commits into
mainfrom
omni-doc
Apr 27, 2026
Merged

Add vllm Omni doc#6007
Yadan-Wei merged 11 commits into
mainfrom
omni-doc

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented Apr 27, 2026

Purpose

Add public documentation and runnable examples for the vLLM-Omni DLC (version 0.18.0) on EC2 and SageMaker. Covers TTS, image generation, video generation, and multimodal chat (Qwen2.5-Omni).

Also fixes image URIs in the generator: vLLM-Omni images live in the vllm ECR repo with tags omni-cuda-v1 (EC2) and omni-sagemaker-cuda-v1 (SageMaker). Introduces an optional ecr_repository YAML field so the data-dir key can differ from the actualbECR repo name (backward-compatible — other frameworks are unchanged).

Test Plan

  1. cd docs && python src/main.py — confirm reference/available_images.md has a vLLM-Omni section pointing at gallery.ecr.aws/deep-learning-containers/vllm with tags omni-cuda-v1 and omni-sagemaker-cuda-v1.
  2. Confirm releasenotes/vllm-omni/vllm-omni-0.18.0-gpu-{default,sagemaker}.md show vllm:omni-cuda-v1 / vllm:omni-sagemaker-cuda-v1.
  3. Verify EC2 deployment works end-to-end for TTS, image, video, and Qwen2.5-Omni examples.
  4. Verify SageMaker deployment works end-to-end for TTS; confirm SageMaker async video matches the documented limitation (job-ID JSON only, no MP4).

Test Result

EC2 examples verified end-to-end in a prior session (TTS, image, video, Qwen2.5-Omni).

SageMaker verified in account 897880167187 (us-west-2):

Scenario Instance Result
Real-time TTS (deploy_tts.py) ml.g5.xlarge Deploy 664s; first invoke 71.2s (torch.compile warmup); response 176684 bytesbaudio/wav, valid 24kHz 16-bit mono
Async video (informational) ml.g5.12xlarge (TP=2) Deploy 606s; invoke returns 202; S3 output is 420-byte JSON with status: queued — confirms the known limitation, MP4 is not retrievable via SageMaker async in 0.18.0. Use EC2 for video.

Endpoints were deleted after verification.

pre-commit run passes (ruff, gitleaks, flowmark, typos, shfmt, signoff).

PR Checklist

  • I ran pre-commit run --all-files locally before creating this PR.

Toggle if you are merging into master Branch

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

  1. Using dlc_developer_config.toml
  2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

  • Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

  • Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

  • Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

  • sagemaker_remote_tests = true
  • sagemaker_efa_tests = true
  • sagemaker_rc_tests = true
  • sagemaker_local_tests = true
How to use PR description Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
  • # /buildspec <buildspec_path>
    • e.g.: # /buildspec pytorch/training/buildspec.yml
    • If this line is commented out, dlc_developer_config.toml will be used.
  • # /tests <test_list>
    • e.g.: # /tests sanity security ec2
    • If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.
# /buildspec <buildspec_path>
# /tests <test_list>
Toggle if you are merging into main Branch

PR Checklist

  • [] I ran pre-commit run --all-files locally before creating this PR. (Read DEVELOPMENT.md for details).

Yadan Wei added 9 commits April 26, 2026 21:46
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- Align version labeling with Ray convention: YAML 'version' now reflects
  the embedded framework version (0.18.0) instead of a DLC release number.
- Add optional 'ecr_repository' field so the data-dir key can differ from
  the actual ECR repo name. vllm-omni images live under the 'vllm' repo,
  not 'vllm-omni'.
- Fix SageMaker image tag: 'omni-sagemaker-cuda-v1' (verified against
  763104351884.dkr.ecr.us-west-2.amazonaws.com/vllm), not the previous
  'omni-cuda-sagemaker-v1'.
- Rewrite the SageMaker async example to deploy TTS (works end-to-end)
  instead of video. The /v1/videos endpoint in 0.18.0 returns a job-ID
  JSON, which is what SageMaker async writes to S3; the MP4 itself is
  never written to S3 and cannot be retrieved via SageMaker in 0.18.0.
- Clarify Known Limitations: video generation is not supported on
  SageMaker in 0.18.0 (use EC2 for the full video workflow).
- Minor fix to EC2 video example (tensor-parallel-size 2, bumped steps,
  status value 'completed').

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei and others added 2 commits April 27, 2026 15:41
The vllm-omni package on PyPI is named with a hyphen (pip install vllm-omni),
not an underscore. Align the YAML package key with the PyPI project name and
drop the redundant underscore display_names entry in global.yml.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@Yadan-Wei Yadan-Wei merged commit 54b9b51 into main Apr 27, 2026
8 checks passed
@Yadan-Wei Yadan-Wei deleted the omni-doc branch April 29, 2026 21:46
@Yadan-Wei Yadan-Wei mentioned this pull request May 12, 2026
5 tasks
Yadan-Wei pushed a commit that referenced this pull request May 13, 2026
… vllm)

All six EC2 example shell scripts hardcoded the legacy repo name
`vllm-omni:omni-cuda-v1`, but the actual ECR repo for these images is
`vllm` (post-#6007's repo unification, also reflected in the docs
generator's `ecr_repository: vllm` field and the prod_image config
`vllm:omni-cuda-v1`).

Customers running these scripts as-is would have hit a "repo does not
exist" error from `docker pull`. Fix the IMAGE default in each script:

  examples/vllm-omni/audio-generate/run.sh
  examples/vllm-omni/image/run.sh
  examples/vllm-omni/qwen2.5-omni/run.sh
  examples/vllm-omni/tts/run.sh
  examples/vllm-omni/video-sync/run.sh
  examples/vllm-omni/video/run.sh

The three SageMaker python examples (deploy_tts.py, deploy_tts_async.py,
deploy_video_sync.py) already used the correct `vllm:omni-sagemaker-cuda-v1`
repo path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan-Wei added a commit that referenced this pull request May 13, 2026
* docs(vllm-omni): add 0.20.0 release notes, new endpoints, and known limitations

New version data files (auto-feed available_images.md and release-notes pages)
------------------------------------------------------------------------------
  docs/src/data/vllm-omni/0.20.0-gpu-ec2.yml
  docs/src/data/vllm-omni/0.20.0-gpu-sagemaker.yml

Pinned package versions match upstream vllm v0.20.0 requirements/cuda.txt:
PyTorch 2.11.0, torchvision 0.26.0, torchaudio 2.11.0, flashinfer 0.6.8.post1,
CUDA 13.0.2. Same omni-cuda-v1 / omni-sagemaker-cuda-v1 tags are reused for
the new image (both v1 tags now point at 0.20.0).

docs/vllm-omni/index.md
-----------------------
- May 12, 2026 announcement covering the 0.20.0 alignment, CUDA 13.0 bump,
  new /v1/audio/generate and /v1/videos/sync endpoints, and the four new
  supported models (CosyVoice3, ERNIE-Image-Turbo, Wan2.1-VACE-1.3B,
  Stable-Audio-Open-1.0).
- Header CUDA reference 12.9 -> 13.0.
- Supported Modalities table grows two rows (Audio Generation, Video sync)
  and the example-model lists are expanded for TTS / image / video.
- New EC2 sections: Audio Generation (stable-audio-open) and Video sync.
- SageMaker routing-middleware table: adds /v1/audio/generate and
  /v1/videos/sync rows; the existing async /v1/videos row now points at
  the sync route as the recommended SageMaker path.
- New SageMaker section: Deploy a Video Endpoint (sync) — replaces the
  previous "video not supported on SageMaker" warning since that was the
  exact gap /v1/videos/sync closes.
- Known Limitations refreshed: drops the SageMaker-video-not-supported
  item, keeps torch.compile warmup, adds usage.completion_tokens=0 caveat
  for omni-chat, CosyVoice3 host-RAM requirement, and stable-audio-open's
  ~47s per-request cap.

New endpoint examples
---------------------
  examples/vllm-omni/audio-generate/run.sh        — stable-audio-open EC2
  examples/vllm-omni/video-sync/run.sh            — sync video EC2
  examples/vllm-omni/sagemaker/deploy_video_sync.py — sync video on SageMaker

All three follow the existing examples' shape (single-shot docker run,
health check, single curl/invoke, exit) so the index.md --8<-- includes
work without further changes.

Auto-generated release notes (docs/releasenotes/vllm-omni/0.20.0-*.md)
and the available_images.md table row are emitted by docs/src/main.py
from the YAMLs above; both are gitignored.

Verified locally with `python docs/src/main.py && mkdocs serve`:
  /deep-learning-containers/vllm-omni/                       (HTTP 200)
  /deep-learning-containers/releasenotes/vllm-omni/0.20.0-*  (rendered)
  /deep-learning-containers/reference/available_images/      (0.20.0 row above 0.18.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): note Code2Wav un-batching TTS regression in known limitations

Adds a Known Limitations entry documenting the upstream Code2Wav decode-chunk
un-batching regression in vllm-omni#3203 that ships in 0.20.0 and slows
voice-clone TTS (Qwen3-TTS-Base). Observed on g6.xlarge:

  rps           0.4   -> 0.281
  audio rtf     1.6   -> 1.109
  p95 e2e       11s   -> 15.9s

Quality is unchanged. Preset-voice TTS (Qwen3-TTS-CustomVoice) is unaffected.
The fix is already merged upstream as vllm-omni#3485 (post-0.20.0) and will
land in the next omni point release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): pin 0.18.0 to immutable v1.0 tag, leave v1 floating on 0.20.0

The `omni-cuda-v1` and `omni-sagemaker-cuda-v1` tags are now reused for
0.20.0 (per the image config files in main). Switch the 0.18.0 docs to
the immutable `omni-cuda-v1.0` / `omni-sagemaker-cuda-v1.0` tags so users
who want to reproduce the 0.18.0 image have a frozen URI; `v1` continues
to float to the latest release in the v1 line (0.20.0 today).

The v1.0 tags already exist in both 763... and public.ecr.aws.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): tag versioning convention, sync-video example fix, What's New entries

Tag versioning (DLC-level)
--------------------------
Document the two-tier tag convention so customers can choose the right
trade-off between freshness and stability:

- omni-cuda-v1 / omni-sagemaker-cuda-v1 — float across DLC minor + patch
  (auto-upgrade on docker pull). Best for dev, quick-starts.
- omni-cuda-v1.1 / omni-sagemaker-cuda-v1.1 — float across DLC patches only
  in the v1.1 line (auto-accept security fixes, decline new minor releases).
  Recommended for production.
- @sha256:<digest> — escape hatch for byte-identical reproducibility.

The semantic versioning tier is at the DLC level (v1, v1.1, v1.1.x), not
the bundled vllm-omni upstream version (which can advance independently
of DLC patches). Customers pinned to v1.1 would have been insulated from
the Code2Wav un-batching regression that landed with the DLC v1.1 minor
bump until they were ready to evaluate it.

Reflected in:
- docs/src/data/vllm-omni/0.20.0-gpu-{ec2,sagemaker}.yml — list both v1 and
  v1.1 tags with comments explaining the floating semantics
- docs/vllm-omni/index.md — new Versioning and Tags section + expanded
  Pull Commands showing both tiers + digest pin

Sync-video SageMaker example fix
--------------------------------
The previous example used real-time invoke_endpoint, which has a hard
60-second timeout. First-request latency on Wan2.1-VACE-1.3B includes
model load + torch.compile warmup (3-4 min), so the example would always
fail on first invoke.

Rewrote to mirror the pattern proven by test_vllm_omni_video_async_endpoint
(last green 2026-05-11):
- AsyncInferenceConfig with output_path + max_concurrent_invocations=1
- s3.put_object to upload the request payload
- invoke_endpoint_async with InputLocation + CustomAttributes
- Poll the .out object for raw MP4 bytes
- Form-data values as strings (the middleware converts JSON to
  multipart/form-data; numeric values must be JSON strings)
- Wan2.1-VACE-1.3B-diffusers + ml.g5.2xlarge (validated combination)

End-to-end validated 2026-05-13 in account 897880167187:
endpoint deployed, async invoke succeeded, 45 KB MP4 returned with
Content-Type video/mp4 (valid ISO Media MP4 header), endpoint cleaned
up after.

docs/vllm-omni/index.md prose updated to recommend async inference as
the default for video on SageMaker (it's required, not optional, given
the warmup time).

What's New entries
------------------
README.md (which generates docs/index.md): two new vLLM-Omni entries
under Release Highlights:
- 2026/05/13 vLLM-Omni v0.20.0
- 2026/04/24 vLLM-Omni v0.18.0 (initial release)

Both reference the floating tag (omni-cuda-v1 / omni-sagemaker-cuda-v1)
and v1.0 for the 0.18.0 entry.

Removed
-------
The "usage.completion_tokens=0 for omni-chat models" Known Limitations
item — internal benchmark-tooling concern, not user-facing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): show v1.1 in available_images table for 0.20.0 (patch-floating)

Reorder the 0.20.0 yamls so omni-cuda-v1.1 / omni-sagemaker-cuda-v1.1
come first; the docs generator uses tags[0] for the per-row table cell
in available_images.md.

Before: 0.18.0 row showed `omni-cuda-v1.0`, 0.20.0 row showed
`omni-cuda-v1` — inconsistent (one patch-floating, one minor-floating).
After: both rows show their patch-floating tag, which uniquely identifies
the release line and won't drift when the minor-floating v1 advances to
a future image.

Also bumps the README.md "What's New" entry for v0.20.0 to reference
omni-cuda-v1.1 / omni-sagemaker-cuda-v1.1 for the same durability reason.
Release-notes pages still print all four URIs (private + public ECR ×
both tag tiers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): revert tag reorder — `tags[0]` feeds the latest_*_ec2 macro

The previous commit (25b7b22) reordered tags to put `omni-cuda-v1.1`
first, intending to fix a perceived asymmetry in available_images.md
(0.18.0 → v1.0, 0.20.0 → v1). That broke the Pull Commands section:
both pull commands ended up pointing at v1.1, defeating the two-tier
story.

Root cause: docs/src/macros.py uses `latest.display_tag` (which returns
`tags[0]`) to render `{{ images.latest_vllm_omni_ec2 }}`. That macro is
the "latest supported" pull command in docs/vllm-omni/index.md.

The original asymmetry was actually the convention working as intended:
- 0.20.0 is the current floating-v1 release, so its yaml lists v1 first
- 0.18.0 is no longer the floating-v1 target, so its yaml only lists v1.0

The maintenance pattern when a new release ships: remove v1 from the
*previous* release's yaml. The 0.18.0 yaml already reflects this.

Restore tags[0] = "omni-cuda-v1" on the 0.20.0 yamls and the README
What's New entry; add a comment in each yaml documenting the convention
so the next maintainer doesn't make the same mistake.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): decouple per-release tags from minor-floating v1 tag

Stop listing `omni-cuda-v1` / `omni-sagemaker-cuda-v1` in per-release
docs/src/data/vllm-omni/0.20.0-gpu-{ec2,sagemaker}.yml. Each release's
yaml now lists only its patch-floating tag (`v1.1` for 0.20.0; `v1.0`
already this way in 0.18.0).

The minor-floating `v1` tag is still documented prominently in
docs/vllm-omni/index.md (Pull Commands "latest supported" + Versioning
and Tags section), but it isn't a per-release identifier — it points at
whichever release is currently the v1-line target. Hardcoding the v1
pull URL in index.md (instead of using the `{{ images.latest_vllm_omni_*
}}` macro that reads `tags[0]`) makes the prose source-of-truth for the
floater, decoupled from per-release yaml metadata.

Why this is better:
- available_images.md table is now self-consistent — every row shows the
  release's patch-floating tag, no asymmetry between current and previous
  releases.
- Self-correcting: when a future DLC release ships, no edits to the
  0.20.0 yaml are required to remove `v1` (since it was never there).
  Today's convention required "drop v1 from old yaml on next release",
  easy to forget.
- Decoupled concerns: yamls own per-release metadata, prose owns the
  floating-tag story.

Verified locally with `python docs/src/main.py && mkdocs serve`:
  - reference/available_images table: 0.20.0 → v1.1, 0.18.0 → v1.0
  - releasenotes/vllm-omni-0.20.0-*: only v1.1 URIs (no longer v1)
  - vllm-omni/index.md Pull Commands: both v1 (latest) and v1.1
    (patch-stable) tags shown for EC2 + SageMaker
  - vllm-omni/index.md Versioning section table: unchanged

README.md What's New entry for 0.20.0: bumped from v1 to v1.1 to match
the 0.18.0 entry's pattern (per-release rows always show the
patch-floating, durable identifier).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): adopt vLLM-server's four-tier Pin a Version table

Replace the bespoke two-section "Pull Commands + Versioning and Tags"
prose with the cleaner four-tier convention used by the public vLLM-server
docs. Maps directly to the suffix structure customers already see in
vllm-server pull commands.

Pull Commands now show only the bare base tags (omni-cuda /
omni-sagemaker-cuda) — "give me whatever ships". The Pin a Version
section enumerates the four tiers in one table:

  | Suffix                    | Example          | Updates when                                  |
  |---------------------------|------------------|-----------------------------------------------|
  | (none)                    | omni-cuda        | Any release, including breaking changes      |
  | -v<MAJOR>                 | omni-cuda-v1     | New features and fixes, no breaking changes  |
  | -v<MAJOR>.<MINOR>         | omni-cuda-v1.1   | Security patches and bug fixes only          |
  | -v<MAJOR>.<MINOR>.<PATCH> | omni-cuda-v1.1.0 | Never — immutable snapshot                   |

Production recommendation (pin to -v<MAJOR>.<MINOR>) calls out the
Code2Wav un-batching regression as the concrete example of why
patch-stable insulates production from feature-release surprises.

Switches both Pull Commands URIs from the private 763... ECR to
public.ecr.aws/deep-learning-containers/vllm to match the vLLM-server
docs convention (private ECR is in the per-region table on
available_images.md).

Removes the now-obsolete tag-history table — Pin a Version handles the
same information through suffix semantics.

Verified locally with `python docs/src/main.py && mkdocs serve`:
  - Pull Commands: bare omni-cuda and omni-sagemaker-cuda URIs
  - Pin a Version: 4-row suffix table with examples + update semantics
  - Section order: Latest Announcements -> Pull Commands -> Pin a Version
    -> Packages -> Supported Modalities -> ...

Existing example scripts (deploy_tts.py, deploy_tts_async.py,
deploy_video_sync.py) keep their -v1 URIs unchanged — examples document
behavior validated at v1 and don't need to chase the latest tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): add private ECR pull commands alongside public ECR

Pull Commands section now shows both registry options for each
deployment target:

- Public ECR (anonymous pull): public.ecr.aws/deep-learning-containers/vllm
- Private DLC ECR (authenticated): 763104351884.dkr.ecr.<region>.amazonaws.com/vllm

Customers running on AWS infrastructure (EC2/EKS/SageMaker) typically
prefer the private ECR for better network locality and IAM-controlled
access; public ECR is the right path for local development or workloads
outside AWS.

A short prologue paragraph explains the auth difference and links to
Getting Started for credentials. Per-region URI table still lives in
available_images.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix(vllm-omni): correct ECR repo in EC2 example scripts (vllm-omni -> vllm)

All six EC2 example shell scripts hardcoded the legacy repo name
`vllm-omni:omni-cuda-v1`, but the actual ECR repo for these images is
`vllm` (post-#6007's repo unification, also reflected in the docs
generator's `ecr_repository: vllm` field and the prod_image config
`vllm:omni-cuda-v1`).

Customers running these scripts as-is would have hit a "repo does not
exist" error from `docker pull`. Fix the IMAGE default in each script:

  examples/vllm-omni/audio-generate/run.sh
  examples/vllm-omni/image/run.sh
  examples/vllm-omni/qwen2.5-omni/run.sh
  examples/vllm-omni/tts/run.sh
  examples/vllm-omni/video-sync/run.sh
  examples/vllm-omni/video/run.sh

The three SageMaker python examples (deploy_tts.py, deploy_tts_async.py,
deploy_video_sync.py) already used the correct `vllm:omni-sagemaker-cuda-v1`
repo path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

---------

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Co-authored-by: Yadan Wei <yadanwei@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants