feat(vllm-omni): prepare 0.21.0rc1 release branch by Yadan-Wei · Pull Request #6110 · aws/deep-learning-containers

Yadan-Wei · 2026-05-18T23:26:40Z

Summary

Bumps the vLLM-Omni AL2023 image to vllm-omni 0.21.0rc1 (pre-release), tracking upstream vLLM v0.21.0 (the rebase target — see vllm-project/vllm-omni#3530). DLC tag bumps v1.2 → v1.3.

This is a preparation PR ahead of the official 0.21.0 release. Public docs and release notes are intentionally unchanged; they land in the follow-up PR once 0.21.0 ships final.

Upstream Dockerfile changes ported to the AL2023 fork

Cherry-picked from the upstream vllm v0.20.0 → v0.21.0 docker/Dockerfile diff:

libcublas-${CUDA_DASH} → libcublas-devel-${CUDA_DASH} in the runtime stage so cublas headers are present for JIT (fastsafetensors, nccl_allocator).
flashinfer download-cubin moved to a final RUN after the vllm wheel, EP kernels, and KV connectors finish installing. Earlier downloads cause ~2.5 GB of layer duplication when later pip installs overwrite flashinfer files.
nixl-cu${CUDA_MAJOR} --force-reinstall --no-deps after the kv_connectors install (replacing the bare nixl-cu13 install), so the matching nixl_ep_cpp.so is shipped.

Upstream changes intentionally skipped

Upstream change	Reason
`BUILD_OS=manylinux` apt/dnf branching	We are dnf-only on AL2023; the existing dnf path covers us.
`nvidia-cutlass-dsl[cu13]` strip-shim for cu12	We pin CUDA 13.
DeepGEMM multi-Python interpreter matrix	We build a single-Python image.
`examples/online_serving` → `examples/deployment` entrypoint move	We ship our own entrypoint scripts in `scripts/vllm/`.
New `VLLM_BUILD_*` and OCI labels	Not required for DLC.

`vllm-omni` Dockerfile.cuda

Byte-identical between v0.20.0 and v0.21.0rc1 (same blob SHA 78f64f6a). The Dockerfile.ci only bumps VLLM_BASE_TAG=v0.21.0; we don't consume that file, so no port needed.

Pre-release handling

uv pip install adds --prerelease=allow because 0.21.0rc1 is a PEP 440 pre-release. Strip the flag when bumping to stable 0.21.0.

Transformers 5.9.0 compatibility (workaround: pin <5.9.0)

vllm-omni 0.21.0rc1 inherits vLLM core's transformers floor >=4.56.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.4.*,!=5.5.0 with no upper bound, so a fresh install resolves to transformers 5.9.0 (released 2026-05-20). 5.9.0 made two breaking changes to transformers.masking_utils.create_causal_mask / create_sliding_window_causal_mask:

Removed the deprecated input_embeds alias (renamed to inputs_embeds in 5.5.1, alias dropped in 5.9.0).
Removed cache_position from the signature entirely.

Qwen3TTSTokenizerV2DecoderTransformerModel.forward in vllm-omni 0.21.0rc1 still passes the old kwargs, so every qwen3-tts decode raises TypeError: create_causal_mask() got an unexpected keyword argument 'input_embeds' on 5.9.0.

Workaround in this PR: add a transformers>=4.56.0,<5.9.0 pin to Dockerfile.amzn2023. The upstream call site is being fixed in vllm-project/vllm-omni#3786 (signature-filtered helper that works on 4.56–5.9+). We will not cherry-pick the fix into this image — once vllm-omni ships a release containing #3786, drop the pin and let pip resolve transformers freely again.

New model & feature audit (no test-suite changes)

Walked through the new model + feature delta from v0.20.0 → v0.21.0rc1:

New model	Decision	Reason
SenseNova-U1 (`SenseNovaU1Pipeline`, DiT-only image gen)	Skip	`/v1/images/generations` route already covered by `flux2-klein-4b` and `ernie-image-turbo` — no new code path.
Tencent Covo-Audio-Chat (`CovoAudioForConditionalGeneration`, omni-chat)	Skip with TODO	Would be the first omni-chat smoke test, but the model is multi-GPU class and the `g6e12xl-runner` CodeBuild fleet is ICE in us-west-2 (same constraint as `qwen2.5-omni-3b`). Revisit when fleet capacity returns.

Notable feature changes (HunyuanImage-3.0 KV reuse + IT2I editing, HunyuanVideo-1.5 USP, FLUX.2-dev TP, Voxtral TTS FP8, attention_config → diffusion_attention_config rename per #3489) do not alter our existing test request shapes or routes — extra_args: "" defaults shield us from the config rename.

No SageMaker endpoint test additions: 0.21.0rc1 introduces no new route or content-type that the existing endpoint suite (Qwen3-TTS on /v1/audio/speech, Wan2.1-VACE on /v1/videos/sync multipart) doesn't already exercise.

Files changed

docker/vllm_omni/versions.env — VLLM_VERSION 0.20.0 → 0.21.0, VLLM_OMNI_VERSION 0.20.0 → 0.21.0rc1, DLC_MINOR_VERSION 2 → 3.
docker/vllm_omni/Dockerfile.amzn2023 — same default-ARG bumps + the three upstream-derived hunks + --prerelease=allow on the omni pip install + transformers>=4.56.0,<5.9.0 pin (to be removed once vllm-omni ships Fix build dependencies #3786).
.github/config/image/vllm-omni-{ec2,sagemaker}-amzn2023.yml — framework_version: "0.21.0rc1", vllm_ref: "v0.21.0".

Test plan

Build the image locally (or via the build workflow): docker build -f docker/vllm_omni/Dockerfile.amzn2023 ...
Run the vLLM-Omni smoke-test workflow against the new image (covers /v1/audio/speech, /v1/images/generations, /v1/videos, /v1/videos/sync, /v1/audio/generate).
Run the SageMaker endpoint test (test/vllm-omni/sagemaker/test_sm_omni_endpoint.py) against the new image.
Run the security scan workflow; confirm the existing vllm_omni/framework_allowlist.json still covers the diffusers/safetensors advisory chain (vllm-omni#3349 raises diffusers>=0.38.0 but is itself blocked on safetensors 0.8.0 final, so the allowlist entries remain valid).
Re-evaluate dropping --prerelease=allow and the rc1 suffix once stable 0.21.0 ships.
Drop the transformers<5.9.0 pin once vllm-omni ships a release containing [Bugfix] Fix qwen3-tts create_causal_mask kwarg for transformers >=5.9.0 vllm-project/vllm-omni#3786.

Bumps the vLLM-Omni AL2023 image to vllm-omni 0.21.0rc1 (pre-release), which rebases onto upstream vLLM v0.21.0. Cherry-picks three Dockerfile changes from the upstream vLLM v0.20.0 -> v0.21.0 diff that are relevant to our fork: - libcublas-${CUDA_DASH} -> libcublas-devel-${CUDA_DASH} in the runtime stage so cublas headers are present for JIT (fastsafetensors, nccl_allocator). - FlashInfer download-cubin moved to a final RUN after vllm wheel, EP kernels, and KV connectors install. Earlier downloads cause ~2.5 GB layer duplication when later pip installs overwrite flashinfer files. - nixl-cu${CUDA_MAJOR} --force-reinstall --no-deps after the kv_connectors install, replacing the bare nixl-cu13 install, so the matching nixl_ep_cpp.so is shipped. Skipped upstream changes that don't apply to our AL2023 fork: BUILD_OS=manylinux apt/dnf branching (we are dnf-only), nvidia-cutlass-dsl[cu13] strip-shim (we pin CUDA 13), DeepGEMM multi-Python interpreter matrix (single-Python build), and the sagemaker-entrypoint.sh path move (we ship our own entrypoints). Also adds --prerelease=allow on the omni install since 0.21.0rc1 is a PEP 440 pre-release; uv would otherwise refuse to resolve it. Strip when bumping to a stable 0.21.0. DLC_MINOR_VERSION 2 -> 3, tagging this image v1.3. This is a preparation PR for the official release. No public docs or release notes are updated; those land in the follow-up PR once 0.21.0 ships final. No test-suite additions: per the new vllm-omni-release skill audit (Step 4b/4c), neither SenseNova-U1 nor Tencent Covo-Audio-Chat clears the gating rules right now (existing image-gen route already covered; g6e12xl-runner is ICE in us-west-2). Endpoint test routes / content-types are unchanged in 0.21.0rc1, so no new endpoint cases. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…baseline vllm-omni 0.20.0 had a regression from vllm-omni#3203 that un-batched Code2Wav decode chunks. Thresholds were loosened to (0.27 / 1.0 / 17000). vllm-omni#3485 fix is now picked up in 0.21.0rc1. Observed on this branch: rps=1.302, audio rtf mult=5.033, p95 e2e=3499ms — comfortably above the original (0.4 / 1.6 / 11000) baseline. Restore those values as the comment explicitly directed once the fix landed. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…wheel cache key The workflow was passing framework_version (= 0.21.0rc1, the omni package version) into fetch_cached_wheels.sh as the vLLM version. That makes the cache key sha256(...,version:0.21.0rc1,...) and the filename glob 'vllm-0.21.0rc1*.whl' — neither matches wheels uploaded for vllm core 0.21.0. Result: every omni build is a forced cache miss, even when a matching vllm core wheel exists in S3. Source docker/vllm_omni/versions.env first and pass VLLM_VERSION (= 0.21.0) to fetch + upload. Now omni shares the cache with any other workflow building the same vllm core ref/version. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

vllm-omni 0.21.0rc1's qwen3_tts module calls create_causal_mask(..., input_embeds=...) at modeling_qwen3_tts_tokenizer_v2.py:576. transformers renamed the kwarg to `inputs_embeds` in 5.5.1 (kept input_embeds as a deprecated alias via @deprecate_kwarg) and removed the decorator outright in 5.9.0 (released 2026-05-20). Reference: https://github.com/huggingface/transformers/releases/tag/v5.9.0 vllm core 0.21.0's pin (>=4.56.0, !=5.0..5.4, !=5.5.0) doesn't upper-bound past 5.5, so pip resolves to 5.9.x and breaks qwen3-tts smoke tests with: TypeError: create_causal_mask() got an unexpected keyword argument 'input_embeds' Cap at <5.9.0 (last working release line is 5.8.x). Drop when vllm-omni updates the call site to use `inputs_embeds`. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Mirror the same fix applied to pr-vllm-omni-ec2-amzn2023.yml (f55e3da) for both build-ec2 and build-sagemaker jobs in the scheduled autorelease workflow. framework-version (= 0.21.0rc1, the omni package version) is not the version stamped on the vllm wheel filename, so passing it to fetch/upload_cached_wheels.sh forces a cache miss every run. Source docker/vllm_omni/versions.env and pass VLLM_VERSION (= 0.21.0) + VLLM_REF (= v0.21.0) instead so the autorelease shares the wheel cache with PR builds on the same vllm core ref. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): document v1.2 and v1.3 image releases - Add v1.2.0 changelog entry: SageMaker /v1/videos* now requires client-built multipart/form-data; routing middleware no longer converts JSON. (#6101) - Add v1.3.0 changelog entry: vllm-omni 0.21.0rc1 prep, Qwen3-TTS voice-clone throughput restored, transformers <5.9.0 pin. (#6110) - Update SageMaker deployment guide with the client-built multipart example for /v1/videos/sync. - Drop the v1.1 voice-clone TTS limitation note from configuration.md now that the upstream Code2Wav regression is resolved in 0.21.0rc1. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * docs(vllm-omni): link upstream transformers fix PR in v1.3 known issues Signed-off-by: Yadan Wei <yadanwei@amazon.com> * docs(vllm-omni): link DLC image PRs in v1.2 and v1.3 changelog entries Signed-off-by: Yadan Wei <yadanwei@amazon.com> --------- Signed-off-by: Yadan Wei <yadanwei@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com>

aws-deep-learning-containers-ci Bot added the authorized label May 18, 2026

Yadan Wei added 3 commits May 20, 2026 14:21

Yadan-Wei force-pushed the vllm-omni-0.21.0rc1 branch from 4986349 to 1c344a0 Compare May 20, 2026 23:40

Yadan-Wei requested a review from junpuf May 21, 2026 20:02

Yadan-Wei enabled auto-merge (squash) May 21, 2026 20:03

junpuf approved these changes May 21, 2026

View reviewed changes

Yadan-Wei merged commit c78e78f into main May 21, 2026
48 checks passed

Yadan-Wei mentioned this pull request May 22, 2026

docs(vllm-omni): document v1.2 and v1.3 image releases #6129

Merged

3 tasks

sirutBuasai deleted the vllm-omni-0.21.0rc1 branch May 26, 2026 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm-omni): prepare 0.21.0rc1 release branch#6110

feat(vllm-omni): prepare 0.21.0rc1 release branch#6110
Yadan-Wei merged 5 commits into
mainfrom
vllm-omni-0.21.0rc1

Yadan-Wei commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yadan-Wei commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Upstream Dockerfile changes ported to the AL2023 fork

Upstream changes intentionally skipped

vllm-omni Dockerfile.cuda

Pre-release handling

Transformers 5.9.0 compatibility (workaround: pin <5.9.0)

New model & feature audit (no test-suite changes)

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yadan-Wei commented May 18, 2026 •

edited

Loading

`vllm-omni` Dockerfile.cuda