Skip to content

feat(vllm-omni): prepare 0.21.0rc1 release branch#6110

Merged
Yadan-Wei merged 5 commits into
mainfrom
vllm-omni-0.21.0rc1
May 21, 2026
Merged

feat(vllm-omni): prepare 0.21.0rc1 release branch#6110
Yadan-Wei merged 5 commits into
mainfrom
vllm-omni-0.21.0rc1

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented May 18, 2026

Summary

Bumps the vLLM-Omni AL2023 image to vllm-omni 0.21.0rc1 (pre-release), tracking upstream vLLM v0.21.0 (the rebase target — see vllm-project/vllm-omni#3530). DLC tag bumps v1.2 → v1.3.

This is a preparation PR ahead of the official 0.21.0 release. Public docs and release notes are intentionally unchanged; they land in the follow-up PR once 0.21.0 ships final.

Upstream Dockerfile changes ported to the AL2023 fork

Cherry-picked from the upstream vllm v0.20.0 → v0.21.0 docker/Dockerfile diff:

  • libcublas-${CUDA_DASH}libcublas-devel-${CUDA_DASH} in the runtime stage so cublas headers are present for JIT (fastsafetensors, nccl_allocator).
  • flashinfer download-cubin moved to a final RUN after the vllm wheel, EP kernels, and KV connectors finish installing. Earlier downloads cause ~2.5 GB of layer duplication when later pip installs overwrite flashinfer files.
  • nixl-cu${CUDA_MAJOR} --force-reinstall --no-deps after the kv_connectors install (replacing the bare nixl-cu13 install), so the matching nixl_ep_cpp.so is shipped.

Upstream changes intentionally skipped

Upstream change Reason
BUILD_OS=manylinux apt/dnf branching We are dnf-only on AL2023; the existing dnf path covers us.
nvidia-cutlass-dsl[cu13] strip-shim for cu12 We pin CUDA 13.
DeepGEMM multi-Python interpreter matrix We build a single-Python image.
examples/online_servingexamples/deployment entrypoint move We ship our own entrypoint scripts in scripts/vllm/.
New VLLM_BUILD_* and OCI labels Not required for DLC.

vllm-omni Dockerfile.cuda

Byte-identical between v0.20.0 and v0.21.0rc1 (same blob SHA 78f64f6a). The Dockerfile.ci only bumps VLLM_BASE_TAG=v0.21.0; we don't consume that file, so no port needed.

Pre-release handling

uv pip install adds --prerelease=allow because 0.21.0rc1 is a PEP 440 pre-release. Strip the flag when bumping to stable 0.21.0.

Transformers 5.9.0 compatibility (workaround: pin <5.9.0)

vllm-omni 0.21.0rc1 inherits vLLM core's transformers floor >=4.56.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.4.*,!=5.5.0 with no upper bound, so a fresh install resolves to transformers 5.9.0 (released 2026-05-20). 5.9.0 made two breaking changes to transformers.masking_utils.create_causal_mask / create_sliding_window_causal_mask:

  1. Removed the deprecated input_embeds alias (renamed to inputs_embeds in 5.5.1, alias dropped in 5.9.0).
  2. Removed cache_position from the signature entirely.

Qwen3TTSTokenizerV2DecoderTransformerModel.forward in vllm-omni 0.21.0rc1 still passes the old kwargs, so every qwen3-tts decode raises TypeError: create_causal_mask() got an unexpected keyword argument 'input_embeds' on 5.9.0.

Workaround in this PR: add a transformers>=4.56.0,<5.9.0 pin to Dockerfile.amzn2023. The upstream call site is being fixed in vllm-project/vllm-omni#3786 (signature-filtered helper that works on 4.56–5.9+). We will not cherry-pick the fix into this image — once vllm-omni ships a release containing #3786, drop the pin and let pip resolve transformers freely again.

New model & feature audit (no test-suite changes)

Walked through the new model + feature delta from v0.20.0 → v0.21.0rc1:

New model Decision Reason
SenseNova-U1 (SenseNovaU1Pipeline, DiT-only image gen) Skip /v1/images/generations route already covered by flux2-klein-4b and ernie-image-turbo — no new code path.
Tencent Covo-Audio-Chat (CovoAudioForConditionalGeneration, omni-chat) Skip with TODO Would be the first omni-chat smoke test, but the model is multi-GPU class and the g6e12xl-runner CodeBuild fleet is ICE in us-west-2 (same constraint as qwen2.5-omni-3b). Revisit when fleet capacity returns.

Notable feature changes (HunyuanImage-3.0 KV reuse + IT2I editing, HunyuanVideo-1.5 USP, FLUX.2-dev TP, Voxtral TTS FP8, attention_configdiffusion_attention_config rename per #3489) do not alter our existing test request shapes or routes — extra_args: "" defaults shield us from the config rename.

No SageMaker endpoint test additions: 0.21.0rc1 introduces no new route or content-type that the existing endpoint suite (Qwen3-TTS on /v1/audio/speech, Wan2.1-VACE on /v1/videos/sync multipart) doesn't already exercise.

Files changed

  • docker/vllm_omni/versions.envVLLM_VERSION 0.20.0 → 0.21.0, VLLM_OMNI_VERSION 0.20.0 → 0.21.0rc1, DLC_MINOR_VERSION 2 → 3.
  • docker/vllm_omni/Dockerfile.amzn2023 — same default-ARG bumps + the three upstream-derived hunks + --prerelease=allow on the omni pip install + transformers>=4.56.0,<5.9.0 pin (to be removed once vllm-omni ships Fix build dependencies #3786).
  • .github/config/image/vllm-omni-{ec2,sagemaker}-amzn2023.ymlframework_version: "0.21.0rc1", vllm_ref: "v0.21.0".

Test plan

  • Build the image locally (or via the build workflow): docker build -f docker/vllm_omni/Dockerfile.amzn2023 ...
  • Run the vLLM-Omni smoke-test workflow against the new image (covers /v1/audio/speech, /v1/images/generations, /v1/videos, /v1/videos/sync, /v1/audio/generate).
  • Run the SageMaker endpoint test (test/vllm-omni/sagemaker/test_sm_omni_endpoint.py) against the new image.
  • Run the security scan workflow; confirm the existing vllm_omni/framework_allowlist.json still covers the diffusers/safetensors advisory chain (vllm-omni#3349 raises diffusers>=0.38.0 but is itself blocked on safetensors 0.8.0 final, so the allowlist entries remain valid).
  • Re-evaluate dropping --prerelease=allow and the rc1 suffix once stable 0.21.0 ships.
  • Drop the transformers<5.9.0 pin once vllm-omni ships a release containing [Bugfix] Fix qwen3-tts create_causal_mask kwarg for transformers >=5.9.0 vllm-project/vllm-omni#3786.

Bumps the vLLM-Omni AL2023 image to vllm-omni 0.21.0rc1 (pre-release),
which rebases onto upstream vLLM v0.21.0. Cherry-picks three Dockerfile
changes from the upstream vLLM v0.20.0 -> v0.21.0 diff that are relevant
to our fork:

- libcublas-${CUDA_DASH} -> libcublas-devel-${CUDA_DASH} in the runtime
  stage so cublas headers are present for JIT (fastsafetensors,
  nccl_allocator).
- FlashInfer download-cubin moved to a final RUN after vllm wheel,
  EP kernels, and KV connectors install. Earlier downloads cause
  ~2.5 GB layer duplication when later pip installs overwrite
  flashinfer files.
- nixl-cu${CUDA_MAJOR} --force-reinstall --no-deps after the kv_connectors
  install, replacing the bare nixl-cu13 install, so the matching
  nixl_ep_cpp.so is shipped.

Skipped upstream changes that don't apply to our AL2023 fork:
BUILD_OS=manylinux apt/dnf branching (we are dnf-only),
nvidia-cutlass-dsl[cu13] strip-shim (we pin CUDA 13), DeepGEMM
multi-Python interpreter matrix (single-Python build), and the
sagemaker-entrypoint.sh path move (we ship our own entrypoints).

Also adds --prerelease=allow on the omni install since 0.21.0rc1 is a
PEP 440 pre-release; uv would otherwise refuse to resolve it. Strip
when bumping to a stable 0.21.0.

DLC_MINOR_VERSION 2 -> 3, tagging this image v1.3.

This is a preparation PR for the official release. No public docs or
release notes are updated; those land in the follow-up PR once 0.21.0
ships final.

No test-suite additions: per the new vllm-omni-release skill audit
(Step 4b/4c), neither SenseNova-U1 nor Tencent Covo-Audio-Chat clears
the gating rules right now (existing image-gen route already covered;
g6e12xl-runner is ICE in us-west-2). Endpoint test routes /
content-types are unchanged in 0.21.0rc1, so no new endpoint cases.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei added 3 commits May 20, 2026 14:21
…baseline

vllm-omni 0.20.0 had a regression from vllm-omni#3203 that un-batched
Code2Wav decode chunks. Thresholds were loosened to (0.27 / 1.0 / 17000).

vllm-omni#3485 fix is now picked up in 0.21.0rc1. Observed on this
branch: rps=1.302, audio rtf mult=5.033, p95 e2e=3499ms — comfortably
above the original (0.4 / 1.6 / 11000) baseline. Restore those values
as the comment explicitly directed once the fix landed.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…wheel cache key

The workflow was passing framework_version (= 0.21.0rc1, the omni package
version) into fetch_cached_wheels.sh as the vLLM version. That makes the
cache key sha256(...,version:0.21.0rc1,...) and the filename glob
'vllm-0.21.0rc1*.whl' — neither matches wheels uploaded for vllm core
0.21.0. Result: every omni build is a forced cache miss, even when a
matching vllm core wheel exists in S3.

Source docker/vllm_omni/versions.env first and pass VLLM_VERSION
(= 0.21.0) to fetch + upload. Now omni shares the cache with any other
workflow building the same vllm core ref/version.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
vllm-omni 0.21.0rc1's qwen3_tts module calls
create_causal_mask(..., input_embeds=...) at modeling_qwen3_tts_tokenizer_v2.py:576.
transformers renamed the kwarg to `inputs_embeds` in 5.5.1 (kept
input_embeds as a deprecated alias via @deprecate_kwarg) and removed the
decorator outright in 5.9.0 (released 2026-05-20).

Reference: https://github.com/huggingface/transformers/releases/tag/v5.9.0

vllm core 0.21.0's pin (>=4.56.0, !=5.0..5.4, !=5.5.0) doesn't upper-bound
past 5.5, so pip resolves to 5.9.x and breaks qwen3-tts smoke tests with:

    TypeError: create_causal_mask() got an unexpected keyword argument 'input_embeds'

Cap at <5.9.0 (last working release line is 5.8.x). Drop when vllm-omni
updates the call site to use `inputs_embeds`.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@Yadan-Wei Yadan-Wei force-pushed the vllm-omni-0.21.0rc1 branch from 4986349 to 1c344a0 Compare May 20, 2026 23:40
Mirror the same fix applied to pr-vllm-omni-ec2-amzn2023.yml (f55e3da)
for both build-ec2 and build-sagemaker jobs in the scheduled autorelease
workflow. framework-version (= 0.21.0rc1, the omni package version) is
not the version stamped on the vllm wheel filename, so passing it to
fetch/upload_cached_wheels.sh forces a cache miss every run.

Source docker/vllm_omni/versions.env and pass VLLM_VERSION (= 0.21.0)
+ VLLM_REF (= v0.21.0) instead so the autorelease shares the wheel
cache with PR builds on the same vllm core ref.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@Yadan-Wei Yadan-Wei requested a review from junpuf May 21, 2026 20:02
@Yadan-Wei Yadan-Wei enabled auto-merge (squash) May 21, 2026 20:03
@Yadan-Wei Yadan-Wei merged commit c78e78f into main May 21, 2026
48 checks passed
Yadan-Wei added a commit that referenced this pull request May 22, 2026
* docs(vllm-omni): document v1.2 and v1.3 image releases

- Add v1.2.0 changelog entry: SageMaker /v1/videos* now requires
  client-built multipart/form-data; routing middleware no longer
  converts JSON. (#6101)
- Add v1.3.0 changelog entry: vllm-omni 0.21.0rc1 prep, Qwen3-TTS
  voice-clone throughput restored, transformers <5.9.0 pin. (#6110)
- Update SageMaker deployment guide with the client-built multipart
  example for /v1/videos/sync.
- Drop the v1.1 voice-clone TTS limitation note from configuration.md
  now that the upstream Code2Wav regression is resolved in 0.21.0rc1.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): link upstream transformers fix PR in v1.3 known issues

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): link DLC image PRs in v1.2 and v1.3 changelog entries

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

---------

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Co-authored-by: Yadan Wei <yadanwei@amazon.com>
@sirutBuasai sirutBuasai deleted the vllm-omni-0.21.0rc1 branch May 26, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants