feat(vllm-omni): prepare 0.21.0rc1 release branch#6110
Merged
Conversation
Bumps the vLLM-Omni AL2023 image to vllm-omni 0.21.0rc1 (pre-release),
which rebases onto upstream vLLM v0.21.0. Cherry-picks three Dockerfile
changes from the upstream vLLM v0.20.0 -> v0.21.0 diff that are relevant
to our fork:
- libcublas-${CUDA_DASH} -> libcublas-devel-${CUDA_DASH} in the runtime
stage so cublas headers are present for JIT (fastsafetensors,
nccl_allocator).
- FlashInfer download-cubin moved to a final RUN after vllm wheel,
EP kernels, and KV connectors install. Earlier downloads cause
~2.5 GB layer duplication when later pip installs overwrite
flashinfer files.
- nixl-cu${CUDA_MAJOR} --force-reinstall --no-deps after the kv_connectors
install, replacing the bare nixl-cu13 install, so the matching
nixl_ep_cpp.so is shipped.
Skipped upstream changes that don't apply to our AL2023 fork:
BUILD_OS=manylinux apt/dnf branching (we are dnf-only),
nvidia-cutlass-dsl[cu13] strip-shim (we pin CUDA 13), DeepGEMM
multi-Python interpreter matrix (single-Python build), and the
sagemaker-entrypoint.sh path move (we ship our own entrypoints).
Also adds --prerelease=allow on the omni install since 0.21.0rc1 is a
PEP 440 pre-release; uv would otherwise refuse to resolve it. Strip
when bumping to a stable 0.21.0.
DLC_MINOR_VERSION 2 -> 3, tagging this image v1.3.
This is a preparation PR for the official release. No public docs or
release notes are updated; those land in the follow-up PR once 0.21.0
ships final.
No test-suite additions: per the new vllm-omni-release skill audit
(Step 4b/4c), neither SenseNova-U1 nor Tencent Covo-Audio-Chat clears
the gating rules right now (existing image-gen route already covered;
g6e12xl-runner is ICE in us-west-2). Endpoint test routes /
content-types are unchanged in 0.21.0rc1, so no new endpoint cases.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
added 3 commits
May 20, 2026 14:21
…baseline vllm-omni 0.20.0 had a regression from vllm-omni#3203 that un-batched Code2Wav decode chunks. Thresholds were loosened to (0.27 / 1.0 / 17000). vllm-omni#3485 fix is now picked up in 0.21.0rc1. Observed on this branch: rps=1.302, audio rtf mult=5.033, p95 e2e=3499ms — comfortably above the original (0.4 / 1.6 / 11000) baseline. Restore those values as the comment explicitly directed once the fix landed. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…wheel cache key The workflow was passing framework_version (= 0.21.0rc1, the omni package version) into fetch_cached_wheels.sh as the vLLM version. That makes the cache key sha256(...,version:0.21.0rc1,...) and the filename glob 'vllm-0.21.0rc1*.whl' — neither matches wheels uploaded for vllm core 0.21.0. Result: every omni build is a forced cache miss, even when a matching vllm core wheel exists in S3. Source docker/vllm_omni/versions.env first and pass VLLM_VERSION (= 0.21.0) to fetch + upload. Now omni shares the cache with any other workflow building the same vllm core ref/version. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
vllm-omni 0.21.0rc1's qwen3_tts module calls create_causal_mask(..., input_embeds=...) at modeling_qwen3_tts_tokenizer_v2.py:576. transformers renamed the kwarg to `inputs_embeds` in 5.5.1 (kept input_embeds as a deprecated alias via @deprecate_kwarg) and removed the decorator outright in 5.9.0 (released 2026-05-20). Reference: https://github.com/huggingface/transformers/releases/tag/v5.9.0 vllm core 0.21.0's pin (>=4.56.0, !=5.0..5.4, !=5.5.0) doesn't upper-bound past 5.5, so pip resolves to 5.9.x and breaks qwen3-tts smoke tests with: TypeError: create_causal_mask() got an unexpected keyword argument 'input_embeds' Cap at <5.9.0 (last working release line is 5.8.x). Drop when vllm-omni updates the call site to use `inputs_embeds`. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
4986349 to
1c344a0
Compare
Mirror the same fix applied to pr-vllm-omni-ec2-amzn2023.yml (f55e3da) for both build-ec2 and build-sagemaker jobs in the scheduled autorelease workflow. framework-version (= 0.21.0rc1, the omni package version) is not the version stamped on the vllm wheel filename, so passing it to fetch/upload_cached_wheels.sh forces a cache miss every run. Source docker/vllm_omni/versions.env and pass VLLM_VERSION (= 0.21.0) + VLLM_REF (= v0.21.0) instead so the autorelease shares the wheel cache with PR builds on the same vllm core ref. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
junpuf
approved these changes
May 21, 2026
3 tasks
Yadan-Wei
added a commit
that referenced
this pull request
May 22, 2026
* docs(vllm-omni): document v1.2 and v1.3 image releases - Add v1.2.0 changelog entry: SageMaker /v1/videos* now requires client-built multipart/form-data; routing middleware no longer converts JSON. (#6101) - Add v1.3.0 changelog entry: vllm-omni 0.21.0rc1 prep, Qwen3-TTS voice-clone throughput restored, transformers <5.9.0 pin. (#6110) - Update SageMaker deployment guide with the client-built multipart example for /v1/videos/sync. - Drop the v1.1 voice-clone TTS limitation note from configuration.md now that the upstream Code2Wav regression is resolved in 0.21.0rc1. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * docs(vllm-omni): link upstream transformers fix PR in v1.3 known issues Signed-off-by: Yadan Wei <yadanwei@amazon.com> * docs(vllm-omni): link DLC image PRs in v1.2 and v1.3 changelog entries Signed-off-by: Yadan Wei <yadanwei@amazon.com> --------- Signed-off-by: Yadan Wei <yadanwei@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bumps the vLLM-Omni AL2023 image to vllm-omni 0.21.0rc1 (pre-release), tracking upstream vLLM v0.21.0 (the rebase target — see vllm-project/vllm-omni#3530). DLC tag bumps v1.2 → v1.3.
This is a preparation PR ahead of the official 0.21.0 release. Public docs and release notes are intentionally unchanged; they land in the follow-up PR once 0.21.0 ships final.
Upstream Dockerfile changes ported to the AL2023 fork
Cherry-picked from the upstream
vllmv0.20.0 → v0.21.0docker/Dockerfilediff:libcublas-${CUDA_DASH}→libcublas-devel-${CUDA_DASH}in the runtime stage so cublas headers are present for JIT (fastsafetensors, nccl_allocator).flashinfer download-cubinmoved to a finalRUNafter the vllm wheel, EP kernels, and KV connectors finish installing. Earlier downloads cause ~2.5 GB of layer duplication when later pip installs overwrite flashinfer files.nixl-cu${CUDA_MAJOR}--force-reinstall --no-depsafter the kv_connectors install (replacing the barenixl-cu13install), so the matchingnixl_ep_cpp.sois shipped.Upstream changes intentionally skipped
BUILD_OS=manylinuxapt/dnf branchingnvidia-cutlass-dsl[cu13]strip-shim for cu12examples/online_serving→examples/deploymententrypoint movescripts/vllm/.VLLM_BUILD_*and OCI labelsvllm-omniDockerfile.cudaByte-identical between v0.20.0 and v0.21.0rc1 (same blob SHA
78f64f6a). TheDockerfile.cionly bumpsVLLM_BASE_TAG=v0.21.0; we don't consume that file, so no port needed.Pre-release handling
uv pip installadds--prerelease=allowbecause0.21.0rc1is a PEP 440 pre-release. Strip the flag when bumping to stable0.21.0.Transformers 5.9.0 compatibility (workaround: pin <5.9.0)
vllm-omni 0.21.0rc1 inherits vLLM core's transformers floor
>=4.56.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.4.*,!=5.5.0with no upper bound, so a fresh install resolves to transformers 5.9.0 (released 2026-05-20). 5.9.0 made two breaking changes totransformers.masking_utils.create_causal_mask/create_sliding_window_causal_mask:input_embedsalias (renamed toinputs_embedsin 5.5.1, alias dropped in 5.9.0).cache_positionfrom the signature entirely.Qwen3TTSTokenizerV2DecoderTransformerModel.forwardin vllm-omni 0.21.0rc1 still passes the old kwargs, so every qwen3-tts decode raisesTypeError: create_causal_mask() got an unexpected keyword argument 'input_embeds'on 5.9.0.Workaround in this PR: add a
transformers>=4.56.0,<5.9.0pin toDockerfile.amzn2023. The upstream call site is being fixed in vllm-project/vllm-omni#3786 (signature-filtered helper that works on 4.56–5.9+). We will not cherry-pick the fix into this image — once vllm-omni ships a release containing #3786, drop the pin and let pip resolve transformers freely again.New model & feature audit (no test-suite changes)
Walked through the new model + feature delta from v0.20.0 → v0.21.0rc1:
SenseNovaU1Pipeline, DiT-only image gen)/v1/images/generationsroute already covered byflux2-klein-4bandernie-image-turbo— no new code path.CovoAudioForConditionalGeneration, omni-chat)g6e12xl-runnerCodeBuild fleet is ICE in us-west-2 (same constraint asqwen2.5-omni-3b). Revisit when fleet capacity returns.Notable feature changes (HunyuanImage-3.0 KV reuse + IT2I editing, HunyuanVideo-1.5 USP, FLUX.2-dev TP, Voxtral TTS FP8,
attention_config→diffusion_attention_configrename per #3489) do not alter our existing test request shapes or routes —extra_args: ""defaults shield us from the config rename.No SageMaker endpoint test additions: 0.21.0rc1 introduces no new route or content-type that the existing endpoint suite (Qwen3-TTS on
/v1/audio/speech, Wan2.1-VACE on/v1/videos/syncmultipart) doesn't already exercise.Files changed
docker/vllm_omni/versions.env—VLLM_VERSION0.20.0 → 0.21.0,VLLM_OMNI_VERSION0.20.0 → 0.21.0rc1,DLC_MINOR_VERSION2 → 3.docker/vllm_omni/Dockerfile.amzn2023— same default-ARG bumps + the three upstream-derived hunks +--prerelease=allowon the omni pip install +transformers>=4.56.0,<5.9.0pin (to be removed once vllm-omni ships Fix build dependencies #3786)..github/config/image/vllm-omni-{ec2,sagemaker}-amzn2023.yml—framework_version: "0.21.0rc1",vllm_ref: "v0.21.0".Test plan
docker build -f docker/vllm_omni/Dockerfile.amzn2023 .../v1/audio/speech,/v1/images/generations,/v1/videos,/v1/videos/sync,/v1/audio/generate).test/vllm-omni/sagemaker/test_sm_omni_endpoint.py) against the new image.vllm_omni/framework_allowlist.jsonstill covers thediffusers/safetensorsadvisory chain (vllm-omni#3349 raisesdiffusers>=0.38.0but is itself blocked onsafetensors 0.8.0final, so the allowlist entries remain valid).--prerelease=allowand the rc1 suffix once stable 0.21.0 ships.transformers<5.9.0pin once vllm-omni ships a release containing [Bugfix] Fix qwen3-tts create_causal_mask kwarg for transformers >=5.9.0 vllm-project/vllm-omni#3786.