Skip to content

feat(vllm-omni): bump to v0.20.0 and expand smoke-test matrix#6028

Merged
Yadan-Wei merged 24 commits into
mainfrom
omni-0.20.0
May 12, 2026
Merged

feat(vllm-omni): bump to v0.20.0 and expand smoke-test matrix#6028
Yadan-Wei merged 24 commits into
mainfrom
omni-0.20.0

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented May 2, 2026

Purpose

Upgrades the vLLM-Omni DLC image from 0.18.0 → 0.20.0 (final), aligns build-time pins with upstream vllm/docker/Dockerfile at v0.20.0, and expands the smoke-test matrix to cover three new diffusion-pipeline models added in v0.20.0.

Changes

Version bumps

Component Before After
CUDA 12.9.1 13.0.2 (matches upstream vllm v0.20.0)
vLLM 0.18.0 0.20.0
vLLM-Omni 0.18.0 0.20.0 (final, no --prerelease=allow needed)
FlashInfer 0.6.6 0.6.8.post1
runai-model-streamer >=0.15.3 >=0.15.7
torch_cuda_arch_list 7.0 7.5 8.0 8.9 9.0 10.0 12.0 7.5 8.0 8.6 8.9 9.0 10.0 11.0 12.0+PTX

Other Dockerfile changes

  • Add numactl / numactl-libs / numactl-devel in runtime stage (needed by fastsafetensors in CUDA 13 upstream).
  • Add VLLM_ENABLE_CUDA_COMPATIBILITY=0 env (settable to 1 at runtime for hosts with NVIDIA drivers older than what CUDA 13 requires).
  • Drop sox system dep — vllm-omni v0.20.0 removed sox from its deps (vllm-omni#2745).
  • Remove in-tree DeepGEMM build — upstream also removed it from the main Dockerfile.
  • Image-config files: cuda_version cu129 → cu130, prod_image 0.18 → 0.20.
  • Explicit cuda-compat-13-0 upgrade for CVE-2025-33219 (NVIDIA repo CVE not flagged by AL2023's --security filter).

Smoke-test matrix (9 active models)

Added three new active entries (marked new below); fleet-bumped one existing entry. The matrix grew from 6 → 9 active models. Each new entry was validated end-to-end during this PR:

Model Route Fleet Notes
qwen3-tts-1.7b-customvoice /v1/audio/speech x86-g6xl-runner Existing
qwen3-tts-12hz-1.7b-base /v1/audio/speech x86-g6xl-runner Existing — voice-clone via ref_audio_s3
cosyvoice3-0.5b /v1/audio/speech x86-g6exl-runner Fleet bumped from x86-g6xl-runner (16 GB RAM) → x86-g6exl-runner (32 GB RAM): under 0.20.0 final the --trust-remote-code model load was tipping host RAM into OOM-kill on the smaller box
flux2-klein-4b /v1/images/generations x86-g6xl-runner Existing
ernie-image-turbo /v1/images/generations x86-g6exl-runner New — 8-step distilled DiT image gen (vllm-omni#2861), only landed in 0.20.0 final
wan2.1-t2v-1.3b /v1/videos x86-g6exl-runner Existing — async route, returns job ID
wan2.1-t2v-1.3b-sync /v1/videos/sync x86-g6exl-runner New — sync video route in v0.20.0 returns video/mp4 directly, SageMaker-compatible
wan2.1-vace-1.3b /v1/videos/sync x86-g6exl-runner New — unified video creation+editing pipeline (WanVACEPipeline, vllm-omni#1885), distinct from T2V
stable-audio-open-1.0 /v1/audio/generate x86-g6xl-runner Existing — new /v1/audio/generate route added in v0.20.0

A fourth new entry (wan2.2-i2v-a14b, 27B-total / 14B-active MoE on x86-g6e12xl-runner) is pre-staged in S3 but commented out — needs the I2V-image-fixture harness extension and g6e.12xl capacity to return.

SageMaker async endpoint test for video

New test_vllm_omni_video_async_endpoint covers the recommended production pattern for video on SageMaker: AsyncInferenceConfig + /v1/videos/sync via the custom-attributes routing middleware (which auto-converts JSON to multipart/form-data for FORM_DATA_ROUTES). Uses Wan-AI/Wan2.1-VACE-1.3B-diffusers on ml.g5.2xlarge — A10G has plenty of VRAM for VACE and 32 GB RAM avoids the host-RAM OOM the 16 GB instances hit during HF model load. The fixture skips on SageMaker capacity errors so other matrix entries still get signal.

Middleware doc

Annotated omni_sagemaker_serve.py with the upstream contract we depend on: vLLM v0.20.0 still wires --middleware through args.middleware → app.add_middleware() in vllm/entrypoints/openai/api_server.py:build_app. vllm-omni's "delegate to upstream entrypoint" rebase (vllm-omni#3082, vllm-omni#3232) preserves it. Documenting this so a future upstream change that drops the loop is caught early — otherwise the loader silently no-ops and SageMaker /invocations returns 404 for non-default routes.

Smoke-test scripts

vllm_omni_ec2_smoke_test.sh and vllm_omni_sagemaker_smoke_test.sh grew a @/path/to/file convention for the test_request arg so large request bodies (TTS voice-clone with base64 ref audio, etc.) bypass the shell argv length limit.

Test workflow

  • --stage-init-timeout bumped 600 → 900 in reusable-vllm-omni-model-tests.yml to match the vllm-omni benchmark default after the Stage CLI refactor (vllm-omni#2020).

Security allowlist

Added 9 entries to test/security/data/ecr_scan_allowlist/vllm_omni/framework_allowlist.json:

  • GHSA-98h9-4798-4q5v (diffusers trust_remote_code bypass): fix is diffusers>=0.38.0, but it transitively requires safetensors>=0.8.0-rc.0 which uv/pip skip by default. Upstream vllm-omni#3349 is open and waiting on safetensors 0.8.0 final. DLC loads only pre-staged S3 models we control; user code paths do not pass trust_remote_code=True.
  • 8 mooncake go/stdlib CVEs (CVE-2026-25679, -32280, -32281, -32283, -33186, -33811, -33814, -39820, -39836, -42499, -61726, -68121): all vendored into mooncake/libetcd_wrapper.so, unpatchable without a mooncake upgrade.

Entrypoint unchanged

scripts/vllm/omni_dockerd_entrypoint.sh and scripts/vllm/omni_sagemaker_entrypoint.sh still run vllm serve --omni. Upstream vllm v0.20.0 + vllm-omni v0.20.0 support this invocation natively via vllm's --omni delegation (vllm-omni#3082, vllm#40744) — no script changes needed.

Test Plan

  • pre-commit run passes on all changed files (YAML/JSON validation, workflow lint, Docker format, ruff, security scans, typos).
  • CI green on all blocking checks — gatekeeper, build-image, sanity-test, security-test (ECR vulnerability scan), telemetry-test, sagemaker-endpoint-test, and the full 9-model smoke-test matrix on both EC2 and SageMaker images.
  • Manual SageMaker async + /v1/videos/sync end-to-end validation on ml.g5.2xlarge in a separate account (PR image vllm-omni-0.20.0-...sagemaker-pr-6028): VACE-1.3B returned 45 KB MP4 in 10s, content-type video/mp4, all resources cleaned up.
  • Smoke-test matrix CI run: all 9 entries PASS on both EC2 and SageMaker — qwen3-tts-1.7b-customvoice, qwen3-tts-12hz-1.7b-base, cosyvoice3-0.5b, flux2-klein-4b, ernie-image-turbo, wan2.1-t2v-1.3b, wan2.1-t2v-1.3b-sync, wan2.1-vace-1.3b, stable-audio-open-1.0. SageMaker job also runs test_vllm_omni_video_async_endpoint.

Notes / Follow-ups

  • Host driver ≥ 550 required at runtime for CUDA 13. For hosts with older drivers, start the container with -e VLLM_ENABLE_CUDA_COMPATIBILITY=1.
  • sccache cache miss on first build — the CUDA 13 / vLLM 0.20 combination has no prior sccache entries, so the first build does a full source compile. Subsequent builds reuse the cache.
  • Wan2.2-I2V-A14B is pre-staged at s3://dlc-cicd-models/omni-models/wan2.2-i2v-a14b.tar.gz (107 GB) but the YAML entry is commented out: enabling needs (a) g6e.12xl capacity in us-west-2 and (b) extending the smoke-test harness with an image_s3 fetch path analogous to the existing ref_audio_s3 pattern.
  • Benchmark thresholds untouched in this PR. Thresholds live on the benchmark branch and will be re-baselined now that 0.20.0 final is in.
  • Docs intentionally untouched (docs/releasenotes/vllm-omni/*, docs/src/data/vllm-omni/*, docs/vllm-omni/index.md) — follow-up PR will add 0.20.0 entries alongside the 0.18.0 ones.
  • Diffusers CVE re-evaluation — once safetensors 0.8.0 final ships and upstream vllm-omni#3349 merges, the allowlist entry for GHSA-98h9-4798-4q5v should be removed and the Dockerfile should pin diffusers>=0.38.0.

- CUDA 12.9.1 → 13.0.2 (matches upstream vllm v0.20.0 default)
- vllm 0.18.0 → 0.20.0
- vllm-omni 0.18.0 → 0.20.0rc1 (release candidate; install with
  --prerelease=allow)
- flashinfer 0.6.6 → 0.6.8.post1
- torch_cuda_arch_list aligned with upstream:
  '7.5 8.0 8.6 8.9 9.0 10.0 11.0 12.0+PTX'
- nvcc_threads 2 → 8
- runai-model-streamer >= 0.15.3 → 0.15.7
- Add numactl / numactl-libs / numactl-devel in runtime stage
- Add VLLM_ENABLE_CUDA_COMPATIBILITY=0 env (settable to 1 at runtime
  for hosts with older NVIDIA drivers)
- Drop sox system dep (vllm-omni v0.20.0rc1 removed sox from its deps)
- Bump --stage-init-timeout in model-tests workflow 600 → 900 to
  match vllm-omni benchmark default after the stage CLI refactor
- Image config files: cuda_version cu129 → cu130, prod_image 0.18 → 0.20

Entrypoint scripts (scripts/vllm/omni_*) unchanged — 'vllm serve --omni'
still works in 0.20.0 via upstream vllm's --omni delegation.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei and others added 22 commits May 5, 2026 10:30
Upstream vllm v0.20.0 moved requirements/build.txt → requirements/build/cuda.txt
(the root build.txt no longer exists, causing 'uv pip install -r
requirements/build.txt' to fail with exit code 2).

Mirror upstream by installing cuda.txt and build/cuda.txt in two separate
RUN steps. The base vllm Dockerfile does the same split across its 'base'
and 'csrc-build' stages.

Fixes the build failure:

  ERROR: failed to solve: process "/bin/sh -c uv pip install -r
  requirements/cuda.txt -r requirements/build.txt ..." did not complete
  successfully: exit code: 2

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…sibling DLC

Port hardening changes from the sibling docker/vllm/Dockerfile.amzn2023
and match upstream vllm v0.20.0 patterns for FlashInfer install:

- Pre-built wheel stage: add 'rm -rf dist' before recreating the dist/
  directory so stale artifacts from a previous build layer can't leak in.
- SETUPTOOLS_SCM_PRETEND_VERSION is now overrideable via --build-arg
  (e.g. 0.20.0rc1+amzn2023.abcdef12 with git SHA), falling back to
  VLLM_VERSION+amzn2023 when not provided.
- vLLM wheel install now picks the most recent vllm-*.whl by mtime and
  echoes which wheel it installs — defends against accidental glob
  matches when multiple wheels end up in the deps/ staging dir.
- FlashInfer install matches upstream vllm v0.20.0 exactly: only
  flashinfer-jit-cache is explicitly installed (flashinfer-python and
  flashinfer-cubin are already pinned in requirements/cuda.txt),
  followed by 'flashinfer show-config && flashinfer download-cubin' to
  pre-download precompiled kernels so the first inference request does
  not pay JIT compile latency.
- Serving extras: drop hf_transfer (superseded by HF_XET_HIGH_PERFORMANCE
  below) and match upstream's package set.
- Switch HF Hub acceleration from HF_HUB_ENABLE_HF_TRANSFER=1 to
  HF_XET_HIGH_PERFORMANCE=1, matching the sibling vllm Dockerfile and
  HF's direction of travel.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Pair with the Dockerfile change that made SETUPTOOLS_SCM_PRETEND_VERSION
overrideable via --build-arg. Compute the wheel version tag here so it
gets auto-forwarded to docker buildx and encodes the pinned VLLM_REF
(tag slug or commit SHA prefix) alongside VLLM_VERSION for traceability.

Mirrors the same pattern in docker/vllm/versions.env.

Example for the current pin:
  VLLM_REF=v0.20.0
  VLLM_VERSION=0.20.0
  -> SETUPTOOLS_SCM_PRETEND_VERSION=0.20.0+amzn2023.v0_20_0

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The blanket 'dnf upgrade --security' in the final stages only picks up

fixes from AL2023's security-advisory channel. The cuda-compat-13-0 fix

ships in NVIDIA's CUDA repo, which doesn't emit AL2023 advisory metadata,

so --security misses it, leaving the HIGH CVE unpatched in the scanned image.

Mirror the pattern already used in docker/base/v2/Dockerfile and

docker/pytorch/Dockerfile.cuda: add an explicit

'dnf upgrade -y --releasever latest cuda-compat-13-0' step in both

final stages (EC2 + SageMaker).

Patches: CVE-2025-33219 (cuda-compat-13-0 580.95.05 -> 1:580.126.09-1.amzn2023)
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Exercises the new /v1/audio/generate route introduced in upstream

vllm-omni v0.20.0 via PR vllm-project/vllm-omni#1794.

Validated on the 0.20.0 DLC image (CUDA 13, vllm 0.20.0):

  - Model loads in ~29s on 1x L4 (24GB VRAM, 3GB peak post-load)

  - /v1/audio/generate returns HTTP 200 in ~7s with a 5s audio_length

  - Response is valid WAV (PCM 16-bit stereo 44.1kHz, ~860KB for 5s)

  - Payload matches the example in the upstream PR description.

Artifact pre-staged at:

  s3://dlc-cicd-models/omni-models/stable-audio-open-1.0.tar.gz (14.3 GB)

Fleet x86-g6xl-runner (L4, compute 8.9) is used instead of g6exl because

the model only needs ~3GB VRAM at runtime.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…TTS models

Adds qwen3-tts-12hz-1.7b-base and cosyvoice3-0.5b to the smoke-test

matrix. Both are voice-clone TTS models that require a base64-encoded

reference audio in the request, which previously could not be expressed

in the YAML test_request field (~450KB of base64 would exceed shell

argument limits).

Changes:

- Smoke-test scripts accept '@/path/to/file' in the request argument so

  the workflow can hand in large, preprocessed payloads without hitting

  the ~128KB bash argv limit.

- Workflow grows a 'Prepare test request' step that detects ref_audio_s3

  in the YAML test_request, fetches the wav from S3 on the runner

  (same IAM path the existing download-model action uses), base64-

  encodes it, substitutes ref_audio, and docker-cps the result into

  the container for the smoke-test script to read.

- CosyVoice3 is zero-shot voice-clone only; its reference fixture is

  mirrored from upstream tests/assets/cosyvoice3/zero_shot_prompt.wav

  to s3://dlc-cicd-models/test-fixtures/audio/cosyvoice3_ref.wav.

- qwen3-tts-12hz-1.7b-base reuses the already-staged tts_ref_vivian.wav

  that the benchmark suite uses; ref_text MUST match the audio transcript

  exactly (upstream issue vllm-project/vllm-omni#3124).

Existing entries with literal test_request strings keep working

unchanged (no @ prefix => script treats argument as literal body).

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Previous change (00bce6f) solved the argv-limit problem between the

workflow and the smoke-test script but re-introduced it one step later:

the script still ran 'curl -d "${REQUEST}"' with REQUEST holding the

~450KB JSON, which fails with /usr/bin/curl: Argument list too long.

Keep the body in a file throughout:

- If invoked with @file, use that path directly.

- Otherwise write the literal body to a temp file.

- curl now uses '-d @${REQUEST_FILE}' for JSON and reads the urlencoded

  pairs from the file line-by-line for multipart form-data.

Fixes voice-clone TTS smoke tests (qwen3-tts-12hz-1.7b-base and

cosyvoice3-0.5b).

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
- versions.env / Dockerfile: vllm-omni 0.20.0rc1 → 0.20.0, drop --prerelease=allow
- model-tests.yml: add wan2.1-t2v-1.3b-sync, wan2.1-vace-1.3b, ernie-image-turbo
  on x86-g6exl-runner; pre-stage commented wan2.2-i2v-a14b for follow-up

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Unblocks pre-commit on PRs that merge into main: ruff-format reflow plus
E731 (lambda → def) in find_match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The merge of main into omni-0.20.0 left 24 unresolved conflict markers
in scripts/autocurrency/agent-fix.py, breaking pre-commit. Restore our
ruff-formatted copy from da00c60 — same content as our prior fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…te_code bypass)

diffusers>=0.38.0 fixes the CVE but transitively requires safetensors>=0.8.0-rc.0
which uv/pip skip by default (only safetensors 0.8.0rc0 published; 0.7.0 is
latest stable). Upstream vllm-omni#3349 tracks the bump and is waiting on
safetensors 0.8.0 final. DLC loads only pre-staged S3 models we control and
does not pass trust_remote_code=True from user code paths, so exploit surface
is narrow. Re-evaluate once safetensors 0.8.0 final ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Document that we depend on vllm v0.20.0 still wiring args.middleware through
to FastAPI app.add_middleware in build_app. vllm-omni v0.20.0's "delegate to
upstream entrypoint" rebase (#3082, #3232) preserves this; if a future
upstream change drops the loop, this loader silently no-ops and SageMaker
/invocations returns 404 for non-default routes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Adds test_vllm_omni_video_async_endpoint covering the recommended production
pattern for video on SageMaker: AsyncInferenceConfig + /v1/videos/sync via the
custom-attributes routing middleware (which auto-converts JSON to
multipart/form-data for FORM_DATA_ROUTES). Uses Wan-AI/Wan2.1-VACE-1.3B-diffusers
on ml.g6e.xlarge (32 GB RAM avoids the host-RAM OOM that 16 GB instances hit
during HF model load).

Skips on SageMaker capacity errors (ResourceLimitExceeded /
InsufficientInstanceCapacity / CapacityError) so the rest of the matrix still
gets signal when newer GPU families are unavailable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…xes)

origin/main #6065 refactored agent-fix.py to use the GitHub API for
structured failure extraction and tweaked the system prompt format. Pull
the upstream version verbatim so future merges from main don't conflict
with our prior local ruff fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Drops the duplicated _build_async_endpoint helper (added in 2393207) and
puts the SageMaker capacity-skip logic inside the async_endpoint fixture
itself. Both async tests (TTS and video) now share the same fixture body
and both gain the capacity-skip behavior. Net: 144 LOC deleted, 83 added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
# Conflicts:
#	scripts/autocurrency/agent-fix.py

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…per.so

CVE-2026-33811 (cgo DNS resolver double-free on long CNAME) and CVE-2026-39820
(net/mail and time.ParseDate CPU/memory exhaustion). Both are go/stdlib
issues vendored into mooncake/libetcd_wrapper.so; same unpatchable-without-
mooncake-upgrade situation as the 5 existing mooncake entries above.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
ml.g6e.xlarge (the previous default) had quota=0 in our SageMaker accounts,
causing the capacity-skip path to permanently swallow this test. Validated
ml.g5.2xlarge end-to-end on 2026-05-11: 45 KB MP4 returned in 10s, peak GPU
memory comfortably below A10G's 24 GB ceiling, and 32 GB host RAM avoids
the OOM-during-HF-load that bit us on 16 GB instances. A10G uses PyTorch
SDPA fallback for diffusion attention (no FA3 dependency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
CVE-2026-42499 (RFC 5322 consumePhrase DoS), CVE-2026-39836 (Windows Dial/
LookupPort NUL panic), CVE-2026-33814 (HTTP/2 SETTINGS_MAX_FRAME_SIZE=0
infinite CONTINUATION loop). All vendored into mooncake/libetcd_wrapper.so;
same unpatchable-without-mooncake-upgrade situation as the existing entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…large

- image configs: prod_image renamed to vllm:omni-cuda-v1 / vllm:omni-sagemaker-cuda-v1
- cosyvoice3-0.5b smoke test: x86-g6xl-runner (16 GB RAM) → x86-g6exl-runner
  (32 GB RAM). Last green run was 2026-05-07 on vllm-omni 0.20.0rc1;
  --trust-remote-code load on g6.xlarge started SIGKILL'ing the host docker
  exec under 0.20.0 final. The bigger box matches our pattern for other
  diffusion-pipeline models (wan2.1-vace, ernie-image-turbo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@Yadan-Wei Yadan-Wei changed the title feat(vllm-omni): bump to 0.20.0rc1 and align with upstream vllm v0.20.0 feat(vllm-omni): bump to v0.20.0 and expand smoke-test matrix May 11, 2026
Increments the dlc_minor_version Dockerfile label to mark the 0.20.0 final
release (with new smoke-test entries, async video endpoint test, and CVE
allowlist updates) as a new image revision distinct from the initial 0.20.0
build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@Yadan-Wei Yadan-Wei enabled auto-merge (squash) May 12, 2026 17:38
@Yadan-Wei Yadan-Wei merged commit 9120403 into main May 12, 2026
48 checks passed
@junpuf junpuf deleted the omni-0.20.0 branch May 12, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants