Skip to content

Fix "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint"#1794

Merged
linyueqian merged 46 commits into
vllm-project:mainfrom
ekagra-ranjan:er-stable-audio-online
Apr 28, 2026
Merged

Fix "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint"#1794
linyueqian merged 46 commits into
vllm-project:mainfrom
ekagra-ranjan:er-stable-audio-online

Conversation

@ekagra-ranjan
Copy link
Copy Markdown
Contributor

@ekagra-ranjan ekagra-ranjan commented Mar 10, 2026

This adds #1255 again. It was reverted here #1789.

The issue comes from _Unsupported being added to _DiffusionServingModels. It was added because earlier OpenAIServing accessing variables not defined in _DiffusionServingModels class with self.model_config = self.models.model_config which would fail even though its not used later on. Hence, I added _Unsupported so that such assignment dont fail during init but loudly if accessed. This failure during init still fails on vllm v0.16 but is fixed in vllm v0.17

However, there are other checks in vllm-omni that rely on hasattr() to see if _DiffusionServingModels has a variable and avoids using undefined variables. These checks fail because hasattr() calls getattr() and _Unsupported is truthy.

But now the recent vllm code assigns these variables in OpenAIServing using self.engine_client instead of self.models so the _Unsupported way can be safely removed from v0.17. For it to work with v0.16, we still need

self.model_config = self._NullModelConfig()
        self.renderer = None
        self.input_processor = None
        self.io_processor = None

This is the commit which fixes it: eb765ee

The prev failing test pytest -sv tests/e2e/online_serving/test_images_generations_lora.py::test_images_generations_per_request_lora_switching now passes.

Example cmd to run Stable audio online

# online
vllm-omni serve stabilityai/stable-audio-open-1.0 \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --enforce-eager \
    --omni
    
# v1/audio/generate    
curl -X POST http://localhost:8000/v1/audio/generate     -H "Content-Type: application/json"     -d '{
        "input": "The sound of a dog barking",
        "audio_length": 15.0,
        "negative_prompt": "Low quality",
        "guidance_scale": 7.0,
        "num_inference_steps": 100,
        "seed": 42
    }' --output dog.wav

ekagra-ranjan and others added 28 commits February 6, 2026 17:26
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
…_type code across stage config loading. Avoid inplace change in default sampling arg

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ekagra-ranjan ekagra-ranjan changed the title Er stable audio online Fix "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint" Mar 10, 2026
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

fix conflicts, in general it looks good to me

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@ekagra-ranjan
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 - the conflicts have been resolved again. Pls have a look at the earliest!

@linyueqian linyueqian added this to the v0.20.0 milestone Apr 21, 2026
Match the port convention used by the Qwen3-TTS online serving example
(8091) across the stable_audio curl/python examples, README, and the
new audio_generate_api / text_to_audio docs. Also register
serving/audio_generate_api.md under the OpenAI-Compatible API nav
section so the page shows up in the site navigation (the Examples
nav is auto-generated by generate_examples.py and did not need
manual wiring).

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a small follow-up (8091 port to match the Qwen3-TTS example, and registered serving/audio_generate_api.md under the OpenAI-Compatible API nav). The for_diffusion() routing concern from my earlier review is addressed — under is_pure_diffusion the engine_client is the diffusion engine and the OpenAIServing.__init__ attributes it touches are satisfied (same pattern as OmniOpenAIServingVideo.for_diffusion). LGTM.

@linyueqian linyueqian enabled auto-merge (squash) April 21, 2026 21:27
@ekagra-ranjan
Copy link
Copy Markdown
Contributor Author

@linyueqian thank you!
The CI failed on one task with this error: "Job never started (agent lost) (because there was a problem provisioning infrastructure to run the job)"

https://buildkite.com/vllm/vllm-omni/builds/7547/steps/canvas?sid=019db1f8-f1d7-4b2f-a377-fa025bfc8a43&tab=output

Looks harmless?

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@linyueqian linyueqian merged commit 4c2bea7 into vllm-project:main Apr 28, 2026
6 of 8 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 29, 2026
…dio/generate endpoint" (vllm-project#1794)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Yueqian Lin <linyueqian@outlook.com>
xiaohajiayou pushed a commit to xiaohajiayou/vllm-omni that referenced this pull request Apr 30, 2026
…dio/generate endpoint" (vllm-project#1794)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Yueqian Lin <linyueqian@outlook.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…dio/generate endpoint" (vllm-project#1794)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Yueqian Lin <linyueqian@outlook.com>
BeatSeat pushed a commit to BeatSeat/vllm-omni that referenced this pull request May 2, 2026
…dio/generate endpoint" (vllm-project#1794)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Yueqian Lin <linyueqian@outlook.com>
sphinxkkkbc pushed a commit to sphinxkkkbc/vllm-omni that referenced this pull request May 4, 2026
…dio/generate endpoint" (vllm-project#1794)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
Yadan-Wei pushed a commit to aws/deep-learning-containers that referenced this pull request May 7, 2026
Exercises the new /v1/audio/generate route introduced in upstream

vllm-omni v0.20.0 via PR vllm-project/vllm-omni#1794.

Validated on the 0.20.0 DLC image (CUDA 13, vllm 0.20.0):

  - Model loads in ~29s on 1x L4 (24GB VRAM, 3GB peak post-load)

  - /v1/audio/generate returns HTTP 200 in ~7s with a 5s audio_length

  - Response is valid WAV (PCM 16-bit stereo 44.1kHz, ~860KB for 5s)

  - Payload matches the example in the upstream PR description.

Artifact pre-staged at:

  s3://dlc-cicd-models/omni-models/stable-audio-open-1.0.tar.gz (14.3 GB)

Fleet x86-g6xl-runner (L4, compute 8.9) is used instead of g6exl because

the model only needs ~3GB VRAM at runtime.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…dio/generate endpoint" (vllm-project#1794)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: Yueqian Lin <linyueqian@outlook.com>
Yadan-Wei added a commit to aws/deep-learning-containers that referenced this pull request May 12, 2026
* feat(vllm-omni): bump to 0.20.0rc1 and align with upstream vllm v0.20.0

- CUDA 12.9.1 → 13.0.2 (matches upstream vllm v0.20.0 default)
- vllm 0.18.0 → 0.20.0
- vllm-omni 0.18.0 → 0.20.0rc1 (release candidate; install with
  --prerelease=allow)
- flashinfer 0.6.6 → 0.6.8.post1
- torch_cuda_arch_list aligned with upstream:
  '7.5 8.0 8.6 8.9 9.0 10.0 11.0 12.0+PTX'
- nvcc_threads 2 → 8
- runai-model-streamer >= 0.15.3 → 0.15.7
- Add numactl / numactl-libs / numactl-devel in runtime stage
- Add VLLM_ENABLE_CUDA_COMPATIBILITY=0 env (settable to 1 at runtime
  for hosts with older NVIDIA drivers)
- Drop sox system dep (vllm-omni v0.20.0rc1 removed sox from its deps)
- Bump --stage-init-timeout in model-tests workflow 600 → 900 to
  match vllm-omni benchmark default after the stage CLI refactor
- Image config files: cuda_version cu129 → cu130, prod_image 0.18 → 0.20

Entrypoint scripts (scripts/vllm/omni_*) unchanged — 'vllm serve --omni'
still works in 0.20.0 via upstream vllm's --omni delegation.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix(vllm-omni): split cuda/build requirements install for vllm v0.20.0

Upstream vllm v0.20.0 moved requirements/build.txt → requirements/build/cuda.txt
(the root build.txt no longer exists, causing 'uv pip install -r
requirements/build.txt' to fail with exit code 2).

Mirror upstream by installing cuda.txt and build/cuda.txt in two separate
RUN steps. The base vllm Dockerfile does the same split across its 'base'
and 'csrc-build' stages.

Fixes the build failure:

  ERROR: failed to solve: process "/bin/sh -c uv pip install -r
  requirements/cuda.txt -r requirements/build.txt ..." did not complete
  successfully: exit code: 2

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* refactor(vllm-omni): align Dockerfile with upstream vllm v0.20.0 and sibling DLC

Port hardening changes from the sibling docker/vllm/Dockerfile.amzn2023
and match upstream vllm v0.20.0 patterns for FlashInfer install:

- Pre-built wheel stage: add 'rm -rf dist' before recreating the dist/
  directory so stale artifacts from a previous build layer can't leak in.
- SETUPTOOLS_SCM_PRETEND_VERSION is now overrideable via --build-arg
  (e.g. 0.20.0rc1+amzn2023.abcdef12 with git SHA), falling back to
  VLLM_VERSION+amzn2023 when not provided.
- vLLM wheel install now picks the most recent vllm-*.whl by mtime and
  echoes which wheel it installs — defends against accidental glob
  matches when multiple wheels end up in the deps/ staging dir.
- FlashInfer install matches upstream vllm v0.20.0 exactly: only
  flashinfer-jit-cache is explicitly installed (flashinfer-python and
  flashinfer-cubin are already pinned in requirements/cuda.txt),
  followed by 'flashinfer show-config && flashinfer download-cubin' to
  pre-download precompiled kernels so the first inference request does
  not pay JIT compile latency.
- Serving extras: drop hf_transfer (superseded by HF_XET_HIGH_PERFORMANCE
  below) and match upstream's package set.
- Switch HF Hub acceleration from HF_HUB_ENABLE_HF_TRANSFER=1 to
  HF_XET_HIGH_PERFORMANCE=1, matching the sibling vllm Dockerfile and
  HF's direction of travel.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* chore(vllm-omni): compute SETUPTOOLS_SCM_PRETEND_VERSION in versions.env

Pair with the Dockerfile change that made SETUPTOOLS_SCM_PRETEND_VERSION
overrideable via --build-arg. Compute the wheel version tag here so it
gets auto-forwarded to docker buildx and encodes the pinned VLLM_REF
(tag slug or commit SHA prefix) alongside VLLM_VERSION for traceability.

Mirrors the same pattern in docker/vllm/versions.env.

Example for the current pin:
  VLLM_REF=v0.20.0
  VLLM_VERSION=0.20.0
  -> SETUPTOOLS_SCM_PRETEND_VERSION=0.20.0+amzn2023.v0_20_0

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix(vllm-omni): explicit cuda-compat-13-0 upgrade for CVE-2025-33219

The blanket 'dnf upgrade --security' in the final stages only picks up

fixes from AL2023's security-advisory channel. The cuda-compat-13-0 fix

ships in NVIDIA's CUDA repo, which doesn't emit AL2023 advisory metadata,

so --security misses it, leaving the HIGH CVE unpatched in the scanned image.

Mirror the pattern already used in docker/base/v2/Dockerfile and

docker/pytorch/Dockerfile.cuda: add an explicit

'dnf upgrade -y --releasever latest cuda-compat-13-0' step in both

final stages (EC2 + SageMaker).

Patches: CVE-2025-33219 (cuda-compat-13-0 580.95.05 -> 1:580.126.09-1.amzn2023)
Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* feat(vllm-omni): add stable-audio-open-1.0 to smoke-test matrix

Exercises the new /v1/audio/generate route introduced in upstream

vllm-omni v0.20.0 via PR vllm-project/vllm-omni#1794.

Validated on the 0.20.0 DLC image (CUDA 13, vllm 0.20.0):

  - Model loads in ~29s on 1x L4 (24GB VRAM, 3GB peak post-load)

  - /v1/audio/generate returns HTTP 200 in ~7s with a 5s audio_length

  - Response is valid WAV (PCM 16-bit stereo 44.1kHz, ~860KB for 5s)

  - Payload matches the example in the upstream PR description.

Artifact pre-staged at:

  s3://dlc-cicd-models/omni-models/stable-audio-open-1.0.tar.gz (14.3 GB)

Fleet x86-g6xl-runner (L4, compute 8.9) is used instead of g6exl because

the model only needs ~3GB VRAM at runtime.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* feat(vllm-omni): support ref_audio_s3 in smoke-test, add voice-clone TTS models

Adds qwen3-tts-12hz-1.7b-base and cosyvoice3-0.5b to the smoke-test

matrix. Both are voice-clone TTS models that require a base64-encoded

reference audio in the request, which previously could not be expressed

in the YAML test_request field (~450KB of base64 would exceed shell

argument limits).

Changes:

- Smoke-test scripts accept '@/path/to/file' in the request argument so

  the workflow can hand in large, preprocessed payloads without hitting

  the ~128KB bash argv limit.

- Workflow grows a 'Prepare test request' step that detects ref_audio_s3

  in the YAML test_request, fetches the wav from S3 on the runner

  (same IAM path the existing download-model action uses), base64-

  encodes it, substitutes ref_audio, and docker-cps the result into

  the container for the smoke-test script to read.

- CosyVoice3 is zero-shot voice-clone only; its reference fixture is

  mirrored from upstream tests/assets/cosyvoice3/zero_shot_prompt.wav

  to s3://dlc-cicd-models/test-fixtures/audio/cosyvoice3_ref.wav.

- qwen3-tts-12hz-1.7b-base reuses the already-staged tts_ref_vivian.wav

  that the benchmark suite uses; ref_text MUST match the audio transcript

  exactly (upstream issue vllm-project/vllm-omni#3124).

Existing entries with literal test_request strings keep working

unchanged (no @ prefix => script treats argument as literal body).

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix(vllm-omni): pass large request body to curl via file, not argv

Previous change (00bce6f) solved the argv-limit problem between the

workflow and the smoke-test script but re-introduced it one step later:

the script still ran 'curl -d "${REQUEST}"' with REQUEST holding the

~450KB JSON, which fails with /usr/bin/curl: Argument list too long.

Keep the body in a file throughout:

- If invoked with @file, use that path directly.

- Otherwise write the literal body to a temp file.

- curl now uses '-d @${REQUEST_FILE}' for JSON and reads the urlencoded

  pairs from the file line-by-line for multipart form-data.

Fixes voice-clone TTS smoke tests (qwen3-tts-12hz-1.7b-base and

cosyvoice3-0.5b).

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* feat(vllm-omni): bump to v0.20.0 final and expand smoke-test matrix

- versions.env / Dockerfile: vllm-omni 0.20.0rc1 → 0.20.0, drop --prerelease=allow
- model-tests.yml: add wan2.1-t2v-1.3b-sync, wan2.1-vace-1.3b, ernie-image-turbo
  on x86-g6exl-runner; pre-stage commented wan2.2-i2v-a14b for follow-up

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* chore: fix ruff lint+format on scripts/autocurrency/agent-fix.py

Unblocks pre-commit on PRs that merge into main: ruff-format reflow plus
E731 (lambda → def) in find_match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix: resolve agent-fix.py merge-conflict markers from main merge

The merge of main into omni-0.20.0 left 24 unresolved conflict markers
in scripts/autocurrency/agent-fix.py, breaking pre-commit. Restore our
ruff-formatted copy from da00c60 — same content as our prior fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* chore(vllm-omni): allowlist GHSA-98h9-4798-4q5v (diffusers trust_remote_code bypass)

diffusers>=0.38.0 fixes the CVE but transitively requires safetensors>=0.8.0-rc.0
which uv/pip skip by default (only safetensors 0.8.0rc0 published; 0.7.0 is
latest stable). Upstream vllm-omni#3349 tracks the bump and is waiting on
safetensors 0.8.0 final. DLC loads only pre-staged S3 models we control and
does not pass trust_remote_code=True from user code paths, so exploit surface
is narrow. Re-evaluate once safetensors 0.8.0 final ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): note --middleware contract in SageMaker proxy module

Document that we depend on vllm v0.20.0 still wiring args.middleware through
to FastAPI app.add_middleware in build_app. vllm-omni v0.20.0's "delegate to
upstream entrypoint" rebase (#3082, #3232) preserves this; if a future
upstream change drops the loop, this loader silently no-ops and SageMaker
/invocations returns 404 for non-default routes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* test(vllm-omni): add async SageMaker endpoint test for video generation

Adds test_vllm_omni_video_async_endpoint covering the recommended production
pattern for video on SageMaker: AsyncInferenceConfig + /v1/videos/sync via the
custom-attributes routing middleware (which auto-converts JSON to
multipart/form-data for FORM_DATA_ROUTES). Uses Wan-AI/Wan2.1-VACE-1.3B-diffusers
on ml.g6e.xlarge (32 GB RAM avoids the host-RAM OOM that 16 GB instances hit
during HF model load).

Skips on SageMaker capacity errors (ResourceLimitExceeded /
InsufficientInstanceCapacity / CapacityError) so the rest of the matrix still
gets signal when newer GPU families are unavailable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* chore: sync agent-fix.py with main (structured GitHub API + prompt fixes)

origin/main #6065 refactored agent-fix.py to use the GitHub API for
structured failure extraction and tweaked the system prompt format. Pull
the upstream version verbatim so future merges from main don't conflict
with our prior local ruff fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* refactor(vllm-omni): inline capacity-skip into async_endpoint fixture

Drops the duplicated _build_async_endpoint helper (added in 2393207) and
puts the SageMaker capacity-skip logic inside the async_endpoint fixture
itself. Both async tests (TTS and video) now share the same fixture body
and both gain the capacity-skip behavior. Net: 144 LOC deleted, 83 added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* chore(vllm-omni): allowlist 2 go/stdlib CVEs in mooncake/libetcd_wrapper.so

CVE-2026-33811 (cgo DNS resolver double-free on long CNAME) and CVE-2026-39820
(net/mail and time.ParseDate CPU/memory exhaustion). Both are go/stdlib
issues vendored into mooncake/libetcd_wrapper.so; same unpatchable-without-
mooncake-upgrade situation as the 5 existing mooncake entries above.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* test(vllm-omni): switch video async endpoint test to ml.g5.2xlarge

ml.g6e.xlarge (the previous default) had quota=0 in our SageMaker accounts,
causing the capacity-skip path to permanently swallow this test. Validated
ml.g5.2xlarge end-to-end on 2026-05-11: 45 KB MP4 returned in 10s, peak GPU
memory comfortably below A10G's 24 GB ceiling, and 32 GB host RAM avoids
the OOM-during-HF-load that bit us on 16 GB instances. A10G uses PyTorch
SDPA fallback for diffusion attention (no FA3 dependency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* chore(vllm-omni): allowlist 3 more mooncake go/stdlib CVEs

CVE-2026-42499 (RFC 5322 consumePhrase DoS), CVE-2026-39836 (Windows Dial/
LookupPort NUL panic), CVE-2026-33814 (HTTP/2 SETTINGS_MAX_FRAME_SIZE=0
infinite CONTINUATION loop). All vendored into mooncake/libetcd_wrapper.so;
same unpatchable-without-mooncake-upgrade situation as the existing entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* ci(vllm-omni): rename prod_image tags + bump cosyvoice fleet to g6e.xlarge

- image configs: prod_image renamed to vllm:omni-cuda-v1 / vllm:omni-sagemaker-cuda-v1
- cosyvoice3-0.5b smoke test: x86-g6xl-runner (16 GB RAM) → x86-g6exl-runner
  (32 GB RAM). Last green run was 2026-05-07 on vllm-omni 0.20.0rc1;
  --trust-remote-code load on g6.xlarge started SIGKILL'ing the host docker
  exec under 0.20.0 final. The bigger box matches our pattern for other
  diffusion-pipeline models (wan2.1-vace, ernie-image-turbo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* chore(vllm-omni): bump DLC_MINOR_VERSION 0 → 1

Increments the dlc_minor_version Dockerfile label to mark the 0.20.0 final
release (with new smoke-test entries, async video endpoint test, and CVE
allowlist updates) as a new image revision distinct from the initial 0.20.0
build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

---------

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Co-authored-by: Yadan Wei <yadanwei@amazon.com>
Yadan-Wei pushed a commit to aws/deep-learning-containers that referenced this pull request May 12, 2026
…d benchmark suite

Token-counting fix for chat_omni_benchmark_client.py
----------------------------------------------------
The client now reads `metrics.num_tokens_out` from each SSE chunk —
the vllm-omni engine-side counter — matching upstream
vllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions.
This is version-stable, unlike the previous fallbacks:
  * usage.completion_tokens (OpenAI standard) — omni reports 0
  * len(token_times) (chunk count) — swings ~50× between 0.18.0 and 0.20.0
    due to SSE batching changes (158 -> 3 on identical config)
Both are kept as second/third-preference fallbacks. README and YAML
comments updated to reflect the stable metric.

New benchmark entries (reuse existing clients)
----------------------------------------------
  cosyvoice3-0.5b      tts-base    x86-g6exl-runner
  ernie-image-turbo    image       x86-g6exl-runner
  wan2.1-vace-1.3b     video       x86-g6exl-runner

All three have thresholds intentionally unset; baseline on first run
and tighten with the standard ~25% CI margin.

New benchmark client for stable-audio-open-1.0
----------------------------------------------
audio_generate_benchmark_client.py targets /v1/audio/generate (new
endpoint in vllm-omni v0.20.0 per vllm-project/vllm-omni#1794). Uses
the same async machinery, WAV duration parser, and metric set as
tts_benchmark_client.py (TTFB / E2E / RTF / RPS / audio_throughput),
but the request shape is disjoint (audio_length, guidance_scale,
num_inference_steps, seed, negative_prompt) so a separate client is
cleaner than overloading the TTS one. New `audio-generate`
benchmark_type wired into the dispatcher; threshold validators reuse
the tts/tts-base branch since metric names match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan-Wei added a commit to aws/deep-learning-containers that referenced this pull request May 13, 2026
…ICE workaround for qwen2.5-omni-3b (#6079)

* fix(vllm-omni): adjust benchmark thresholds for upstream 0.20.0 changes

qwen3-tts-12hz-1.7b-base: temporarily loosen rps/audio_rtf_mult/p95_e2e
to absorb the upstream Code2Wav decode-chunk un-batching regression
from vllm-omni#3203. Fix is merged as vllm-omni#3485 post-0.20.0; will
re-tighten when next omni point release is picked up.

qwen2.5-omni-3b: drop min_output_tps. The SSE event stream changed in
0.20.0 (vllm-omni#3082 delegation to upstream vllm OpenAI entrypoint)
so the benchmark client now counts text tokens only (~95/req) instead
of text + codec frames (~5656/req in 0.18.0). The metric is no longer
comparable across versions; rps / ttft p95 / e2e p95 cover the
user-facing SLO without ambiguity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* docs(vllm-omni): correct min_output_tps comment based on devbox capture

Earlier comment cited vllm-omni#3082 as the SSE-format change that
caused the metric to swing across versions. Source-code review of
serving_chat.py at v0.18.0 vs v0.20.0 plus a devbox SSE capture
showed:

- Both versions still place audio in delta.content per yield (no
  documented change to chat-completions SSE shape).
- Server reports usage.completion_tokens=0 in the streamed [DONE]
  block on 0.18.0; the benchmark client therefore falls back to
  len(token_times) (a chunk count).
- Under concurrent load the per-chunk emit pattern shifts between
  releases enough to swing the value by ~50x (158 -> 3) on identical
  config, even though RPS / TTFT / e2e are unchanged.

Replace the #3082 attribution with the verified explanation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* feat(vllm-omni): align chat-omni token counting with upstream + expand benchmark suite

Token-counting fix for chat_omni_benchmark_client.py
----------------------------------------------------
The client now reads `metrics.num_tokens_out` from each SSE chunk —
the vllm-omni engine-side counter — matching upstream
vllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions.
This is version-stable, unlike the previous fallbacks:
  * usage.completion_tokens (OpenAI standard) — omni reports 0
  * len(token_times) (chunk count) — swings ~50× between 0.18.0 and 0.20.0
    due to SSE batching changes (158 -> 3 on identical config)
Both are kept as second/third-preference fallbacks. README and YAML
comments updated to reflect the stable metric.

New benchmark entries (reuse existing clients)
----------------------------------------------
  cosyvoice3-0.5b      tts-base    x86-g6exl-runner
  ernie-image-turbo    image       x86-g6exl-runner
  wan2.1-vace-1.3b     video       x86-g6exl-runner

All three have thresholds intentionally unset; baseline on first run
and tighten with the standard ~25% CI margin.

New benchmark client for stable-audio-open-1.0
----------------------------------------------
audio_generate_benchmark_client.py targets /v1/audio/generate (new
endpoint in vllm-omni v0.20.0 per vllm-project/vllm-omni#1794). Uses
the same async machinery, WAV duration parser, and metric set as
tts_benchmark_client.py (TTFB / E2E / RTF / RPS / audio_throughput),
but the request shape is disjoint (audio_length, guidance_scale,
num_inference_steps, seed, negative_prompt) so a separate client is
cleaner than overloading the TTS one. New `audio-generate`
benchmark_type wired into the dispatcher; threshold validators reuse
the tts/tts-base branch since metric names match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* feat(vllm-omni): baseline 4 thresholds, route qwen2.5-omni-3b around g6e12xl ICE

Threshold baselining (2026-05-12, vllm-omni 0.20.0)
---------------------------------------------------
First-run numbers + ~25% CI margin applied to four previously-open entries:

  cosyvoice3-0.5b           rps 0.348  rtf 2.119  p95 e2e 15639ms
                             -> min_rps 0.26 / min_audio_rtf_mult 1.6 / max_p95_e2e_ms 20000
  stable-audio-open-1.0     rps 0.141  rtf 0.706  p95 e2e 7167ms
                             -> min_rps 0.10 / min_audio_rtf_mult 0.5 / max_p95_e2e_ms 9500
  ernie-image-turbo         images/s 0.067  p95 e2e 17573ms
                             -> min_images_per_s 0.05 / max_p95_e2e_ms 22000
  wan2.1-vace-1.3b          videos/s 0.332  p95 e2e 3010ms
                             -> min_videos_per_s 0.25 / max_p95_e2e_ms 4000

All 9 benchmark entries now carry pass/fail thresholds.

ICE workaround for qwen2.5-omni-3b
----------------------------------
The x86-g6e12xl-runner CodeBuild fleet (4x L40S 192 GB) has been ICE in
us-west-2 since 2026-05-12, blocking the qwen2.5-omni-3b benchmark.

Mirror the SGLang/vLLM pattern of supporting both CodeBuild fleets and
k8s-backed runner-scale-sets:

- Add `benchmark.runner-scale-sets:` group in vllm-omni-model-tests.yml
  alongside `benchmark.codebuild-fleet:`. Move qwen2.5-omni-3b there
  with runner_label `gpu-l40s-4gpu-runners` (same 4x L40S hardware).
- Expand the runner-scale-sets comment block to list all 9 available
  k8s scale-set labels and their hardware mappings.
- Extend dispatch-vllm-omni-benchmark.yml's load-benchmarks parser to
  emit both matrices.
- Add a parallel `benchmark-runner-scale` job that uses
  `runs-on: ${{ matrix.runner_label }}` (no `fleet:` selector), pins
  GPU access to the pod's assigned UUIDs so parallel pods don't
  contend, and skips `docker rmi` since the host Docker daemon is
  shared across pods.
- `benchmark-report` now waits for both benchmark jobs.

Same hardware class (4x L40S 192 GB) so qwen2.5-omni-3b's existing
thresholds (min_rps 0.02, max_p95_ttft_ms 1500, max_p95_e2e_ms 120000)
do not need to change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix(vllm-omni): docker cp model into container on runner-scale-sets

The previous benchmark-runner-scale step bind-mounted /dlc-models from
the runner pod into the container via `-v /dlc-models:/models`. On a
k8s-backed scale-set, the host docker daemon resolves bind-mount paths
on the host filesystem, not the pod filesystem — so the container sees
an empty /models, vllm-omni's omni_snapshot_download falls through to
its HuggingFace path, and crashes with:

  huggingface_hub.errors.HFValidationError:
  Repo id must be in the form 'repo_name' or 'namespace/repo_name':
  '/models/qwen2.5-omni-3b'

Mirror the SGLang runner-scale pattern: start the container with
`--entrypoint /bin/bash` so it idles instead of immediately invoking
`vllm serve` with a bad path, `docker cp` the model from the pod
filesystem into the container, then launch the server via
`docker exec -d`. The host-side health check on localhost:8080 still
works because `-p 8080:8080` is unchanged.

CodeBuild fleet jobs are unaffected — they continue to bind-mount the
runner's /dlc-models since the CodeBuild docker daemon is on the same
host as the runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* fix(vllm-omni): run runner-scale benchmark inside container, not from host

Previous fix (2d61941) docker-cp'd the model into the container but still
ran the dispatcher (vllm_omni_benchmark_test.sh) from the runner pod, so
its `curl http://localhost:8080/health` polled the runner's loopback —
not the container's. The runner pod and the docker host on a k8s scale-set
are separate network namespaces, so even with `-p 8080:8080` the runner
can't reach the published port. Result: 600s health-check timeout, exit 1,
no server logs since the docker-exec'd server was still healthy inside the
container.

Mirror the SGLang scale-set path end-to-end:
- docker cp test/vllm-omni/scripts into /workspace/scripts in the
  container at start time
- Launch `vllm serve --omni ...` via `docker exec -d` and redirect output
  to /workspace/server.log
- Run the dispatcher itself via `docker exec` so all networking
  (curl /health, the per-modality benchmark client) uses the container's
  own loopback. Results are written to /workspace/benchmark_results inside
  the container, then docker-cp'd out for upload-artifact.
- Drop the now-unused `-p 8080:8080` publish (host can't reach it anyway).

CodeBuild fleet jobs are unaffected — they continue to bind-mount and
poll localhost:8080 since the runner and the docker daemon share the
same host network there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

---------

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Co-authored-by: Yadan Wei <yadanwei@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants