-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat(omni): add Cosmos3 support to vLLM-Omni backend #10132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
ebe6779
b9b9ca3
22812d0
7744835
0034bee
001eacb
22d56b9
2c48064
271214e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,163 @@ | ||
| --- | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| title: Cosmos3 | ||
| --- | ||
|
|
||
| Run NVIDIA's **Cosmos3** omni model through Dynamo's | ||
| [vLLM-Omni backend](vllm-omni.md) for **text-to-image**, **text-to-video**, and | ||
| **image-to-video** generation. | ||
|
|
||
| Cosmos3 is a unified world foundation model (WFM) for Physical AI, built on a | ||
| Mixture-of-Transformers (MoT) architecture. A single `Cosmos3OmniTransformer` | ||
| runs a Qwen-style "understanding" stream alongside a "generation" stream | ||
| joined by a 3D multimodal RoPE, replacing the separate Predict / Reason / | ||
| Transfer models from earlier Cosmos releases. See the | ||
| [Cosmos World Foundation Model Platform paper](https://huggingface.co/papers/2501.03575) | ||
| for the architectural background, and the | ||
| [diffusers Cosmos3 reference](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cosmos3) for the underlying pipeline. | ||
|
|
||
| Cosmos3 support in Dynamo is provided by the native vLLM-Omni pipeline added in | ||
| [vllm-project/vllm-omni#3454](https://github.com/vllm-project/vllm-omni/pull/3454). | ||
|
|
||
| ## Checkpoints | ||
|
|
||
| Both checkpoints share the same `Cosmos3OmniPipeline` class and Dynamo flags; | ||
| swap the model identifier on the worker (`--model …`) and in request payloads. | ||
|
|
||
| | Checkpoint | Description | HF Hub | | ||
| |------------|-------------|--------| | ||
| | `nvidia/Cosmos3-Nano` | Smaller, faster — default in this repo's launch scripts | [link](https://huggingface.co/nvidia/Cosmos3-Nano) | | ||
| | `nvidia/Cosmos3-Super` | Larger, higher quality | [link](https://huggingface.co/nvidia/Cosmos3-Super) | | ||
|
Comment on lines
+31
to
+32
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use descriptive link labels for checkpoint URLs.
As per coding guidelines, for 🧰 Tools🪛 markdownlint-cli2 (0.22.1)[warning] 31-31: Link text should be descriptive (MD059, descriptive-link-text) [warning] 32-32: Link text should be descriptive (MD059, descriptive-link-text) 🤖 Prompt for AI AgentsFix the checkpoint links that currently fail docs link-check CI. The current Hugging Face checkpoint URLs are failing lychee with 401, which blocks docs checks. Please switch these to URLs that pass CI (or update the docs-link-check allowlist for these exact domains/statuses). 🧰 Tools🪛 markdownlint-cli2 (0.22.1)[warning] 31-31: Link text should be descriptive (MD059, descriptive-link-text) [warning] 32-32: Link text should be descriptive (MD059, descriptive-link-text) 🤖 Prompt for AI Agents |
||
|
|
||
| ## Supported modalities | ||
|
|
||
| | Task | Endpoint | `--output-modalities` | | ||
| |------|----------|-----------------------| | ||
| | Text-to-Image | `/v1/images/generations` | `image` | | ||
| | Text-to-Video | `/v1/videos` | `video` | | ||
| | Image-to-Video | `/v1/videos` (with `input_reference`) | `video` | | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| This guide builds on the [vLLM-Omni backend guide](vllm-omni.md) — see it for general setup, `etcd`/`nats`, and OpenAI-endpoint details. | ||
|
|
||
| ### Installation | ||
|
|
||
| This branch carries Dynamo code changes (the Cosmos3 worker flags and image | ||
| output handling) on top of a pinned vLLM-Omni, so run Dynamo **from source on | ||
| this branch** — a released `ai-dynamo` wheel will not include the integration. | ||
|
|
||
| 1. Clone and check out the branch: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/ai-dynamo/dynamo.git | ||
| cd dynamo | ||
| git checkout cosmos3-omni-integration | ||
| ``` | ||
|
|
||
| 2. Create a Python 3.12 environment: | ||
|
|
||
| ```bash | ||
| uv venv --python 3.12 --seed | ||
| source .venv/bin/activate | ||
| ``` | ||
|
|
||
| 3. Build and install Dynamo from source (the branch's Cosmos3 code must be | ||
| live, and the Rust core `ai-dynamo-runtime` isn't published for this dev | ||
| version, so it has to be built locally). See | ||
| [Building from source](../../getting-started/building-from-source.md) for | ||
| prerequisites (Rust toolchain, system deps); the key steps from the repo root: | ||
|
|
||
| ```bash | ||
| uv pip install pip maturin | ||
| (cd lib/bindings/python && maturin develop --uv) # builds ai-dynamo-runtime | ||
| uv pip install -e lib/gpu_memory_service | ||
| uv pip install -e ".[vllm]" # also pulls vllm==0.21.0 | ||
| ``` | ||
|
|
||
| 4. Install the Cosmos3-capable vLLM-Omni, pinned to the PR commit (its dynamic | ||
| `setup.py` pulls the matching pipeline deps — `diffusers==0.38`, `torchsde`, | ||
| `x-transformers`): | ||
|
|
||
| ```bash | ||
| uv pip install "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@e826f626afb47c8c3c39ccf892ed247f442f6bd2" | ||
| ``` | ||
|
|
||
| 5. Start etcd and NATS: | ||
|
|
||
| ```bash | ||
| docker compose -f dev/docker-compose.yml up -d | ||
| ``` | ||
|
|
||
| ## Serve | ||
|
|
||
| Quick start — each script launches the frontend on `:8000` plus a | ||
| single-modality worker and prints a sample request: | ||
|
|
||
| ```bash | ||
| examples/backends/vllm/launch/agg_omni_cosmos3_image.sh # text-to-image | ||
| examples/backends/vllm/launch/agg_omni_cosmos3_video.sh # text-to-video | ||
| examples/backends/vllm/launch/agg_omni_cosmos3_i2v.sh # image-to-video | ||
| ``` | ||
|
|
||
| Manual launch: | ||
|
|
||
| ```bash | ||
| python -m dynamo.frontend --http-port 8000 & | ||
|
|
||
| python -m dynamo.vllm.omni \ | ||
| --model nvidia/Cosmos3-Nano \ | ||
| --output-modalities image \ # or: video | ||
| --no-cosmos3-guardrails \ # skip loading the safety guardrail models | ||
| --media-output-fs-url file:///tmp/dynamo_media | ||
|
Comment on lines
+112
to
+114
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The multiline shell example is not copy/paste-safe. The inline comments after line-continuation backslashes break the command. Move those comments to separate lines (or provide separate command variants) so the snippet executes as documented. 🤖 Prompt for AI Agents |
||
| ``` | ||
|
|
||
| Cosmos3-specific flags: | ||
|
|
||
| | Flag | Purpose | | ||
| |------|---------| | ||
| | `--no-cosmos3-guardrails` | Disable the Cosmos3 text/video safety guardrails (otherwise loaded at startup). | | ||
| | `--flow-shift <float>` | Scheduler flow-shift (image default `3.0`). Launch-time only — not a per-request image parameter. | | ||
| | `--media-output-fs-url file://<dir>` | Destination for media when `response_format: "url"`. | | ||
|
|
||
| ## Requests | ||
|
|
||
| ### Text-to-image | ||
|
|
||
| Run from the repo root; `cosmos3/t2i.json` is the official Cosmos3 t2i payload | ||
| (prompt verbatim) mapped to the Dynamo request schema: | ||
|
|
||
| ```bash | ||
| curl -s -X POST http://localhost:8000/v1/images/generations \ | ||
| -H 'Content-Type: application/json' \ | ||
| --data-binary @examples/backends/vllm/launch/cosmos3/t2i.json \ | ||
| | jq -r '.data[0].b64_json' | base64 -d > out.png | ||
| ``` | ||
|
|
||
| - `size` must be one of `256x256`, `512x512`, `1024x1024`, `1792x1024`, | ||
| `1024x1792`, `1536x1024`, `1024x1536`, `auto` — the payload uses `1024x1024` | ||
| (the official `960x960` is not an allowed image size). | ||
| - Put `num_inference_steps`, `guidance_scale`, `seed`, and `negative_prompt` | ||
| under `nvext` — top-level values are ignored. | ||
|
|
||
| ### Text-to-video | ||
|
|
||
| ```bash | ||
| curl -s http://localhost:8000/v1/videos \ | ||
| -H 'Content-Type: application/json' \ | ||
| --data-binary @examples/backends/vllm/launch/cosmos3/t2v.json | jq | ||
| ``` | ||
|
|
||
| The official `t2v.json` payload is `1280x720`, `192` frames @ `24` fps (8s). | ||
|
|
||
| ### Image-to-video | ||
|
|
||
| `i2v.json` adds `input_reference` (the official `vision_path` — an http URL; | ||
| local paths are rejected, use an http(s) URL or a `data:` base64 URI): | ||
|
|
||
| ```bash | ||
| curl -s http://localhost:8000/v1/videos \ | ||
| -H 'Content-Type: application/json' \ | ||
| --data-binary @examples/backends/vllm/launch/cosmos3/i2v.json | jq | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| #!/bin/bash | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Aggregated Cosmos3 image-to-video generation (1 GPU). | ||
| # Same worker as text-to-video (registers the "video" modality); i2v is driven | ||
| # by adding "input_reference" to the /v1/videos request. The image loader | ||
| # rejects local file paths — pass a data: URI (base64) or an http(s) URL. | ||
| # --no-cosmos3-guardrails skips loading the safety guardrail models. | ||
|
|
||
| set -e | ||
| trap 'echo Cleaning up...; kill 0' EXIT | ||
|
|
||
| SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" | ||
| source "$SCRIPT_DIR/../../../common/gpu_utils.sh" | ||
| source "$SCRIPT_DIR/../../../common/launch_utils.sh" | ||
|
|
||
| MODEL="nvidia/Cosmos3-Nano" | ||
|
|
||
| # Parse command line arguments | ||
| EXTRA_ARGS=() | ||
| while [[ $# -gt 0 ]]; do | ||
| case $1 in | ||
| --model) | ||
| MODEL="$2" | ||
| shift 2 | ||
| ;; | ||
| *) | ||
| EXTRA_ARGS+=("$1") | ||
| shift | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| HTTP_PORT="${DYN_HTTP_PORT:-8000}" | ||
| GPU_MEM_ARGS=$(build_vllm_gpu_mem_args) | ||
| print_launch_banner --no-curl "Launching vLLM-Omni Cosmos3 Image-to-Video (1 GPU)" "$MODEL" "$HTTP_PORT" | ||
| print_curl_footer <<CURL | ||
| # Official Cosmos3 image-to-video payload (prompt + vision_path verbatim). | ||
| # input_reference must be an http(s) URL or a data: URI (local paths are rejected). | ||
| curl -s http://localhost:${HTTP_PORT}/v1/videos \\ | ||
| -H 'Content-Type: application/json' \\ | ||
| --data-binary @${SCRIPT_DIR}/cosmos3/i2v.json | jq | ||
| CURL | ||
|
|
||
|
|
||
| python -m dynamo.frontend & | ||
| FRONTEND_PID=$! | ||
|
|
||
| sleep 2 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove fixed readiness sleep and use shared health-check orchestration.
As per coding guidelines, launch scripts should “Avoid readiness sleeps/polls; rely on the shared framework health-check patterns instead.” 🤖 Prompt for AI Agents |
||
|
|
||
| echo "Starting Omni worker..." | ||
| DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \ | ||
| python -m dynamo.vllm.omni \ | ||
| --model "$MODEL" \ | ||
| --output-modalities video \ | ||
| --no-cosmos3-guardrails \ | ||
| --media-output-fs-url file:///tmp/dynamo_media \ | ||
| $GPU_MEM_ARGS \ | ||
| "${EXTRA_ARGS[@]}" & | ||
|
|
||
| # Exit on first worker failure; kill 0 in the EXIT trap tears down the rest | ||
| wait_any_exit | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| #!/bin/bash | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Aggregated Cosmos3 text-to-image generation (1 GPU). | ||
| # Uses the native vLLM-Omni Cosmos3 pipeline; --no-cosmos3-guardrails skips | ||
| # loading the safety guardrail models. A worker serves a single modality, so | ||
| # this script registers the "image" modality (see agg_omni_cosmos3_video.sh | ||
| # for text-to-video). | ||
|
|
||
| set -e | ||
| trap 'echo Cleaning up...; kill 0' EXIT | ||
|
|
||
| SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" | ||
| source "$SCRIPT_DIR/../../../common/launch_utils.sh" | ||
|
|
||
|
Comment on lines
+15
to
+16
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Align this launcher with shared vLLM GPU-memory utilities. This script skips As per coding guidelines, launchers should source Also applies to: 50-57 🧰 Tools🪛 Shellcheck (0.11.0)[info] 15-15: Not following: ./../../../common/launch_utils.sh was not specified as input (see shellcheck -x). (SC1091) 🤖 Prompt for AI Agents |
||
| MODEL="nvidia/Cosmos3-Nano" | ||
|
|
||
| # Parse command line arguments | ||
| EXTRA_ARGS=() | ||
| while [[ $# -gt 0 ]]; do | ||
| case $1 in | ||
| --model) | ||
| MODEL="$2" | ||
| shift 2 | ||
| ;; | ||
| *) | ||
| EXTRA_ARGS+=("$1") | ||
| shift | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| HTTP_PORT="${DYN_HTTP_PORT:-8000}" | ||
| print_launch_banner --no-curl "Launching vLLM-Omni Cosmos3 Image Generation (1 GPU)" "$MODEL" "$HTTP_PORT" | ||
| print_curl_footer <<CURL | ||
| # Official Cosmos3 text-to-image payload (prompt verbatim) | ||
| curl -s -X POST http://localhost:${HTTP_PORT}/v1/images/generations \\ | ||
| -H 'Content-Type: application/json' \\ | ||
| --data-binary @${SCRIPT_DIR}/cosmos3/t2i.json \\ | ||
| | jq -r '.data[0].b64_json' | base64 -d > t2i.png | ||
| CURL | ||
|
|
||
|
|
||
| python -m dynamo.frontend & | ||
| FRONTEND_PID=$! | ||
|
|
||
| sleep 2 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Replace fixed startup sleep with framework readiness handling. Using As per coding guidelines, launch scripts should “Avoid readiness sleeps/polls; rely on the shared framework health-check patterns instead.” 🤖 Prompt for AI Agents |
||
|
|
||
| echo "Starting Omni worker..." | ||
| DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \ | ||
| python -m dynamo.vllm.omni \ | ||
| --model "$MODEL" \ | ||
| --output-modalities image \ | ||
| --no-cosmos3-guardrails \ | ||
| --media-output-fs-url file:///tmp/dynamo_media \ | ||
| "${EXTRA_ARGS[@]}" & | ||
|
|
||
| # Exit on first worker failure; kill 0 in the EXIT trap tears down the rest | ||
| wait_any_exit | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize_image_framescollapses a[B, F, H, W, C]Cosmos3 array by takingarr[0], so image requests withn > 1silently drop every generated batch after the first. Fix: preserve and flatten all leading batch/frame dimensions before converting frames to PIL images.🤖 AI Fix
In
components/src/dynamo/common/utils/video_utils.py, updatenormalize_image_framesto replace thewhile arr.ndim > 4: arr = arr[0]logic with validation that the last three dimensions areH, W, Candarr = arr.reshape((-1, *arr.shape[-3:]))so all[B, F, H, W, C]outputs are emitted.