vllm-project · lishunyang12 · Jun 3, 2026 · May 28, 2026 · Jun 2, 2026 · Jun 2, 2026
@@ -33,7 +33,7 @@ th {
 | `ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `WanPipeline` | Wan2.1-T2V, Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-T2V-14B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
-| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
+| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
 | `WanSpeechToVideoPipeline` | Wan2.2-S2V | `Wan-AI/Wan2.2-S2V-14B` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `Wan22VACEPipeline` | Wan2.1-VACE | `Wan-AI/Wan2.1-VACE-1.3B-diffusers`, `Wan-AI/Wan2.1-VACE-14B-diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `LTX2Pipeline` | LTX-2-T2V | `Lightricks/LTX-2` | ✅︎ | ✅︎ | | |

@@ -36,7 +36,8 @@ recipes/
 | [`LTX/LTX-2.md`](./LTX/LTX-2.md) | Text-to-video and image-to-video serving | 1x H200 141GB |
 | [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) |
 | [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB |
-| [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, and image-to-video generation | 1x H200 141GB / B300 |
+| [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound  | 1x H200 141GB / B300 |
+| [`nvidia/Cosmos3-Super.md`](./nvidia/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) | 8x H200/H100/A100 / 2x H200 / B300 |
 | [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB |
 | [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB |
 | [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 |

@@ -6,7 +6,7 @@
 
 - Vendor: NVIDIA
 - Model: `nvidia/Cosmos3-Nano`
-- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation
+- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation, with optional synchronized audio (video + sound)
 - Mode: Online serving with the OpenAI-compatible image/video APIs, plus offline generation via the `Omni` API
 - Maintainer: Community
 
@@ -20,12 +20,17 @@ the mode is selected per request:
 - **T2V** — `POST /v1/videos/sync` with `num_frames > 1` and no reference image.
 - **I2V** — `POST /v1/videos/sync` with a reference image (`input_reference` file
   upload, or `image_reference` JSON).
+- **T2VS / I2VS** — add `generate_sound=true` (and optional `sound_duration`) to a
+  T2V/I2V `/v1/videos/sync` request to also generate synchronized audio, muxed into
+  the mp4 as AAC 48 kHz stereo. See the official model card's "Video + Audio" examples.
 
 ## References
 
 - Model card (authoritative usage + example assets): <https://huggingface.co/nvidia/Cosmos3-Nano>
 - Example inputs/outputs live in the repo's `assets/` (`example_t2v_prompt.json`,
-  `example_i2v_prompt.json`, `example_i2v_input.jpg`, `negative_prompt.json`).
+  `example_i2v_prompt.json`, `example_i2v_input.jpg`, `negative_prompt.json`;
+  audio examples: `example_t2vs_prompt.json`, `example_t2vs_output.mp4`,
+  `example_i2vs_output.mp4`).
 - Prompt upsampling (recommended for quality): the model expects JSON-upsampled
   structured prompts; see NVIDIA's `cosmos-framework` prompt-upsampling docs.
 - Pipeline: [`vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py`](../../vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py)
@@ -47,6 +52,9 @@ the mode is selected per request:
 
 #### Command
 
+Requires the `vllm-omni` package (or the `vllm/vllm-omni:cosmos3` container),
+which provides the `vllm serve … --omni` entrypoint used below.
+
 Safety guardrails are **on by default** (NVIDIA Open Model License). They load
 the **gated** `nvidia/Cosmos-1.0-Guardrail` model, so to keep them on you must:
 
@@ -116,6 +124,26 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
   -F "seed=1111" \
   -F "input_reference=@/path/to/reference.jpg;type=image/jpeg" \
   -o cosmos3_i2v.mp4
+
+
+# Text-to-video-with-sound
+curl -sS -X POST http://localhost:8000/v1/videos/sync \
+  -H "Accept: video/mp4" \
+  -F "model=nvidia/Cosmos3-Nano" \
+  -F "prompt=The video opens with a view of a well-lit indoor fruit display. A robotic arm picks up a pear, an orange, and a carambola one by one, placing each into a plastic bag in a shopping cart with red handles. The video is 7.875 seconds long, 24 FPS, and 1280x720. Audio description: soft servo whirs, gentle fruit thuds, plastic bag rustling, and a faint refrigeration hum." \
+  -F "negative_prompt=blurry, distorted, low quality" \
+  -F "size=1280x720" \
+  -F "num_frames=189" \
+  -F "fps=24" \
+  -F "num_inference_steps=35" \
+  -F "guidance_scale=6.0" \
+  -F "max_sequence_length=4096" \
+  -F "flow_shift=10.0" \
+  -F "seed=0" \
+  -F "generate_sound=true" \
+  -F "sound_duration=7.875" \
+  -F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
+  -o cosmos3_t2v_with_sound.mp4
 ```
 
 #### Notes
@@ -134,7 +162,9 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
   3:4, 9:16. Defaults: T2I 1024², 50 steps, guidance 7.0; T2V/I2V 1280×720,
   189 frames, 35 steps, guidance 6.0, `flow_shift=10.0`.
 - **Key flags / params:** `--no-guardrails` (server) or
-  `extra_params={"guardrails":false}` (per request) toggles safety;
+  `extra_params={"guardrails":false}` (per request) toggles safety. The
+  per-request flag only takes effect when the server was launched **with**
+  guardrails enabled (it cannot re-enable them on a `--no-guardrails` server).
   `use_resolution_template` / `use_duration_template` are off by default and only
   needed when not using upsampled prompts that already encode resolution/duration.
 - **Known limitations:**
@@ -143,8 +173,8 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
     the server fails at pipeline build with a gated-repo / safety-checker error.
   - A guardrail-blocked prompt currently returns HTTP 500
     (`"Guardrail blocked prompt"`).
-  - Video + audio, and action (policy / forward- / inverse-dynamics) modalities
-    are not part of this integration yet.
+  - Action (policy / forward- / inverse-dynamics) modalities are not part of
+    this integration yet.
 
 ### 1x GPU (Offline generation)
 
@@ -170,8 +200,8 @@ def main():
         model_class_name="Cosmos3OmniDiffusersPipeline",
         trust_remote_code=True,
         enforce_eager=True,
-        # Keep guardrails on by installing cosmos-guardrail + gated-repo access;
-        # this disables them for a quick local run.
+        # Guardrails are disabled here for a quick local run; install
+        # cosmos-guardrail + gated-repo access and drop this to enable them.
         model_config={"guardrails": False},
     )
     gen = torch.Generator(device="cpu").manual_seed(42)

@@ -0,0 +1,108 @@
+# Cosmos3-Super
+
+> Frontier 64B world model: text-to-image, text-to-video, image-to-video (+ optional audio)
+
+## Summary
+
+- Vendor: NVIDIA
+- Model: `nvidia/Cosmos3-Super` (64B; also `Cosmos3-Super-Text2Image`, `Cosmos3-Super-Image2Video`)
+- Task: T2I, T2V, I2V generation, with optional synchronized audio (video + sound)
+- Mode: Online serving with the OpenAI-compatible image/video APIs
+- Maintainer: Community
+
+## When to use this recipe
+
+Use this recipe to deploy the 64B `nvidia/Cosmos3-Super` for the highest-quality
+Cosmos3 generation. It shares the same `Cosmos3OmniDiffusersPipeline` and request
+formats as [Cosmos3-Nano](./Cosmos3-Nano.md) — only the checkpoint size and the
+recommended parallelism differ. Mode is selected per request (T2I →
+`/v1/images/generations`; T2V/I2V → `/v1/videos/sync`; add `generate_sound=true`
+for audio).
+
+## References
+
+- Model card (authoritative usage + example assets): <https://huggingface.co/nvidia/Cosmos3-Super>
+- Nano recipe (same APIs/params): [`Cosmos3-Nano.md`](./Cosmos3-Nano.md)
+- Pipeline: [`vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py`](../../vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py)
+
+## Hardware Support
+
+## GPU
+
+Requires the `vllm-omni` package (or the `vllm/vllm-omni:cosmos3` container),
+which provides the `vllm serve … --omni` entrypoint used below.
+
+### 8x H200/H100/A100 (recommended, per model card)
+
+```bash
+vllm serve nvidia/Cosmos3-Super \
+  --omni \
+  --host 0.0.0.0 --port 8000 \
+  --cfg-parallel-size 2 \
+  --ulysses-degree 4 \
+  --use-hsdp --hsdp-shard-size 8 \
+  --init-timeout 1800
+```
+
+### 2x H200 / B300 (minimum)
+
+```bash
+vllm serve nvidia/Cosmos3-Super \
+  --omni \
+  --host 0.0.0.0 --port 8000 \
+  --cfg-parallel-size 2 \
+  --use-hsdp --hsdp-shard-size 2 \
+  --init-timeout 1800
+```
+
+Guardrails are on by default (gated `nvidia/Cosmos-1.0-Guardrail` — `pip install
+cosmos-guardrail`, accept the license, set `HF_TOKEN`); add `--no-guardrails` to
+disable. `--enable-layerwise-offload` reduces VRAM on smaller GPUs.
+
+#### Verification
+
+Requests are identical to Nano (see [`Cosmos3-Nano.md`](./Cosmos3-Nano.md) for full
+T2I/T2V/I2V/T2VS curls); official params: `size=1280x720, num_frames=189, fps=24,
+num_inference_steps=35, guidance_scale=6.0, flow_shift=10.0, max_sequence_length=4096`.
+
+```bash
+curl http://localhost:8000/v1/models
+# T2V (official prompt assets give best quality)
+curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
+  -F "model=nvidia/Cosmos3-Super" -F "prompt=A robot arm is cleaning a plate in the kitchen" \
+  -F "size=1280x720" -F "num_frames=189" -F "fps=24" -F "num_inference_steps=35" \
+  -F "guidance_scale=6.0" -F "max_sequence_length=4096" -F "flow_shift=10.0" \
+  -F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
+  -F "seed=17" -o cosmos3_super_t2v.mp4
+
+# I2V — add an uploaded reference image
+curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
+  -F "model=nvidia/Cosmos3-Super" -F "prompt=The scene comes to life with smooth, natural motion." \
+  -F "size=1280x720" -F "num_frames=189" -F "fps=24" -F "num_inference_steps=35" \
+  -F "guidance_scale=6.0" -F "max_sequence_length=4096" -F "flow_shift=10.0" \
+  -F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
+  -F "seed=1111" -F "input_reference=@/path/to/reference.jpg;type=image/jpeg" \
+  -o cosmos3_super_i2v.mp4
+
+# T2V + sound — add generate_sound/sound_duration (output muxes AAC 48 kHz stereo)
+curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
+  -F "model=nvidia/Cosmos3-Super" -F "prompt=A robot arm is cleaning a plate in the kitchen" \
+  -F "size=1280x720" -F "num_frames=189" -F "fps=24" -F "num_inference_steps=35" \
+  -F "guidance_scale=6.0" -F "max_sequence_length=4096" -F "flow_shift=10.0" \
+  -F "generate_sound=true" -F "sound_duration=7.875" \
+  -F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
+  -F "seed=17" -o cosmos3_super_t2vs.mp4
+```
+
+#### Notes
+
+- **Measured (2x B300, bf16, guardrails off, official 2-GPU config above):**
+  - T2I 1024², 50 steps → **~6 s**
+  - T2V 1280×720, 189 frames, 35 steps → **~197 s**
+  - I2V 1280×720, 189 frames, 35 steps → **~200 s**
+  - T2V + sound (189 frames, 35 steps) → **~198 s**, output muxes **AAC 48 kHz stereo**
+  - (NVIDIA's reference: 8×H200 @ 50 steps ≈ 55 s/video; 2×H200 @ 35 steps ≈ 3 min/video.)
+- **Memory:** ~61.5 GiB per GPU when sharded across 2 GPUs (HSDP shard 2); repo ~135 GB on disk.
+- Same generation defaults, supported sizes, and `generate_sound`/`sound_duration`
+  semantics as Nano. Action (policy / forward- / inverse-dynamics) modalities are
+  not part of this integration yet.