Skip to content
2 changes: 1 addition & 1 deletion docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ th {
| `ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `WanPipeline` | Wan2.1-T2V, Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-T2V-14B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
| `WanSpeechToVideoPipeline` | Wan2.2-S2V | `Wan-AI/Wan2.2-S2V-14B` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `Wan22VACEPipeline` | Wan2.1-VACE | `Wan-AI/Wan2.1-VACE-1.3B-diffusers`, `Wan-AI/Wan2.1-VACE-14B-diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `LTX2Pipeline` | LTX-2-T2V | `Lightricks/LTX-2` | ✅︎ | ✅︎ | | |
Expand Down
3 changes: 2 additions & 1 deletion recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@ recipes/
| [`LTX/LTX-2.md`](./LTX/LTX-2.md) | Text-to-video and image-to-video serving | 1x H200 141GB |
| [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) |
| [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB |
| [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, and image-to-video generation | 1x H200 141GB / B300 |
| [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound | 1x H200 141GB / B300 |
| [`nvidia/Cosmos3-Super.md`](./nvidia/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) | 8x H200/H100/A100 / 2x H200 / B300 |
| [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB |
| [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB |
| [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 |
Expand Down
44 changes: 37 additions & 7 deletions recipes/nvidia/Cosmos3-Nano.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

- Vendor: NVIDIA
- Model: `nvidia/Cosmos3-Nano`
- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation
- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation, with optional synchronized audio (video + sound)
- Mode: Online serving with the OpenAI-compatible image/video APIs, plus offline generation via the `Omni` API
- Maintainer: Community

Expand All @@ -20,12 +20,17 @@ the mode is selected per request:
- **T2V** — `POST /v1/videos/sync` with `num_frames > 1` and no reference image.
- **I2V** — `POST /v1/videos/sync` with a reference image (`input_reference` file
upload, or `image_reference` JSON).
- **T2VS / I2VS** — add `generate_sound=true` (and optional `sound_duration`) to a
T2V/I2V `/v1/videos/sync` request to also generate synchronized audio, muxed into
the mp4 as AAC 48 kHz stereo. See the official model card's "Video + Audio" examples.

## References

- Model card (authoritative usage + example assets): <https://huggingface.co/nvidia/Cosmos3-Nano>
- Example inputs/outputs live in the repo's `assets/` (`example_t2v_prompt.json`,
`example_i2v_prompt.json`, `example_i2v_input.jpg`, `negative_prompt.json`).
`example_i2v_prompt.json`, `example_i2v_input.jpg`, `negative_prompt.json`;
audio examples: `example_t2vs_prompt.json`, `example_t2vs_output.mp4`,
`example_i2vs_output.mp4`).
- Prompt upsampling (recommended for quality): the model expects JSON-upsampled
structured prompts; see NVIDIA's `cosmos-framework` prompt-upsampling docs.
- Pipeline: [`vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py`](../../vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py)
Expand All @@ -47,6 +52,9 @@ the mode is selected per request:

#### Command

Requires the `vllm-omni` package (or the `vllm/vllm-omni:cosmos3` container),
which provides the `vllm serve … --omni` entrypoint used below.

Safety guardrails are **on by default** (NVIDIA Open Model License). They load
the **gated** `nvidia/Cosmos-1.0-Guardrail` model, so to keep them on you must:

Expand Down Expand Up @@ -116,6 +124,26 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
-F "seed=1111" \
-F "input_reference=@/path/to/reference.jpg;type=image/jpeg" \
-o cosmos3_i2v.mp4


# Text-to-video-with-sound
curl -sS -X POST http://localhost:8000/v1/videos/sync \
-H "Accept: video/mp4" \
-F "model=nvidia/Cosmos3-Nano" \
-F "prompt=The video opens with a view of a well-lit indoor fruit display. A robotic arm picks up a pear, an orange, and a carambola one by one, placing each into a plastic bag in a shopping cart with red handles. The video is 7.875 seconds long, 24 FPS, and 1280x720. Audio description: soft servo whirs, gentle fruit thuds, plastic bag rustling, and a faint refrigeration hum." \
-F "negative_prompt=blurry, distorted, low quality" \
-F "size=1280x720" \
-F "num_frames=189" \
-F "fps=24" \
-F "num_inference_steps=35" \
-F "guidance_scale=6.0" \
-F "max_sequence_length=4096" \
-F "flow_shift=10.0" \
-F "seed=0" \
-F "generate_sound=true" \
-F "sound_duration=7.875" \
-F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
-o cosmos3_t2v_with_sound.mp4
```

#### Notes
Expand All @@ -134,7 +162,9 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
3:4, 9:16. Defaults: T2I 1024², 50 steps, guidance 7.0; T2V/I2V 1280×720,
189 frames, 35 steps, guidance 6.0, `flow_shift=10.0`.
- **Key flags / params:** `--no-guardrails` (server) or
`extra_params={"guardrails":false}` (per request) toggles safety;
`extra_params={"guardrails":false}` (per request) toggles safety. The
per-request flag only takes effect when the server was launched **with**
guardrails enabled (it cannot re-enable them on a `--no-guardrails` server).
`use_resolution_template` / `use_duration_template` are off by default and only
needed when not using upsampled prompts that already encode resolution/duration.
- **Known limitations:**
Expand All @@ -143,8 +173,8 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
the server fails at pipeline build with a gated-repo / safety-checker error.
- A guardrail-blocked prompt currently returns HTTP 500
(`"Guardrail blocked prompt"`).
- Video + audio, and action (policy / forward- / inverse-dynamics) modalities
are not part of this integration yet.
- Action (policy / forward- / inverse-dynamics) modalities are not part of
this integration yet.

### 1x GPU (Offline generation)

Expand All @@ -170,8 +200,8 @@ def main():
model_class_name="Cosmos3OmniDiffusersPipeline",
trust_remote_code=True,
enforce_eager=True,
# Keep guardrails on by installing cosmos-guardrail + gated-repo access;
# this disables them for a quick local run.
# Guardrails are disabled here for a quick local run; install
# cosmos-guardrail + gated-repo access and drop this to enable them.
model_config={"guardrails": False},
)
gen = torch.Generator(device="cpu").manual_seed(42)
Expand Down
108 changes: 108 additions & 0 deletions recipes/nvidia/Cosmos3-Super.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Cosmos3-Super

> Frontier 64B world model: text-to-image, text-to-video, image-to-video (+ optional audio)

## Summary

- Vendor: NVIDIA
- Model: `nvidia/Cosmos3-Super` (64B; also `Cosmos3-Super-Text2Image`, `Cosmos3-Super-Image2Video`)
- Task: T2I, T2V, I2V generation, with optional synchronized audio (video + sound)
- Mode: Online serving with the OpenAI-compatible image/video APIs
- Maintainer: Community

## When to use this recipe

Use this recipe to deploy the 64B `nvidia/Cosmos3-Super` for the highest-quality
Cosmos3 generation. It shares the same `Cosmos3OmniDiffusersPipeline` and request
formats as [Cosmos3-Nano](./Cosmos3-Nano.md) — only the checkpoint size and the
recommended parallelism differ. Mode is selected per request (T2I →
`/v1/images/generations`; T2V/I2V → `/v1/videos/sync`; add `generate_sound=true`
for audio).

## References

- Model card (authoritative usage + example assets): <https://huggingface.co/nvidia/Cosmos3-Super>
- Nano recipe (same APIs/params): [`Cosmos3-Nano.md`](./Cosmos3-Nano.md)
- Pipeline: [`vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py`](../../vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py)

## Hardware Support

## GPU

Requires the `vllm-omni` package (or the `vllm/vllm-omni:cosmos3` container),
which provides the `vllm serve … --omni` entrypoint used below.

### 8x H200/H100/A100 (recommended, per model card)

```bash
vllm serve nvidia/Cosmos3-Super \
--omni \
--host 0.0.0.0 --port 8000 \
--cfg-parallel-size 2 \
--ulysses-degree 4 \
--use-hsdp --hsdp-shard-size 8 \
--init-timeout 1800
```

### 2x H200 / B300 (minimum)

```bash
vllm serve nvidia/Cosmos3-Super \
--omni \
--host 0.0.0.0 --port 8000 \
--cfg-parallel-size 2 \
--use-hsdp --hsdp-shard-size 2 \
--init-timeout 1800
```

Guardrails are on by default (gated `nvidia/Cosmos-1.0-Guardrail` — `pip install
cosmos-guardrail`, accept the license, set `HF_TOKEN`); add `--no-guardrails` to
disable. `--enable-layerwise-offload` reduces VRAM on smaller GPUs.

#### Verification

Requests are identical to Nano (see [`Cosmos3-Nano.md`](./Cosmos3-Nano.md) for full
T2I/T2V/I2V/T2VS curls); official params: `size=1280x720, num_frames=189, fps=24,
num_inference_steps=35, guidance_scale=6.0, flow_shift=10.0, max_sequence_length=4096`.

```bash
curl http://localhost:8000/v1/models
# T2V (official prompt assets give best quality)
curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
-F "model=nvidia/Cosmos3-Super" -F "prompt=A robot arm is cleaning a plate in the kitchen" \
-F "size=1280x720" -F "num_frames=189" -F "fps=24" -F "num_inference_steps=35" \
-F "guidance_scale=6.0" -F "max_sequence_length=4096" -F "flow_shift=10.0" \
-F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
-F "seed=17" -o cosmos3_super_t2v.mp4

# I2V — add an uploaded reference image
curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
-F "model=nvidia/Cosmos3-Super" -F "prompt=The scene comes to life with smooth, natural motion." \
-F "size=1280x720" -F "num_frames=189" -F "fps=24" -F "num_inference_steps=35" \
-F "guidance_scale=6.0" -F "max_sequence_length=4096" -F "flow_shift=10.0" \
-F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
-F "seed=1111" -F "input_reference=@/path/to/reference.jpg;type=image/jpeg" \
-o cosmos3_super_i2v.mp4

# T2V + sound — add generate_sound/sound_duration (output muxes AAC 48 kHz stereo)
curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
-F "model=nvidia/Cosmos3-Super" -F "prompt=A robot arm is cleaning a plate in the kitchen" \
-F "size=1280x720" -F "num_frames=189" -F "fps=24" -F "num_inference_steps=35" \
-F "guidance_scale=6.0" -F "max_sequence_length=4096" -F "flow_shift=10.0" \
-F "generate_sound=true" -F "sound_duration=7.875" \
-F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
-F "seed=17" -o cosmos3_super_t2vs.mp4
```

#### Notes

- **Measured (2x B300, bf16, guardrails off, official 2-GPU config above):**
- T2I 1024², 50 steps → **~6 s**
- T2V 1280×720, 189 frames, 35 steps → **~197 s**
- I2V 1280×720, 189 frames, 35 steps → **~200 s**
- T2V + sound (189 frames, 35 steps) → **~198 s**, output muxes **AAC 48 kHz stereo**
- (NVIDIA's reference: 8×H200 @ 50 steps ≈ 55 s/video; 2×H200 @ 35 steps ≈ 3 min/video.)
- **Memory:** ~61.5 GiB per GPU when sharded across 2 GPUs (HSDP shard 2); repo ~135 GB on disk.
- Same generation defaults, supported sizes, and `generate_sound`/`sound_duration`
semantics as Nano. Action (policy / forward- / inverse-dynamics) modalities are
not part of this integration yet.
Loading
Loading