Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
e881e9b
Add Cosmos3 sound generation
MaciejBalaNV May 28, 2026
55c1917
Fix tests; small improvements
MaciejBalaNV Jun 2, 2026
7a4f69a
Added action generation
MaciejBalaNV May 28, 2026
b9bedbf
Remove unused parameter
bastefaniak Jun 2, 2026
a6bf873
Comment about packed modalities into single tensor
bastefaniak Jun 2, 2026
90ef413
Enable sound generation only thorough "generate_sound", "sound_gen" f…
bastefaniak Jun 2, 2026
b4e0379
add video+sound usage to Cosmos3-Nano recipe
lishunyang12 Jun 2, 2026
1eee7a2
Pass sound_dim/sound_latent_fps into transformer from initialized sou…
bastefaniak Jun 2, 2026
141956d
Update recipes
bastefaniak Jun 2, 2026
73313f6
lint
bastefaniak Jun 2, 2026
17c9f6d
Merge branch 'mbala/cosmos3_sound' into bstefaniak/cosmos3_sound
bastefaniak Jun 2, 2026
3d724b7
Merge branch 'bstefaniak/cosmos3_sound' into mbala/cosmos3_action_v2
bastefaniak Jun 2, 2026
733eed0
Merge branch 'mbala/cosmos3_action_v2' into mbala/cosmos3_action_review
bastefaniak Jun 3, 2026
0e944c3
fix no guardrails
bastefaniak Jun 3, 2026
aab40ee
Undo recipe change
bastefaniak Jun 3, 2026
32bee22
Allow action in online serving only as json array
bastefaniak Jun 3, 2026
a6f76c5
Document Cosmos3 action modality in recipes
lishunyang12 Jun 3, 2026
0c79145
Merge branch 'main' into mbala/cosmos3_action_review
lishunyang12 Jun 3, 2026
97c3343
Reword inverse dynamics note in recipes
lishunyang12 Jun 3, 2026
ffe32d2
Move Cosmos3 recipes to recipes/cosmos3/
lishunyang12 Jun 3, 2026
919b694
Update recipes index path to cosmos3/
lishunyang12 Jun 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ th {
| `ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `WanPipeline` | Wan2.1-T2V, Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-T2V-14B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound, action policy | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
| `WanSpeechToVideoPipeline` | Wan2.2-S2V | `Wan-AI/Wan2.2-S2V-14B` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `Wan22VACEPipeline` | Wan2.1-VACE | `Wan-AI/Wan2.1-VACE-1.3B-diffusers`, `Wan-AI/Wan2.1-VACE-14B-diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
| `LTX2Pipeline` | LTX-2-T2V | `Lightricks/LTX-2` | ✅︎ | ✅︎ | | |
Expand Down
4 changes: 2 additions & 2 deletions recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ recipes/
| [`LTX/LTX-2.md`](./LTX/LTX-2.md) | Text-to-video and image-to-video serving | 1x H200 141GB |
| [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) |
| [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB |
| [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound | 1x H200 141GB / B300 |
| [`nvidia/Cosmos3-Super.md`](./nvidia/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) | 8x H200/H100/A100 / 2x H200 / B300 |
| [`cosmos3/Cosmos3-Nano.md`](./cosmos3/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound, action policy | 1x H200 141GB / B300 |
| [`cosmos3/Cosmos3-Super.md`](./cosmos3/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) / Action policy | 8x H200/H100/A100 / 2x H200 / B300 |
| [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB |
| [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB |
| [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

- Vendor: NVIDIA
- Model: `nvidia/Cosmos3-Nano`
- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation, with optional synchronized audio (video + sound)
- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation, with optional synchronized audio (video + sound), action policy
- Mode: Online serving with the OpenAI-compatible image/video APIs, plus offline generation via the `Omni` API
- Maintainer: Community

Expand All @@ -23,6 +23,23 @@ the mode is selected per request:
- **T2VS / I2VS** — add `generate_sound=true` (and optional `sound_duration`) to a
T2V/I2V `/v1/videos/sync` request to also generate synchronized audio, muxed into
the mp4 as AAC 48 kHz stereo. See the official model card's "Video + Audio" examples.
- **Action** — pass `extra_params={"action_mode": ...}` to drive Physical-AI tasks:
- `forward_dynamics` — given a first frame **and** an action trajectory, roll out
the resulting video. Synchronous: `POST /v1/videos/sync`.
- `policy` — given a first frame and a language instruction, **predict** the action
trajectory (and a rollout video). Use the async `POST /v1/videos` endpoint and
read the predicted action from the top-level `action` field
(`{data, shape, dtype, raw_action_dim, domain_id}`).

Action requests also take `domain_name` (e.g. `av`, `bridge_orig_lerobot`,
`droid_lerobot`, `agibotworld`, …; or a numeric `domain_id`), `raw_action_dim`,
and `action_chunk_size` (must equal `num_frames` or `num_frames - 1`). For
`forward_dynamics` also pass the `action` array. The dedicated policy checkpoint
**`nvidia/Cosmos3-Nano-Policy-DROID`** is served the same way
(`domain_name=droid_lerobot`).

`inverse_dynamics` (recover the action from a given video) is supported by the
pipeline; **online inference of inverse dynamics will be added in a follow-up MR.**

## References

Expand Down Expand Up @@ -144,6 +161,35 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
-F "sound_duration=7.875" \
-F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
-o cosmos3_t2v_with_sound.mp4

# Action — forward dynamics (first frame + action trajectory -> rollout video).
# Synchronous; `action` is a JSON array shaped [action_chunk_size, raw_action_dim].
curl -sS -X POST http://localhost:8000/v1/videos/sync \
-H "Accept: video/mp4" \
--form-string "model=nvidia/Cosmos3-Nano" \
--form-string "prompt=You are an autonomous vehicle. This video is captured from a first-person perspective." \
-F "input_reference=@first_frame.jpg;type=image/jpeg" \
-F "size=640x480" -F "num_frames=61" -F "fps=10" \
-F "num_inference_steps=30" -F "guidance_scale=1.0" -F "flow_shift=5.0" \
--form-string "extra_params={\"action_mode\":\"forward_dynamics\",\"domain_name\":\"av\",\"raw_action_dim\":9,\"action_chunk_size\":60,\"action\":$(cat action.json)}" \
-F "seed=0" \
-o cosmos3_forward_dynamics.mp4

# Action — policy (first frame + instruction -> predicted action trajectory + video).
# Asynchronous: POST returns a job id; poll, then read the predicted action from
# the top-level `action` field ({data, shape, dtype, raw_action_dim, domain_id}).
VIDEO_ID=$(curl -sS -X POST http://localhost:8000/v1/videos \
-H "Accept: application/json" \
--form-string "model=nvidia/Cosmos3-Nano" \
--form-string "prompt=Pick up the banana and place it in the bowl." \
-F "input_reference=@first_frame.jpg;type=image/jpeg" \
-F "size=640x480" -F "num_frames=17" -F "fps=5" \
-F "num_inference_steps=30" -F "guidance_scale=1.0" -F "flow_shift=5.0" \
--form-string 'extra_params={"action_mode":"policy","domain_name":"bridge_orig_lerobot","raw_action_dim":10,"action_chunk_size":16}' \
-F "seed=0" | jq -r '.id')
# poll until status == completed, then:
curl -sS "http://localhost:8000/v1/videos/$VIDEO_ID" | jq '.action | {shape, dtype, raw_action_dim, domain_id}'
curl -sS -L "http://localhost:8000/v1/videos/$VIDEO_ID/content" -o cosmos3_policy.mp4
```

#### Notes
Expand All @@ -152,6 +198,7 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
- T2I 1024² — 10 / 25 / 50 steps → ~0.4 / 0.7 / **1.3 s**
- T2V 1280×720 @ 35 steps — 25 / 49 / 93 / **189** frames → ~7 / 15 / 33 / **~93 s**
- I2V 1280×720, 189 frames @ 35 steps → ~**99 s**
- Action 640×480 @ 30 steps — forward-dynamics 61f ~**4 s**, policy 17f ~**1–3 s**.
- Guardrails-on overhead: ~8% on T2I, negligible on video.
- **Memory:** transformer ~17 GiB (bf16); peak ~46 GiB for 720p video on 1 GPU;
full repo (transformer + Wan VAE + Qwen3-VL vision encoder + audio tokenizer)
Expand All @@ -173,8 +220,10 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
the server fails at pipeline build with a gated-repo / safety-checker error.
- A guardrail-blocked prompt currently returns HTTP 500
(`"Guardrail blocked prompt"`).
- Action (policy / forward- / inverse-dynamics) modalities are not part of
this integration yet.
- Action `forward_dynamics` (sync `/v1/videos/sync`) and `policy` (async
`/v1/videos`, returns the predicted action under the top-level `action`
field) are supported online. **Online inference of inverse dynamics will be
added in a follow-up MR.**

### 1x GPU (Offline generation)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,5 +104,9 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
- (NVIDIA's reference: 8×H200 @ 50 steps ≈ 55 s/video; 2×H200 @ 35 steps ≈ 3 min/video.)
- **Memory:** ~61.5 GiB per GPU when sharded across 2 GPUs (HSDP shard 2); repo ~135 GB on disk.
- Same generation defaults, supported sizes, and `generate_sound`/`sound_duration`
semantics as Nano. Action (policy / forward- / inverse-dynamics) modalities are
not part of this integration yet.
semantics as Nano, including the **action** modality: `forward_dynamics`
(sync `/v1/videos/sync`) and `policy` (async `/v1/videos`, predicted action under
the top-level `action` field) — see the Cosmos3-Nano recipe for the request shape.
Online inference of inverse dynamics will be added in a follow-up MR. Verified on
the 64B Super under `--cfg-parallel-size 2`: async `policy` returns the predicted
action (`[16, 10]`) and the rollout video reliably.
76 changes: 67 additions & 9 deletions tests/diffusion/models/cosmos3/test_cosmos3_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,16 @@ def __init__(
sound_gen: bool = False,
sound_dim: int = 3,
sound_latent_fps: float = 25.0,
action_gen: bool = False,
action_dim: int = 4,
) -> None:
super().__init__()
self.latent_channel_size = latent_channel_size
self.sound_gen = sound_gen
self.sound_dim = sound_dim
self.sound_latent_fps = sound_latent_fps
self.action_gen = action_gen
self.action_dim = action_dim
self.cached_kv: Any | None = None
self.cached_freqs_gen: Any | None = None
self.calls: list[dict[str, Any]] = []
Expand Down Expand Up @@ -139,7 +143,10 @@ def forward(
marker = torch.tensor([token], dtype=torch.float32)
self.cached_kv = [(marker, marker + 100)]
self.cached_freqs_gen = (marker + 200, marker + 300)
action_latents = kwargs.get("action_latents")
outputs: list[torch.Tensor] = [torch.full_like(hidden_states, float(token))]
if action_latents is not None:
outputs.append(torch.full_like(action_latents, float(token + 20)))
if sound_latents is not None:
outputs.append(torch.full_like(sound_latents, float(token + 10)))
return outputs[0] if len(outputs) == 1 else tuple(outputs)
Expand Down Expand Up @@ -344,7 +351,7 @@ def test_pipeline_init_passes_tokenizer_attrs_into_transformer(
assert pipeline.transformer.audio_proj_out.out_features == 5


def test_preprocess_i2v_image_input() -> None:
def test_preprocess_i2v_image_and_action_video_inputs() -> None:
from vllm_omni.diffusion.models.cosmos3.pipeline_cosmos3 import get_cosmos3_pre_process_func

preprocess = get_cosmos3_pre_process_func(SimpleNamespace())
Expand All @@ -357,6 +364,16 @@ def test_preprocess_i2v_image_input() -> None:
assert (result.sampling_params.height, result.sampling_params.width) == (672, 1344)
assert tuple(result.prompts[0]["additional_information"]["preprocessed_image"].shape[-2:]) == (672, 1344)

frames = [Image.new("RGB", (8, 4), color) for color in ("red", "green", "blue")]
action = SimpleNamespace(
prompts=[{"prompt": "Move.", "multi_modal_data": {"video": frames}}],
sampling_params=SimpleNamespace(height=16, width=32, extra_args={"action_mode": "forward_dynamics"}),
)

additional = preprocess(action).prompts[0]["additional_information"]
assert tuple(additional["preprocessed_image"].shape) == (1, 3, 16, 32)
assert tuple(additional["preprocessed_video"].shape) == (1, 3, 3, 16, 32)


def test_postprocess_handles_image_video_audio_and_validation() -> None:
from vllm_omni.diffusion.models.cosmos3.pipeline_cosmos3 import get_cosmos3_post_process_func
Expand Down Expand Up @@ -434,7 +451,7 @@ def test_prompt_formatting_and_checkpoint_key_remap(make_cosmos3_pipeline) -> No
assert {key: Cosmos3OmniDiffusersPipeline._remap_ckpt_key(key) for key in remaps} == remaps


def test_prepare_latents_for_video_image_and_sound(make_cosmos3_pipeline) -> None:
def test_prepare_latents_for_video_image_sound_and_action(make_cosmos3_pipeline) -> None:
pipeline = make_cosmos3_pipeline()
latents = pipeline._prepare_latents(16, 24, 5, torch.Generator(device="cpu").manual_seed(0))
assert latents.shape == (1, 2, 2, 2, 3)
Expand Down Expand Up @@ -463,8 +480,20 @@ def test_prepare_latents_for_video_image_and_sound(make_cosmos3_pipeline) -> Non
assert (sound_latents.shape, latent_frames) == (torch.Size([1, 3, 6]), 6)
assert pipeline._decode_sound_latents(torch.zeros(1, 3, 6), target_audio_samples=21).shape == (1, 2, 21)

pipeline.transformer = pipeline.transformer.__class__(action_gen=True, action_dim=4)
action, action_mask, clean, raw_dim = pipeline._prepare_action_latents(
mode="forward_dynamics",
action_chunk_size=2,
raw_action_dim=None,
generator=torch.Generator(device="cpu").manual_seed(0),
sp=SimpleNamespace(extra_args={"action": [[1.0, 2.0], [3.0, 4.0]]}),
)
assert raw_dim == 2
assert action_mask.tolist() == [[[0.0], [0.0]]]
torch.testing.assert_close(action, clean)


def test_diffuse_covers_cfg_i2v_and_sound_steps(make_cosmos3_pipeline) -> None:
def test_diffuse_covers_cfg_i2v_and_multimodal_steps(make_cosmos3_pipeline) -> None:
pipeline = make_cosmos3_pipeline()
latents = torch.zeros(1, 2, 1, 1, 1)

Expand Down Expand Up @@ -496,20 +525,22 @@ def test_diffuse_covers_cfg_i2v_and_sound_steps(make_cosmos3_pipeline) -> None:
)
torch.testing.assert_close(i2v[:, :, 0:1], torch.full((1, 2, 1, 1, 1), 7.0))

pipeline.transformer = pipeline.transformer.__class__(latent_channel_size=2, sound_gen=True, sound_dim=3)
video_result, sound_result = pipeline.diffuse(
pipeline.transformer = pipeline.transformer.__class__(latent_channel_size=2, action_gen=True, action_dim=4)
video_result, action_result = pipeline.diffuse(
latents=latents,
sound_latents=torch.zeros(1, 3, 4),
action_latents=torch.zeros(1, 3, 4),
action_velocity_mask=torch.ones(1, 3, 1),
action_condition_latents=torch.zeros(1, 3, 4),
timesteps=torch.tensor([7, 3]),
cond_ids=_ids(2),
cond_mask=_mask(),
uncond_ids=_ids(1),
uncond_mask=_mask(),
guidance_scale=1.0,
shared_kwargs={"video_shape": (1, 1, 1), "fps": 24.0},
shared_kwargs={"video_shape": (1, 1, 1), "fps": 24.0, "action_domain_ids": torch.tensor([0])},
)
torch.testing.assert_close(video_result, torch.full_like(latents, 4.0))
torch.testing.assert_close(sound_result, torch.full((), 24.0).expand_as(sound_result))
torch.testing.assert_close(action_result, torch.full((), 44.0).expand_as(action_result))


def test_diffuse_keeps_paired_cfg_when_cache_dit_active(make_cosmos3_pipeline) -> None:
Expand Down Expand Up @@ -568,6 +599,8 @@ def fake_prepare(height, width, num_frames, generator):
def fake_diffuse(**kwargs):
captured["diffuse_calls"].append(kwargs)
outputs = [kwargs["latents"] + len(captured["diffuse_calls"])]
if kwargs.get("action_latents") is not None:
outputs.append(kwargs["action_latents"] + 3.0)
if kwargs.get("sound_latents") is not None:
outputs.append(kwargs["sound_latents"] + 2.0)
return outputs[0] if len(outputs) == 1 else tuple(outputs)
Expand Down Expand Up @@ -612,7 +645,7 @@ def test_forward_defaults_and_mode_selection(
assert captured["flow_shifts"] == expected["flow"]
assert [call[0] for call in pipeline.scheduler.set_timesteps_calls] == expected["steps"]

def test_forward_i2v_and_sound_routes(self, make_cosmos3_pipeline) -> None:
def test_forward_i2v_sound_and_action_routes(self, make_cosmos3_pipeline) -> None:
pipeline = make_cosmos3_pipeline()
captured = self._install_forward_stubs(pipeline)
image_tensor = torch.zeros(1, 3, 16, 16)
Expand Down Expand Up @@ -651,6 +684,31 @@ def test_forward_i2v_and_sound_routes(self, make_cosmos3_pipeline) -> None:
assert captured["diffuse_calls"][-1]["sound_latents"] is sound_latents
assert output.output["audio_sample_rate"] == 10

pipeline.transformer = pipeline.transformer.__class__(latent_channel_size=2, action_gen=True, action_dim=4)
output = pipeline.forward(
SimpleNamespace(
prompts=[
{
"prompt": "Pick the block.",
"modalities": ["video"],
"additional_information": {"preprocessed_image": image_tensor},
}
],
sampling_params=make_sampling_params(
height=16,
width=16,
extra_args={
"action_mode": "policy",
"action_chunk_size": 2,
"raw_action_dim": 2,
"domain_name": "bridge_orig_lerobot",
},
),
)
)
assert captured["diffuse_calls"][-1]["shared_kwargs"]["action_domain_ids"].tolist() == [7]
assert output.custom_output["action"].shape == (1, 2, 2)

@pytest.mark.parametrize(
("prompt", "sampling_params", "message"),
[
Expand Down
Loading
Loading