vllm-project · lishunyang12 · Jun 3, 2026 · May 28, 2026 · Jun 2, 2026 · May 28, 2026
@@ -33,7 +33,7 @@ th {
 | `ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `WanPipeline` | Wan2.1-T2V, Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-T2V-14B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
-| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
+| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound, action policy | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
 | `WanSpeechToVideoPipeline` | Wan2.2-S2V | `Wan-AI/Wan2.2-S2V-14B` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `Wan22VACEPipeline` | Wan2.1-VACE | `Wan-AI/Wan2.1-VACE-1.3B-diffusers`, `Wan-AI/Wan2.1-VACE-14B-diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `LTX2Pipeline` | LTX-2-T2V | `Lightricks/LTX-2` | ✅︎ | ✅︎ | | |

@@ -36,8 +36,8 @@ recipes/
 | [`LTX/LTX-2.md`](./LTX/LTX-2.md) | Text-to-video and image-to-video serving | 1x H200 141GB |
 | [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) |
 | [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB |
-| [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound  | 1x H200 141GB / B300 |
-| [`nvidia/Cosmos3-Super.md`](./nvidia/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) | 8x H200/H100/A100 / 2x H200 / B300 |
+| [`cosmos3/Cosmos3-Nano.md`](./cosmos3/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound, action policy  | 1x H200 141GB / B300 |
+| [`cosmos3/Cosmos3-Super.md`](./cosmos3/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) / Action policy | 8x H200/H100/A100 / 2x H200 / B300 |
 | [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB |
 | [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB |
 | [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 |

@@ -6,7 +6,7 @@
 
 - Vendor: NVIDIA
 - Model: `nvidia/Cosmos3-Nano`
-- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation, with optional synchronized audio (video + sound)
+- Task: Text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) generation, with optional synchronized audio (video + sound), action policy
 - Mode: Online serving with the OpenAI-compatible image/video APIs, plus offline generation via the `Omni` API
 - Maintainer: Community
 
@@ -23,6 +23,23 @@ the mode is selected per request:
 - **T2VS / I2VS** — add `generate_sound=true` (and optional `sound_duration`) to a
   T2V/I2V `/v1/videos/sync` request to also generate synchronized audio, muxed into
   the mp4 as AAC 48 kHz stereo. See the official model card's "Video + Audio" examples.
+- **Action** — pass `extra_params={"action_mode": ...}` to drive Physical-AI tasks:
+  - `forward_dynamics` — given a first frame **and** an action trajectory, roll out
+    the resulting video. Synchronous: `POST /v1/videos/sync`.
+  - `policy` — given a first frame and a language instruction, **predict** the action
+    trajectory (and a rollout video). Use the async `POST /v1/videos` endpoint and
+    read the predicted action from the top-level `action` field
+    (`{data, shape, dtype, raw_action_dim, domain_id}`).
+
+  Action requests also take `domain_name` (e.g. `av`, `bridge_orig_lerobot`,
+  `droid_lerobot`, `agibotworld`, …; or a numeric `domain_id`), `raw_action_dim`,
+  and `action_chunk_size` (must equal `num_frames` or `num_frames - 1`). For
+  `forward_dynamics` also pass the `action` array. The dedicated policy checkpoint
+  **`nvidia/Cosmos3-Nano-Policy-DROID`** is served the same way
+  (`domain_name=droid_lerobot`).
+
+  `inverse_dynamics` (recover the action from a given video) is supported by the
+  pipeline; **online inference of inverse dynamics will be added in a follow-up MR.**
 
 ## References
 
@@ -144,6 +161,35 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
   -F "sound_duration=7.875" \
   -F 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
   -o cosmos3_t2v_with_sound.mp4
+
+# Action — forward dynamics (first frame + action trajectory -> rollout video).
+# Synchronous; `action` is a JSON array shaped [action_chunk_size, raw_action_dim].
+curl -sS -X POST http://localhost:8000/v1/videos/sync \
+  -H "Accept: video/mp4" \
+  --form-string "model=nvidia/Cosmos3-Nano" \
+  --form-string "prompt=You are an autonomous vehicle. This video is captured from a first-person perspective." \
+  -F "input_reference=@first_frame.jpg;type=image/jpeg" \
+  -F "size=640x480" -F "num_frames=61" -F "fps=10" \
+  -F "num_inference_steps=30" -F "guidance_scale=1.0" -F "flow_shift=5.0" \
+  --form-string "extra_params={\"action_mode\":\"forward_dynamics\",\"domain_name\":\"av\",\"raw_action_dim\":9,\"action_chunk_size\":60,\"action\":$(cat action.json)}" \
+  -F "seed=0" \
+  -o cosmos3_forward_dynamics.mp4
+
+# Action — policy (first frame + instruction -> predicted action trajectory + video).
+# Asynchronous: POST returns a job id; poll, then read the predicted action from
+# the top-level `action` field ({data, shape, dtype, raw_action_dim, domain_id}).
+VIDEO_ID=$(curl -sS -X POST http://localhost:8000/v1/videos \
+  -H "Accept: application/json" \
+  --form-string "model=nvidia/Cosmos3-Nano" \
+  --form-string "prompt=Pick up the banana and place it in the bowl." \
+  -F "input_reference=@first_frame.jpg;type=image/jpeg" \
+  -F "size=640x480" -F "num_frames=17" -F "fps=5" \
+  -F "num_inference_steps=30" -F "guidance_scale=1.0" -F "flow_shift=5.0" \
+  --form-string 'extra_params={"action_mode":"policy","domain_name":"bridge_orig_lerobot","raw_action_dim":10,"action_chunk_size":16}' \
+  -F "seed=0" | jq -r '.id')
+# poll until status == completed, then:
+curl -sS "http://localhost:8000/v1/videos/$VIDEO_ID" | jq '.action | {shape, dtype, raw_action_dim, domain_id}'
+curl -sS -L "http://localhost:8000/v1/videos/$VIDEO_ID/content" -o cosmos3_policy.mp4
 ```
 
 #### Notes
@@ -152,6 +198,7 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
   - T2I 1024² — 10 / 25 / 50 steps → ~0.4 / 0.7 / **1.3 s**
   - T2V 1280×720 @ 35 steps — 25 / 49 / 93 / **189** frames → ~7 / 15 / 33 / **~93 s**
   - I2V 1280×720, 189 frames @ 35 steps → ~**99 s**
+  - Action 640×480 @ 30 steps — forward-dynamics 61f ~**4 s**, policy 17f ~**1–3 s**.
   - Guardrails-on overhead: ~8% on T2I, negligible on video.
 - **Memory:** transformer ~17 GiB (bf16); peak ~46 GiB for 720p video on 1 GPU;
   full repo (transformer + Wan VAE + Qwen3-VL vision encoder + audio tokenizer)
@@ -173,8 +220,10 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
     the server fails at pipeline build with a gated-repo / safety-checker error.
   - A guardrail-blocked prompt currently returns HTTP 500
     (`"Guardrail blocked prompt"`).
-  - Action (policy / forward- / inverse-dynamics) modalities are not part of
-    this integration yet.
+  - Action `forward_dynamics` (sync `/v1/videos/sync`) and `policy` (async
+    `/v1/videos`, returns the predicted action under the top-level `action`
+    field) are supported online. **Online inference of inverse dynamics will be
+    added in a follow-up MR.**
 
 ### 1x GPU (Offline generation)
 

@@ -104,5 +104,9 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" \
   - (NVIDIA's reference: 8×H200 @ 50 steps ≈ 55 s/video; 2×H200 @ 35 steps ≈ 3 min/video.)
 - **Memory:** ~61.5 GiB per GPU when sharded across 2 GPUs (HSDP shard 2); repo ~135 GB on disk.
 - Same generation defaults, supported sizes, and `generate_sound`/`sound_duration`
-  semantics as Nano. Action (policy / forward- / inverse-dynamics) modalities are
-  not part of this integration yet.
+  semantics as Nano, including the **action** modality: `forward_dynamics`
+  (sync `/v1/videos/sync`) and `policy` (async `/v1/videos`, predicted action under
+  the top-level `action` field) — see the Cosmos3-Nano recipe for the request shape.
+  Online inference of inverse dynamics will be added in a follow-up MR. Verified on
+  the 64B Super under `--cfg-parallel-size 2`: async `policy` returns the predicted
+  action (`[16, 10]`) and the rollout video reliably.
@@ -99,12 +99,16 @@ def __init__(
         sound_gen: bool = False,
         sound_dim: int = 3,
         sound_latent_fps: float = 25.0,
+        action_gen: bool = False,
+        action_dim: int = 4,
     ) -> None:
         super().__init__()
         self.latent_channel_size = latent_channel_size
         self.sound_gen = sound_gen
         self.sound_dim = sound_dim
         self.sound_latent_fps = sound_latent_fps
+        self.action_gen = action_gen
+        self.action_dim = action_dim
         self.cached_kv: Any | None = None
         self.cached_freqs_gen: Any | None = None
         self.calls: list[dict[str, Any]] = []
@@ -139,7 +143,10 @@ def forward(
             marker = torch.tensor([token], dtype=torch.float32)
             self.cached_kv = [(marker, marker + 100)]
             self.cached_freqs_gen = (marker + 200, marker + 300)
+        action_latents = kwargs.get("action_latents")
         outputs: list[torch.Tensor] = [torch.full_like(hidden_states, float(token))]
+        if action_latents is not None:
+            outputs.append(torch.full_like(action_latents, float(token + 20)))
         if sound_latents is not None:
             outputs.append(torch.full_like(sound_latents, float(token + 10)))
         return outputs[0] if len(outputs) == 1 else tuple(outputs)
@@ -344,7 +351,7 @@ def test_pipeline_init_passes_tokenizer_attrs_into_transformer(
     assert pipeline.transformer.audio_proj_out.out_features == 5
 
 
-def test_preprocess_i2v_image_input() -> None:
+def test_preprocess_i2v_image_and_action_video_inputs() -> None:
     from vllm_omni.diffusion.models.cosmos3.pipeline_cosmos3 import get_cosmos3_pre_process_func
 
     preprocess = get_cosmos3_pre_process_func(SimpleNamespace())
@@ -357,6 +364,16 @@ def test_preprocess_i2v_image_input() -> None:
     assert (result.sampling_params.height, result.sampling_params.width) == (672, 1344)
     assert tuple(result.prompts[0]["additional_information"]["preprocessed_image"].shape[-2:]) == (672, 1344)
 
+    frames = [Image.new("RGB", (8, 4), color) for color in ("red", "green", "blue")]
+    action = SimpleNamespace(
+        prompts=[{"prompt": "Move.", "multi_modal_data": {"video": frames}}],
+        sampling_params=SimpleNamespace(height=16, width=32, extra_args={"action_mode": "forward_dynamics"}),
+    )
+
+    additional = preprocess(action).prompts[0]["additional_information"]
+    assert tuple(additional["preprocessed_image"].shape) == (1, 3, 16, 32)
+    assert tuple(additional["preprocessed_video"].shape) == (1, 3, 3, 16, 32)
+
 
 def test_postprocess_handles_image_video_audio_and_validation() -> None:
     from vllm_omni.diffusion.models.cosmos3.pipeline_cosmos3 import get_cosmos3_post_process_func
@@ -434,7 +451,7 @@ def test_prompt_formatting_and_checkpoint_key_remap(make_cosmos3_pipeline) -> No
     assert {key: Cosmos3OmniDiffusersPipeline._remap_ckpt_key(key) for key in remaps} == remaps
 
 
-def test_prepare_latents_for_video_image_and_sound(make_cosmos3_pipeline) -> None:
+def test_prepare_latents_for_video_image_sound_and_action(make_cosmos3_pipeline) -> None:
     pipeline = make_cosmos3_pipeline()
     latents = pipeline._prepare_latents(16, 24, 5, torch.Generator(device="cpu").manual_seed(0))
     assert latents.shape == (1, 2, 2, 2, 3)
@@ -463,8 +480,20 @@ def test_prepare_latents_for_video_image_and_sound(make_cosmos3_pipeline) -> Non
     assert (sound_latents.shape, latent_frames) == (torch.Size([1, 3, 6]), 6)
     assert pipeline._decode_sound_latents(torch.zeros(1, 3, 6), target_audio_samples=21).shape == (1, 2, 21)
 
+    pipeline.transformer = pipeline.transformer.__class__(action_gen=True, action_dim=4)
+    action, action_mask, clean, raw_dim = pipeline._prepare_action_latents(
+        mode="forward_dynamics",
+        action_chunk_size=2,
+        raw_action_dim=None,
+        generator=torch.Generator(device="cpu").manual_seed(0),
+        sp=SimpleNamespace(extra_args={"action": [[1.0, 2.0], [3.0, 4.0]]}),
+    )
+    assert raw_dim == 2
+    assert action_mask.tolist() == [[[0.0], [0.0]]]
+    torch.testing.assert_close(action, clean)
+
 
-def test_diffuse_covers_cfg_i2v_and_sound_steps(make_cosmos3_pipeline) -> None:
+def test_diffuse_covers_cfg_i2v_and_multimodal_steps(make_cosmos3_pipeline) -> None:
     pipeline = make_cosmos3_pipeline()
     latents = torch.zeros(1, 2, 1, 1, 1)
 
@@ -496,20 +525,22 @@ def test_diffuse_covers_cfg_i2v_and_sound_steps(make_cosmos3_pipeline) -> None:
     )
     torch.testing.assert_close(i2v[:, :, 0:1], torch.full((1, 2, 1, 1, 1), 7.0))
 
-    pipeline.transformer = pipeline.transformer.__class__(latent_channel_size=2, sound_gen=True, sound_dim=3)
-    video_result, sound_result = pipeline.diffuse(
+    pipeline.transformer = pipeline.transformer.__class__(latent_channel_size=2, action_gen=True, action_dim=4)
+    video_result, action_result = pipeline.diffuse(
         latents=latents,
-        sound_latents=torch.zeros(1, 3, 4),
+        action_latents=torch.zeros(1, 3, 4),
+        action_velocity_mask=torch.ones(1, 3, 1),
+        action_condition_latents=torch.zeros(1, 3, 4),
         timesteps=torch.tensor([7, 3]),
         cond_ids=_ids(2),
         cond_mask=_mask(),
         uncond_ids=_ids(1),
         uncond_mask=_mask(),
         guidance_scale=1.0,
-        shared_kwargs={"video_shape": (1, 1, 1), "fps": 24.0},
+        shared_kwargs={"video_shape": (1, 1, 1), "fps": 24.0, "action_domain_ids": torch.tensor([0])},
     )
     torch.testing.assert_close(video_result, torch.full_like(latents, 4.0))
-    torch.testing.assert_close(sound_result, torch.full((), 24.0).expand_as(sound_result))
+    torch.testing.assert_close(action_result, torch.full((), 44.0).expand_as(action_result))
 
 
 def test_diffuse_keeps_paired_cfg_when_cache_dit_active(make_cosmos3_pipeline) -> None:
@@ -568,6 +599,8 @@ def fake_prepare(height, width, num_frames, generator):
         def fake_diffuse(**kwargs):
             captured["diffuse_calls"].append(kwargs)
             outputs = [kwargs["latents"] + len(captured["diffuse_calls"])]
+            if kwargs.get("action_latents") is not None:
+                outputs.append(kwargs["action_latents"] + 3.0)
             if kwargs.get("sound_latents") is not None:
                 outputs.append(kwargs["sound_latents"] + 2.0)
             return outputs[0] if len(outputs) == 1 else tuple(outputs)
@@ -612,7 +645,7 @@ def test_forward_defaults_and_mode_selection(
         assert captured["flow_shifts"] == expected["flow"]
         assert [call[0] for call in pipeline.scheduler.set_timesteps_calls] == expected["steps"]
 
-    def test_forward_i2v_and_sound_routes(self, make_cosmos3_pipeline) -> None:
+    def test_forward_i2v_sound_and_action_routes(self, make_cosmos3_pipeline) -> None:
         pipeline = make_cosmos3_pipeline()
         captured = self._install_forward_stubs(pipeline)
         image_tensor = torch.zeros(1, 3, 16, 16)
@@ -651,6 +684,31 @@ def test_forward_i2v_and_sound_routes(self, make_cosmos3_pipeline) -> None:
         assert captured["diffuse_calls"][-1]["sound_latents"] is sound_latents
         assert output.output["audio_sample_rate"] == 10
 
+        pipeline.transformer = pipeline.transformer.__class__(latent_channel_size=2, action_gen=True, action_dim=4)
+        output = pipeline.forward(
+            SimpleNamespace(
+                prompts=[
+                    {
+                        "prompt": "Pick the block.",
+                        "modalities": ["video"],
+                        "additional_information": {"preprocessed_image": image_tensor},
+                    }
+                ],
+                sampling_params=make_sampling_params(
+                    height=16,
+                    width=16,
+                    extra_args={
+                        "action_mode": "policy",
+                        "action_chunk_size": 2,
+                        "raw_action_dim": 2,
+                        "domain_name": "bridge_orig_lerobot",
+                    },
+                ),
+            )
+        )
+        assert captured["diffuse_calls"][-1]["shared_kwargs"]["action_domain_ids"].tolist() == [7]
+        assert output.custom_output["action"].shape == (1, 2, 2)
+
     @pytest.mark.parametrize(
         ("prompt", "sampling_params", "message"),
         [