Add Cosmos3 model support by MaciejBalaNV · Pull Request #3454 · vllm-project/vllm-omni

MaciejBalaNV · 2026-05-08T13:40:27Z

Purpose

Add support for a Cosmos3 model: https://huggingface.co/nvidia/Cosmos3-Nano (and more variants).

The Cosmos3 model covers t2i, t2v, i2v modalities, combined with sound generation on top of video, as well as three different modes for action generation: policy (predicts action and video based on prompt and image), forward_dynamics (predicts video based on action sequence) and inverse_dynamics (predicts action sequence based on video).

This PR covers only t2v, i2v and t2i modalities. The code for other modalities is ready and will be part of the follow-up PRs once this one is reviewed and merged.

The notable changes potentially impacting more than only Cosmos3 model are:

Added modalities field to OmniTextPrompt to properly recognize between t2i and t2v prompts for the same pipeline.

Test Plan

Unit tests

cd tests; python -m pytest -v -m "core_model and cpu"

Added 91 new test cases for the new model integration and pipeline unit tests.

Serving tests

Host server with

vllm serve nvidia/Cosmos3-Nano --omni

Run a request with

curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" -F "prompt=A low-angle tracking shot follows a man riding a vintage black motorcycle across a lush green grassy yard. Sunlight filters through overhead trees, casting dappled shadows across the vibrating chrome exhaust and the rider's leather jacket. He kicks up small blades of grass as he maneuvers the bike. He gradually decelerates, the front fork compressing slightly as he brakes to a smooth halt beside another individual standing in the shade. The camera settles into a medium two-shot, capturing the rider lifting his visor to speak, his face framed by a matte helmet. The video is 8 seconds long and is of 24 FPS. This video is of 1280x720 resolution. Audio description: The rhythmic, mechanical chugging of a four-stroke motorcycle engine dominates the foreground, characterized by a throaty, guttural timbre. Periodic high-pitched revs punctuate the steady idle as the throttle is twisted. The sound of tires crunching softly over dry grass and twigs provides a textured background layer. As the vehicle slows, the engine note drops to a low-frequency rumble before clicking into neutral. A muffled, mid-range male voice begins speaking, accompanied by the metallic clink of a helmet visor snapping upward and the faint chirping of distant birds in an open-air environment." -F "negative_prompt=blurry, distorted, low quality" -F "size=1280x720" -F "num_frames=193" -F "fps=24" -F "num_inference_steps=35" -F "guidance_scale=4.0" -F "seed=42" -F 'extra_params={"use_resolution_template":false,"use_duration_template":false}' -o cosmos3_t2v.mp4

Test Result

Unit tests

The unit tests pass, including all of the new Cosmos unit tests.

======= 2209 passed, 4 skipped, 1130 deselected, 50 warnings in 679.48s (0:11:19) ====

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

hsliuustc0106

Merge conflict needs fixing before review.

TKONIY · 2026-05-11T13:18:55Z

Good to see cosmos support in vLLM-Omni. Here are some personal comments on the image/video entrypoint design.

Context

Cosmos3 is a single diffusion pipeline that can serve both image and video
generation:

/v1/images/generations should run text-to-image and return an image response.
/v1/videos and /v1/videos/sync should run text/image-to-video and return video output.

The requested output type is an endpoint/request contract, not a sampling
parameter. Therefore, adding output_modality to OmniDiffusionSamplingParams
is not the preferred design.

Option 1: Keep Dedicated Image/Video Endpoints

This is the minimal and recommended path for PR 3454.

The endpoint determines the output type and stores it in request-level metadata:

# /v1/images/generations
prompt["modalities"] = ["image"]

# /v1/videos and /v1/videos/sync
prompt["modalities"] = ["video"]

Cosmos3 reads the requested modality from the prompt:

modalities = first_prompt.get("modalities", [])
is_t2i = "image" in modalities
is_video = "video" in modalities

If post-processing needs the modality after forward(), the pipeline can return
a typed output dictionary:

return DiffusionOutput(output={"image": decoded_video})  # T2I, T=1
return DiffusionOutput(output={"video": decoded_video})  # T2V/I2V

Then get_cosmos3_post_process_func() dispatches by key:

if "image" in output:
    image = output["image"].squeeze(2)
    return video_processor.postprocess(image, output_type="pil")

if "video" in output:
    return {
        "video": video_processor.postprocess_video(output["video"], output_type=output_type)
    }

Benefits:

No new field in OmniDiffusionSamplingParams.
Keeps image/video OpenAI-style endpoints clear.
Matches existing video+audio model convention where dict keys carry output
payload types, e.g. {"video": ..., "audio": ...}.
Allows chat multimodal paths to reuse the same internal prompt["modalities"]
convention later.

Option 2: Extend Chat Multimodal Output To Video

This is a larger Omni-specific API extension.

Request example:

{
  "model": "cosmos3",
  "messages": [
    {"role": "user", "content": "A corgi running in the park at sunset"}
  ],
  "modalities": ["video"],
  "extra_body": {
    "num_frames": 81,
    "fps": 24,
    "size": "720x1080"
  }
}

The chat serving layer would convert this to the same internal prompt contract:

engine_prompt["modalities"] = ["video"]

The response formatter would need a new video branch:

elif omni_outputs.final_output_type == "video":
    choices_data = self._create_video_choice(...)

The response should preferably return a video id or URL instead of inline base64:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": [
          {
            "type": "video",
            "video": {
              "id": "video-xxx",
              "url": "/v1/videos/video-xxx/content",
              "mime_type": "video/mp4",
              "fps": 24,
              "num_frames": 81
            }
          }
        ]
      }
    }
  ]
}

Benefits:

Aligns with Qwen-Omni style request-level modalities.
Gives users one chat-style multimodal API for text/audio/image/video.

Costs:

Requires a new chat video response contract.
Requires final_output_type="video" handling in chat response formatting.
Requires video storage or content URL plumbing.
It is not necessary for basic Cosmos3 /v1/images and /v1/videos support.

Recommendation

For PR 3454, use Option 1.

Do not add output_modality to OmniDiffusionSamplingParams. Let the endpoint
set prompt["modalities"], and let Cosmos3 branch on that request-level
modality. Consider Option 2 only as a follow-up feature for chat-based video
generation.

MaciejBalaNV · 2026-05-11T14:59:17Z

@TKONIY I've updated the code to follow the option 1 you proposed, please let me know what you think now.

Signed-off-by: Maciej Bala <mbala@nvidia.com>

… tuples Signed-off-by: Maciej Bala <mbala@nvidia.com>

Signed-off-by: Maciej Bala <mbala@nvidia.com>

…and removed unnecessary padding Signed-off-by: Maciej Bala <mbala@nvidia.com>

MaciejBalaNV · 2026-05-29T16:52:23Z

I added one more framework-level change to enable flow_shift for image generation.

Signed-off-by: Maciej Bala <mbala@nvidia.com>

alex-jw-brooks

Some additional thoughts - I think the main things from my end are the name for the guardrails flag and understanding why the diffusion attention backend isn't used for causal attention, since we should ideally use it there too.

Some of the others can be follow-up refactors to try to unblock this PR. Thanks!

alex-jw-brooks · 2026-05-30T05:52:39Z

+    # -- Weight loading --------------------------------------------------------
+
+    @staticmethod
+    def _remap_ckpt_key(key: str) -> str | None:


I see, thanks for the explanation. I do still think this can be better aligned with other models in Omni, because it's a bit strange to have this much complexity in the pipeline load. Usually the pipeline load is simple, and the transformer handles remapping etc, since other components like the vae are not loaded by the weight loader. As an example, you can see Flux2's pipeline loader here is super minimal, and the load weights on the transformer here handles remapping including things like param remapping for to_q/add_q_proj.

Can you please add a TODO to refactor this part into the transformer implementation? I can help with this as a follow-up when I have cycles to avoid blocking this PR for now

alex-jw-brooks · 2026-05-30T06:13:21Z

+                softmax_scale=1.0 / (self.head_dim**0.5),
+                num_kv_heads=self.num_kv_heads,
+            )
+        return self._sp_attn


This feels a bit weird to me. Given that sequence parallelism is configured at initialization time, I don't think it makes sense to keep the local_attn/_sp_attn as separate attributes. It would be nice to try to unify the SP / non-SP paths I think, although it can be a follow-up

I simplified it a little bit, now we have only one attribute for attention. However, we still need a separate forward-pass path, as joint-key is not supported for non-SP path. Left a TODO in the code for when the framework code is done for it.

Cosmos3 pipelines are only in the unreleased vllm-omni PR vllm-project/vllm-omni#3454, not in any released wheel. Re-enable the git-install mechanism (reverted in 7744835) so the vllm-runtime container installs vllm-omni from the canonical repo pinned to the current PR head SHA (65b83d87, == refs/pull/3454/head). When vllm_omni_git_url is set, install_vllm_omni.sh installs "vllm-omni @ git+<url>@<ref>"; otherwise it falls back to the released "vllm-omni==<ref>" wheel. Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Signed-off-by: Maciej Bala <mbala@nvidia.com>

MaciejBalaNV · 2026-06-01T11:40:11Z

I made one more change that impact the entire framework: images/generations endpoint now accepts extra_params field the same way that the video endpoints do. For Cosmos3 it's useful for e.g. disabling guardrails on per-request basis or setting resolution or duration template fields.

The model is also public now at https://huggingface.co/nvidia/Cosmos3-Nano (and a few more variants that also work with this PR)

Signed-off-by: Maciej Bala <mbala@nvidia.com>

- recipes/nvidia/Cosmos3-Nano.md: T2I/T2V/I2V online + offline usage matching the official model-card recipe (1280x720, 189 frames, flow_shift=10, guardrails), with measured latency; indexed in recipes/README.md. - test_prompt_formatting_and_checkpoint_key_remap: enable the now off-by-default duration/resolution templates via extra_args so assertions match the implementation. Signed-off-by: lishunyang12 <lishunyang12@163.com>

lishunyang12 · 2026-06-01T19:19:34Z

Rebased to 0.22.0, add recipe, and retested against cosmos 3 repo.

solved

hsliuustc0106 reviewed May 8, 2026

View reviewed changes

MaciejBalaNV force-pushed the mbala/cosmos3_model branch 3 times, most recently from c865643 to 521c322 Compare May 11, 2026 14:39

MaciejBalaNV force-pushed the mbala/cosmos3_model branch 2 times, most recently from b361819 to 5fc0e2a Compare May 12, 2026 10:13

Added Cosmos3 model

1cc4059

Signed-off-by: Maciej Bala <mbala@nvidia.com>

MaciejBalaNV force-pushed the mbala/cosmos3_model branch from d556aeb to 1cc4059 Compare May 14, 2026 11:21

MaciejBalaNV added 6 commits May 14, 2026 13:41

Small qol improvements

ee77c61

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Updated docs for Cosmos3

0d0542f

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Cleared up docs

bd4ecb3

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Fixed sound quality issues

921cc4b

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Linter fixes

75d9888

Signed-off-by: Maciej Bala <mbala@nvidia.com>

extra cleanup

9e9e453

Signed-off-by: Maciej Bala <mbala@nvidia.com>

MaciejBalaNV changed the title ~~Added new model support~~ Add Cosmos3 model support May 14, 2026

MaciejBalaNV added 3 commits May 15, 2026 09:40

Updated examples to refer to HF repo

3c7cd31

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Improved guardrails

c36b23c

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Linter fixes

095f6b5

Signed-off-by: Maciej Bala <mbala@nvidia.com>

MaciejBalaNV marked this pull request as ready for review May 15, 2026 13:46

MaciejBalaNV requested review from Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, princepride, tzhouam, wtomin and yenuo26 as code owners May 15, 2026 13:46

MaciejBalaNV added 2 commits May 29, 2026 10:51

Deleted cosmos3 conftest.py

2074758

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Small review improvements

5b0ba8f

Signed-off-by: Maciej Bala <mbala@nvidia.com>

MaciejBalaNV mentioned this pull request May 29, 2026

Added GQA support for CUDNNAttention backend #3984

Open

5 tasks

MaciejBalaNV added 3 commits May 29, 2026 13:11

Postprocess Cosmos3 output to be tensor-only; Update ipc.py to handle…

491c85a

… tuples Signed-off-by: Maciej Bala <mbala@nvidia.com>

Improved dit cache for t2i

21a1322

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Fixed flow_shift for images; Updated max_sequence_length for Cosmos3 …

e826f62

…and removed unnecessary padding Signed-off-by: Maciej Bala <mbala@nvidia.com>

Pass extra_args to image endpoint

65b83d8

Signed-off-by: Maciej Bala <mbala@nvidia.com>

ayushag-nv mentioned this pull request May 29, 2026

feat(omni): add Cosmos3 support to vLLM-Omni backend ai-dynamo/dynamo#10132

Open

alex-jw-brooks previously requested changes May 30, 2026

View reviewed changes

MaciejBalaNV added 6 commits June 1, 2026 11:26

Small review cleanup

9efb5fe

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Simplified cosmos3 attention

65e24cf

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Pass extra_params for image endpoint

30c5a98

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Removed self.sdpa_backend from Cosmos3

63770f8

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Added extra type checking

ce45394

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Reintroduce some empty dict checks

0861791

Signed-off-by: Maciej Bala <mbala@nvidia.com>

MaciejBalaNV and others added 2 commits June 1, 2026 13:40

Change resolution and duration templates to off by default

453bd09

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Merge branch 'main' into mbala/cosmos3_model

a985f16

lishunyang12 added merge-test label to trigger buildkite merge test CI and removed ready label to trigger buildkite CI labels Jun 1, 2026

lishunyang12 approved these changes Jun 1, 2026

View reviewed changes

lishunyang12 merged commit bc794e6 into vllm-project:main Jun 1, 2026
6 checks passed

linyueqian mentioned this pull request Jun 2, 2026

Add Cosmos3 sound generation MaciejBalaNV/vllm-omni#1

Open

5 tasks

MaciejBalaNV mentioned this pull request Jun 2, 2026

Add Cosmos3 sound generation #4073

Merged

5 tasks

bastefaniak mentioned this pull request Jun 3, 2026

Add Cosmos3 action modality #4102

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cosmos3 model support#3454

Add Cosmos3 model support#3454
lishunyang12 merged 65 commits into
vllm-project:mainfrom
MaciejBalaNV:mbala/cosmos3_model

MaciejBalaNV commented May 8, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 left a comment

Uh oh!

TKONIY commented May 11, 2026

Uh oh!

MaciejBalaNV commented May 11, 2026

Uh oh!

MaciejBalaNV commented May 29, 2026

Uh oh!

alex-jw-brooks left a comment

Uh oh!

Uh oh!

Uh oh!

alex-jw-brooks May 30, 2026

Uh oh!

Uh oh!

alex-jw-brooks May 30, 2026

Uh oh!

MaciejBalaNV Jun 1, 2026

Uh oh!

Uh oh!

MaciejBalaNV commented Jun 1, 2026 •

edited

Loading

Uh oh!

lishunyang12 commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

MaciejBalaNV commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Unit tests

Serving tests

Test Result

Unit tests

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

TKONIY commented May 11, 2026

Context

Option 1: Keep Dedicated Image/Video Endpoints

Option 2: Extend Chat Multimodal Output To Video

Recommendation

Uh oh!

MaciejBalaNV commented May 11, 2026

Uh oh!

MaciejBalaNV commented May 29, 2026

Uh oh!

alex-jw-brooks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alex-jw-brooks May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alex-jw-brooks May 30, 2026

Choose a reason for hiding this comment

Uh oh!

MaciejBalaNV Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaciejBalaNV commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

MaciejBalaNV commented May 8, 2026 •

edited

Loading

MaciejBalaNV commented Jun 1, 2026 •

edited

Loading

lishunyang12 commented Jun 1, 2026 •

edited

Loading