Skip to content

Add Cosmos3 model support#3454

Merged
lishunyang12 merged 65 commits into
vllm-project:mainfrom
MaciejBalaNV:mbala/cosmos3_model
Jun 1, 2026
Merged

Add Cosmos3 model support#3454
lishunyang12 merged 65 commits into
vllm-project:mainfrom
MaciejBalaNV:mbala/cosmos3_model

Conversation

@MaciejBalaNV
Copy link
Copy Markdown
Contributor

@MaciejBalaNV MaciejBalaNV commented May 8, 2026

Purpose

Add support for a Cosmos3 model: https://huggingface.co/nvidia/Cosmos3-Nano (and more variants).

The Cosmos3 model covers t2i, t2v, i2v modalities, combined with sound generation on top of video, as well as three different modes for action generation: policy (predicts action and video based on prompt and image), forward_dynamics (predicts video based on action sequence) and inverse_dynamics (predicts action sequence based on video).

This PR covers only t2v, i2v and t2i modalities. The code for other modalities is ready and will be part of the follow-up PRs once this one is reviewed and merged.

The notable changes potentially impacting more than only Cosmos3 model are:

  • Added modalities field to OmniTextPrompt to properly recognize between t2i and t2v prompts for the same pipeline.

Test Plan

Unit tests

cd tests; python -m pytest -v -m "core_model and cpu"

Added 91 new test cases for the new model integration and pipeline unit tests.

Serving tests

Host server with

vllm serve nvidia/Cosmos3-Nano --omni

Run a request with

curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" -F "prompt=A low-angle tracking shot follows a man riding a vintage black motorcycle across a lush green grassy yard. Sunlight filters through overhead trees, casting dappled shadows across the vibrating chrome exhaust and the rider's leather jacket. He kicks up small blades of grass as he maneuvers the bike. He gradually decelerates, the front fork compressing slightly as he brakes to a smooth halt beside another individual standing in the shade. The camera settles into a medium two-shot, capturing the rider lifting his visor to speak, his face framed by a matte helmet. The video is 8 seconds long and is of 24 FPS. This video is of 1280x720 resolution. Audio description: The rhythmic, mechanical chugging of a four-stroke motorcycle engine dominates the foreground, characterized by a throaty, guttural timbre. Periodic high-pitched revs punctuate the steady idle as the throttle is twisted. The sound of tires crunching softly over dry grass and twigs provides a textured background layer. As the vehicle slows, the engine note drops to a low-frequency rumble before clicking into neutral. A muffled, mid-range male voice begins speaking, accompanied by the metallic clink of a helmet visor snapping upward and the faint chirping of distant birds in an open-air environment." -F "negative_prompt=blurry, distorted, low quality" -F "size=1280x720" -F "num_frames=193" -F "fps=24" -F "num_inference_steps=35" -F "guidance_scale=4.0" -F "seed=42" -F 'extra_params={"use_resolution_template":false,"use_duration_template":false}' -o cosmos3_t2v.mp4

Test Result

Unit tests

The unit tests pass, including all of the new Cosmos unit tests.

======= 2209 passed, 4 skipped, 1130 deselected, 50 warnings in 679.48s (0:11:19) ====


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge conflict needs fixing before review.

@TKONIY
Copy link
Copy Markdown
Contributor

TKONIY commented May 11, 2026

Good to see cosmos support in vLLM-Omni. Here are some personal comments on the image/video entrypoint design.

Context

Cosmos3 is a single diffusion pipeline that can serve both image and video
generation:

  • /v1/images/generations should run text-to-image and return an image response.
  • /v1/videos and /v1/videos/sync should run text/image-to-video and return video output.

The requested output type is an endpoint/request contract, not a sampling
parameter. Therefore, adding output_modality to OmniDiffusionSamplingParams
is not the preferred design.

Option 1: Keep Dedicated Image/Video Endpoints

This is the minimal and recommended path for PR 3454.

The endpoint determines the output type and stores it in request-level metadata:

# /v1/images/generations
prompt["modalities"] = ["image"]

# /v1/videos and /v1/videos/sync
prompt["modalities"] = ["video"]

Cosmos3 reads the requested modality from the prompt:

modalities = first_prompt.get("modalities", [])
is_t2i = "image" in modalities
is_video = "video" in modalities

If post-processing needs the modality after forward(), the pipeline can return
a typed output dictionary:

return DiffusionOutput(output={"image": decoded_video})  # T2I, T=1
return DiffusionOutput(output={"video": decoded_video})  # T2V/I2V

Then get_cosmos3_post_process_func() dispatches by key:

if "image" in output:
    image = output["image"].squeeze(2)
    return video_processor.postprocess(image, output_type="pil")

if "video" in output:
    return {
        "video": video_processor.postprocess_video(output["video"], output_type=output_type)
    }

Benefits:

  • No new field in OmniDiffusionSamplingParams.
  • Keeps image/video OpenAI-style endpoints clear.
  • Matches existing video+audio model convention where dict keys carry output
    payload types, e.g. {"video": ..., "audio": ...}.
  • Allows chat multimodal paths to reuse the same internal prompt["modalities"]
    convention later.

Option 2: Extend Chat Multimodal Output To Video

This is a larger Omni-specific API extension.

Request example:

{
  "model": "cosmos3",
  "messages": [
    {"role": "user", "content": "A corgi running in the park at sunset"}
  ],
  "modalities": ["video"],
  "extra_body": {
    "num_frames": 81,
    "fps": 24,
    "size": "720x1080"
  }
}

The chat serving layer would convert this to the same internal prompt contract:

engine_prompt["modalities"] = ["video"]

The response formatter would need a new video branch:

elif omni_outputs.final_output_type == "video":
    choices_data = self._create_video_choice(...)

The response should preferably return a video id or URL instead of inline base64:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": [
          {
            "type": "video",
            "video": {
              "id": "video-xxx",
              "url": "/v1/videos/video-xxx/content",
              "mime_type": "video/mp4",
              "fps": 24,
              "num_frames": 81
            }
          }
        ]
      }
    }
  ]
}

Benefits:

  • Aligns with Qwen-Omni style request-level modalities.
  • Gives users one chat-style multimodal API for text/audio/image/video.

Costs:

  • Requires a new chat video response contract.
  • Requires final_output_type="video" handling in chat response formatting.
  • Requires video storage or content URL plumbing.
  • It is not necessary for basic Cosmos3 /v1/images and /v1/videos support.

Recommendation

For PR 3454, use Option 1.

Do not add output_modality to OmniDiffusionSamplingParams. Let the endpoint
set prompt["modalities"], and let Cosmos3 branch on that request-level
modality. Consider Option 2 only as a follow-up feature for chat-based video
generation.

@MaciejBalaNV MaciejBalaNV force-pushed the mbala/cosmos3_model branch 3 times, most recently from c865643 to 521c322 Compare May 11, 2026 14:39
@MaciejBalaNV
Copy link
Copy Markdown
Contributor Author

@TKONIY I've updated the code to follow the option 1 you proposed, please let me know what you think now.

@MaciejBalaNV MaciejBalaNV force-pushed the mbala/cosmos3_model branch 2 times, most recently from b361819 to 5fc0e2a Compare May 12, 2026 10:13
Signed-off-by: Maciej Bala <mbala@nvidia.com>
@MaciejBalaNV MaciejBalaNV force-pushed the mbala/cosmos3_model branch from d556aeb to 1cc4059 Compare May 14, 2026 11:21
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
@MaciejBalaNV MaciejBalaNV changed the title Added new model support Add Cosmos3 model support May 14, 2026
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
… tuples

Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
…and removed unnecessary padding

Signed-off-by: Maciej Bala <mbala@nvidia.com>
@MaciejBalaNV
Copy link
Copy Markdown
Contributor Author

I added one more framework-level change to enable flow_shift for image generation.

Signed-off-by: Maciej Bala <mbala@nvidia.com>
Copy link
Copy Markdown
Contributor

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional thoughts - I think the main things from my end are the name for the guardrails flag and understanding why the diffusion attention backend isn't used for causal attention, since we should ideally use it there too.

Some of the others can be follow-up refactors to try to unblock this PR. Thanks!

Comment thread vllm_omni/diffusion/attention/backends/sdpa.py Outdated
Comment thread vllm_omni/diffusion/models/cosmos3/guardrails.py Outdated
# -- Weight loading --------------------------------------------------------

@staticmethod
def _remap_ckpt_key(key: str) -> str | None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation. I do still think this can be better aligned with other models in Omni, because it's a bit strange to have this much complexity in the pipeline load. Usually the pipeline load is simple, and the transformer handles remapping etc, since other components like the vae are not loaded by the weight loader. As an example, you can see Flux2's pipeline loader here is super minimal, and the load weights on the transformer here handles remapping including things like param remapping for to_q/add_q_proj.

Can you please add a TODO to refactor this part into the transformer implementation? I can help with this as a follow-up when I have cycles to avoid blocking this PR for now

Comment thread vllm_omni/diffusion/models/cosmos3/transformer_cosmos3.py Outdated
softmax_scale=1.0 / (self.head_dim**0.5),
num_kv_heads=self.num_kv_heads,
)
return self._sp_attn
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit weird to me. Given that sequence parallelism is configured at initialization time, I don't think it makes sense to keep the local_attn/_sp_attn as separate attributes. It would be nice to try to unify the SP / non-SP paths I think, although it can be a follow-up

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified it a little bit, now we have only one attribute for attention. However, we still need a separate forward-pass path, as joint-key is not supported for non-SP path. Left a TODO in the code for when the framework code is done for it.

Comment thread vllm_omni/entrypoints/cli/serve.py
saturley-hall added a commit to ai-dynamo/dynamo that referenced this pull request May 31, 2026
Cosmos3 pipelines are only in the unreleased vllm-omni PR
vllm-project/vllm-omni#3454, not in any released wheel. Re-enable the
git-install mechanism (reverted in 7744835) so the vllm-runtime
container installs vllm-omni from the canonical repo pinned to the
current PR head SHA (65b83d87, == refs/pull/3454/head).

When vllm_omni_git_url is set, install_vllm_omni.sh installs
"vllm-omni @ git+<url>@<ref>"; otherwise it falls back to the released
"vllm-omni==<ref>" wheel.

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
@MaciejBalaNV
Copy link
Copy Markdown
Contributor Author

MaciejBalaNV commented Jun 1, 2026

I made one more change that impact the entire framework: images/generations endpoint now accepts extra_params field the same way that the video endpoints do. For Cosmos3 it's useful for e.g. disabling guardrails on per-request basis or setting resolution or duration template fields.

The model is also public now at https://huggingface.co/nvidia/Cosmos3-Nano (and a few more variants that also work with this PR)

@lishunyang12 lishunyang12 added merge-test label to trigger buildkite merge test CI and removed ready label to trigger buildkite CI labels Jun 1, 2026
- recipes/nvidia/Cosmos3-Nano.md: T2I/T2V/I2V online + offline usage matching
  the official model-card recipe (1280x720, 189 frames, flow_shift=10, guardrails),
  with measured latency; indexed in recipes/README.md.
- test_prompt_formatting_and_checkpoint_key_remap: enable the now off-by-default
  duration/resolution templates via extra_args so assertions match the
  implementation.

Signed-off-by: lishunyang12 <lishunyang12@163.com>
@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Jun 1, 2026

Rebased to 0.22.0, add recipe, and retested against cosmos 3 repo.

@lishunyang12 lishunyang12 merged commit bc794e6 into vllm-project:main Jun 1, 2026
6 checks passed
@MaciejBalaNV MaciejBalaNV mentioned this pull request Jun 2, 2026
5 tasks
@bastefaniak bastefaniak mentioned this pull request Jun 3, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-test label to trigger buildkite merge test CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.