Add Cosmos3 model support#3454
Conversation
hsliuustc0106
left a comment
There was a problem hiding this comment.
Merge conflict needs fixing before review.
|
Good to see cosmos support in vLLM-Omni. Here are some personal comments on the ContextCosmos3 is a single diffusion pipeline that can serve both image and video
The requested output type is an endpoint/request contract, not a sampling Option 1: Keep Dedicated Image/Video EndpointsThis is the minimal and recommended path for PR 3454. The endpoint determines the output type and stores it in request-level metadata: # /v1/images/generations
prompt["modalities"] = ["image"]
# /v1/videos and /v1/videos/sync
prompt["modalities"] = ["video"]Cosmos3 reads the requested modality from the prompt: modalities = first_prompt.get("modalities", [])
is_t2i = "image" in modalities
is_video = "video" in modalitiesIf post-processing needs the modality after return DiffusionOutput(output={"image": decoded_video}) # T2I, T=1
return DiffusionOutput(output={"video": decoded_video}) # T2V/I2VThen if "image" in output:
image = output["image"].squeeze(2)
return video_processor.postprocess(image, output_type="pil")
if "video" in output:
return {
"video": video_processor.postprocess_video(output["video"], output_type=output_type)
}Benefits:
Option 2: Extend Chat Multimodal Output To VideoThis is a larger Omni-specific API extension. Request example: {
"model": "cosmos3",
"messages": [
{"role": "user", "content": "A corgi running in the park at sunset"}
],
"modalities": ["video"],
"extra_body": {
"num_frames": 81,
"fps": 24,
"size": "720x1080"
}
}The chat serving layer would convert this to the same internal prompt contract: engine_prompt["modalities"] = ["video"]The response formatter would need a new video branch: elif omni_outputs.final_output_type == "video":
choices_data = self._create_video_choice(...)The response should preferably return a video id or URL instead of inline base64: {
"choices": [
{
"message": {
"role": "assistant",
"content": [
{
"type": "video",
"video": {
"id": "video-xxx",
"url": "/v1/videos/video-xxx/content",
"mime_type": "video/mp4",
"fps": 24,
"num_frames": 81
}
}
]
}
}
]
}Benefits:
Costs:
RecommendationFor PR 3454, use Option 1. Do not add |
c865643 to
521c322
Compare
|
@TKONIY I've updated the code to follow the option 1 you proposed, please let me know what you think now. |
b361819 to
5fc0e2a
Compare
Signed-off-by: Maciej Bala <mbala@nvidia.com>
d556aeb to
1cc4059
Compare
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
… tuples Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
…and removed unnecessary padding Signed-off-by: Maciej Bala <mbala@nvidia.com>
|
I added one more framework-level change to enable flow_shift for image generation. |
Signed-off-by: Maciej Bala <mbala@nvidia.com>
alex-jw-brooks
left a comment
There was a problem hiding this comment.
Some additional thoughts - I think the main things from my end are the name for the guardrails flag and understanding why the diffusion attention backend isn't used for causal attention, since we should ideally use it there too.
Some of the others can be follow-up refactors to try to unblock this PR. Thanks!
| # -- Weight loading -------------------------------------------------------- | ||
|
|
||
| @staticmethod | ||
| def _remap_ckpt_key(key: str) -> str | None: |
There was a problem hiding this comment.
I see, thanks for the explanation. I do still think this can be better aligned with other models in Omni, because it's a bit strange to have this much complexity in the pipeline load. Usually the pipeline load is simple, and the transformer handles remapping etc, since other components like the vae are not loaded by the weight loader. As an example, you can see Flux2's pipeline loader here is super minimal, and the load weights on the transformer here handles remapping including things like param remapping for to_q/add_q_proj.
Can you please add a TODO to refactor this part into the transformer implementation? I can help with this as a follow-up when I have cycles to avoid blocking this PR for now
| softmax_scale=1.0 / (self.head_dim**0.5), | ||
| num_kv_heads=self.num_kv_heads, | ||
| ) | ||
| return self._sp_attn |
There was a problem hiding this comment.
This feels a bit weird to me. Given that sequence parallelism is configured at initialization time, I don't think it makes sense to keep the local_attn/_sp_attn as separate attributes. It would be nice to try to unify the SP / non-SP paths I think, although it can be a follow-up
There was a problem hiding this comment.
I simplified it a little bit, now we have only one attribute for attention. However, we still need a separate forward-pass path, as joint-key is not supported for non-SP path. Left a TODO in the code for when the framework code is done for it.
Cosmos3 pipelines are only in the unreleased vllm-omni PR vllm-project/vllm-omni#3454, not in any released wheel. Re-enable the git-install mechanism (reverted in 7744835) so the vllm-runtime container installs vllm-omni from the canonical repo pinned to the current PR head SHA (65b83d87, == refs/pull/3454/head). When vllm_omni_git_url is set, install_vllm_omni.sh installs "vllm-omni @ git+<url>@<ref>"; otherwise it falls back to the released "vllm-omni==<ref>" wheel. Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
|
I made one more change that impact the entire framework: The model is also public now at https://huggingface.co/nvidia/Cosmos3-Nano (and a few more variants that also work with this PR) |
Signed-off-by: Maciej Bala <mbala@nvidia.com>
- recipes/nvidia/Cosmos3-Nano.md: T2I/T2V/I2V online + offline usage matching the official model-card recipe (1280x720, 189 frames, flow_shift=10, guardrails), with measured latency; indexed in recipes/README.md. - test_prompt_formatting_and_checkpoint_key_remap: enable the now off-by-default duration/resolution templates via extra_args so assertions match the implementation. Signed-off-by: lishunyang12 <lishunyang12@163.com>
|
Rebased to 0.22.0, add recipe, and retested against cosmos 3 repo. |
Purpose
Add support for a Cosmos3 model: https://huggingface.co/nvidia/Cosmos3-Nano (and more variants).
The Cosmos3 model covers t2i, t2v, i2v modalities, combined with sound generation on top of video, as well as three different modes for action generation:
policy(predicts action and video based on prompt and image),forward_dynamics(predicts video based on action sequence) andinverse_dynamics(predicts action sequence based on video).This PR covers only t2v, i2v and t2i modalities. The code for other modalities is ready and will be part of the follow-up PRs once this one is reviewed and merged.
The notable changes potentially impacting more than only Cosmos3 model are:
modalitiesfield toOmniTextPromptto properly recognize between t2i and t2v prompts for the same pipeline.Test Plan
Unit tests
cd tests; python -m pytest -v -m "core_model and cpu"Added 91 new test cases for the new model integration and pipeline unit tests.
Serving tests
Host server with
vllm serve nvidia/Cosmos3-Nano --omniRun a request with
curl -sS -X POST http://localhost:8000/v1/videos/sync -H "Accept: video/mp4" -F "prompt=A low-angle tracking shot follows a man riding a vintage black motorcycle across a lush green grassy yard. Sunlight filters through overhead trees, casting dappled shadows across the vibrating chrome exhaust and the rider's leather jacket. He kicks up small blades of grass as he maneuvers the bike. He gradually decelerates, the front fork compressing slightly as he brakes to a smooth halt beside another individual standing in the shade. The camera settles into a medium two-shot, capturing the rider lifting his visor to speak, his face framed by a matte helmet. The video is 8 seconds long and is of 24 FPS. This video is of 1280x720 resolution. Audio description: The rhythmic, mechanical chugging of a four-stroke motorcycle engine dominates the foreground, characterized by a throaty, guttural timbre. Periodic high-pitched revs punctuate the steady idle as the throttle is twisted. The sound of tires crunching softly over dry grass and twigs provides a textured background layer. As the vehicle slows, the engine note drops to a low-frequency rumble before clicking into neutral. A muffled, mid-range male voice begins speaking, accompanied by the metallic clink of a helmet visor snapping upward and the faint chirping of distant birds in an open-air environment." -F "negative_prompt=blurry, distorted, low quality" -F "size=1280x720" -F "num_frames=193" -F "fps=24" -F "num_inference_steps=35" -F "guidance_scale=4.0" -F "seed=42" -F 'extra_params={"use_resolution_template":false,"use_duration_template":false}' -o cosmos3_t2v.mp4Test Result
Unit tests
The unit tests pass, including all of the new Cosmos unit tests.
======= 2209 passed, 4 skipped, 1130 deselected, 50 warnings in 679.48s (0:11:19) ====
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.