Skip to content

Add Cosmos3 action modality#4102

Merged
lishunyang12 merged 21 commits into
vllm-project:mainfrom
MaciejBalaNV:mbala/cosmos3_action_review
Jun 3, 2026
Merged

Add Cosmos3 action modality#4102
lishunyang12 merged 21 commits into
vllm-project:mainfrom
MaciejBalaNV:mbala/cosmos3_action_review

Conversation

@bastefaniak
Copy link
Copy Markdown
Contributor

@bastefaniak bastefaniak commented Jun 3, 2026

Purpose

This PR is a follow-up to #3454 and #4073, it adds action policy/forward dynamics/inverse dynamics generation.

Cosmos3 model is available under https://huggingface.co/nvidia/Cosmos3-Nano (and more variants)

Test Plan

Unit tests

cd tests; python -m pytest -v -m "core_model and cpu"

Added 4 new tests, updated 9 existing ones to also support action modality

Serving tests

Host server with: vllm serve nvidia/Cosmos3-Nano --omni

Forward dynamics:

# Fetch assets
curl -sSL "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_vision_25_73d01c91-51f0-46cf-9b76-5682a76fb349.mp4" -o av_vision_25.mp4
curl -sSL "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/av_action_25.json" -o av_action_25.json
ffmpeg -y -loglevel error -i av_vision_25.mp4 -vf "select=eq(n\,0)" -vframes 1 av_vision_25_frame0.jpg

# Forward dynamics: image + action sequence -> predicted video (sync endpoint)
curl -sS -X POST "http://localhost:8000/v1/videos/sync" \
  --form-string "prompt=You are an autonomous vehicle planning system. This video is captured from a first-person perspective looking at the scene." \
  -F "input_reference=@av_vision_25_frame0.jpg" \
  -F "size=640x480" \
  -F "num_frames=61" \
  -F "fps=10" \
  -F "num_inference_steps=30" \
  -F "guidance_scale=1.0" \
  -F "flow_shift=5.0" \
  --form-string "extra_params={\"action_mode\":\"forward_dynamics\",\"domain_name\":\"av\",\"raw_action_dim\":9,\"action_chunk_size\":60,\"action\":$(cat av_action_25.json)}" \
  -F "seed=0" \
  -o cosmos3_forward_dynamics_av.mp4

Policy:

# Fetch asset + extract conditioning frame
curl -sSL "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/action/bridge_0.mp4" -o bridge_0.mp4
ffmpeg -y -loglevel error -i bridge_0.mp4 -vf "select=eq(n\,0)" -vframes 1 bridge_0_frame0.jpg

# Policy: image + instruction -> generated video + predicted actions (async endpoint)
VIDEO_ID=$(curl -sS -X POST "http://localhost:8000/v1/videos" \
  -H "Accept: application/json" \
  --form-string "prompt=Put the pot to the left of the purple item. This video is captured from a first-person perspective looking at the scene." \
  -F "input_reference=@bridge_0_frame0.jpg" \
  -F "size=640x480" \
  -F "num_frames=17" \
  -F "fps=5" \
  -F "num_inference_steps=30" \
  -F "guidance_scale=1.0" \
  -F "flow_shift=5.0" \
  --form-string 'extra_params={"action_mode":"policy","domain_name":"bridge_orig_lerobot","raw_action_dim":10,"action_chunk_size":16}' \
  -F "seed=0" | jq -r '.id')
echo "job: $VIDEO_ID"

# Poll until completed
while true; do
  resp=$(curl -sS "http://localhost:8000/v1/videos/$VIDEO_ID")
  status=$(echo "$resp" | jq -r '.status')
  echo "status: $status"
  [ "$status" = "completed" ] && break
  [ "$status" = "failed" ] && { echo "$resp" | jq .; exit 1; }
  sleep 2
done

# Save predicted actions + generated video
echo "$resp" | jq '.data[0].action' > cosmos3_action_policy_action.json
curl -sS -L "http://localhost:8000/v1/videos/$VIDEO_ID/content" -o cosmos3_action_policy.mp4

Online inference of inverse dynamics will be added in follow-up MR

Test Result

3185 passed, 15 skipped, 1533 deselected, 55 warnings in 227.68s (0:03:47)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

MaciejBalaNV and others added 15 commits June 2, 2026 10:52
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
…lags

Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
…nd tokenizer

Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b410e6601

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/models/cosmos3/action.py Outdated
@bastefaniak bastefaniak force-pushed the mbala/cosmos3_action_review branch from 1b410e6 to aab40ee Compare June 3, 2026 12:44
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
@lishunyang12 lishunyang12 self-assigned this Jun 3, 2026
@lishunyang12 lishunyang12 added merge-test label to trigger buildkite merge test CI and removed merge-test label to trigger buildkite merge test CI labels Jun 3, 2026
@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Jun 3, 2026

Let's test it and update the recipe for other three models in the series.

Signed-off-by: lishunyang12 <lishunyang12@163.com>
@david6666666
Copy link
Copy Markdown
Collaborator

david6666666 commented Jun 3, 2026

Checkpoint Mode(s) Result
Cosmos3-Nano (16B) T2I, T2V, I2V, video+sound, forward_dynamics, policy Yes
Cosmos3-Super (64B) T2I, T2V, I2V, video+sound, forward_dynamics, policy Yes
Cosmos3-Nano-Policy-DROID (16B) policy (domain_name=droid_lerobot) Yes
Cosmos3-Super-Text2Image (64B) T2I Yes
Cosmos3-Super-Image2Video (64B) I2V Yes

Recipe docs for the action modality are included in this PR (recipes/nvidia/Cosmos3-Nano.md, Cosmos3-Super.md).

@lishunyang12 lishunyang12 enabled auto-merge (squash) June 3, 2026 18:12
Signed-off-by: lishunyang12 <lishunyang12@163.com>
@lishunyang12 lishunyang12 disabled auto-merge June 3, 2026 18:23
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
david6666666 pushed a commit to david6666666/cosmos that referenced this pull request Jun 3, 2026
The recipes are being relocated from recipes/nvidia to recipes/cosmos3
in vllm-project/vllm-omni#4102; update the links to the new path.

Signed-off-by: lishunyang12 <lishunyang12@163.com>
@lishunyang12 lishunyang12 enabled auto-merge (squash) June 3, 2026 18:32
@lishunyang12 lishunyang12 merged commit 706bad2 into vllm-project:main Jun 3, 2026
6 checks passed
foreverlms pushed a commit to NVIDIA/cosmos that referenced this pull request Jun 4, 2026
The "Generator with vLLM-Omni" section has some stale references now
that the upstream PRs have landed.

**Compatibility status paragraph**
- #3454 (text-to-image / text-to-video / image-to-video) is merged on
vllm-omni `main`, not "being upstreamed."
- video-with-sound (vllm-project/vllm-omni#4073) is also merged — it's
listed as a pending follow-up.
- Action is still in review (vllm-project/vllm-omni#4102);
video-to-video is still planned.
- Linked the maintained recipes (Cosmos3-Nano / Cosmos3-Super) for
per-modality usage.

**Install snippet**
- Since #3454/#4073 are merged, install from `main` instead of the
`@refs/pull/3454/head` PR ref.

**Dead link**
- `examples/online_serving/cosmos3` 404s (it doesn't exist on `main`,
and isn't added by #4102). Repointed both references to the maintained
`recipes/nvidia`.

Scoped to the vLLM-Omni section; Docker quick-start and parallelism
options left as-is.

---------

Signed-off-by: lishunyang12 <lishunyang12@163.com>
Co-authored-by: lishunyang12 <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-test label to trigger buildkite merge test CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants