Skip to content

Add Cosmos3 sound generation#1

Closed
MaciejBalaNV wants to merge 22 commits into
mbala/cosmos3_modelfrom
mbala/cosmos3_sound
Closed

Add Cosmos3 sound generation#1
MaciejBalaNV wants to merge 22 commits into
mbala/cosmos3_modelfrom
mbala/cosmos3_sound

Conversation

@MaciejBalaNV

Copy link
Copy Markdown
Owner

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@MaciejBalaNV MaciejBalaNV self-assigned this May 20, 2026
@MaciejBalaNV MaciejBalaNV force-pushed the mbala/cosmos3_sound branch from f0b9d82 to b78f881 Compare May 28, 2026 15:58
MaciejBalaNV and others added 3 commits June 2, 2026 04:04
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: MaciejBalaNV <mbala@nvidia.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Co-authored-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…ect#3987)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

@linyueqian linyueqian left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, Maciej. I read the full diff and ran the PR end to end on a Cosmos3-Nano deployment (offline Omni(), HTTP /v1/videos/sync, and a cfg_parallel_size=2 configuration). The feature works: stereo 48 kHz audio comes out of both the offline path and the HTTP endpoint, the AAC mux is correct on the server side, and the tuple-output path through CFGParallelMixin.combine_cfg_noise operates cleanly under CFG-parallel.

A few things worth addressing before merge. Specific suggestions inline; here is the summary:

🔴 Blocking

  1. Two unit tests in tests/diffusion/models/cosmos3/test_cosmos3_transformer.py fail on this branch (test_forward_returns_video_prediction, test_forward_returns_video_and_sound_predictions). Both look like test-side oversights rather than model bugs.
  2. The PR base is mbala/cosmos3_model, which has since landed as vllm-project#3454. A rebase onto current main produces conflicts in 4 files. Once that is done, CI can fairly evaluate the change.

🟡 Important
3. _is_sound_request accepts six aliases for what is effectively one user-facing flag. Narrowing this surface reduces the chance of a "set the wrong key, get silent video-only output" UX failure.
4. The duplicate self.time_embedder initialization at transformer_cosmos3.py:1003 is dead dtype configuration (later overridden by post_load_weights().to(torch.float32)), but it is confusing to read.

💡 Suggestion
5. The joint (video, sound) pack-then-step in diffuse._step relies on the scheduler being linearly separable per element. True for flow matching, but the assumption is worth a one-line comment.

Test evidence

  • Offline T2V+sound, two prompts, 4.00 s stereo 48 kHz: distinct, prompt-responsive audio.
  • HTTP /v1/videos/sync with generate_sound=true sound_duration=4.0: 1.9 MB mp4 with H264 video (2.04 s) + AAC stereo (4.02 s), muxed server-side.
  • cfg_parallel_size=2 on 2 GPUs: produces the same shape audio cleanly.

Comment thread vllm_omni/diffusion/models/cosmos3/transformer_cosmos3.py Outdated
Comment thread tests/diffusion/models/cosmos3/test_cosmos3_transformer.py Outdated
Comment thread tests/diffusion/models/cosmos3/test_cosmos3_transformer.py Outdated
Comment thread vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py
Comment thread vllm_omni/diffusion/models/cosmos3/pipeline_cosmos3.py
tc-mb and others added 2 commits June 2, 2026 14:33
Signed-off-by: tc-mb <tianchi_cai@icloud.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang Jian <jianmusings@gmail.com>
Signed-off-by: jian <jianmusings@gmail.com>
Signed-off-by: Canlin Guo <961750412@qq.com>
Co-authored-by: Canlin Guo <961750412@qq.com>
bkdoeng and others added 6 commits June 2, 2026 11:04
Signed-off-by: siyuan.lei <siyuanlei37@gmail.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: rein yang <ruiruyang2@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…orm backend) (vllm-project#4074)

Signed-off-by: NumberWan <wantszkin2003@gmail.com>
…llm-project#3178)

Signed-off-by: liqian <65649795+vasede@users.noreply.github.com>
Co-authored-by: Gao Han <hgaoaf@connect.ust.hk>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
@david6666666 david6666666 force-pushed the mbala/cosmos3_sound branch 2 times, most recently from a4f1e69 to b4e0379 Compare June 2, 2026 19:09
MaciejBalaNV and others added 9 commits June 2, 2026 19:36
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
…lags

Signed-off-by: lishunyang12 <lishunyang12@163.com>
…nd tokenizer

Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
@david6666666 david6666666 force-pushed the mbala/cosmos3_sound branch from da833d0 to 8c340fe Compare June 2, 2026 19:39
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
@bastefaniak

Copy link
Copy Markdown
Collaborator

Closing, because already merged to vllm-omni main

@bastefaniak bastefaniak closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.