Add Cosmos3 sound generation by MaciejBalaNV · Pull Request #1 · MaciejBalaNV/vllm-omni

MaciejBalaNV · 2026-05-20T13:50:00Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Co-authored-by: lishunyang12 <lishunyang12@163.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…ect#3987) Signed-off-by: natureofnature <wzliu@connect.hku.hk>

linyueqian

Thanks for this, Maciej. I read the full diff and ran the PR end to end on a Cosmos3-Nano deployment (offline Omni(), HTTP /v1/videos/sync, and a cfg_parallel_size=2 configuration). The feature works: stereo 48 kHz audio comes out of both the offline path and the HTTP endpoint, the AAC mux is correct on the server side, and the tuple-output path through CFGParallelMixin.combine_cfg_noise operates cleanly under CFG-parallel.

A few things worth addressing before merge. Specific suggestions inline; here is the summary:

🔴 Blocking

Two unit tests in tests/diffusion/models/cosmos3/test_cosmos3_transformer.py fail on this branch (test_forward_returns_video_prediction, test_forward_returns_video_and_sound_predictions). Both look like test-side oversights rather than model bugs.
The PR base is mbala/cosmos3_model, which has since landed as vllm-project#3454. A rebase onto current main produces conflicts in 4 files. Once that is done, CI can fairly evaluate the change.

🟡 Important
3. _is_sound_request accepts six aliases for what is effectively one user-facing flag. Narrowing this surface reduces the chance of a "set the wrong key, get silent video-only output" UX failure.
4. The duplicate self.time_embedder initialization at transformer_cosmos3.py:1003 is dead dtype configuration (later overridden by post_load_weights().to(torch.float32)), but it is confusing to read.

💡 Suggestion
5. The joint (video, sound) pack-then-step in diffuse._step relies on the scheduler being linearly separable per element. True for flow matching, but the assumption is worth a one-line comment.

Test evidence

Offline T2V+sound, two prompts, 4.00 s stereo 48 kHz: distinct, prompt-responsive audio.
HTTP /v1/videos/sync with generate_sound=true sound_duration=4.0: 1.9 MB mp4 with H264 video (2.04 s) + AAC stereo (4.02 s), muxed server-side.
cfg_parallel_size=2 on 2 GPUs: produces the same shape audio cleanly.

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

Signed-off-by: Zhang <jianmusings@gmail.com> Signed-off-by: Zhang Jian <jianmusings@gmail.com> Signed-off-by: jian <jianmusings@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Canlin Guo <961750412@qq.com>

…llm-project#3949) Signed-off-by: bkannappan <bkannappan@digitalocean.com>

Signed-off-by: siyuan.lei <siyuanlei37@gmail.com>

Signed-off-by: Alex Brooks <albrooks@redhat.com>

Signed-off-by: rein yang <ruiruyang2@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

…orm backend) (vllm-project#4074) Signed-off-by: NumberWan <wantszkin2003@gmail.com>

…llm-project#3178) Signed-off-by: liqian <65649795+vasede@users.noreply.github.com> Co-authored-by: Gao Han <hgaoaf@connect.ust.hk> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>

Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com>

Signed-off-by: lishunyang12 <lishunyang12@163.com>

…lags Signed-off-by: lishunyang12 <lishunyang12@163.com>

…nd tokenizer Signed-off-by: lishunyang12 <lishunyang12@163.com>

Signed-off-by: lishunyang12 <lishunyang12@163.com>

Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com>

Signed-off-by: lishunyang12 <lishunyang12@163.com>

bastefaniak · 2026-06-03T13:47:04Z

Closing, because already merged to vllm-omni main

MaciejBalaNV self-assigned this May 20, 2026

MaciejBalaNV force-pushed the mbala/cosmos3_sound branch from f0b9d82 to b78f881 Compare May 28, 2026 15:58

MaciejBalaNV and others added 3 commits June 2, 2026 04:04

[XPU][Rebase v0.22] Fix for 0.22 rebase (vllm-project#4059)

2a2033d

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

[Perf][Bagel] Avoid per-step device syncs in Bagel img2img (vllm-proj…

1fb423e

…ect#3987) Signed-off-by: natureofnature <wzliu@connect.hku.hk>

linyueqian reviewed Jun 2, 2026

View reviewed changes

tc-mb and others added 2 commits June 2, 2026 14:33

add MiniCPM-o 4.5 recipe under recipes/OpenBMB (vllm-project#4067)

da9c45e

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

[TTS][Model] support MOSS-TTS series (vllm-project#3420)

b550709

Signed-off-by: Zhang <jianmusings@gmail.com> Signed-off-by: Zhang Jian <jianmusings@gmail.com> Signed-off-by: jian <jianmusings@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Canlin Guo <961750412@qq.com>

MaciejBalaNV force-pushed the mbala/cosmos3_sound branch from b78f881 to 5f84b67 Compare June 2, 2026 08:08

MaciejBalaNV mentioned this pull request Jun 2, 2026

Add Cosmos3 sound generation vllm-project/vllm-omni#4073

Merged

5 tasks

MaciejBalaNV force-pushed the mbala/cosmos3_sound branch from 5f84b67 to 55c1917 Compare June 2, 2026 08:53

bkdoeng and others added 6 commits June 2, 2026 11:04

[Bugfix] Fix SD3 T5 truncation check device mismatch on long prompts (v…

7c729e1

…llm-project#3949) Signed-off-by: bkannappan <bkannappan@digitalocean.com>

Support VAE parallel for Bagel (vllm-project#3982)

bd37f3c

Signed-off-by: siyuan.lei <siyuanlei37@gmail.com>

[Core] Integrate TrackingArgumentParser (vllm-project#3369)

12e75c6

Signed-off-by: Alex Brooks <albrooks@redhat.com>

[Bugfix] fix qwen3-omni performance regression (vllm-project#3575)

3a7c7f1

Signed-off-by: rein yang <ruiruyang2@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

[BugFix]Qwen-Image performance regression by using torch RMSNorm(RMSN…

35ee3c7

…orm backend) (vllm-project#4074) Signed-off-by: NumberWan <wantszkin2003@gmail.com>

david6666666 force-pushed the mbala/cosmos3_sound branch 2 times, most recently from a4f1e69 to b4e0379 Compare June 2, 2026 19:09

MaciejBalaNV and others added 9 commits June 2, 2026 19:36

Add Cosmos3 sound generation

a37fd9e

Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com>

Fix tests; small improvements

1b7e40d

Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com>

Remove unused parameter

6638fbc

Signed-off-by: lishunyang12 <lishunyang12@163.com>

Comment about packed modalities into single tensor

9b8b239

Signed-off-by: lishunyang12 <lishunyang12@163.com>

Enable sound generation only thorough "generate_sound", "sound_gen" f…

e82a831

…lags Signed-off-by: lishunyang12 <lishunyang12@163.com>

Pass sound_dim/sound_latent_fps into transformer from initialized sou…

2ee73c6

…nd tokenizer Signed-off-by: lishunyang12 <lishunyang12@163.com>

Update recipes

04ffce4

Signed-off-by: lishunyang12 <lishunyang12@163.com>

lint

a0a9868

Signed-off-by: Bartosz Stefaniak <bstefaniak@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com>

add video+sound usage to Cosmos3-Nano recipe

8c340fe

Signed-off-by: lishunyang12 <lishunyang12@163.com>

david6666666 force-pushed the mbala/cosmos3_sound branch from da833d0 to 8c340fe Compare June 2, 2026 19:39

add Cosmos3-Super recipe

96243ef

Signed-off-by: lishunyang12 <lishunyang12@163.com>

polish Cosmos3 recipes: add model field, install note, Super curls

7765517

Signed-off-by: lishunyang12 <lishunyang12@163.com>

bastefaniak closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cosmos3 sound generation#1

Add Cosmos3 sound generation#1
MaciejBalaNV wants to merge 22 commits into
mbala/cosmos3_modelfrom
mbala/cosmos3_sound

MaciejBalaNV commented May 20, 2026

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bastefaniak commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

MaciejBalaNV commented May 20, 2026

Purpose

Test Plan

Test Result

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bastefaniak commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants