Add Cosmos3 sound generation#1
Conversation
f0b9d82 to
b78f881
Compare
Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Co-authored-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…ect#3987) Signed-off-by: natureofnature <wzliu@connect.hku.hk>
linyueqian
left a comment
There was a problem hiding this comment.
Thanks for this, Maciej. I read the full diff and ran the PR end to end on a Cosmos3-Nano deployment (offline Omni(), HTTP /v1/videos/sync, and a cfg_parallel_size=2 configuration). The feature works: stereo 48 kHz audio comes out of both the offline path and the HTTP endpoint, the AAC mux is correct on the server side, and the tuple-output path through CFGParallelMixin.combine_cfg_noise operates cleanly under CFG-parallel.
A few things worth addressing before merge. Specific suggestions inline; here is the summary:
🔴 Blocking
- Two unit tests in
tests/diffusion/models/cosmos3/test_cosmos3_transformer.pyfail on this branch (test_forward_returns_video_prediction,test_forward_returns_video_and_sound_predictions). Both look like test-side oversights rather than model bugs. - The PR base is
mbala/cosmos3_model, which has since landed as vllm-project#3454. A rebase onto currentmainproduces conflicts in 4 files. Once that is done, CI can fairly evaluate the change.
🟡 Important
3. _is_sound_request accepts six aliases for what is effectively one user-facing flag. Narrowing this surface reduces the chance of a "set the wrong key, get silent video-only output" UX failure.
4. The duplicate self.time_embedder initialization at transformer_cosmos3.py:1003 is dead dtype configuration (later overridden by post_load_weights().to(torch.float32)), but it is confusing to read.
💡 Suggestion
5. The joint (video, sound) pack-then-step in diffuse._step relies on the scheduler being linearly separable per element. True for flow matching, but the assumption is worth a one-line comment.
Test evidence
- Offline T2V+sound, two prompts, 4.00 s stereo 48 kHz: distinct, prompt-responsive audio.
- HTTP
/v1/videos/syncwithgenerate_sound=true sound_duration=4.0: 1.9 MB mp4 with H264 video (2.04 s) + AAC stereo (4.02 s), muxed server-side. cfg_parallel_size=2on 2 GPUs: produces the same shape audio cleanly.
Signed-off-by: tc-mb <tianchi_cai@icloud.com>
Signed-off-by: Zhang <jianmusings@gmail.com> Signed-off-by: Zhang Jian <jianmusings@gmail.com> Signed-off-by: jian <jianmusings@gmail.com> Signed-off-by: Canlin Guo <961750412@qq.com> Co-authored-by: Canlin Guo <961750412@qq.com>
b78f881 to
5f84b67
Compare
5f84b67 to
55c1917
Compare
…llm-project#3949) Signed-off-by: bkannappan <bkannappan@digitalocean.com>
Signed-off-by: siyuan.lei <siyuanlei37@gmail.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: rein yang <ruiruyang2@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…orm backend) (vllm-project#4074) Signed-off-by: NumberWan <wantszkin2003@gmail.com>
…llm-project#3178) Signed-off-by: liqian <65649795+vasede@users.noreply.github.com> Co-authored-by: Gao Han <hgaoaf@connect.ust.hk> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
a4f1e69 to
b4e0379
Compare
Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
…lags Signed-off-by: lishunyang12 <lishunyang12@163.com>
…nd tokenizer Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
da833d0 to
8c340fe
Compare
Signed-off-by: lishunyang12 <lishunyang12@163.com>
Signed-off-by: lishunyang12 <lishunyang12@163.com>
|
Closing, because already merged to vllm-omni main |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)