-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Add Qwen2.5-Omni #36752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add Qwen2.5-Omni #36752
Changes from 43 commits
Commits
Show all changes
57 commits
Select commit
Hold shift + click to select a range
b4ff115
Add qwen2.5-omni
33e479e
Remove einops dependency
241abde
Add torchdiffeq dependency
9b847cb
Sort init
2157b5a
Add torchdiffeq to extras['diffeq']
e541399
Fix repo consistency
d937243
use cached_file
586f7ff
del odeint
29db603
Merge branch 'qwen25omni' of http://gitlab.alibaba-inc.com/DamoAGI/tr…
7f158f5
renew pytest
6027556
format
1a8533d
Remove torchdiffeq
bdbb8ea
format
827404a
fixed batch infer bug
1d04f0d
Change positional_embedding to parameter
a169211
Change default speaker
f1d63db
Config revision
7413de0
Use modular & code clean
3a1ead0
code clean
5efaed6
decouple padding with model & code cleaning
99fe6f8
sort init
3d4ebe8
fix
f742a64
fix
63ec845
Second code review
04e5260
fix
07626dc
Merge branch 'main' into qwen25omni
BakerBunker ae15523
fix
299eaa8
rename vars to full name + some comments
zucchini-nlp 67efc09
update pytest
590d167
Code clean & fix
debde07
fix
08c05fd
style
01f48b5
more clean up
zucchini-nlp 2f32d27
fixup
zucchini-nlp 319d79e
smaller vision model in tests
zucchini-nlp 771f8f2
fix processor test
61d21d8
Merge remote-tracking branch 'upstream/main' into qwen25omni
zucchini-nlp 26e54de
deflake a bit the tests (still flaky though)
zucchini-nlp 0d5c29d
de-flake tests finally + add generation mixin
zucchini-nlp 4641c44
final nits i hope
zucchini-nlp 134856d
make sure processor tests are complete
zucchini-nlp f74562c
replace with Qwen2_5OmniForConditionalGeneration
zucchini-nlp e6776df
fix tests after updating ckpt
zucchini-nlp 6929d60
fix typos when cleaning, also we can't change ckpt
zucchini-nlp 2ec1964
fixup
zucchini-nlp 466f4bf
images and videos kwargs for processor
zucchini-nlp b317f84
thinker and talker loadable from hub ckpt
zucchini-nlp ee7b7f9
merge main
zucchini-nlp 4da369e
address comments and update tests after rebase
zucchini-nlp ecd3133
fixup
zucchini-nlp 7ed74dd
merge main
zucchini-nlp 03752e4
skip for now
zucchini-nlp ecc920e
fixup
zucchini-nlp d9be9c9
fixup
605da81
remove torch dependency in processors
zucchini-nlp 1240a81
Merge branch 'main' into qwen25omni
zucchini-nlp ef73ef7
Merge branch 'main' into qwen25omni
zucchini-nlp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,382 @@ | ||
| <!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| # Qwen2.5-Omni | ||
|
|
||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | ||
| <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| The [Qwen2.5-Omni](https://qwenlm.github.io/blog/) model is a unified multiple modalities model proposed in [Qwen2.5-Omni Technical Report]() from Qwen team, Alibaba Group. | ||
|
|
||
| The abstract from the technical report is the following: | ||
|
|
||
| *We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.* | ||
|
|
||
|
|
||
|
|
||
| ## Usage example | ||
|
|
||
| `Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen). | ||
|
|
||
| ### Single Media inference | ||
|
|
||
| The model can accept text, images, audio and videos as input. Here's an example code for inference. | ||
|
|
||
| ```python | ||
| import soundfile as sf | ||
| from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor | ||
|
|
||
| model = Qwen2_5OmniModel.from_pretrained( | ||
| "Qwen/Qwen2.5-Omni-7B", | ||
| torch_dtype="auto", | ||
| device_map="auto" | ||
| ) | ||
| processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B") | ||
|
|
||
| conversation = [ | ||
| { | ||
| "role": "system", | ||
| "content": [ | ||
| {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} | ||
| ], | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "video", "video": "/path/to/video.mp4"}, | ||
| {"type": "text", "text": "What cant you hear and see in this video?"}, | ||
| ], | ||
| }, | ||
| ] | ||
|
|
||
| inputs = processor.apply_chat_template( | ||
| conversations, | ||
| load_audio_from_video=True, | ||
| add_generation_prompt=True, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| return_tensors="pt", | ||
| video_fps=1, | ||
|
|
||
| # kwargs to be passed to `Qwen2-5-OmniProcessor` | ||
| padding=True, | ||
| use_audio_in_video=True, | ||
| ).to(model.device) | ||
|
|
||
| text_ids, audio = model.generate(**inputs, use_audio_in_video=True) | ||
| text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) | ||
|
|
||
| sf.write( | ||
| "output.wav", | ||
| audio.reshape(-1).detach().cpu().numpy(), | ||
| samplerate=24000, | ||
| ) | ||
| print(text) | ||
| ``` | ||
|
|
||
| ### Text-only generation | ||
|
|
||
| To generate only text output and save compute by not loading the audio generation model, we can set `enable_audio_output=False` when loading the model. | ||
|
|
||
| ```python | ||
| from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor | ||
|
|
||
| model = Qwen2_5OmniModel.from_pretrained( | ||
| "Qwen/Qwen2.5-Omni-7B", | ||
| torch_dtype="auto", | ||
| device_map="auto", | ||
| enable_audio_output=False, | ||
| ) | ||
| processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B") | ||
|
|
||
| conversation = [ | ||
| { | ||
| "role": "system", | ||
| "content": [ | ||
| {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} | ||
| ], | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "video", "video": "/path/to/video.mp4"}, | ||
| {"type": "text", "text": "What cant you hear and see in this video?"}, | ||
| ], | ||
| }, | ||
| ] | ||
|
|
||
| inputs = processor.apply_chat_template( | ||
| conversations, | ||
| load_audio_from_video=True, | ||
| add_generation_prompt=True, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| return_tensors="pt", | ||
| video_fps=1, | ||
|
|
||
| # kwargs to be passed to `Qwen2-5-OmniProcessor` | ||
| padding=True, | ||
| use_audio_in_video=True, | ||
| ).to(model.device) | ||
|
|
||
|
|
||
| text_ids = model.generate(**inputs, use_audio_in_video=True, return_audio=False) | ||
| text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) | ||
|
|
||
| sf.write( | ||
| "output.wav", | ||
| audio.reshape(-1).detach().cpu().numpy(), | ||
| samplerate=24000, | ||
| ) | ||
| print(text) | ||
| ``` | ||
|
|
||
| ### Batch Mixed Media Inference | ||
|
|
||
| The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `return_audio=False` is set. Here is an example. | ||
|
|
||
| ```python | ||
| import soundfile as sf | ||
| from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor | ||
|
|
||
| model = Qwen2_5OmniForConditionalGeneration.from_pretrained( | ||
| "Qwen/Qwen2.5-Omni-7B", | ||
| torch_dtype="auto", | ||
| device_map="auto" | ||
| ) | ||
| processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B") | ||
|
|
||
| # Conversation with video only | ||
| conversation1 = [ | ||
| { | ||
| "role": "system", | ||
| "content": [ | ||
| {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} | ||
| ], | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "video", "path": "/path/to/video.mp4"}, | ||
| ] | ||
| } | ||
| ] | ||
|
|
||
| # Conversation with audio only | ||
| conversation2 = [ | ||
| { | ||
| "role": "system", | ||
| "content": [ | ||
| {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} | ||
| ], | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "audio", "path": "/path/to/audio.wav"}, | ||
| ] | ||
| } | ||
| ] | ||
|
|
||
| # Conversation with pure text | ||
| conversation3 = [ | ||
| { | ||
| "role": "system", | ||
| "content": [ | ||
| {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} | ||
| ], | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": [{"type": "text", "text": "who are you?"}], | ||
| } | ||
| ] | ||
|
|
||
|
|
||
| # Conversation with mixed media | ||
| conversation4 = [ | ||
| { | ||
| "role": "system", | ||
| "content": [ | ||
| {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} | ||
| ], | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "path": "/path/to/image.jpg"}, | ||
| {"type": "video", "path": "/path/to/video.mp4"}, | ||
| {"type": "audio", "path": "/path/to/audio.wav"}, | ||
| {"type": "text", "text": "What are the elements can you see and hear in these medias?"}, | ||
| ], | ||
| } | ||
| ] | ||
|
|
||
| conversations = [conversation1, conversation2, conversation3, conversation4] | ||
|
|
||
| inputs = processor.apply_chat_template( | ||
| conversations, | ||
| load_audio_from_video=True, | ||
| add_generation_prompt=True, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| return_tensors="pt", | ||
| video_fps=1, | ||
|
|
||
| # kwargs to be passed to `Qwen2-5-OmniProcessor` | ||
| padding=True, | ||
| use_audio_in_video=True, | ||
| ).to(model.thinker.device) | ||
|
|
||
| text_ids = model.generate(**inputs, use_audio_in_video=True, return_audio=False) | ||
| text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) | ||
|
|
||
| print(text) | ||
| ``` | ||
|
|
||
| ### Usage Tips | ||
|
|
||
| #### Prompt for audio output | ||
| If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. | ||
| ``` | ||
| { | ||
| "role": "system", | ||
| "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", | ||
| } | ||
| ``` | ||
|
|
||
| #### Use audio output or not | ||
|
|
||
| The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`. | ||
BakerBunker marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ```python | ||
| model = Qwen2_5OmniForConditionalGeneration.from_pretrained( | ||
| "Qwen/Qwen2.5-Omni-7B", | ||
| torch_dtype="auto", | ||
| device_map="auto", | ||
| enable_audio_output=False, | ||
| ) | ||
| ``` | ||
|
|
||
| In order to obtain a flexible experience, we recommend that users set `enable_audio_output` at `True` when initializing the model through `from_pretrained` function, and then decide whether to return audio when `generate` function is called. When `return_audio` is set to `False`, the model will only return text outputs to get text responses faster. | ||
|
|
||
| ```python | ||
| model = Qwen2_5OmniForConditionalGeneration.from_pretrained( | ||
| "Qwen/Qwen2.5-Omni-7B", | ||
| torch_dtype="auto", | ||
| device_map="auto", | ||
| enable_audio_output=True, | ||
| ) | ||
| ... | ||
| text_ids = model.generate(**inputs, return_audio=False) | ||
| ``` | ||
|
|
||
| #### Change voice type of output audio | ||
| Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the `spk` parameter of `generate` function to specify the voice type. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types: `Chelsie` and `Ethan`, while `Chelsie` is a female voice and `Ethan` is a male voice. By defalut, if `spk` is not specified, the default voice type is `Chelsie`. | ||
|
|
||
| ```python | ||
| text_ids, audio = model.generate(**inputs, spk="Chelsie") | ||
| ``` | ||
|
|
||
| ```python | ||
| text_ids, audio = model.generate(**inputs, spk="Ethan") | ||
| ``` | ||
|
|
||
| #### Flash-Attention 2 to speed up generation | ||
|
|
||
| First, make sure to install the latest version of Flash Attention 2: | ||
|
|
||
| ```bash | ||
| pip install -U flash-attn --no-build-isolation | ||
| ``` | ||
|
|
||
| Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. | ||
|
|
||
| To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model: | ||
|
|
||
| ```python | ||
| from transformers import Qwen2_5OmniForConditionalGeneration | ||
|
|
||
| model = Qwen2_5OmniForConditionalGeneration.from_pretrained( | ||
| "Qwen/Qwen2.5-Omni-7B", | ||
| device_map="auto", | ||
| torch_dtype=torch.bfloat16, | ||
| attn_implementation="flash_attention_2", | ||
| ) | ||
| ``` | ||
|
|
||
|
|
||
|
|
||
| ## Qwen2_5OmniConfig | ||
|
|
||
| [[autodoc]] Qwen2_5OmniConfig | ||
|
|
||
| ## Qwen2_5OmniProcessor | ||
|
|
||
| [[autodoc]] Qwen2_5OmniProcessor | ||
|
|
||
| ## Qwen2_5OmniForConditionalGeneration | ||
|
|
||
| [[autodoc]] Qwen2_5OmniForConditionalGeneration | ||
| - forward | ||
|
|
||
| ## Qwen2_5OmniPreTrainedModelForConditionalGeneration | ||
|
|
||
| [[autodoc]] Qwen2_5OmniPreTrainedModelForConditionalGeneration | ||
|
|
||
| ## Qwen2_5OmniThinkerConfig | ||
|
|
||
| [[autodoc]] Qwen2_5OmniThinkerConfig | ||
|
|
||
| ## Qwen2_5OmniThinkerForConditionalGeneration | ||
|
|
||
| [[autodoc]] Qwen2_5OmniThinkerForConditionalGeneration | ||
|
|
||
| ## Qwen2_5OmniThinkerTextModel | ||
|
|
||
| [[autodoc]] Qwen2_5OmniThinkerTextModel | ||
|
|
||
| ## Qwen2_5OmniTalkerConfig | ||
|
|
||
| [[autodoc]] Qwen2_5OmniTalkerConfig | ||
|
|
||
| ## Qwen2_5OmniTalkerForConditionalGeneration | ||
|
|
||
| [[autodoc]] Qwen2_5OmniTalkerForConditionalGeneration | ||
|
|
||
| ## Qwen2_5OmniTalkerModel | ||
|
|
||
| [[autodoc]] Qwen2_5OmniTalkerModel | ||
|
|
||
| ## Qwen2_5OmniToken2WavConfig | ||
|
|
||
| [[autodoc]] Qwen2_5OmniToken2WavConfig | ||
|
|
||
| ## Qwen2_5OmniToken2WavModel | ||
|
|
||
| [[autodoc]] Qwen2_5OmniToken2WavModel | ||
|
|
||
| ## Qwen2_5OmniToken2WavDiTModel | ||
|
|
||
| [[autodoc]] Qwen2_5OmniToken2WavDiTModel | ||
|
|
||
| ## Qwen2_5OmniToken2WavBigVGANModel | ||
|
|
||
| [[autodoc]] Qwen2_5OmniToken2WavBigVGANModel | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.