Skip to content
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
b4ff115
Add qwen2.5-omni
Mar 16, 2025
33e479e
Remove einops dependency
Mar 16, 2025
241abde
Add torchdiffeq dependency
Mar 16, 2025
9b847cb
Sort init
Mar 16, 2025
2157b5a
Add torchdiffeq to extras['diffeq']
Mar 16, 2025
e541399
Fix repo consistency
Mar 17, 2025
d937243
use cached_file
Mar 17, 2025
586f7ff
del odeint
Mar 17, 2025
29db603
Merge branch 'qwen25omni' of http://gitlab.alibaba-inc.com/DamoAGI/tr…
Mar 17, 2025
7f158f5
renew pytest
Mar 17, 2025
6027556
format
Mar 17, 2025
1a8533d
Remove torchdiffeq
Mar 17, 2025
bdbb8ea
format
Mar 17, 2025
827404a
fixed batch infer bug
Mar 17, 2025
1d04f0d
Change positional_embedding to parameter
Mar 19, 2025
a169211
Change default speaker
Mar 22, 2025
f1d63db
Config revision
Mar 22, 2025
7413de0
Use modular & code clean
Mar 24, 2025
3a1ead0
code clean
Mar 26, 2025
5efaed6
decouple padding with model & code cleaning
Mar 27, 2025
99fe6f8
sort init
Mar 27, 2025
3d4ebe8
fix
Mar 28, 2025
f742a64
fix
Mar 28, 2025
63ec845
Second code review
Mar 30, 2025
04e5260
fix
Mar 30, 2025
07626dc
Merge branch 'main' into qwen25omni
BakerBunker Mar 30, 2025
ae15523
fix
Mar 30, 2025
299eaa8
rename vars to full name + some comments
zucchini-nlp Apr 2, 2025
67efc09
update pytest
Apr 1, 2025
590d167
Code clean & fix
Apr 2, 2025
debde07
fix
Apr 2, 2025
08c05fd
style
Apr 2, 2025
01f48b5
more clean up
zucchini-nlp Apr 2, 2025
2f32d27
fixup
zucchini-nlp Apr 2, 2025
319d79e
smaller vision model in tests
zucchini-nlp Apr 2, 2025
771f8f2
fix processor test
Apr 2, 2025
61d21d8
Merge remote-tracking branch 'upstream/main' into qwen25omni
zucchini-nlp Apr 2, 2025
26e54de
deflake a bit the tests (still flaky though)
zucchini-nlp Apr 2, 2025
0d5c29d
de-flake tests finally + add generation mixin
zucchini-nlp Apr 2, 2025
4641c44
final nits i hope
zucchini-nlp Apr 2, 2025
134856d
make sure processor tests are complete
zucchini-nlp Apr 2, 2025
f74562c
replace with Qwen2_5OmniForConditionalGeneration
zucchini-nlp Apr 3, 2025
e6776df
fix tests after updating ckpt
zucchini-nlp Apr 3, 2025
6929d60
fix typos when cleaning, also we can't change ckpt
zucchini-nlp Apr 3, 2025
2ec1964
fixup
zucchini-nlp Apr 3, 2025
466f4bf
images and videos kwargs for processor
zucchini-nlp Apr 3, 2025
b317f84
thinker and talker loadable from hub ckpt
zucchini-nlp Apr 7, 2025
ee7b7f9
merge main
zucchini-nlp Apr 10, 2025
4da369e
address comments and update tests after rebase
zucchini-nlp Apr 10, 2025
ecd3133
fixup
zucchini-nlp Apr 10, 2025
7ed74dd
merge main
zucchini-nlp Apr 11, 2025
03752e4
skip for now
zucchini-nlp Apr 11, 2025
ecc920e
fixup
zucchini-nlp Apr 11, 2025
d9be9c9
fixup
Apr 12, 2025
605da81
remove torch dependency in processors
zucchini-nlp Apr 14, 2025
1240a81
Merge branch 'main' into qwen25omni
zucchini-nlp Apr 14, 2025
ef73ef7
Merge branch 'main' into qwen25omni
zucchini-nlp Apr 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -987,6 +987,8 @@
title: Pix2Struct
- local: model_doc/pixtral
title: Pixtral
- local: model_doc/qwen2_5_omni
title: Qwen2.5-Omni
- local: model_doc/qwen2_5_vl
title: Qwen2.5-VL
- local: model_doc/qwen2_audio
Expand Down
382 changes: 382 additions & 0 deletions docs/source/en/model_doc/qwen2_5_omni.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,382 @@
<!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Qwen2.5-Omni

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>

## Overview

The [Qwen2.5-Omni](https://qwenlm.github.io/blog/) model is a unified multiple modalities model proposed in [Qwen2.5-Omni Technical Report]() from Qwen team, Alibaba Group.

The abstract from the technical report is the following:

*We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.*



## Usage example

`Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen).

### Single Media inference

The model can accept text, images, audio and videos as input. Here's an example code for inference.

```python
import soundfile as sf
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor

model = Qwen2_5OmniModel.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": "/path/to/video.mp4"},
{"type": "text", "text": "What cant you hear and see in this video?"},
],
},
]

inputs = processor.apply_chat_template(
conversations,
load_audio_from_video=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_fps=1,

# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.device)

text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

sf.write(
"output.wav",
audio.reshape(-1).detach().cpu().numpy(),
samplerate=24000,
)
print(text)
```

### Text-only generation

To generate only text output and save compute by not loading the audio generation model, we can set `enable_audio_output=False` when loading the model.

```python
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor

model = Qwen2_5OmniModel.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
enable_audio_output=False,
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": "/path/to/video.mp4"},
{"type": "text", "text": "What cant you hear and see in this video?"},
],
},
]

inputs = processor.apply_chat_template(
conversations,
load_audio_from_video=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_fps=1,

# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.device)


text_ids = model.generate(**inputs, use_audio_in_video=True, return_audio=False)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

sf.write(
"output.wav",
audio.reshape(-1).detach().cpu().numpy(),
samplerate=24000,
)
print(text)
```

### Batch Mixed Media Inference

The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `return_audio=False` is set. Here is an example.

```python
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")

# Conversation with video only
conversation1 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "video", "path": "/path/to/video.mp4"},
]
}
]

# Conversation with audio only
conversation2 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "audio", "path": "/path/to/audio.wav"},
]
}
]

# Conversation with pure text
conversation3 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [{"type": "text", "text": "who are you?"}],
}
]


# Conversation with mixed media
conversation4 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "image", "path": "/path/to/image.jpg"},
{"type": "video", "path": "/path/to/video.mp4"},
{"type": "audio", "path": "/path/to/audio.wav"},
{"type": "text", "text": "What are the elements can you see and hear in these medias?"},
],
}
]

conversations = [conversation1, conversation2, conversation3, conversation4]

inputs = processor.apply_chat_template(
conversations,
load_audio_from_video=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_fps=1,

# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.thinker.device)

text_ids = model.generate(**inputs, use_audio_in_video=True, return_audio=False)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(text)
```

### Usage Tips

#### Prompt for audio output
If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.
```
{
"role": "system",
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
}
```

#### Use audio output or not

The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
```python
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
enable_audio_output=False,
)
```

In order to obtain a flexible experience, we recommend that users set `enable_audio_output` at `True` when initializing the model through `from_pretrained` function, and then decide whether to return audio when `generate` function is called. When `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.

```python
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
enable_audio_output=True,
)
...
text_ids = model.generate(**inputs, return_audio=False)
```

#### Change voice type of output audio
Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the `spk` parameter of `generate` function to specify the voice type. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types: `Chelsie` and `Ethan`, while `Chelsie` is a female voice and `Ethan` is a male voice. By defalut, if `spk` is not specified, the default voice type is `Chelsie`.

```python
text_ids, audio = model.generate(**inputs, spk="Chelsie")
```

```python
text_ids, audio = model.generate(**inputs, spk="Ethan")
```

#### Flash-Attention 2 to speed up generation

First, make sure to install the latest version of Flash Attention 2:

```bash
pip install -U flash-attn --no-build-isolation
```

Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.

To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:

```python
from transformers import Qwen2_5OmniForConditionalGeneration

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
```



## Qwen2_5OmniConfig

[[autodoc]] Qwen2_5OmniConfig

## Qwen2_5OmniProcessor

[[autodoc]] Qwen2_5OmniProcessor

## Qwen2_5OmniForConditionalGeneration

[[autodoc]] Qwen2_5OmniForConditionalGeneration
- forward

## Qwen2_5OmniPreTrainedModelForConditionalGeneration

[[autodoc]] Qwen2_5OmniPreTrainedModelForConditionalGeneration

## Qwen2_5OmniThinkerConfig

[[autodoc]] Qwen2_5OmniThinkerConfig

## Qwen2_5OmniThinkerForConditionalGeneration

[[autodoc]] Qwen2_5OmniThinkerForConditionalGeneration

## Qwen2_5OmniThinkerTextModel

[[autodoc]] Qwen2_5OmniThinkerTextModel

## Qwen2_5OmniTalkerConfig

[[autodoc]] Qwen2_5OmniTalkerConfig

## Qwen2_5OmniTalkerForConditionalGeneration

[[autodoc]] Qwen2_5OmniTalkerForConditionalGeneration

## Qwen2_5OmniTalkerModel

[[autodoc]] Qwen2_5OmniTalkerModel

## Qwen2_5OmniToken2WavConfig

[[autodoc]] Qwen2_5OmniToken2WavConfig

## Qwen2_5OmniToken2WavModel

[[autodoc]] Qwen2_5OmniToken2WavModel

## Qwen2_5OmniToken2WavDiTModel

[[autodoc]] Qwen2_5OmniToken2WavDiTModel

## Qwen2_5OmniToken2WavBigVGANModel

[[autodoc]] Qwen2_5OmniToken2WavBigVGANModel
Loading