Add Qwen3-Omni moe thinker#25550
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request adds support for the Qwen3-Omni-Moe model. The changes include a new model implementation file, modifications to handle multimodal rotary embeddings, and registration of the new model. While the implementation is comprehensive, I've identified several critical and high-severity issues related to performance and maintainability. Specifically, there are non-vectorized loops and inefficient tensor operations in the position embedding calculation, which will significantly impact performance. Additionally, there are uses of NumPy within core logic that should be replaced with PyTorch operations to avoid CPU-GPU synchronization. I've also found a few potential bugs related to tensor shape calculations that could lead to runtime errors. Addressing these points will be crucial for integrating this model into vLLM effectively.
| def _omni3_get_input_positions_tensor( | ||
| cls, | ||
| config, | ||
| input_ids: Optional[torch.LongTensor] = None, | ||
| image_grid_thw: Optional[torch.LongTensor] = None, | ||
| video_grid_thw: Optional[torch.LongTensor] = None, | ||
| attention_mask: Optional[torch.Tensor] = None, | ||
| use_audio_in_video: bool = False, | ||
| audio_seqlens: Optional[torch.LongTensor] = None, | ||
| second_per_grids: Optional[torch.Tensor] = None, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: |
There was a problem hiding this comment.
The function _omni3_get_input_positions_tensor is very long and complex, making it difficult to understand and maintain. More importantly, it processes input sequences one by one within a for loop (for i, input_ids in enumerate(total_input_ids):), which is not vectorized and will lead to significant performance degradation, especially with larger batch sizes. The use of .tolist() and list methods like .index() inside the loop further contributes to the inefficiency. This implementation should be refactored to be vectorized over the batch dimension to meet the performance standards of vLLM. Consider using tensor operations to find indices and process modalities in parallel for all sequences in the batch.
| if name == "feature_attention_mask": | ||
| dim = -1 | ||
| if isinstance(mm_input, torch.Tensor): | ||
| return torch.concat(list(mm_input), dim=dim) |
There was a problem hiding this comment.
The implementation of _validate_and_reshape_mm_tensor seems to have a bug when handling a torch.Tensor. The line return torch.concat(list(mm_input), dim=dim) is problematic. When mm_input is a tensor, list(mm_input) iterates over its first dimension. torch.concat then joins these tensors along dim. For example, if mm_input has shape (B, C, L) and dim=1, the result will have shape (C, B*L), which is likely incorrect for batch processing where one would expect to flatten the batch dimension. This will likely cause shape mismatches in downstream processing.
| multimodal_embeddings[index] = embeddings_main | ||
| multimodal_embeddings_multiscale.append(embeddings_multiscale) | ||
| if len(multimodal_embeddings_multiscale) > 0: | ||
| deepstack_input_embeds = inputs_embeds.new_zeros(inputs_embeds.size(0), multiscale_len * inputs_embeds.size(1)) |
There was a problem hiding this comment.
There appears to be a bug in the shape calculation for deepstack_input_embeds. The second dimension is calculated as multiscale_len * inputs_embeds.size(1), which resolves to multiscale_len * text_config.hidden_size. However, this tensor is later populated with multimodal_embeddings_multiscale which have a feature dimension of multi_dim (multiscale_len * visual_dim), and then reshaped using visual_dim. This will raise a runtime error if text_config.hidden_size is not equal to visual_dim (vision_config.out_hidden_size). The correct size for the second dimension should be multi_dim (i.e., multiscale_len * visual_dim), which is computed a few lines above.
deepstack_input_embeds = inputs_embeds.new_zeros(inputs_embeds.size(0), multi_dim)| None, | ||
| use_audio_in_video, | ||
| audio_feature_lengths, | ||
| torch.tensor([1] * torch.tensor(video_grid_thw).shape[0])) |
There was a problem hiding this comment.
This line creates a tensor in a highly inefficient way. torch.tensor(video_grid_thw) is redundant as video_grid_thw is already a tensor at this point. Creating a list of 1s and then converting it to a tensor is also inefficient. This can be simplified and made more performant.
| torch.tensor([1] * torch.tensor(video_grid_thw).shape[0])) | |
| torch.ones(video_grid_thw.shape[0], dtype=torch.long, device=video_grid_thw.device)) |
| h_idxs = np.linspace(0, num_grid_per_side-1, h) | ||
| w_idxs = np.linspace(0, num_grid_per_side-1, w) |
There was a problem hiding this comment.
This function uses numpy for calculations (np.linspace), which can lead to performance bottlenecks due to CPU-GPU synchronization and data transfers. The comment on line 379 already indicates this. These operations should be replaced with their torch equivalents to keep the computation on the GPU and within the computation graph.
| h_idxs = np.linspace(0, num_grid_per_side-1, h) | |
| w_idxs = np.linspace(0, num_grid_per_side-1, w) | |
| h_idxs = torch.linspace(0, num_grid_per_side-1, h, device=self.pos_embed.weight.device) | |
| w_idxs = torch.linspace(0, num_grid_per_side-1, w, device=self.pos_embed.weight.device) |
There was a problem hiding this comment.
Are you able to finish this TODO before you have to go OOO?
| audio_token_indices = np.arange(next(iter([audio_len]))) | ||
| curr_video_grid_thw = next(iter([video_grid_thw])) | ||
| height = curr_video_grid_thw[1] // spatial_merge_size | ||
| width = curr_video_grid_thw[2] // spatial_merge_size | ||
| video_token_indices = np.arange(curr_video_grid_thw[0]).reshape(-1, 1, 1) | ||
| video_token_indices = np.broadcast_to( | ||
| video_token_indices, (video_token_indices.shape[0], height, width) | ||
| ).reshape(-1) | ||
| video_token_indices = ((video_token_indices + shift) * next(iter([video_second_per_grid_t])) * position_id_per_seconds) |
There was a problem hiding this comment.
This function uses numpy for array creation and manipulation (np.arange, np.broadcast_to). This forces data transfers between CPU and GPU and can be a performance bottleneck. These should be replaced with torch equivalents to maintain performance.
audio_token_indices = torch.arange(next(iter([audio_len])))
curr_video_grid_thw = next(iter([video_grid_thw]))
height = curr_video_grid_thw[1] // spatial_merge_size
width = curr_video_grid_thw[2] // spatial_merge_size
video_token_indices = torch.arange(curr_video_grid_thw[0]).reshape(-1, 1, 1)
video_token_indices = video_token_indices.expand(video_token_indices.shape[0], height, width).reshape(-1)
video_token_indices = ((video_token_indices + shift) * next(iter([video_second_per_grid_t])) * position_id_per_seconds)|
Alright, I'll handle these parts. Currently, I'm still working on adding audio-in-video support in v1, In the meantime, One known issue is that I may not be able to straightforwardly reuse relevant modules from Qwen3-VL, because our model has already been made public, and some checkpoint keys and configurations are incompatible with Qwen3-VL. This stems from the fact that our internal iterations were not synchronized. This issue may require further careful discussion. I might go on vacation starting tomorrow and probably won't resume modifications until after October 4th :) You can proceed with the review based on the current version. |
| None, | ||
| use_audio_in_video, | ||
| audio_feature_lengths, | ||
| torch.tensor([1] * torch.tensor(video_grid_thw).shape[0])) |
There was a problem hiding this comment.
| torch.tensor([1] * torch.tensor(video_grid_thw).shape[0])) | |
| torch.ones(len(video_grid_thw)) |
Simplify this
|
Thanks for your work. May I ask, will talker model get supported in future? It seems Qwen2.5-Omni still only support thinker model now. |
|
LGTM! May I know whether Talker model will be supported by vLLM? |
|
Supporting Qwen3-Omni end-to-end will not be within the scope of |
So that means vLLM project will support thinking model which is just like normal LLM model. And a new multimodal inference project will support end2end Qwen3-Omni model? Do I know more about this new project? |
Yea that's the right understanding! We're still planning for the new project so stay tuned! |
Great! And may I know will the new project also handle the single-model multimodal models such as Kimi-Audio? Or they will be supported by vLLM? |
Really wish the new project is fast and efficient. Tried transformers and audio output was SLOW... |
vllm-project#25550 Signed-off-by: Chen, Wenbin <wenbin.chen@intel.com>
Same, even with flash-attn2 it is very slow |
I tried this PR and it's like >20x faster than transformers :) |
|
@houseroad has imported this pull request. If you are a Meta employee, you can view this in D83274891. |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Isotr0py
left a comment
There was a problem hiding this comment.
The processor tests should pass now:
tests/models/multimodal/processing/test_common.py::test_processing_correctness[1.0-32-0.3-Qwen/Qwen3-Omni-30B-A3B-Instruct] Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section'}
INFO 10-10 23:00:59 [model.py:653] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 10-10 23:00:59 [model.py:1714] Using max model len 65536
PASSED
tests/models/multimodal/processing/test_common.py::test_processing_correctness[1.0-32-0.5-Qwen/Qwen3-Omni-30B-A3B-Instruct] Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section'}
INFO 10-10 23:01:17 [model.py:653] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
INFO 10-10 23:01:17 [model.py:1714] Using max model len 65536
PASSED
tests/models/multimodal/processing/test_common.py::test_processing_correctness[1.0-32-1.0-Qwen/Qwen3-Omni-30B-A3B-Instruct] Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section'}
INFO 10-10 23:01:26 [model.py:653] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
INFO 10-10 23:01:26 [model.py:1714] Using max model len 65536
PASSED
And outputs look reasonable on my side too:
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.86s/it]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:13<00:00, 13.77s/it, est. speed input: 298.63 toks/s, output: 22.00 toks/s]
Based on the audio, video, and image content, here is a breakdown of what is happening:
### Audio Content
The audio features a man reciting a well-known nursery rhyme, "Mary Had a Little Lamb." The rhyme is:
> "Mary had a little lamb, its fleece was white as snow. And everywhere that Mary went, the lamb was sure to go."
The speaker mentions that these were the "first words" spoken into a phonograph, which is a historical reference to the invention of the phonograph by Thomas Edison.
### Image Content
The image shows a young child, likely a toddler, sitting on a bed. The child is wearing glasses and is looking at a book. The child's feet are propped up on the book, and they are turning the pages. The child appears to be enjoying the book and is smiling.
### Why the Video is Funny
The humor in the video comes from the contrast between the child's serious and focused demeanor and the absurdity of the situation. The child is wearing glasses and is sitting on the bed with their feet propped up on the book, which is a common and relatable scene for many people. However, the child's intense focus on the book, combined with the glasses, creates a humorous and endearing image. The child seems to be taking the reading experience very seriously, which is funny because they are just a toddler. The overall effect is a charming and amusing scene that many viewers can relate to and find funny.
|
Does vLLM serve currently only support Qwen3-Omni-Thinking? I use two servers with eight Gpus, vLLM serve cannot start Qwen3-Omni-Instruct. An error display "error in inspecting the model architecture Qwen3OmniMoeForConditionalGeneration" |
|
Which version of Transformers are you using? Make sure it's 4.57 or higher |
The docker image I'm using is vllm/ vLLM-OpenAI :v0.11.0, and the transformwes version is 4.57.0 |
|
This PR was only merged after v0.11, so you need to install vLLM from main branch or use the per-commit Docker image. |
|
It's not really feasible as the PR depends on some changes that were introduced after v0.11. |
Big thanks, I will attempt to download latest vllm docker image or latest version vllm. |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: bbartels <benjamin@bartels.dev>
were merged into main via vllm-project/vllm#25550
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
When sampling is disabled, the same audio request, vLLM main nightly version, Qwen3-Omni-30B-A3B-Captioner model, there has some diference compared to https://github.com/wangxiongts/vllm.git. |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@DarkLight1337 How is it going now? |
|
It is still getting worked on but I cannot give further details as of current. |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Xiong Wang <feizi.wx@alibaba-inc.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

This PR from the Qwen team for: qwen3-omni-moe thinker part.
Testing has been conducted internally across four configurations (v0/v1, eager/CUDA) on several representative benchmarks, with results meeting expectations.
Known issues (we hope to resolve them together with the vLLM team):
We sincerely appreciate the great work and support from the vLLM team, and look forward to your feedback.
CLOSE #25472