[MM][Feat] Add support for audio in video in Qwen2.5-Omni#26156
[MM][Feat] Add support for audio in video in Qwen2.5-Omni#26156wwl2755 wants to merge 1 commit intovllm-project:mainfrom
audio in video in Qwen2.5-Omni#26156Conversation
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
| grid_thw_list = video_grid_thw.tolist() | ||
| if isinstance(grid_thw_list, list) and len(grid_thw_list) == 1: | ||
| grid_thw_list = grid_thw_list[0] |
There was a problem hiding this comment.
| grid_thw_list = video_grid_thw.tolist() | |
| if isinstance(grid_thw_list, list) and len(grid_thw_list) == 1: | |
| grid_thw_list = grid_thw_list[0] | |
| grid_thw_list = video_grid_thw.tolist() | |
| if len(grid_thw_list) == 1: | |
| grid_thw_list = grid_thw_list[0] |
The isinstance check is redundant
| if isinstance(grid_thw_list, list) and len(grid_thw_list) == 1: | ||
| grid_thw_list = grid_thw_list[0] | ||
| grid_t, grid_h, grid_w = grid_thw_list | ||
| grid_t, grid_h, grid_w = int(grid_t), int(grid_h), int(grid_w) |
There was a problem hiding this comment.
Maybe we can perform the int conversion and removal of batch dimension before converting to list
There was a problem hiding this comment.
Code Review
This pull request adds support for audio in video in the Qwen2.5-Omni model for the V1 engine. The changes are mostly in qwen2_5_omni_thinker.py to handle the new use_audio_in_video flag, including logic for interleaving audio and video embeddings. The implementation is largely correct, but I've identified a critical issue with handling batched requests that could lead to a crash, and a high-severity maintainability concern regarding the manual handling of special token embeddings. My review includes suggestions to address these points.
| use_audio_in_video = kwargs.get("use_audio_in_video", False) | ||
| use_audio_in_video = bool(use_audio_in_video.item()) |
There was a problem hiding this comment.
The current logic for determining use_audio_in_video is incorrect and will fail for batched requests.
kwargs.get("use_audio_in_video", False)is problematic. If the key is not present, it returnsFalse, andFalse.item()will raise anAttributeError.- If multiple requests are batched,
kwargs.get("use_audio_in_video")will be a tensor with multiple elements. Calling.item()on it will raise aValueError.
The logic should correctly handle the case where the key is missing and where the tensor is batched. Since the current architecture does not support mixed batches of use_audio_in_video, we should enforce that all requests in a batch have the same setting. A more robust implementation would be to check if all values in the tensor are the same before proceeding.
| use_audio_in_video = kwargs.get("use_audio_in_video", False) | |
| use_audio_in_video = bool(use_audio_in_video.item()) | |
| use_audio_in_video_tensor = kwargs.get("use_audio_in_video") | |
| use_audio_in_video = False | |
| if use_audio_in_video_tensor is not None: | |
| # This assumes all requests in a batch have the same setting. | |
| # We take the value from the first item. | |
| val = use_audio_in_video_tensor[0] if use_audio_in_video_tensor.ndim > 0 else use_audio_in_video_tensor | |
| use_audio_in_video = bool(val.item()) |
| # TODO: this should be moved to the placeholder mechanism | ||
| # Get the text embeddings for audio_bos and audio_eos tokens | ||
| thinker_config = self.config | ||
| # <|audio_bos|> | ||
| audio_start_token_id = thinker_config.audio_start_token_id | ||
| # <|audio_eos|> | ||
| audio_end_token_id = thinker_config.audio_end_token_id | ||
|
|
||
| embed_layer = self.language_model.model.embed_tokens | ||
| device = merged_embedding.device | ||
| dtype = merged_embedding.dtype | ||
|
|
||
| # Get inferred embeddings for special tokens | ||
| audio_bos_ids = torch.tensor([audio_start_token_id], device=device) | ||
| audio_eos_ids = torch.tensor([audio_end_token_id], device=device) | ||
| audio_bos_emb = embed_layer(audio_bos_ids).to( | ||
| dtype) # [1, hidden_size] | ||
| audio_eos_emb = embed_layer(audio_eos_ids).to( | ||
| dtype) # [1, hidden_size] | ||
|
|
||
| # Add special token embeddings at beginning and end | ||
| # Structure: [audio_bos] + [interleaved] + [audio_eos] | ||
| merged_embedding = torch.cat( | ||
| [audio_bos_emb, merged_embedding, audio_eos_emb], dim=0) | ||
|
|
There was a problem hiding this comment.
The manual handling of audio_bos and audio_eos token embeddings is a maintainability concern. This logic is inconsistent with how special tokens are typically handled through the placeholder mechanism in vLLM, which makes the code more fragile to future changes in tokenization or embedding logic. As noted in the TODO comment, this should be refactored to use the placeholder mechanism, for example, by modifying omni_get_updates_use_audio_in_video to include the special tokens and letting the standard machinery handle the embedding.
| # Check if use_audio_in_video is enabled - if so, we need to | ||
| # interleave audio and video embeddings into a single tensor | ||
| use_audio_in_video = False | ||
| if "video" in mm_input_by_modality and "audio" in mm_input_by_modality: |
There was a problem hiding this comment.
Tbh I think this will never be hit in V1 engine because we only call get_multimodal_embeddings on one modality at a time
There was a problem hiding this comment.
Yeah, it is a bit confusing as in this design we treat audio as "residing" in video, so it would be seen as video but has input_audio_features.
|
Thank you @DarkLight1337 for the early review! After offline-syncing with @ywang96, he may have an alternative and cleaner method to support this. Will get back to this when we finalized our design. |
|
Documentation preview: https://vllm--26156.org.readthedocs.build/en/26156/ |
|
This pull request has merge conflicts that must be resolved before it can be |
|
please Merge this to releases/v0.11.1 |
|
Closing as superseded by #27721 |
Fix some of #23888
Enable
audio in videoin Qwen2.5-Omni in V1 engine.CC: @ywang96 @DarkLight1337 @Isotr0py
Test