[MM][Feat] Add support for audio in video in Qwen2.5-Omni#26334
[MM][Feat] Add support for audio in video in Qwen2.5-Omni#26334wwl2755 wants to merge 14 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
|
Documentation preview: https://vllm--26334.org.readthedocs.build/en/26334/ |
| use_audio_in_video = all( | ||
| item["use_audio_in_video"].data for item in video_items | ||
| ) |
There was a problem hiding this comment.
This existing code seems to assume all video inputs should have a paired audio to enable use_audio_in_video.
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| second_per_grid_ts.append(t) | ||
| if (t := mm_input.get("audio_feature_lengths")) is not None: | ||
| audio_feature_lengths.append(t) | ||
| if mm_input.get("use_audio_in_video") is True: | ||
| use_audio_in_video = True | ||
| # Check for use_audio_in_video | ||
| use_audio_in_video_value = mm_input.get("use_audio_in_video") | ||
| if use_audio_in_video_value is not None: | ||
| use_audio_in_video = bool(use_audio_in_video_value.item()) |
There was a problem hiding this comment.
Preserve any
use_audio_in_video flag across batch
The new loop in _init_mrope_positions overwrites use_audio_in_video on every multimodal item (use_audio_in_video = bool(use_audio_in_video_value.item())). When a batch mixes requests that require audio-in-video with ones that do not, the last item processed can reset the flag to False, so get_mrope_input_positions is called without audio-in-video handling even though earlier requests needed it. This yields incorrect rotary positions for those prompts. The flag should be accumulated (e.g., OR’ed) instead of overwritten so that any request enabling audio-in-video keeps the global flag true.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
How to handle use_audio_in_video and non_use_audio_in_video fixed in a request is a problem. This PR's scope is to assume all video items have the same attribute in this field.
|
This should be ready to review. Please free feel to take a look when you are free~ @DarkLight1337 @ywang96 @Isotr0py |
| ( | ||
| prompt_ids, | ||
| mm_placeholders, | ||
| ) = self._apply_prompt_updates( |
There was a problem hiding this comment.
| ( | |
| prompt_ids, | |
| mm_placeholders, | |
| ) = self._apply_prompt_updates( | |
| prompt_ids, mm_placeholders = self._apply_prompt_updates( |
Nit: Avoid unnecessary lines. Same below, and can also do the same for self._validate_mm_placeholders
| if num_audios != num_videos: | ||
| raise ValueError( | ||
| f"use_audio_in_video requires equal number of audio and video items, " | ||
| f"got audio={num_audios}, video={num_videos}" |
There was a problem hiding this comment.
| f"got audio={num_audios}, video={num_videos}" | |
| f"got {num_audios=}, {num_videos=}" |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
|
Closing as superseded by #27721 |
Fix some of #23888
Enable
audio in videoin Qwen2.5-Omni in V1 engine.Same purpose as #26156, but using a different and simpler method from @ywang96 . Basic idea is to create two placeholders for video and audio with the same start_idx, but use "is_embed" to differetiate them.
Basic flow
Known limitation
This PR assumes the number of video and audio would exactly match to enable
use_audio_in_videoas in the example.Test