[Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni #33605
Conversation
…audio budget Signed-off-by: linyueqian <linyueqian@outlook.com>
There was a problem hiding this comment.
Code Review
This pull request introduces two important bug fixes for Qwen2.5-Omni when using use_audio_in_video=True. The first fix correctly provides an audio budget by overriding get_mm_max_tokens_per_item, resolving a KeyError. The second fix addresses an embedding misalignment for interleaved audio and video tokens by overriding embed_input_ids to scatter embeddings for each modality separately. While the approach is sound, I've identified a critical issue in the implementation of the embedding separation logic that could lead to incorrect model behavior. A detailed comment with a suggested fix is provided below.
| if video_remaining > 0 and n <= video_remaining: | ||
| video_embeds.append(emb) | ||
| video_remaining -= n | ||
| elif audio_remaining > 0 and n <= audio_remaining: | ||
| audio_embeds.append(emb) | ||
| audio_remaining -= n |
There was a problem hiding this comment.
The logic for separating multimodal embeddings into video_embeds and audio_embeds appears to be incorrect. It assumes that video embeddings appear before audio embeddings in the multimodal_embeddings list.
However, based on the implementation of embed_multimodal and _parse_and_validate_multimodal_inputs, the order of embeddings is determined by the order of modalities in mm_input_by_modality. This order is derived from the field order in the dictionary returned by create_qwen2_5_omni_thinker_field_factory, where audio-related fields appear before video-related fields. Consequently, audio embeddings will be placed before video embeddings in multimodal_embeddings.
The current greedy matching logic will incorrectly classify audio embeddings as video embeddings, leading to incorrect model behavior.
To fix this, the order of checks should be swapped to match the embedding order (audio then video).
| if video_remaining > 0 and n <= video_remaining: | |
| video_embeds.append(emb) | |
| video_remaining -= n | |
| elif audio_remaining > 0 and n <= audio_remaining: | |
| audio_embeds.append(emb) | |
| audio_remaining -= n | |
| if audio_remaining > 0 and n <= audio_remaining: | |
| audio_embeds.append(emb) | |
| audio_remaining -= n | |
| elif video_remaining > 0 and n <= video_remaining: | |
| video_embeds.append(emb) | |
| video_remaining -= n |
There was a problem hiding this comment.
I don't think we need to fix it. The embedding order is actually determined by _gather_mm_embeddings in gpu_model_runner.py, which iterates over req_state.mm_features — where video is registered before audio (as confirmed by the scheduler output's mm_features=[MultiModalFeatureSpec(modality='video', ...), MultiModalFeatureSpec(modality='audio', ...)]). The field factory dict order affects HF processor kwargs, not the feature registration order, which is set by the multimodal processor's placeholder creation (video first, then audio derived via _derive_audio_from_video_placeholders).
…d missing audio budget Signed-off-by: linyueqian <linyueqian@outlook.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
has vllm-omni fixed this part or it will rely on this fix? |
|
The first issue should be fixed in a more general way by #33634, can you focus this PR on the second issue? |
ok. i will revert the changes once the pr you mention is merged and test again. |
Signed-off-by: linyueqian <linyueqian@outlook.com>
|
Thanks for the contribution - will take a look today |
ywang96
left a comment
There was a problem hiding this comment.
Please take a look at my comment - thanks!
|
I've updated this PR with some clean up - we can use the following example as the source of truth import os
import torch
from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info
if __name__ == '__main__':
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
llm = LLM(
model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
tensor_parallel_size=2,
limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
max_num_seqs=8,
max_model_len=32768,
seed=1234,
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=16384,
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
{"type": "text", "text": "What can you see and hear? Answer in one sentence."}
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
inputs = {
'prompt': text,
'multi_modal_data': {},
"mm_processor_kwargs": {
"use_audio_in_video": True,
},
}
if images is not None:
inputs['multi_modal_data']['image'] = images
if videos is not None:
inputs['multi_modal_data']['video'] = videos
if audios is not None:
inputs['multi_modal_data']['audio'] = audios
outputs = llm.generate([inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)Main branch: This PR: |
…-Omni (vllm-project#33605) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
…-Omni (vllm-project#33605) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>
…llm-project#34506) PR vllm-project#33605 changed the non-interleaved path in Qwen2_5OmniThinkerForConditionalGeneration.embed_input_ids to call _merge_multimodal_embeddings directly instead of super().embed_input_ids. This broke mixed modalities (audio + image + video): the embeddings list is ordered by modality type (audio, image, video) but masked_scatter_ fills positions sequentially by token order, so audio embeddings were incorrectly assigned to image/video positions. Restore super().embed_input_ids() for the non-interleaved path to match pre-vllm-project#33605 behaviour. The interleaved use_audio_in_video path is unchanged and still uses merge_interleaved_embeddings. Adds unit tests for the regression and related helpers. Fixes: vllm-project#34506 Signed-off-by: linyueqian <linyueqian@outlook.com>
…llm-project#34506) PR vllm-project#33605 changed the non-interleaved path in Qwen2_5OmniThinkerForConditionalGeneration.embed_input_ids to call _merge_multimodal_embeddings directly instead of super().embed_input_ids. This broke mixed modalities (audio + image + video): the embeddings list is ordered by modality type (audio, image, video) but masked_scatter_ fills positions sequentially by token order, so audio embeddings were incorrectly assigned to image/video positions. Restore super().embed_input_ids() for the non-interleaved path to match pre-vllm-project#33605 behaviour. The interleaved use_audio_in_video path is unchanged and still uses merge_interleaved_embeddings. Adds unit tests for the regression and related helpers. Fixes: vllm-project#34506 Signed-off-by: linyueqian <linyueqian@outlook.com>
…-Omni (vllm-project#33605) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>
Purpose
Fix bugs preventing
use_audio_in_video=Truefrom working correctly with Qwen2.5-Omni and Qwen3-Omni.Bug 1:KeyError: 'audio'inMultiModalBudgetBothFixed by #33634Qwen2_5OmniThinkerProcessingInfoandQwen3OmniMoeThinkerProcessingInfoinheritget_mm_max_tokens_per_itemfromQwen2VLProcessingInfo, which only returns{"image": ..., "video": ...}without an"audio"key. This causes aKeyErrorwhenMultiModalBudget.__init__tries to look up the audio budget. Fixed by overridingget_mm_max_tokens_per_itemin both models to include audio token budget computed from the WhisperFeatureExtractor config.Bug 2: Embedding merge misalignment with interleaved audio-in-video tokens
When
use_audio_in_video=True, the HF processor interleaves video and audio tokens in chunks:[video_chunk1, audio_chunk1, video_chunk2, audio_chunk2, ...]. However,_gather_mm_embeddingsprovides embeddings as separate contiguous tensors[all_video_embeds, all_audio_embeds]with an ORedis_multimodalmask. The defaultmasked_scatter_fills True positions sequentially, placing video embeddings at audio token positions and vice versa.Fixed by overriding
embed_input_idsin both models to detect interleaved video+audio tokens usinginput_ids, build per-modality masks (is_video,is_audio), and scatter each modality's embeddings separately. For Qwen3-Omni, the deepstack vision path also required fixing — itsis_visionmask was built using sequential position tracking which similarly breaks under interleaving; now usesinput_ids-based masks when interleaved.This is the same underlying issue identified in #32994 but solved locally in the model without changing the shared multimodal interface.
Related PRs: #27721, #32772, #32994
Test Plan
Offline inference with
examples/offline_inference/qwen2_5_omni/only_thinker.pyalso tested.Test Results
Qwen2.5-Omni
Audio only — correctly transcribes "Mary had a little lamb":
Video + Audio (use_audio_in_video=True) — correctly processes both modalities (3228 prompt tokens, confirming interleaving):
Video + Audio (use_audio_in_video=False) — also works correctly (2827 prompt tokens, no interleaving):
Qwen3-Omni
Before fix (use_audio_in_video=True):
After fix (use_audio_in_video=True) — correctly transcribes:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.