[Bugfix][Core] Fix use audio in video bug#32994
[Bugfix][Core] Fix use audio in video bug#32994xsank wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
This pull request aims to fix a bug in how audio-in-video features are handled in Qwen3-Omni models by using separate masks for different multimodal inputs instead of a single combined mask. The approach of separating the masks is sound and correctly addresses the issue of discontinuous audio masks. However, the implementation introduces a critical bug that can cause a crash and a significant breaking change to the multimodal interface that affects other models in the codebase. I've provided detailed comments on these two critical issues with suggestions for how to resolve them.
| inputs_embeds_scheduled = self.model.embed_input_ids( | ||
| self.input_ids.gpu[:num_scheduled_tokens], | ||
| multimodal_embeddings=mm_embeds, | ||
| is_multimodal=is_mm_embed, | ||
| is_multimodals=is_mm_embeds, | ||
| ) |
There was a problem hiding this comment.
This change, which passes is_multimodals to embed_input_ids, introduces a breaking change to the SupportsMultiModal interface. Currently, only Qwen3OmniMoeThinkerForConditionalGeneration is updated to handle this new parameter. Other multimodal models in the codebase that expect is_multimodal: torch.Tensor will fail at runtime.
To address this, you could either:
- Update all other multimodal models to accept the
is_multimodalsparameter. - Implement a backward-compatibility mechanism. For instance, you could inspect the signature of
self.model.embed_input_idsand, if it doesn't acceptis_multimodals, compute a single combined mask usingreduce(torch.logical_or, is_mm_embeds)and pass it asis_multimodal.
There was a problem hiding this comment.
As the critical message, this commit should modify the interface _gather_mm_embeddingsand the interface embed_input_ids, a lot of the models code should be modified, i'm not sure should i do it now.
|
Hi @xsank, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| self.input_ids.gpu[:num_scheduled_tokens], | ||
| multimodal_embeddings=mm_embeds, | ||
| is_multimodal=is_mm_embed, | ||
| is_multimodals=is_mm_embeds, |
There was a problem hiding this comment.
Interface change breaks other multimodal models
High Severity
The parameter name in gpu_model_runner.py changed from is_multimodal to is_multimodals, but only qwen3_omni_moe_thinker.py was updated to accept the new parameter name. Other multimodal models (e.g., clip.py, eagle2_5_vl.py, gemma3_mm.py, ernie45_vl.py, qwen2_5_omni_thinker.py) still expect is_multimodal (singular). When the runner calls embed_input_ids(is_multimodals=...) on these models, a TypeError will be raised for unexpected keyword argument.
Additional Locations (1)
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com>
|
Hi @xsank, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com>
|
Very much appreciate the bugfix! We're aware of this bug and will prioritize reviewing this! |
|
This pull request has merge conflicts that must be resolved before it can be |
|
It seems that @Etelis ‘s PR compatible with this case in another way. |
|
Sorry for the late reply - I think we still need this PR to fix the actual issue itself, but I personally prefer #33605 over this since that PR is more model-specific. |
If use the audio in the video, the mask of the audio is not continuous.

The current implemention would mix the mm feature leads to the bug. Use the mask list would be better.
before:
{ "综合描述": "视频中,一个穿着深色服装的人在昏暗的室内环境中,手持一个带有蓝色图案的白色物体,似乎在进行某种仪式或表演。背景是深色的,可能是一个舞台或舞台,有蓝色的灯光。这个人似乎在移动,可能在移动或移动,可能在移动或移动。背景中可以听到模糊的音乐或音乐,以及一些模糊的背景噪音。这个人可能在说话,但语音内容无法辨认。", "简要描述": "一个人在昏暗的舞台上手持一个带有蓝色图案的白色物体,背景有模糊的音乐。" }after fix:
{ "综合描述": "在一个类似剧院的舞台上,一名穿着粉色连衣裙的女性和一名穿着蓝色西装的男性在橙色地毯上跳舞。他们从舞台中央向两侧移动,女性向左,男性向右。与此同时,另一名穿着绿色连衣裙的女性和一名穿着棕色夹克的男性从舞台左侧跑向右侧,经过跳舞的两人。在舞台的右侧,一名穿着灰色西装的男性手持红色文件夹站立不动。背景中,一个金色的王座位于舞台后方,王座上方悬挂着一幅画,画中是一对男女。观众坐在舞台前方的座位上观看表演。音频中可以听到一名男性用戏剧化的语气说:“The dale still has been the father of good news.” 另一名男性回应:“Am I, my lord?” 第三名男性说:“I assure my liege, I hold.”", "简要描述": "舞台上,一男一女在跳舞,另一对男女从旁跑过,一名男子手持红色文件夹站立。背景有王座和画像。音频中一名男性说:“The dale still has been the father of good news.”" }But this change should modify the interface
_gather_mm_embeddingsandthe interfaceembed_input_ids, a lot of the models code should be modified, i'm not sure should i do it now.