Skip to content

[Bugfix][Core] Fix use audio in video bug#32994

Closed
xsank wants to merge 3 commits intovllm-project:mainfrom
xsank:main
Closed

[Bugfix][Core] Fix use audio in video bug#32994
xsank wants to merge 3 commits intovllm-project:mainfrom
xsank:main

Conversation

@xsank
Copy link
Copy Markdown
Contributor

@xsank xsank commented Jan 24, 2026

If use the audio in the video, the mask of the audio is not continuous.
image

The current implemention would mix the mm feature leads to the bug. Use the mask list would be better.

before:

{
    "综合描述": "视频中,一个穿着深色服装的人在昏暗的室内环境中,手持一个带有蓝色图案的白色物体,似乎在进行某种仪式或表演。背景是深色的,可能是一个舞台或舞台,有蓝色的灯光。这个人似乎在移动,可能在移动或移动,可能在移动或移动。背景中可以听到模糊的音乐或音乐,以及一些模糊的背景噪音。这个人可能在说话,但语音内容无法辨认。",
    "简要描述": "一个人在昏暗的舞台上手持一个带有蓝色图案的白色物体,背景有模糊的音乐。"
}

after fix:

{
    "综合描述": "在一个类似剧院的舞台上,一名穿着粉色连衣裙的女性和一名穿着蓝色西装的男性在橙色地毯上跳舞。他们从舞台中央向两侧移动,女性向左,男性向右。与此同时,另一名穿着绿色连衣裙的女性和一名穿着棕色夹克的男性从舞台左侧跑向右侧,经过跳舞的两人。在舞台的右侧,一名穿着灰色西装的男性手持红色文件夹站立不动。背景中,一个金色的王座位于舞台后方,王座上方悬挂着一幅画,画中是一对男女。观众坐在舞台前方的座位上观看表演。音频中可以听到一名男性用戏剧化的语气说:“The dale still has been the father of good news.” 另一名男性回应:“Am I, my lord?” 第三名男性说:“I assure my liege, I hold.”",
    "简要描述": "舞台上,一男一女在跳舞,另一对男女从旁跑过,一名男子手持红色文件夹站立。背景有王座和画像。音频中一名男性说:“The dale still has been the father of good news.”"
}

But this change should modify the interface _gather_mm_embeddingsand the interface embed_input_ids, a lot of the models code should be modified, i'm not sure should i do it now.

Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com>
@xsank xsank requested a review from sighingnow as a code owner January 24, 2026 04:00
@mergify mergify bot added qwen Related to Qwen models v1 bug Something isn't working labels Jan 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug in how audio-in-video features are handled in Qwen3-Omni models by using separate masks for different multimodal inputs instead of a single combined mask. The approach of separating the masks is sound and correctly addresses the issue of discontinuous audio masks. However, the implementation introduces a critical bug that can cause a crash and a significant breaking change to the multimodal interface that affects other models in the codebase. I've provided detailed comments on these two critical issues with suggestions for how to resolve them.

Comment on lines 2658 to 2662
inputs_embeds_scheduled = self.model.embed_input_ids(
self.input_ids.gpu[:num_scheduled_tokens],
multimodal_embeddings=mm_embeds,
is_multimodal=is_mm_embed,
is_multimodals=is_mm_embeds,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change, which passes is_multimodals to embed_input_ids, introduces a breaking change to the SupportsMultiModal interface. Currently, only Qwen3OmniMoeThinkerForConditionalGeneration is updated to handle this new parameter. Other multimodal models in the codebase that expect is_multimodal: torch.Tensor will fail at runtime.

To address this, you could either:

  1. Update all other multimodal models to accept the is_multimodals parameter.
  2. Implement a backward-compatibility mechanism. For instance, you could inspect the signature of self.model.embed_input_ids and, if it doesn't accept is_multimodals, compute a single combined mask using reduce(torch.logical_or, is_mm_embeds) and pass it as is_multimodal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the critical message, this commit should modify the interface _gather_mm_embeddingsand the interface embed_input_ids, a lot of the models code should be modified, i'm not sure should i do it now.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 24, 2026

Hi @xsank, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

self.input_ids.gpu[:num_scheduled_tokens],
multimodal_embeddings=mm_embeds,
is_multimodal=is_mm_embed,
is_multimodals=is_mm_embeds,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface change breaks other multimodal models

High Severity

The parameter name in gpu_model_runner.py changed from is_multimodal to is_multimodals, but only qwen3_omni_moe_thinker.py was updated to accept the new parameter name. Other multimodal models (e.g., clip.py, eagle2_5_vl.py, gemma3_mm.py, ernie45_vl.py, qwen2_5_omni_thinker.py) still expect is_multimodal (singular). When the runner calls embed_input_ids(is_multimodals=...) on these models, a TypeError will be raised for unexpected keyword argument.

Additional Locations (1)

Fix in Cursor Fix in Web

Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 24, 2026

Hi @xsank, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com>
@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Jan 24, 2026

Very much appreciate the bugfix! We're aware of this bug and will prioritize reviewing this!

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xsank.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 27, 2026
@xsank
Copy link
Copy Markdown
Contributor Author

xsank commented Jan 28, 2026

It seems that @Etelis ‘s PR compatible with this case in another way.

@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Feb 4, 2026

Sorry for the late reply - I think we still need this PR to fix the actual issue itself, but I personally prefer #33605 over this since that PR is more model-specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase qwen Related to Qwen models v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants