[MM][Feat] Add support for audio in video in Qwen2.5-Omni by wwl2755 · Pull Request #26334 · vllm-project/vllm

wwl2755 · 2025-10-07T05:13:00Z

Fix some of #23888

Enable audio in video in Qwen2.5-Omni in V1 engine.

Same purpose as #26156, but using a different and simpler method from @ywang96 . Basic idea is to create two placeholders for video and audio with the same start_idx, but use "is_embed" to differetiate them.

Basic flow

<|im_start|>user\n<|vision_bos|><|VIDEO|><|vision_eos|>Describe the content of the video<|im_end|>  # no audio placeholder in the prompt
->
      "video": [
          PlaceholderFeaturesInfo(
              start_idx=4,  
              tokens=[151659, 151655, 151655, 151654, 151654, 151660],  
              is_embed=[False, True, True, False, False, False]  
          )
      ]
->
      "audio": [
          PlaceholderFeaturesInfo(
              start_idx=4,  
              tokens=[151659, 151655, 151655, 151654, 151654, 151660],  
              is_embed=[False, False, False, True, True, False]  
          )
      ]
->
<|im_start|>user\n<|vision_bos|><|audio_bos|><|VIDEO|>*2<|AUDIO|>*2<|audio_eos|><|vision_eos|>Describe the content of the video<|im_end|>

Known limitation

This PR assumes the number of video and audio would exactly match to enable use_audio_in_video as in the example.

Test

python examples/offline_inference/qwen2_5_omni/only_thinker.py -q use_audio_in_video

INFO 10-09 04:02:38 [llm.py:340] Supported_tasks: ['generate']
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.42s/it]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.14s/it, est. speed input: 2370.76 toks/s, output: 80.69 toks/s]
The video shows a baby sitting on a bed, wearing glasses, and holding a book. The baby seems to be looking at the book and turning the pages. I'm not sure what the baby says, but it could be something like "book" or "read". So, the text of what the baby says is "book" or "read". If you have any other questions about the video or anything else, feel free to let me know.

mergify · 2025-10-07T05:13:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wwl2755.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify · 2025-10-08T12:08:37Z

Documentation preview: https://vllm--26334.org.readthedocs.build/en/26334/

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 · 2025-10-09T03:58:35Z

vllm/model_executor/models/qwen2_5_omni_thinker.py

                use_audio_in_video = all(
                    item["use_audio_in_video"].data for item in video_items
                )


This existing code seems to assume all video inputs should have a paired audio to enable use_audio_in_video.

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-09T04:07:51Z

vllm/v1/worker/gpu_model_runner.py

                second_per_grid_ts.append(t)
            if (t := mm_input.get("audio_feature_lengths")) is not None:
                audio_feature_lengths.append(t)
-            if mm_input.get("use_audio_in_video") is True:
-                use_audio_in_video = True
+            # Check for use_audio_in_video
+            use_audio_in_video_value = mm_input.get("use_audio_in_video")
+            if use_audio_in_video_value is not None:
+                use_audio_in_video = bool(use_audio_in_video_value.item())


Preserve any use_audio_in_video flag across batch

The new loop in _init_mrope_positions overwrites use_audio_in_video on every multimodal item (use_audio_in_video = bool(use_audio_in_video_value.item())). When a batch mixes requests that require audio-in-video with ones that do not, the last item processed can reset the flag to False, so get_mrope_input_positions is called without audio-in-video handling even though earlier requests needed it. This yields incorrect rotary positions for those prompts. The flag should be accumulated (e.g., OR’ed) instead of overwritten so that any request enabling audio-in-video keeps the global flag true.

Useful? React with 👍 / 👎.

How to handle use_audio_in_video and non_use_audio_in_video fixed in a request is a problem. This PR's scope is to assume all video items have the same attribute in this field.

wwl2755 · 2025-10-09T04:29:39Z

This should be ready to review. Please free feel to take a look when you are free~ @DarkLight1337 @ywang96 @Isotr0py

DarkLight1337 · 2025-10-09T04:43:47Z

vllm/model_executor/models/qwen2_5_omni_thinker.py

+                (
+                    prompt_ids,
+                    mm_placeholders,
+                ) = self._apply_prompt_updates(


Suggested change

(

prompt_ids,

mm_placeholders,

) = self._apply_prompt_updates(

prompt_ids, mm_placeholders = self._apply_prompt_updates(

Nit: Avoid unnecessary lines. Same below, and can also do the same for self._validate_mm_placeholders

DarkLight1337 · 2025-10-09T04:44:38Z

vllm/model_executor/models/qwen2_5_omni_thinker.py

+        if num_audios != num_videos:
+            raise ValueError(
+                f"use_audio_in_video requires equal number of audio and video items, "
+                f"got audio={num_audios}, video={num_videos}"


Suggested change

f"got audio={num_audios}, video={num_videos}"

f"got {num_audios=}, {num_videos=}"

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify · 2025-10-14T04:31:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wwl2755.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

Signed-off-by: Roger Wang <hey@rogerw.io>

DarkLight1337 · 2025-11-25T04:23:14Z

Closing as superseded by #27721

mergify bot added documentation Improvements or additions to documentation qwen Related to Qwen models v1 labels Oct 7, 2025

mergify bot added the needs-rebase label Oct 7, 2025

wwl2755 force-pushed the mm-omni-2 branch from 243bba6 to acb006d Compare October 7, 2025 05:28

mergify bot removed the needs-rebase label Oct 7, 2025

init

4fdbf8e

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 force-pushed the mm-omni-2 branch from acb006d to 4fdbf8e Compare October 7, 2025 05:46

fix

ab88a46

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

DarkLight1337 mentioned this pull request Oct 7, 2025

[V0 Deprecation] Remove VLLM_USE_V1 from docs and scripts #26336

Merged

5 tasks

wwl2755 added 2 commits October 9, 2025 02:59

cleanup

db41805

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

cleanup

8df7bc3

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 commented Oct 9, 2025

View reviewed changes

add validation for matched number

1eec6a0

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 marked this pull request as ready for review October 9, 2025 04:04

wwl2755 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, sighingnow and ywang96 as code owners October 9, 2025 04:04

chatgpt-codex-connector bot reviewed Oct 9, 2025

View reviewed changes

DarkLight1337 reviewed Oct 9, 2025

View reviewed changes

comment

9693739

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify bot added the needs-rebase label Oct 14, 2025

merge from main

d8030fa

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify bot removed the needs-rebase label Oct 16, 2025

DarkLight1337 mentioned this pull request Oct 16, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

57 tasks

ywang96 added 7 commits October 24, 2025 13:07

Merge branch 'main' into mm-omni-2

9071910

Merge branch 'main' into mm-omni-2

c8c0f2a

Merge branch 'main' into mm-omni-2

5052362

Merge branch 'main' into mm-omni-2

0fa3c7d

Merge branch 'main' into mm-omni-2

4d54c2d

add attn override

1f6e8f4

Signed-off-by: Roger Wang <hey@rogerw.io>

update doc

41e0b5f

Signed-off-by: Roger Wang <hey@rogerw.io>

huachenheli mentioned this pull request Oct 29, 2025

[Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. #27721

Merged

5 tasks

wangxiyuan mentioned this pull request Nov 12, 2025

[[V0 deprecation]]Remove VLLM_USE_V1 env #28204

Merged

5 tasks

wwl2755 mentioned this pull request Nov 16, 2025

[Bug]: 新版的vllm已经废弃了v0代码，而对qwen-omni系列的模型支持仅限于v0，似乎是因为这个原因，我们无法使用最新版的vllm推理qwen-omni模型 #28388

Open

1 task

Gaohan123 mentioned this pull request Nov 20, 2025

[Feature] support multimodal inputs with multiple requests vllm-project/vllm-omni#76

Merged

5 tasks

DarkLight1337 closed this Nov 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MM][Feat] Add support for audio in video in Qwen2.5-Omni#26334

[MM][Feat] Add support for audio in video in Qwen2.5-Omni#26334
wwl2755 wants to merge 14 commits intovllm-project:mainfrom
wwl2755:mm-omni-2

wwl2755 commented Oct 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

wwl2755 Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 9, 2025

Uh oh!

wwl2755 Oct 9, 2025

Uh oh!

wwl2755 commented Oct 9, 2025

Uh oh!

DarkLight1337 Oct 9, 2025 •

edited

Loading

Uh oh!

DarkLight1337 Oct 9, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

DarkLight1337 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	f"got audio={num_audios}, video={num_videos}"
	f"got {num_audios=}, {num_videos=}"

Uh oh!

Conversation

wwl2755 commented Oct 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Basic flow

Known limitation

Test

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

wwl2755 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

wwl2755 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

wwl2755 commented Oct 9, 2025

Uh oh!

DarkLight1337 Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

DarkLight1337 commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wwl2755 commented Oct 7, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Oct 9, 2025 •

edited

Loading