[Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. by huachenheli · Pull Request #27721 · vllm-project/vllm

huachenheli · 2025-10-29T05:14:58Z

Purpose

(EngineCore_DP0 pid=2834739) INFO 10-29 13:21:24 [core.py:237] init engine (profile, create kv cache, warmup model) took 18.56 seconds
INFO 10-29 13:21:25 [llm.py:345] Supported tasks: ['generate']
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.90s/it]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.31s/it, est. speed input: 1492.83 toks/s, output: 102.27 toks/s]
The
The video shows a baby sitting on a bed, wearing glasses and intently focused on a book they are holding. The baby is dressed in a light blue shirt and appears to be deeply engaged in reading, turning the pages of the book. The room around them is somewhat cluttered, with clothes and other items scattered on the bed. In the background, there is a crib and a wall with a picture frame. The baby occasionally looks up from the book, seemingly curious about something off-camera, but quickly returns their attention to the book. The baby's glasses are large in proportion to their face, adding a humorous and endearing touch to the scene.

FIX #22268
FIX #22364
CLOSE #23888
CLOSE #25473
CLOSE #28046

Test Plan

HF_HUB_DISABLE_XET=1 VLLM_ATTENTION_BACKEND=TORCH_SDPA python examples/offline_inference/qwen3_omni/only_thinker.py -q use_audio_in_video

pytest tests/model_executor/test_qwen3_omni.py

Sanity checked with https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

mergify · 2025-10-29T05:15:35Z

Documentation preview: https://vllm--27721.org.readthedocs.build/en/27721/

mergify · 2025-10-29T05:31:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @huachenheli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jeremyteboul · 2025-10-29T15:49:49Z

vllm/attention/layer.py

            return _Backend.TORCH_SDPA, None

    elif current_platform.is_cuda():
+        return _Backend.TORCH_SDPA, None


2 questions:

What is the reason of using Torch_SDPA ?

if we return immediately should the if section be removed ?

This will be removed when I clean up this PR. I was having issue with flash-attn on my devgpu so this is a local hack to get qwen3 running.

jeremyteboul · 2025-10-29T15:51:50Z

vllm/model_executor/models/qwen2_5_omni_thinker.py

-                assert "audio" in mm_item_counts
-                mm_item_counts["audio"] -= mm_item_counts["video"]
-        super()._validate_mm_placeholders(mm_placeholders, mm_item_counts)
+    # def _validate_mm_placeholders(


it feels the intent was good to have the validation of the placeholders , should we just have one doing the verification we need ?

This is being removed already in #26334

jeremyteboul · 2025-10-29T15:54:25Z

vllm/model_executor/models/qwen3_omni_moe_thinker.py

+                    filtered_updates,
+                )
+                # Derive audio placeholders from video placeholders
+                mm_placeholders = self._derive_audio_from_video_placeholders(


nice, that can be very useful

Hope you folks can see this: when I try https://github.com/QwenLM/Qwen3-Omni/blob/main/web_demo.py with this fix, and use vllm as backend, and put system prompt as "what does this man say?"
python web_demo.py -c ../Qwen3-Omni-30B-A3B-Instruct --server-name localhost
I still cannot get correct result.

However, if I switch to transformers as backend:
python web_demo.py -c ../Qwen3-Omni-30B-A3B-Instruct --server-name localhost --use-transformers --flash-attn2
I get reasonable output.

So it seems there is still more work to do.

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/model_executor/models/qwen2_5_omni_thinker.py

Lines 395 to 413 in 9e6f9f0

    
           if is_update_applied: 
        
               mm_placeholders = self._find_mm_placeholders( 
        
                   prompt_ids, 
        
                   mm_prompt_updates, 
        
               ) 
        
               self._validate_mm_placeholders( 
        
                   mm_placeholders, 
        
                   mm_item_counts, 
        
                   use_audio_in_video=use_audio_in_video, 
        
               ) 
        
           else: 
        
               prompt_ids, mm_placeholders = self._apply_prompt_updates( 
        
                   prompt_ids, 
        
                   mm_prompt_updates, 
        
               ) 
        
               self._validate_mm_placeholders( 
        
                   mm_placeholders, 
        
                   mm_item_counts, 
        
                   use_audio_in_video=use_audio_in_video,

Remove unsupported keyword from placeholder validation call

Qwen2_5OmniThinkerMultiModalProcessor._maybe_apply_prompt_updates still invokes _validate_mm_placeholders(..., use_audio_in_video=use_audio_in_video) even though this class’s override of _validate_mm_placeholders was deleted in this change. The inherited implementation only accepts two positional parameters, so this call will now raise TypeError: _validate_mm_placeholders() got an unexpected keyword argument 'use_audio_in_video' whenever multimodal inputs are processed, crashing Qwen2.5 Omni inference before any work is done.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

jeremyteboul

thanks !

wangruohui · 2025-10-30T03:20:37Z

will this PR automatically support online inference?

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96 · 2025-10-30T03:39:10Z

@huachenheli Thanks for working on this - I'll review this PR and push it to the finishing line.

jeremyteboul · 2025-11-11T14:43:57Z

Curious if we could land a version for V1 @ywang96 ; thanks in advance !

Signed-off-by: Roger Wang <hey@rogerw.io>

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

…uts in V1 engine. (vllm-project#27721) Signed-off-by: Chenheli Hua <huachenheli@outlook.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>

wwxs123W · 2025-12-03T07:27:10Z

thank u for doing it! But can it support qwen2.5 omni?

hacky start to get qwen3 omni audio-in-video to work

382689b

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models labels Oct 29, 2025

mergify bot added the needs-rebase label Oct 29, 2025

jeremyteboul reviewed Oct 29, 2025

View reviewed changes

cleanup

9e6f9f0

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

huachenheli changed the title ~~[WIP] make Qwen3 Omni work with audio-in-video inputs.~~ [Multimodal][Qwen3 Omni] Qwen3 Omni work with audio-in-video inputs in V1 engine. Oct 29, 2025

huachenheli marked this pull request as ready for review October 29, 2025 20:25

huachenheli requested a review from sighingnow as a code owner October 29, 2025 20:25

huachenheli changed the title ~~[Multimodal][Qwen3 Omni] Qwen3 Omni work with audio-in-video inputs in V1 engine.~~ [Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. Oct 29, 2025

chatgpt-codex-connector bot reviewed Oct 29, 2025

View reviewed changes

jeremyteboul approved these changes Oct 30, 2025

View reviewed changes

Merge branch 'main' into qwen3audioinvideo

30fb5b5

Signed-off-by: Roger Wang <hey@rogerw.io>

mergify bot removed the needs-rebase label Oct 30, 2025

ywang96 self-assigned this Oct 30, 2025

ywang96 added 2 commits November 14, 2025 10:48

Merge branch 'main' into qwen3audioinvideo

f1192b4

fix default second_grid_ts

3f6164d

Signed-off-by: Roger Wang <hey@rogerw.io>

DarkLight1337 mentioned this pull request Nov 18, 2025

Qwen3-Omni model inference : ValueError: Either SamplingParams or PoolingParams must be provided. #28046

Closed

1 task

Merge branch 'main' into qwen3audioinvideo

6f2445a

ywang96 approved these changes Nov 19, 2025

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

huachenheli added 3 commits November 19, 2025 10:16

Merge branch 'main' into qwen3audioinvideo

deb5303

capture ray import AttributeError

e6c033b

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

Merge branch 'fixray' into pr-27721

8458c78

Merge branch 'main' into qwen3audioinvideo

d42d07e

ywang96 enabled auto-merge (squash) November 21, 2025 09:28

huachenheli and others added 5 commits November 21, 2025 08:46

Merge branch 'main' into qwen3audioinvideo

c24fedf

Merge branch 'main' into qwen3audioinvideo

dd4296d

Merge branch 'main' into qwen3audioinvideo

394de84

Merge branch 'main' into qwen3audioinvideo

2e69970

Merge branch 'main' into qwen3audioinvideo

6941263

ywang96 merged commit 839c6b7 into vllm-project:main Nov 24, 2025
52 checks passed

huachenheli deleted the qwen3audioinvideo branch November 24, 2025 19:34

david6666666 mentioned this pull request Dec 10, 2025

[Bug]: Issues for use_audio_in_v vllm-project/vllm-omni#264

Closed

1 task

DarkLight1337 mentioned this pull request Dec 15, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

57 tasks

yttasdfghjk mentioned this pull request Dec 17, 2025

[Bug]: v0.11.2 can not support Qwen2.5-Omni- #30779

Open

1 task

This was referenced Dec 17, 2025

[Doc] Show that use_audio_in_video is supported in docs #30837

Merged

[Usage]: Qwen3-omni's offline usage #30776

Open

DarkLight1337 mentioned this pull request Jan 4, 2026

[Roadmap]: preparing for v0.12.0 release vllm-project/vllm-omni#165

Closed

61 tasks

linyueqian mentioned this pull request Feb 2, 2026

[Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni #33605

Merged

5 tasks

pjo256 mentioned this pull request Feb 28, 2026

fix(vllm): support mixed multimodal payloads hiyouga/LlamaFactory#10225

Merged

2 tasks

	if is_update_applied:
	mm_placeholders = self._find_mm_placeholders(
	prompt_ids,
	mm_prompt_updates,
	)
	self._validate_mm_placeholders(
	mm_placeholders,
	mm_item_counts,
	use_audio_in_video=use_audio_in_video,
	)
	else:
	prompt_ids, mm_placeholders = self._apply_prompt_updates(
	prompt_ids,
	mm_prompt_updates,
	)
	self._validate_mm_placeholders(
	mm_placeholders,
	mm_item_counts,
	use_audio_in_video=use_audio_in_video,

Uh oh!

Conversation

huachenheli commented Oct 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 29, 2025

Uh oh!

mergify bot commented Oct 29, 2025

Uh oh!

jeremyteboul Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

huachenheli Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

jeremyteboul Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

huachenheli Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

jeremyteboul Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

jzou-dev Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

jeremyteboul left a comment

Choose a reason for hiding this comment

Uh oh!

wangruohui commented Oct 30, 2025

Uh oh!

ywang96 commented Oct 30, 2025

Uh oh!

jeremyteboul commented Nov 11, 2025

Uh oh!

Uh oh!

wwxs123W commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

huachenheli commented Oct 29, 2025 •

edited by github-actions bot

Loading