Skip to content

[Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. #27721

Merged
ywang96 merged 20 commits intovllm-project:mainfrom
huachenheli:qwen3audioinvideo
Nov 24, 2025
Merged

[Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. #27721
ywang96 merged 20 commits intovllm-project:mainfrom
huachenheli:qwen3audioinvideo

Conversation

@huachenheli
Copy link
Copy Markdown
Contributor

@huachenheli huachenheli commented Oct 29, 2025

Purpose

(EngineCore_DP0 pid=2834739) INFO 10-29 13:21:24 [core.py:237] init engine (profile, create kv cache, warmup model) took 18.56 seconds
INFO 10-29 13:21:25 [llm.py:345] Supported tasks: ['generate']
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.90s/it]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.31s/it, est. speed input: 1492.83 toks/s, output: 102.27 toks/s]
The
The video shows a baby sitting on a bed, wearing glasses and intently focused on a book they are holding. The baby is dressed in a light blue shirt and appears to be deeply engaged in reading, turning the pages of the book. The room around them is somewhat cluttered, with clothes and other items scattered on the bed. In the background, there is a crib and a wall with a picture frame. The baby occasionally looks up from the book, seemingly curious about something off-camera, but quickly returns their attention to the book. The baby's glasses are large in proportion to their face, adding a humorous and endearing touch to the scene.

FIX #22268
FIX #22364
CLOSE #23888
CLOSE #25473
CLOSE #28046

Test Plan

HF_HUB_DISABLE_XET=1 VLLM_ATTENTION_BACKEND=TORCH_SDPA python examples/offline_inference/qwen3_omni/only_thinker.py -q use_audio_in_video

pytest tests/model_executor/test_qwen3_omni.py

Sanity checked with https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Oct 29, 2025

Documentation preview: https://vllm--27721.org.readthedocs.build/en/27721/

@mergify mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models labels Oct 29, 2025
@mergify
Copy link
Copy Markdown

mergify bot commented Oct 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @huachenheli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 29, 2025
return _Backend.TORCH_SDPA, None

elif current_platform.is_cuda():
return _Backend.TORCH_SDPA, None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 questions:

  1. What is the reason of using Torch_SDPA ?
  2. if we return immediately should the if section be removed ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed when I clean up this PR. I was having issue with flash-attn on my devgpu so this is a local hack to get qwen3 running.

assert "audio" in mm_item_counts
mm_item_counts["audio"] -= mm_item_counts["video"]
super()._validate_mm_placeholders(mm_placeholders, mm_item_counts)
# def _validate_mm_placeholders(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels the intent was good to have the validation of the placeholders , should we just have one doing the verification we need ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being removed already in #26334

filtered_updates,
)
# Derive audio placeholders from video placeholders
mm_placeholders = self._derive_audio_from_video_placeholders(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, that can be very useful

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope you folks can see this: when I try https://github.com/QwenLM/Qwen3-Omni/blob/main/web_demo.py with this fix, and use vllm as backend, and put system prompt as "what does this man say?"
python web_demo.py -c ../Qwen3-Omni-30B-A3B-Instruct --server-name localhost
I still cannot get correct result.

However, if I switch to transformers as backend:
python web_demo.py -c ../Qwen3-Omni-30B-A3B-Instruct --server-name localhost --use-transformers --flash-attn2
I get reasonable output.

So it seems there is still more work to do.

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
@huachenheli huachenheli changed the title [WIP] make Qwen3 Omni work with audio-in-video inputs. [Multimodal][Qwen3 Omni] Qwen3 Omni work with audio-in-video inputs in V1 engine. Oct 29, 2025
@huachenheli huachenheli marked this pull request as ready for review October 29, 2025 20:25
@huachenheli huachenheli changed the title [Multimodal][Qwen3 Omni] Qwen3 Omni work with audio-in-video inputs in V1 engine. [Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. Oct 29, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if is_update_applied:
mm_placeholders = self._find_mm_placeholders(
prompt_ids,
mm_prompt_updates,
)
self._validate_mm_placeholders(
mm_placeholders,
mm_item_counts,
use_audio_in_video=use_audio_in_video,
)
else:
prompt_ids, mm_placeholders = self._apply_prompt_updates(
prompt_ids,
mm_prompt_updates,
)
self._validate_mm_placeholders(
mm_placeholders,
mm_item_counts,
use_audio_in_video=use_audio_in_video,

P0 Badge Remove unsupported keyword from placeholder validation call

Qwen2_5OmniThinkerMultiModalProcessor._maybe_apply_prompt_updates still invokes _validate_mm_placeholders(..., use_audio_in_video=use_audio_in_video) even though this class’s override of _validate_mm_placeholders was deleted in this change. The inherited implementation only accepts two positional parameters, so this call will now raise TypeError: _validate_mm_placeholders() got an unexpected keyword argument 'use_audio_in_video' whenever multimodal inputs are processed, crashing Qwen2.5 Omni inference before any work is done.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@jeremyteboul jeremyteboul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks !

@wangruohui
Copy link
Copy Markdown
Contributor

will this PR automatically support online inference?

Signed-off-by: Roger Wang <hey@rogerw.io>
@mergify mergify bot removed the needs-rebase label Oct 30, 2025
@ywang96 ywang96 self-assigned this Oct 30, 2025
@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Oct 30, 2025

@huachenheli Thanks for working on this - I'll review this PR and push it to the finishing line.

@jeremyteboul
Copy link
Copy Markdown
Contributor

Curious if we could land a version for V1 @ywang96 ; thanks in advance !

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025
@ywang96 ywang96 enabled auto-merge (squash) November 21, 2025 09:28
@ywang96 ywang96 merged commit 839c6b7 into vllm-project:main Nov 24, 2025
52 checks passed
@huachenheli huachenheli deleted the qwen3audioinvideo branch November 24, 2025 19:34
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…uts in V1 engine. (vllm-project#27721)

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…uts in V1 engine. (vllm-project#27721)

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
@wwxs123W
Copy link
Copy Markdown

wwxs123W commented Dec 3, 2025

thank u for doing it! But can it support qwen2.5 omni?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

6 participants