Skip to content

[Model] Use mm_position to compute mrope positions for Qwen3-Omni#33010

Merged
Isotr0py merged 6 commits intovllm-project:mainfrom
Etelis:etelis/qwen3-omni-mrope-optimization
Jan 26, 2026
Merged

[Model] Use mm_position to compute mrope positions for Qwen3-Omni#33010
Isotr0py merged 6 commits intovllm-project:mainfrom
Etelis:etelis/qwen3-omni-mrope-optimization

Conversation

@Etelis
Copy link
Copy Markdown
Contributor

@Etelis Etelis commented Jan 24, 2026

Optimizes M-RoPE position calculation for Qwen3-Omni by using mm_position.offset directly from MultiModalFeatureSpec instead of token-by-token searching through input_tokens.

Part of #32656 (M-RoPE calculation optimization tracker). Follows the pattern from Qwen2.5-Omni optimization.

Preserves Qwen3-Omni specifics: audio token formula, BOS/EOS handling, token-by-token interleaving for use_audio_in_video.

Related

Test Plan

  • Syntax check passes
  • pytest tests/models/multimodal/processing/test_qwen3_omni.py -v
  • pytest tests/model_executor/test_qwen3_omni.py -v
  • Verify single image/video/audio inputs
  • Verify mixed modalities
  • Verify use_audio_in_video=True

Optimizes M-RoPE position calculation by using mm_position.offset directly
from MultiModalFeatureSpec instead of token-by-token searching.

Changes:
- Add helper methods for modular position computation
- Refactor get_mrope_input_positions to use mm_position.offset
- Replace torch operations with numpy for efficiency
- Remove unused get_llm_pos_ids_for_vision import

Part of vllm-project#32656

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis requested a review from sighingnow as a code owner January 24, 2026 19:39
@mergify mergify bot added the qwen Related to Qwen models label Jan 24, 2026
@Etelis
Copy link
Copy Markdown
Contributor Author

Etelis commented Jan 24, 2026

@DarkLight1337 cc

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the M-RoPE position calculation for Qwen3-Omni to be more efficient by using mm_position.offset from MultiModalFeatureSpec instead of token-by-token searching. The changes look good and follow the pattern from a similar optimization for Qwen2.5-Omni. I've found one critical issue that could lead to a crash under specific circumstances. Please see my comment below.

Comment on lines +2103 to 2131
llm_pos_ids_list.append(pos_ids)
st_idx = int(pos_ids.max()) + 1

eos_pos = np.broadcast_to(np.array([st_idx]), (3, 1))
llm_pos_ids_list.append(eos_pos)
llm_pos_ids_list.append(eos_pos)

video_len = grid_t * grid_h * grid_w
audio_len = self._compute_audio_token_count(
data["audio_feature_length"]
)
video_len = video_grid_thw[video_idx].prod() // (spatial_merge_size**2)
st_idx = llm_pos_ids_list[-1].max() + 1 if llm_pos_ids_list else 0
eos_len = 1
eos_block = (
torch.arange(eos_len, dtype=torch.long).view(1, -1).expand(3, -1)
+ st_idx
)
llm_pos_ids_list.append(eos_block)
llm_pos_ids_list.append(eos_block)
st += text_len + bos_len * 2 + audio_len + video_len + eos_len * 2 # noqa: E501
audio_idx += 1
video_idx += 1
remain_videos -= 1
remain_audios -= 1
st = offset + 2 + video_len + audio_len + 2

if st < len(input_tokens):
st_idx = llm_pos_ids_list[-1].max() + 1 if llm_pos_ids_list else 0
text_len = len(input_tokens) - st
if st < seq_len:
st_idx = int(llm_pos_ids_list[-1].max()) + 1 if llm_pos_ids_list else 0
text_len = seq_len - st
llm_pos_ids_list.append(
torch.arange(text_len, dtype=torch.long).view(1, -1).expand(3, -1)
+ st_idx
np.broadcast_to(np.arange(text_len), (3, text_len)) + st_idx
)

llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1)
llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1)
if llm_positions.shape[1] != seq_len:
raise RuntimeError("Position ids length mismatch with input ids length")

mrope_position_delta = llm_positions.max() + 1 - seq_len
return llm_positions, mrope_position_delta
mrope_position_delta = int(llm_positions.max()) + 1 - seq_len
return torch.from_numpy(llm_positions), mrope_position_delta

def get_mm_mapping(self) -> MultiModelKeys:
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential TypeError here. If data["use_audio_in_video"] is True but there is no corresponding audio feature, data["audio_feature_length"] will be None. This will cause _compute_interleaved_positions to be called with None, which then raises a TypeError when torch.tensor([None]) is attempted in _compute_audio_token_count.

To fix this, you should handle this case by falling back to video-only processing. You can adjust the condition to also check if audio_feature_length is available.

                if not data["use_audio_in_video"] or data["audio_feature_length"] is None:
                    grid_indices = np.indices((grid_t, grid_h, grid_w))
                    if t_factor != 1.0:
                        grid_indices[0] = (grid_indices[0] * t_factor).astype(np.int64)
                    llm_pos_ids_list.append(grid_indices.reshape(3, -1) + st_idx)

                    video_len = grid_t * grid_h * grid_w
                    st_idx = int(llm_pos_ids_list[-1].max()) + 1

                    eos_pos = np.broadcast_to(np.array([st_idx]), (3, 1))
                    llm_pos_ids_list.append(eos_pos)
                    st = offset + 1 + video_len + 1
                else:
                    audio_bos_pos = np.broadcast_to(np.array([st_idx - 1]), (3, 1))
                    llm_pos_ids_list.append(audio_bos_pos)

                    pos_ids, _ = self._compute_interleaved_positions(st_idx, data)
                    llm_pos_ids_list.append(pos_ids)
                    st_idx = int(pos_ids.max()) + 1

                    eos_pos = np.broadcast_to(np.array([st_idx]), (3, 1))
                    llm_pos_ids_list.append(eos_pos)
                    llm_pos_ids_list.append(eos_pos)

                    video_len = grid_t * grid_h * grid_w
                    audio_len = self._compute_audio_token_count(
                        data["audio_feature_length"]
                    )
                    st = offset + 2 + video_len + audio_len + 2

@Etelis
Copy link
Copy Markdown
Contributor Author

Etelis commented Jan 25, 2026

@DarkLight1337 Give me GL to run benchmarking :) and I'll

@DarkLight1337
Copy link
Copy Markdown
Member

Yeah feel free to do it

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 25, 2026

Documentation preview: https://vllm--33010.org.readthedocs.build/en/33010/

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 25, 2026
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis
Copy link
Copy Markdown
Contributor Author

Etelis commented Jan 25, 2026

Yeah feel free to do it

Test Query Type Result
multi_images 2 images Correctly compared Tokyo Skytree + cherry blossoms vs Chinatown scene
mixed_modalities audio + image + video Correctly identified "Mary Had a Little Lamb", described cherry blossom image, analyzed baby video
multi_audios 2 audio clips Correctly distinguished sports broadcast from nursery rhyme
use_audio_in_video video + audio interleaved Correctly described baby video and transcribed audio

test_qwen3_multi_audios.log
test_qwen3_multi_images.log
test_qwen3_mixed.log
test_qwen3_audio_in_video.log

Ran on

Sun Jan 25 14:46:59 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:01:00.0 Off |                    0 |
| N/A   27C    P8             30W /  600W |       0MiB /  97887MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

All read @DarkLight1337

Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isotr0py can you review this as well?

Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Isotr0py Isotr0py enabled auto-merge (squash) January 26, 2026 06:46
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 26, 2026
@Etelis
Copy link
Copy Markdown
Contributor Author

Etelis commented Jan 26, 2026

@DarkLight1337 Re-run?

@Isotr0py Isotr0py merged commit 6ca2c91 into vllm-project:main Jan 26, 2026
58 checks passed
@DarkLight1337
Copy link
Copy Markdown
Member

Merged

apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
…lm-project#33010)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…lm-project#33010)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants