[Model] Use mm_position to compute mrope positions for Qwen3-Omni#33010
Conversation
Optimizes M-RoPE position calculation by using mm_position.offset directly from MultiModalFeatureSpec instead of token-by-token searching. Changes: - Add helper methods for modular position computation - Refactor get_mrope_input_positions to use mm_position.offset - Replace torch operations with numpy for efficiency - Remove unused get_llm_pos_ids_for_vision import Part of vllm-project#32656 Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the M-RoPE position calculation for Qwen3-Omni to be more efficient by using mm_position.offset from MultiModalFeatureSpec instead of token-by-token searching. The changes look good and follow the pattern from a similar optimization for Qwen2.5-Omni. I've found one critical issue that could lead to a crash under specific circumstances. Please see my comment below.
| llm_pos_ids_list.append(pos_ids) | ||
| st_idx = int(pos_ids.max()) + 1 | ||
|
|
||
| eos_pos = np.broadcast_to(np.array([st_idx]), (3, 1)) | ||
| llm_pos_ids_list.append(eos_pos) | ||
| llm_pos_ids_list.append(eos_pos) | ||
|
|
||
| video_len = grid_t * grid_h * grid_w | ||
| audio_len = self._compute_audio_token_count( | ||
| data["audio_feature_length"] | ||
| ) | ||
| video_len = video_grid_thw[video_idx].prod() // (spatial_merge_size**2) | ||
| st_idx = llm_pos_ids_list[-1].max() + 1 if llm_pos_ids_list else 0 | ||
| eos_len = 1 | ||
| eos_block = ( | ||
| torch.arange(eos_len, dtype=torch.long).view(1, -1).expand(3, -1) | ||
| + st_idx | ||
| ) | ||
| llm_pos_ids_list.append(eos_block) | ||
| llm_pos_ids_list.append(eos_block) | ||
| st += text_len + bos_len * 2 + audio_len + video_len + eos_len * 2 # noqa: E501 | ||
| audio_idx += 1 | ||
| video_idx += 1 | ||
| remain_videos -= 1 | ||
| remain_audios -= 1 | ||
| st = offset + 2 + video_len + audio_len + 2 | ||
|
|
||
| if st < len(input_tokens): | ||
| st_idx = llm_pos_ids_list[-1].max() + 1 if llm_pos_ids_list else 0 | ||
| text_len = len(input_tokens) - st | ||
| if st < seq_len: | ||
| st_idx = int(llm_pos_ids_list[-1].max()) + 1 if llm_pos_ids_list else 0 | ||
| text_len = seq_len - st | ||
| llm_pos_ids_list.append( | ||
| torch.arange(text_len, dtype=torch.long).view(1, -1).expand(3, -1) | ||
| + st_idx | ||
| np.broadcast_to(np.arange(text_len), (3, text_len)) + st_idx | ||
| ) | ||
|
|
||
| llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1) | ||
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) | ||
| if llm_positions.shape[1] != seq_len: | ||
| raise RuntimeError("Position ids length mismatch with input ids length") | ||
|
|
||
| mrope_position_delta = llm_positions.max() + 1 - seq_len | ||
| return llm_positions, mrope_position_delta | ||
| mrope_position_delta = int(llm_positions.max()) + 1 - seq_len | ||
| return torch.from_numpy(llm_positions), mrope_position_delta | ||
|
|
||
| def get_mm_mapping(self) -> MultiModelKeys: | ||
| """ |
There was a problem hiding this comment.
There's a potential TypeError here. If data["use_audio_in_video"] is True but there is no corresponding audio feature, data["audio_feature_length"] will be None. This will cause _compute_interleaved_positions to be called with None, which then raises a TypeError when torch.tensor([None]) is attempted in _compute_audio_token_count.
To fix this, you should handle this case by falling back to video-only processing. You can adjust the condition to also check if audio_feature_length is available.
if not data["use_audio_in_video"] or data["audio_feature_length"] is None:
grid_indices = np.indices((grid_t, grid_h, grid_w))
if t_factor != 1.0:
grid_indices[0] = (grid_indices[0] * t_factor).astype(np.int64)
llm_pos_ids_list.append(grid_indices.reshape(3, -1) + st_idx)
video_len = grid_t * grid_h * grid_w
st_idx = int(llm_pos_ids_list[-1].max()) + 1
eos_pos = np.broadcast_to(np.array([st_idx]), (3, 1))
llm_pos_ids_list.append(eos_pos)
st = offset + 1 + video_len + 1
else:
audio_bos_pos = np.broadcast_to(np.array([st_idx - 1]), (3, 1))
llm_pos_ids_list.append(audio_bos_pos)
pos_ids, _ = self._compute_interleaved_positions(st_idx, data)
llm_pos_ids_list.append(pos_ids)
st_idx = int(pos_ids.max()) + 1
eos_pos = np.broadcast_to(np.array([st_idx]), (3, 1))
llm_pos_ids_list.append(eos_pos)
llm_pos_ids_list.append(eos_pos)
video_len = grid_t * grid_h * grid_w
audio_len = self._compute_audio_token_count(
data["audio_feature_length"]
)
st = offset + 2 + video_len + audio_len + 2|
@DarkLight1337 Give me GL to run benchmarking :) and I'll |
|
Yeah feel free to do it |
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
Documentation preview: https://vllm--33010.org.readthedocs.build/en/33010/ |
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
test_qwen3_multi_audios.log Ran on Sun Jan 25 14:46:59 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:01:00.0 Off | 0 |
| N/A 27C P8 30W / 600W | 0MiB / 97887MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+All read @DarkLight1337 |
DarkLight1337
left a comment
There was a problem hiding this comment.
@Isotr0py can you review this as well?
|
@DarkLight1337 Re-run? |
|
Merged |
…lm-project#33010) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
…lm-project#33010) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Optimizes M-RoPE position calculation for Qwen3-Omni by using
mm_position.offsetdirectly fromMultiModalFeatureSpecinstead of token-by-token searching throughinput_tokens.Part of #32656 (M-RoPE calculation optimization tracker). Follows the pattern from Qwen2.5-Omni optimization.
Preserves Qwen3-Omni specifics: audio token formula, BOS/EOS handling, token-by-token interleaving for
use_audio_in_video.Related
mm_featuresfor M-RoPE calculation for all models #32656Test Plan
pytest tests/models/multimodal/processing/test_qwen3_omni.py -vpytest tests/model_executor/test_qwen3_omni.py -vuse_audio_in_video=True