[Model] Use mm_position to compute mrope positions for Qwen2-VL/2.5-VL#32126
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request refactors the M-RoPE position computation for Qwen2-VL and Qwen2.5-VL models. The changes significantly simplify the logic by introducing a helper method iter_mm_grid_thw to iterate over multimodal features and by using NumPy for vectorized position calculations. This is a great improvement in terms of both readability and performance. I've found one minor issue with a type hint that should be corrected for static analysis correctness.
76c6550 to
23cc4d3
Compare
Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com>
23cc4d3 to
95645a5
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a significant optimization for M-RoPE position computation in Qwen2-VL and Qwen2.5-VL models. By leveraging pre-calculated multimodal features and vectorized NumPy operations, the new implementation is cleaner, more readable, and more efficient than the previous token-scanning approach. The introduction of the iter_mm_grid_thw helper function is a good abstraction. I've identified a potential crash when handling empty inputs, which was also present in the previous implementation, and have provided suggestions to address this edge case.
| ) | ||
|
|
||
| llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1) | ||
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) |
There was a problem hiding this comment.
np.concatenate will raise a ValueError if llm_pos_ids_list is empty (e.g., for an empty prompt), which will cause a crash. This should be handled to avoid unexpected failures on valid edge cases.
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) | |
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) if llm_pos_ids_list else np.empty((3, 0), dtype=np.int64) |
|
|
||
| llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1) | ||
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) | ||
| mrope_position_delta = (llm_positions.max() + 1 - len(input_tokens)).item() |
There was a problem hiding this comment.
If llm_positions becomes an empty array after the change in the previous line, calling .max() on it will raise a ValueError. This also needs to be handled to prevent a crash.
| mrope_position_delta = (llm_positions.max() + 1 - len(input_tokens)).item() | |
| mrope_position_delta = (llm_positions.max() + 1 - len(input_tokens)).item() if llm_positions.size > 0 else 0 |
| ) | ||
|
|
||
| llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1) | ||
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) |
There was a problem hiding this comment.
np.concatenate will raise a ValueError if llm_pos_ids_list is empty (e.g., for an empty prompt), which will cause a crash. This should be handled to avoid unexpected failures on valid edge cases.
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) | |
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) if llm_pos_ids_list else np.empty((3, 0), dtype=np.int64) |
|
|
||
| llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1) | ||
| llm_positions = np.concatenate(llm_pos_ids_list, axis=1).reshape(3, -1) | ||
| mrope_position_delta = (llm_positions.max() + 1 - len(input_tokens)).item() |
There was a problem hiding this comment.
If llm_positions becomes an empty array after the change in the previous line, calling .max() on it will raise a ValueError. This also needs to be handled to prevent a crash.
| mrope_position_delta = (llm_positions.max() + 1 - len(input_tokens)).item() | |
| mrope_position_delta = (llm_positions.max() + 1 - len(input_tokens)).item() if llm_positions.size > 0 else 0 |
|
/CC @DarkLight1337, thanks for taking a look! |
|
Hi @Isotr0py, |
There was a problem hiding this comment.
Out of curiosity - have you done profiling to quantify the perf gain from this PR (similar to #28730 (comment))? Thanks
vllm-project#32126) Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
|
@ywang96 Here are the trace results for Qwen2-VL: ImageBefore
### After
VideoBefore
### After
|
vllm-project#32126) Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
vllm-project#32126) Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
vllm-project#32126) Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Refactor get_mrope_input_positions() to use mm_feature.mm_position.offset directly instead of searching through input_tokens token-by-token. Changes: - Add iter_mm_features() iterator that yields (offset, modality, data) sorted by mm_position.offset for all 3 modalities (audio, image, video) - Add _get_audio_for_video_mapping() for use_audio_in_video pairing - Add _compute_audio_token_count() and _compute_interleaved_positions() helper methods - Refactor get_mrope_input_positions() to iterate by offset using numpy - Remove unused get_llm_pos_ids_for_vision import Follows the pattern established in PR vllm-project#32126 for Qwen2-VL/2.5-VL. Resolves: vllm-project#32656 (Qwen2.5-Omni item) Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
vllm-project#32126) Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>




Purpose
Followup to #28399 and #28730. This PR optimizes the M-RoPE position computation for both Qwen2-VL and Qwen2.5-VL.
Test Plan
VLLM_WORKER_MULTIPROC_METHOD=spawn lm_eval --model vllm-vlm --model_args "pretrained=Qwen/Qwen2-VL-7B-Instruct,max_model_len=8192" --tasks chartqa --batch_size 1 --apply_chat_template --seed 42Test Result
Qwen2-VL test results:
Before:
After:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Optimizes M-RoPE position generation by leveraging pre-parsed multimodal features and vectorized index math.
iter_mm_grid_thwto iteratemm_featuresbymm_position.offset, yielding(t,h,w)grids and temporal scale (t_factor) for images/videosnp.indices,np.broadcast_to) to build 3D position IDs, then converts back to torchspatial_merge_size) and temporal scaling usingsecond_per_grid_ts * tokens_per_secondfor videoIterator, add NumPy; updates bothqwen2_vl.pyandqwen2_5_vl.pyconsistentlyWritten by Cursor Bugbot for commit c370ddc7807818b64d98d78873931f675ab66d60. This will update automatically on new commits. Configure here.
Note
Optimizes and simplifies M-RoPE position computation by leveraging pre-parsed multimodal features and NumPy vectorization.
iter_mm_grid_thwto iteratemm_featuresbymm_position.offset, yielding(t,h,w)grids and temporal scaling viasecond_per_grid_ts * tokens_per_secondget_mrope_input_positionsto segment by offsets and build positions withnp.broadcast_to/np.indices, then converts to torchspatial_merge_size) and temporal scaling (t_factor) uniformly; removes token-id scanning and per-step torch index constructionIteratorandnumpy; identical updates inqwen2_vl.pyandqwen2_5_vl.pyWritten by Cursor Bugbot for commit 76c65502e40d29d100c94ed01a6daecb8e76a79c. This will update automatically on new commits. Configure here.
Note
Optimizes and simplifies M-RoPE position computation for Qwen2-VL and Qwen2.5-VL.
iter_mm_grid_thwto iteratemm_featuresbymm_position.offset, yielding(t,h,w)and temporal scale (t_factor)get_mrope_input_positionswith offset-based segmentation and NumPy (np.indices,np.broadcast_to), then converts back to torchspatial_merge_sizeand temporal scaling usingsecond_per_grid_ts * tokens_per_secondfor videoIteratorandnumpy; updates mirrored inqwen2_vl.pyandqwen2_5_vl.pyWritten by Cursor Bugbot for commit 23cc4d3365298890fc41111c1fac0a3a3bd40e24. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 95645a5. Configure here.
Note
Optimizes and simplifies M-RoPE position id generation by leveraging pre-parsed multimodal features and NumPy.
iter_mm_grid_thwto traversemm_featuresbymm_position.offset, yielding(t,h,w)grids and temporal scalet_factorget_mrope_input_positionsto segment by offsets and build 3D IDs withnp.indices/np.broadcast_to, applying spatial merge andsecond_per_grid_ts * tokens_per_second, then converts back to torchIterator,numpy); identical changes inqwen2_vl.pyandqwen2_5_vl.pyWritten by Cursor Bugbot for commit b38131b. This will update automatically on new commits. Configure here.
Note
Optimizes and simplifies M-RoPE position computation using pre-parsed multimodal features and NumPy.
iter_mm_grid_thwto iteratemm_featuresbymm_position.offset, yielding(t,h,w)and temporal scalet_factorget_mrope_input_positionsto segment by offsets and build 3D IDs vianp.indices/np.broadcast_to, then converts to torchspatial_merge_size) and temporal scaling (second_per_grid_ts * tokens_per_second) for videoIterator,numpy); identical changes inqwen2_vl.pyandqwen2_5_vl.pyWritten by Cursor Bugbot for commit b862112. This will update automatically on new commits. Configure here.
Note
Optimizes and simplifies M-RoPE position id generation by leveraging pre-parsed multimodal features and vectorized index math.
iter_mm_grid_thwto iteratemm_featuresbymm_position.offset, yielding(t,h,w)grids and temporal scalet_factorget_mrope_input_positionsto segment by offsets and build 3D IDs vianp.indices/np.broadcast_to, then converts to torch; removes token-id scanning and per-step torch index constructionspatial_merge_size) and temporal scaling usingsecond_per_grid_ts * tokens_per_secondfor videoIteratorandnumpy; identical changes inqwen2_vl.pyandqwen2_5_vl.pyWritten by Cursor Bugbot for commit b3a0317. This will update automatically on new commits. Configure here.