Skip to content

[BugFix] Fix 3D rope in transformers backend#35097

Merged
hmellor merged 9 commits intovllm-project:mainfrom
zucchini-nlp:qwen2-vl
Feb 27, 2026
Merged

[BugFix] Fix 3D rope in transformers backend#35097
hmellor merged 9 commits intovllm-project:mainfrom
zucchini-nlp:qwen2-vl

Conversation

@zucchini-nlp
Copy link
Contributor

@zucchini-nlp zucchini-nlp commented Feb 23, 2026

We will require mm_token_type_ids to prepare 3D position ids in Qwen-VL model family after huggingface/transformers#43972 is merged. This PR makes sure that transformers backend keeps functioning and is forward/backwards compatible. Tested with tests/models/multimodal/generation/test_common.py::test_single_image_models[qwen2_5_vl-transformers-test_case53] that the args are passed correctly and rope index can be computed

Also, fixes the glmv model with video input to be consistent with transformers, video timestamps are usually kept as float to get finegrained information about each frame. It will fix the currently failing Glm-OCR processing test in vLLM

Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
@mergify mergify bot added the bug Something isn't working label Feb 23, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses a bug in the GLM-OCR processing test in vLLM by ensuring that video timestamps are kept as floats, consistent with transformers. It also makes changes to support mm_token_type_ids for 3D position IDs in the Qwen-VL model family, ensuring forward/backward compatibility. The changes primarily involve modifying how video timestamps are handled and updating multimodal processing logic to incorporate mm_token_type_ids.

Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
@hmellor hmellor added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2026
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Isotr0py for this change

@@ -472,10 +468,16 @@ def get_mrope_input_positions(
video_grid_thw
)

# In v4 this utility didn't accept any `kwargs`, thus we filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment.

Will mm_token_type_ids only exist in v5 and we keep kwargs empty otherwise because get_rope_index would error in v4 if we explicitly passed the None value?

@hmellor
Copy link
Member

hmellor commented Feb 25, 2026

@@ -1011,7 +990,7 @@ def _get_video_second_idx_glm4v(
uniq.append(uniq[-1])
frame_indices = uniq

full_second_idxs = [int(idx / video_fps) for idx in frame_indices]
full_second_idxs = [idx / video_fps for idx in frame_indices]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change also necessary for GLM4.1V? I remember GLM4.1V use int for timestamp while GLM4.6V is float with decimal seconds:

<|end_of_image|>0<|end_of_video|>
<|end_of_image|>0.0 seconds<|end_of_video|>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in transformers we use the same timestamps format for all GLM models, so I am relying on it. Do you want to check-in with GLM authors, I can ask in slack?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zucchini-nlp and others added 5 commits February 25, 2026 10:21
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
@hmellor hmellor enabled auto-merge (squash) February 27, 2026 15:41
@hmellor hmellor merged commit fd6de37 into vllm-project:main Feb 27, 2026
59 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in Transformers backend Feb 27, 2026
sergey-zinchenko pushed a commit to sergey-zinchenko/vllm that referenced this pull request Mar 1, 2026
Signed-off-by: raushan <raushan@huggingface.co>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Sergey Zinchenko <sergey.zinchenko.rnd@gmail.com>
EanWang211123 pushed a commit to EanWang211123/vllm that referenced this pull request Mar 2, 2026
Signed-off-by: raushan <raushan@huggingface.co>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>
@AndreasKaratzas
Copy link
Collaborator

This PR introduced a regression for this test:

pytest -s -v tests/models/multimodal/generation/test_common.py::test_single_image_models[qwen2_5_vl-transformers-test_case53]

The test is part of Multi-Modal Models (Extended) 2. I have already put up a fix for that (#35711).

@hmellor
Copy link
Member

hmellor commented Mar 2, 2026

Thanks for letting us know, I'll look into your fix

bhoomit pushed a commit to bhoomit/vllm that referenced this pull request Mar 2, 2026
Signed-off-by: raushan <raushan@huggingface.co>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Signed-off-by: raushan <raushan@huggingface.co>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants