[BugFix] Fix 3D rope in transformers backend#35097
[BugFix] Fix 3D rope in transformers backend#35097hmellor merged 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
There was a problem hiding this comment.
Code Review
The pull request addresses a bug in the GLM-OCR processing test in vLLM by ensuring that video timestamps are kept as floats, consistent with transformers. It also makes changes to support mm_token_type_ids for 3D position IDs in the Qwen-VL model family, ensuring forward/backward compatibility. The changes primarily involve modifying how video timestamps are handled and updating multimodal processing logic to incorporate mm_token_type_ids.
Signed-off-by: raushan <raushan@huggingface.co>
| @@ -472,10 +468,16 @@ def get_mrope_input_positions( | |||
| video_grid_thw | |||
| ) | |||
|
|
|||
| # In v4 this utility didn't accept any `kwargs`, thus we filter | |||
There was a problem hiding this comment.
I don't understand this comment.
Will mm_token_type_ids only exist in v5 and we keep kwargs empty otherwise because get_rope_index would error in v4 if we explicitly passed the None value?
| @@ -1011,7 +990,7 @@ def _get_video_second_idx_glm4v( | |||
| uniq.append(uniq[-1]) | |||
| frame_indices = uniq | |||
|
|
|||
| full_second_idxs = [int(idx / video_fps) for idx in frame_indices] | |||
| full_second_idxs = [idx / video_fps for idx in frame_indices] | |||
There was a problem hiding this comment.
Is this change also necessary for GLM4.1V? I remember GLM4.1V use int for timestamp while GLM4.6V is float with decimal seconds:
<|end_of_image|>0<|end_of_video|>
<|end_of_image|>0.0 seconds<|end_of_video|>
There was a problem hiding this comment.
in transformers we use the same timestamps format for all GLM models, so I am relying on it. Do you want to check-in with GLM authors, I can ask in slack?
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Sergey Zinchenko <sergey.zinchenko.rnd@gmail.com>
Signed-off-by: raushan <raushan@huggingface.co> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>
|
This PR introduced a regression for this test: pytest -s -v tests/models/multimodal/generation/test_common.py::test_single_image_models[qwen2_5_vl-transformers-test_case53]The test is part of |
|
Thanks for letting us know, I'll look into your fix |
Signed-off-by: raushan <raushan@huggingface.co> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: raushan <raushan@huggingface.co> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
We will require
mm_token_type_idsto prepare 3D position ids in Qwen-VL model family after huggingface/transformers#43972 is merged. This PR makes sure that transformers backend keeps functioning and is forward/backwards compatible. Tested withtests/models/multimodal/generation/test_common.py::test_single_image_models[qwen2_5_vl-transformers-test_case53]that the args are passed correctly and rope index can be computedAlso, fixes the glmv model with video input to be consistent with transformers, video timestamps are usually kept as float to get finegrained information about each frame. It will fix the currently failing Glm-OCR processing test in vLLM