[Bugfix] Fix getting vision features in Transformer Multimodal backend#32933
Conversation
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: raushan <raushan@huggingface.co>
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the compatibility issue with the transformers library's v5 release, where the self.model.get_image_features method now returns a tuple or dict instead of a single tensor. The added logic correctly extracts the vision embeddings from these new output formats, ensuring the multimodal backend continues to function as expected. The changes are concise and directly resolve the reported bug.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
Nice |
vllm-project#32933) Signed-off-by: raushan <raushan@huggingface.co> Signed-off-by: 陈建华 <1647430658@qq.com>
vllm-project#32933) Signed-off-by: raushan <raushan@huggingface.co>
vllm-project#32933) Signed-off-by: raushan <raushan@huggingface.co>
Makes sure that transformers multimodal backend keeps working after v5 release.
PR huggingface/transformers#42564 changed the output of
self.model.get_image_featurestotuple | dictformat. Prev we expected the output to always be a single tensor or a list of tensors for non-homogeneous image sizes. A simple check if the output istupleThe default output format currently depends on
model.config.return_dict, so I added both formatscc @hmellor