-
Notifications
You must be signed in to change notification settings - Fork 31.4k
fix: Add method to get image features in PaliGemmaForConditionalGeneration #38730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Indeed it is a breaking change and we need to allow access through the generative model as well
Can you update other VLMs as well, since the changes were done on all models?
Yes, I would like to be a contributor to transformers. Besides, I have a question is that: if hasattr(self.paligemma, "get_image_features"):
return self.paligemma.get_image_features(image)
else:
return self.paligemma.model.get_image_features(image)With this modification I can run the generated models no matter what |
yeah, it can be used as a hacky workaround in your code for Btw, after the fixes you need to run |
|
I modify 6 files with adding the method
and use |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot ❤️
Just left a few comments about missing parts, some models use different helpers for video modality or quantized vision tokens
| def get_image_features(self, pixel_values): | ||
| return self.model.get_image_features(pixel_values) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for chameleon and emu3, the helper was get_image_tokens. Can we propagate that too?
| vision_feature_layer: Optional[Union[int, List[int]]] = None, | ||
| vision_feature_select_strategy: Optional[str] = None, | ||
| ): | ||
| return self.model.get_image_features(pixel_values_images, vision_feature_layer, vision_feature_select_strategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qwen-VL has no vision_feature_layer and vision_feature_select_strategy, naming should be the same as in base model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is my fault. Thanks!
| def get_image_features( | ||
| self, | ||
| pixel_values_images: torch.FloatTensor, | ||
| vision_feature_layer: Optional[Union[int, List[int]]] = None, | ||
| vision_feature_select_strategy: Optional[str] = None, | ||
| ): | ||
| return self.model.get_image_features(pixel_values_images, vision_feature_layer, vision_feature_select_strategy) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| def get_decoder(self): | ||
| return self.model | ||
|
|
||
| def get_image_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add the get_video_features when it exists. I think it's only for llava-onevision, llava-next-video and video-llava
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
… across multiple models modified: - modeling_chameleon.py - modeling_llava_next.py - modular_llava_next_video.py - modeling_qwen2_vl.py and generate the: - modeling_llava_next_video.py - modeling_llava_onevision.py - modeling_qwen2_5_vl.py
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! I think the last rebase went wring, I see unrelated commits in the history. Can you fix that and we'll merge :)
…lava models with updated parameters
55a2fbc to
67461fb
Compare
Sure, it is my fault. I used git fetch upstream
git rebase upstream/maincaused something wrong. |
|
Commit 67461fb Let me explain the reason for this commit. For example, in the file class AriaModel(LlavaModel):
def get_image_features(
self,
pixel_values: torch.FloatTensor,
pixel_mask: Optional[torch.FloatTensor] = None,
vision_feature_layer: int = -1,
):
...class LlavaModel(LlavaPreTrainedModel):
def get_image_features(
self,
pixel_values: torch.FloatTensor,
vision_feature_layer: Optional[Union[int, List[int]]] = None,
vision_feature_select_strategy: Optional[str] = None,
**kwargs,
):
...If I don not add the method class AriaForConditionalGeneration(AriaPreTrainedModel, GenerationMixin):
def get_image_features(
self,
pixel_values: torch.FloatTensor,
vision_feature_layer: Optional[Union[int, List[int]]] = None,
vision_feature_select_strategy: Optional[str] = None,
**kwargs,
):
return self.model.get_image_features(
pixel_values=pixel_values,
vision_feature_layer=vision_feature_layer,
vision_feature_select_strategy=vision_feature_select_strategy,
**kwargs,
) |
…ation (huggingface#38730) * fix: Add method to retrieve image features in PaliGemmaForConditionalGeneration * feat: Add get_image_features method to multiple models for image feature extraction * fix: reformat the files with ruff. * feat: Add methods for packing and retrieving image and video features across multiple models modified: - modeling_chameleon.py - modeling_llava_next.py - modular_llava_next_video.py - modeling_qwen2_vl.py and generate the: - modeling_llava_next_video.py - modeling_llava_onevision.py - modeling_qwen2_5_vl.py * feat: Implement get_image_features method in Aria, Mistral3, and VipLlava models with updated parameters * fix: reformatted the code with fix-style
What does this PR do?
In the v4.52.1 release of the transformers library, PR #37033 @zucchini-nlp introduced a bug by renaming the class PaliGemmaForConditionalGeneration(PaliGemmaPreTrainedModel, GenerationMixin) to class PaliGemmaModel(PaliGemmaPreTrainedModel), which causes the original
get_image_featuresfunction in line 218, huggingface/lerobot/common/policies/pi0/paligemma_with_expert.py to be unusable.This pull request adds a new
get_image_featuresmethod across multiple generative model implementations in thesrc/transformers/modelsdirectory. The method provides a standardized interface for extracting image features from models, with variations in parameters depending on the specific model's requirements.I modify 6 files with adding the method
get_image_featuresto corresponding class<model name>ForConditionalGeneration:and use make fix-copies to generate the other 13 modeling files.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts, @qubvel, @zucchini-nlp