fix: Add method to get image features in PaliGemmaForConditionalGeneration #38730

YushunXiang · 2025-06-10T13:06:44Z

What does this PR do?

In the v4.52.1 release of the transformers library, PR #37033 @zucchini-nlp introduced a bug by renaming the class PaliGemmaForConditionalGeneration(PaliGemmaPreTrainedModel, GenerationMixin) to class PaliGemmaModel(PaliGemmaPreTrainedModel), which causes the original get_image_features function in line 218, huggingface/lerobot/common/policies/pi0/paligemma_with_expert.py to be unusable.

This pull request adds a new get_image_features method across multiple generative model implementations in the src/transformers/models directory. The method provides a standardized interface for extracting image features from models, with variations in parameters depending on the specific model's requirements.

I modify 6 files with adding the method get_image_features to corresponding class <model name>ForConditionalGeneration:

src/transformers/models/idefics2/modeling_idefics2.py
src/transformers/models/llava/modeling_llava.py
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
src/transformers/models/chameleon/modeling_chameleon.py
src/transformers/models/paligemma/modeling_paligemma.py
src/transformers/models/video_llava/modeling_video_llava.py

and use make fix-copies to generate the other 13 modeling files.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts, @qubvel, @zucchini-nlp

…Generation

zucchini-nlp

Nice catch! Indeed it is a breaking change and we need to allow access through the generative model as well

Can you update other VLMs as well, since the changes were done on all models?

YushunXiang · 2025-06-10T13:55:26Z

Nice catch! Indeed it is a breaking change and we need to allow access through the generative model as well

Can you update other VLMs as well, since the changes were done on all models?

Yes, I would like to be a contributor to transformers.

Besides, I have a question is that:

if hasattr(self.paligemma, "get_image_features"):
    return self.paligemma.get_image_features(image)
else:
    return self.paligemma.model.get_image_features(image)

With this modification I can run the generated models no matter what transformers version, but is it a good modification?

zucchini-nlp · 2025-06-10T14:05:37Z

With this modification I can run the generated models no matter what transformers version, but is it a good modification?

yeah, it can be used as a hacky workaround in your code for transformers==v4.52. And for the transformers codebase, your solution looks good and can be propagated to all models. We'll then include your fixes for the next release :)

Btw, after the fixes you need to run make fix-copies to make our CI happy

…ure extraction

YushunXiang · 2025-06-10T18:49:07Z

I modify 6 files with adding the method get_image_features to corresponding class <model name>ForConditionalGeneration:

src/transformers/models/idefics2/modeling_idefics2.py
src/transformers/models/llava/modeling_llava.py
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
src/transformers/models/chameleon/modeling_chameleon.py
src/transformers/models/paligemma/modeling_paligemma.py
src/transformers/models/video_llava/modeling_video_llava.py

and use make fix-copies to generate the other 13 modeling files.

zucchini-nlp

Thanks a lot ❤️

Just left a few comments about missing parts, some models use different helpers for video modality or quantized vision tokens

zucchini-nlp · 2025-06-11T08:01:04Z

src/transformers/models/chameleon/modeling_chameleon.py

+    def get_image_features(self, pixel_values):
+        return self.model.get_image_features(pixel_values)
+


for chameleon and emu3, the helper was get_image_tokens. Can we propagate that too?

zucchini-nlp · 2025-06-11T08:02:00Z

src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py

+        vision_feature_layer: Optional[Union[int, List[int]]] = None,
+        vision_feature_select_strategy: Optional[str] = None,
+    ):
+        return self.model.get_image_features(pixel_values_images, vision_feature_layer, vision_feature_select_strategy)


Qwen-VL has no vision_feature_layer and vision_feature_select_strategy, naming should be the same as in base model

Oh, this is my fault. Thanks!

zucchini-nlp · 2025-06-11T08:02:13Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+    def get_image_features(
+        self,
+        pixel_values_images: torch.FloatTensor,
+        vision_feature_layer: Optional[Union[int, List[int]]] = None,
+        vision_feature_select_strategy: Optional[str] = None,
+    ):
+        return self.model.get_image_features(pixel_values_images, vision_feature_layer, vision_feature_select_strategy)
+


zucchini-nlp · 2025-06-11T08:03:10Z

src/transformers/models/llava_onevision/modeling_llava_onevision.py

    def get_decoder(self):
        return self.model

+    def get_image_features(


can we also add the get_video_features when it exists. I think it's only for llava-onevision, llava-next-video and video-llava

HuggingFaceDocBuilderDev · 2025-06-11T08:12:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… across multiple models modified: - modeling_chameleon.py - modeling_llava_next.py - modular_llava_next_video.py - modeling_qwen2_vl.py and generate the: - modeling_llava_next_video.py - modeling_llava_onevision.py - modeling_qwen2_5_vl.py

zucchini-nlp

Perfect! I think the last rebase went wring, I see unrelated commits in the history. Can you fix that and we'll merge :)

…lava models with updated parameters

YushunXiang · 2025-06-11T09:56:58Z

Perfect! I think the last rebase went wring, I see unrelated commits in the history. Can you fix that and we'll merge :)

Sure, it is my fault. I used

git fetch upstream
git rebase upstream/main

caused something wrong.

YushunXiang · 2025-06-11T10:04:13Z

Commit 67461fb
implement get_image_features method in Aria, Mistral3, and VipLlava models with updated parameters.

Let me explain the reason for this commit.

For example, in the file src/transformers/models/aria/modular_aria.py, AriaForConditionalGeneration inherits from LlavaForConditionalGeneration. However, the get_image_features interface in AriaModel is inconsistent with the get_image_features interface in LlavaForConditionalGeneration.

class AriaModel(LlavaModel):
    def get_image_features(
        self,
        pixel_values: torch.FloatTensor,
        pixel_mask: Optional[torch.FloatTensor] = None,
        vision_feature_layer: int = -1,
    ):
        ...

class LlavaModel(LlavaPreTrainedModel):
    def get_image_features(
        self,
        pixel_values: torch.FloatTensor,
        vision_feature_layer: Optional[Union[int, List[int]]] = None,
        vision_feature_select_strategy: Optional[str] = None,
        **kwargs,
    ):
        ...

If I don not add the method get_image_features in the class AriaForConditionalGeneration in file modular_aria.py, and then I use make fix-copies to generate the modeling files, will cause something wrong like:

class AriaForConditionalGeneration(AriaPreTrainedModel, GenerationMixin):
    def get_image_features(
        self,
        pixel_values: torch.FloatTensor,
        vision_feature_layer: Optional[Union[int, List[int]]] = None,
        vision_feature_select_strategy: Optional[str] = None,
        **kwargs,
    ):
        return self.model.get_image_features(
            pixel_values=pixel_values,
            vision_feature_layer=vision_feature_layer,
            vision_feature_select_strategy=vision_feature_select_strategy,
            **kwargs,
        )

…ation (huggingface#38730) * fix: Add method to retrieve image features in PaliGemmaForConditionalGeneration * feat: Add get_image_features method to multiple models for image feature extraction * fix: reformat the files with ruff. * feat: Add methods for packing and retrieving image and video features across multiple models modified: - modeling_chameleon.py - modeling_llava_next.py - modular_llava_next_video.py - modeling_qwen2_vl.py and generate the: - modeling_llava_next_video.py - modeling_llava_onevision.py - modeling_qwen2_5_vl.py * feat: Implement get_image_features method in Aria, Mistral3, and VipLlava models with updated parameters * fix: reformatted the code with fix-style

fix: Add method to retrieve image features in PaliGemmaForConditional…

7451359

…Generation

zucchini-nlp reviewed Jun 10, 2025

View reviewed changes

YushunXiang and others added 2 commits June 11, 2025 02:36

feat: Add get_image_features method to multiple models for image feat…

ca15ded

…ure extraction

Merge branch 'huggingface:main' into fix-paligemma

247f9ce

fix: reformat the files with ruff.

5e17cd1

zucchini-nlp reviewed Jun 11, 2025

View reviewed changes

zucchini-nlp approved these changes Jun 11, 2025

View reviewed changes

feat: Implement get_image_features method in Aria, Mistral3, and VipL…

67461fb

…lava models with updated parameters

YushunXiang force-pushed the fix-paligemma branch from 55a2fbc to 67461fb Compare June 11, 2025 09:54

fix: reformatted the code with fix-style

7ed85ba

zucchini-nlp enabled auto-merge (squash) June 11, 2025 10:14

zucchini-nlp merged commit 56a7cf5 into huggingface:main Jun 11, 2025
15 checks passed

YushunXiang deleted the fix-paligemma branch June 11, 2025 11:35

		def get_image_features(self, pixel_values):
		return self.model.get_image_features(pixel_values)

fix: Add method to get image features in PaliGemmaForConditionalGeneration #38730

fix: Add method to get image features in PaliGemmaForConditionalGeneration #38730

Uh oh!

Conversation

YushunXiang commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

YushunXiang commented Jun 10, 2025

Uh oh!

zucchini-nlp commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YushunXiang commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

YushunXiang Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

YushunXiang commented Jun 11, 2025

Uh oh!

YushunXiang commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YushunXiang commented Jun 10, 2025 •

edited

Loading

zucchini-nlp commented Jun 10, 2025 •

edited

Loading

YushunXiang commented Jun 10, 2025 •

edited

Loading

YushunXiang commented Jun 11, 2025 •

edited

Loading