-
Notifications
You must be signed in to change notification settings - Fork 32.3k
Add I-JEPA #33125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add I-JEPA #33125
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
3f53abd
first draft
jmtzt b9d7c03
add IJepaEmbeddings class
jmtzt 7af8961
fix copy-from for IJepa model
jmtzt a4c8eec
add weight conversion script
jmtzt bf70f98
update attention class names in IJepa model
jmtzt 64f2208
style changes
jmtzt 1dd4e7d
Add push_to_hub option to convert_ijepa_checkpoint function
jmtzt 9826f99
add initial tests for I-JEPA
jmtzt d78e468
minor style changes to conversion script
jmtzt 7a64b83
make fixup related
jmtzt 66773ee
rename conversion script
jmtzt 9b7e8b4
Add I-JEPA to sdpa docs
jmtzt edd2ac9
Merge branch 'huggingface:main' into add_ijepa
jmtzt 40cf528
minor fixes
jmtzt 2bae64a
adjust conversion script
jmtzt 4ccf28c
update conversion script
jmtzt 851ed7e
adjust sdpa docs
jmtzt b7a027c
[run_slow] ijepa
jmtzt 552e800
[run-slow] ijepa
jmtzt f2f7eb8
[run-slow] ijepa
jmtzt 51b950d
Merge branch 'main' of github.com:huggingface/transformers into add_i…
jmtzt f24ef12
[run-slow] ijepa
jmtzt 6f9acc9
[run-slow] ijepa
jmtzt d663ea3
[run-slow] ijepa
jmtzt 5c80f00
Merge branch 'main' of github.com:huggingface/transformers into add_i…
jmtzt 7da705b
formatting issues
jmtzt 52f2173
adjust modeling to modular code
jmtzt b13a24e
add IJepaModel to objects to ignore in docstring checks
jmtzt 2b154ce
[run-slow] ijepa
jmtzt 3f0c027
fix formatting issues
jmtzt 2ea53eb
add usage instruction snippet to docs
jmtzt 13ccd82
change pos encoding, add checkpoint for doc
jmtzt 10cbda2
add verify logits for all models
jmtzt 0ccd96e
[run-slow] ijepa
jmtzt d2d47d4
update docs to include image feature extraction instructions
jmtzt 8e8df55
remove pooling layer from IJepaModel in image classification class
jmtzt 50f93d4
[run-slow] ijepa
jmtzt db79009
remove pooling layer from IJepaModel constructor
jmtzt 57e5407
update docs
jmtzt 8236816
[run-slow] ijepa
jmtzt ce6499f
[run-slow] ijepa
jmtzt 81a6e66
small changes
jmtzt 7a0fc39
[run-slow] ijepa
jmtzt 37a38f9
style adjustments
jmtzt 491d5a5
update copyright in init file
jmtzt 2afaba0
adjust modular ijepa
jmtzt db4dfc0
[run-slow] ijepa
jmtzt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| <!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| # I-JEPA | ||
|
|
||
| ## Overview | ||
|
|
||
| The I-JEPA model was proposed in [Image-based Joint-Embedding Predictive Architecture](https://arxiv.org/pdf/2301.08243.pdf) by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. | ||
| I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations. | ||
|
|
||
| The abstract from the paper is the following: | ||
|
|
||
| This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image- based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample tar- get blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transform- ers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction. | ||
|
|
||
| This model was contributed by [jmtzt](https://huggingface.co/jmtzt). | ||
| The original code can be found [here](https://github.com/facebookresearch/ijepa). | ||
|
|
||
| ## How to use | ||
|
|
||
| Here is how to use this model for image feature extraction: | ||
|
|
||
| ```python | ||
| import requests | ||
| import torch | ||
| from PIL import Image | ||
| from torch.nn.functional import cosine_similarity | ||
|
|
||
| from transformers import AutoModel, AutoProcessor | ||
|
|
||
| url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg" | ||
| url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg" | ||
| image_1 = Image.open(requests.get(url_1, stream=True).raw) | ||
| image_2 = Image.open(requests.get(url_2, stream=True).raw) | ||
|
|
||
| model_id = "jmtzt/ijepa_vith14_1k" | ||
| processor = AutoProcessor.from_pretrained(model_id) | ||
| model = AutoModel.from_pretrained(model_id) | ||
|
|
||
| @torch.no_grad() | ||
| def infer(image): | ||
jmtzt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| inputs = processor(image, return_tensors="pt") | ||
| outputs = model(**inputs) | ||
| return outputs.last_hidden_state.mean(dim=1) | ||
|
|
||
|
|
||
| embed_1 = infer(image_1) | ||
| embed_2 = infer(image_2) | ||
|
|
||
| similarity = cosine_similarity(embed_1, embed_2) | ||
| print(similarity) | ||
| ``` | ||
|
|
||
| ## IJepaConfig | ||
|
|
||
| [[autodoc]] IJepaConfig | ||
|
|
||
| ## IJepaModel | ||
|
|
||
| [[autodoc]] IJepaModel | ||
| - forward | ||
|
|
||
| ## IJepaForImageClassification | ||
|
|
||
| [[autodoc]] IJepaForImageClassification | ||
| - forward | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -117,6 +117,7 @@ | |
| idefics, | ||
| idefics2, | ||
| idefics3, | ||
| ijepa, | ||
| imagegpt, | ||
| informer, | ||
| instructblip, | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Copyright 2023 The HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import ( | ||
| OptionalDependencyNotAvailable, | ||
| _LazyModule, | ||
| is_torch_available, | ||
| ) | ||
|
|
||
|
|
||
| _import_structure = {"configuration_ijepa": ["IJepaConfig"]} | ||
|
|
||
| try: | ||
| if not is_torch_available(): | ||
| raise OptionalDependencyNotAvailable() | ||
| except OptionalDependencyNotAvailable: | ||
| pass | ||
| else: | ||
| _import_structure["modeling_ijepa"] = [ | ||
| "IJepaForImageClassification", | ||
| "IJepaModel", | ||
| "IJepaPreTrainedModel", | ||
| ] | ||
|
|
||
| if TYPE_CHECKING: | ||
| from .configuration_ijepa import IJepaConfig | ||
|
|
||
| try: | ||
| if not is_torch_available(): | ||
| raise OptionalDependencyNotAvailable() | ||
| except OptionalDependencyNotAvailable: | ||
| pass | ||
| else: | ||
| from .modeling_ijepa import ( | ||
| IJepaForImageClassification, | ||
| IJepaModel, | ||
| IJepaPreTrainedModel, | ||
| ) | ||
|
|
||
| else: | ||
| import sys | ||
|
|
||
| sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # coding=utf-8 | ||
| # Copyright 2024 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """I-JEPA model configuration""" | ||
|
|
||
| from ...configuration_utils import PretrainedConfig | ||
|
|
||
|
|
||
| class IJepaConfig(PretrainedConfig): | ||
| r""" | ||
| This is the configuration class to store the configuration of a [`IJepaModel`]. It is used to instantiate an IJEPA | ||
| model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | ||
| defaults will yield a similar configuration to that of the I-JEPA | ||
| [google/ijepa-base-patch16-224](https://huggingface.co/google/ijepa-base-patch16-224) architecture. | ||
|
|
||
| Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the | ||
| documentation from [`PretrainedConfig`] for more information. | ||
|
|
||
|
|
||
| Args: | ||
| hidden_size (`int`, *optional*, defaults to 768): | ||
| Dimensionality of the encoder layers and the pooler layer. | ||
| num_hidden_layers (`int`, *optional*, defaults to 12): | ||
| Number of hidden layers in the Transformer encoder. | ||
| num_attention_heads (`int`, *optional*, defaults to 12): | ||
| Number of attention heads for each attention layer in the Transformer encoder. | ||
| intermediate_size (`int`, *optional*, defaults to 3072): | ||
| Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. | ||
| hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): | ||
| The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, | ||
| `"relu"`, `"selu"` and `"gelu_new"` are supported. | ||
| hidden_dropout_prob (`float`, *optional*, defaults to 0.0): | ||
| The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. | ||
| attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0): | ||
| The dropout ratio for the attention probabilities. | ||
| initializer_range (`float`, *optional*, defaults to 0.02): | ||
| The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | ||
| layer_norm_eps (`float`, *optional*, defaults to 1e-12): | ||
| The epsilon used by the layer normalization layers. | ||
| image_size (`int`, *optional*, defaults to 224): | ||
| The size (resolution) of each image. | ||
| patch_size (`int`, *optional*, defaults to 16): | ||
| The size (resolution) of each patch. | ||
| num_channels (`int`, *optional*, defaults to 3): | ||
| The number of input channels. | ||
| qkv_bias (`bool`, *optional*, defaults to `True`): | ||
| Whether to add a bias to the queries, keys and values. | ||
|
|
||
| Example: | ||
|
|
||
| ```python | ||
| >>> from transformers import IJepaConfig, IJepaModel | ||
|
|
||
| >>> # Initializing a IJEPA ijepa-base-patch16-224 style configuration | ||
| >>> configuration = IJepaConfig() | ||
|
|
||
| >>> # Initializing a model (with random weights) from the ijepa-base-patch16-224 style configuration | ||
| >>> model = IJepaModel(configuration) | ||
|
|
||
| >>> # Accessing the model configuration | ||
| >>> configuration = model.config | ||
| ```""" | ||
|
|
||
| model_type = "ijepa" | ||
|
|
||
| def __init__( | ||
| self, | ||
| hidden_size=768, | ||
| num_hidden_layers=12, | ||
| num_attention_heads=12, | ||
| intermediate_size=3072, | ||
| hidden_act="gelu", | ||
| hidden_dropout_prob=0.0, | ||
| attention_probs_dropout_prob=0.0, | ||
| initializer_range=0.02, | ||
| layer_norm_eps=1e-12, | ||
| image_size=224, | ||
| patch_size=16, | ||
| num_channels=3, | ||
| qkv_bias=True, | ||
| **kwargs, | ||
| ): | ||
| super().__init__(**kwargs) | ||
|
|
||
| self.hidden_size = hidden_size | ||
| self.num_hidden_layers = num_hidden_layers | ||
| self.num_attention_heads = num_attention_heads | ||
| self.intermediate_size = intermediate_size | ||
| self.hidden_act = hidden_act | ||
| self.hidden_dropout_prob = hidden_dropout_prob | ||
| self.attention_probs_dropout_prob = attention_probs_dropout_prob | ||
| self.initializer_range = initializer_range | ||
| self.layer_norm_eps = layer_norm_eps | ||
| self.image_size = image_size | ||
| self.patch_size = patch_size | ||
| self.num_channels = num_channels | ||
| self.qkv_bias = qkv_bias |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.