huggingface · EduardoPach · Sep 21, 2023 · Sep 21, 2023 · Sep 22, 2023 · Sep 22, 2023
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -649,6 +649,8 @@
         title: GLPN
       - local: model_doc/hiera
         title: Hiera
+      - local: model_doc/imagebind
+        title: ImageBind
       - local: model_doc/imagegpt
         title: ImageGPT
       - local: model_doc/levit

diff --git a/docs/source/en/index.md b/docs/source/en/index.md
@@ -170,6 +170,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                       [IDEFICS](model_doc/idefics)                       |       ✅        |         ✅         |      ❌      |
 |                      [Idefics2](model_doc/idefics2)                      |       ✅        |         ❌         |      ❌      |
 |                      [Idefics3](model_doc/idefics3)                      |       ✅        |         ❌         |      ❌      |
+|                     [ImageBind](model_doc/imagebind)                     |       ✅        |         ❌         |      ❌      |
 |                      [ImageGPT](model_doc/imagegpt)                      |       ✅        |         ❌         |      ❌      |
 |                      [Informer](model_doc/informer)                      |       ✅        |         ❌         |      ❌      |
 |                  [InstructBLIP](model_doc/instructblip)                  |       ✅        |         ❌         |      ❌      |

diff --git a/docs/source/en/model_doc/imagebind.md b/docs/source/en/model_doc/imagebind.md
@@ -0,0 +1,141 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# ImageBind
+
+## Overview
+
+The ImageBind model was proposed in [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
+ImageBind is a multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images.
+For any input from these six modalities, it outputs the same-sized embedding that can be used for cross-modal and multimodal tasks.
+
+The abstract from the paper is the following:
+
+*We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.*
+
+This model was contributed by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [ruffy369](https://huggingface.co/ruffy369) and [dg845](https://huggingface.co/dg845) and [shehan97](https://huggingface.co/shehan97).
+The original code can be found [here](https://github.com/facebookresearch/ImageBind).
+
+## Usage tips
+
+- ImageBind can be used for multi-modality similarity and zero-shot tasks.
+- Currently only Vision (image and video), Audio and Text are supported.
+- One can use [`ImageBindProcessor`] to prepare all or pairs of the available modalities.
+- [`ImageBindModel`] `forward` expects only one pair of modalities where one of those MUST be vision modality.
+- If interest only on the modalities embeddings one can use [`ImageBindModel`] `get_xxx_features` method or the appropriate `ImageBindXxxModelWithProjection`
+- As ImageBind vision and text encoders were frozen during training and are initialized with OpenCLIP ViT-H if one has an application using this model the addition of other modalities by including other encoders would be possible.
+
+Here's one example of how to get the embeddings for images, text and audios (this example requires `torchaudio`!)
+
+```python
+import torch
+import torchaudio
+from datasets import load_dataset
+from transformers import ImageBindModel, ImageBindProcessor
+
+ds = load_dataset("EduardoPacheco/imagebind-example-data", split="train")
+images = ds["image"]
+text = ds["text"]
+audios = ds["audio"] # It's a dict with keys -> array and sampling_rate
+audios = [
+    torchaudio.functional.resample(
+        torch.from_numpy(audio["array"]), 
+        orig_freq=audio["sampling_rate"], 
+        new_freq=16000
+    ).numpy() 
+    for audio in audios
+]
+
+model = ImageBindModel.from_pretrained("EduardoPacheco/imagebind-huge")
+processor = ImageBindProcessor.from_pretrained("EduardoPacheco/imagebind-huge")
+
+inputs = processor(text=text, images=images, audios=audios, padding=True, return_tensors="pt")
+
+with torch.no_grad():
+    audio_embeds = model.get_audio_features(input_features=inputs.input_features)
+    image_embeds = model.get_image_features(pixel_values=inputs.pixel_values)
+    text_embeds = model.get_text_features(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask)
+
+# we can compute probs to use for retrieval or zero-shot workflows.
+probs_image_text = (image_embeds @ text_embeds.T).softmax(dim=-1)
+probs_text_audio = (text_embeds @ audio_embeds.T).softmax(dim=-1)
+probs_image_audio = (image_embeds @ audio_embeds.T).softmax(dim=-1)
+```
+
+## ImageBindConfig
+
+[[autodoc]] ImageBindConfig
+    - from_text_vision_configs
+
+## ImageBindTextConfig
+
+[[autodoc]] ImageBindTextConfig
+
+## ImageBindVisionConfig
+
+[[autodoc]] ImageBindVisionConfig
+
+## ImageBindAudioConfig
+
+[[autodoc]] ImageBindAudioConfig
+
+## ImageBindImageProcessor
+
+[[autodoc]] ImageBindImageProcessor
+    - preprocess
+
+## ImageBindFeatureExtractor
+
+[[autodoc]] ImageBindFeatureExtractor
+
+## ImageBindProcessor
+
+[[autodoc]] ImageBindProcessor
+
+## ImageBindModel
+
+[[autodoc]] ImageBindModel
+    - forward
+    - get_text_features
+    - get_image_features
+    - get_audio_features
+
+## ImageBindTextModel
+
+[[autodoc]] ImageBindTextModel
+    - forward
+
+## ImageBindTextModelWithProjection
+
+[[autodoc]] ImageBindTextModelWithProjection
+    - forward
+
+## ImageBindVisionModel
+
+[[autodoc]] ImageBindVisionModel
+    - forward
+
+
+## ImageBindVisionModelWithProjection
+
+[[autodoc]] ImageBindVisionModelWithProjection
+    - forward
+
+## ImageBindAudioModel
+
+[[autodoc]] ImageBindAudioModel
+    - forward
+
+## ImageBindAudioModelWithProjection
+
+[[autodoc]] ImageBindAudioModelWithProjection
+    - forward
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
@@ -481,6 +481,14 @@
     "models.idefics": ["IdeficsConfig"],
     "models.idefics2": ["Idefics2Config"],
     "models.idefics3": ["Idefics3Config"],
+    "models.imagebind": [
+        "ImageBindAudioConfig",
+        "ImageBindConfig",
+        "ImageBindFeatureExtractor",
+        "ImageBindProcessor",
+        "ImageBindTextConfig",
+        "ImageBindVisionConfig",
+    ],
     "models.imagegpt": ["ImageGPTConfig"],
     "models.informer": ["InformerConfig"],
     "models.instructblip": [
@@ -1200,6 +1208,7 @@
     _import_structure["models.idefics"].extend(["IdeficsImageProcessor"])
     _import_structure["models.idefics2"].extend(["Idefics2ImageProcessor"])
     _import_structure["models.idefics3"].extend(["Idefics3ImageProcessor"])
+    _import_structure["models.imagebind"].extend(["ImageBindImageProcessor"])
     _import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
     _import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"])
     _import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
@@ -2439,6 +2448,18 @@
             "Idefics3Processor",
         ]
     )
+    _import_structure["models.imagebind"].extend(
+        [
+            "ImageBindAudioModel",
+            "ImageBindAudioModelWithProjection",
+            "ImageBindModel",
+            "ImageBindPreTrainedModel",
+            "ImageBindTextModel",
+            "ImageBindTextModelWithProjection",
+            "ImageBindVisionModel",
+            "ImageBindVisionModelWithProjection",
+        ]
+    )
     _import_structure["models.imagegpt"].extend(
         [
             "ImageGPTForCausalImageModeling",
@@ -5337,6 +5358,14 @@
     )
     from .models.idefics2 import Idefics2Config
     from .models.idefics3 import Idefics3Config
+    from .models.imagebind import (
+        ImageBindAudioConfig,
+        ImageBindConfig,
+        ImageBindFeatureExtractor,
+        ImageBindProcessor,
+        ImageBindTextConfig,
+        ImageBindVisionConfig,
+    )
     from .models.imagegpt import ImageGPTConfig
     from .models.informer import InformerConfig
     from .models.instructblip import (
@@ -6094,6 +6123,7 @@
         from .models.idefics import IdeficsImageProcessor
         from .models.idefics2 import Idefics2ImageProcessor
         from .models.idefics3 import Idefics3ImageProcessor
+        from .models.imagebind import ImageBindImageProcessor
         from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
         from .models.instructblipvideo import InstructBlipVideoImageProcessor
         from .models.layoutlmv2 import (
@@ -7136,6 +7166,16 @@
             Idefics3PreTrainedModel,
             Idefics3Processor,
         )
+        from .models.imagebind import (
+            ImageBindAudioModel,
+            ImageBindAudioModelWithProjection,
+            ImageBindModel,
+            ImageBindPreTrainedModel,
+            ImageBindTextModel,
+            ImageBindTextModelWithProjection,
+            ImageBindVisionModel,
+            ImageBindVisionModelWithProjection,
+        )
         from .models.imagegpt import (
             ImageGPTForCausalImageModeling,
             ImageGPTForImageClassification,

diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
@@ -116,6 +116,7 @@
     idefics,
     idefics2,
     idefics3,
+    imagebind,
     imagegpt,
     informer,
     instructblip,

diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
@@ -134,6 +134,7 @@
         ("idefics", "IdeficsConfig"),
         ("idefics2", "Idefics2Config"),
         ("idefics3", "Idefics3Config"),
+        ("imagebind", "ImageBindConfig"),
         ("imagegpt", "ImageGPTConfig"),
         ("informer", "InformerConfig"),
         ("instructblip", "InstructBlipConfig"),
@@ -437,6 +438,7 @@
         ("idefics", "IDEFICS"),
         ("idefics2", "Idefics2"),
         ("idefics3", "Idefics3"),
+        ("imagebind", "ImageBind"),
         ("imagegpt", "ImageGPT"),
         ("informer", "Informer"),
         ("instructblip", "InstructBLIP"),

diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py
@@ -63,6 +63,7 @@
         ("glpn", "GLPNFeatureExtractor"),
         ("groupvit", "CLIPFeatureExtractor"),
         ("hubert", "Wav2Vec2FeatureExtractor"),
+        ("imagebind", "ImageBindFeatureExtractor"),
         ("imagegpt", "ImageGPTFeatureExtractor"),
         ("layoutlmv2", "LayoutLMv2FeatureExtractor"),
         ("layoutlmv3", "LayoutLMv3FeatureExtractor"),

diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
@@ -90,6 +90,7 @@
             ("idefics", ("IdeficsImageProcessor",)),
             ("idefics2", ("Idefics2ImageProcessor",)),
             ("idefics3", ("Idefics3ImageProcessor",)),
+            ("imagebind", ("ImageBindImageProcessor",)),
             ("imagegpt", ("ImageGPTImageProcessor",)),
             ("instructblip", ("BlipImageProcessor",)),
             ("instructblipvideo", ("InstructBlipVideoImageProcessor",)),

diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
@@ -131,6 +131,7 @@
         ("idefics", "IdeficsModel"),
         ("idefics2", "Idefics2Model"),
         ("idefics3", "Idefics3Model"),
+        ("imagebind", "ImageBindModel"),
         ("imagegpt", "ImageGPTModel"),
         ("informer", "InformerModel"),
         ("jamba", "JambaModel"),
@@ -1328,6 +1329,7 @@
         ("chinese_clip", "ChineseCLIPModel"),
         ("clip", "CLIPModel"),
         ("clipseg", "CLIPSegModel"),
+        ("imagebind", "ImageBindModel"),
         ("siglip", "SiglipModel"),
     ]
 )

diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py
@@ -66,6 +66,7 @@
         ("idefics", "IdeficsProcessor"),
         ("idefics2", "Idefics2Processor"),
         ("idefics3", "Idefics3Processor"),
+        ("imagebind", "ImageBindProcessor"),
         ("instructblip", "InstructBlipProcessor"),
         ("instructblipvideo", "InstructBlipVideoProcessor"),
         ("kosmos-2", "Kosmos2Processor"),

diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
@@ -220,6 +220,13 @@
             ("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
             ("idefics2", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
             ("idefics3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
+            (
+                "imagebind",
+                (
+                    "CLIPTokenizer",
+                    "CLIPTokenizerFast" if is_tokenizers_available() else None,
+                ),
+            ),
             ("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
             ("instructblipvideo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
             (

diff --git a/src/transformers/models/imagebind/__init__.py b/src/transformers/models/imagebind/__init__.py
@@ -0,0 +1,30 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import _LazyModule
+from ...utils.import_utils import define_import_structure
+
+
+if TYPE_CHECKING:
+    from .configuration_imagebind import *
+    from .feature_extraction_imagebind import *
+    from .image_processing_imagebind import *
+    from .modeling_imagebind import *
+    from .processing_imagebind import *
+else:
+    import sys
+
+    _file = globals()["__file__"]
+    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)