Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
176 commits
Select commit Hold shift + click to select a range
d72c9a3
initial commit for ImageBind model
dg845 Sep 21, 2023
6be5464
add initial testing code for ImageBind model
dg845 Sep 21, 2023
190e727
Add config classes for remaining modalities (audio, depth, thermal, I…
dg845 Sep 22, 2023
3692190
Update ImageBindOutput with remaining modalities (audio, depth, therm…
dg845 Sep 22, 2023
4037f6a
Add embedding classes for image-like modalities (vision, audio, depth…
dg845 Sep 22, 2023
970dc5d
Implement IMU embedding class.
dg845 Sep 22, 2023
ffd1460
Add module to convert still images into video frames.
dg845 Sep 23, 2023
ee74943
Add implementation for shared model encoder blocks.
dg845 Sep 24, 2023
93ce319
Add key and value biases to ImageBindAttention.
dg845 Sep 24, 2023
c7968d6
Add ImageBind heads and postprocessors.
dg845 Sep 24, 2023
0000bbc
Update ImageBindModel.forward to compare images against any other mod…
dg845 Sep 24, 2023
a1bdbf7
Separate normalized embeddings into their own output field.
dg845 Sep 26, 2023
69fa517
Add initial tester/test classes for remaining modalities (audio, dept…
dg845 Sep 26, 2023
a8341e4
Create initial audio feature extractor based on ASTFeatureExtractor (…
dg845 Sep 26, 2023
ac926ad
Add image processing classes for remaining image-like modalities excl…
dg845 Sep 26, 2023
e151140
Add IMU feature extractor class declaration and add feature extractor…
dg845 Sep 26, 2023
789559a
Update ImageBindAudioFeatureExtractor to use ImageBind-specific audio…
dg845 Sep 28, 2023
84851a5
Add final dropout layer to ImageBindImuTransformer.
dg845 Sep 28, 2023
43016df
Fix typo
dg845 Sep 28, 2023
93d7749
Change model test parameters to be closer to ImageBind defaults.
dg845 Sep 28, 2023
1b4bb43
Update audio feature extractor to output batched and clipped audio.
dg845 Sep 30, 2023
d9a0a80
Add modeling support for batched and clipped vision and audio inputs.
dg845 Sep 30, 2023
b5d46cd
Update ImageBind image processor to always output video (batched and …
dg845 Oct 3, 2023
029d424
Merge branch 'main' into imagebind-model
dg845 Oct 13, 2023
a9d432c
Implement ImageBindDepthImageProcessor.
dg845 Oct 13, 2023
90543ce
Implement ImageBindImuFeatureExtractor.
dg845 Oct 16, 2023
8ce499b
Fix some modeling code bugs.
dg845 Oct 17, 2023
484cd3f
Move Image2Video logic into RGBDTPatchEmbedding.
dg845 Oct 17, 2023
284ffe5
Fix attention kv bias initialization bug.
dg845 Oct 17, 2023
c5d1e3b
Implement ImageBind conversion script.
dg845 Oct 17, 2023
4a8aaf5
Fix bugs in ImageBind conversion script.
dg845 Oct 24, 2023
06f9536
Fix conversion script test configs.
dg845 Oct 24, 2023
f691396
Fix ImageBindAudioEmbeddings.
dg845 Oct 24, 2023
ba64517
Fix num_patches calculation.
dg845 Oct 24, 2023
78e537d
Fix audio num_patches calculation in conversion script.
dg845 Oct 24, 2023
befcc26
Merge branch 'main' into imagebind-model
dg845 Nov 23, 2023
a55faed
All modalities embeddings
EduardoPach May 10, 2024
fa77a40
Improving implementation
EduardoPach May 10, 2024
5c8c223
Fix copies
EduardoPach May 10, 2024
bd3ac72
Improving conversion script
EduardoPach May 11, 2024
12bd91b
Removed tokenizer
EduardoPach May 13, 2024
ca6fa03
Forward working
EduardoPach May 13, 2024
c618bcc
Format off and on
EduardoPach May 13, 2024
835161c
Improvements on conversion script
EduardoPach May 13, 2024
f08dd8c
More improvements
EduardoPach May 13, 2024
18dcde0
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 13, 2024
460fb00
Trying to make things write
EduardoPach May 13, 2024
78ccd1f
Improving import and cos
EduardoPach May 13, 2024
6e8407d
Fix copies
EduardoPach May 13, 2024
a83bebe
ImageBindFeatureExtractor
EduardoPach May 14, 2024
0ee0902
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 17, 2024
7421c63
fix copies
EduardoPach May 17, 2024
8af30b1
Improving tests
EduardoPach May 21, 2024
99770c5
More improvements
EduardoPach May 21, 2024
8a59421
Fixing tests
EduardoPach May 21, 2024
3d3a273
Tests green
EduardoPach May 21, 2024
8fcf36c
Improving consistency
EduardoPach May 21, 2024
cfe9da6
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 22, 2024
de7f84d
Removed speech dependency
EduardoPach May 22, 2024
1c9b317
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 22, 2024
003ff10
Updated conversion script
EduardoPach May 22, 2024
a0ef219
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 22, 2024
df4c0e4
Improved ImageBindProcessor
EduardoPach May 22, 2024
8d055f1
ImageBindProcessor working
EduardoPach May 23, 2024
05ac8ba
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 23, 2024
c8ad793
Update docs and docstrings
EduardoPach May 23, 2024
5b39d85
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 23, 2024
97c4bd5
ImageBindFeatureExtractor tests
EduardoPach May 23, 2024
d9c6c84
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 23, 2024
987f404
ImageBindProcessor tests
EduardoPach May 23, 2024
709613c
Make tests green
EduardoPach May 23, 2024
9fdcce4
Improve feature extractor
EduardoPach May 23, 2024
d0f788a
fix style and copies
EduardoPach May 23, 2024
4d2dd20
fix style new
EduardoPach May 23, 2024
2f2b511
nits
EduardoPach May 23, 2024
45ce871
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 24, 2024
6fa3611
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 27, 2024
7f6684d
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach May 28, 2024
d04ab40
Update src/transformers/models/imagebind/__init__.py
EduardoPach Jun 11, 2024
bcd7626
Update tests/models/imagebind/test_modeling_imagebind.py
EduardoPach Jun 11, 2024
b0d5a9f
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Jun 11, 2024
5a0a5ff
Fix tests
EduardoPach Jun 11, 2024
670c2f5
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Jun 11, 2024
37d8f84
Fix consistency
EduardoPach Jun 11, 2024
a0639fb
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Jun 13, 2024
8c5cdf5
Update src/transformers/models/imagebind/configuration_imagebind.py
EduardoPach Jun 17, 2024
0392b53
Addressed comments
EduardoPach Jun 17, 2024
48671d3
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Jun 17, 2024
0ed167f
Update src/transformers/models/imagebind/processing_imagebind.py
EduardoPach Jun 17, 2024
4b112f0
Merge branch 'adding-imagebind' of https://github.com/EduardoPach/tra…
EduardoPach Jun 17, 2024
e6ffb8e
Fixed audio in processor
EduardoPach Jun 17, 2024
ad6bb42
Addressed more comments
EduardoPach Jun 18, 2024
ec8379d
Addressed more comments
EduardoPach Jun 18, 2024
53683a4
Added comments to reduce clips for audio and videos
EduardoPach Jun 18, 2024
dab1877
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Jun 20, 2024
b74d808
Update ImageBindConfig
EduardoPach Jun 21, 2024
ae0b489
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Jul 23, 2024
55bd10f
Added video functionality to ImageBindImageProcessor
EduardoPach Jul 29, 2024
c151d6b
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Jul 29, 2024
a9a5539
chore:add func and classes to get vid clips from user given paths
RUFFY-369 Aug 4, 2024
d1c33d0
chore:update uniform_chunk_sampling()
RUFFY-369 Aug 4, 2024
53fe080
chore:change chunk duration val and type
RUFFY-369 Aug 4, 2024
99306ab
chore:update uniform_temporal_subsample()
RUFFY-369 Aug 4, 2024
082be8b
chore:update video transforms and few nits
RUFFY-369 Aug 4, 2024
1d6c4ea
fix:bug in image processor call on video paths
RUFFY-369 Aug 4, 2024
229a779
fixed: math.ceil instead of int when getting clips from video
EduardoPach Aug 5, 2024
8bea22a
Fixed copies
EduardoPach Aug 5, 2024
64d6c38
chore:revert to original to test for unmatched outputs
RUFFY-369 Aug 6, 2024
558f544
chore:make transformers compliant and few nits
RUFFY-369 Aug 7, 2024
9314a57
style:make fixup
RUFFY-369 Aug 7, 2024
79c4089
fix:make fix copies
RUFFY-369 Aug 7, 2024
f64778d
chore:resolve necessary conflicts
RUFFY-369 Aug 7, 2024
8d717d0
Video is now matching
EduardoPach Aug 12, 2024
02cb2ab
Merge remote-tracking branch 'imagebind/adding-imagebind' into imageb…
RUFFY-369 Aug 24, 2024
4d0edbf
resolve merge/change conflicts by pull
RUFFY-369 Aug 24, 2024
bc8821f
chore:make everything similar about files
RUFFY-369 Aug 24, 2024
fbbb108
test:add image processor tests
RUFFY-369 Aug 26, 2024
4099c8c
fix:failing image processor tests
RUFFY-369 Aug 26, 2024
2d4cb59
chore:add contributor name for video output matching and image proces…
RUFFY-369 Aug 26, 2024
a283626
test:add Processor kwargs and its test
RUFFY-369 Aug 27, 2024
04a9e07
fix:ProcessorTesterMixin test failures
RUFFY-369 Aug 27, 2024
4b7f5a8
fix:test failure for len of input ids
RUFFY-369 Aug 27, 2024
e2f3064
chore:add custom image and audio kwargs class and some nits
RUFFY-369 Aug 29, 2024
030027d
Merge pull request #2 from RUFFY-369/imagebind_hf
EduardoPach Sep 1, 2024
c4f19bb
fix: style
EduardoPach Sep 2, 2024
8a53076
Merge remote-tracking branch 'upstream/main' into adding-imagebind
EduardoPach Sep 13, 2024
12b9abf
fix: copies and import
EduardoPach Sep 14, 2024
237954f
Update src/transformers/models/imagebind/processing_imagebind.py
RUFFY-369 Sep 30, 2024
43de0d7
Merge branch 'main' into adding-imagebind
RUFFY-369 Sep 30, 2024
6e6f581
chore:add suggested changes related to #31330
RUFFY-369 Sep 30, 2024
40a1170
style:make style;make quality
RUFFY-369 Sep 30, 2024
1bc9d74
chore:move assertions to modeling test file from ckpt conversion file…
RUFFY-369 Oct 1, 2024
3bf1476
style:make style
RUFFY-369 Oct 1, 2024
1d32a1d
chore:weights conversion file suggested changes
RUFFY-369 Oct 1, 2024
977179e
chore:add suggested changes for audio and images kwargs
RUFFY-369 Oct 1, 2024
3eec1eb
chore:typo changes
RUFFY-369 Oct 1, 2024
b284d4e
chore:remove use_square_size
RUFFY-369 Oct 1, 2024
fcb2fac
chore:add videos as input for processor as suggested
RUFFY-369 Oct 2, 2024
d7c1b70
chore:add suggested changes
RUFFY-369 Oct 2, 2024
2e7c000
chore:add suggested changes
RUFFY-369 Oct 2, 2024
fe32980
reverting previous config commit
RUFFY-369 Oct 2, 2024
eb1f17a
chore:decouple image_to_video from modeling as mentioned in suggested…
RUFFY-369 Oct 2, 2024
92a6ad1
chore:add more suggested changes
RUFFY-369 Oct 2, 2024
bc1b722
chore:refactoring _init_weights from suggested changes
RUFFY-369 Oct 2, 2024
8c8f563
Merge remote-tracking branch 'upstream/main' into adding-imagebind
RUFFY-369 Oct 2, 2024
6182b3e
chore:decouple build_attention_mask
RUFFY-369 Oct 2, 2024
be79290
chore:some more suggested changes
RUFFY-369 Oct 2, 2024
14f6cb5
chore: remove suggested changes
RUFFY-369 Oct 2, 2024
1b6716e
fix:test failures
RUFFY-369 Oct 5, 2024
ce49517
Merge remote-tracking branch 'upstream/main' into adding-imagebind
RUFFY-369 Oct 5, 2024
21f11cd
style:make style
RUFFY-369 Oct 5, 2024
ac95d27
chore: apply suggested changes
RUFFY-369 Oct 8, 2024
c2fb254
chore:address suggested changes
RUFFY-369 Oct 9, 2024
e853fc9
Merge remote-tracking branch 'upstream/main' into adding-imagebind
RUFFY-369 Oct 9, 2024
e0f741b
chore:suggested deprecate_kwarg for return_numpy
RUFFY-369 Oct 9, 2024
85337c7
chore:suggested nit for image_to_video
RUFFY-369 Oct 9, 2024
f9fae40
test:update atol due to observed flakyness
RUFFY-369 Oct 9, 2024
f878996
test:remove unwanted tests as they are already available with Process…
RUFFY-369 Oct 9, 2024
58e1c3a
chore: make suggested changes
RUFFY-369 Oct 11, 2024
e3353e5
chore:do nit suggested changes
RUFFY-369 Oct 11, 2024
76f99ab
test:add suggested assertion
RUFFY-369 Oct 11, 2024
17525ac
Merge remote-tracking branch 'upstream/main' into adding-imagebind
RUFFY-369 Oct 11, 2024
f893147
Merge remote-tracking branch 'upstream/main' into adding-imagebind
RUFFY-369 Oct 13, 2024
50e2ca3
chore:simplify weight conversion file with regex as suggested
RUFFY-369 Oct 14, 2024
3d3887b
style:make style
RUFFY-369 Oct 14, 2024
0951775
chore:remove unused func(from review suggestions)
RUFFY-369 Oct 14, 2024
e031e0d
chore: apply suggested changes
RUFFY-369 Oct 14, 2024
7ea5f59
chore: apply suggested changes
RUFFY-369 Oct 14, 2024
9d09258
chore: apply suggested changes
RUFFY-369 Oct 14, 2024
0adf14f
chore: apply suggested changes
RUFFY-369 Oct 14, 2024
40d50c9
chore: apply suggested changes
RUFFY-369 Oct 14, 2024
f8fa533
Merge remote-tracking branch 'upstream/main' into adding-imagebind
RUFFY-369 Oct 14, 2024
cfefa9b
chore:add suggested changes for single loop
RUFFY-369 Oct 14, 2024
30370f7
chore:apply suggested changes for abstract feature_size
RUFFY-369 Oct 15, 2024
106dfb0
chore:make few suggested changes
RUFFY-369 Oct 17, 2024
a637d59
Merge remote-tracking branch 'upstream/main' into adding-imagebind
RUFFY-369 Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -649,6 +649,8 @@
title: GLPN
- local: model_doc/hiera
title: Hiera
- local: model_doc/imagebind
title: ImageBind
- local: model_doc/imagegpt
title: ImageGPT
- local: model_doc/levit
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ Flax), PyTorch, and/or TensorFlow.
| [IDEFICS](model_doc/idefics) | ✅ | ✅ | ❌ |
| [Idefics2](model_doc/idefics2) | ✅ | ❌ | ❌ |
| [Idefics3](model_doc/idefics3) | ✅ | ❌ | ❌ |
| [ImageBind](model_doc/imagebind) | ✅ | ❌ | ❌ |
| [ImageGPT](model_doc/imagegpt) | ✅ | ❌ | ❌ |
| [Informer](model_doc/informer) | ✅ | ❌ | ❌ |
| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ |
Expand Down
141 changes: 141 additions & 0 deletions docs/source/en/model_doc/imagebind.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# ImageBind

## Overview

The ImageBind model was proposed in [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
ImageBind is a multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images.
For any input from these six modalities, it outputs the same-sized embedding that can be used for cross-modal and multimodal tasks.

The abstract from the paper is the following:

*We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.*

This model was contributed by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [ruffy369](https://huggingface.co/ruffy369) and [dg845](https://huggingface.co/dg845) and [shehan97](https://huggingface.co/shehan97).
The original code can be found [here](https://github.com/facebookresearch/ImageBind).

## Usage tips

- ImageBind can be used for multi-modality similarity and zero-shot tasks.
- Currently only Vision (image and video), Audio and Text are supported.
- One can use [`ImageBindProcessor`] to prepare all or pairs of the available modalities.
- [`ImageBindModel`] `forward` expects only one pair of modalities where one of those MUST be vision modality.
- If interest only on the modalities embeddings one can use [`ImageBindModel`] `get_xxx_features` method or the appropriate `ImageBindXxxModelWithProjection`
- As ImageBind vision and text encoders were frozen during training and are initialized with OpenCLIP ViT-H if one has an application using this model the addition of other modalities by including other encoders would be possible.

Here's one example of how to get the embeddings for images, text and audios (this example requires `torchaudio`!)

```python
import torch
import torchaudio
from datasets import load_dataset
from transformers import ImageBindModel, ImageBindProcessor

ds = load_dataset("EduardoPacheco/imagebind-example-data", split="train")
images = ds["image"]
text = ds["text"]
audios = ds["audio"] # It's a dict with keys -> array and sampling_rate
audios = [
torchaudio.functional.resample(
torch.from_numpy(audio["array"]),
orig_freq=audio["sampling_rate"],
new_freq=16000
).numpy()
for audio in audios
]

model = ImageBindModel.from_pretrained("EduardoPacheco/imagebind-huge")
processor = ImageBindProcessor.from_pretrained("EduardoPacheco/imagebind-huge")

inputs = processor(text=text, images=images, audios=audios, padding=True, return_tensors="pt")

with torch.no_grad():
audio_embeds = model.get_audio_features(input_features=inputs.input_features)
image_embeds = model.get_image_features(pixel_values=inputs.pixel_values)
text_embeds = model.get_text_features(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask)

# we can compute probs to use for retrieval or zero-shot workflows.
probs_image_text = (image_embeds @ text_embeds.T).softmax(dim=-1)
probs_text_audio = (text_embeds @ audio_embeds.T).softmax(dim=-1)
probs_image_audio = (image_embeds @ audio_embeds.T).softmax(dim=-1)
```

## ImageBindConfig

[[autodoc]] ImageBindConfig
- from_text_vision_configs

## ImageBindTextConfig

[[autodoc]] ImageBindTextConfig

## ImageBindVisionConfig

[[autodoc]] ImageBindVisionConfig

## ImageBindAudioConfig

[[autodoc]] ImageBindAudioConfig

## ImageBindImageProcessor

[[autodoc]] ImageBindImageProcessor
- preprocess

## ImageBindFeatureExtractor

[[autodoc]] ImageBindFeatureExtractor

## ImageBindProcessor

[[autodoc]] ImageBindProcessor

## ImageBindModel

[[autodoc]] ImageBindModel
- forward
- get_text_features
- get_image_features
- get_audio_features

## ImageBindTextModel

[[autodoc]] ImageBindTextModel
- forward

## ImageBindTextModelWithProjection

[[autodoc]] ImageBindTextModelWithProjection
- forward

## ImageBindVisionModel

[[autodoc]] ImageBindVisionModel
- forward


## ImageBindVisionModelWithProjection

[[autodoc]] ImageBindVisionModelWithProjection
- forward

## ImageBindAudioModel

[[autodoc]] ImageBindAudioModel
- forward

## ImageBindAudioModelWithProjection

[[autodoc]] ImageBindAudioModelWithProjection
- forward
40 changes: 40 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,14 @@
"models.idefics": ["IdeficsConfig"],
"models.idefics2": ["Idefics2Config"],
"models.idefics3": ["Idefics3Config"],
"models.imagebind": [
"ImageBindAudioConfig",
"ImageBindConfig",
"ImageBindFeatureExtractor",
"ImageBindProcessor",
"ImageBindTextConfig",
"ImageBindVisionConfig",
],
"models.imagegpt": ["ImageGPTConfig"],
"models.informer": ["InformerConfig"],
"models.instructblip": [
Expand Down Expand Up @@ -1200,6 +1208,7 @@
_import_structure["models.idefics"].extend(["IdeficsImageProcessor"])
_import_structure["models.idefics2"].extend(["Idefics2ImageProcessor"])
_import_structure["models.idefics3"].extend(["Idefics3ImageProcessor"])
_import_structure["models.imagebind"].extend(["ImageBindImageProcessor"])
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
_import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"])
_import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
Expand Down Expand Up @@ -2439,6 +2448,18 @@
"Idefics3Processor",
]
)
_import_structure["models.imagebind"].extend(
[
"ImageBindAudioModel",
"ImageBindAudioModelWithProjection",
"ImageBindModel",
"ImageBindPreTrainedModel",
"ImageBindTextModel",
"ImageBindTextModelWithProjection",
"ImageBindVisionModel",
"ImageBindVisionModelWithProjection",
]
)
_import_structure["models.imagegpt"].extend(
[
"ImageGPTForCausalImageModeling",
Expand Down Expand Up @@ -5337,6 +5358,14 @@
)
from .models.idefics2 import Idefics2Config
from .models.idefics3 import Idefics3Config
from .models.imagebind import (
ImageBindAudioConfig,
ImageBindConfig,
ImageBindFeatureExtractor,
ImageBindProcessor,
ImageBindTextConfig,
ImageBindVisionConfig,
)
from .models.imagegpt import ImageGPTConfig
from .models.informer import InformerConfig
from .models.instructblip import (
Expand Down Expand Up @@ -6094,6 +6123,7 @@
from .models.idefics import IdeficsImageProcessor
from .models.idefics2 import Idefics2ImageProcessor
from .models.idefics3 import Idefics3ImageProcessor
from .models.imagebind import ImageBindImageProcessor
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
from .models.instructblipvideo import InstructBlipVideoImageProcessor
from .models.layoutlmv2 import (
Expand Down Expand Up @@ -7136,6 +7166,16 @@
Idefics3PreTrainedModel,
Idefics3Processor,
)
from .models.imagebind import (
ImageBindAudioModel,
ImageBindAudioModelWithProjection,
ImageBindModel,
ImageBindPreTrainedModel,
ImageBindTextModel,
ImageBindTextModelWithProjection,
ImageBindVisionModel,
ImageBindVisionModelWithProjection,
)
from .models.imagegpt import (
ImageGPTForCausalImageModeling,
ImageGPTForImageClassification,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@
idefics,
idefics2,
idefics3,
imagebind,
imagegpt,
informer,
instructblip,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@
("idefics", "IdeficsConfig"),
("idefics2", "Idefics2Config"),
("idefics3", "Idefics3Config"),
("imagebind", "ImageBindConfig"),
("imagegpt", "ImageGPTConfig"),
("informer", "InformerConfig"),
("instructblip", "InstructBlipConfig"),
Expand Down Expand Up @@ -437,6 +438,7 @@
("idefics", "IDEFICS"),
("idefics2", "Idefics2"),
("idefics3", "Idefics3"),
("imagebind", "ImageBind"),
("imagegpt", "ImageGPT"),
("informer", "Informer"),
("instructblip", "InstructBLIP"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/feature_extraction_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
("glpn", "GLPNFeatureExtractor"),
("groupvit", "CLIPFeatureExtractor"),
("hubert", "Wav2Vec2FeatureExtractor"),
("imagebind", "ImageBindFeatureExtractor"),
("imagegpt", "ImageGPTFeatureExtractor"),
("layoutlmv2", "LayoutLMv2FeatureExtractor"),
("layoutlmv3", "LayoutLMv3FeatureExtractor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@
("idefics", ("IdeficsImageProcessor",)),
("idefics2", ("Idefics2ImageProcessor",)),
("idefics3", ("Idefics3ImageProcessor",)),
("imagebind", ("ImageBindImageProcessor",)),
("imagegpt", ("ImageGPTImageProcessor",)),
("instructblip", ("BlipImageProcessor",)),
("instructblipvideo", ("InstructBlipVideoImageProcessor",)),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@
("idefics", "IdeficsModel"),
("idefics2", "Idefics2Model"),
("idefics3", "Idefics3Model"),
("imagebind", "ImageBindModel"),
("imagegpt", "ImageGPTModel"),
("informer", "InformerModel"),
("jamba", "JambaModel"),
Expand Down Expand Up @@ -1328,6 +1329,7 @@
("chinese_clip", "ChineseCLIPModel"),
("clip", "CLIPModel"),
("clipseg", "CLIPSegModel"),
("imagebind", "ImageBindModel"),
("siglip", "SiglipModel"),
]
)
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
("idefics", "IdeficsProcessor"),
("idefics2", "Idefics2Processor"),
("idefics3", "Idefics3Processor"),
("imagebind", "ImageBindProcessor"),
("instructblip", "InstructBlipProcessor"),
("instructblipvideo", "InstructBlipVideoProcessor"),
("kosmos-2", "Kosmos2Processor"),
Expand Down
7 changes: 7 additions & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,13 @@
("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("idefics2", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("idefics3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
(
"imagebind",
(
"CLIPTokenizer",
"CLIPTokenizerFast" if is_tokenizers_available() else None,
),
),
("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("instructblipvideo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
(
Expand Down
30 changes: 30 additions & 0 deletions src/transformers/models/imagebind/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_imagebind import *
from .feature_extraction_imagebind import *
from .image_processing_imagebind import *
from .modeling_imagebind import *
from .processing_imagebind import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading