diff --git a/README.md b/README.md
index e65d512defc4..7a88c12b99ec 100644
--- a/README.md
+++ b/README.md
@@ -383,6 +383,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh.
+1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
diff --git a/README_es.md b/README_es.md
index d42750237fc2..5e6a03688870 100644
--- a/README_es.md
+++ b/README_es.md
@@ -358,6 +358,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 
+1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
diff --git a/README_hd.md b/README_hd.md
index e4dd69943f4c..84b24e6707cc 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -332,6 +332,7 @@ conda install -c huggingface transformers
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा।
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा।
 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 
+1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce से) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. द्वाराअनुसंधान पत्र [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) के साथ जारी किया गया
diff --git a/README_ja.md b/README_ja.md
index ea8de35ff3de..2e1f9283ac55 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -392,6 +392,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447)
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321)
 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 
+1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI から) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever から公開された研究論文: [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/)
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce から) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. から公開された研究論文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)
diff --git a/README_ko.md b/README_ko.md
index c88d62a9ad9f..ff4a6f929fe2 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -307,6 +307,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다.
 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 
+1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI 에서) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 의 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 논문과 함께 발표했습니다.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce 에서 제공)은 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.의 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index e6e6ab59cd06..4afd7161cafc 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -331,6 +331,7 @@ conda install -c huggingface transformers
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。
 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 
+1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (来自 OpenAI) 伴随论文 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 由 Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 发布。
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (来自 Salesforce) 伴随论文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 由 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index 21cbe14be804..bfeb64487f5f 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -343,6 +343,7 @@ conda install -c huggingface transformers
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 
+1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index d7d593b21e62..39aba210f0f9 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -529,6 +529,8 @@
         title: FocalNet
       - local: model_doc/glpn
         title: GLPN
+      - local: model_doc/imagebind
+        title: ImageBind
       - local: model_doc/imagegpt
         title: ImageGPT
       - local: model_doc/levit
diff --git a/docs/source/en/model_doc/imagebind.md b/docs/source/en/model_doc/imagebind.md
new file mode 100644
index 000000000000..66784c31e165
--- /dev/null
+++ b/docs/source/en/model_doc/imagebind.md
@@ -0,0 +1,97 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# ImageBind
+
+## Overview
+
+The ImageBind model was proposed in [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
+ImageBind is a multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images.
+For any input from these six modalities, it outputs the same-sized embedding that can be used for cross-modal and multimodal tasks.
+
+The abstract from the paper is the following:
+
+*We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.*
+
+Tips:
+
+<INSERT TIPS ABOUT MODEL HERE>
+
+This model was contributed by [dg845](https://huggingface.co/dg845) and [shehan97](https://huggingface.co/shehan97).
+The original code can be found [here](https://github.com/facebookresearch/ImageBind).
+
+
+## ImageBindConfig
+
+[[autodoc]] ImageBindConfig
+    - from_text_vision_configs
+
+## ImageBindTextConfig
+
+[[autodoc]] ImageBindTextConfig
+
+## ImageBindVisionConfig
+
+[[autodoc]] ImageBindVisionConfig
+
+## ImageBindTokenizer
+
+[[autodoc]] ImageBindTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+
+## ImageBindTokenizerFast
+
+[[autodoc]] ImageBindTokenizerFast
+
+## ImageBindImageProcessor
+
+[[autodoc]] ImageBindImageProcessor
+    - preprocess
+
+## ImageBindFeatureExtractor
+
+[[autodoc]] ImageBindFeatureExtractor
+
+## ImageBindProcessor
+
+[[autodoc]] ImageBindProcessor
+
+## ImageBindModel
+
+[[autodoc]] ImageBindModel
+    - forward
+    - get_text_features
+    - get_image_features
+
+## ImageBindTextModel
+
+[[autodoc]] ImageBindTextModel
+    - forward
+
+## ImageBindTextModelWithProjection
+
+[[autodoc]] ImageBindTextModelWithProjection
+    - forward
+
+## ImageBindVisionModelWithProjection
+
+[[autodoc]] ImageBindVisionModelWithProjection
+    - forward
+
+
+## ImageBindVisionModel
+
+[[autodoc]] ImageBindVisionModel
+    - forward
\ No newline at end of file
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index e09752f5f39c..c183df75ecd8 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -383,6 +383,17 @@
         "IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP",
         "IdeficsConfig",
     ],
+    "models.imagebind": [
+        "IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ImageBindAudioConfig",
+        "ImageBindConfig",
+        "ImageBindDepthConfig",
+        "ImageBindImuConfig",
+        "ImageBindOnnxConfig",
+        "ImageBindTextConfig",
+        "ImageBindThermalConfig",
+        "ImageBindVisionConfig",
+    ],
     "models.imagegpt": ["IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ImageGPTConfig"],
     "models.informer": ["INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "InformerConfig"],
     "models.instructblip": [
@@ -889,6 +900,7 @@
     _import_structure["models.gpt_neox"].append("GPTNeoXTokenizerFast")
     _import_structure["models.gpt_neox_japanese"].append("GPTNeoXJapaneseTokenizer")
     _import_structure["models.herbert"].append("HerbertTokenizerFast")
+    _import_structure["models.imagebind"].append("ImageBindTokenizerFast")
     _import_structure["models.layoutlm"].append("LayoutLMTokenizerFast")
     _import_structure["models.layoutlmv2"].append("LayoutLMv2TokenizerFast")
     _import_structure["models.layoutlmv3"].append("LayoutLMv3TokenizerFast")
@@ -999,6 +1011,7 @@
     _import_structure["models.fuyu"].extend(["FuyuImageProcessor", "FuyuProcessor"])
     _import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"])
     _import_structure["models.idefics"].extend(["IdeficsImageProcessor"])
+    _import_structure["models.imagebind"].extend(["ImageBindFeatureExtractor", "ImageBindImageProcessor"])
     _import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
     _import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
     _import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
@@ -2041,6 +2054,25 @@
             "IdeficsProcessor",
         ]
     )
+    _import_structure["models.imagebind"].extend(
+        [
+            "IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "ImageBindAudioModel",
+            "ImageBindAudioModelWithProjection",
+            "ImageBindDepthModel",
+            "ImageBindDepthModelWithProjection",
+            "ImageBindImuModel",
+            "ImageBindImuModelWithProjection",
+            "ImageBindModel",
+            "ImageBindPreTrainedModel",
+            "ImageBindTextModel",
+            "ImageBindTextModelWithProjection",
+            "ImageBindThermalModel",
+            "ImageBindThermalModelWithProjection",
+            "ImageBindVisionModel",
+            "ImageBindVisionModelWithProjection",
+        ]
+    )
     _import_structure["models.imagegpt"].extend(
         [
             "IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4622,6 +4654,17 @@
         IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP,
         IdeficsConfig,
     )
+    from .models.imagebind import (
+        IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ImageBindAudioConfig,
+        ImageBindConfig,
+        ImageBindDepthConfig,
+        ImageBindImuConfig,
+        ImageBindOnnxConfig,
+        ImageBindTextConfig,
+        ImageBindThermalConfig,
+        ImageBindVisionConfig,
+    )
     from .models.imagegpt import IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, ImageGPTConfig
     from .models.informer import INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, InformerConfig
     from .models.instructblip import (
@@ -5093,6 +5136,7 @@
         from .models.gpt_neox import GPTNeoXTokenizerFast
         from .models.gpt_neox_japanese import GPTNeoXJapaneseTokenizer
         from .models.herbert import HerbertTokenizerFast
+        from .models.imagebind import ImageBindTokenizerFast
         from .models.layoutlm import LayoutLMTokenizerFast
         from .models.layoutlmv2 import LayoutLMv2TokenizerFast
         from .models.layoutlmv3 import LayoutLMv3TokenizerFast
@@ -5179,6 +5223,7 @@
         from .models.fuyu import FuyuImageProcessor, FuyuProcessor
         from .models.glpn import GLPNFeatureExtractor, GLPNImageProcessor
         from .models.idefics import IdeficsImageProcessor
+        from .models.imagebind import ImageBindFeatureExtractor, ImageBindImageProcessor
         from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
         from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2ImageProcessor
         from .models.layoutlmv3 import LayoutLMv3FeatureExtractor, LayoutLMv3ImageProcessor
@@ -6053,6 +6098,23 @@
             IdeficsPreTrainedModel,
             IdeficsProcessor,
         )
+        from .models.imagebind import (
+            IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ImageBindAudioModel,
+            ImageBindAudioModelWithProjection,
+            ImageBindDepthModel,
+            ImageBindDepthModelWithProjection,
+            ImageBindImuModel,
+            ImageBindImuModelWithProjection,
+            ImageBindModel,
+            ImageBindPreTrainedModel,
+            ImageBindTextModel,
+            ImageBindTextModelWithProjection,
+            ImageBindThermalModel,
+            ImageBindThermalModelWithProjection,
+            ImageBindVisionModel,
+            ImageBindVisionModelWithProjection,
+        )
         from .models.imagegpt import (
             IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST,
             ImageGPTForCausalImageModeling,
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index 997ee82b4324..f223710d6cdf 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -106,6 +106,7 @@
     hubert,
     ibert,
     idefics,
+    imagebind,
     imagegpt,
     informer,
     instructblip,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index 78a33270e7ac..9e5b16be3193 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -114,6 +114,7 @@
         ("hubert", "HubertConfig"),
         ("ibert", "IBertConfig"),
         ("idefics", "IdeficsConfig"),
+        ("imagebind", "ImageBindConfig"),
         ("imagegpt", "ImageGPTConfig"),
         ("informer", "InformerConfig"),
         ("instructblip", "InstructBlipConfig"),
@@ -333,6 +334,7 @@
         ("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("idefics", "IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("imagebind", "IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("informer", "INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("instructblip", "INSTRUCTBLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -553,6 +555,7 @@
         ("hubert", "Hubert"),
         ("ibert", "I-BERT"),
         ("idefics", "IDEFICS"),
+        ("imagebind", "ImageBind"),
         ("imagegpt", "ImageGPT"),
         ("informer", "Informer"),
         ("instructblip", "InstructBLIP"),
diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py
index 395875dfa14b..07d0d8c86a57 100644
--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -61,6 +61,7 @@
         ("glpn", "GLPNFeatureExtractor"),
         ("groupvit", "CLIPFeatureExtractor"),
         ("hubert", "Wav2Vec2FeatureExtractor"),
+        ("imagebind", "ImageBindFeatureExtractor"),
         ("imagegpt", "ImageGPTFeatureExtractor"),
         ("layoutlmv2", "LayoutLMv2FeatureExtractor"),
         ("layoutlmv3", "LayoutLMv3FeatureExtractor"),
diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
index 168b7a5dff3a..ab875ea332a1 100644
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -69,6 +69,7 @@
         ("glpn", "GLPNImageProcessor"),
         ("groupvit", "CLIPImageProcessor"),
         ("idefics", "IdeficsImageProcessor"),
+        ("imagebind", "ImageBindImageProcessor"),
         ("imagegpt", "ImageGPTImageProcessor"),
         ("instructblip", "BlipImageProcessor"),
         ("layoutlmv2", "LayoutLMv2ImageProcessor"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index a62880d32696..c8042b8b7896 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -110,6 +110,7 @@
         ("hubert", "HubertModel"),
         ("ibert", "IBertModel"),
         ("idefics", "IdeficsModel"),
+        ("imagebind", "ImageBindModel"),
         ("imagegpt", "ImageGPTModel"),
         ("informer", "InformerModel"),
         ("jukebox", "JukeboxModel"),
@@ -1074,6 +1075,7 @@
         ("chinese_clip", "ChineseCLIPModel"),
         ("clip", "CLIPModel"),
         ("clipseg", "CLIPSegModel"),
+        ("imagebind", "ImageBindModel"),
     ]
 )
 
diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py
index e1b3bac2de05..3e4447aebdb4 100644
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -60,6 +60,7 @@
         ("groupvit", "CLIPProcessor"),
         ("hubert", "Wav2Vec2Processor"),
         ("idefics", "IdeficsProcessor"),
+        ("imagebind", "ImageBindProcessor"),
         ("instructblip", "InstructBlipProcessor"),
         ("kosmos-2", "Kosmos2Processor"),
         ("layoutlmv2", "LayoutLMv2Processor"),
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index 04a1bc77e655..5c1831532067 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -181,6 +181,13 @@
             ("hubert", ("Wav2Vec2CTCTokenizer", None)),
             ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
             ("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
+            (
+                "imagebind",
+                (
+                    "ImageBindTokenizer",
+                    "ImageBindTokenizerFast" if is_tokenizers_available() else None,
+                ),
+            ),
             ("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
             ("jukebox", ("JukeboxTokenizer", None)),
             (
diff --git a/src/transformers/models/imagebind/__init__.py b/src/transformers/models/imagebind/__init__.py
new file mode 100644
index 000000000000..d6d328d9822e
--- /dev/null
+++ b/src/transformers/models/imagebind/__init__.py
@@ -0,0 +1,163 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_speech_available,
+    is_tokenizers_available,
+    is_torch_available,
+    is_vision_available,
+)
+
+
+_import_structure = {
+    "configuration_imagebind": [
+        "IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ImageBindAudioConfig",
+        "ImageBindConfig",
+        "ImageBindDepthConfig",
+        "ImageBindImuConfig",
+        "ImageBindOnnxConfig",
+        "ImageBindTextConfig",
+        "ImageBindThermalConfig",
+        "ImageBindVisionConfig",
+    ],
+    "feature_extraction_imagebind": ["ImageBindImuFeatureExtractor"],
+    "processing_imagebind": ["ImageBindProcessor"],
+    "tokenization_imagebind": ["ImageBindTokenizer"],
+}
+
+# TODO: add dependencies for other modalities, if necessary
+
+try:
+    if not is_tokenizers_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["tokenization_imagebind_fast"] = ["ImageBindTokenizerFast"]
+
+try:
+    if not is_vision_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["feature_extraction_imagebind"].extend(["ImageBindFeatureExtractor"])
+    _import_structure["image_processing_imagebind"] = ["ImageBindImageProcessor", "ImageBindDepthImageProcessor", "ImageBindThermalImageProcessor"]
+
+try:
+    if not is_speech_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["feature_extraction_imagebind"].extend(["ImageBindAudioFeatureExtractor"])
+
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_imagebind"] = [
+        "IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "ImageBindAudioModel",
+        "ImageBindAudioModelWithProjection",
+        "ImageBindDepthModel",
+        "ImageBindDepthModelWithProjection",
+        "ImageBindImuModel",
+        "ImageBindImuModelWithProjection",
+        "ImageBindModel",
+        "ImageBindPreTrainedModel",
+        "ImageBindTextModel",
+        "ImageBindTextModelWithProjection",
+        "ImageBindThermalModel",
+        "ImageBindThermalModelWithProjection",
+        "ImageBindVisionModel",
+        "ImageBindVisionModelWithProjection",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_imagebind import (
+        IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ImageBindAudioConfig,
+        ImageBindConfig,
+        ImageBindDepthConfig,
+        ImageBindImuConfig,
+        ImageBindOnnxConfig,
+        ImageBindTextConfig,
+        ImageBindThermalConfig,
+        ImageBindVisionConfig,
+    )
+    from .feature_extraction_imagebind import ImageBindImuFeatureExtractor
+    from .processing_imagebind import ImageBindProcessor
+    from .tokenization_imagebind import ImageBindTokenizer
+
+    try:
+        if not is_tokenizers_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .tokenization_imagebind_fast import ImageBindTokenizerFast
+
+    try:
+        if not is_vision_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .feature_extraction_imagebind import ImageBindFeatureExtractor
+        from .image_processing_imagebind import ImageBindImageProcessor, ImageBindDepthImageProcessor, ImageBindThermalImageProcessor
+
+    try:
+        if not is_speech_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .feature_extraction_imagebind import ImageBindAudioFeatureExtractor
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_imagebind import (
+            IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ImageBindAudioModel,
+            ImageBindAudioModelWithProjection,
+            ImageBindDepthModel,
+            ImageBindDepthModelWithProjection,
+            ImageBindImuModel,
+            ImageBindImuModelWithProjection,
+            ImageBindModel,
+            ImageBindPreTrainedModel,
+            ImageBindTextModel,
+            ImageBindTextModelWithProjection,
+            ImageBindThermalModel,
+            ImageBindThermalModelWithProjection,
+            ImageBindVisionModel,
+            ImageBindVisionModelWithProjection,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
\ No newline at end of file
diff --git a/src/transformers/models/imagebind/configuration_imagebind.py b/src/transformers/models/imagebind/configuration_imagebind.py
new file mode 100644
index 000000000000..3a9355bba583
--- /dev/null
+++ b/src/transformers/models/imagebind/configuration_imagebind.py
@@ -0,0 +1,1178 @@
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ImageBind model configuration"""
+
+
+import copy
+import os
+from collections import OrderedDict
+from typing import TYPE_CHECKING, Any, Mapping, Optional, Union
+
+
+if TYPE_CHECKING:
+    from ...processing_utils import ProcessorMixin
+    from ...utils import TensorType
+
+from ...configuration_utils import PretrainedConfig
+from ...onnx import OnnxConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/config.json",
+}
+
+
+class ImageBindTextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ImageBindTextModel`]. It is used to instantiate a ImageBind
+    text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the text encoder of the ImageBind
+    [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 49408):
+            Vocabulary size of the ImageBind text model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`ImageBindModel`].
+        hidden_size (`int`, *optional*, defaults to 1024):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 4096):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        projection_dim (`int`, *optional*, defaults to 1024):
+            If the ImageBind text model has an output projection layer, the dimension to which that projection layer
+            maps to.
+        num_hidden_layers (`int`, *optional*, defaults to 24):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        max_position_embeddings (`int`, *optional*, defaults to 77):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        add_kv_bias(`bool`, *optional*, defaults to `False`):
+            Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the
+            `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html).
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            The dropout probability for the DropPath (stochastic) regularization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        logit_scale_init_value (`float`, *optional*, defaults to `14.2857`):
+            The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not
+            be scaled.
+        learnable_logit_scale (`bool`, *optional*, defaults to `True`):
+            Whether the `logit_scale` is learnable or fixed.
+
+    Example:
+
+    ```python
+    >>> from transformers import ImageBindTextConfig, ImageBindTextModel
+
+    >>> # Initializing a ImageBindTextConfig with facebook/imagebind-huge style configuration
+    >>> configuration = ImageBindTextConfig()
+
+    >>> # Initializing a ImageBindTextModel (with random weights) from the facebook/imagebind-huge style configuration
+    >>> model = ImageBindTextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "imagebind_text_model"
+
+    def __init__(
+        self,
+        vocab_size=49408,
+        hidden_size=1024,
+        intermediate_size=4096,
+        projection_dim=1024,
+        num_hidden_layers=24,
+        num_attention_heads=16,
+        max_position_embeddings=77,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-6,
+        add_kv_bias=False,
+        attention_dropout=0.0,
+        drop_path_rate=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        logit_scale_init_value=14.2857,
+        learnable_logit_scale=True,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.add_kv_bias = add_kv_bias
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the text config dict if we are loading from ImageBindConfig
+        if config_dict.get("model_type") == "imagebind":
+            config_dict = config_dict["text_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ImageBindVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ImageBindVisionModel`]. It is used to instantiate a
+    ImageBind vision encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the vision encoder of the ImageBind
+    [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 1280):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 5120):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        projection_dim (`int`, *optional*, defaults to 1024):
+            If the ImageBind vision model has an output projection layer, the dimension to which that projection layer
+            maps to.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of channels in the input images.
+        num_frames (`int`, *optional*, defaults to 2):
+            If using video (spatiotemporal) input, the number of video frames in the spatiotemporal data.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int` or `Tuple[int]`, *optional*, defaults to `(2, 14, 14)`):
+            The size (resolution) of each spatialtemporal patch. If `patch_size` is an int, spatial patches of shape
+            `(patch_size, patch_size)` will be used; otherwise, `patch_size` should be a tuple of shape
+            `(time_patch_size, height_patch_size, width_patch_size)`.
+        stride (`int` or `Tuple[int]`, *optional*, defaults to `(2, 14, 14)`):
+            The stride of the imate patch embedding. If `stride` is an int, spatial strides of shape
+            `(stride, stride)` will be used; otherwise, `patch_size` should be a tuple of shape
+            `(time_stride, height_stride, width_stride)`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        add_kv_bias(`bool`, *optional*, defaults to `False`):
+            Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the
+            `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html).
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            The dropout probability for the DropPath (stochastic) regularization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        logit_scale_init_value (`float`, *optional*, defaults to `None`):
+            The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not
+            be scaled.
+        learnable_logit_scale (`bool`, *optional*, defaults to `False`):
+            Whether the `logit_scale` is learnable or fixed.
+
+    Example:
+
+    ```python
+    >>> from transformers import ImageBindVisionConfig, ImageBindVisionModel
+
+    >>> # Initializing a ImageBindVisionConfig with facebook/imagebind-huge style configuration
+    >>> configuration = ImageBindVisionConfig()
+
+    >>> # Initializing a ImageBindVisionModel (with random weights) from the facebook/imagebind-huge style configuration
+    >>> model = ImageBindVisionModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "imagebind_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=1280,
+        intermediate_size=5120,
+        projection_dim=1024,
+        num_hidden_layers=32,
+        num_attention_heads=16,
+        num_channels=3,
+        num_frames=2,
+        image_size=224,
+        patch_size=(2, 14, 14),
+        stride=(2, 14, 14),
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-6,
+        add_kv_bias=False,
+        attention_dropout=0.0,
+        drop_path_rate=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        logit_scale_init_value=None,
+        learnable_logit_scale=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.num_frames = num_frames
+        self.patch_size = patch_size
+        self.stride = stride
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.add_kv_bias = add_kv_bias
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from ImageBindConfig
+        if config_dict.get("model_type") == "imagebind":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ImageBindAudioConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ImageBindAudioModel`]. It is used to instantiate a
+    ImageBind audio encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the audio encoder of the ImageBind
+    [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        projection_dim (`int`, *optional*, defaults to 1024):
+            If the ImageBind audio model has an output projection layer, the dimension to which that projection layer
+            maps to.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_mel_bins (`int`, *optional*, defaults to 128):
+            The number of frequency bins in the log-mel spectrogram.
+        target_len (`int`, *optional*, defaults to 204):
+            TODO
+        num_channels (`int`, *optional*, defaults to 1):
+            The number of channels in the input audio data.
+        patch_size (`int`, *optional*, defaults to 16):
+            The kernel size of the patch embedding 2D convolution layer.
+        stride (`int`, *optional*, defaults to 10):
+            The stride of the patch embedding 2D convolution layer.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        add_kv_bias(`bool`, *optional*, defaults to `True`):
+            Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the
+            `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html).
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        drop_path_rate (`float`, *optional*, defaults to 0.1):
+            The dropout probability for the DropPath (stochastic) regularization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        logit_scale_init_value (`float`, *optional*, defaults to `20.0`):
+            The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not
+            be scaled.
+        learnable_logit_scale (`bool`, *optional*, defaults to `False`):
+            Whether the `logit_scale` is learnable or fixed.
+    
+    Example:
+    ```python
+    >>> from transformers import ImageBindAudioConfig, ImageBindAudioModel
+
+    >>> # Initializing a ImageBindAudioConfig with facebook/imagebind-huge style configuration
+    >>> configuration = ImageBindAudioConfig()
+
+    >>> # Initializing a ImageBindAudioModel (with random weights) from the facebook/imagebind-huge style configuration
+    >>> model = ImageBindAudioModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        projection_dim=1024,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_mel_bins=128,
+        target_len=204,
+        num_channels=1,
+        patch_size=16,
+        stride=10,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.1,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        logit_scale_init_value=20.0,
+        learnable_logit_scale=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_mel_bins = num_mel_bins
+        self.target_len = target_len
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.stride = stride
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.add_kv_bias = add_kv_bias
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+    
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the audio config dict if we are loading from ImageBindConfig
+        if config_dict.get("model_type") == "imagebind":
+            config_dict = config_dict["audio_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ImageBindDepthConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ImageBindDepthModel`]. It is used to instantiate a
+    ImageBind depth encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the depth encoder of the ImageBind
+    [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    
+    Args:
+        hidden_size (`int`, *optional*, defaults to 384):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 1536):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        projection_dim (`int`, *optional*, defaults to 1024):
+            If the ImageBind depth model has an output projection layer, the dimension to which that projection layer
+            maps to.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_channels (`int`, *optional*, defaults to 1):
+            The number of channels in the input depth data.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 16):
+            The kernel size of the depth patch embedding 2D convolution layer.
+        stride (`int`, *optional*, defaults to 16):
+            The stride of the depth patch embedding 2D convolution layer.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        add_kv_bias(`bool`, *optional*, defaults to `True`):
+            Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the
+            `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html).
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            The dropout probability for the DropPath (stochastic) regularization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        logit_scale_init_value (`float`, *optional*, defaults to `5.0`):
+            The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not
+            be scaled.
+        learnable_logit_scale (`bool`, *optional*, defaults to `False`):
+            Whether the `logit_scale` is learnable or fixed.
+    
+    Example:
+    ```python
+    >>> from transformers import ImageBindDepthConfig, ImageBindDepthModel
+
+    >>> # Initializing a ImageBindDepthConfig with facebook/imagebind-huge style configuration
+    >>> configuration = ImageBindDepthConfig()
+
+    >>> # Initializing a ImageBindDepthModel (with random weights) from the facebook/imagebind-huge style configuration
+    >>> model = ImageBindDepthModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    def __init__(
+        self,
+        hidden_size=384,
+        intermediate_size=1536,
+        projection_dim=1024,
+        num_hidden_layers=12,
+        num_attention_heads=8,
+        num_channels=1,
+        image_size=224,
+        patch_size=16,
+        stride=16,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        logit_scale_init_value=5.0,
+        learnable_logit_scale=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.stride = stride
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.add_kv_bias = add_kv_bias
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+    
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the audio config dict if we are loading from ImageBindConfig
+        if config_dict.get("model_type") == "imagebind":
+            config_dict = config_dict["depth_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ImageBindThermalConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ImageBindThermalModel`]. It is used to instantiate a
+    ImageBind thermal encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the thermal encoder of the ImageBind
+    [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        projection_dim (`int`, *optional*, defaults to 1024):
+            If the ImageBind thermal model has an output projection layer, the dimension to which that projection layer
+            maps to.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_channels (`int`, *optional*, defaults to 1):
+            The number of channels in the input thermal data.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 16):
+            The kernel size of the thermal patch embedding 2D convolution layer.
+        stride (`int`, *optional*, defaults to 16):
+            The stride of the thermal patch embedding 2D convolution layer.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        add_kv_bias(`bool`, *optional*, defaults to `True`):
+            Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the
+            `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html).
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            The dropout probability for the DropPath (stochastic) regularization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        logit_scale_init_value (`float`, *optional*, defaults to `10.0`):
+            The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not
+            be scaled.
+        learnable_logit_scale (`bool`, *optional*, defaults to `False`):
+            Whether the `logit_scale` is learnable or fixed.
+    
+    Example:
+    ```python
+    >>> from transformers import ImageBindThermalConfig, ImageBindThermalModel
+
+    >>> # Initializing a ImageBindThermalConfig with facebook/imagebind-huge style configuration
+    >>> configuration = ImageBindThermalConfig()
+
+    >>> # Initializing a ImageBindThermalModel (with random weights) from the facebook/imagebind-huge style configuration
+    >>> model = ImageBindThermalModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        projection_dim=1024,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=1,
+        image_size=224,
+        patch_size=16,
+        stride=16,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        logit_scale_init_value=10.0,
+        learnable_logit_scale=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.stride = stride
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.add_kv_bias = add_kv_bias
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+    
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the audio config dict if we are loading from ImageBindConfig
+        if config_dict.get("model_type") == "imagebind":
+            config_dict = config_dict["thermal_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ImageBindImuConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ImageBindImuModel`]. It is used to instantiate a
+    ImageBind IMU encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the IMU encoder of the ImageBind
+    [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    
+    Args:
+        hidden_size (`int`, *optional*, defaults to 512):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 2048):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        projection_dim (`int`, *optional*, defaults to 1024):
+            If the ImageBind thermal model has an output projection layer, the dimension to which that projection layer
+            maps to.
+        num_hidden_layers (`int`, *optional*, defaults to 6):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        input_shape ('Tuple[int]`, *optional*, defaults to `(6, 2000)`):
+            The shape of the input IMU data.
+        kernel_size (`int`, *optional*, defaults to 8):
+            The kernel size of the 2D convolution layers. (TODO)
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        add_kv_bias(`bool`, *optional*, defaults to `True`):
+            Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the
+            `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html).
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        drop_path_rate (`float`, *optional*, defaults to 0.7):
+            The dropout probability for the DropPath (stochastic) regularization layers.
+        final_dropout (`float`, *optional*, defaults to 0.5):
+            The dropout probability for the dropout layer that occurs after the post layer norm and before the linear
+            projection is applied.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        logit_scale_init_value (`float`, *optional*, defaults to `5.0`):
+            The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not
+            be scaled.
+        learnable_logit_scale (`bool`, *optional*, defaults to `False`):
+            Whether the `logit_scale` is learnable or fixed.
+    
+    Example:
+    ```python
+    >>> from transformers import ImageBindImuConfig, ImageBindImuModel
+
+    >>> # Initializing a ImageBindImuConfig with facebook/imagebind-huge style configuration
+    >>> configuration = ImageBindImuConfig()
+
+    >>> # Initializing a ImageBindImuModel (with random weights) from the facebook/imagebind-huge style configuration
+    >>> model = ImageBindImuModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    def __init__(
+        self,
+        hidden_size=512,
+        intermediate_size=2048,
+        projection_dim=1024,
+        num_hidden_layers=6,
+        num_attention_heads=8,
+        input_shape=(6, 2000),
+        kernel_size=8,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.7,
+        final_dropout=0.5,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        logit_scale_init_value=5.0,
+        learnable_logit_scale=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.input_shape = input_shape
+        self.kernel_size = kernel_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.add_kv_bias = add_kv_bias
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.final_dropout = final_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+    
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the audio config dict if we are loading from ImageBindConfig
+        if config_dict.get("model_type") == "imagebind":
+            config_dict = config_dict["imu_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+# TODO: add configs for other modalities (audio, depth, thermal, IMU)
+class ImageBindConfig(PretrainedConfig):
+    r"""
+    [`ImageBindConfig`] is the configuration class to store the configuration of a [`ImageBindModel`]. It is used to instantiate
+    a ImageBind model according to the specified arguments, defining the text model and vision model configs. Instantiating
+    a configuration with the defaults will yield a similar configuration to that of the ImageBind
+    [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ImageBindTextConfig`].
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ImageBindVisionConfig`].
+        projection_dim (`int`, *optional*, defaults to 512):
+            Dimentionality of text and vision projection layers.
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original ImageBind implementation.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+
+    Example:
+
+    ```python
+    >>> from transformers import ImageBindConfig, ImageBindModel
+
+    >>> # Initializing a ImageBindConfig with facebook/imagebind-huge style configuration
+    >>> configuration = ImageBindConfig()
+
+    >>> # Initializing a ImageBindModel (with random weights) from the facebook/imagebind-huge style configuration
+    >>> model = ImageBindModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+
+    >>> # We can also initialize a ImageBindConfig from a ImageBindTextConfig and a ImageBindVisionConfig
+    >>> from transformers import ImageBindTextConfig, ImageBindVisionConfig
+
+    >>> # Initializing a ImageBindText and ImageBindVision configuration
+    >>> config_text = ImageBindTextConfig()
+    >>> config_vision = ImageBindVisionConfig()
+
+    >>> config = ImageBindConfig.from_text_vision_configs(config_text, config_vision)
+    ```"""
+
+    model_type = "imagebind"
+    is_composition = True
+
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        audio_config=None,
+        depth_config=None,
+        thermal_config=None,
+        imu_config=None,
+        projection_dim=1024,
+        **kwargs,
+    ):
+        # If `_config_dict` exist, we use them for the backward compatibility.
+        # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
+        # of confusion!).
+        text_config_dict = kwargs.pop("text_config_dict", None)
+        vision_config_dict = kwargs.pop("vision_config_dict", None)
+        audio_config_dict = kwargs.pop("audio_config_dict", None)
+        depth_config_dict = kwargs.pop("depth_config_dict", None)
+        thermal_config_dict = kwargs.pop("thermal_config_dict", None)
+        imu_config_dict = kwargs.pop("imu_config_dict", None)
+
+        super().__init__(**kwargs)
+
+        # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
+        # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
+        # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
+        if text_config_dict is not None:
+            if text_config is None:
+                text_config = {}
+
+            # This is the complete result when using `text_config_dict`.
+            _text_config_dict = ImageBindTextConfig(**text_config_dict).to_dict()
+
+            # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
+            for key, value in _text_config_dict.items():
+                if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
+                    # If specified in `text_config_dict`
+                    if key in text_config_dict:
+                        message = (
+                            f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
+                            f'The value `text_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`text_config_dict` is provided which will be used to initialize `ImageBindTextConfig`. The "
+                            f'value `text_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `text_config` with the ones in `_text_config_dict`.
+            text_config.update(_text_config_dict)
+
+        if vision_config_dict is not None:
+            if vision_config is None:
+                vision_config = {}
+
+            # This is the complete result when using `vision_config_dict`.
+            _vision_config_dict = ImageBindVisionConfig(**vision_config_dict).to_dict()
+            # convert keys to string instead of integer
+            if "id2label" in _vision_config_dict:
+                _vision_config_dict["id2label"] = {
+                    str(key): value for key, value in _vision_config_dict["id2label"].items()
+                }
+
+            # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
+            for key, value in _vision_config_dict.items():
+                if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
+                    # If specified in `vision_config_dict`
+                    if key in vision_config_dict:
+                        message = (
+                            f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
+                            f'values. The value `vision_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`vision_config_dict` is provided which will be used to initialize `ImageBindVisionConfig`. "
+                            f'The value `vision_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `vision_config` with the ones in `_vision_config_dict`.
+            vision_config.update(_vision_config_dict)
+        
+        if audio_config_dict is not None:
+            if audio_config is None:
+                audio_config = {}
+
+            # This is the complete result when using `audio_config_dict`.
+            _audio_config_dict = ImageBindAudioConfig(**audio_config_dict).to_dict()
+            # convert keys to string instead of integer
+            if "id2label" in _vision_config_dict:
+                _vision_config_dict["id2label"] = {
+                    str(key): value for key, value in _vision_config_dict["id2label"].items()
+                }
+
+            # Give a warning if the values exist in both `_audio_config_dict` and `audio_config` but being different.
+            for key, value in _vision_config_dict.items():
+                if key in audio_config and value != audio_config[key] and key not in ["transformers_version"]:
+                    # If specified in `audio_config_dict`
+                    if key in audio_config_dict:
+                        message = (
+                            f"`{key}` is found in both `audio_config_dict` and `audio_config` but with different "
+                            f'values. The value `audio_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`audio_config_dict` is provided which will be used to initialize `ImageBindAudioConfig`. "
+                            f'The value `audio_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `vision_config` with the ones in `_audio_config_dict`.
+            audio_config.update(_audio_config_dict)
+        
+        if depth_config_dict is not None:
+            if depth_config is None:
+                depth_config = {}
+
+            # This is the complete result when using `depth_config_dict`.
+            _depth_config_dict = ImageBindDepthConfig(**depth_config_dict).to_dict()
+            # convert keys to string instead of integer
+            if "id2label" in _depth_config_dict:
+                _depth_config_dict["id2label"] = {
+                    str(key): value for key, value in _depth_config_dict["id2label"].items()
+                }
+
+            # Give a warning if the values exist in both `_depth_config_dict` and `depth_config` but being different.
+            for key, value in _depth_config_dict.items():
+                if key in depth_config and value != depth_config[key] and key not in ["transformers_version"]:
+                    # If specified in `depth_config_dict`
+                    if key in depth_config_dict:
+                        message = (
+                            f"`{key}` is found in both `depth_config_dict` and `depth_config` but with different "
+                            f'values. The value `depth_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`depth_config_dict` is provided which will be used to initialize `ImageBindDepthConfig`. "
+                            f'The value `depth_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `vision_config` with the ones in `_depth_config_dict`.
+            depth_config.update(_depth_config_dict)
+        
+        if thermal_config_dict is not None:
+            if thermal_config is None:
+                thermal_config = {}
+
+            # This is the complete result when using `thermal_config_dict`.
+            _thermal_config_dict = ImageBindThermalConfig(**thermal_config_dict).to_dict()
+            # convert keys to string instead of integer
+            if "id2label" in _thermal_config_dict:
+                _thermal_config_dict["id2label"] = {
+                    str(key): value for key, value in _thermal_config_dict["id2label"].items()
+                }
+
+            # Give a warning if the values exist in both `_thermal_config_dict` and `thermal_config` but being different.
+            for key, value in _thermal_config_dict.items():
+                if key in thermal_config and value != thermal_config[key] and key not in ["transformers_version"]:
+                    # If specified in `thermal_config_dict`
+                    if key in thermal_config_dict:
+                        message = (
+                            f"`{key}` is found in both `thermal_config_dict` and `thermal_config` but with different "
+                            f'values. The value `thermal_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`thermal_config_dict` is provided which will be used to initialize `ImageBindThermalConfig`. "
+                            f'The value `thermal_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `vision_config` with the ones in `_thermal_config_dict`.
+            thermal_config.update(_thermal_config_dict)
+        
+        if imu_config_dict is not None:
+            if imu_config is None:
+                imu_config = {}
+
+            # This is the complete result when using `imu_config_dict`.
+            _imu_config_dict = ImageBindImuConfig(**imu_config_dict).to_dict()
+            # convert keys to string instead of integer
+            if "id2label" in _imu_config_dict:
+                _imu_config_dict["id2label"] = {
+                    str(key): value for key, value in _imu_config_dict["id2label"].items()
+                }
+
+            # Give a warning if the values exist in both `_imu_config_dict` and `imu_config` but being different.
+            for key, value in _imu_config_dict.items():
+                if key in imu_config and value != imu_config[key] and key not in ["transformers_version"]:
+                    # If specified in `imu_config_dict`
+                    if key in imu_config_dict:
+                        message = (
+                            f"`{key}` is found in both `imu_config_dict` and `imu_config` but with different "
+                            f'values. The value `imu_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`imu_config_dict` is provided which will be used to initialize `ImageBindImuConfig`. "
+                            f'The value `imu_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `imu_config` with the ones in `_imu_config_dict`.
+            imu_config.update(_imu_config_dict)
+
+        if text_config is None:
+            text_config = {}
+            logger.info("`text_config` is `None`. Initializing the `ImageBindTextConfig` with default values.")
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("`vision_config` is `None`. initializing the `ImageBindVisionConfig` with default values.")
+        
+        if audio_config is None:
+            audio_config = {}
+            logger.info("`audio_config` is `None`. initializing the `ImageBindAudioConfig` with default values.")
+        
+        if depth_config is None:
+            depth_config = {}
+            logger.info("`depth_config` is `None`. initializing the `ImageBindDepthConfig` with default values.")
+        
+        if thermal_config is None:
+            thermal_config = {}
+            logger.info("`thermal_config` is `None`. initializing the `ImageBindThermalConfig` with default values.")
+        
+        if imu_config is None:
+            imu_config = {}
+            logger.info("`imu_config` is `None`. initializing the `ImageBindImuConfig` with default values.")
+
+        self.text_config = ImageBindTextConfig(**text_config)
+        self.vision_config = ImageBindVisionConfig(**vision_config)
+        self.audio_config = ImageBindAudioConfig(**audio_config)
+        self.depth_config = ImageBindDepthConfig(**depth_config)
+        self.thermal_config = ImageBindThermalConfig(**thermal_config)
+        self.imu_config = ImageBindImuConfig(**imu_config)
+
+        self.projection_dim = projection_dim
+        self.initializer_factor = 1.0
+
+    @classmethod
+    def from_text_vision_configs(cls, text_config: ImageBindTextConfig, vision_config: ImageBindVisionConfig, **kwargs):
+        r"""
+        Instantiate a [`ImageBindConfig`] (or a derived class) from imagebind text model configuration and imagebind vision model
+        configuration.
+
+        Returns:
+            [`ImageBindConfig`]: An instance of a configuration object
+        """
+
+        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["text_config"] = self.text_config.to_dict()
+        output["vision_config"] = self.vision_config.to_dict()
+        output["audio_config"] = self.audio_config.to_dict()
+        output["depth_config"] = self.depth_config.to_dict()
+        output["thermal_config"] = self.thermal_config.to_dict()
+        output["imu_config"] = self.imu_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
+
+# TODO: add other modalities
+class ImageBindOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("input_ids", {0: "batch", 1: "sequence"}),
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+                ("attention_mask", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    @property
+    def outputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("logits_per_image", {0: "batch"}),
+                ("logits_per_text", {0: "batch"}),
+                ("text_embeds", {0: "batch"}),
+                ("image_embeds", {0: "batch"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
+
+    def generate_dummy_inputs(
+        self,
+        processor: "ProcessorMixin",
+        batch_size: int = -1,
+        seq_length: int = -1,
+        framework: Optional["TensorType"] = None,
+    ) -> Mapping[str, Any]:
+        text_input_dict = super().generate_dummy_inputs(
+            processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework
+        )
+        image_input_dict = super().generate_dummy_inputs(
+            processor.feature_extractor, batch_size=batch_size, framework=framework
+        )
+        return {**text_input_dict, **image_input_dict}
+
+    @property
+    def default_onnx_opset(self) -> int:
+        return 14
\ No newline at end of file
diff --git a/src/transformers/models/imagebind/convert_imagebind_original_pytorch_to_hf.py b/src/transformers/models/imagebind/convert_imagebind_original_pytorch_to_hf.py
new file mode 100644
index 000000000000..5e0dbb70f5f0
--- /dev/null
+++ b/src/transformers/models/imagebind/convert_imagebind_original_pytorch_to_hf.py
@@ -0,0 +1,452 @@
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import torch
+# from imagebind import load
+
+from transformers import (
+    ImageBindAudioConfig,
+    ImageBindConfig,
+    ImageBindDepthConfig,
+    ImageBindImuConfig,
+    ImageBindModel,
+    ImageBindTextConfig,
+    ImageBindThermalConfig,
+    ImageBindVisionConfig,
+)
+
+SPATIOTEMPORAL_MODALITY_LIST = ["vision"]
+IMAGELIKE_MODALITY_LIST = ["vision", "audio", "depth", "thermal"]
+MODALITY_LIST = ["text", *IMAGELIKE_MODALITY_LIST, "imu"]
+
+
+# Holds configs common to all test ImageBind encoders
+IMAGEBIND_TEST_TRUNK_CONFIG = {
+    "hidden_size": 32,
+    "projection_dim": 32,
+    "num_hidden_layers": 5,
+    "num_attention_heads": 4,
+    "intermediate_size": 37,
+    "dropout": 0.0,
+    "layer_norm_eps": 1e-6,
+}
+
+IMAGEBIND_TEST_TEXT_CONFIG = {
+    **IMAGEBIND_TEST_TRUNK_CONFIG,
+    "vocab_size": 99,
+    "max_position_embeddings": 512,
+    "logit_scale_init_value": 14.2857,
+    "learnable_logit_scale": True,
+}
+
+IMAGEBIND_TEST_VISION_CONFIG = {
+    **IMAGEBIND_TEST_TRUNK_CONFIG,
+    "image_size": 30,
+    "patch_size": (2, 2, 2),
+    "stride": (2, 2, 2),
+    "num_channels": 3,
+    "num_frames": 2,
+    "logit_scale_init_value": None,
+    "learnable_logit_scale": False,
+}
+
+IMAGEBIND_TEST_AUDIO_CONFIG = {
+    **IMAGEBIND_TEST_TRUNK_CONFIG,
+    "image_size": 30,
+    "patch_size": 16,
+    "stride": 10,
+    "num_channels": 1,
+    "num_mel_bins": 128,
+    "target_len": 204,
+    "add_kv_bias": True,
+    "drop_path_rate": 0.1,
+    "logit_scale_init_value": 20.0,
+    "learnable_logit_scale": False,
+}
+
+IMAGEBIND_TEST_DEPTH_CONFIG = {
+    **IMAGEBIND_TEST_TRUNK_CONFIG,
+    "image_size": 30,
+    "patch_size": 2,
+    "stride": 2,
+    "num_channels": 1,
+    "add_kv_bias": True,
+    "logit_scale_init_value": 5.0,
+    "learnable_logit_scale": False,
+}
+
+IMAGEBIND_TEST_THERMAL_CONFIG = {
+    **IMAGEBIND_TEST_TRUNK_CONFIG,
+    "image_size": 30,
+    "patch_size": 2,
+    "stride": 2,
+    "num_channels": 1,
+    "add_kv_bias": True,
+    "logit_scale_init_value": 10.0,
+    "learnable_logit_scale": False,
+}
+
+IMAGEBIND_TEST_IMU_CONFIG = {
+    **IMAGEBIND_TEST_TRUNK_CONFIG,
+    "input_shape": (6, 30),
+    "kernel_size": 2,
+    "add_kv_bias": True,
+    "drop_path_rate": 0.7,
+    "logit_scale_init_value": 5.0,
+    "learnable_logit_scale": False,
+}
+
+
+def get_modality_config(config, modality):
+    if modality == "text":
+        return config.text_config
+    elif modality == "vision":
+        return config.vision_config
+    elif modality == "audio":
+        return config.audio_config
+    elif modality == "depth":
+        return config.depth_config
+    elif modality == "thermal":
+        return config.thermal_config
+    elif modality == "imu":
+        return config.imu_config
+    else:
+        raise ValueError(f"Modality {modality} is not currently supported.")
+
+
+def convert_embeddings(config, model_state_dict):
+    # Create position_ids buffer for text model]
+    text_position_ids_buffer = torch.arange(config.text_config.max_position_embeddings).expand((1, -1))
+    model_state_dict[f"text_model.embeddings.position_ids"] = text_position_ids_buffer
+
+    # Create position_ids buffer for IMU model
+    imu_num_patches = config.imu_config.input_shape[1] // config.imu_config.kernel_size
+    imu_num_positions = imu_num_patches + 1
+    imu_position_ids_buffer = torch.arange(imu_num_positions).expand((1, -1))
+    model_state_dict[f"imu_model.embeddings.position_ids"] = imu_position_ids_buffer
+
+    for modality in ["text", "imu"]:
+        # Convert position embeddings for text and IMU modalities
+        pos_embed_key = f"modality_preprocessors.{modality}.pos_embed"
+        pos_embed = model_state_dict[pos_embed_key]
+        converted_pos_embed = pos_embed.squeeze()
+        model_state_dict[pos_embed_key] = converted_pos_embed
+
+    for modality in IMAGELIKE_MODALITY_LIST:
+        # Convert position embeddings for image-like modalities
+        pos_embed_key = f"modality_preprocessors.{modality}.pos_embedding_helper.pos_embed"
+        pos_embed = model_state_dict[pos_embed_key]
+        converted_pos_embed = pos_embed.squeeze()
+        model_state_dict[pos_embed_key] = converted_pos_embed
+
+        # Create position_ids buffer for image-likd modalities
+        modality_config = get_modality_config(config, modality)
+        # Recalculate num_positions
+        if modality in SPATIOTEMPORAL_MODALITY_LIST:
+            patches_along_time_dim = modality_config.num_frames // modality_config.patch_size[0]
+            patches_along_spatial_dims = (modality_config.image_size // modality_config.patch_size[1]) ** 2
+            num_patches = patches_along_spatial_dims * patches_along_time_dim
+        elif modality == "audio":
+            patch_size = modality_config.patch_size
+            stride = modality_config.stride
+            patches_along_mel_dim = ((modality_config.num_mel_bins - patch_size) // stride) + 1
+            patches_along_frame_dim = ((modality_config.target_len - patch_size) // stride) + 1
+            num_patches = patches_along_mel_dim * patches_along_frame_dim
+        else:
+            num_patches = (modality_config.image_size // modality_config.patch_size) ** 2
+        num_positions = num_patches + 1
+        position_ids_buffer = torch.arange(num_positions).expand((1, -1))
+        model_state_dict[f"{modality}_model.embeddings.position_ids"] = position_ids_buffer
+
+    for modality in IMAGELIKE_MODALITY_LIST + ["imu"]:
+        # Convert class embeddings
+        class_embed_key = f"modality_preprocessors.{modality}.cls_token"
+        class_embed = model_state_dict[class_embed_key]
+        converted_class_embed = class_embed.squeeze()
+        model_state_dict[class_embed_key] = converted_class_embed
+
+
+def convert_attention(config, model_state_dict):
+    for modality in MODALITY_LIST:
+        old_prefix = f"modality_trunks.{modality}.blocks"
+        new_prefix = f"{modality}_model.encoder.layers"
+        modality_config = get_modality_config(config, modality)
+        for i in range(modality_config.num_hidden_layers):
+            attn_weight_key = f"{old_prefix}.{i}.attn.in_proj_weight"
+            attn_bias_key = f"{old_prefix}.{i}.attn.in_proj_bias"
+            attn_weight = model_state_dict[attn_weight_key]
+            attn_bias = model_state_dict[attn_bias_key]
+
+            # Split up the attention projections/bias in to q, k, v projections/bias
+            q_proj, k_proj, v_proj = attn_weight.chunk(3, dim=0)
+            q_proj_bias, k_proj_bias, v_proj_bias = attn_bias.chunk(3, dim=0)
+
+            model_state_dict[f"{new_prefix}.{i}.self_attn.q_proj.weight"] = q_proj
+            model_state_dict[f"{new_prefix}.{i}.self_attn.q_proj.bias"] = q_proj_bias
+
+            model_state_dict[f"{new_prefix}.{i}.self_attn.k_proj.weight"] = k_proj
+            model_state_dict[f"{new_prefix}.{i}.self_attn.k_proj.bias"] = k_proj_bias
+
+            model_state_dict[f"{new_prefix}.{i}.self_attn.v_proj.weight"] = v_proj
+            model_state_dict[f"{new_prefix}.{i}.self_attn.v_proj.bias"] = v_proj_bias
+
+
+def map_preprocessor_keys(prefix="modality_preprocessors"):
+    mapping = {}
+    keys_to_remove = []
+
+    # Text preprocessor
+    mapping[f"{prefix}.text.token_embedding.weight"] = "text_model.embeddings.token_embedding.weight"
+    mapping[f"{prefix}.text.pos_embed"] = "text_model.embeddings.position_embedding.weight"
+
+    # NOTE: no need to map causal attention mask buffer
+    keys_to_remove.append("modality_preprocessors.text.mask")
+
+    # Image-like modalities common
+    for modality in IMAGELIKE_MODALITY_LIST:
+        mapping[f"{prefix}.{modality}.cls_token"] = f"{modality}_model.embeddings.class_embedding"
+        mapping[f"{prefix}.{modality}.pos_embedding_helper.pos_embed"] = f"{modality}_model.embeddings.position_embedding.weight"
+
+    # Vision preprocessor specific
+    mapping[f"{prefix}.vision.rgbt_stem.proj.1.weight"] = "vision_model.embeddings.patch_embedding.weight"
+
+    # Audio preprocessor specific
+    mapping[f"{prefix}.audio.rgbt_stem.proj.weight"] = "audio_model.embeddings.patch_embedding.weight"
+    mapping[f"{prefix}.audio.rgbt_stem.norm_layer.weight"] = "audio_model.embeddings.norm_layer.weight"
+    mapping[f"{prefix}.audio.rgbt_stem.norm_layer.bias"] = "audio_model.embeddings.norm_layer.bias"
+
+    # Depth preprocessor specific
+    mapping[f"{prefix}.depth.depth_stem.proj.weight"] = "depth_model.embeddings.patch_embedding.weight"
+    mapping[f"{prefix}.depth.depth_stem.norm_layer.weight"] = "depth_model.embeddings.norm_layer.weight"
+    mapping[f"{prefix}.depth.depth_stem.norm_layer.bias"] = "depth_model.embeddings.norm_layer.bias"
+
+    # Thermal preprocessor specific
+    mapping[f"{prefix}.thermal.rgbt_stem.proj.weight"] = "thermal_model.embeddings.patch_embedding.weight"
+    mapping[f"{prefix}.thermal.rgbt_stem.norm_layer.weight"] = "thermal_model.embeddings.norm_layer.weight"
+    mapping[f"{prefix}.thermal.rgbt_stem.norm_layer.bias"] = "thermal_model.embeddings.norm_layer.bias"
+
+    # IMU preprocessor
+    mapping[f"{prefix}.imu.cls_token"] = "imu_model.embeddings.class_embedding"
+    mapping[f"{prefix}.imu.pos_embed"] = "imu_model.embeddings.position_embedding.weight"
+    mapping[f"{prefix}.imu.imu_stem.proj.weight"] = "imu_model.embeddings.patch_embedding.weight"
+    mapping[f"{prefix}.imu.imu_stem.norm_layer.weight"] = "imu_model.embeddings.norm_layer.weight"
+    mapping[f"{prefix}.imu.imu_stem.norm_layer.bias"] = "imu_model.embeddings.norm_layer.bias"
+
+    return mapping, keys_to_remove
+
+
+def map_transformer_keys(config, old_prefix, new_prefix):
+    mapping = {}
+    keys_to_remove = []
+
+    for i in range(config.num_hidden_layers):
+        # NOTE: q, k, v proj/bias are added to the state dict with the correct names in convert_attention
+        keys_to_remove.append(f"{old_prefix}.{i}.attn.in_proj_weight")
+        keys_to_remove.append(f"{old_prefix}.{i}.attn.in_proj_bias")
+
+        mapping[f"{old_prefix}.{i}.attn.out_proj.weight"] = f"{new_prefix}.{i}.self_attn.out_proj.weight"
+        mapping[f"{old_prefix}.{i}.attn.out_proj.bias"] = f"{new_prefix}.{i}.self_attn.out_proj.bias"
+
+        mapping[f"{old_prefix}.{i}.norm_1.weight"] = f"{new_prefix}.{i}.layer_norm1.weight"
+        mapping[f"{old_prefix}.{i}.norm_1.bias"] = f"{new_prefix}.{i}.layer_norm1.bias"
+
+        mapping[f"{old_prefix}.{i}.mlp.fc1.weight"] = f"{new_prefix}.{i}.mlp.fc1.weight"
+        mapping[f"{old_prefix}.{i}.mlp.fc1.bias"] = f"{new_prefix}.{i}.mlp.fc1.bias"
+        mapping[f"{old_prefix}.{i}.mlp.fc2.weight"] = f"{new_prefix}.{i}.mlp.fc2.weight"
+        mapping[f"{old_prefix}.{i}.mlp.fc2.bias"] = f"{new_prefix}.{i}.mlp.fc2.bias"
+
+        mapping[f"{old_prefix}.{i}.norm_2.weight"] = f"{new_prefix}.{i}.layer_norm2.weight"
+        mapping[f"{old_prefix}.{i}.norm_2.bias"] = f"{new_prefix}.{i}.layer_norm2.bias"
+
+        if config.add_kv_bias:
+            mapping[f"{old_prefix}.{i}.attn.bias_k"] = f"{new_prefix}.{i}.self_attn.k_bias"
+            mapping[f"{old_prefix}.{i}.attn.bias_v"] = f"{new_prefix}.{i}.self_attn.v_bias"
+
+    return mapping, keys_to_remove
+
+
+def get_encoder_key_mapping(config, prefix="modality_trunks"):
+    mapping = {}
+    keys_to_remove = []
+
+    # 1. Handle any pre-transformer layers, if available.
+
+    # Vision specific
+    mapping["modality_trunks.vision.pre_transformer_layer.0.weight"] = "vision_model.pre_layernorm.weight"
+    mapping["modality_trunks.vision.pre_transformer_layer.0.bias"] = "vision_model.pre_layernorm.bias"
+
+    # 2. Map transformer trunk keys
+    for modality in MODALITY_LIST:
+        old_prefix = f"{prefix}.{modality}.blocks"
+        new_prefix = f"{modality}_model.encoder.layers"
+        modality_config = get_modality_config(config, modality)
+        transformer_mapping, transformer_keys_to_remove = map_transformer_keys(modality_config, old_prefix, new_prefix)
+        mapping.update(transformer_mapping)
+        keys_to_remove.extend(transformer_keys_to_remove)
+
+    return mapping, keys_to_remove
+
+
+def map_transformer_head_keys(prefix="modality_heads"):
+    mapping = {}
+    keys_to_remove = []
+
+    # Text final layer norm
+    mapping[f"{prefix}.text.proj.0.weight"] = "text_model.final_layer_norm.weight"
+    mapping[f"{prefix}.text.proj.0.bias"] = "text_model.final_layer_norm.bias"
+
+    for modality in IMAGELIKE_MODALITY_LIST + ["imu"]:
+        mapping[f"{prefix}.{modality}.0.weight"] = f"{modality}_model.post_layernorm.weight"
+        mapping[f"{prefix}.{modality}.0.bias"] = f"{modality}_model.post_layernorm.bias"
+
+    # Modality heads
+    mapping[f"{prefix}.text.proj.1.weight"] = "text_projection.weight"
+    for modality in IMAGELIKE_MODALITY_LIST:
+        if modality == "vision":
+            mapping[f"{prefix}.{modality}.2.weight"] = f"visual_projection.weight"
+        else:
+            mapping[f"{prefix}.{modality}.2.weight"] = f"{modality}_projection.weight"
+    mapping[f"{prefix}.imu.3.weight"] = "imu_projection.weight"
+
+    return mapping, keys_to_remove
+
+
+def map_postprocessor_keys(prefix="modality_postprocessors"):
+    mapping = {}
+    keys_to_remove = []
+
+    for modality in ["text", "audio", "depth", "thermal", "imu"]:
+        mapping[f"{prefix}.{modality}.1.log_logit_scale"] = f"{modality}_postprocessor.log_logit_scale"
+
+    return mapping, keys_to_remove
+
+
+def get_key_mapping(config):
+    mapping = {}
+    keys_to_remove = []
+
+    # 1. Map preprocessor keys
+    preprocessor_mapping, preprocessor_keys_to_remove = map_preprocessor_keys(prefix="modality_preprocessors")
+    mapping.update(preprocessor_mapping)
+    keys_to_remove.extend(preprocessor_keys_to_remove)
+
+    # 2. Map transformer keys
+    encoder_mapping, encoder_keys_to_remove = get_encoder_key_mapping(config, prefix="modality_trunks")
+    mapping.update(encoder_mapping)
+    keys_to_remove.extend(encoder_keys_to_remove)
+
+    # 3. Map transformer head keys
+    head_mapping, head_keys_to_remove = map_transformer_head_keys(prefix="modality_heads")
+    mapping.update(head_mapping)
+    keys_to_remove.extend(head_keys_to_remove)
+
+    # 4. Map postprocessor keys
+    postprocessor_mapping, postprocessor_keys_to_remove = map_postprocessor_keys(prefix="modality_postprocessors")
+    mapping.update(postprocessor_mapping)
+    keys_to_remove.extend(postprocessor_keys_to_remove)
+
+    return mapping, keys_to_remove
+
+
+def rename_state_dict(state_dict, keys_to_modify, keys_to_remove):
+    model_state_dict = {}
+    for key, value in state_dict.items():
+        if key in keys_to_remove:
+            continue
+
+        if key in keys_to_modify:
+            new_key = keys_to_modify[key]
+            model_state_dict[new_key] = value
+        else:
+            model_state_dict[key] = value
+    return model_state_dict
+
+
+def convert_imagebind_checkpoint(
+    checkpoint_path,
+    pytorch_dump_folder_path,
+    config_path=None,
+    repo_id=None,
+    use_test_config=False,
+    safe_serialization=False,
+):
+    """
+    Copy/paste/tweak model's weights to transformers design.
+    """
+    if config_path is not None:
+        config = ImageBindConfig.from_pretrained(config_path)
+    elif use_test_config:
+        config = ImageBindConfig(
+            text_config=IMAGEBIND_TEST_TEXT_CONFIG,
+            vision_config=IMAGEBIND_TEST_VISION_CONFIG,
+            audio_config=IMAGEBIND_TEST_AUDIO_CONFIG,
+            depth_config=IMAGEBIND_TEST_DEPTH_CONFIG,
+            thermal_config=IMAGEBIND_TEST_THERMAL_CONFIG,
+            imu_config=IMAGEBIND_TEST_IMU_CONFIG,
+            projection_dim=32,
+        )
+    else:
+        # The default config corresponds to the original ImageBind model.
+        config = ImageBindConfig()
+
+    hf_model = ImageBindModel(config)
+
+    # print(hf_model)
+    # hf_model_state_dict = hf_model.state_dict()
+    # for key in hf_model_state_dict:
+    #     print(key)
+
+    # Original ImageBind checkpoint is a PyTorch state dict
+    model_state_dict = torch.load(checkpoint_path, map_location="cpu")
+
+    # Fix embedding shapes
+    convert_embeddings(config, model_state_dict)
+    # Convert attention parameters to transformers
+    convert_attention(config, model_state_dict)
+
+    keys_to_modify, keys_to_remove = get_key_mapping(config)
+    keys_to_remove = set(keys_to_remove)
+    hf_state_dict = rename_state_dict(model_state_dict, keys_to_modify, keys_to_remove)
+
+    hf_model.load_state_dict(hf_state_dict)
+
+    hf_model.save_pretrained(pytorch_dump_folder_path, safe_serialization=safe_serialization)
+
+    if repo_id:
+        print("Pushing to the hub...")
+        hf_model.push_to_hub(repo_id)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to ImageBind checkpoint")
+    parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
+    parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
+    parser.add_argument("--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub.")
+    parser.add_argument("--test", action="store_true", help="Whether to use the test config for ImageBind models.")
+    parser.add_argument("--safe_serialization", action="store_true", help="Whether to save the model using `safetensors`.")
+
+    args = parser.parse_args()
+
+    convert_imagebind_checkpoint(
+        args.checkpoint_path,
+        args.pytorch_dump_folder_path,
+        args.config_path,
+        args.push_to_hub,
+        args.test,
+        args.safe_serialization,
+    )
diff --git a/src/transformers/models/imagebind/feature_extraction_imagebind.py b/src/transformers/models/imagebind/feature_extraction_imagebind.py
new file mode 100644
index 000000000000..02b23aab046e
--- /dev/null
+++ b/src/transformers/models/imagebind/feature_extraction_imagebind.py
@@ -0,0 +1,470 @@
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for ImageBind."""
+
+
+import warnings
+from typing import List, Optional, Union
+
+import numpy as np
+import torch
+import torchaudio.compliance.kaldi as ta_kaldi
+
+from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ...feature_extraction_utils import BatchFeature
+from ...utils import PaddingStrategy, TensorType, logging
+from .image_processing_imagebind import ImageBindImageProcessor
+
+
+logger = logging.get_logger(__name__)
+
+
+def valid_batched_clipped_audio(raw_speech):
+    """
+    Determines whether raw mono-channel audio input (or any other 1D data) is batched and clipped. The following
+    conditions will be recognized as valid audio:
+
+    - unbatched: `List[float]`, `np.ndarray` (`ndim=1`)
+    - batched: `List[List[float]]`, `List[np.ndarray]` (`ndim=1`), `np.ndarray` (`ndim=2`)
+    - batched and clipped: `List[List[List[float]]]`, `List[List[np.ndarray]]` (`ndim=1`), List[np.ndarray] (`ndim=2`), np.ndarray (`ndim=3`)
+    """
+    valid_audio = False
+    if isinstance(raw_speech, np.ndarray) and (1 <= len(raw_speech.shape) <= 3):
+        # unbatched, batched, or batched and clipped np.ndarray
+        valid_audio = True
+    elif isinstance(raw_speech, (list, tuple)):
+        if isinstance(raw_speech[0], np.ndarray) and (1 <= len(raw_speech[0].shape) <= 2):
+            # batched or batched and clipped List[np.ndarray]
+            valid_audio = True
+        elif isinstance(raw_speech[0], float):
+            # unbatched List[float]
+            valid_audio = True
+        elif isinstance(raw_speech[0], (list, tuple)):
+            if isinstance(raw_speech[0][0], np.ndarray) and (len(raw_speech[0][0].shape == 1)):
+                # batched and clipped List[List[np.ndarray]]
+                valid_audio = True
+            elif isinstance(raw_speech, (float, list, tuple)):
+                # batched List[List[float]], batched and clipped List[List[List[float]]]
+                valid_audio = True
+    return valid_audio
+
+
+def batch_and_clip_ndarray(array, data_dim=1, dtype=np.float32):
+    """
+    Turns a possibly nested list of np.ndarrays into a batched and clipped output of type `List[List[np.ndarray]]`.
+    """
+    if isinstance(array, (list, tuple)) and isinstance(array[0], (list, tuple)) and isinstance(array[0][0], np.ndarray):
+        if array[0][0].ndim == data_dim:
+            return [[base_array.astype(dtype=dtype) for base_array in clip] for clip in array]
+        else:
+            raise ValueError(
+                f"`For List[List[np.ndarray]]` inputs the internal `np.ndarray`s are expected to have dimension"
+                f" {data_dim} but got dimension {array[0][0].ndim}"
+            )
+    elif isinstance(array, (list, tuple) and isinstance(array[0], np.ndarray)):
+        if array[0].ndim == data_dim + 1:
+            return [[np.asarray(base_array, dtype=dtype) for base_array in clip] for clip in array]
+        elif array[0].ndim == data_dim:
+            return [[base_array.astype(dtype=dtype) for base_array in array]]
+        else:
+            raise ValueError(
+                f"For `List[np.ndarray]` inputs the internal `np.ndarray`s are expected to have dimension"
+                f" {data_dim} or {data_dim + 1} but got dimension {array[0].ndim}"
+            )
+    elif isinstance(array, np.ndarray):
+        if array.ndim == data_dim + 2:
+            return [[np.asarray(raw_input, dtype=dtype) for raw_input in clip] for clip in array]
+        elif array.ndim == data_dim + 1:
+            return [[np.asarray(raw_input, dtype=dtype) for raw_input in array]]
+        elif array.ndim == data_dim:
+            return [[array.astype(dtype=dtype)]]
+        else:
+            raise ValueError(
+                f"`np.ndarray` inputs are expected to have dimension in"
+                f" `[{data_dim}, {data_dim + 1}, {data_dim + 2}]` but instead got {array.ndim}"
+            )
+    else:
+        raise ValueError(f"Could not make batched and clipped audio from {array}")
+
+
+class ImageBindFeatureExtractor(ImageBindImageProcessor):
+    def __init__(self, *args, **kwargs) -> None:
+        warnings.warn(
+            "The class ImageBindFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please"
+            " use ImageBindImageProcessor instead.",
+            FutureWarning,
+        )
+        super().__init__(*args, **kwargs)
+
+
+# NOTE: ImageBind follow Audio Spectrogram Transformer for audio processing
+# Based on ASTFeatureExtractor
+class ImageBindAudioFeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs a Audio Spectrogram Transformer (AST) feature extractor.
+
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+
+    This class extracts mel-filter bank features from raw speech using TorchAudio, pads/truncates them to a fixed
+    length and normalizes them using a mean and standard deviation.
+
+    Args:
+        feature_size (`int`, *optional*, defaults to 1):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`, *optional*, defaults to 16000):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
+        num_mel_bins (`int`, *optional*, defaults to 128):
+            Number of Mel-frequency bins.
+        max_length (`int`, *optional*, defaults to 204):
+            Maximum length to which to pad/truncate the extracted features.
+        padding_value (`float`, *optional*, defaults to 0.0):
+            The value to pad with when applying the padding strategy defined by the `padding` argument to
+            [ImageBindAudioFeatureExtractor.__call__`].
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether or not to normalize the log-Mel features using `mean` and `std`.
+        mean (`float`, *optional*, defaults to -4.268):
+            The mean value used to normalize the log-Mel features. Uses the AudioSet mean by default.
+        std (`float`, *optional*, defaults to 9.138):
+            The standard deviation value used to normalize the log-Mel features. Uses the AudioSet standard deviation
+            by default.
+        return_attention_mask (`bool`, *optional*, defaults to `False`):
+            Whether or not [`~ImageBindAudioFeatureExtractor.__call__`] should return `attention_mask`.
+    """
+
+    model_input_names = ["input_features", "attention_mask"]
+
+    def __init__(
+        self,
+        feature_size=1,
+        sampling_rate=16000,
+        num_mel_bins=128,
+        max_length=204,
+        padding_value=0.0,
+        do_normalize=True,
+        mean=-4.268,
+        std=9.138,
+        return_attention_mask=False,
+        **kwargs,
+    ):
+        super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
+        self.num_mel_bins = num_mel_bins
+        self.max_length = max_length
+        self.do_normalize = do_normalize
+        self.mean = mean
+        self.std = std
+        self.return_attention_mask = return_attention_mask
+
+    def _extract_fbank_features(
+        self,
+        waveform: np.ndarray,
+        max_length: int,
+    ) -> np.ndarray:
+        """
+        Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
+        and hence the waveform should not be normalized before feature extraction.
+        """
+        # waveform = waveform * (2**15)  # Kaldi compliance: 16-bit signed integers
+        # Mean center the waveform
+        waveform -= waveform.mean()
+        waveform = torch.from_numpy(waveform).unsqueeze(0)
+        fbank = ta_kaldi.fbank(
+            waveform,
+            htk_compat=True,
+            sample_frequency=self.sampling_rate,
+            use_energy=False,
+            window_type="hanning",
+            num_mel_bins=self.num_mel_bins,
+            dither=0.0,
+            frame_shift=10,
+        )
+
+        n_frames = fbank.shape[0]
+        difference = max_length - n_frames
+
+        # pad or truncate, depending on difference
+        if difference > 0:
+            pad_module = torch.nn.ZeroPad2d((0, 0, 0, difference))
+            fbank = pad_module(fbank)
+        elif difference < 0:
+            fbank = fbank[0:max_length, :]
+
+        fbank = fbank.numpy()
+
+        return fbank
+
+    def normalize(self, input_values: np.ndarray) -> np.ndarray:
+        return (input_values - (self.mean)) / (self.std * 2)
+
+    def __call__(
+        self,
+        raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]], List[List[List[float]]]],
+        sampling_rate: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+
+        Args:
+            raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`, `List[List[List[float]]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of numpy
+                arrays or a (possibly nested) list of float values. The supported input types are as follows:
+
+                - unbatched: `List[float]`, `np.ndarray` (`ndim=1`)
+                - batched: `List[List[float]]`, `List[np.ndarray]` (`ndim=1`), `np.ndarray` (`ndim=2`)
+                - batched with clips: `List[List[List[float]]]`, `List[List[np.ndarray]]` (`ndim=1`), `List[np.ndarray]` (`ndim=2`), np.ndarray (`ndim=3`)
+                
+                The input will always be interpreted as mono channel audio, not stereo, i.e. a single float per timestep.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        """
+
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
+                    f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
+                    f" {self.sampling_rate} and not {sampling_rate}."
+                )
+        else:
+            logger.warning(
+                "It is strongly recommended to pass the `sampling_rate` argument to this function. "
+                "Failing to do so can result in silent errors that might be hard to debug."
+            )
+        
+        if not valid_batched_clipped_audio(raw_speech):
+            raise ValueError(
+                f"Only unbatched, batched, and batched and clipped mono-channel audio is supported for input to {self}"
+            )
+        
+        # Handle the cases where there are no np.ndarrays in raw_speech
+        if isinstance(raw_speech, (list, tuple)) and isinstance(raw_speech[0], float):
+            raw_speech = [[np.asarray(raw_speech, dtype=np.float32)]]
+        elif isinstance(raw_speech, (list, tuple)) and isinstance(raw_speech[0], (list, tuple)):
+            if isinstance(raw_speech[0][0], float):
+                # List[List[float]]
+                raw_speech = [[np.asarray(audio, dtype=np.float32) for audio in raw_speech]]
+            elif isinstance(raw_speech[0][0], (list, tuple)):
+                # List[List[List[float]]]
+                raw_speech = [[np.asarray(audio, dtype=np.float32) for audio in clip] for clip in raw_speech]
+
+        # always return batched and clipped audio of type [List[List[np.ndarray]]]
+        raw_speech = batch_and_clip_ndarray(raw_speech, data_dim=1, dtype=np.float32)
+
+        # extract fbank features and pad/truncate to max_length
+        features = [[self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in clip] for clip in raw_speech]
+
+        # convert into BatchFeature
+        padded_inputs = BatchFeature({"input_features": features})
+
+        # make sure spectrograms are in array format
+        input_values = padded_inputs.get("input_features")
+        if isinstance(input_values[0][0], list):
+            padded_inputs["input_features"] = [[np.asarray(feature, dtype=np.float32) for feature in clip] for clip in input_values]
+
+        # normalization
+        if self.do_normalize:
+            padded_inputs["input_features"] = [
+                [self.normalize(feature) for feature in clip] for clip in padded_inputs["input_features"]
+            ]
+
+        if return_tensors is not None:
+            padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
+
+        return padded_inputs
+
+
+class ImageBindImuFeatureExtractor(SequenceFeatureExtractor):
+    """
+    Constructs a ImageBind IMU feature extractor.
+
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+
+    This class takes in raw IMU time series data, converts it to a standard sampling rate, and pads/truncates it to a
+    fixed length.
+
+    Args:
+        feature_size (`int`, *optional*, defaults to 6):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`, *optional*, defaults to 200):
+            The sampling rate at which the IMU data should be digitalized expressed in hertz (Hz).
+        padding_value (`float`, *optional*, defaults to 0.0):
+            The value to pad with when applying the padding strategy defined by the `padding` argument to
+            [`ImageBindImuFeatureExtractor.__call__`].
+        imu_len_in_s (`float`, *optional*, defaults to 10):
+            Maximum length to which to pad/truncate the extracted features.
+        return_attention_mask (`bool`, *optional*, defaults to `False`):
+            Whether or not [`~ImageBindImuFeatureExtractor.__call__`] should return `attention_mask`.
+    """
+
+    model_input_names = ["input_features", "attention_mask"]
+
+    def __init__(
+        self,
+        feature_size=6,
+        sampling_rate=200,
+        padding_value=0.0,
+        imu_len_in_s=10,
+        return_attention_mask=False,
+        **kwargs,
+    ):
+        super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
+
+        self.imu_len_in_s = imu_len_in_s
+        self.return_attention_mask = return_attention_mask
+
+    def __call__(
+        self,
+        raw_imu: Union[np.ndarray, List[np.ndarray], List[List[float]], List[List[List[float]]]],
+        sampling_rate: Optional[int] = None,
+        padding: Union[bool, str, PaddingStrategy] = "max_length",
+        max_length: Optional[int] = None,
+        truncation: bool = True,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ):
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+
+        Args:
+            raw_imu (`np.ndarray`, `List[np.ndarray]`, `List[List[float]]`, `List[List[List[float]]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of numpy
+                arrays or a (possibly nested) list of float values. The supported input types are as follows:
+
+                - unbatched: `List[List[float]]`, `List[np.ndarray]` (`ndim=1`), `np.ndarray` (`ndim=2`),
+                - batched: `List[List[List[float]]]`, `List[np.ndarray]` (`ndim=2`), `np.ndarray` (`ndim=3`)
+
+                The input will always be interpreted as a multiple-channel time series signal.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `raw_imu` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors.
+            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `'max_length'`):
+                Select a strategy to pad the input `raw_speech` waveforms (according to the model's padding side and
+                padding index) among:
+
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            truncation (`bool`, *optional*, defaults to `True`):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+            pad_to_multiple_of (`int`, *optional*):
+                If set will pad the sequence to a multiple of the provided value.
+
+                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
+            return_attention_mask (`bool`, *optional*):
+                Whether to return the attention mask. If left to the default, will return the attention mask according
+                to the specific feature_extractor's default.
+
+                [What are attention masks?](../glossary#attention-mask)
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        """
+
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
+                    f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
+                    f" {self.sampling_rate} and not {sampling_rate}."
+                )
+        else:
+            logger.warning(
+                "It is strongly recommended to pass the `sampling_rate` argument to this function. "
+                "Failing to do so can result in silent errors that might be hard to debug."
+            )
+
+        if isinstance(raw_imu, (list, tuple)) and isinstance(raw_imu[0], float):
+            raise ValueError(
+                "The expected IMU input is a multichannel (rather than single channel) time series, so `List[float]`"
+                " inputs are not accepted."
+            )
+
+        # Handle nested list inputs
+        if isinstance(raw_imu, (list, tuple)) and isinstance(raw_imu[0], (list, tuple)):
+            if isinstance(raw_imu[0][0], float):
+                # List[List[float]] -> unbatched IMU input
+                raw_imu = [np.asarray(raw_imu, dtype=np.float32)]
+            elif isinstance(raw_imu[0][0], (list, tuple)):
+                # List[List[List[float]]] -> batched IMU input
+                raw_imu = [np.asarray(imu, dtype=np.float32) for imu in raw_imu]
+
+        # Handle inputs with ndarrays
+        if isinstance(raw_imu, (list, tuple)) and isinstance(raw_imu[0], np.ndarray):
+            if raw_imu[0].ndim == 1:
+                # Unbatched IMU input
+                raw_imu = [np.asarray(raw_imu, dtype=np.float32)]
+            elif raw_imu[0].ndim != 2:
+                raise ValueError(
+                    f"For `List[np.ndarray]` inputs expected the internal arrays to have dim 1 or 2, but got"
+                    f" {raw_imu[0].ndim}"
+                )
+
+        if isinstance(raw_imu, np.ndarray):
+            if raw_imu.ndim == 2:
+                # Unbatched IMU input
+                raw_imu = [raw_imu.astype(np.float32)]
+            elif raw_imu.ndim == 3:
+                # Batched IMU input
+                raw_imu = [np.asarray(imu, dtype=np.float32) for imu in raw_imu]
+            else:
+                raise ValueError(
+                    f"For `np.ndarray` inputs expected the array to have dim 2 or 3, but got {raw_imu.ndim}"
+                )
+
+        # raw_imu should be of form `List[np.ndarray]` where raw_imu[0].ndim == 2
+        # convert into BatchFeature
+        batched_imu = BatchFeature({"input_features": raw_imu})
+
+        # Pad/truncate batched features
+        padded_inputs = self.pad(
+            batched_imu,
+            padding=padding,
+            max_length=max_length if max_length is not None else self.imu_len_in_s,
+            truncation=truncation,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_attention_mask=return_attention_mask,
+        )
+
+        # Convert attention_mask to correct format
+        attention_mask = padded_inputs.get("attention_mask")
+        if attention_mask is not None:
+            batched_imu["attention_mask"] = [np.asarray(array, dtype=np.int32) for array in attention_mask]
+
+        # Convert tensors if desired
+        if return_tensors is not None:
+            batched_imu = batched_imu.convert_to_tensors(return_tensors)
+
+        return batched_imu
diff --git a/src/transformers/models/imagebind/image_processing_imagebind.py b/src/transformers/models/imagebind/image_processing_imagebind.py
new file mode 100644
index 000000000000..a9d7ddc638b7
--- /dev/null
+++ b/src/transformers/models/imagebind/image_processing_imagebind.py
@@ -0,0 +1,1058 @@
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for ImageBind."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+
+from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ...image_transforms import (
+    center_crop,
+    convert_to_rgb,
+    get_resize_output_image_size,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ...image_utils import (
+    OPENAI_CLIP_MEAN,
+    OPENAI_CLIP_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    infer_channel_dimension_format,
+    is_scaled_image,
+    is_valid_image,
+    make_list_of_images,
+    to_numpy_array,
+    valid_images,
+)
+from ...utils import TensorType, is_vision_available, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+if is_vision_available():
+    import PIL
+
+
+# Copied from transformers.models.videomae.image_processing_videomae.make_batched
+def make_batched(videos) -> List[List[ImageInput]]:
+    if isinstance(videos, (list, tuple)) and isinstance(videos[0], (list, tuple)) and is_valid_image(videos[0][0]):
+        return videos
+
+    elif isinstance(videos, (list, tuple)) and is_valid_image(videos[0]):
+        return [videos]
+
+    elif is_valid_image(videos):
+        return [[videos]]
+
+    raise ValueError(f"Could not make batched video from {videos}")
+
+
+class ImageBindImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a ImageBind image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
+            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
+            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_center_crop (`bool`, *optional*, defaults to `True`):
+            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
+            `preprocess` method.
+        crop_size (`Dict[str, int]` *optional*, defaults to 224):
+            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
+            method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
+            the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
+            method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `OPENAI_CLIP_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `OPENAI_CLIP_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_center_crop: bool = True,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"shortest_edge": 224}
+        size = get_size_dict(size, default_to_square=False)
+        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
+        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
+        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=False)
+        if "shortest_edge" not in size:
+            raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def center_crop(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
+        returned result will always be of size `size`).
+
+        Args:
+            image (`np.ndarray`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form of a dictionary with keys `height` and `width`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
+        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess_single_image(
+        self,
+        image: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ) -> np.ndarray:
+        """
+        Process a single image.
+        """
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_center_crop and crop_size is None:
+            raise ValueError("Crop size must be specified if do_center_crop is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # All transformations expect numpy arrays.
+        image = to_numpy_array(image)
+
+        if is_scaled_image(image) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            input_data_format = infer_channel_dimension_format(image)
+
+        if do_resize:
+            image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
+
+        if do_center_crop:
+            image = self.center_crop(image=image, size=crop_size, input_data_format=input_data_format)
+
+        if do_rescale:
+            image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+
+        if do_normalize:
+            image = self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+
+        image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+        return image
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: int = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size, param_name="size", default_to_square=False)
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        # Batch and clip images into video frames
+        videos = make_batched(images)
+
+        videos = [
+            [
+                self.preprocess_single_image(
+                    image=img,
+                    do_resize=do_resize,
+                    size=size,
+                    resample=resample,
+                    do_center_crop=do_center_crop,
+                    crop_size=crop_size,
+                    do_rescale=do_rescale,
+                    rescale_factor=rescale_factor,
+                    do_normalize=do_normalize,
+                    image_mean=image_mean,
+                    image_std=image_std,
+                    data_format=data_format,
+                    input_data_format=input_data_format,
+                )
+                for img in clip
+            ]
+            for clip in videos
+        ]
+
+        data = {"pixel_values": videos}
+        return BatchFeature(data=data, tensor_type=return_tensors)
+
+
+class ImageBindDepthImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a ImageBind depth image processor.
+
+    Args:
+        do_depth_norm (`bool`, *optional*, defaults to `True`):
+            Whether to perform depth normalization (following Omnivore). Can be overridden by `do_depth_norm` in the
+            `preprocess` method.
+        max_depth (`float`, *optional*, defaults to 75.0):
+            The max depth value, which will be used to scale the depth values by dividing them by `max_depth`. Can be
+            overridden by `max_depth` in the `preprocess` method.
+        min_depth (`float`, *optional*, defaults to 0.0):
+            The min depth value to clamp to. This is typically used to prevent negative depth values, which correspond
+            to far-away distances. Can be overridden by `min_depth` in the `preprocess` method.
+        clamp_max_before_scale (`bool`, *optional*, defaults to `True`):
+            Whether to clamp the depth values to `max_depth` before scaling by `max_depth`. If `True`, this will ensure
+            that the max depth value is 1. Can be overridden by `clamp_max_before_scale` in the `preprocess` method.
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
+            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
+            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_center_crop (`bool`, *optional*, defaults to `True`):
+            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
+            `preprocess` method.
+        crop_size (`Dict[str, int]` *optional*, defaults to 224):
+            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
+            method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
+            the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
+            method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
+            Image standard deviation.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_depth_norm = True,
+        max_depth: float = 75.0,
+        min_depth: float = 0.0,
+        clamp_max_before_scale: bool = True,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_center_crop: bool = True,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"shortest_edge": 224}
+        size = get_size_dict(size, default_to_square=False)
+        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
+        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
+
+        self.do_depth_norm = do_depth_norm
+        self.max_depth = max_depth
+        self.min_depth = min_depth
+        self.clamp_max_before_scale = clamp_max_before_scale
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
+        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
+        self.do_convert_rgb = do_convert_rgb
+
+    def depth_norm(
+        self,
+        image: np.ndarray,
+        max_depth: float,
+        min_depth: float = 0.0,
+        clamp_max_before_scale: bool = True,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ):
+        """
+        Normalize the depth channel. This will apply to the single channel of a depth input.
+
+        Args:
+            image (`np.ndarray`):
+                Single channel depth image to normalize.
+            max_depth (`float`, *optional*, defaults to 75.0):
+                The max depth value for the data.
+            min_depth (`float`, *optional*, defaults to 0.0):
+                The minimum value to clamp the depth values to. This is done to prevent negative depth values, which
+                correspond to far away distances.
+            clamp_max_before_scale (`bool`, *optional*, defaults to `True`):
+                Whether to clamp the depth values to `max_depth` before scaling them by dividing by `max_depth`.
+        """
+        # Clamp depth values to 0.0 to prevent negative depths
+        image = np.clip(image, a_min=min_depth, a_max=None)
+
+        if clamp_max_before_scale:
+            image = np.clip(image, a_min=None, a_max=max_depth)
+
+        image = image / max_depth
+
+        if data_format is not None:
+            image = to_channel_dimension_format(image, data_format, input_data_format)
+        return image
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=False)
+        if "shortest_edge" not in size:
+            raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def center_crop(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
+        returned result will always be of size `size`).
+
+        Args:
+            image (`np.ndarray`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form of a dictionary with keys `height` and `width`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
+        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_depth_norm: bool = None,
+        max_depth: float = None,
+        min_depth: float = None,
+        clamp_max_before_scale: bool = None,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: int = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_depth_norm = do_depth_norm if do_depth_norm is not None else self.do_depth_norm
+        max_depth = max_depth if max_depth is not None else self.max_depth
+        min_depth = min_depth if min_depth is not None else self.min_depth
+        clamp_max_before_scale = clamp_max_before_scale if clamp_max_before_scale is not None else self.clamp_max_before_scale
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size, param_name="size", default_to_square=False)
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        images = make_list_of_images(images)
+
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        if do_depth_norm and max_depth is None:
+            raise ValueError("Max depth must be specified if do_depth_norm is True.")
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_center_crop and crop_size is None:
+            raise ValueError("Crop size must be specified if do_center_crop is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_depth_norm:
+            images = [self.do_depth_norm(image=image, max_depth=max_depth, min_depth=min_depth, clamp_max_before_scale=clamp_max_before_scale) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_center_crop:
+            images = [self.center_crop(image=image, size=crop_size) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
+
+
+# NOTE: currently based on autogenerated ImageBindImageProcessor
+class ImageBindThermalImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a ImageBind thermal image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
+            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
+            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_center_crop (`bool`, *optional*, defaults to `True`):
+            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
+            `preprocess` method.
+        crop_size (`Dict[str, int]` *optional*, defaults to 224):
+            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
+            method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
+            the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
+            method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
+            Image standard deviation.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_center_crop: bool = True,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"shortest_edge": 224}
+        size = get_size_dict(size, default_to_square=False)
+        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
+        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
+        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=False)
+        if "shortest_edge" not in size:
+            raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def center_crop(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
+        returned result will always be of size `size`).
+
+        Args:
+            image (`np.ndarray`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form of a dictionary with keys `height` and `width`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
+        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: int = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size, param_name="size", default_to_square=False)
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        images = make_list_of_images(images)
+
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_center_crop and crop_size is None:
+            raise ValueError("Crop size must be specified if do_center_crop is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_center_crop:
+            images = [self.center_crop(image=image, size=crop_size) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/src/transformers/models/imagebind/modeling_imagebind.py b/src/transformers/models/imagebind/modeling_imagebind.py
new file mode 100644
index 000000000000..ca4ef7202adf
--- /dev/null
+++ b/src/transformers/models/imagebind/modeling_imagebind.py
@@ -0,0 +1,2985 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch ImageBind model."""
+
+
+from dataclasses import dataclass
+from typing import Any, List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from timm.layers import DropPath
+
+from ...activations import ACT2FN
+from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+    ModelOutput,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from .configuration_imagebind import (
+    ImageBindConfig,
+    ImageBindAudioConfig,
+    ImageBindDepthConfig,
+    ImageBindImuConfig,
+    ImageBindTextConfig,
+    ImageBindThermalConfig,
+    ImageBindVisionConfig,
+)
+
+logger = logging.get_logger(__name__)
+
+_CHECKPOINT_FOR_DOC = "facebook/imagebind-huge"
+
+IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/imagebind-huge",
+    # See all ImageBind models at https://huggingface.co/models?filter=imagebind
+]
+
+
+# Copied from transformers.models.bart.modeling_bart._expand_mask
+def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    bsz, src_len = mask.size()
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+
+    inverted_mask = 1.0 - expanded_mask
+
+    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
+
+
+# TODO: can use code already in transformers?
+# contrastive loss function, adapted from
+# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/ImageBind.html
+def contrastive_loss(logits: torch.Tensor) -> torch.Tensor:
+    return nn.functional.cross_entropy(logits, torch.arange(len(logits), device=logits.device))
+
+
+# Copied from transformers.models.clip.modeling_clip.clip_loss with clip->imagebind
+def imagebind_loss(similarity: torch.Tensor) -> torch.Tensor:
+    caption_loss = contrastive_loss(similarity)
+    image_loss = contrastive_loss(similarity.t())
+    return (caption_loss + image_loss) / 2.0
+
+
+# BaseModelOutputWithPooling + num_clips field for modalities which have clips (vision, audio)
+@dataclass
+class ImageBindTransformerOutput(ModelOutput):
+    """
+    The output class for ImageBind*Transformer models. This is [`BaseModelOutputWithPooling`] with an additional
+    `num_clips` field for modalities which are organized into clips as well as batches (vision, audio).
+
+    Args:
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
+            Last layer hidden-state of the first token of the sequence (classification token) after further processing
+            through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
+            the classification token after processing through a linear layer and a tanh activation function. The linear
+            layer weights are trained from the next sentence prediction (classification) objective during pretraining.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        num_clips: (`int`, *optional*):
+            The number of clips for modalities which have both a batch dimension (dim 0) and clip dimension (dim 1).
+            In the original ImageBind model, these modalities are vision (image/video) and audio.
+    """
+
+    last_hidden_state: torch.FloatTensor = None
+    pooler_output: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    num_clips: Optional[int] = None
+
+
+@dataclass
+# CLIPTextModelOutput + normalized embeddings
+class ImageBindTextModelOutput(ModelOutput):
+    """
+    Base class for text model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The text embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        normalized_text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`):
+            The normalized text embeddings obtained by applying the projection layer to the pooler_output, then
+            applying L2 normalization and scaling the logits.
+    """
+
+    text_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    normalized_text_embeds: Optional[torch.FloatTensor] = None
+
+
+@dataclass
+# ClipVisionModelOutput + normalized embeddings
+class ImageBindVisionModelOutput(ModelOutput):
+    """
+    Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
+
+    Args:
+        image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The image embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        normalized_image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`):
+            The normalized image embeddings obtained by applying the projection layer to the pooler_output, then
+            applying L2 normalization and scaling the logits.
+    """
+
+    image_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    normalized_image_embeds: Optional[torch.FloatTensor] = None
+
+
+# CLAPAudioModelOutput + normalized embeddings
+@dataclass
+class ImageBindAudioModelOutput(ModelOutput):
+    """
+    ClapAudio model output to mimic the output of the original implementation.
+
+    Args:
+        audio_embeds (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
+            The Audio embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        normalized_audio_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`):
+            The normalized audio embeddings obtained by applying the projection layer to the pooler_output, then
+            applying L2 normalization and scaling the logits.
+    """
+
+    audio_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    normalized_audio_embeds: Optional[torch.FloatTensor] = None
+
+
+@dataclass
+class ImageBindDepthModelOutput(ModelOutput):
+    """
+    Base class for depth model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        depth_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The depth embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        normalized_depth_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`):
+            The normalized depth embeddings obtained by applying the projection layer to the pooler_output, then
+            applying L2 normalization and scaling the logits.
+    """
+    
+    depth_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    normalized_depth_embeds: Optional[torch.FloatTensor] = None
+
+
+@dataclass
+class ImageBindThermalModelOutput(ModelOutput):
+    """
+    Base class for thermal model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        thermal_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The thermal embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        normalized_thermal_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`):
+            The normalized thermal embeddings obtained by applying the projection layer to the pooler_output, then
+            applying L2 normalization and scaling the logits.
+    """
+    
+    thermal_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    normalized_thermal_embeds: Optional[torch.FloatTensor] = None
+
+
+@dataclass
+class ImageBindImuModelOutput(ModelOutput):
+    """
+    Base class for IMU model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        imu_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The IMU embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        normalized_imu_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`):
+            The normalized IMU embeddings obtained by applying the projection layer to the pooler_output, then
+            applying L2 normalization and scaling the logits.
+    """
+    
+    imu_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    normalized_imu_embeds: Optional[torch.FloatTensor] = None
+
+
+@dataclass
+# Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->ImageBind
+class ImageBindOutput(ModelOutput):
+    """
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
+            The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
+            similarity scores.
+        logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
+            similarity scores.
+        logits_per_audio:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `audio_embeds` and `image_embeds`. This represents the audio-image
+            similarity scores.
+        logits_per_depth:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `depth_embeds` and `image_embeds`. This represents the depth-image
+            similarity scores.
+        logits_per_thermal:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `thermal_embeds` and `image_embeds`. This represents the thermal-image
+            similarity scores.
+        logits_per_imu:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `imu_embeds` and `image_embeds`. This represents the IMU-image
+            similarity scores.
+        text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The normalized text embeddings obtained by applying the projection layer to the pooled output of [`ImageBindTextModel`], then applying L2 normalization and logit scaling.
+        image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The normalized image embeddings obtained by applying the projection layer to the pooled output of [`ImageBindVisionModel`], then applying L2 normalization and logit scaling.
+        audio_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The normalized audio embeddings obtained by applying the projection layer to the pooled output of [`ImageBindAudioModel`], then applying L2 normalization and logit scaling.
+        depth_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The normalized depth embeddings obtained by applying the projection layer to the pooled output of [`ImageBindDepthModel`], then applying L2 normalization and logit scaling.
+        thermal_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The normalized thermal embeddings obtained by applying the projection layer to the pooled output of [`ImageBindThermalModel`], then applying L2 normalization and logit scaling.
+        imu_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The normalized IMU embeddings obtained by applying the projection layer to the pooled output of [`ImageBindImuModel`], then applying L2 normalization and logit scaling.
+        text_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ImageBindTextModel`].
+        vision_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ImageBindVisionModel`].
+        audio_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ImageBindAudioModel`].
+        depth_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ImageBindDepthModel`].
+        thermal_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ImageBindThermalModel`].
+        imu_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ImageBindImuModel`].
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    logits_per_image: torch.FloatTensor = None
+    logits_per_text: torch.FloatTensor = None
+    logits_per_audio: torch.FloatTensor = None
+    logits_per_depth: torch.FloatTensor = None
+    logits_per_thermal: torch.FloatTensor = None
+    logits_per_imu: torch.FloatTensor = None
+    text_embeds: torch.FloatTensor = None
+    image_embeds: torch.FloatTensor = None
+    audio_embeds: torch.FloatTensor = None
+    depth_embeds: torch.FloatTensor = None
+    thermal_embeds: torch.FloatTensor = None
+    imu_embeds: torch.FloatTensor = None
+    text_model_output: BaseModelOutputWithPooling = None
+    vision_model_output: BaseModelOutputWithPooling = None
+    audio_model_output: BaseModelOutputWithPooling = None
+    depth_model_output: BaseModelOutputWithPooling = None
+    thermal_model_output: BaseModelOutputWithPooling = None
+    imu_model_output: BaseModelOutputWithPooling = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        fields_to_exclude = [
+            "text_model_output",
+            "vision_model_output",
+            "audio_model_output",
+            "depth_model_output",
+            "thermal_model_output",
+            "imu_model_output",
+        ]
+        return tuple(
+            self[k] if k not in fields_to_exclude else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+# Copied from transformers.models.clip.modeling_clip.CLIPTextEmbeddings with CLIP->ImageBind
+class ImageBindTextEmbeddings(nn.Module):
+    def __init__(self, config: ImageBindTextConfig):
+        super().__init__()
+        embed_dim = config.hidden_size
+
+        self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
+        self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+    ) -> torch.Tensor:
+        seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.token_embedding(input_ids)
+
+        position_embeddings = self.position_embedding(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+
+        return embeddings
+
+
+class RGBDTPatchEmbedding(nn.Module):
+    """
+    Creates patch embeddings for spatiotemporal data (e.g. images, video, depth etc.). This handles patch embeddings
+    for all image-like modalities (image/video, depth, thermal).
+    """
+    def __init__(
+        self,
+        config: Union[ImageBindAudioConfig, ImageBindDepthConfig, ImageBindThermalConfig, ImageBindVisionConfig],
+        image_shape: Union[List[int], Tuple[int]],
+        norm_layer: Optional[nn.Module] = None,
+    ):
+        super().__init__()
+        self.config = config
+        self.image_shape = image_shape
+        self.embed_dim = config.hidden_size
+        self.patch_size = config.patch_size
+        self.stride = config.stride
+        self.num_frames = config.num_frames if hasattr(config, "num_frames") else None
+        self.is_temporal = self.num_frames is not None
+
+        self.class_embedding = nn.Parameter(torch.randn(self.embed_dim))
+
+        if self.is_temporal:
+            patch_embedding_cls = nn.Conv3d
+        else:
+            patch_embedding_cls = nn.Conv2d
+        
+        self.patch_embedding = patch_embedding_cls(
+            in_channels=image_shape[0],
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.stride,
+            bias=False,
+        )
+        self.norm_layer = norm_layer if norm_layer is not None else nn.Identity()
+
+        if self.is_temporal:
+            patches_along_time_dim = (config.num_frames // self.patch_size[0])
+            patches_along_height_dim = ((self.image_shape[-2] - self.patch_size[-2]) // self.stride[-2]) + 1
+            patches_along_width_dim = ((self.image_shape[-1] - self.patch_size[-1]) // self.stride[-1]) + 1
+        else:
+            patches_along_time_dim = 1
+            patches_along_height_dim = ((self.image_shape[-2] - self.patch_size) // self.stride) + 1
+            patches_along_width_dim = ((self.image_shape[-1] - self.patch_size) // self.stride) + 1
+        self.num_patches = patches_along_height_dim * patches_along_width_dim * patches_along_time_dim
+        self.num_positions = self.num_patches + 1
+        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+        self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)))
+    
+    def image_to_video(self, image: torch.FloatTensor, time_dim: int = 2, ntimes: int = 2, pad_type: str = "repeat"):
+        """
+        Maps 4-dim image tensors of shape (B, C, H, W) to 5-dim video tensors, possibly repeating the image along the
+        time dimension. For example, if `time_dim == 1`, RGB images of shape (B, C, H, W) will be transformed to
+        video of shape (B, 1, C, H, W), and then the image will be repeated along the time dimension `ntimes` to get
+        shape (B, N, C, H, W).
+        """
+        if image.ndim not in [4, 5]:
+            raise ValueError(
+                f"The input `image` tensor should be 4- or 5-dimensional but has {image.ndim} dimensions."
+            )
+
+        # Add time dimension at specified dim index
+        if image.ndim == 4:
+            image = image.unsqueeze(time_dim)
+
+        # Repeat image across the time dimension ntimes.
+        if image.shape[time_dim] == 1:
+            if pad_type == "repeat":
+                new_shape = [1] * len(image.shape)
+                new_shape[time_dim] = ntimes
+                video = image.repeat(new_shape)
+            elif pad_type == "zero":
+                pad_arg = [0, 0] * len(image.shape)
+                pad_arg[2 * time_dim + 1] = self.ntimes - image.shape[time_dim]
+                video = nn.functional.pad(image, pad_arg)
+        return video
+
+    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
+        batch_size = pixel_values.shape[0]
+        if self.is_temporal:
+            pixel_values = self.image_to_video(pixel_values, time_dim=1, ntimes=self.num_frames)
+        
+        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
+        patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
+        patch_embeds = self.norm_layer(patch_embeds)
+
+        class_embeds = self.class_embedding.expand(batch_size, 1, -1)
+        embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
+        embeddings = embeddings + self.position_embedding(self.position_ids)
+        return embeddings
+
+
+class ImageBindVisionEmbeddings(RGBDTPatchEmbedding):
+    def __init__(self, config: ImageBindVisionConfig):
+        image_shape = (config.num_channels, config.image_size, config.image_size)
+        super().__init__(config, image_shape, norm_layer=None)
+
+
+class ImageBindAudioEmbeddings(RGBDTPatchEmbedding):
+    def __init__(self, config: ImageBindAudioConfig):
+        image_shape = (config.num_channels, config.num_mel_bins, config.target_len)
+        layer_norm = nn.LayerNorm(config.hidden_size)
+        super().__init__(config, image_shape, norm_layer=layer_norm)
+    
+    def forward(self, audio: torch.FloatTensor) -> torch.Tensor:
+        super().forward(pixel_values=audio)
+
+
+class ImageBindDepthEmbeddings(RGBDTPatchEmbedding):
+    def __init__(self, config: ImageBindDepthConfig):
+        image_shape = (config.num_channels, config.image_size, config.image_size)
+        layer_norm = nn.LayerNorm(config.hidden_size)
+        super().__init__(config, image_shape, norm_layer=layer_norm)
+    
+    def forward(self, depth: torch.FloatTensor) -> torch.Tensor:
+        super().forward(pixel_values=depth)
+
+
+class ImageBindThermalEmbeddings(RGBDTPatchEmbedding):
+    def __init__(self, config: ImageBindThermalConfig):
+        image_shape = (config.num_channels, config.image_size, config.image_size)
+        layer_norm = nn.LayerNorm(config.hidden_size)
+        super().__init__(config, image_shape, norm_layer=layer_norm)
+    
+    def forward(self, thermal: torch.FloatTensor) -> torch.Tensor:
+        super().forward(pixel_values=thermal)
+
+
+class ImageBindImuEmbeddings(nn.Module):
+    def __init__(self, config: ImageBindImuConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.kernel_size = config.kernel_size
+        self.in_features = config.input_shape[0] * self.kernel_size
+
+        self.class_embedding = nn.Parameter(torch.randn(self.embed_dim))
+
+        self.patch_embedding = nn.Linear(self.in_features, self.embed_dim, bias=False)
+        self.norm_layer = nn.LayerNorm(self.embed_dim)
+
+        self.num_patches = config.input_shape[1] // self.kernel_size
+        self.num_positions = self.num_patches + 1
+        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+        self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)))
+    
+    def forward(self, imu: torch.FloatTensor) -> torch.Tensor:
+        batch_size = imu.shape[0]
+
+        # Patchify
+        # (B, L, D) -> (B, L, D // K, K) -> (B, D // K, L, K)
+        patches = imu.unfold(-1, self.kernel_size, self.kernel_size).permute(0, 2, 1, 3)
+        patches = patches.reshape(batch_size, patches.shape[1], -1)
+
+        patch_embeds = self.patch_embedding(patches)
+        patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
+        patch_embeds = self.norm_layer(patch_embeds)
+
+        class_embeds = self.class_embedding.expand(batch_size, 1, -1)
+        embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
+        embeddings = embeddings + self.position_embedding(self.position_ids)
+        return embeddings
+
+
+# CLIPAttention + key/value biases
+class ImageBindAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = config.attention_dropout
+
+        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
+
+        # Create bias parameters for key and value sequences.
+        if config.add_kv_bias:
+            self.k_bias = nn.Parameter(torch.empty((1, 1, self.embed_dim)))
+            self.v_bias = nn.Parameter(torch.empty((1, 1, self.embed_dim)))
+        else:
+            self.k_bias = None
+            self.v_bias = None
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        causal_attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, tgt_len, embed_dim = hidden_states.size()
+
+        # get query proj
+        query_states = self.q_proj(hidden_states) * self.scale
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+
+        # Add key/value biases if necessary
+        if self.k_bias is not None and self.v_bias is not None:
+            # Repeat bias along batch dimension (first)
+            key_states = torch.cat([key_states, self.k_bias.repeat(bsz, 1, 1)])
+            value_states = torch.cat([value_states, self.v_bias.repeat(bsz, 1, 1)])
+        
+        key_states = self._shape(key_states, -1, bsz)
+        value_states = self._shape(value_states, -1, bsz)
+
+        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
+        query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
+        key_states = key_states.view(*proj_shape)
+        value_states = value_states.view(*proj_shape)
+
+        src_len = key_states.size(1)
+        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
+
+        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
+            raise ValueError(
+                f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+
+        # apply the causal_attention_mask first
+        if causal_attention_mask is not None:
+            if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
+                raise ValueError(
+                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
+                    f" {causal_attention_mask.size()}"
+                )
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + causal_attention_mask
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        if attention_mask is not None:
+            if attention_mask.size() != (bsz, 1, tgt_len, src_len):
+                raise ValueError(
+                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
+                )
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+        if output_attentions:
+            # this operation is a bit akward, but it's required to
+            # make sure that attn_weights keeps its gradient.
+            # In order to do so, attn_weights have to reshaped
+            # twice and have to be reused in the following
+            attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
+        else:
+            attn_weights_reshaped = None
+
+        attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
+
+        attn_output = torch.bmm(attn_probs, value_states)
+
+        if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+
+        attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
+        attn_output = attn_output.transpose(1, 2)
+        attn_output = attn_output.reshape(bsz, tgt_len, embed_dim)
+
+        attn_output = self.out_proj(attn_output)
+
+        return attn_output, attn_weights_reshaped
+
+
+# Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->ImageBind
+class ImageBindMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+# CLIPEncoderLayer with DropPath layer after each residual subblock (attention, feedforward)
+class ImageBindEncoderLayer(nn.Module):
+    def __init__(self, config: ImageBindConfig, drop_path_rate: float = 0.0):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = ImageBindAttention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+        self.mlp = ImageBindMLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+        if drop_path_rate > 0.0:
+            self.drop_path = DropPath(drop_path_rate)
+        else:
+            self.drop_path = nn.Identity()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        causal_attention_mask: torch.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.FloatTensor]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            causal_attention_mask=causal_attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = self.drop_path(hidden_states)
+        hidden_states = residual + hidden_states
+
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = self.drop_path(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class ImageBindPostProcessor(nn.Module):
+    """
+    Post-processes ImageBind embeddings by using a normalize layer followed by an optional logit scaling layer.
+    """
+    def __init__(
+        self,
+        config,
+        dim: int = -1,
+        max_logit_scale: float = 100,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.scale_logits = config.logit_scale_init_value is not None
+
+        if self.scale_logits:
+            self.logit_scale_init = config.logit_scale_init_value
+            self.max_logit_scale = max_logit_scale
+            self.learnable = config.learnable_logit_scale
+
+            log_logit_scale = torch.ones([]) * np.log(self.logit_scale_init)
+            if self.learnable:
+                self.log_logit_scale = nn.Parameter(log_logit_scale)
+            else:
+                self.register_buffer("log_logit_scale", log_logit_scale)
+    
+    def forward(self, logits: torch.FloatTensor) -> torch.FloatTensor:
+        logits = nn.functional.normalize(logits, dim=self.dim, p=2)
+        if self.scale_logits:
+            logits = torch.clip(self.log_logit_scale.exp(), max=self.max_logit_scale) * logits
+        return logits
+
+
+class ImageBindPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = ImageBindConfig
+    base_model_prefix = "imagebind"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_factor
+        if isinstance(module, ImageBindTextEmbeddings):
+            module.token_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02)
+            module.position_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02)
+        elif isinstance(module, RGBDTPatchEmbedding):
+            factor = self.config.initializer_factor
+            nn.init.normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor)
+            nn.init.normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor)
+            nn.init.normal_(module.position_embedding.weight, std=module.config.initializer_range * factor)
+        elif isinstance(module, ImageBindImuEmbeddings):
+            factor = self.config.initializer_factor
+            nn.init.normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor)
+            nn.init.normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor)
+            nn.init.normal_(module.position_embedding.weight, std=module.config.initializer_range * factor)
+        elif isinstance(module, ImageBindAttention):
+            factor = self.config.initializer_factor
+            in_proj_std = (module.embed_dim**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
+            out_proj_std = (module.embed_dim**-0.5) * factor
+            nn.init.normal_(module.q_proj.weight, std=in_proj_std)
+            nn.init.normal_(module.k_proj.weight, std=in_proj_std)
+            nn.init.normal_(module.v_proj.weight, std=in_proj_std)
+            nn.init.normal_(module.out_proj.weight, std=out_proj_std)
+            if module.k_bias is not None:
+                nn.init.normal_(module.k_bias, std=in_proj_std)
+            if module.v_bias is not None:
+                nn.init.normal_(module.v_bias, std=in_proj_std)
+        elif isinstance(module, ImageBindMLP):
+            factor = self.config.initializer_factor
+            in_proj_std = (
+                (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
+            )
+            fc_std = (2 * module.config.hidden_size) ** -0.5 * factor
+            nn.init.normal_(module.fc1.weight, std=fc_std)
+            nn.init.normal_(module.fc2.weight, std=in_proj_std)
+        elif isinstance(module, ImageBindModel):
+            nn.init.normal_(
+                module.text_projection.weight,
+                std=module.text_embed_dim**-0.5 * self.config.initializer_factor,
+            )
+            nn.init.normal_(
+                module.visual_projection.weight,
+                std=module.vision_embed_dim**-0.5 * self.config.initializer_factor,
+            )
+        elif isinstance(module, ImageBindVisionModelWithProjection):
+            nn.init.normal_(
+                module.visual_projection.weight,
+                std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
+            )
+        elif isinstance(module, ImageBindTextModelWithProjection):
+            nn.init.normal_(
+                module.text_projection.weight,
+                std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
+            )
+
+        if isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, ImageBindEncoder):
+            module.gradient_checkpointing = value
+
+
+IMAGEBIND_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`ImageBindConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+IMAGEBIND_TEXT_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+IMAGEBIND_VISION_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+            [`AutoImageProcessor`]. See [`ImageBindImageProcessor.__call__`] for details.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+# TODO: add inputs doctrings for remaining modalities (audio, depth, thermal, IMU)
+IMAGEBIND_AUDIO_INPUTS_DOCSTRING = r"""
+    Args:
+        TODO
+"""
+
+IMAGEBIND_DEPTH_INPUTS_DOCSTRING = r"""
+    Args:
+        TODO
+"""
+
+IMAGEBIND_THERMAL_INPUTS_DOCSTRING = r"""
+    Args:
+        TODO
+"""
+
+IMAGEBIND_IMU_INPUTS_DOCSTRING = r"""
+    Args:
+        TODO
+"""
+
+# TODO: update inputs docstring with remaining modalities
+IMAGEBIND_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+            [`AutoImageProcessor`]. See [`ImageBindImageProcessor.__call__`] for details.
+        return_loss (`bool`, *optional*):
+            Whether or not to return the contrastive loss.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+# CLIPEncoder with DropPath support
+class ImageBindEncoder(nn.Module):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`ImageBindEncoderLayer`].
+
+    Args:
+        config: ImageBindConfig
+    """
+
+    def __init__(self, config: ImageBindConfig, drop_path_type: str = "progressive"):
+        super().__init__()
+        self.config = config
+
+        if drop_path_type == "progressive":
+            drop_path_rates = [prob.item() for prob in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
+        elif drop_path_type == "uniform":
+            drop_path_rates = [config.drop_path_rate for _ in range(config.num_hidden_layers)]
+        else:
+            raise ValueError(
+                f"`drop_path_type` is expected to be in `['uniform', 'progressive']` but got {drop_path_type}"
+            )
+        
+        self.layers = nn.ModuleList(
+            [ImageBindEncoderLayer(config, drop_path_rate) for drop_path_rate in drop_path_rates]
+        )
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[torch.Tensor] = None,
+        causal_attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            causal_attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Causal mask for the text model. Mask values selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for idx, encoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    causal_attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    causal_attention_mask,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
+        )
+
+
+# TODO: copied from CLIP?
+class ImageBindTextTransformer(nn.Module):
+    def __init__(self, config: ImageBindTextConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+        self.embeddings = ImageBindTextEmbeddings(config)
+        self.encoder = ImageBindEncoder(config)
+        self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindTextConfig)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is None:
+            raise ValueError("You have to specify input_ids")
+
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_shape[-1])
+
+        hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+
+        bsz, seq_len = input_shape
+        # ImageBind's text model uses causal mask, prepare it here.
+        # https://github.com/facebookresearch/ImageBind/blob/95d27c7fd5a8362f3527e176c3a80ae5a4d880c0/imagebind/models/imagebind_model.py#L172
+        causal_attention_mask = self._build_causal_attention_mask(
+            bsz, seq_len, hidden_states.dtype, device=hidden_states.device
+        )
+        # expand attention_mask
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            attention_mask = _expand_mask(attention_mask, hidden_states.dtype)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            attention_mask=attention_mask,
+            causal_attention_mask=causal_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.final_layer_norm(last_hidden_state)
+
+        # text_embeds.shape = [batch_size, sequence_length, transformer.width]
+        # take features from the eot embedding (eot_token is the highest number in each sequence)
+        # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
+        pooled_output = last_hidden_state[
+            torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
+            input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
+        ]
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,)
+
+        return ImageBindTransformerOutput(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            num_clips=None,
+        )
+
+    def _build_causal_attention_mask(self, bsz, seq_len, dtype, device=None):
+        # lazily create causal attention mask, with full attention between the vision tokens
+        # pytorch uses additive attention mask; fill with -inf
+        mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype, device=device)
+        mask.fill_(torch.finfo(dtype).min)
+        mask.triu_(1)  # zero out the lower diagonal
+        mask = mask.unsqueeze(1)  # expand mask
+        return mask
+
+
+# TODO: copied from CLIP?
+@add_start_docstrings(
+    """The text model from ImageBind without any head or projection on top.""",
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindTextModel(ImageBindPreTrainedModel):
+    config_class = ImageBindTextConfig
+
+    _no_split_modules = ["ImageBindEncoderLayer"]
+
+    def __init__(self, config: ImageBindTextConfig):
+        super().__init__(config)
+        self.text_model = ImageBindTextTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.text_model.embeddings.token_embedding
+
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.token_embedding = value
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindTextConfig)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoTokenizer, ImageBindTextModel
+
+        >>> model = ImageBindTextModel.from_pretrained("facebook/imagebind-huge")
+        >>> tokenizer = AutoTokenizer.from_pretrained("facebook/imagebind-huge")
+
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+# TODO: copied from CLIP?
+class ImageBindVisionTransformer(nn.Module):
+    def __init__(self, config: ImageBindVisionConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = ImageBindVisionEmbeddings(config)
+        self.pre_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+        self.encoder = ImageBindEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindVisionConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+        
+        num_clips = None
+        reduce_clips = pixel_values.ndim >= 5
+        if reduce_clips:
+            batch_size, num_clips = pixel_values.shape[:2]
+            pixel_values = pixel_values.reshape(batch_size * num_clips, *pixel_values.shape[2:])
+
+        hidden_states = self.embeddings(pixel_values)
+        hidden_states = self.pre_layernorm(hidden_states)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (num_clips,)
+
+        return ImageBindTransformerOutput(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            num_clips=num_clips,
+        )
+
+
+# TODO: copied from CLIP?
+@add_start_docstrings(
+    """The vision model from ImageBind without any head or projection on top.""",
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindVisionModel(ImageBindPreTrainedModel):
+    config_class = ImageBindVisionConfig
+    _no_split_modules = ["ImageBindEncoderLayer"]
+
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: ImageBindVisionConfig):
+        super().__init__(config)
+        self.vision_model = ImageBindVisionTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.vision_model.embeddings.patch_embedding
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindVisionConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindVisionModel
+
+        >>> model = ImageBindVisionModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+# TODO: copied from CLIP?
+class ImageBindAudioTransformer(nn.Module):
+    def __init__(self, config: ImageBindAudioConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = ImageBindAudioEmbeddings(config)
+        self.encoder = ImageBindEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindAudioConfig)
+    def forward(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_features is None:
+            raise ValueError("You have to specify input_features")
+        
+        num_clips = None
+        reduce_clips = input_features.ndim >= 5
+        if reduce_clips:
+            batch_size, num_clips = input_features.shape[:2]
+            input_features = input_features.reshape(batch_size * num_clips, *input_features.shape[2:])
+
+        hidden_states = self.embeddings(input_features)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (num_clips,)
+
+        return ImageBindTransformerOutput(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            num_clips=num_clips,
+        )
+
+
+@add_start_docstrings(
+    """The vision model from ImageBind without any head or projection on top.""",
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindAudioModel(ImageBindPreTrainedModel):
+    config = ImageBindAudioConfig
+    _no_split_modules = ["ImageBindEncoderLayer"]
+
+    main_input_name = "input_features"
+
+    def __init__(self, config: ImageBindAudioConfig):
+        super().__init__(config)
+        self.audio_model = ImageBindAudioTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    
+    def get_input_embeddings(self) -> nn.Module:
+        return self.audio_model.embeddings.patch_embedding
+    
+    @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindAudioConfig)
+    def forward(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindAudioModel
+
+        >>> model = ImageBindAudioModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.audio_model(
+            input_features=input_features,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+# TODO: copied from CLIP?
+class ImageBindDepthTransformer(nn.Module):
+    def __init__(self, config: ImageBindDepthConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = ImageBindDepthEmbeddings(config)
+        self.encoder = ImageBindEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindDepthConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,)
+
+        return ImageBindTransformerOutput(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            num_clips=None,
+        )
+
+
+@add_start_docstrings(
+    """The depth model from ImageBind without any head or projection on top.""",
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindDepthModel(ImageBindPreTrainedModel):
+    config = ImageBindDepthConfig
+    _no_split_modules = ["ImageBindEncoderLayer"]
+
+    main_input_name = "pixel_values"  # TODO: rename to something better?
+
+    def __init__(self, config: ImageBindDepthConfig):
+        super().__init__(config)
+        self.depth_model = ImageBindDepthTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    
+    def get_input_embeddings(self) -> nn.Module:
+        return self.depth_model.embeddings.patch_embedding
+    
+    @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindDepthConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindDepthModel
+
+        >>> model = ImageBindDepthModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.depth_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+# TODO: copied from CLIP?
+class ImageBindThermalTransformer(nn.Module):
+    def __init__(self, config: ImageBindThermalConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = ImageBindThermalEmbeddings(config)
+        self.encoder = ImageBindEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindThermalConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,)
+
+        return ImageBindTransformerOutput(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            num_clips=None,
+        )
+
+
+@add_start_docstrings(
+    """The thermal model from ImageBind without any head or projection on top.""",
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindThermalModel(ImageBindPreTrainedModel):
+    config = ImageBindThermalConfig
+    _no_split_modules = ["ImageBindEncoderLayer"]
+
+    main_input_name = "pixel_values"  # TODO: rename to something better?
+
+    def __init__(self, config: ImageBindThermalConfig):
+        super().__init__(config)
+        self.thermal_model = ImageBindThermalTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    
+    def get_input_embeddings(self) -> nn.Module:
+        return self.thermal_model.embeddings.patch_embedding
+    
+    @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=ImageBindThermalConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindThermalModel
+
+        >>> model = ImageBindThermalModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.thermal_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+# TODO: copied from CLIP?
+class ImageBindImuTransformer(nn.Module):
+    def __init__(self, config: ImageBindImuConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = ImageBindImuEmbeddings(config)
+        self.encoder = ImageBindEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+        self.post_dropout = nn.Dropout(p=config.final_dropout)
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindImuConfig)
+    def forward(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_features is None:
+            raise ValueError("You have to specify input_features")
+
+        hidden_states = self.embeddings(input_features)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+        pooled_output = self.post_dropout(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,)
+
+        return ImageBindTransformerOutput(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            num_clips=None,
+        )
+
+
+@add_start_docstrings(
+    """The IMU model from ImageBind without any head or projection on top.""",
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindImuModel(ImageBindPreTrainedModel):
+    config = ImageBindImuConfig
+    _no_split_modules = ["ImageBindEncoderLayer"]
+
+    main_input_name = "input_features"
+
+    def __init__(self, config: ImageBindImuConfig):
+        super().__init__(config)
+        self.imu_model = ImageBindImuTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    
+    def get_input_embeddings(self) -> nn.Module:
+        return self.imu_model.embeddings.patch_embedding
+    
+    @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindImuConfig)
+    def forward(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTransformerOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindImuModel
+
+        >>> model = ImageBindImuModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.imu_model(
+            input_features=input_features,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+@add_start_docstrings(IMAGEBIND_START_DOCSTRING)
+class ImageBindModel(ImageBindPreTrainedModel):
+    config_class = ImageBindConfig
+
+    def __init__(self, config: ImageBindConfig):
+        super().__init__(config)
+
+        if not isinstance(config.text_config, ImageBindTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type ImageBindTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+
+        if not isinstance(config.vision_config, ImageBindVisionConfig):
+            raise ValueError(
+                "config.vision_config is expected to be of type ImageBindVisionConfig but is of type"
+                f" {type(config.vision_config)}."
+            )
+        
+        if not isinstance(config.audio_config, ImageBindAudioConfig):
+            raise ValueError(
+                "config.audio_config is expected to be of type ImageBindAudioConfig but is of type"
+                f" {type(config.audio_config)}."
+            )
+        
+        if not isinstance(config.depth_config, ImageBindDepthConfig):
+            raise ValueError(
+                "config.depth_config is expected to be of type ImageBindDepthConfig but is of type"
+                f" {type(config.depth_config)}."
+            )
+        
+        if not isinstance(config.thermal_config, ImageBindThermalConfig):
+            raise ValueError(
+                "config.thermal_config is expected to be of type ImageBindThermalConfig but is of type"
+                f" {type(config.thermal_config)}."
+            )
+        
+        if not isinstance(config.imu_config, ImageBindImuConfig):
+            raise ValueError(
+                "config.imu_config is expected to be of type ImageBindImuConfig but is of type"
+                f" {type(config.imu_config)}."
+            )
+
+        text_config = config.text_config
+        vision_config = config.vision_config
+        audio_config = config.audio_config
+        depth_config = config.depth_config
+        thermal_config = config.thermal_config
+        imu_config = config.imu_config
+
+        self.projection_dim = config.projection_dim
+        self.text_embed_dim = text_config.hidden_size
+        self.vision_embed_dim = vision_config.hidden_size
+        self.audio_embed_dim = audio_config.hidden_size
+        self.depth_embed_dim = depth_config.hidden_size
+        self.thermal_embed_dim = thermal_config.hidden_size
+        self.imu_embed_dim = imu_config.hidden_size
+
+        self.text_model = ImageBindTextTransformer(text_config)
+        self.vision_model = ImageBindVisionTransformer(vision_config)
+        self.audio_model = ImageBindAudioTransformer(audio_config)
+        self.depth_model = ImageBindDepthTransformer(depth_config)
+        self.thermal_model = ImageBindThermalTransformer(thermal_config)
+        self.imu_model = ImageBindImuTransformer(imu_config)
+
+        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
+        self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
+        self.audio_projection = nn.Linear(self.audio_embed_dim, self.projection_dim, bias=False)
+        self.depth_projection = nn.Linear(self.depth_embed_dim, self.projection_dim, bias=False)
+        self.thermal_projection = nn.Linear(self.thermal_embed_dim, self.projection_dim, bias=False)
+        self.imu_projection = nn.Linear(self.imu_embed_dim, self.projection_dim, bias=False)
+
+        self.text_postprocessor = ImageBindPostProcessor(text_config)
+        self.vision_postprocessor = ImageBindPostProcessor(vision_config)
+        self.audio_postprocessor = ImageBindPostProcessor(audio_config)
+        self.depth_postprocessor = ImageBindPostProcessor(depth_config)
+        self.thermal_postprocessor = ImageBindPostProcessor(thermal_config)
+        self.imu_postprocessor = ImageBindPostProcessor(imu_config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING)
+    def get_text_features(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            applying the projection layer to the pooled output of [`ImageBindTextModel`].
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoTokenizer, ImageBindModel
+
+        >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge")
+        >>> tokenizer = AutoTokenizer.from_pretrained("facebook/imagebind-huge")
+
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
+        >>> text_features = model.get_text_features(**inputs)
+        ```"""
+        # Use ImageBind model's config for some fields (if specified) instead of those in the text component.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+        text_features = self.text_projection(pooled_output)
+
+        return text_features
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING)
+    def get_image_features(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
+            applying the projection layer to the pooled output of [`ImageBindVisionModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindModel
+
+        >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> image_features = model.get_image_features(**inputs)
+        ```"""
+        # Use ImageBind model's config for some fields (if specified) instead of those in the vision components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        batch_size = pixel_values.shape[0]
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = vision_outputs[1]  # pooled_output
+        image_features = self.visual_projection(pooled_output)
+
+        num_clips = vision_outputs[-1]
+        if num_clips is not None:
+            image_features = image_features.reshape(batch_size, num_clips, -1)
+            # Take mean over all clips
+            image_features = image_features.mean(dim=1)
+
+        return image_features
+    
+    # TODO: make sure inputs match with ImageBindAudioModel
+    @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING)
+    def get_audio_features(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            audio_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The audio embeddings obtained by
+            applying the projection layer to the pooled output of [`ImageBindAudioModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindModel
+
+        >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> audio_features = model.get_audio_features(**inputs)
+        ```"""
+        # Use ImageBind model's config for some fields (if specified) instead of those in the audio component.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        batch_size = input_features.shape[0]
+        
+        audio_outputs = self.audio_model(
+            input_features=input_features,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = audio_outputs[1]  # pooled_output
+        audio_features = self.audio_projection(pooled_output)
+
+        num_clips = audio_outputs[-1]
+        if num_clips is not None:
+            audio_features = audio_features.reshape(batch_size, num_clips, -1)
+            # Take mean over all clips
+            audio_features = audio_features.mean(dim=1)
+
+        return audio_features
+
+    # TODO: make sure inputs match with ImageBindDepthModel
+    @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING)
+    def get_depth_features(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            depth_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The depth embeddings obtained by
+            applying the projection layer to the pooled output of [`ImageBindDepthModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindModel
+
+        >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> depth_features = model.get_depth_features(**inputs)
+        ```"""
+        # Use ImageBind model's config for some fields (if specified) instead of those in the depth component.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        depth_outputs = self.depth_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = depth_outputs[1]  # pooled_output
+        depth_features = self.depth_projection(pooled_output)
+
+        return depth_features
+
+    # TODO: make sure inputs match with ImageBindThermalModel
+    @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING)
+    def get_thermal_features(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            thermal_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The thermal embeddings obtained by
+            applying the projection layer to the pooled output of [`ImageBindThermalModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindModel
+
+        >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> thermal_features = model.get_thermal_features(**inputs)
+        ```"""
+        # Use ImageBind model's config for some fields (if specified) instead of those in the thermal component.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        thermal_outputs = self.thermal_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = thermal_outputs[1]  # pooled_output
+        thermal_features = self.thermal_projection(pooled_output)
+
+        return thermal_features
+
+    # TODO: make sure inputs match with ImageBindImuModel
+    @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING)
+    def get_imu_features(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            imu_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The IMU embeddings obtained by
+            applying the projection layer to the pooled output of [`ImageBindImuModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindModel
+
+        >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> imu_features = model.get_imu_features(**inputs)
+        ```"""
+        # Use ImageBind model's config for some fields (if specified) instead of those in the IMU component.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        imu_outputs = self.imu_model(
+            input_features=input_features,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = imu_outputs[1]  # pooled_output
+        imu_features = self.imu_projection(pooled_output)
+
+        return imu_features
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindOutput, config_class=ImageBindConfig)
+    def forward(
+        self,
+        input_features: Optional[torch.Tensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        modality: Optional[str] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindModel
+
+        >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(
+        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
+        ... )
+
+        >>> outputs = model(**inputs)
+        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
+        ```"""
+        # Use ImageBind model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        image_batch_size = pixel_values.shape[0]
+        other_batch_size = input_features.shape[0]
+
+        other_model, other_projection, other_postprocessor = self._resolve_modality_models(modality)
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if modality == "text":
+            other_outputs = other_model(
+                input_ids=input_features,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        else:
+            other_outputs = other_model(
+                input_ids=input_features,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+
+        image_embeds = vision_outputs[1]
+        image_embeds = self.visual_projection(image_embeds)
+
+        other_embeds = other_outputs[1]
+        other_embeds = other_projection(other_embeds)
+
+        # normalized features: postprocessor performs normalization and logit scaling
+        image_embeds = self.vision_postprocessor(image_embeds)
+        other_embeds = other_postprocessor(other_embeds)
+
+        # If modality input was batched and clipped, reduce embedding over clips dimension
+        image_num_clips = vision_outputs[-1]
+        if image_num_clips is not None:
+            image_embeds = image_embeds.reshape(image_batch_size, image_num_clips, -1)
+            # Take mean over all clips
+            image_embeds = image_embeds.mean(dim=1)
+        other_num_clips = other_outputs[-1]
+        if other_num_clips is not None:
+            other_embeds = other_embeds.reshape(other_batch_size, other_num_clips, -1)
+            other_embeds = other_embeds.mean(dim=1)
+
+        # cosine similarity as logits
+        logits_per_other = torch.matmul(other_embeds, image_embeds.t())
+        logits_per_image = logits_per_other.t()
+
+        loss = None
+        if return_loss:
+            loss = imagebind_loss(logits_per_other)
+
+        if not return_dict:
+            output = (logits_per_image, logits_per_other, other_embeds, image_embeds, other_outputs, vision_outputs)
+            return ((loss,) + output) if loss is not None else output
+        
+        output_kwargs = self._resolve_output_keys(modality, logits_per_other, other_embeds, other_outputs)
+
+        return ImageBindOutput(
+            loss=loss,
+            logits_per_image=logits_per_image,
+            image_embeds=image_embeds,
+            vision_model_output=vision_outputs,
+            **output_kwargs,
+        )
+    
+    def _resolve_modality_models(self, modality: str):
+        if modality == "text":
+            model = self.text_model
+            projection = self.text_projection
+            postprocessor = self.text_postprocessor
+        elif modality == "vision":
+            model = self.vision_model
+            projection = self.visual_projection
+            postprocessor = self.vision_postprocessor
+        elif modality == "audio":
+            model = self.audio_model
+            projection = self.audio_projection
+            postprocessor = self.audio_postprocessor
+        elif modality == "depth":
+            model = self.depth_model
+            projection = self.depth_projection
+            postprocessor = self.depth_postprocessor
+        elif modality == "thermal":
+            model = self.thermal_model
+            projection = self.thermal_projection
+            postprocessor = self.thermal_postprocessor
+        elif modality == "imu":
+            model = self.imu_model
+            projection = self.imu_projection
+            postprocessor = self.imu_postprocessor
+        else:
+            raise ValueError(
+                f"`modality` is expected to be in `['text', 'vision', 'audio', 'depth', 'thermal', 'imu']` but got"
+                f" {modality}"
+            )
+        return model, projection, postprocessor
+    
+    def _resolve_output_keys(self, modality: str, logits, embeds, model_outputs):
+        output_kwargs = {}
+        if modality == "vision":
+            # Different naming pattern
+            output_kwargs["logits_per_image"] = logits
+            output_kwargs["image_embeds"] = embeds
+            output_kwargs["vision_model_output"] = model_outputs
+        else:
+            output_kwargs[f"logits_per_{modality}"] = logits
+            output_kwargs[f"{modality}_embeds"] = embeds
+            output_kwargs[f"{modality}_model_output"] = model_outputs
+        return output_kwargs
+
+
+@add_start_docstrings(
+    """
+    ImageBind Text Model with a projection layer on top (a linear layer on top of the pooled output).
+    """,
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindTextModelWithProjection(ImageBindPreTrainedModel):
+    config_class = ImageBindTextConfig
+
+    _no_split_modules = ["ImageBindEncoderLayer"]
+
+    def __init__(self, config: ImageBindTextConfig):
+        super().__init__(config)
+
+        self.text_model = ImageBindTextTransformer(config)
+
+        self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
+
+        self.text_postprocessor = ImageBindPostProcessor(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.text_model.embeddings.token_embedding
+
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.token_embedding = value
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindTextModelOutput, config_class=ImageBindTextConfig)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindTextModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoTokenizer, ImageBindTextModelWithProjection
+
+        >>> model = ImageBindTextModelWithProjection.from_pretrained("facebook/imagebind-huge")
+        >>> tokenizer = AutoTokenizer.from_pretrained("facebook/imagebind-huge")
+
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> text_embeds = outputs.text_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+
+        text_embeds = self.text_projection(pooled_output)
+        normalized_text_embeds = self.text_postprocessor(text_embeds)
+
+        if not return_dict:
+            # Exclude num_clips output
+            outputs = (text_embeds, text_outputs[0]) + text_outputs[2:-1] + (normalized_text_embeds,)
+            return tuple(output for output in outputs if output is not None)
+
+        return ImageBindTextModelOutput(
+            text_embeds=text_embeds,
+            last_hidden_state=text_outputs.last_hidden_state,
+            hidden_states=text_outputs.hidden_states,
+            attentions=text_outputs.attentions,
+            normalized_text_embeds=normalized_text_embeds,
+        )
+
+
+@add_start_docstrings(
+    """
+    ImageBind Vision Model with a projection layer on top (a linear layer on top of the pooled output).
+    """,
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindVisionModelWithProjection(ImageBindPreTrainedModel):
+    config_class = ImageBindVisionConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: ImageBindVisionConfig):
+        super().__init__(config)
+
+        self.vision_model = ImageBindVisionTransformer(config)
+
+        self.visual_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
+
+        self.vision_postprocessor = ImageBindPostProcessor(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.vision_model.embeddings.patch_embedding
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindVisionModelOutput, config_class=ImageBindVisionConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindVisionModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindVisionModelWithProjection
+
+        >>> model = ImageBindVisionModelWithProjection.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> image_embeds = outputs.image_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        batch_size = pixel_values.shape[0]
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = vision_outputs[1]  # pooled_output
+
+        image_embeds = self.visual_projection(pooled_output)
+        normalized_image_embeds = self.vision_postprocessor(image_embeds)
+
+        num_clips = vision_outputs[-1]
+        if num_clips is not None:
+            image_embeds = image_embeds.reshape(batch_size, num_clips, -1)
+            # Take mean over all clips
+            image_embeds = image_embeds.mean(dim=1)
+
+            normalized_image_embeds = normalized_image_embeds.reshape(batch_size, num_clips, -1)
+            normalized_image_embeds = normalized_image_embeds.mean(dim=1)
+
+        if not return_dict:
+            # Exclude num_clips output
+            outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:-1] + (normalized_image_embeds,)
+            return tuple(output for output in outputs if output is not None)
+
+        return ImageBindVisionModelOutput(
+            image_embeds=image_embeds,
+            last_hidden_state=vision_outputs.last_hidden_state,
+            hidden_states=vision_outputs.hidden_states,
+            attentions=vision_outputs.attentions,
+            normalized_image_embeds=normalized_image_embeds,
+        )
+
+
+@add_start_docstrings(
+    """
+    ImageBind Audio Model with a projection layer on top (a linear layer on top of the pooled output).
+    """,
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindAudioModelWithProjection(ImageBindPreTrainedModel):
+    config_class = ImageBindAudioConfig
+    main_input_name = "input_features"
+
+    def __init__(self, config: ImageBindAudioConfig):
+        super().__init__(config)
+
+        self.audio_model = ImageBindAudioTransformer(config)
+
+        self.audio_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
+
+        self.audio_postprocessor = ImageBindPostProcessor(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.audio_model.embeddings.patch_embedding
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindAudioModelOutput, config_class=ImageBindAudioConfig)
+    def forward(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindAudioModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindAudioModelWithProjection
+
+        >>> model = ImageBindAudioModelWithProjection.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")  # TODO
+
+        >>> outputs = model(**inputs)
+        >>> audio_embeds = outputs.audio_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        batch_size = input_features.shape[0]
+
+        audio_outputs = self.audio_model(
+            input_features=input_features,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = audio_outputs[1]  # pooled_output
+
+        audio_embeds = self.audio_projection(pooled_output)
+        normalized_audio_embeds = self.audio_postprocessor(audio_embeds)
+
+        num_clips = audio_outputs[-1]
+        if num_clips is not None:
+            audio_embeds = audio_embeds.reshape(batch_size, num_clips, -1)
+            # Take mean over all clips
+            audio_embeds = audio_embeds.mean(dim=1)
+
+            normalized_audio_embeds = normalized_audio_embeds.reshape(batch_size, num_clips, -1)
+            normalized_audio_embeds = normalized_audio_embeds.mean(dim=1)
+
+        if not return_dict:
+            # Exclude num_clips output
+            outputs = (audio_embeds, audio_outputs[0]) + audio_outputs[2:-1] + (normalized_audio_embeds,)
+            return tuple(output for output in outputs if output is not None)
+
+        return ImageBindAudioModelOutput(
+            audio_embeds=audio_embeds,
+            last_hidden_state=audio_outputs.last_hidden_state,
+            hidden_states=audio_outputs.hidden_states,
+            attentions=audio_outputs.attentions,
+            normalized_audio_embeds=normalized_audio_embeds,
+        )
+
+
+@add_start_docstrings(
+    """
+    ImageBind Depth Model with a projection layer on top (a linear layer on top of the pooled output).
+    """,
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindDepthModelWithProjection(ImageBindPreTrainedModel):
+    config_class = ImageBindDepthConfig
+    main_input_name = "pixel_values"  # TODO: rename to something better?
+
+    def __init__(self, config: ImageBindDepthConfig):
+        super().__init__(config)
+
+        self.depth_model = ImageBindDepthTransformer(config)
+
+        self.depth_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
+
+        self.depth_postprocessor = ImageBindPostProcessor(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.depth_model.embeddings.patch_embedding
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindDepthModelOutput, config_class=ImageBindDepthConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindDepthModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindDepthModelWithProjection
+
+        >>> model = ImageBindDepthModelWithProjection.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")  # TODO
+
+        >>> outputs = model(**inputs)
+        >>> depth_embeds = outputs.depth_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        depth_outputs = self.depth_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = depth_outputs[1]  # pooled_output
+
+        depth_embeds = self.depth_projection(pooled_output)
+        normalized_depth_embeds = self.depth_postprocessor(depth_embeds)
+
+        if not return_dict:
+            # Exclude num_clips output
+            outputs = (depth_embeds, depth_outputs[0]) + depth_outputs[2:-1] + (normalized_depth_embeds,)
+            return tuple(output for output in outputs if output is not None)
+
+        return ImageBindDepthModelOutput(
+            depth_embeds=depth_embeds,
+            last_hidden_state=depth_outputs.last_hidden_state,
+            hidden_states=depth_outputs.hidden_states,
+            attentions=depth_outputs.attentions,
+            normalized_depth_embeds=normalized_depth_embeds,
+        )
+
+
+@add_start_docstrings(
+    """
+    ImageBind Thermal Model with a projection layer on top (a linear layer on top of the pooled output).
+    """,
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindThermalModelWithProjection(ImageBindPreTrainedModel):
+    config_class = ImageBindThermalConfig
+    main_input_name = "pixel_values"  # TODO: rename to something better?
+
+    def __init__(self, config: ImageBindThermalConfig):
+        super().__init__(config)
+
+        self.thermal_model = ImageBindThermalTransformer(config)
+
+        self.thermal_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
+
+        self.thermal_postprocessor = ImageBindPostProcessor(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.thermal_model.embeddings.patch_embedding
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindThermalModelOutput, config_class=ImageBindThermalConfig)
+    def forward(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindThermalModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindDepthModelWithProjection
+
+        >>> model = ImageBindDepthModelWithProjection.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")  # TODO
+
+        >>> outputs = model(**inputs)
+        >>> depth_embeds = outputs.depth_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        thermal_outputs = self.thermal_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = thermal_outputs[1]  # pooled_output
+
+        thermal_embeds = self.thermal_projection(pooled_output)
+        normalized_thermal_embeds = self.thermal_postprocessor(thermal_embeds)
+
+        if not return_dict:
+            # Exclude num_clips output
+            outputs = (thermal_embeds, thermal_outputs[0]) + thermal_outputs[2:-1] + (normalized_thermal_embeds,)
+            return tuple(output for output in outputs if output is not None)
+
+        return ImageBindThermalModelOutput(
+            thermal_embeds=thermal_embeds,
+            last_hidden_state=thermal_outputs.last_hidden_state,
+            hidden_states=thermal_outputs.hidden_states,
+            attentions=thermal_outputs.attentions,
+            normalized_thermal_embeds=normalized_thermal_embeds,
+        )
+
+
+@add_start_docstrings(
+    """
+    ImageBind IMU Model with a projection layer on top (a linear layer on top of the pooled output).
+    """,
+    IMAGEBIND_START_DOCSTRING,
+)
+class ImageBindImuModelWithProjection(ImageBindPreTrainedModel):
+    config_class = ImageBindImuConfig
+    main_input_name = "input_features"
+
+    def __init__(self, config: ImageBindImuConfig):
+        super().__init__(config)
+
+        self.imu_model = ImageBindImuTransformer(config)
+
+        self.imu_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
+
+        self.imu_postprocessor = ImageBindPostProcessor(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.imu_model.embeddings.patch_embedding
+
+    @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=ImageBindImuModelOutput, config_class=ImageBindImuConfig)
+    def forward(
+        self,
+        input_features: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageBindImuModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, ImageBindDepthModelWithProjection
+
+        >>> model = ImageBindDepthModelWithProjection.from_pretrained("facebook/imagebind-huge")
+        >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")  # TODO
+
+        >>> outputs = model(**inputs)
+        >>> depth_embeds = outputs.depth_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        imu_outputs = self.imu_model(
+            input_features=input_features,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = imu_outputs[1]  # pooled_output
+
+        imu_embeds = self.imu_projection(pooled_output)
+        normalized_imu_embeds = self.imu_postprocessor(imu_embeds)
+
+        if not return_dict:
+            # Exclude num_clips output
+            outputs = (imu_embeds, imu_outputs[0]) + imu_outputs[2:-1] + (normalized_imu_embeds,)
+            return tuple(output for output in outputs if output is not None)
+
+        return ImageBindImuModelOutput(
+            imu_embeds=imu_embeds,
+            last_hidden_state=imu_outputs.last_hidden_state,
+            hidden_states=imu_outputs.hidden_states,
+            attentions=imu_outputs.attentions,
+            normalized_imu_embeds=normalized_imu_embeds,
+        )
diff --git a/src/transformers/models/imagebind/processing_imagebind.py b/src/transformers/models/imagebind/processing_imagebind.py
new file mode 100644
index 000000000000..03b3671fe8c7
--- /dev/null
+++ b/src/transformers/models/imagebind/processing_imagebind.py
@@ -0,0 +1,141 @@
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Image/Text processor class for ImageBind
+"""
+
+import warnings
+
+from ...processing_utils import ProcessorMixin
+from ...tokenization_utils_base import BatchEncoding
+
+
+# NOTE: currently copied from previous PR (#23284)
+
+
+class ImageBindProcessor(ProcessorMixin):
+    r"""
+    Constructs a ImageBind processor which wraps a ImageBind image processor and a ImageBind tokenizer into a single processor.
+    [`ImageBindProcessor`] offers all the functionalities of [`ImageBindImageProcessor`] and [`ImageBindTokenizerFast`]. See the
+    [`~ImageBindProcessor.__call__`] and [`~ImageBindProcessor.decode`] for more information.
+    Args:
+        image_processor ([`ImageBindImageProcessor`]):
+            The image processor is a required input.
+        tokenizer ([`ImageBindTokenizerFast`]):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "ImageBindImageProcessor"
+    tokenizer_class = ("ImageBindTokenizer", "ImageBindTokenizerFast")
+
+    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        if "feature_extractor" in kwargs:
+            warnings.warn(
+                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
+                " instead.",
+                FutureWarning,
+            )
+            feature_extractor = kwargs.pop("feature_extractor")
+
+        image_processor = image_processor if image_processor is not None else feature_extractor
+        if image_processor is None:
+            raise ValueError("You need to specify an `image_processor`.")
+        if tokenizer is None:
+            raise ValueError("You need to specify a `tokenizer`.")
+
+        super().__init__(image_processor, tokenizer)
+
+    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
+        and `kwargs` arguments to ImageBindTokenizerFast's [`~ImageBindTokenizerFast.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        ImageBindImageProcessor's [`~ImageBindImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
+        of the above two methods for more information.
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+        Returns:
+            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+
+        if text is None and images is None:
+            raise ValueError("You have to specify either text or images. Both cannot be none.")
+
+        if text is not None:
+            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if images is not None:
+            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+
+        if text is not None and images is not None:
+            encoding["pixel_values"] = image_features.pixel_values
+            return encoding
+        elif text is not None:
+            return encoding
+        else:
+            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ImageBindTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ImageBindTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
+
+    @property
+    def feature_extractor_class(self):
+        warnings.warn(
+            "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.",
+            FutureWarning,
+        )
+        return self.image_processor_class
+
+    @property
+    def feature_extractor(self):
+        warnings.warn(
+            "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.",
+            FutureWarning,
+        )
+        return self.image_processor
\ No newline at end of file
diff --git a/src/transformers/models/imagebind/tokenization_imagebind.py b/src/transformers/models/imagebind/tokenization_imagebind.py
new file mode 100644
index 000000000000..084406c774c8
--- /dev/null
+++ b/src/transformers/models/imagebind/tokenization_imagebind.py
@@ -0,0 +1,525 @@
+# Copyright 2023 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for ImageBind."""
+
+import json
+import os
+import unicodedata
+from functools import lru_cache
+from typing import List, Optional, Tuple
+
+import regex as re
+
+from ...tokenization_utils import AddedToken, PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
+from ...utils import logging
+
+
+# NOTE: currently copied from previous PR (#23284)
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/vocab.json",
+    },
+    "merges_file": {
+        "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/merges.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "facebook/imagebind-huge": 77,
+}
+
+
+PRETRAINED_INIT_CONFIGURATION = {
+    "facebook/imagebind-huge": {},
+}
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
+    characters the bpe code barfs on.
+
+    The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
+    if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
+    decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
+    tables between utf-8 bytes and unicode strings.
+    """
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """
+    Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+def whitespace_clean(text):
+    text = re.sub(r"\s+", " ", text)
+    text = text.strip()
+    return text
+
+
+# Copied from transformers.models.bert.tokenization_bert.whitespace_tokenize
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer
+class BasicTokenizer(object):
+    """
+    Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).
+
+    Args:
+        do_lower_case (`bool`, *optional*, defaults to `True`):
+            Whether or not to lowercase the input when tokenizing.
+        never_split (`Iterable`, *optional*):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`
+        tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
+            Whether or not to tokenize Chinese characters.
+
+            This should likely be deactivated for Japanese (see this
+            [issue](https://github.com/huggingface/transformers/issues/328)).
+        strip_accents (`bool`, *optional*):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+    """
+
+    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+        if never_split is None:
+            never_split = []
+        self.do_lower_case = do_lower_case
+        self.never_split = set(never_split)
+        self.tokenize_chinese_chars = tokenize_chinese_chars
+        self.strip_accents = strip_accents
+
+    def tokenize(self, text, never_split=None):
+        """
+        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
+        WordPieceTokenizer.
+
+        Args:
+            never_split (`List[str]`, *optional*)
+                Kept for backward compatibility purposes. Now implemented directly at the base class level (see
+                [`PreTrainedTokenizer.tokenize`]) List of token not to split.
+        """
+        # union() returns a new set by concatenating the two sets.
+        never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        if self.tokenize_chinese_chars:
+            text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if token not in never_split:
+                if self.do_lower_case:
+                    token = token.lower()
+                    if self.strip_accents is not False:
+                        token = self._run_strip_accents(token)
+                elif self.strip_accents:
+                    token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token, never_split))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """Splits punctuation on a piece of text."""
+        if never_split is not None and text in never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xFFFD or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class ImageBindTokenizer(PreTrainedTokenizer):
+    """
+    Construct a ImageBind tokenizer. Based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods.
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        merges_file (`str`):
+            Path to the merges file.
+        errors (`str`, *optional*, defaults to `"replace"`):
+            Paradigm to follow when decoding bytes to UTF-8. See
+            [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
+        unk_token (`str`, *optional*, defaults to `<|endoftext|>`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        bos_token (`str`, *optional*, defaults to `<|startoftext|>`):
+            The beginning of sequence token.
+        eos_token (`str`, *optional*, defaults to `<|endoftext|>`):
+            The end of sequence token.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        unk_token="<|endoftext|>",
+        bos_token="<|startoftext|>",
+        eos_token="<|endoftext|>",
+        pad_token="<|endoftext|>",  # hack to enable padding
+        **kwargs,
+    ):
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+
+        super().__init__(
+            errors=errors,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            **kwargs,
+        )
+
+        try:
+            import ftfy
+
+            self.fix_text = ftfy.fix_text
+        except ImportError:
+            logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.")
+            self.nlp = BasicTokenizer(do_lower_case=True)
+            self.fix_text = None
+
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.encoder = json.load(vocab_handle)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        with open(merges_file, encoding="utf-8") as merges_handle:
+            bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {"<|startoftext|>": "<|startoftext|>", "<|endoftext|>": "<|endoftext|>"}
+
+        self.pat = re.compile(
+            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
+            re.IGNORECASE,
+        )
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A ImageBind sequence has the following format:
+
+        - single sequence: `<|startoftext|> X <|endoftext|>`
+
+        Pairs of sequences are not the expected use case, but they will be handled without a separator.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        bos_token = [self.bos_token_id]
+        eos_token = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return bos_token + token_ids_0 + eos_token
+        return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1] + [1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed. ImageBind does not make use of token type ids, therefore a list of
+        zeros is returned.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of zeros.
+        """
+        bos_token = [self.bos_token_id]
+        eos_token = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return len(bos_token + token_ids_0 + eos_token) * [0]
+        return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token[:-1]) + (token[-1] + "</w>",)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token + "</w>"
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                except ValueError:
+                    new_word.extend(word[i:])
+                    break
+                else:
+                    new_word.extend(word[i:j])
+                    i = j
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        bpe_tokens = []
+        if self.fix_text is None:
+            text = " ".join(self.nlp.tokenize(text))
+        else:
+            text = whitespace_clean(self.fix_text(text)).lower()
+
+        for token in re.findall(self.pat, text):
+            token = "".join(
+                self.byte_encoder[b] for b in token.encode("utf-8")
+            )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        text = "".join(tokens)
+        byte_array = bytearray([self.byte_decoder[c] for c in text])
+        text = byte_array.decode("utf-8", errors=self.errors).replace("</w>", " ").strip()
+        return text
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+        merge_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
+        )
+
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
+
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            writer.write("#version: 0.2\n")
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning(
+                        "Saving vocabulary to {}: BPE merge indices are not consecutive."
+                        " Please check that the tokenizer is not corrupted!".format(merge_file)
+                    )
+                    index = token_index
+                writer.write(" ".join(bpe_tokens) + "\n")
+                index += 1
+
+        return vocab_file, merge_file
\ No newline at end of file
diff --git a/src/transformers/models/imagebind/tokenization_imagebind_fast.py b/src/transformers/models/imagebind/tokenization_imagebind_fast.py
new file mode 100644
index 000000000000..a28a29a7efcf
--- /dev/null
+++ b/src/transformers/models/imagebind/tokenization_imagebind_fast.py
@@ -0,0 +1,169 @@
+# Copyright 2023 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for OpenAI GPT."""
+
+
+from typing import List, Optional, Tuple
+
+from tokenizers import pre_tokenizers
+
+from ...tokenization_utils_fast import PreTrainedTokenizerFast
+from ...utils import logging
+from .tokenization_imagebind import ImageBindTokenizer
+
+
+# NOTE: currently copied from previous PR (#23284)
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.json", "merges_file": "merges.txt", "tokenizer_file": "tokenizer.json"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/vocab.json",
+    },
+    "merges_file": {
+        "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/merges.txt",
+    },
+    "tokenizer_file": {
+        "facebook/imagebind-huge": (
+            "https://huggingface.co/facebook/imagebind-huge/resolve/main/tokenizer.json"
+        ),
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "facebook/imagebind-huge": 77,
+}
+
+
+class ImageBindTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Construct a "fast" ImageBind tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level
+    Byte-Pair-Encoding.
+    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
+    refer to this superclass for more information regarding those methods.
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        merges_file (`str`):
+            Path to the merges file.
+        unk_token (`str`, *optional*, defaults to `<|endoftext|>`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        bos_token (`str`, *optional*, defaults to `<|startoftext|>`):
+            The beginning of sequence token.
+        eos_token (`str`, *optional*, defaults to `<|endoftext|>`):
+            The end of sequence token.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+    slow_tokenizer_class = ImageBindTokenizer
+
+    def __init__(
+        self,
+        vocab_file=None,
+        merges_file=None,
+        tokenizer_file=None,
+        unk_token="<|endoftext|>",
+        bos_token="<|startoftext|>",
+        eos_token="<|endoftext|>",
+        pad_token="<|endoftext|>",  # hack to enable padding
+        **kwargs,
+    ):
+        super().__init__(
+            vocab_file,
+            merges_file,
+            tokenizer_file=tokenizer_file,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            **kwargs,
+        )
+
+        if not isinstance(self.backend_tokenizer.pre_tokenizer, pre_tokenizers.Sequence):
+            raise ValueError(
+                "The `backend_tokenizer` provided does not match the expected format. The ImageBind tokenizer has been"
+                " heavily modified from transformers version 4.17.0. You need to convert the tokenizer you are using"
+                " to be compatible with this version.The easiest way to do so is"
+                ' `ImageBindTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo, from_slow=True)`. If you want'
+                " to use your existing tokenizer, you will have to revert to a version prior to 4.17.0 of"
+                " transformers."
+            )
+
+        self._wrap_decode_method_backend_tokenizer()
+
+    # Very ugly hack to enable padding to have a correct decoding see https://github.com/huggingface/tokenizers/issues/872
+    def _wrap_decode_method_backend_tokenizer(self):
+        orig_decode_method = self.backend_tokenizer.decode
+
+        def new_decode_method(*args, **kwargs):
+            text = orig_decode_method(*args, **kwargs)
+            text = text.replace(self.backend_tokenizer.model.end_of_word_suffix, " ").strip()
+            return text
+
+        self.backend_tokenizer.decode = new_decode_method
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A ImageBind sequence has the following format:
+        - single sequence: `<|startoftext|> X <|endoftext|>`
+        Pairs of sequences are not the expected use case, but they will be handled without a separator.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        bos_token = [self.bos_token_id]
+        eos_token = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return bos_token + token_ids_0 + eos_token
+        return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed. ImageBind does not make use of token type ids, therefore a list of
+        zeros is returned.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of zeros.
+        """
+        bos_token = [self.bos_token_id]
+        eos_token = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return len(bos_token + token_ids_0 + eos_token) * [0]
+        return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        files = self._tokenizer.model.save(save_directory, name=filename_prefix)
+        return tuple(files)
\ No newline at end of file
diff --git a/tests/models/imagebind/__init__.py b/tests/models/imagebind/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/tests/models/imagebind/test_image_processing_imagebind.py b/tests/models/imagebind/test_image_processing_imagebind.py
new file mode 100644
index 000000000000..67c11c2d4ffd
--- /dev/null
+++ b/tests/models/imagebind/test_image_processing_imagebind.py
@@ -0,0 +1,305 @@
+# Copyright 2023 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import numpy as np
+
+from transformers.testing_utils import require_torch, require_vision
+from transformers.utils import is_torch_available, is_vision_available
+
+from ...test_image_processing_common import ImageProcessingSavingTestMixin
+
+
+# NOTE: currently copied from previous PR (#23284)
+
+
+if is_torch_available():
+    import torch
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import ImageBindImageProcessor
+
+
+class ImageBindImageProcessingTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        num_channels=3,
+        image_size=18,
+        min_resolution=30,
+        max_resolution=400,
+        do_resize=True,
+        size=None,
+        do_center_crop=True,
+        crop_size=None,
+        do_normalize=True,
+        image_mean=[0.48145466, 0.4578275, 0.40821073],
+        image_std=[0.26862954, 0.26130258, 0.27577711],
+        do_convert_rgb=True,
+    ):
+        size = size if size is not None else {"shortest_edge": 20}
+        crop_size = crop_size if crop_size is not None else {"height": 18, "width": 18}
+        self.parent = parent
+        self.batch_size = batch_size
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.min_resolution = min_resolution
+        self.max_resolution = max_resolution
+        self.do_resize = do_resize
+        self.size = size
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean
+        self.image_std = image_std
+        self.do_convert_rgb = do_convert_rgb
+
+    def prepare_image_processor_dict(self):
+        return {
+            "do_resize": self.do_resize,
+            "size": self.size,
+            "do_center_crop": self.do_center_crop,
+            "crop_size": self.crop_size,
+            "do_normalize": self.do_normalize,
+            "image_mean": self.image_mean,
+            "image_std": self.image_std,
+            "do_convert_rgb": self.do_convert_rgb,
+        }
+
+    def prepare_inputs(self, equal_resolution=False, numpify=False, torchify=False):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PyTorch tensors if one specifies torchify=True.
+        """
+
+        assert not (numpify and torchify), "You cannot specify both numpy and PyTorch tensors at the same time"
+
+        if equal_resolution:
+            image_inputs = []
+            for i in range(self.batch_size):
+                image_inputs.append(
+                    np.random.randint(
+                        255, size=(self.num_channels, self.max_resolution, self.max_resolution), dtype=np.uint8
+                    )
+                )
+        else:
+            image_inputs = []
+            for i in range(self.batch_size):
+                width, height = np.random.choice(np.arange(self.min_resolution, self.max_resolution), 2)
+                image_inputs.append(np.random.randint(255, size=(self.num_channels, width, height), dtype=np.uint8))
+
+        if not numpify and not torchify:
+            # PIL expects the channel dimension as last dimension
+            image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        if torchify:
+            image_inputs = [torch.from_numpy(x) for x in image_inputs]
+
+        return image_inputs
+
+
+@require_torch
+@require_vision
+class ImageBindImageProcessingTest(ImageProcessingSavingTestMixin, unittest.TestCase):
+    image_processing_class = ImageBindImageProcessor if is_vision_available() else None
+
+    def setUp(self):
+        self.image_processor_tester = ImageBindImageProcessingTester(self)
+
+    @property
+    def image_processor_dict(self):
+        return self.image_processor_tester.prepare_image_processor_dict()
+
+    def test_image_processor_properties(self):
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        self.assertTrue(hasattr(image_processing, "do_resize"))
+        self.assertTrue(hasattr(image_processing, "size"))
+        self.assertTrue(hasattr(image_processing, "do_center_crop"))
+        self.assertTrue(hasattr(image_processing, "center_crop"))
+        self.assertTrue(hasattr(image_processing, "do_normalize"))
+        self.assertTrue(hasattr(image_processing, "image_mean"))
+        self.assertTrue(hasattr(image_processing, "image_std"))
+        self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
+
+    def test_image_processor_from_dict_with_kwargs(self):
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
+        self.assertEqual(image_processor.size, {"shortest_edge": 20})
+        self.assertEqual(image_processor.crop_size, {"height": 18, "width": 18})
+
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42, crop_size=84)
+        self.assertEqual(image_processor.size, {"shortest_edge": 42})
+        self.assertEqual(image_processor.crop_size, {"height": 84, "width": 84})
+
+    def test_batch_feature(self):
+        pass
+
+    def test_call_pil(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PIL images
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False)
+        for image in image_inputs:
+            self.assertIsInstance(image, Image.Image)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+    def test_call_numpy(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random numpy tensors
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False, numpify=True)
+        for image in image_inputs:
+            self.assertIsInstance(image, np.ndarray)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+    def test_call_pytorch(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PyTorch tensors
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False, torchify=True)
+        for image in image_inputs:
+            self.assertIsInstance(image, torch.Tensor)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+
+@require_torch
+@require_vision
+class ImageBindImageProcessingTestFourChannels(ImageProcessingSavingTestMixin, unittest.TestCase):
+    image_processing_class = ImageBindImageProcessor if is_vision_available() else None
+
+    def setUp(self):
+        self.image_processor_tester = ImageBindImageProcessingTester(self, num_channels=4)
+        self.expected_encoded_image_num_channels = 3
+
+    @property
+    def image_processor_dict(self):
+        return self.image_processor_tester.prepare_image_processor_dict()
+
+    def test_image_processor_properties(self):
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        self.assertTrue(hasattr(image_processing, "do_resize"))
+        self.assertTrue(hasattr(image_processing, "size"))
+        self.assertTrue(hasattr(image_processing, "do_center_crop"))
+        self.assertTrue(hasattr(image_processing, "center_crop"))
+        self.assertTrue(hasattr(image_processing, "do_normalize"))
+        self.assertTrue(hasattr(image_processing, "image_mean"))
+        self.assertTrue(hasattr(image_processing, "image_std"))
+        self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
+
+    def test_batch_feature(self):
+        pass
+
+    def test_call_pil_four_channels(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PIL images
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False)
+        for image in image_inputs:
+            self.assertIsInstance(image, Image.Image)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.expected_encoded_image_num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.expected_encoded_image_num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
\ No newline at end of file
diff --git a/tests/models/imagebind/test_modeling_imagebind.py b/tests/models/imagebind/test_modeling_imagebind.py
new file mode 100644
index 000000000000..e64276216f9a
--- /dev/null
+++ b/tests/models/imagebind/test_modeling_imagebind.py
@@ -0,0 +1,1546 @@
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the PyTorch ImageBind model. """
+
+
+import inspect
+import os
+import tempfile
+import unittest
+
+import numpy as np
+import requests
+
+import transformers
+from transformers import (
+    ImageBindConfig,
+    ImageBindAudioConfig,
+    ImageBindDepthConfig,
+    ImageBindImuConfig,
+    ImageBindTextConfig,
+    ImageBindThermalConfig,
+    ImageBindVisionConfig,
+)
+from transformers.testing_utils import (
+    is_flax_available,
+    is_pt_flax_cross_test,
+    require_torch,
+    require_vision,
+    slow,
+    torch_device,
+)
+from transformers.utils import is_torch_available, is_vision_available
+
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import (
+    ModelTesterMixin,
+    _config_zero_init,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+from ...test_pipeline_mixin import PipelineTesterMixin
+
+
+if is_torch_available():
+    import torch
+    from torch import nn
+
+    from transformers import (
+        ImageBindAudioModel,
+        ImageBindAudioModelWithProjection,
+        ImageBindDepthModel,
+        ImageBindDepthModelWithProjection,
+        ImageBindImuModel,
+        ImageBindImuModelWithProjection,
+        ImageBindModel,
+        ImageBindPreTrainedModel,
+        ImageBindTextModel,
+        ImageBindTextModelWithProjection,
+        ImageBindThermalModel,
+        ImageBindThermalModelWithProjection,
+        ImageBindVisionModel,
+        ImageBindVisionModelWithProjection,
+    )
+    from transformers.models.imagebind.modeling_imagebind import IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST
+
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import ImageBindProcessor
+
+
+if is_flax_available():
+    import jax.numpy as jnp
+
+    from transformers.modeling_flax_pytorch_utils import (
+        convert_pytorch_state_dict_to_flax,
+        load_flax_weights_in_pytorch_model,
+    )
+
+
+class ImageBindTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.0,
+        attention_dropout=0.0,
+        max_position_embeddings=512,
+        layer_norm_eps=1e-6,
+        initializer_range=0.02,
+        logit_scale_init_value=14.2857,
+        learnable_logit_scale=True,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return ImageBindTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            layer_norm_eps=self.layer_norm_eps,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            logit_scale_init_value=self.logit_scale_init_value,
+            learnable_logit_scale=self.learnable_logit_scale,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = ImageBindTextModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(input_ids, attention_mask=input_mask)
+            result = model(input_ids)
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
+        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
+
+    def create_and_check_model_with_projection(self, config, input_ids, input_mask):
+        model = ImageBindTextModelWithProjection(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(input_ids, attention_mask=input_mask)
+            result = model(input_ids)
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
+        self.parent.assertEqual(result.text_embeds.shape, (self.batch_size, self.projection_dim))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@require_torch
+class ImageBindTextModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (ImageBindTextModel, ImageBindTextModelWithProjection) if is_torch_available() else ()
+    fx_compatible = False
+    test_pruning = False
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = ImageBindTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ImageBindTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ImageBind does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="ImageBindTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ImageBindTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindTextModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "text_projection"))
+
+
+class ImageBindVisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=(2, 2, 2),
+        stride=(2, 2, 2),
+        num_channels=3,
+        num_frames=2,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.0,
+        layer_norm_eps=1e-6,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        logit_scale_init_value=None,
+        learnable_logit_scale=False,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.stride = stride
+        self.num_channels = num_channels
+        self.num_frames = num_frames
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.initializer_range = initializer_range
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+        self.scope = scope
+
+        # Resolve spatiotemporal patch size
+        patches_along_time_dim = num_frames // patch_size[0]
+        patches_along_height_dim = ((image_size - patch_size[1]) // stride[1]) + 1
+        patches_along_width_dim = ((image_size - patch_size[2]) // stride[2]) + 1
+        num_patches = patches_along_time_dim * patches_along_height_dim * patches_along_width_dim
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.num_frames, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return ImageBindVisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            stride=self.stride,
+            num_channels=self.num_channels,
+            num_frames=self.num_frames,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            layer_norm_eps=self.layer_norm_eps,
+            initializer_range=self.initializer_range,
+            logit_scale_init_value=self.logit_scale_init_value,
+            learnable_logit_scale=self.learnable_logit_scale,
+        )
+
+    # TODO: fix image size and patch_size
+    def create_and_check_model(self, config, pixel_values):
+        model = ImageBindVisionModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
+
+    # TODO: fix image size and patch_size
+    def create_and_check_model_with_projection(self, config, pixel_values):
+        model = ImageBindVisionModelWithProjection(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+@require_torch
+class ImageBindVisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ImageBindVisionModel, ImageBindVisionModelWithProjection) if is_torch_available() else ()
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = ImageBindVisionModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ImageBindVisionConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="IMAGEBIND does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ImageBindVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ImageBindVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindVisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindVisionModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "visual_projection"))
+
+
+class ImageBindAudioModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        patch_size=16,
+        stride=10,
+        num_channels=1,
+        is_training=True,
+        num_mel_bins=128,
+        target_len=204,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.0,
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.1,
+        initializer_range=0.02,
+        logit_scale_init_value=20.0,
+        learnable_logit_scale=False,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.patch_size = patch_size
+        self.stride = stride
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.num_mel_bins = num_mel_bins
+        self.target_len = target_len
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.add_kv_bias = add_kv_bias
+        self.initializer_range = initializer_range
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        patches_along_height_dim = ((num_mel_bins - patch_size) // stride) + 1
+        patches_along_width_dim = ((target_len - patch_size) // stride) + 1
+        num_patches = patches_along_height_dim * patches_along_width_dim
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return ImageBindAudioConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            stride=self.stride,
+            num_channels=self.num_channels,
+            num_mel_bins=self.num_mel_bins,
+            target_len=self.target_len,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            layer_norm_eps=self.layer_norm_eps,
+            add_kv_bias=self.add_kv_bias,
+            initializer_range=self.initializer_range,
+            logit_scale_init_value=self.logit_scale_init_value,
+            learnable_logit_scale=self.learnable_logit_scale,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = ImageBindAudioModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
+
+    def create_and_check_model_with_projection(self, config, pixel_values):
+        model = ImageBindAudioModelWithProjection(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+@require_torch
+class ImageBindAudioModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ImageBindAudioModel, ImageBindAudioModelWithProjection) if is_torch_available() else ()
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = ImageBindAudioModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ImageBindAudioConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="ImageBind does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ImageBindAudioModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ImageBindAudioModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindAudioModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindAudioModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "audio_projection"))
+
+
+class ImageBindDepthModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        stride=2,
+        num_channels=1,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.0,
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.0,
+        initializer_range=0.02,
+        logit_scale_init_value=5.0,
+        learnable_logit_scale=False,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.stride = stride
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.add_kv_bias = add_kv_bias
+        self.initializer_range = initializer_range
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+        self.scope = scope
+
+        num_patches = (((image_size - patch_size) // stride) + 1) ** 2
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return ImageBindDepthConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            stride=self.stride,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            layer_norm_eps=self.layer_norm_eps,
+            add_kv_bias=self.add_kv_bias,
+            initializer_range=self.initializer_range,
+            logit_scale_init_value=self.logit_scale_init_value,
+            learnable_logit_scale=self.learnable_logit_scale,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = ImageBindDepthModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
+
+    def create_and_check_model_with_projection(self, config, pixel_values):
+        model = ImageBindDepthModelWithProjection(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+@require_torch
+class ImageBindDepthModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ImageBindDepthModel, ImageBindDepthModelWithProjection) if is_torch_available() else ()
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = ImageBindDepthModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ImageBindDepthConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="ImageBind does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ImageBindDepthModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ImageBindDepthModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindDepthModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindDepthModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "depth_projection"))
+
+
+class ImageBindThermalModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        stride=2,
+        num_channels=1,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.0,
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.0,
+        initializer_range=0.02,
+        logit_scale_init_value=10.0,
+        learnable_logit_scale=False,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.stride = stride
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.add_kv_bias = add_kv_bias
+        self.initializer_range = initializer_range
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+        self.scope = scope
+
+        num_patches = (((image_size - patch_size) // stride) + 1) ** 2
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return ImageBindThermalConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            stride=self.stride,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            layer_norm_eps=self.layer_norm_eps,
+            add_kv_bias=self.add_kv_bias,
+            initializer_range=self.initializer_range,
+            logit_scale_init_value=self.logit_scale_init_value,
+            learnable_logit_scale=self.learnable_logit_scale,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = ImageBindThermalModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
+
+    def create_and_check_model_with_projection(self, config, pixel_values):
+        model = ImageBindThermalModelWithProjection(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+@require_torch
+class ImageBindThermalModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ImageBindThermalModel, ImageBindThermalModelWithProjection) if is_torch_available() else ()
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = ImageBindThermalModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ImageBindThermalConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="ImageBind does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ImageBindThermalModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ImageBindThermalModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindThermalModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindThermalModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "thermal_projection"))
+
+
+class ImageBindImuModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        input_shape=(6, 30),
+        kernel_size=2,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.0,
+        layer_norm_eps=1e-6,
+        add_kv_bias=True,
+        attention_dropout=0.0,
+        drop_path_rate=0.7,
+        initializer_range=0.02,
+        logit_scale_init_value=5.0,
+        learnable_logit_scale=False,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.input_shape = input_shape
+        self.kernel_size = kernel_size
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.drop_path_rate = drop_path_rate
+        self.layer_norm_eps = layer_norm_eps
+        self.add_kv_bias = add_kv_bias
+        self.initializer_range = initializer_range
+        self.logit_scale_init_value = logit_scale_init_value
+        self.learnable_logit_scale = learnable_logit_scale
+        self.scope = scope
+
+        num_patches = input_shape[1] // kernel_size
+        # The seq length is the number of patches + 1 (for the [CLS] token)
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return ImageBindImuConfig(
+            input_shape=self.input_shape,
+            kernel_size=self.kernel_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            layer_norm_eps=self.layer_norm_eps,
+            add_kv_bias=self.add_kv_bias,
+            initializer_range=self.initializer_range,
+            logit_scale_init_value=self.logit_scale_init_value,
+            learnable_logit_scale=self.learnable_logit_scale,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = ImageBindImuModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
+
+    def create_and_check_model_with_projection(self, config, pixel_values):
+        model = ImageBindImuModelWithProjection(config=config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
+        self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+@require_torch
+class ImageBindImuModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ImageBindImuModel, ImageBindImuModelWithProjection) if is_torch_available() else ()
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = ImageBindImuModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ImageBindImuConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="ImageBind does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ImageBindImuModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ImageBindImuModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindImuModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindImuModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "imu_projection"))
+
+
+class ImageBindModelTester:
+    def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
+        if text_kwargs is None:
+            text_kwargs = {}
+        if vision_kwargs is None:
+            vision_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = ImageBindTextModelTester(parent, **text_kwargs)
+        self.vision_model_tester = ImageBindVisionModelTester(parent, **vision_kwargs)
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return ImageBindConfig.from_text_vision_configs(
+            self.text_model_tester.get_config(), self.vision_model_tester.get_config(), projection_dim=64
+        )
+
+    def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
+        model = ImageBindModel(config).to(torch_device).eval()
+        with torch.no_grad():
+            result = model(input_ids, pixel_values, attention_mask)
+        self.parent.assertEqual(
+            result.logits_per_image.shape, (self.vision_model_tester.batch_size, self.text_model_tester.batch_size)
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, (self.text_model_tester.batch_size, self.vision_model_tester.batch_size)
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": pixel_values,
+            "return_loss": True,
+        }
+        return config, inputs_dict
+
+
+@require_torch
+class ImageBindModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
+    all_model_classes = (ImageBindModel,) if is_torch_available() else ()
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+
+    def setUp(self):
+        self.model_tester = ImageBindModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="ImageBindModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    # override as the `logit_scale` parameter initilization is different for IMAGEBIND
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if param.requires_grad:
+                    # check if `logit_scale` is initilized as per the original implementation
+                    if name == "logit_scale":
+                        self.assertAlmostEqual(
+                            param.data.item(),
+                            np.log(1 / 0.07),
+                            delta=1e-3,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.data.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    def _create_and_check_torchscript(self, config, inputs_dict):
+        if not self.test_torchscript:
+            return
+
+        configs_no_init = _config_zero_init(config)  # To be sure we have no Nan
+        configs_no_init.torchscript = True
+        configs_no_init.return_dict = False
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            model.to(torch_device)
+            model.eval()
+
+            try:
+                input_ids = inputs_dict["input_ids"]
+                pixel_values = inputs_dict["pixel_values"]  # IMAGEBIND needs pixel_values
+                traced_model = torch.jit.trace(model, (input_ids, pixel_values))
+            except RuntimeError:
+                self.fail("Couldn't trace module.")
+
+            with tempfile.TemporaryDirectory() as tmp_dir_name:
+                pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
+
+                try:
+                    torch.jit.save(traced_model, pt_file_name)
+                except Exception:
+                    self.fail("Couldn't save module.")
+
+                try:
+                    loaded_model = torch.jit.load(pt_file_name)
+                except Exception:
+                    self.fail("Couldn't load module.")
+
+            model.to(torch_device)
+            model.eval()
+
+            loaded_model.to(torch_device)
+            loaded_model.eval()
+
+            model_state_dict = model.state_dict()
+            loaded_model_state_dict = loaded_model.state_dict()
+
+            self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
+
+            models_equal = True
+            for layer_name, p1 in model_state_dict.items():
+                p2 = loaded_model_state_dict[layer_name]
+                if p1.data.ne(p2.data).sum() > 0:
+                    models_equal = False
+
+            self.assertTrue(models_equal)
+
+    def test_load_vision_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save ImageBindConfig and check if we can load ImageBindVisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = ImageBindVisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save ImageBindConfig and check if we can load ImageBindTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = ImageBindTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    # overwrite from common since FlaxImageBindModel returns nested output
+    # which is not supported in the common test
+    @is_pt_flax_cross_test
+    def test_equivalence_pt_to_flax(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            with self.subTest(model_class.__name__):
+                # load PyTorch class
+                pt_model = model_class(config).eval()
+                # Flax models don't use the `use_cache` option and cache is not returned as a default.
+                # So we disable `use_cache` here for PyTorch model.
+                pt_model.config.use_cache = False
+
+                fx_model_class_name = "Flax" + model_class.__name__
+
+                if not hasattr(transformers, fx_model_class_name):
+                    return
+
+                fx_model_class = getattr(transformers, fx_model_class_name)
+
+                # load Flax class
+                fx_model = fx_model_class(config, dtype=jnp.float32)
+                # make sure only flax inputs are forward that actually exist in function args
+                fx_input_keys = inspect.signature(fx_model.__call__).parameters.keys()
+
+                # prepare inputs
+                pt_inputs = self._prepare_for_class(inputs_dict, model_class)
+
+                # remove function args that don't exist in Flax
+                pt_inputs = {k: v for k, v in pt_inputs.items() if k in fx_input_keys}
+
+                fx_state = convert_pytorch_state_dict_to_flax(pt_model.state_dict(), fx_model)
+                fx_model.params = fx_state
+
+                with torch.no_grad():
+                    pt_outputs = pt_model(**pt_inputs).to_tuple()
+
+                # convert inputs to Flax
+                fx_inputs = {k: np.array(v) for k, v in pt_inputs.items() if torch.is_tensor(v)}
+                fx_outputs = fx_model(**fx_inputs).to_tuple()
+                self.assertEqual(len(fx_outputs), len(pt_outputs), "Output lengths differ between Flax and PyTorch")
+                for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs[:4]):
+                    self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2)
+
+                with tempfile.TemporaryDirectory() as tmpdirname:
+                    pt_model.save_pretrained(tmpdirname)
+                    fx_model_loaded = fx_model_class.from_pretrained(tmpdirname, from_pt=True)
+
+                fx_outputs_loaded = fx_model_loaded(**fx_inputs).to_tuple()
+                self.assertEqual(
+                    len(fx_outputs_loaded), len(pt_outputs), "Output lengths differ between Flax and PyTorch"
+                )
+                for fx_output_loaded, pt_output in zip(fx_outputs_loaded[:4], pt_outputs[:4]):
+                    self.assert_almost_equals(fx_output_loaded, pt_output.numpy(), 4e-2)
+
+    # overwrite from common since FlaxImageBindModel returns nested output
+    # which is not supported in the common test
+    @is_pt_flax_cross_test
+    def test_equivalence_flax_to_pt(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            with self.subTest(model_class.__name__):
+                # load corresponding PyTorch class
+                pt_model = model_class(config).eval()
+
+                # So we disable `use_cache` here for PyTorch model.
+                pt_model.config.use_cache = False
+
+                fx_model_class_name = "Flax" + model_class.__name__
+
+                if not hasattr(transformers, fx_model_class_name):
+                    # no flax model exists for this class
+                    return
+
+                fx_model_class = getattr(transformers, fx_model_class_name)
+
+                # load Flax class
+                fx_model = fx_model_class(config, dtype=jnp.float32)
+                # make sure only flax inputs are forward that actually exist in function args
+                fx_input_keys = inspect.signature(fx_model.__call__).parameters.keys()
+
+                pt_model = load_flax_weights_in_pytorch_model(pt_model, fx_model.params)
+
+                # make sure weights are tied in PyTorch
+                pt_model.tie_weights()
+
+                # prepare inputs
+                pt_inputs = self._prepare_for_class(inputs_dict, model_class)
+
+                # remove function args that don't exist in Flax
+                pt_inputs = {k: v for k, v in pt_inputs.items() if k in fx_input_keys}
+
+                with torch.no_grad():
+                    pt_outputs = pt_model(**pt_inputs).to_tuple()
+
+                fx_inputs = {k: np.array(v) for k, v in pt_inputs.items() if torch.is_tensor(v)}
+
+                fx_outputs = fx_model(**fx_inputs).to_tuple()
+                self.assertEqual(len(fx_outputs), len(pt_outputs), "Output lengths differ between Flax and PyTorch")
+
+                for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs[:4]):
+                    self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2)
+
+                with tempfile.TemporaryDirectory() as tmpdirname:
+                    fx_model.save_pretrained(tmpdirname)
+                    pt_model_loaded = model_class.from_pretrained(tmpdirname, from_flax=True)
+
+                with torch.no_grad():
+                    pt_outputs_loaded = pt_model_loaded(**pt_inputs).to_tuple()
+
+                self.assertEqual(
+                    len(fx_outputs), len(pt_outputs_loaded), "Output lengths differ between Flax and PyTorch"
+                )
+                for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs_loaded[:4]):
+                    self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ImageBindModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+@require_vision
+@require_torch
+class ImageBindModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference(self):
+        model_name = "facebook/imagebind-huge"
+        model = ImageBindModel.from_pretrained(model_name).to(torch_device)
+        processor = ImageBindProcessor.from_pretrained(model_name)
+
+        image = prepare_img()
+        inputs = processor(
+            text=["a photo of a cat", "a photo of a dog"], images=image, padding=True, return_tensors="pt"
+        ).to(torch_device)
+
+        # forward pass
+        with torch.no_grad():
+            outputs = model(**inputs)
+
+        # verify the logits
+        self.assertEqual(
+            outputs.logits_per_image.shape,
+            torch.Size((inputs.pixel_values.shape[0], inputs.input_ids.shape[0])),
+        )
+        self.assertEqual(
+            outputs.logits_per_text.shape,
+            torch.Size((inputs.input_ids.shape[0], inputs.pixel_values.shape[0])),
+        )
+
+        expected_logits = torch.tensor([[24.5701, 19.3049]], device=torch_device)
+
+        self.assertTrue(torch.allclose(outputs.logits_per_image, expected_logits, atol=1e-3))
\ No newline at end of file
diff --git a/tests/models/imagebind/test_processor_imagebind.py b/tests/models/imagebind/test_processor_imagebind.py
new file mode 100644
index 000000000000..ff27287c4e79
--- /dev/null
+++ b/tests/models/imagebind/test_processor_imagebind.py
@@ -0,0 +1,205 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import pytest
+
+from transformers import ImageBindTokenizer, ImageBindTokenizerFast
+from transformers.models.imagebind.tokenization_imagebind import VOCAB_FILES_NAMES
+from transformers.testing_utils import require_vision
+from transformers.utils import IMAGE_PROCESSOR_NAME, is_vision_available
+
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import ImageBindImageProcessor, ImageBindProcessor
+
+
+# NOTE: currently copied from previous PR (#23284)
+
+
+@require_vision
+class ImageBindProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+
+        # fmt: off
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "lo", "l</w>", "w</w>", "r</w>", "t</w>", "low</w>", "er</w>", "lowest</w>", "newer</w>", "wider", "<unk>", "<|startoftext|>", "<|endoftext|>"]
+        # fmt: on
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "l o", "lo w</w>", "e r</w>", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+        image_processor_map = {
+            "do_resize": True,
+            "size": 20,
+            "do_center_crop": True,
+            "crop_size": 18,
+            "do_normalize": True,
+            "image_mean": [0.48145466, 0.4578275, 0.40821073],
+            "image_std": [0.26862954, 0.26130258, 0.27577711],
+        }
+        self.image_processor_file = os.path.join(self.tmpdirname, IMAGE_PROCESSOR_NAME)
+        with open(self.image_processor_file, "w", encoding="utf-8") as fp:
+            json.dump(image_processor_map, fp)
+
+    def get_tokenizer(self, **kwargs):
+        return ImageBindTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_rust_tokenizer(self, **kwargs):
+        return ImageBindTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_image_processor(self, **kwargs):
+        return ImageBindImageProcessor.from_pretrained(self.tmpdirname, **kwargs)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def prepare_image_inputs(self):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PyTorch tensors if one specifies torchify=True.
+        """
+
+        image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+        image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        return image_inputs
+
+    def test_save_load_pretrained_default(self):
+        tokenizer_slow = self.get_tokenizer()
+        tokenizer_fast = self.get_rust_tokenizer()
+        image_processor = self.get_image_processor()
+
+        processor_slow = ImageBindProcessor(tokenizer=tokenizer_slow, image_processor=image_processor)
+        processor_slow.save_pretrained(self.tmpdirname)
+        processor_slow = ImageBindProcessor.from_pretrained(self.tmpdirname, use_fast=False)
+
+        processor_fast = ImageBindProcessor(tokenizer=tokenizer_fast, image_processor=image_processor)
+        processor_fast.save_pretrained(self.tmpdirname)
+        processor_fast = ImageBindProcessor.from_pretrained(self.tmpdirname)
+
+        self.assertEqual(processor_slow.tokenizer.get_vocab(), tokenizer_slow.get_vocab())
+        self.assertEqual(processor_fast.tokenizer.get_vocab(), tokenizer_fast.get_vocab())
+        self.assertEqual(tokenizer_slow.get_vocab(), tokenizer_fast.get_vocab())
+        self.assertIsInstance(processor_slow.tokenizer, ImageBindTokenizer)
+        self.assertIsInstance(processor_fast.tokenizer, ImageBindTokenizerFast)
+
+        self.assertEqual(processor_slow.image_processor.to_json_string(), image_processor.to_json_string())
+        self.assertEqual(processor_fast.image_processor.to_json_string(), image_processor.to_json_string())
+        self.assertIsInstance(processor_slow.image_processor, ImageBindImageProcessor)
+        self.assertIsInstance(processor_fast.image_processor, ImageBindImageProcessor)
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = ImageBindProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
+
+        processor = ImageBindProcessor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+        self.assertIsInstance(processor.tokenizer, ImageBindTokenizerFast)
+
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.image_processor, ImageBindImageProcessor)
+
+    def test_image_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+
+        input_image_proc = image_processor(image_input, return_tensors="np")
+        input_processor = processor(images=image_input, return_tensors="np")
+
+        for key in input_image_proc.keys():
+            self.assertAlmostEqual(input_image_proc[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+
+        encoded_processor = processor(text=input_str)
+
+        encoded_tok = tokenizer(input_str)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "pixel_values"])
+
+        # test if it raises when no input is passed
+        with pytest.raises(ValueError):
+            processor()
+
+    def test_tokenizer_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
+
+    def test_model_input_names(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), processor.model_input_names)
\ No newline at end of file
diff --git a/tests/models/imagebind/test_tokenization_imagebind.py b/tests/models/imagebind/test_tokenization_imagebind.py
new file mode 100644
index 000000000000..1f465dc547a1
--- /dev/null
+++ b/tests/models/imagebind/test_tokenization_imagebind.py
@@ -0,0 +1,187 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+import os
+import unittest
+
+from transformers import ImageBindTokenizer, ImageBindTokenizerFast
+from transformers.models.imagebind.tokenization_imagebind import VOCAB_FILES_NAMES
+from transformers.testing_utils import require_ftfy, require_tokenizers
+
+from ...test_tokenization_common import TokenizerTesterMixin
+
+
+# NOTE: currently copied from previous PR (#23284)
+
+
+@require_tokenizers
+class ImageBindTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = ImageBindTokenizer
+    rust_tokenizer_class = ImageBindTokenizerFast
+    test_rust_tokenizer = True
+    from_pretrained_kwargs = {}
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        # fmt: off
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "lo", "l</w>", "w</w>", "r</w>", "t</w>", "low</w>", "er</w>", "lowest</w>", "newer</w>", "wider", "<unk>", "<|startoftext|>", "<|endoftext|>"]
+        # fmt: on
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "l o", "lo w</w>", "e r</w>"]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return ImageBindTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_rust_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return ImageBindTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = ImageBindTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["lo", "w", "er</w>", "n", "e", "w", "er</w>"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [10, 2, 16, 9, 3, 2, 16, 20]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    @require_ftfy
+    def test_check_encoding_slow_fast(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer_s = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                text = "A\n'll 11p223RF☆ho!!to?'d'd''d of a cat"
+                text_tokenized_s = tokenizer_s.tokenize(text)
+                text_tokenized_r = tokenizer_r.tokenize(text)
+
+                self.assertListEqual(text_tokenized_s, text_tokenized_r)
+
+                # Test that the tokenization is identical on an example containing a character (Latin Small Letter A
+                # with Tilde) encoded in 2 different ways
+                text = "xa\u0303y" + " " + "x\xe3y"
+                text_tokenized_s = tokenizer_s.tokenize(text)
+                text_tokenized_r = tokenizer_r.tokenize(text)
+
+                self.assertListEqual(text_tokenized_s, text_tokenized_r)
+
+                # Test that the tokenization is identical on unicode of space type
+                spaces_unicodes = [
+                    "\u0009",  # (horizontal tab, '\t')
+                    "\u000B",  # (vertical tab)
+                    "\u000C",  # (form feed)
+                    "\u0020",  # (space, ' ')
+                    "\u200E",  # (left-to-right mark):w
+                    "\u200F",  # (right-to-left mark)
+                ]
+                for unicode_seq in spaces_unicodes:
+                    text_tokenized_s = tokenizer_s.tokenize(unicode_seq)
+                    text_tokenized_r = tokenizer_r.tokenize(unicode_seq)
+
+                    self.assertListEqual(text_tokenized_s, text_tokenized_r)
+
+                # Test that the tokenization is identical on unicode of line break type
+                line_break_unicodes = [
+                    "\u000A",  # (line feed, '\n')
+                    "\r\n",  # (carriage return and line feed, '\r\n')
+                    "\u000D",  # (carriage return, '\r')
+                    "\r",  # (carriage return, '\r')
+                    "\u000D",  # (carriage return, '\r')
+                    "\u2028",  # (line separator)
+                    "\u2029",  # (paragraph separator)
+                    # "\u0085", # (next line)
+                ]
+
+                # The tokenization is not identical for the character "\u0085" (next line). The slow version transforms
+                # it into the Horizontal Ellipsis character "…" ("\u2026") while the fast version transforms it into a
+                # space (and thus into an empty list).
+
+                for unicode_seq in line_break_unicodes:
+                    text_tokenized_s = tokenizer_s.tokenize(unicode_seq)
+                    text_tokenized_r = tokenizer_r.tokenize(unicode_seq)
+
+                    self.assertListEqual(text_tokenized_s, text_tokenized_r)
+
+    def test_offsets_mapping_with_different_add_prefix_space_argument(self):
+        # Test which aims to verify that the offsets are well adapted to the argument `add_prefix_space`
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                text_of_1_token = "hello"  # `hello` is a token in the vocabulary of `pretrained_name`
+                text = f"{text_of_1_token} {text_of_1_token}"
+
+                tokenizer_r = self.rust_tokenizer_class.from_pretrained(
+                    pretrained_name,
+                    use_fast=True,
+                )
+                encoding = tokenizer_r(text, return_offsets_mapping=True, add_special_tokens=False)
+                self.assertEqual(encoding.offset_mapping[0], (0, len(text_of_1_token)))
+                self.assertEqual(
+                    encoding.offset_mapping[1],
+                    (len(text_of_1_token) + 1, len(text_of_1_token) + 1 + len(text_of_1_token)),
+                )
+
+                text = f" {text}"
+
+                tokenizer_r = self.rust_tokenizer_class.from_pretrained(
+                    pretrained_name,
+                    use_fast=True,
+                )
+                encoding = tokenizer_r(text, return_offsets_mapping=True, add_special_tokens=False)
+                self.assertEqual(encoding.offset_mapping[0], (1, 1 + len(text_of_1_token)))
+                self.assertEqual(
+                    encoding.offset_mapping[1],
+                    (1 + len(text_of_1_token) + 1, 1 + len(text_of_1_token) + 1 + len(text_of_1_token)),
+                )
+
+    def test_log_warning(self):
+        # Test related to the breaking change introduced in transformers v4.17.0
+        # We need to check that an error in raised when the user try to load a previous version of the tokenizer.
+        with self.assertRaises(ValueError) as context:
+            self.rust_tokenizer_class.from_pretrained("robot-test/old-imagebind-tokenizer")
+
+        self.assertTrue(
+            context.exception.args[0].startswith(
+                "The `backend_tokenizer` provided does not match the expected format."
+            )
+        )
+
+    @require_ftfy
+    def test_tokenization_python_rust_equals(self):
+        super().test_tokenization_python_rust_equals()
+
+    # overwrite common test
+    def test_added_tokens_do_lower_case(self):
+        # ImageBind always lower cases letters
+        pass
\ No newline at end of file