diff --git a/README.md b/README.md index e65d512defc4..7a88c12b99ec 100644 --- a/README.md +++ b/README.md @@ -383,6 +383,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. diff --git a/README_es.md b/README_es.md index d42750237fc2..5e6a03688870 100644 --- a/README_es.md +++ b/README_es.md @@ -358,6 +358,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. diff --git a/README_hd.md b/README_hd.md index e4dd69943f4c..84b24e6707cc 100644 --- a/README_hd.md +++ b/README_hd.md @@ -332,6 +332,7 @@ conda install -c huggingface transformers 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा। 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा। 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce से) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. द्वाराअनुसंधान पत्र [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) के साथ जारी किया गया diff --git a/README_ja.md b/README_ja.md index ea8de35ff3de..2e1f9283ac55 100644 --- a/README_ja.md +++ b/README_ja.md @@ -392,6 +392,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI から) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever から公開された研究論文: [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce から) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. から公開された研究論文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) diff --git a/README_ko.md b/README_ko.md index c88d62a9ad9f..ff4a6f929fe2 100644 --- a/README_ko.md +++ b/README_ko.md @@ -307,6 +307,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다. 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI 에서) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 의 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 논문과 함께 발표했습니다. 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce 에서 제공)은 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.의 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)논문과 함께 발표했습니다. diff --git a/README_zh-hans.md b/README_zh-hans.md index e6e6ab59cd06..4afd7161cafc 100644 --- a/README_zh-hans.md +++ b/README_zh-hans.md @@ -331,6 +331,7 @@ conda install -c huggingface transformers 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (来自 OpenAI) 伴随论文 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 由 Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 发布。 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (来自 Salesforce) 伴随论文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 由 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi 发布。 diff --git a/README_zh-hant.md b/README_zh-hant.md index 21cbe14be804..bfeb64487f5f 100644 --- a/README_zh-hant.md +++ b/README_zh-hant.md @@ -343,6 +343,7 @@ conda install -c huggingface transformers 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageBind](https://huggingface.co/docs/transformers/model_doc/imagebind)** (from FAIR and Meta AI) released with the paper [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index d7d593b21e62..39aba210f0f9 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -529,6 +529,8 @@ title: FocalNet - local: model_doc/glpn title: GLPN + - local: model_doc/imagebind + title: ImageBind - local: model_doc/imagegpt title: ImageGPT - local: model_doc/levit diff --git a/docs/source/en/model_doc/imagebind.md b/docs/source/en/model_doc/imagebind.md new file mode 100644 index 000000000000..66784c31e165 --- /dev/null +++ b/docs/source/en/model_doc/imagebind.md @@ -0,0 +1,97 @@ + + +# ImageBind + +## Overview + +The ImageBind model was proposed in [ImageBind: One Embedding Space To Bind Them All](https://arxiv.org/abs/2305.05665) by Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. +ImageBind is a multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images. +For any input from these six modalities, it outputs the same-sized embedding that can be used for cross-modal and multimodal tasks. + +The abstract from the paper is the following: + +*We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.* + +Tips: + + + +This model was contributed by [dg845](https://huggingface.co/dg845) and [shehan97](https://huggingface.co/shehan97). +The original code can be found [here](https://github.com/facebookresearch/ImageBind). + + +## ImageBindConfig + +[[autodoc]] ImageBindConfig + - from_text_vision_configs + +## ImageBindTextConfig + +[[autodoc]] ImageBindTextConfig + +## ImageBindVisionConfig + +[[autodoc]] ImageBindVisionConfig + +## ImageBindTokenizer + +[[autodoc]] ImageBindTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## ImageBindTokenizerFast + +[[autodoc]] ImageBindTokenizerFast + +## ImageBindImageProcessor + +[[autodoc]] ImageBindImageProcessor + - preprocess + +## ImageBindFeatureExtractor + +[[autodoc]] ImageBindFeatureExtractor + +## ImageBindProcessor + +[[autodoc]] ImageBindProcessor + +## ImageBindModel + +[[autodoc]] ImageBindModel + - forward + - get_text_features + - get_image_features + +## ImageBindTextModel + +[[autodoc]] ImageBindTextModel + - forward + +## ImageBindTextModelWithProjection + +[[autodoc]] ImageBindTextModelWithProjection + - forward + +## ImageBindVisionModelWithProjection + +[[autodoc]] ImageBindVisionModelWithProjection + - forward + + +## ImageBindVisionModel + +[[autodoc]] ImageBindVisionModel + - forward \ No newline at end of file diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index e09752f5f39c..c183df75ecd8 100644 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -383,6 +383,17 @@ "IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP", "IdeficsConfig", ], + "models.imagebind": [ + "IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ImageBindAudioConfig", + "ImageBindConfig", + "ImageBindDepthConfig", + "ImageBindImuConfig", + "ImageBindOnnxConfig", + "ImageBindTextConfig", + "ImageBindThermalConfig", + "ImageBindVisionConfig", + ], "models.imagegpt": ["IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ImageGPTConfig"], "models.informer": ["INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "InformerConfig"], "models.instructblip": [ @@ -889,6 +900,7 @@ _import_structure["models.gpt_neox"].append("GPTNeoXTokenizerFast") _import_structure["models.gpt_neox_japanese"].append("GPTNeoXJapaneseTokenizer") _import_structure["models.herbert"].append("HerbertTokenizerFast") + _import_structure["models.imagebind"].append("ImageBindTokenizerFast") _import_structure["models.layoutlm"].append("LayoutLMTokenizerFast") _import_structure["models.layoutlmv2"].append("LayoutLMv2TokenizerFast") _import_structure["models.layoutlmv3"].append("LayoutLMv3TokenizerFast") @@ -999,6 +1011,7 @@ _import_structure["models.fuyu"].extend(["FuyuImageProcessor", "FuyuProcessor"]) _import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"]) _import_structure["models.idefics"].extend(["IdeficsImageProcessor"]) + _import_structure["models.imagebind"].extend(["ImageBindFeatureExtractor", "ImageBindImageProcessor"]) _import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"]) _import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"]) _import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"]) @@ -2041,6 +2054,25 @@ "IdeficsProcessor", ] ) + _import_structure["models.imagebind"].extend( + [ + "IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST", + "ImageBindAudioModel", + "ImageBindAudioModelWithProjection", + "ImageBindDepthModel", + "ImageBindDepthModelWithProjection", + "ImageBindImuModel", + "ImageBindImuModelWithProjection", + "ImageBindModel", + "ImageBindPreTrainedModel", + "ImageBindTextModel", + "ImageBindTextModelWithProjection", + "ImageBindThermalModel", + "ImageBindThermalModelWithProjection", + "ImageBindVisionModel", + "ImageBindVisionModelWithProjection", + ] + ) _import_structure["models.imagegpt"].extend( [ "IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -4622,6 +4654,17 @@ IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP, IdeficsConfig, ) + from .models.imagebind import ( + IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP, + ImageBindAudioConfig, + ImageBindConfig, + ImageBindDepthConfig, + ImageBindImuConfig, + ImageBindOnnxConfig, + ImageBindTextConfig, + ImageBindThermalConfig, + ImageBindVisionConfig, + ) from .models.imagegpt import IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, ImageGPTConfig from .models.informer import INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, InformerConfig from .models.instructblip import ( @@ -5093,6 +5136,7 @@ from .models.gpt_neox import GPTNeoXTokenizerFast from .models.gpt_neox_japanese import GPTNeoXJapaneseTokenizer from .models.herbert import HerbertTokenizerFast + from .models.imagebind import ImageBindTokenizerFast from .models.layoutlm import LayoutLMTokenizerFast from .models.layoutlmv2 import LayoutLMv2TokenizerFast from .models.layoutlmv3 import LayoutLMv3TokenizerFast @@ -5179,6 +5223,7 @@ from .models.fuyu import FuyuImageProcessor, FuyuProcessor from .models.glpn import GLPNFeatureExtractor, GLPNImageProcessor from .models.idefics import IdeficsImageProcessor + from .models.imagebind import ImageBindFeatureExtractor, ImageBindImageProcessor from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2ImageProcessor from .models.layoutlmv3 import LayoutLMv3FeatureExtractor, LayoutLMv3ImageProcessor @@ -6053,6 +6098,23 @@ IdeficsPreTrainedModel, IdeficsProcessor, ) + from .models.imagebind import ( + IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST, + ImageBindAudioModel, + ImageBindAudioModelWithProjection, + ImageBindDepthModel, + ImageBindDepthModelWithProjection, + ImageBindImuModel, + ImageBindImuModelWithProjection, + ImageBindModel, + ImageBindPreTrainedModel, + ImageBindTextModel, + ImageBindTextModelWithProjection, + ImageBindThermalModel, + ImageBindThermalModelWithProjection, + ImageBindVisionModel, + ImageBindVisionModelWithProjection, + ) from .models.imagegpt import ( IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST, ImageGPTForCausalImageModeling, diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py index 997ee82b4324..f223710d6cdf 100644 --- a/src/transformers/models/__init__.py +++ b/src/transformers/models/__init__.py @@ -106,6 +106,7 @@ hubert, ibert, idefics, + imagebind, imagegpt, informer, instructblip, diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index 78a33270e7ac..9e5b16be3193 100755 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -114,6 +114,7 @@ ("hubert", "HubertConfig"), ("ibert", "IBertConfig"), ("idefics", "IdeficsConfig"), + ("imagebind", "ImageBindConfig"), ("imagegpt", "ImageGPTConfig"), ("informer", "InformerConfig"), ("instructblip", "InstructBlipConfig"), @@ -333,6 +334,7 @@ ("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("idefics", "IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("imagebind", "IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("informer", "INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("instructblip", "INSTRUCTBLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -553,6 +555,7 @@ ("hubert", "Hubert"), ("ibert", "I-BERT"), ("idefics", "IDEFICS"), + ("imagebind", "ImageBind"), ("imagegpt", "ImageGPT"), ("informer", "Informer"), ("instructblip", "InstructBLIP"), diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py index 395875dfa14b..07d0d8c86a57 100644 --- a/src/transformers/models/auto/feature_extraction_auto.py +++ b/src/transformers/models/auto/feature_extraction_auto.py @@ -61,6 +61,7 @@ ("glpn", "GLPNFeatureExtractor"), ("groupvit", "CLIPFeatureExtractor"), ("hubert", "Wav2Vec2FeatureExtractor"), + ("imagebind", "ImageBindFeatureExtractor"), ("imagegpt", "ImageGPTFeatureExtractor"), ("layoutlmv2", "LayoutLMv2FeatureExtractor"), ("layoutlmv3", "LayoutLMv3FeatureExtractor"), diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py index 168b7a5dff3a..ab875ea332a1 100644 --- a/src/transformers/models/auto/image_processing_auto.py +++ b/src/transformers/models/auto/image_processing_auto.py @@ -69,6 +69,7 @@ ("glpn", "GLPNImageProcessor"), ("groupvit", "CLIPImageProcessor"), ("idefics", "IdeficsImageProcessor"), + ("imagebind", "ImageBindImageProcessor"), ("imagegpt", "ImageGPTImageProcessor"), ("instructblip", "BlipImageProcessor"), ("layoutlmv2", "LayoutLMv2ImageProcessor"), diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index a62880d32696..c8042b8b7896 100755 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -110,6 +110,7 @@ ("hubert", "HubertModel"), ("ibert", "IBertModel"), ("idefics", "IdeficsModel"), + ("imagebind", "ImageBindModel"), ("imagegpt", "ImageGPTModel"), ("informer", "InformerModel"), ("jukebox", "JukeboxModel"), @@ -1074,6 +1075,7 @@ ("chinese_clip", "ChineseCLIPModel"), ("clip", "CLIPModel"), ("clipseg", "CLIPSegModel"), + ("imagebind", "ImageBindModel"), ] ) diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py index e1b3bac2de05..3e4447aebdb4 100644 --- a/src/transformers/models/auto/processing_auto.py +++ b/src/transformers/models/auto/processing_auto.py @@ -60,6 +60,7 @@ ("groupvit", "CLIPProcessor"), ("hubert", "Wav2Vec2Processor"), ("idefics", "IdeficsProcessor"), + ("imagebind", "ImageBindProcessor"), ("instructblip", "InstructBlipProcessor"), ("kosmos-2", "Kosmos2Processor"), ("layoutlmv2", "LayoutLMv2Processor"), diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py index 04a1bc77e655..5c1831532067 100644 --- a/src/transformers/models/auto/tokenization_auto.py +++ b/src/transformers/models/auto/tokenization_auto.py @@ -181,6 +181,13 @@ ("hubert", ("Wav2Vec2CTCTokenizer", None)), ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)), ("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)), + ( + "imagebind", + ( + "ImageBindTokenizer", + "ImageBindTokenizerFast" if is_tokenizers_available() else None, + ), + ), ("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), ("jukebox", ("JukeboxTokenizer", None)), ( diff --git a/src/transformers/models/imagebind/__init__.py b/src/transformers/models/imagebind/__init__.py new file mode 100644 index 000000000000..d6d328d9822e --- /dev/null +++ b/src/transformers/models/imagebind/__init__.py @@ -0,0 +1,163 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_speech_available, + is_tokenizers_available, + is_torch_available, + is_vision_available, +) + + +_import_structure = { + "configuration_imagebind": [ + "IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ImageBindAudioConfig", + "ImageBindConfig", + "ImageBindDepthConfig", + "ImageBindImuConfig", + "ImageBindOnnxConfig", + "ImageBindTextConfig", + "ImageBindThermalConfig", + "ImageBindVisionConfig", + ], + "feature_extraction_imagebind": ["ImageBindImuFeatureExtractor"], + "processing_imagebind": ["ImageBindProcessor"], + "tokenization_imagebind": ["ImageBindTokenizer"], +} + +# TODO: add dependencies for other modalities, if necessary + +try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["tokenization_imagebind_fast"] = ["ImageBindTokenizerFast"] + +try: + if not is_vision_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["feature_extraction_imagebind"].extend(["ImageBindFeatureExtractor"]) + _import_structure["image_processing_imagebind"] = ["ImageBindImageProcessor", "ImageBindDepthImageProcessor", "ImageBindThermalImageProcessor"] + +try: + if not is_speech_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["feature_extraction_imagebind"].extend(["ImageBindAudioFeatureExtractor"]) + + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_imagebind"] = [ + "IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST", + "ImageBindAudioModel", + "ImageBindAudioModelWithProjection", + "ImageBindDepthModel", + "ImageBindDepthModelWithProjection", + "ImageBindImuModel", + "ImageBindImuModelWithProjection", + "ImageBindModel", + "ImageBindPreTrainedModel", + "ImageBindTextModel", + "ImageBindTextModelWithProjection", + "ImageBindThermalModel", + "ImageBindThermalModelWithProjection", + "ImageBindVisionModel", + "ImageBindVisionModelWithProjection", + ] + +if TYPE_CHECKING: + from .configuration_imagebind import ( + IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP, + ImageBindAudioConfig, + ImageBindConfig, + ImageBindDepthConfig, + ImageBindImuConfig, + ImageBindOnnxConfig, + ImageBindTextConfig, + ImageBindThermalConfig, + ImageBindVisionConfig, + ) + from .feature_extraction_imagebind import ImageBindImuFeatureExtractor + from .processing_imagebind import ImageBindProcessor + from .tokenization_imagebind import ImageBindTokenizer + + try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .tokenization_imagebind_fast import ImageBindTokenizerFast + + try: + if not is_vision_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .feature_extraction_imagebind import ImageBindFeatureExtractor + from .image_processing_imagebind import ImageBindImageProcessor, ImageBindDepthImageProcessor, ImageBindThermalImageProcessor + + try: + if not is_speech_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .feature_extraction_imagebind import ImageBindAudioFeatureExtractor + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_imagebind import ( + IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST, + ImageBindAudioModel, + ImageBindAudioModelWithProjection, + ImageBindDepthModel, + ImageBindDepthModelWithProjection, + ImageBindImuModel, + ImageBindImuModelWithProjection, + ImageBindModel, + ImageBindPreTrainedModel, + ImageBindTextModel, + ImageBindTextModelWithProjection, + ImageBindThermalModel, + ImageBindThermalModelWithProjection, + ImageBindVisionModel, + ImageBindVisionModelWithProjection, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) \ No newline at end of file diff --git a/src/transformers/models/imagebind/configuration_imagebind.py b/src/transformers/models/imagebind/configuration_imagebind.py new file mode 100644 index 000000000000..3a9355bba583 --- /dev/null +++ b/src/transformers/models/imagebind/configuration_imagebind.py @@ -0,0 +1,1178 @@ +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" ImageBind model configuration""" + + +import copy +import os +from collections import OrderedDict +from typing import TYPE_CHECKING, Any, Mapping, Optional, Union + + +if TYPE_CHECKING: + from ...processing_utils import ProcessorMixin + from ...utils import TensorType + +from ...configuration_utils import PretrainedConfig +from ...onnx import OnnxConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + +IMAGEBIND_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/config.json", +} + + +class ImageBindTextConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ImageBindTextModel`]. It is used to instantiate a ImageBind + text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration + with the defaults will yield a similar configuration to that of the text encoder of the ImageBind + [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 49408): + Vocabulary size of the ImageBind text model. Defines the number of different tokens that can be represented by + the `inputs_ids` passed when calling [`ImageBindModel`]. + hidden_size (`int`, *optional*, defaults to 1024): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 4096): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 1024): + If the ImageBind text model has an output projection layer, the dimension to which that projection layer + maps to. + num_hidden_layers (`int`, *optional*, defaults to 24): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 16): + Number of attention heads for each attention layer in the Transformer encoder. + max_position_embeddings (`int`, *optional*, defaults to 77): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 512 or 1024 or 2048). + hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-6): + The epsilon used by the layer normalization layers. + add_kv_bias(`bool`, *optional*, defaults to `False`): + Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the + `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html). + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + drop_path_rate (`float`, *optional*, defaults to 0.0): + The dropout probability for the DropPath (stochastic) regularization layers. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + logit_scale_init_value (`float`, *optional*, defaults to `14.2857`): + The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not + be scaled. + learnable_logit_scale (`bool`, *optional*, defaults to `True`): + Whether the `logit_scale` is learnable or fixed. + + Example: + + ```python + >>> from transformers import ImageBindTextConfig, ImageBindTextModel + + >>> # Initializing a ImageBindTextConfig with facebook/imagebind-huge style configuration + >>> configuration = ImageBindTextConfig() + + >>> # Initializing a ImageBindTextModel (with random weights) from the facebook/imagebind-huge style configuration + >>> model = ImageBindTextModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + model_type = "imagebind_text_model" + + def __init__( + self, + vocab_size=49408, + hidden_size=1024, + intermediate_size=4096, + projection_dim=1024, + num_hidden_layers=24, + num_attention_heads=16, + max_position_embeddings=77, + hidden_act="quick_gelu", + layer_norm_eps=1e-6, + add_kv_bias=False, + attention_dropout=0.0, + drop_path_rate=0.0, + initializer_range=0.02, + initializer_factor=1.0, + logit_scale_init_value=14.2857, + learnable_logit_scale=True, + pad_token_id=1, + bos_token_id=0, + eos_token_id=2, + **kwargs, + ): + super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs) + + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.max_position_embeddings = max_position_embeddings + self.initializer_range = initializer_range + self.initializer_factor = initializer_factor + self.add_kv_bias = add_kv_bias + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the text config dict if we are loading from ImageBindConfig + if config_dict.get("model_type") == "imagebind": + config_dict = config_dict["text_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class ImageBindVisionConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ImageBindVisionModel`]. It is used to instantiate a + ImageBind vision encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the vision encoder of the ImageBind + [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 1280): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 5120): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 1024): + If the ImageBind vision model has an output projection layer, the dimension to which that projection layer + maps to. + num_hidden_layers (`int`, *optional*, defaults to 32): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 16): + Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 3): + The number of channels in the input images. + num_frames (`int`, *optional*, defaults to 2): + If using video (spatiotemporal) input, the number of video frames in the spatiotemporal data. + image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image. + patch_size (`int` or `Tuple[int]`, *optional*, defaults to `(2, 14, 14)`): + The size (resolution) of each spatialtemporal patch. If `patch_size` is an int, spatial patches of shape + `(patch_size, patch_size)` will be used; otherwise, `patch_size` should be a tuple of shape + `(time_patch_size, height_patch_size, width_patch_size)`. + stride (`int` or `Tuple[int]`, *optional*, defaults to `(2, 14, 14)`): + The stride of the imate patch embedding. If `stride` is an int, spatial strides of shape + `(stride, stride)` will be used; otherwise, `patch_size` should be a tuple of shape + `(time_stride, height_stride, width_stride)`. + hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-6): + The epsilon used by the layer normalization layers. + add_kv_bias(`bool`, *optional*, defaults to `False`): + Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the + `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html). + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + drop_path_rate (`float`, *optional*, defaults to 0.0): + The dropout probability for the DropPath (stochastic) regularization layers. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + logit_scale_init_value (`float`, *optional*, defaults to `None`): + The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not + be scaled. + learnable_logit_scale (`bool`, *optional*, defaults to `False`): + Whether the `logit_scale` is learnable or fixed. + + Example: + + ```python + >>> from transformers import ImageBindVisionConfig, ImageBindVisionModel + + >>> # Initializing a ImageBindVisionConfig with facebook/imagebind-huge style configuration + >>> configuration = ImageBindVisionConfig() + + >>> # Initializing a ImageBindVisionModel (with random weights) from the facebook/imagebind-huge style configuration + >>> model = ImageBindVisionModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "imagebind_vision_model" + + def __init__( + self, + hidden_size=1280, + intermediate_size=5120, + projection_dim=1024, + num_hidden_layers=32, + num_attention_heads=16, + num_channels=3, + num_frames=2, + image_size=224, + patch_size=(2, 14, 14), + stride=(2, 14, 14), + hidden_act="quick_gelu", + layer_norm_eps=1e-6, + add_kv_bias=False, + attention_dropout=0.0, + drop_path_rate=0.0, + initializer_range=0.02, + initializer_factor=1.0, + logit_scale_init_value=None, + learnable_logit_scale=False, + **kwargs, + ): + super().__init__(**kwargs) + + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_channels = num_channels + self.num_frames = num_frames + self.patch_size = patch_size + self.stride = stride + self.image_size = image_size + self.initializer_range = initializer_range + self.initializer_factor = initializer_factor + self.add_kv_bias = add_kv_bias + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the vision config dict if we are loading from ImageBindConfig + if config_dict.get("model_type") == "imagebind": + config_dict = config_dict["vision_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class ImageBindAudioConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ImageBindAudioModel`]. It is used to instantiate a + ImageBind audio encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the audio encoder of the ImageBind + [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 3072): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 1024): + If the ImageBind audio model has an output projection layer, the dimension to which that projection layer + maps to. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + num_mel_bins (`int`, *optional*, defaults to 128): + The number of frequency bins in the log-mel spectrogram. + target_len (`int`, *optional*, defaults to 204): + TODO + num_channels (`int`, *optional*, defaults to 1): + The number of channels in the input audio data. + patch_size (`int`, *optional*, defaults to 16): + The kernel size of the patch embedding 2D convolution layer. + stride (`int`, *optional*, defaults to 10): + The stride of the patch embedding 2D convolution layer. + hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-6): + The epsilon used by the layer normalization layers. + add_kv_bias(`bool`, *optional*, defaults to `True`): + Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the + `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html). + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + drop_path_rate (`float`, *optional*, defaults to 0.1): + The dropout probability for the DropPath (stochastic) regularization layers. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + logit_scale_init_value (`float`, *optional*, defaults to `20.0`): + The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not + be scaled. + learnable_logit_scale (`bool`, *optional*, defaults to `False`): + Whether the `logit_scale` is learnable or fixed. + + Example: + ```python + >>> from transformers import ImageBindAudioConfig, ImageBindAudioModel + + >>> # Initializing a ImageBindAudioConfig with facebook/imagebind-huge style configuration + >>> configuration = ImageBindAudioConfig() + + >>> # Initializing a ImageBindAudioModel (with random weights) from the facebook/imagebind-huge style configuration + >>> model = ImageBindAudioModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + def __init__( + self, + hidden_size=768, + intermediate_size=3072, + projection_dim=1024, + num_hidden_layers=12, + num_attention_heads=12, + num_mel_bins=128, + target_len=204, + num_channels=1, + patch_size=16, + stride=10, + hidden_act="quick_gelu", + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.1, + initializer_range=0.02, + initializer_factor=1.0, + logit_scale_init_value=20.0, + learnable_logit_scale=False, + **kwargs, + ): + super().__init__(**kwargs) + + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_mel_bins = num_mel_bins + self.target_len = target_len + self.num_channels = num_channels + self.patch_size = patch_size + self.stride = stride + self.initializer_range = initializer_range + self.initializer_factor = initializer_factor + self.add_kv_bias = add_kv_bias + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the audio config dict if we are loading from ImageBindConfig + if config_dict.get("model_type") == "imagebind": + config_dict = config_dict["audio_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class ImageBindDepthConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ImageBindDepthModel`]. It is used to instantiate a + ImageBind depth encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the depth encoder of the ImageBind + [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 384): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 1536): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 1024): + If the ImageBind depth model has an output projection layer, the dimension to which that projection layer + maps to. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 8): + Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 1): + The number of channels in the input depth data. + image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image. + patch_size (`int`, *optional*, defaults to 16): + The kernel size of the depth patch embedding 2D convolution layer. + stride (`int`, *optional*, defaults to 16): + The stride of the depth patch embedding 2D convolution layer. + hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-6): + The epsilon used by the layer normalization layers. + add_kv_bias(`bool`, *optional*, defaults to `True`): + Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the + `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html). + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + drop_path_rate (`float`, *optional*, defaults to 0.0): + The dropout probability for the DropPath (stochastic) regularization layers. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + logit_scale_init_value (`float`, *optional*, defaults to `5.0`): + The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not + be scaled. + learnable_logit_scale (`bool`, *optional*, defaults to `False`): + Whether the `logit_scale` is learnable or fixed. + + Example: + ```python + >>> from transformers import ImageBindDepthConfig, ImageBindDepthModel + + >>> # Initializing a ImageBindDepthConfig with facebook/imagebind-huge style configuration + >>> configuration = ImageBindDepthConfig() + + >>> # Initializing a ImageBindDepthModel (with random weights) from the facebook/imagebind-huge style configuration + >>> model = ImageBindDepthModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + def __init__( + self, + hidden_size=384, + intermediate_size=1536, + projection_dim=1024, + num_hidden_layers=12, + num_attention_heads=8, + num_channels=1, + image_size=224, + patch_size=16, + stride=16, + hidden_act="quick_gelu", + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.0, + initializer_range=0.02, + initializer_factor=1.0, + logit_scale_init_value=5.0, + learnable_logit_scale=False, + **kwargs, + ): + super().__init__(**kwargs) + + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_channels = num_channels + self.image_size = image_size + self.patch_size = patch_size + self.stride = stride + self.initializer_range = initializer_range + self.initializer_factor = initializer_factor + self.add_kv_bias = add_kv_bias + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the audio config dict if we are loading from ImageBindConfig + if config_dict.get("model_type") == "imagebind": + config_dict = config_dict["depth_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class ImageBindThermalConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ImageBindThermalModel`]. It is used to instantiate a + ImageBind thermal encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the thermal encoder of the ImageBind + [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 3072): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 1024): + If the ImageBind thermal model has an output projection layer, the dimension to which that projection layer + maps to. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 1): + The number of channels in the input thermal data. + image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image. + patch_size (`int`, *optional*, defaults to 16): + The kernel size of the thermal patch embedding 2D convolution layer. + stride (`int`, *optional*, defaults to 16): + The stride of the thermal patch embedding 2D convolution layer. + hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-6): + The epsilon used by the layer normalization layers. + add_kv_bias(`bool`, *optional*, defaults to `True`): + Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the + `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html). + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + drop_path_rate (`float`, *optional*, defaults to 0.0): + The dropout probability for the DropPath (stochastic) regularization layers. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + logit_scale_init_value (`float`, *optional*, defaults to `10.0`): + The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not + be scaled. + learnable_logit_scale (`bool`, *optional*, defaults to `False`): + Whether the `logit_scale` is learnable or fixed. + + Example: + ```python + >>> from transformers import ImageBindThermalConfig, ImageBindThermalModel + + >>> # Initializing a ImageBindThermalConfig with facebook/imagebind-huge style configuration + >>> configuration = ImageBindThermalConfig() + + >>> # Initializing a ImageBindThermalModel (with random weights) from the facebook/imagebind-huge style configuration + >>> model = ImageBindThermalModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + def __init__( + self, + hidden_size=768, + intermediate_size=3072, + projection_dim=1024, + num_hidden_layers=12, + num_attention_heads=12, + num_channels=1, + image_size=224, + patch_size=16, + stride=16, + hidden_act="quick_gelu", + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.0, + initializer_range=0.02, + initializer_factor=1.0, + logit_scale_init_value=10.0, + learnable_logit_scale=False, + **kwargs, + ): + super().__init__(**kwargs) + + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_channels = num_channels + self.image_size = image_size + self.patch_size = patch_size + self.stride = stride + self.initializer_range = initializer_range + self.initializer_factor = initializer_factor + self.add_kv_bias = add_kv_bias + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the audio config dict if we are loading from ImageBindConfig + if config_dict.get("model_type") == "imagebind": + config_dict = config_dict["thermal_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class ImageBindImuConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ImageBindImuModel`]. It is used to instantiate a + ImageBind IMU encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the IMU encoder of the ImageBind + [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 512): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 2048): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 1024): + If the ImageBind thermal model has an output projection layer, the dimension to which that projection layer + maps to. + num_hidden_layers (`int`, *optional*, defaults to 6): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 8): + Number of attention heads for each attention layer in the Transformer encoder. + input_shape ('Tuple[int]`, *optional*, defaults to `(6, 2000)`): + The shape of the input IMU data. + kernel_size (`int`, *optional*, defaults to 8): + The kernel size of the 2D convolution layers. (TODO) + hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-6): + The epsilon used by the layer normalization layers. + add_kv_bias(`bool`, *optional*, defaults to `True`): + Whether to add an extra learnable bias token to the attention key and value sequences. This is based on the + `add_kv_bias` argument to [`torch.nn.MultiHeadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html). + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + drop_path_rate (`float`, *optional*, defaults to 0.7): + The dropout probability for the DropPath (stochastic) regularization layers. + final_dropout (`float`, *optional*, defaults to 0.5): + The dropout probability for the dropout layer that occurs after the post layer norm and before the linear + projection is applied. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + logit_scale_init_value (`float`, *optional*, defaults to `5.0`): + The initial value of the `logit_scale` parameter for the vision component. If `None`, the logits will not + be scaled. + learnable_logit_scale (`bool`, *optional*, defaults to `False`): + Whether the `logit_scale` is learnable or fixed. + + Example: + ```python + >>> from transformers import ImageBindImuConfig, ImageBindImuModel + + >>> # Initializing a ImageBindImuConfig with facebook/imagebind-huge style configuration + >>> configuration = ImageBindImuConfig() + + >>> # Initializing a ImageBindImuModel (with random weights) from the facebook/imagebind-huge style configuration + >>> model = ImageBindImuModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + def __init__( + self, + hidden_size=512, + intermediate_size=2048, + projection_dim=1024, + num_hidden_layers=6, + num_attention_heads=8, + input_shape=(6, 2000), + kernel_size=8, + hidden_act="quick_gelu", + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.7, + final_dropout=0.5, + initializer_range=0.02, + initializer_factor=1.0, + logit_scale_init_value=5.0, + learnable_logit_scale=False, + **kwargs, + ): + super().__init__(**kwargs) + + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.input_shape = input_shape + self.kernel_size = kernel_size + self.initializer_range = initializer_range + self.initializer_factor = initializer_factor + self.add_kv_bias = add_kv_bias + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.final_dropout = final_dropout + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the audio config dict if we are loading from ImageBindConfig + if config_dict.get("model_type") == "imagebind": + config_dict = config_dict["imu_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +# TODO: add configs for other modalities (audio, depth, thermal, IMU) +class ImageBindConfig(PretrainedConfig): + r""" + [`ImageBindConfig`] is the configuration class to store the configuration of a [`ImageBindModel`]. It is used to instantiate + a ImageBind model according to the specified arguments, defining the text model and vision model configs. Instantiating + a configuration with the defaults will yield a similar configuration to that of the ImageBind + [facebook/imagebind-huge](https://huggingface.co/facebook/imagebind-huge) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + text_config (`dict`, *optional*): + Dictionary of configuration options used to initialize [`ImageBindTextConfig`]. + vision_config (`dict`, *optional*): + Dictionary of configuration options used to initialize [`ImageBindVisionConfig`]. + projection_dim (`int`, *optional*, defaults to 512): + Dimentionality of text and vision projection layers. + logit_scale_init_value (`float`, *optional*, defaults to 2.6592): + The inital value of the *logit_scale* paramter. Default is used as per the original ImageBind implementation. + kwargs (*optional*): + Dictionary of keyword arguments. + + Example: + + ```python + >>> from transformers import ImageBindConfig, ImageBindModel + + >>> # Initializing a ImageBindConfig with facebook/imagebind-huge style configuration + >>> configuration = ImageBindConfig() + + >>> # Initializing a ImageBindModel (with random weights) from the facebook/imagebind-huge style configuration + >>> model = ImageBindModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + + >>> # We can also initialize a ImageBindConfig from a ImageBindTextConfig and a ImageBindVisionConfig + >>> from transformers import ImageBindTextConfig, ImageBindVisionConfig + + >>> # Initializing a ImageBindText and ImageBindVision configuration + >>> config_text = ImageBindTextConfig() + >>> config_vision = ImageBindVisionConfig() + + >>> config = ImageBindConfig.from_text_vision_configs(config_text, config_vision) + ```""" + + model_type = "imagebind" + is_composition = True + + def __init__( + self, + text_config=None, + vision_config=None, + audio_config=None, + depth_config=None, + thermal_config=None, + imu_config=None, + projection_dim=1024, + **kwargs, + ): + # If `_config_dict` exist, we use them for the backward compatibility. + # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot + # of confusion!). + text_config_dict = kwargs.pop("text_config_dict", None) + vision_config_dict = kwargs.pop("vision_config_dict", None) + audio_config_dict = kwargs.pop("audio_config_dict", None) + depth_config_dict = kwargs.pop("depth_config_dict", None) + thermal_config_dict = kwargs.pop("thermal_config_dict", None) + imu_config_dict = kwargs.pop("imu_config_dict", None) + + super().__init__(**kwargs) + + # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in + # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most + # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`. + if text_config_dict is not None: + if text_config is None: + text_config = {} + + # This is the complete result when using `text_config_dict`. + _text_config_dict = ImageBindTextConfig(**text_config_dict).to_dict() + + # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different. + for key, value in _text_config_dict.items(): + if key in text_config and value != text_config[key] and key not in ["transformers_version"]: + # If specified in `text_config_dict` + if key in text_config_dict: + message = ( + f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. " + f'The value `text_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`text_config_dict` is provided which will be used to initialize `ImageBindTextConfig`. The " + f'value `text_config["{key}"]` will be overriden.' + ) + logger.warning(message) + + # Update all values in `text_config` with the ones in `_text_config_dict`. + text_config.update(_text_config_dict) + + if vision_config_dict is not None: + if vision_config is None: + vision_config = {} + + # This is the complete result when using `vision_config_dict`. + _vision_config_dict = ImageBindVisionConfig(**vision_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _vision_config_dict: + _vision_config_dict["id2label"] = { + str(key): value for key, value in _vision_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different. + for key, value in _vision_config_dict.items(): + if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]: + # If specified in `vision_config_dict` + if key in vision_config_dict: + message = ( + f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different " + f'values. The value `vision_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`vision_config_dict` is provided which will be used to initialize `ImageBindVisionConfig`. " + f'The value `vision_config["{key}"]` will be overriden.' + ) + logger.warning(message) + + # Update all values in `vision_config` with the ones in `_vision_config_dict`. + vision_config.update(_vision_config_dict) + + if audio_config_dict is not None: + if audio_config is None: + audio_config = {} + + # This is the complete result when using `audio_config_dict`. + _audio_config_dict = ImageBindAudioConfig(**audio_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _vision_config_dict: + _vision_config_dict["id2label"] = { + str(key): value for key, value in _vision_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_audio_config_dict` and `audio_config` but being different. + for key, value in _vision_config_dict.items(): + if key in audio_config and value != audio_config[key] and key not in ["transformers_version"]: + # If specified in `audio_config_dict` + if key in audio_config_dict: + message = ( + f"`{key}` is found in both `audio_config_dict` and `audio_config` but with different " + f'values. The value `audio_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`audio_config_dict` is provided which will be used to initialize `ImageBindAudioConfig`. " + f'The value `audio_config["{key}"]` will be overriden.' + ) + logger.warning(message) + + # Update all values in `vision_config` with the ones in `_audio_config_dict`. + audio_config.update(_audio_config_dict) + + if depth_config_dict is not None: + if depth_config is None: + depth_config = {} + + # This is the complete result when using `depth_config_dict`. + _depth_config_dict = ImageBindDepthConfig(**depth_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _depth_config_dict: + _depth_config_dict["id2label"] = { + str(key): value for key, value in _depth_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_depth_config_dict` and `depth_config` but being different. + for key, value in _depth_config_dict.items(): + if key in depth_config and value != depth_config[key] and key not in ["transformers_version"]: + # If specified in `depth_config_dict` + if key in depth_config_dict: + message = ( + f"`{key}` is found in both `depth_config_dict` and `depth_config` but with different " + f'values. The value `depth_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`depth_config_dict` is provided which will be used to initialize `ImageBindDepthConfig`. " + f'The value `depth_config["{key}"]` will be overriden.' + ) + logger.warning(message) + + # Update all values in `vision_config` with the ones in `_depth_config_dict`. + depth_config.update(_depth_config_dict) + + if thermal_config_dict is not None: + if thermal_config is None: + thermal_config = {} + + # This is the complete result when using `thermal_config_dict`. + _thermal_config_dict = ImageBindThermalConfig(**thermal_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _thermal_config_dict: + _thermal_config_dict["id2label"] = { + str(key): value for key, value in _thermal_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_thermal_config_dict` and `thermal_config` but being different. + for key, value in _thermal_config_dict.items(): + if key in thermal_config and value != thermal_config[key] and key not in ["transformers_version"]: + # If specified in `thermal_config_dict` + if key in thermal_config_dict: + message = ( + f"`{key}` is found in both `thermal_config_dict` and `thermal_config` but with different " + f'values. The value `thermal_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`thermal_config_dict` is provided which will be used to initialize `ImageBindThermalConfig`. " + f'The value `thermal_config["{key}"]` will be overriden.' + ) + logger.warning(message) + + # Update all values in `vision_config` with the ones in `_thermal_config_dict`. + thermal_config.update(_thermal_config_dict) + + if imu_config_dict is not None: + if imu_config is None: + imu_config = {} + + # This is the complete result when using `imu_config_dict`. + _imu_config_dict = ImageBindImuConfig(**imu_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _imu_config_dict: + _imu_config_dict["id2label"] = { + str(key): value for key, value in _imu_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_imu_config_dict` and `imu_config` but being different. + for key, value in _imu_config_dict.items(): + if key in imu_config and value != imu_config[key] and key not in ["transformers_version"]: + # If specified in `imu_config_dict` + if key in imu_config_dict: + message = ( + f"`{key}` is found in both `imu_config_dict` and `imu_config` but with different " + f'values. The value `imu_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`imu_config_dict` is provided which will be used to initialize `ImageBindImuConfig`. " + f'The value `imu_config["{key}"]` will be overriden.' + ) + logger.warning(message) + + # Update all values in `imu_config` with the ones in `_imu_config_dict`. + imu_config.update(_imu_config_dict) + + if text_config is None: + text_config = {} + logger.info("`text_config` is `None`. Initializing the `ImageBindTextConfig` with default values.") + + if vision_config is None: + vision_config = {} + logger.info("`vision_config` is `None`. initializing the `ImageBindVisionConfig` with default values.") + + if audio_config is None: + audio_config = {} + logger.info("`audio_config` is `None`. initializing the `ImageBindAudioConfig` with default values.") + + if depth_config is None: + depth_config = {} + logger.info("`depth_config` is `None`. initializing the `ImageBindDepthConfig` with default values.") + + if thermal_config is None: + thermal_config = {} + logger.info("`thermal_config` is `None`. initializing the `ImageBindThermalConfig` with default values.") + + if imu_config is None: + imu_config = {} + logger.info("`imu_config` is `None`. initializing the `ImageBindImuConfig` with default values.") + + self.text_config = ImageBindTextConfig(**text_config) + self.vision_config = ImageBindVisionConfig(**vision_config) + self.audio_config = ImageBindAudioConfig(**audio_config) + self.depth_config = ImageBindDepthConfig(**depth_config) + self.thermal_config = ImageBindThermalConfig(**thermal_config) + self.imu_config = ImageBindImuConfig(**imu_config) + + self.projection_dim = projection_dim + self.initializer_factor = 1.0 + + @classmethod + def from_text_vision_configs(cls, text_config: ImageBindTextConfig, vision_config: ImageBindVisionConfig, **kwargs): + r""" + Instantiate a [`ImageBindConfig`] (or a derived class) from imagebind text model configuration and imagebind vision model + configuration. + + Returns: + [`ImageBindConfig`]: An instance of a configuration object + """ + + return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) + + def to_dict(self): + """ + Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. + + Returns: + `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, + """ + output = copy.deepcopy(self.__dict__) + output["text_config"] = self.text_config.to_dict() + output["vision_config"] = self.vision_config.to_dict() + output["audio_config"] = self.audio_config.to_dict() + output["depth_config"] = self.depth_config.to_dict() + output["thermal_config"] = self.thermal_config.to_dict() + output["imu_config"] = self.imu_config.to_dict() + output["model_type"] = self.__class__.model_type + return output + +# TODO: add other modalities +class ImageBindOnnxConfig(OnnxConfig): + @property + def inputs(self) -> Mapping[str, Mapping[int, str]]: + return OrderedDict( + [ + ("input_ids", {0: "batch", 1: "sequence"}), + ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}), + ("attention_mask", {0: "batch", 1: "sequence"}), + ] + ) + + @property + def outputs(self) -> Mapping[str, Mapping[int, str]]: + return OrderedDict( + [ + ("logits_per_image", {0: "batch"}), + ("logits_per_text", {0: "batch"}), + ("text_embeds", {0: "batch"}), + ("image_embeds", {0: "batch"}), + ] + ) + + @property + def atol_for_validation(self) -> float: + return 1e-4 + + def generate_dummy_inputs( + self, + processor: "ProcessorMixin", + batch_size: int = -1, + seq_length: int = -1, + framework: Optional["TensorType"] = None, + ) -> Mapping[str, Any]: + text_input_dict = super().generate_dummy_inputs( + processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework + ) + image_input_dict = super().generate_dummy_inputs( + processor.feature_extractor, batch_size=batch_size, framework=framework + ) + return {**text_input_dict, **image_input_dict} + + @property + def default_onnx_opset(self) -> int: + return 14 \ No newline at end of file diff --git a/src/transformers/models/imagebind/convert_imagebind_original_pytorch_to_hf.py b/src/transformers/models/imagebind/convert_imagebind_original_pytorch_to_hf.py new file mode 100644 index 000000000000..5e0dbb70f5f0 --- /dev/null +++ b/src/transformers/models/imagebind/convert_imagebind_original_pytorch_to_hf.py @@ -0,0 +1,452 @@ +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import torch +# from imagebind import load + +from transformers import ( + ImageBindAudioConfig, + ImageBindConfig, + ImageBindDepthConfig, + ImageBindImuConfig, + ImageBindModel, + ImageBindTextConfig, + ImageBindThermalConfig, + ImageBindVisionConfig, +) + +SPATIOTEMPORAL_MODALITY_LIST = ["vision"] +IMAGELIKE_MODALITY_LIST = ["vision", "audio", "depth", "thermal"] +MODALITY_LIST = ["text", *IMAGELIKE_MODALITY_LIST, "imu"] + + +# Holds configs common to all test ImageBind encoders +IMAGEBIND_TEST_TRUNK_CONFIG = { + "hidden_size": 32, + "projection_dim": 32, + "num_hidden_layers": 5, + "num_attention_heads": 4, + "intermediate_size": 37, + "dropout": 0.0, + "layer_norm_eps": 1e-6, +} + +IMAGEBIND_TEST_TEXT_CONFIG = { + **IMAGEBIND_TEST_TRUNK_CONFIG, + "vocab_size": 99, + "max_position_embeddings": 512, + "logit_scale_init_value": 14.2857, + "learnable_logit_scale": True, +} + +IMAGEBIND_TEST_VISION_CONFIG = { + **IMAGEBIND_TEST_TRUNK_CONFIG, + "image_size": 30, + "patch_size": (2, 2, 2), + "stride": (2, 2, 2), + "num_channels": 3, + "num_frames": 2, + "logit_scale_init_value": None, + "learnable_logit_scale": False, +} + +IMAGEBIND_TEST_AUDIO_CONFIG = { + **IMAGEBIND_TEST_TRUNK_CONFIG, + "image_size": 30, + "patch_size": 16, + "stride": 10, + "num_channels": 1, + "num_mel_bins": 128, + "target_len": 204, + "add_kv_bias": True, + "drop_path_rate": 0.1, + "logit_scale_init_value": 20.0, + "learnable_logit_scale": False, +} + +IMAGEBIND_TEST_DEPTH_CONFIG = { + **IMAGEBIND_TEST_TRUNK_CONFIG, + "image_size": 30, + "patch_size": 2, + "stride": 2, + "num_channels": 1, + "add_kv_bias": True, + "logit_scale_init_value": 5.0, + "learnable_logit_scale": False, +} + +IMAGEBIND_TEST_THERMAL_CONFIG = { + **IMAGEBIND_TEST_TRUNK_CONFIG, + "image_size": 30, + "patch_size": 2, + "stride": 2, + "num_channels": 1, + "add_kv_bias": True, + "logit_scale_init_value": 10.0, + "learnable_logit_scale": False, +} + +IMAGEBIND_TEST_IMU_CONFIG = { + **IMAGEBIND_TEST_TRUNK_CONFIG, + "input_shape": (6, 30), + "kernel_size": 2, + "add_kv_bias": True, + "drop_path_rate": 0.7, + "logit_scale_init_value": 5.0, + "learnable_logit_scale": False, +} + + +def get_modality_config(config, modality): + if modality == "text": + return config.text_config + elif modality == "vision": + return config.vision_config + elif modality == "audio": + return config.audio_config + elif modality == "depth": + return config.depth_config + elif modality == "thermal": + return config.thermal_config + elif modality == "imu": + return config.imu_config + else: + raise ValueError(f"Modality {modality} is not currently supported.") + + +def convert_embeddings(config, model_state_dict): + # Create position_ids buffer for text model] + text_position_ids_buffer = torch.arange(config.text_config.max_position_embeddings).expand((1, -1)) + model_state_dict[f"text_model.embeddings.position_ids"] = text_position_ids_buffer + + # Create position_ids buffer for IMU model + imu_num_patches = config.imu_config.input_shape[1] // config.imu_config.kernel_size + imu_num_positions = imu_num_patches + 1 + imu_position_ids_buffer = torch.arange(imu_num_positions).expand((1, -1)) + model_state_dict[f"imu_model.embeddings.position_ids"] = imu_position_ids_buffer + + for modality in ["text", "imu"]: + # Convert position embeddings for text and IMU modalities + pos_embed_key = f"modality_preprocessors.{modality}.pos_embed" + pos_embed = model_state_dict[pos_embed_key] + converted_pos_embed = pos_embed.squeeze() + model_state_dict[pos_embed_key] = converted_pos_embed + + for modality in IMAGELIKE_MODALITY_LIST: + # Convert position embeddings for image-like modalities + pos_embed_key = f"modality_preprocessors.{modality}.pos_embedding_helper.pos_embed" + pos_embed = model_state_dict[pos_embed_key] + converted_pos_embed = pos_embed.squeeze() + model_state_dict[pos_embed_key] = converted_pos_embed + + # Create position_ids buffer for image-likd modalities + modality_config = get_modality_config(config, modality) + # Recalculate num_positions + if modality in SPATIOTEMPORAL_MODALITY_LIST: + patches_along_time_dim = modality_config.num_frames // modality_config.patch_size[0] + patches_along_spatial_dims = (modality_config.image_size // modality_config.patch_size[1]) ** 2 + num_patches = patches_along_spatial_dims * patches_along_time_dim + elif modality == "audio": + patch_size = modality_config.patch_size + stride = modality_config.stride + patches_along_mel_dim = ((modality_config.num_mel_bins - patch_size) // stride) + 1 + patches_along_frame_dim = ((modality_config.target_len - patch_size) // stride) + 1 + num_patches = patches_along_mel_dim * patches_along_frame_dim + else: + num_patches = (modality_config.image_size // modality_config.patch_size) ** 2 + num_positions = num_patches + 1 + position_ids_buffer = torch.arange(num_positions).expand((1, -1)) + model_state_dict[f"{modality}_model.embeddings.position_ids"] = position_ids_buffer + + for modality in IMAGELIKE_MODALITY_LIST + ["imu"]: + # Convert class embeddings + class_embed_key = f"modality_preprocessors.{modality}.cls_token" + class_embed = model_state_dict[class_embed_key] + converted_class_embed = class_embed.squeeze() + model_state_dict[class_embed_key] = converted_class_embed + + +def convert_attention(config, model_state_dict): + for modality in MODALITY_LIST: + old_prefix = f"modality_trunks.{modality}.blocks" + new_prefix = f"{modality}_model.encoder.layers" + modality_config = get_modality_config(config, modality) + for i in range(modality_config.num_hidden_layers): + attn_weight_key = f"{old_prefix}.{i}.attn.in_proj_weight" + attn_bias_key = f"{old_prefix}.{i}.attn.in_proj_bias" + attn_weight = model_state_dict[attn_weight_key] + attn_bias = model_state_dict[attn_bias_key] + + # Split up the attention projections/bias in to q, k, v projections/bias + q_proj, k_proj, v_proj = attn_weight.chunk(3, dim=0) + q_proj_bias, k_proj_bias, v_proj_bias = attn_bias.chunk(3, dim=0) + + model_state_dict[f"{new_prefix}.{i}.self_attn.q_proj.weight"] = q_proj + model_state_dict[f"{new_prefix}.{i}.self_attn.q_proj.bias"] = q_proj_bias + + model_state_dict[f"{new_prefix}.{i}.self_attn.k_proj.weight"] = k_proj + model_state_dict[f"{new_prefix}.{i}.self_attn.k_proj.bias"] = k_proj_bias + + model_state_dict[f"{new_prefix}.{i}.self_attn.v_proj.weight"] = v_proj + model_state_dict[f"{new_prefix}.{i}.self_attn.v_proj.bias"] = v_proj_bias + + +def map_preprocessor_keys(prefix="modality_preprocessors"): + mapping = {} + keys_to_remove = [] + + # Text preprocessor + mapping[f"{prefix}.text.token_embedding.weight"] = "text_model.embeddings.token_embedding.weight" + mapping[f"{prefix}.text.pos_embed"] = "text_model.embeddings.position_embedding.weight" + + # NOTE: no need to map causal attention mask buffer + keys_to_remove.append("modality_preprocessors.text.mask") + + # Image-like modalities common + for modality in IMAGELIKE_MODALITY_LIST: + mapping[f"{prefix}.{modality}.cls_token"] = f"{modality}_model.embeddings.class_embedding" + mapping[f"{prefix}.{modality}.pos_embedding_helper.pos_embed"] = f"{modality}_model.embeddings.position_embedding.weight" + + # Vision preprocessor specific + mapping[f"{prefix}.vision.rgbt_stem.proj.1.weight"] = "vision_model.embeddings.patch_embedding.weight" + + # Audio preprocessor specific + mapping[f"{prefix}.audio.rgbt_stem.proj.weight"] = "audio_model.embeddings.patch_embedding.weight" + mapping[f"{prefix}.audio.rgbt_stem.norm_layer.weight"] = "audio_model.embeddings.norm_layer.weight" + mapping[f"{prefix}.audio.rgbt_stem.norm_layer.bias"] = "audio_model.embeddings.norm_layer.bias" + + # Depth preprocessor specific + mapping[f"{prefix}.depth.depth_stem.proj.weight"] = "depth_model.embeddings.patch_embedding.weight" + mapping[f"{prefix}.depth.depth_stem.norm_layer.weight"] = "depth_model.embeddings.norm_layer.weight" + mapping[f"{prefix}.depth.depth_stem.norm_layer.bias"] = "depth_model.embeddings.norm_layer.bias" + + # Thermal preprocessor specific + mapping[f"{prefix}.thermal.rgbt_stem.proj.weight"] = "thermal_model.embeddings.patch_embedding.weight" + mapping[f"{prefix}.thermal.rgbt_stem.norm_layer.weight"] = "thermal_model.embeddings.norm_layer.weight" + mapping[f"{prefix}.thermal.rgbt_stem.norm_layer.bias"] = "thermal_model.embeddings.norm_layer.bias" + + # IMU preprocessor + mapping[f"{prefix}.imu.cls_token"] = "imu_model.embeddings.class_embedding" + mapping[f"{prefix}.imu.pos_embed"] = "imu_model.embeddings.position_embedding.weight" + mapping[f"{prefix}.imu.imu_stem.proj.weight"] = "imu_model.embeddings.patch_embedding.weight" + mapping[f"{prefix}.imu.imu_stem.norm_layer.weight"] = "imu_model.embeddings.norm_layer.weight" + mapping[f"{prefix}.imu.imu_stem.norm_layer.bias"] = "imu_model.embeddings.norm_layer.bias" + + return mapping, keys_to_remove + + +def map_transformer_keys(config, old_prefix, new_prefix): + mapping = {} + keys_to_remove = [] + + for i in range(config.num_hidden_layers): + # NOTE: q, k, v proj/bias are added to the state dict with the correct names in convert_attention + keys_to_remove.append(f"{old_prefix}.{i}.attn.in_proj_weight") + keys_to_remove.append(f"{old_prefix}.{i}.attn.in_proj_bias") + + mapping[f"{old_prefix}.{i}.attn.out_proj.weight"] = f"{new_prefix}.{i}.self_attn.out_proj.weight" + mapping[f"{old_prefix}.{i}.attn.out_proj.bias"] = f"{new_prefix}.{i}.self_attn.out_proj.bias" + + mapping[f"{old_prefix}.{i}.norm_1.weight"] = f"{new_prefix}.{i}.layer_norm1.weight" + mapping[f"{old_prefix}.{i}.norm_1.bias"] = f"{new_prefix}.{i}.layer_norm1.bias" + + mapping[f"{old_prefix}.{i}.mlp.fc1.weight"] = f"{new_prefix}.{i}.mlp.fc1.weight" + mapping[f"{old_prefix}.{i}.mlp.fc1.bias"] = f"{new_prefix}.{i}.mlp.fc1.bias" + mapping[f"{old_prefix}.{i}.mlp.fc2.weight"] = f"{new_prefix}.{i}.mlp.fc2.weight" + mapping[f"{old_prefix}.{i}.mlp.fc2.bias"] = f"{new_prefix}.{i}.mlp.fc2.bias" + + mapping[f"{old_prefix}.{i}.norm_2.weight"] = f"{new_prefix}.{i}.layer_norm2.weight" + mapping[f"{old_prefix}.{i}.norm_2.bias"] = f"{new_prefix}.{i}.layer_norm2.bias" + + if config.add_kv_bias: + mapping[f"{old_prefix}.{i}.attn.bias_k"] = f"{new_prefix}.{i}.self_attn.k_bias" + mapping[f"{old_prefix}.{i}.attn.bias_v"] = f"{new_prefix}.{i}.self_attn.v_bias" + + return mapping, keys_to_remove + + +def get_encoder_key_mapping(config, prefix="modality_trunks"): + mapping = {} + keys_to_remove = [] + + # 1. Handle any pre-transformer layers, if available. + + # Vision specific + mapping["modality_trunks.vision.pre_transformer_layer.0.weight"] = "vision_model.pre_layernorm.weight" + mapping["modality_trunks.vision.pre_transformer_layer.0.bias"] = "vision_model.pre_layernorm.bias" + + # 2. Map transformer trunk keys + for modality in MODALITY_LIST: + old_prefix = f"{prefix}.{modality}.blocks" + new_prefix = f"{modality}_model.encoder.layers" + modality_config = get_modality_config(config, modality) + transformer_mapping, transformer_keys_to_remove = map_transformer_keys(modality_config, old_prefix, new_prefix) + mapping.update(transformer_mapping) + keys_to_remove.extend(transformer_keys_to_remove) + + return mapping, keys_to_remove + + +def map_transformer_head_keys(prefix="modality_heads"): + mapping = {} + keys_to_remove = [] + + # Text final layer norm + mapping[f"{prefix}.text.proj.0.weight"] = "text_model.final_layer_norm.weight" + mapping[f"{prefix}.text.proj.0.bias"] = "text_model.final_layer_norm.bias" + + for modality in IMAGELIKE_MODALITY_LIST + ["imu"]: + mapping[f"{prefix}.{modality}.0.weight"] = f"{modality}_model.post_layernorm.weight" + mapping[f"{prefix}.{modality}.0.bias"] = f"{modality}_model.post_layernorm.bias" + + # Modality heads + mapping[f"{prefix}.text.proj.1.weight"] = "text_projection.weight" + for modality in IMAGELIKE_MODALITY_LIST: + if modality == "vision": + mapping[f"{prefix}.{modality}.2.weight"] = f"visual_projection.weight" + else: + mapping[f"{prefix}.{modality}.2.weight"] = f"{modality}_projection.weight" + mapping[f"{prefix}.imu.3.weight"] = "imu_projection.weight" + + return mapping, keys_to_remove + + +def map_postprocessor_keys(prefix="modality_postprocessors"): + mapping = {} + keys_to_remove = [] + + for modality in ["text", "audio", "depth", "thermal", "imu"]: + mapping[f"{prefix}.{modality}.1.log_logit_scale"] = f"{modality}_postprocessor.log_logit_scale" + + return mapping, keys_to_remove + + +def get_key_mapping(config): + mapping = {} + keys_to_remove = [] + + # 1. Map preprocessor keys + preprocessor_mapping, preprocessor_keys_to_remove = map_preprocessor_keys(prefix="modality_preprocessors") + mapping.update(preprocessor_mapping) + keys_to_remove.extend(preprocessor_keys_to_remove) + + # 2. Map transformer keys + encoder_mapping, encoder_keys_to_remove = get_encoder_key_mapping(config, prefix="modality_trunks") + mapping.update(encoder_mapping) + keys_to_remove.extend(encoder_keys_to_remove) + + # 3. Map transformer head keys + head_mapping, head_keys_to_remove = map_transformer_head_keys(prefix="modality_heads") + mapping.update(head_mapping) + keys_to_remove.extend(head_keys_to_remove) + + # 4. Map postprocessor keys + postprocessor_mapping, postprocessor_keys_to_remove = map_postprocessor_keys(prefix="modality_postprocessors") + mapping.update(postprocessor_mapping) + keys_to_remove.extend(postprocessor_keys_to_remove) + + return mapping, keys_to_remove + + +def rename_state_dict(state_dict, keys_to_modify, keys_to_remove): + model_state_dict = {} + for key, value in state_dict.items(): + if key in keys_to_remove: + continue + + if key in keys_to_modify: + new_key = keys_to_modify[key] + model_state_dict[new_key] = value + else: + model_state_dict[key] = value + return model_state_dict + + +def convert_imagebind_checkpoint( + checkpoint_path, + pytorch_dump_folder_path, + config_path=None, + repo_id=None, + use_test_config=False, + safe_serialization=False, +): + """ + Copy/paste/tweak model's weights to transformers design. + """ + if config_path is not None: + config = ImageBindConfig.from_pretrained(config_path) + elif use_test_config: + config = ImageBindConfig( + text_config=IMAGEBIND_TEST_TEXT_CONFIG, + vision_config=IMAGEBIND_TEST_VISION_CONFIG, + audio_config=IMAGEBIND_TEST_AUDIO_CONFIG, + depth_config=IMAGEBIND_TEST_DEPTH_CONFIG, + thermal_config=IMAGEBIND_TEST_THERMAL_CONFIG, + imu_config=IMAGEBIND_TEST_IMU_CONFIG, + projection_dim=32, + ) + else: + # The default config corresponds to the original ImageBind model. + config = ImageBindConfig() + + hf_model = ImageBindModel(config) + + # print(hf_model) + # hf_model_state_dict = hf_model.state_dict() + # for key in hf_model_state_dict: + # print(key) + + # Original ImageBind checkpoint is a PyTorch state dict + model_state_dict = torch.load(checkpoint_path, map_location="cpu") + + # Fix embedding shapes + convert_embeddings(config, model_state_dict) + # Convert attention parameters to transformers + convert_attention(config, model_state_dict) + + keys_to_modify, keys_to_remove = get_key_mapping(config) + keys_to_remove = set(keys_to_remove) + hf_state_dict = rename_state_dict(model_state_dict, keys_to_modify, keys_to_remove) + + hf_model.load_state_dict(hf_state_dict) + + hf_model.save_pretrained(pytorch_dump_folder_path, safe_serialization=safe_serialization) + + if repo_id: + print("Pushing to the hub...") + hf_model.push_to_hub(repo_id) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + + parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to ImageBind checkpoint") + parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.") + parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert") + parser.add_argument("--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub.") + parser.add_argument("--test", action="store_true", help="Whether to use the test config for ImageBind models.") + parser.add_argument("--safe_serialization", action="store_true", help="Whether to save the model using `safetensors`.") + + args = parser.parse_args() + + convert_imagebind_checkpoint( + args.checkpoint_path, + args.pytorch_dump_folder_path, + args.config_path, + args.push_to_hub, + args.test, + args.safe_serialization, + ) diff --git a/src/transformers/models/imagebind/feature_extraction_imagebind.py b/src/transformers/models/imagebind/feature_extraction_imagebind.py new file mode 100644 index 000000000000..02b23aab046e --- /dev/null +++ b/src/transformers/models/imagebind/feature_extraction_imagebind.py @@ -0,0 +1,470 @@ +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Feature extractor class for ImageBind.""" + + +import warnings +from typing import List, Optional, Union + +import numpy as np +import torch +import torchaudio.compliance.kaldi as ta_kaldi + +from ...feature_extraction_sequence_utils import SequenceFeatureExtractor +from ...feature_extraction_utils import BatchFeature +from ...utils import PaddingStrategy, TensorType, logging +from .image_processing_imagebind import ImageBindImageProcessor + + +logger = logging.get_logger(__name__) + + +def valid_batched_clipped_audio(raw_speech): + """ + Determines whether raw mono-channel audio input (or any other 1D data) is batched and clipped. The following + conditions will be recognized as valid audio: + + - unbatched: `List[float]`, `np.ndarray` (`ndim=1`) + - batched: `List[List[float]]`, `List[np.ndarray]` (`ndim=1`), `np.ndarray` (`ndim=2`) + - batched and clipped: `List[List[List[float]]]`, `List[List[np.ndarray]]` (`ndim=1`), List[np.ndarray] (`ndim=2`), np.ndarray (`ndim=3`) + """ + valid_audio = False + if isinstance(raw_speech, np.ndarray) and (1 <= len(raw_speech.shape) <= 3): + # unbatched, batched, or batched and clipped np.ndarray + valid_audio = True + elif isinstance(raw_speech, (list, tuple)): + if isinstance(raw_speech[0], np.ndarray) and (1 <= len(raw_speech[0].shape) <= 2): + # batched or batched and clipped List[np.ndarray] + valid_audio = True + elif isinstance(raw_speech[0], float): + # unbatched List[float] + valid_audio = True + elif isinstance(raw_speech[0], (list, tuple)): + if isinstance(raw_speech[0][0], np.ndarray) and (len(raw_speech[0][0].shape == 1)): + # batched and clipped List[List[np.ndarray]] + valid_audio = True + elif isinstance(raw_speech, (float, list, tuple)): + # batched List[List[float]], batched and clipped List[List[List[float]]] + valid_audio = True + return valid_audio + + +def batch_and_clip_ndarray(array, data_dim=1, dtype=np.float32): + """ + Turns a possibly nested list of np.ndarrays into a batched and clipped output of type `List[List[np.ndarray]]`. + """ + if isinstance(array, (list, tuple)) and isinstance(array[0], (list, tuple)) and isinstance(array[0][0], np.ndarray): + if array[0][0].ndim == data_dim: + return [[base_array.astype(dtype=dtype) for base_array in clip] for clip in array] + else: + raise ValueError( + f"`For List[List[np.ndarray]]` inputs the internal `np.ndarray`s are expected to have dimension" + f" {data_dim} but got dimension {array[0][0].ndim}" + ) + elif isinstance(array, (list, tuple) and isinstance(array[0], np.ndarray)): + if array[0].ndim == data_dim + 1: + return [[np.asarray(base_array, dtype=dtype) for base_array in clip] for clip in array] + elif array[0].ndim == data_dim: + return [[base_array.astype(dtype=dtype) for base_array in array]] + else: + raise ValueError( + f"For `List[np.ndarray]` inputs the internal `np.ndarray`s are expected to have dimension" + f" {data_dim} or {data_dim + 1} but got dimension {array[0].ndim}" + ) + elif isinstance(array, np.ndarray): + if array.ndim == data_dim + 2: + return [[np.asarray(raw_input, dtype=dtype) for raw_input in clip] for clip in array] + elif array.ndim == data_dim + 1: + return [[np.asarray(raw_input, dtype=dtype) for raw_input in array]] + elif array.ndim == data_dim: + return [[array.astype(dtype=dtype)]] + else: + raise ValueError( + f"`np.ndarray` inputs are expected to have dimension in" + f" `[{data_dim}, {data_dim + 1}, {data_dim + 2}]` but instead got {array.ndim}" + ) + else: + raise ValueError(f"Could not make batched and clipped audio from {array}") + + +class ImageBindFeatureExtractor(ImageBindImageProcessor): + def __init__(self, *args, **kwargs) -> None: + warnings.warn( + "The class ImageBindFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please" + " use ImageBindImageProcessor instead.", + FutureWarning, + ) + super().__init__(*args, **kwargs) + + +# NOTE: ImageBind follow Audio Spectrogram Transformer for audio processing +# Based on ASTFeatureExtractor +class ImageBindAudioFeatureExtractor(SequenceFeatureExtractor): + r""" + Constructs a Audio Spectrogram Transformer (AST) feature extractor. + + This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains + most of the main methods. Users should refer to this superclass for more information regarding those methods. + + This class extracts mel-filter bank features from raw speech using TorchAudio, pads/truncates them to a fixed + length and normalizes them using a mean and standard deviation. + + Args: + feature_size (`int`, *optional*, defaults to 1): + The feature dimension of the extracted features. + sampling_rate (`int`, *optional*, defaults to 16000): + The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). + num_mel_bins (`int`, *optional*, defaults to 128): + Number of Mel-frequency bins. + max_length (`int`, *optional*, defaults to 204): + Maximum length to which to pad/truncate the extracted features. + padding_value (`float`, *optional*, defaults to 0.0): + The value to pad with when applying the padding strategy defined by the `padding` argument to + [ImageBindAudioFeatureExtractor.__call__`]. + do_normalize (`bool`, *optional*, defaults to `True`): + Whether or not to normalize the log-Mel features using `mean` and `std`. + mean (`float`, *optional*, defaults to -4.268): + The mean value used to normalize the log-Mel features. Uses the AudioSet mean by default. + std (`float`, *optional*, defaults to 9.138): + The standard deviation value used to normalize the log-Mel features. Uses the AudioSet standard deviation + by default. + return_attention_mask (`bool`, *optional*, defaults to `False`): + Whether or not [`~ImageBindAudioFeatureExtractor.__call__`] should return `attention_mask`. + """ + + model_input_names = ["input_features", "attention_mask"] + + def __init__( + self, + feature_size=1, + sampling_rate=16000, + num_mel_bins=128, + max_length=204, + padding_value=0.0, + do_normalize=True, + mean=-4.268, + std=9.138, + return_attention_mask=False, + **kwargs, + ): + super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs) + self.num_mel_bins = num_mel_bins + self.max_length = max_length + self.do_normalize = do_normalize + self.mean = mean + self.std = std + self.return_attention_mask = return_attention_mask + + def _extract_fbank_features( + self, + waveform: np.ndarray, + max_length: int, + ) -> np.ndarray: + """ + Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs + and hence the waveform should not be normalized before feature extraction. + """ + # waveform = waveform * (2**15) # Kaldi compliance: 16-bit signed integers + # Mean center the waveform + waveform -= waveform.mean() + waveform = torch.from_numpy(waveform).unsqueeze(0) + fbank = ta_kaldi.fbank( + waveform, + htk_compat=True, + sample_frequency=self.sampling_rate, + use_energy=False, + window_type="hanning", + num_mel_bins=self.num_mel_bins, + dither=0.0, + frame_shift=10, + ) + + n_frames = fbank.shape[0] + difference = max_length - n_frames + + # pad or truncate, depending on difference + if difference > 0: + pad_module = torch.nn.ZeroPad2d((0, 0, 0, difference)) + fbank = pad_module(fbank) + elif difference < 0: + fbank = fbank[0:max_length, :] + + fbank = fbank.numpy() + + return fbank + + def normalize(self, input_values: np.ndarray) -> np.ndarray: + return (input_values - (self.mean)) / (self.std * 2) + + def __call__( + self, + raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]], List[List[List[float]]]], + sampling_rate: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + **kwargs, + ) -> BatchFeature: + """ + Main method to featurize and prepare for the model one or several sequence(s). + + Args: + raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`, `List[List[List[float]]]`): + The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of numpy + arrays or a (possibly nested) list of float values. The supported input types are as follows: + + - unbatched: `List[float]`, `np.ndarray` (`ndim=1`) + - batched: `List[List[float]]`, `List[np.ndarray]` (`ndim=1`), `np.ndarray` (`ndim=2`) + - batched with clips: `List[List[List[float]]]`, `List[List[np.ndarray]]` (`ndim=1`), `List[np.ndarray]` (`ndim=2`), np.ndarray (`ndim=3`) + + The input will always be interpreted as mono channel audio, not stereo, i.e. a single float per timestep. + sampling_rate (`int`, *optional*): + The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass + `sampling_rate` at the forward call to prevent silent errors. + return_tensors (`str` or [`~utils.TensorType`], *optional*): + If set, will return tensors instead of list of python integers. Acceptable values are: + + - `'tf'`: Return TensorFlow `tf.constant` objects. + - `'pt'`: Return PyTorch `torch.Tensor` objects. + - `'np'`: Return Numpy `np.ndarray` objects. + """ + + if sampling_rate is not None: + if sampling_rate != self.sampling_rate: + raise ValueError( + f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of" + f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with" + f" {self.sampling_rate} and not {sampling_rate}." + ) + else: + logger.warning( + "It is strongly recommended to pass the `sampling_rate` argument to this function. " + "Failing to do so can result in silent errors that might be hard to debug." + ) + + if not valid_batched_clipped_audio(raw_speech): + raise ValueError( + f"Only unbatched, batched, and batched and clipped mono-channel audio is supported for input to {self}" + ) + + # Handle the cases where there are no np.ndarrays in raw_speech + if isinstance(raw_speech, (list, tuple)) and isinstance(raw_speech[0], float): + raw_speech = [[np.asarray(raw_speech, dtype=np.float32)]] + elif isinstance(raw_speech, (list, tuple)) and isinstance(raw_speech[0], (list, tuple)): + if isinstance(raw_speech[0][0], float): + # List[List[float]] + raw_speech = [[np.asarray(audio, dtype=np.float32) for audio in raw_speech]] + elif isinstance(raw_speech[0][0], (list, tuple)): + # List[List[List[float]]] + raw_speech = [[np.asarray(audio, dtype=np.float32) for audio in clip] for clip in raw_speech] + + # always return batched and clipped audio of type [List[List[np.ndarray]]] + raw_speech = batch_and_clip_ndarray(raw_speech, data_dim=1, dtype=np.float32) + + # extract fbank features and pad/truncate to max_length + features = [[self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in clip] for clip in raw_speech] + + # convert into BatchFeature + padded_inputs = BatchFeature({"input_features": features}) + + # make sure spectrograms are in array format + input_values = padded_inputs.get("input_features") + if isinstance(input_values[0][0], list): + padded_inputs["input_features"] = [[np.asarray(feature, dtype=np.float32) for feature in clip] for clip in input_values] + + # normalization + if self.do_normalize: + padded_inputs["input_features"] = [ + [self.normalize(feature) for feature in clip] for clip in padded_inputs["input_features"] + ] + + if return_tensors is not None: + padded_inputs = padded_inputs.convert_to_tensors(return_tensors) + + return padded_inputs + + +class ImageBindImuFeatureExtractor(SequenceFeatureExtractor): + """ + Constructs a ImageBind IMU feature extractor. + + This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains + most of the main methods. Users should refer to this superclass for more information regarding those methods. + + This class takes in raw IMU time series data, converts it to a standard sampling rate, and pads/truncates it to a + fixed length. + + Args: + feature_size (`int`, *optional*, defaults to 6): + The feature dimension of the extracted features. + sampling_rate (`int`, *optional*, defaults to 200): + The sampling rate at which the IMU data should be digitalized expressed in hertz (Hz). + padding_value (`float`, *optional*, defaults to 0.0): + The value to pad with when applying the padding strategy defined by the `padding` argument to + [`ImageBindImuFeatureExtractor.__call__`]. + imu_len_in_s (`float`, *optional*, defaults to 10): + Maximum length to which to pad/truncate the extracted features. + return_attention_mask (`bool`, *optional*, defaults to `False`): + Whether or not [`~ImageBindImuFeatureExtractor.__call__`] should return `attention_mask`. + """ + + model_input_names = ["input_features", "attention_mask"] + + def __init__( + self, + feature_size=6, + sampling_rate=200, + padding_value=0.0, + imu_len_in_s=10, + return_attention_mask=False, + **kwargs, + ): + super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs) + + self.imu_len_in_s = imu_len_in_s + self.return_attention_mask = return_attention_mask + + def __call__( + self, + raw_imu: Union[np.ndarray, List[np.ndarray], List[List[float]], List[List[List[float]]]], + sampling_rate: Optional[int] = None, + padding: Union[bool, str, PaddingStrategy] = "max_length", + max_length: Optional[int] = None, + truncation: bool = True, + pad_to_multiple_of: Optional[int] = None, + return_attention_mask: Optional[bool] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + **kwargs, + ): + """ + Main method to featurize and prepare for the model one or several sequence(s). + + Args: + raw_imu (`np.ndarray`, `List[np.ndarray]`, `List[List[float]]`, `List[List[List[float]]]`): + The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of numpy + arrays or a (possibly nested) list of float values. The supported input types are as follows: + + - unbatched: `List[List[float]]`, `List[np.ndarray]` (`ndim=1`), `np.ndarray` (`ndim=2`), + - batched: `List[List[List[float]]]`, `List[np.ndarray]` (`ndim=2`), `np.ndarray` (`ndim=3`) + + The input will always be interpreted as a multiple-channel time series signal. + sampling_rate (`int`, *optional*): + The sampling rate at which the `raw_imu` input was sampled. It is strongly recommended to pass + `sampling_rate` at the forward call to prevent silent errors. + padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `'max_length'`): + Select a strategy to pad the input `raw_speech` waveforms (according to the model's padding side and + padding index) among: + + - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single + sequence if provided). + - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum + acceptable input length for the model if that argument is not provided. + - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different + lengths). + max_length (`int`, *optional*): + Maximum length of the returned list and optionally padding length (see above). + truncation (`bool`, *optional*, defaults to `True`): + Activates truncation to cut input sequences longer than `max_length` to `max_length`. + pad_to_multiple_of (`int`, *optional*): + If set will pad the sequence to a multiple of the provided value. + + This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability + `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. + return_attention_mask (`bool`, *optional*): + Whether to return the attention mask. If left to the default, will return the attention mask according + to the specific feature_extractor's default. + + [What are attention masks?](../glossary#attention-mask) + + return_tensors (`str` or [`~utils.TensorType`], *optional*): + If set, will return tensors instead of list of python integers. Acceptable values are: + + - `'tf'`: Return TensorFlow `tf.constant` objects. + - `'pt'`: Return PyTorch `torch.Tensor` objects. + - `'np'`: Return Numpy `np.ndarray` objects. + """ + + if sampling_rate is not None: + if sampling_rate != self.sampling_rate: + raise ValueError( + f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of" + f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with" + f" {self.sampling_rate} and not {sampling_rate}." + ) + else: + logger.warning( + "It is strongly recommended to pass the `sampling_rate` argument to this function. " + "Failing to do so can result in silent errors that might be hard to debug." + ) + + if isinstance(raw_imu, (list, tuple)) and isinstance(raw_imu[0], float): + raise ValueError( + "The expected IMU input is a multichannel (rather than single channel) time series, so `List[float]`" + " inputs are not accepted." + ) + + # Handle nested list inputs + if isinstance(raw_imu, (list, tuple)) and isinstance(raw_imu[0], (list, tuple)): + if isinstance(raw_imu[0][0], float): + # List[List[float]] -> unbatched IMU input + raw_imu = [np.asarray(raw_imu, dtype=np.float32)] + elif isinstance(raw_imu[0][0], (list, tuple)): + # List[List[List[float]]] -> batched IMU input + raw_imu = [np.asarray(imu, dtype=np.float32) for imu in raw_imu] + + # Handle inputs with ndarrays + if isinstance(raw_imu, (list, tuple)) and isinstance(raw_imu[0], np.ndarray): + if raw_imu[0].ndim == 1: + # Unbatched IMU input + raw_imu = [np.asarray(raw_imu, dtype=np.float32)] + elif raw_imu[0].ndim != 2: + raise ValueError( + f"For `List[np.ndarray]` inputs expected the internal arrays to have dim 1 or 2, but got" + f" {raw_imu[0].ndim}" + ) + + if isinstance(raw_imu, np.ndarray): + if raw_imu.ndim == 2: + # Unbatched IMU input + raw_imu = [raw_imu.astype(np.float32)] + elif raw_imu.ndim == 3: + # Batched IMU input + raw_imu = [np.asarray(imu, dtype=np.float32) for imu in raw_imu] + else: + raise ValueError( + f"For `np.ndarray` inputs expected the array to have dim 2 or 3, but got {raw_imu.ndim}" + ) + + # raw_imu should be of form `List[np.ndarray]` where raw_imu[0].ndim == 2 + # convert into BatchFeature + batched_imu = BatchFeature({"input_features": raw_imu}) + + # Pad/truncate batched features + padded_inputs = self.pad( + batched_imu, + padding=padding, + max_length=max_length if max_length is not None else self.imu_len_in_s, + truncation=truncation, + pad_to_multiple_of=pad_to_multiple_of, + return_attention_mask=return_attention_mask, + ) + + # Convert attention_mask to correct format + attention_mask = padded_inputs.get("attention_mask") + if attention_mask is not None: + batched_imu["attention_mask"] = [np.asarray(array, dtype=np.int32) for array in attention_mask] + + # Convert tensors if desired + if return_tensors is not None: + batched_imu = batched_imu.convert_to_tensors(return_tensors) + + return batched_imu diff --git a/src/transformers/models/imagebind/image_processing_imagebind.py b/src/transformers/models/imagebind/image_processing_imagebind.py new file mode 100644 index 000000000000..a9d7ddc638b7 --- /dev/null +++ b/src/transformers/models/imagebind/image_processing_imagebind.py @@ -0,0 +1,1058 @@ +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Image processor class for ImageBind.""" + +from typing import Dict, List, Optional, Union + +import numpy as np + +from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict +from ...image_transforms import ( + center_crop, + convert_to_rgb, + get_resize_output_image_size, + normalize, + rescale, + resize, + to_channel_dimension_format, +) +from ...image_utils import ( + OPENAI_CLIP_MEAN, + OPENAI_CLIP_STD, + ChannelDimension, + ImageInput, + PILImageResampling, + infer_channel_dimension_format, + is_scaled_image, + is_valid_image, + make_list_of_images, + to_numpy_array, + valid_images, +) +from ...utils import TensorType, is_vision_available, logging + + +logger = logging.get_logger(__name__) + + +if is_vision_available(): + import PIL + + +# Copied from transformers.models.videomae.image_processing_videomae.make_batched +def make_batched(videos) -> List[List[ImageInput]]: + if isinstance(videos, (list, tuple)) and isinstance(videos[0], (list, tuple)) and is_valid_image(videos[0][0]): + return videos + + elif isinstance(videos, (list, tuple)) and is_valid_image(videos[0]): + return [videos] + + elif is_valid_image(videos): + return [[videos]] + + raise ValueError(f"Could not make batched video from {videos}") + + +class ImageBindImageProcessor(BaseImageProcessor): + r""" + Constructs a ImageBind image processor. + + Args: + do_resize (`bool`, *optional*, defaults to `True`): + Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by + `do_resize` in the `preprocess` method. + size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`): + Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with + the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess` + method. + resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method. + do_center_crop (`bool`, *optional*, defaults to `True`): + Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the + `preprocess` method. + crop_size (`Dict[str, int]` *optional*, defaults to 224): + Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess` + method. + do_rescale (`bool`, *optional*, defaults to `True`): + Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in + the `preprocess` method. + rescale_factor (`int` or `float`, *optional*, defaults to `1/255`): + Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess` + method. + do_normalize: + Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method. + image_mean (`float` or `List[float]`, *optional*, defaults to `OPENAI_CLIP_MEAN`): + Mean to use if normalizing the image. This is a float or list of floats the length of the number of + channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. + image_std (`float` or `List[float]`, *optional*, defaults to `OPENAI_CLIP_STD`): + Standard deviation to use if normalizing the image. This is a float or list of floats the length of the + number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. + do_convert_rgb (`bool`, *optional*, defaults to `True`): + Standard deviation to use if normalizing the image. This is a float or list of floats the length of the + number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. + """ + + model_input_names = ["pixel_values"] + + def __init__( + self, + do_resize: bool = True, + size: Dict[str, int] = None, + resample: PILImageResampling = PILImageResampling.BICUBIC, + do_center_crop: bool = True, + crop_size: Dict[str, int] = None, + do_rescale: bool = True, + rescale_factor: Union[int, float] = 1 / 255, + do_normalize: bool = True, + image_mean: Optional[Union[float, List[float]]] = None, + image_std: Optional[Union[float, List[float]]] = None, + do_convert_rgb: bool = True, + **kwargs, + ) -> None: + super().__init__(**kwargs) + size = size if size is not None else {"shortest_edge": 224} + size = get_size_dict(size, default_to_square=False) + crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224} + crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size") + + self.do_resize = do_resize + self.size = size + self.resample = resample + self.do_center_crop = do_center_crop + self.crop_size = crop_size + self.do_rescale = do_rescale + self.rescale_factor = rescale_factor + self.do_normalize = do_normalize + self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN + self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD + self.do_convert_rgb = do_convert_rgb + + def resize( + self, + image: np.ndarray, + size: Dict[str, int], + resample: PILImageResampling = PILImageResampling.BICUBIC, + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge + resized to keep the input aspect ratio. + + Args: + image (`np.ndarray`): + Image to resize. + size (`Dict[str, int]`): + Size of the output image. + resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + Resampling filter to use when resiizing the image. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + size = get_size_dict(size, default_to_square=False) + if "shortest_edge" not in size: + raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}") + output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False) + return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) + + def center_crop( + self, + image: np.ndarray, + size: Dict[str, int], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the + returned result will always be of size `size`). + + Args: + image (`np.ndarray`): + Image to center crop. + size (`Dict[str, int]`): + Size of the output image in the form of a dictionary with keys `height` and `width`. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + size = get_size_dict(size) + if "height" not in size or "width" not in size: + raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}") + return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs) + + def rescale( + self, + image: np.ndarray, + scale: Union[int, float], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ): + """ + Rescale an image by a scale factor. image = image * scale. + + Args: + image (`np.ndarray`): + Image to rescale. + scale (`int` or `float`): + Scale to apply to the image. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + return rescale(image, scale=scale, data_format=data_format, **kwargs) + + def normalize( + self, + image: np.ndarray, + mean: Union[float, List[float]], + std: Union[float, List[float]], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Normalize an image. image = (image - image_mean) / image_std. + + Args: + image (`np.ndarray`): + Image to normalize. + image_mean (`float` or `List[float]`): + Image mean. + image_std (`float` or `List[float]`): + Image standard deviation. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) + + def preprocess_single_image( + self, + image: ImageInput, + do_resize: bool = None, + size: Dict[str, int] = None, + resample: PILImageResampling = None, + do_center_crop: bool = None, + crop_size: Dict[str, int] = None, + do_rescale: bool = None, + rescale_factor: float = None, + do_normalize: bool = None, + image_mean: Optional[Union[float, List[float]]] = None, + image_std: Optional[Union[float, List[float]]] = None, + data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, + ) -> np.ndarray: + """ + Process a single image. + """ + if do_resize and size is None: + raise ValueError("Size must be specified if do_resize is True.") + + if do_center_crop and crop_size is None: + raise ValueError("Crop size must be specified if do_center_crop is True.") + + if do_rescale and rescale_factor is None: + raise ValueError("Rescale factor must be specified if do_rescale is True.") + + if do_normalize and (image_mean is None or image_std is None): + raise ValueError("Image mean and std must be specified if do_normalize is True.") + + # All transformations expect numpy arrays. + image = to_numpy_array(image) + + if is_scaled_image(image) and do_rescale: + logger.warning_once( + "It looks like you are trying to rescale already rescaled images. If the input" + " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again." + ) + + if input_data_format is None: + input_data_format = infer_channel_dimension_format(image) + + if do_resize: + image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format) + + if do_center_crop: + image = self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) + + if do_rescale: + image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) + + if do_normalize: + image = self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) + + image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) + return image + + def preprocess( + self, + images: ImageInput, + do_resize: bool = None, + size: Dict[str, int] = None, + resample: PILImageResampling = None, + do_center_crop: bool = None, + crop_size: int = None, + do_rescale: bool = None, + rescale_factor: float = None, + do_normalize: bool = None, + image_mean: Optional[Union[float, List[float]]] = None, + image_std: Optional[Union[float, List[float]]] = None, + do_convert_rgb: bool = None, + return_tensors: Optional[Union[str, TensorType]] = None, + data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> PIL.Image.Image: + """ + Preprocess an image or batch of images. + + Args: + images (`ImageInput`): + Image to preprocess. + do_resize (`bool`, *optional*, defaults to `self.do_resize`): + Whether to resize the image. + size (`Dict[str, int]`, *optional*, defaults to `self.size`): + Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with + the longest edge resized to keep the input aspect ratio. + resample (`int`, *optional*, defaults to `self.resample`): + Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only + has an effect if `do_resize` is set to `True`. + do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`): + Whether to center crop the image. + crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`): + Size of the center crop. Only has an effect if `do_center_crop` is set to `True`. + do_rescale (`bool`, *optional*, defaults to `self.do_rescale`): + Whether to rescale the image. + rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`): + Rescale factor to rescale the image by if `do_rescale` is set to `True`. + do_normalize (`bool`, *optional*, defaults to `self.do_normalize`): + Whether to normalize the image. + image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`): + Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`. + image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`): + Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to + `True`. + do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`): + Whether to convert the image to RGB. + return_tensors (`str` or `TensorType`, *optional*): + The type of tensors to return. Can be one of: + - Unset: Return a list of `np.ndarray`. + - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`. + - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`. + - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`. + - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. + data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): + The channel dimension format for the output image. Can be one of: + - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: defaults to the channel dimension format of the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. + """ + do_resize = do_resize if do_resize is not None else self.do_resize + size = size if size is not None else self.size + size = get_size_dict(size, param_name="size", default_to_square=False) + resample = resample if resample is not None else self.resample + do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop + crop_size = crop_size if crop_size is not None else self.crop_size + crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True) + do_rescale = do_rescale if do_rescale is not None else self.do_rescale + rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor + do_normalize = do_normalize if do_normalize is not None else self.do_normalize + image_mean = image_mean if image_mean is not None else self.image_mean + image_std = image_std if image_std is not None else self.image_std + do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb + + if not valid_images(images): + raise ValueError( + "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " + "torch.Tensor, tf.Tensor or jax.ndarray." + ) + + # Batch and clip images into video frames + videos = make_batched(images) + + videos = [ + [ + self.preprocess_single_image( + image=img, + do_resize=do_resize, + size=size, + resample=resample, + do_center_crop=do_center_crop, + crop_size=crop_size, + do_rescale=do_rescale, + rescale_factor=rescale_factor, + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + data_format=data_format, + input_data_format=input_data_format, + ) + for img in clip + ] + for clip in videos + ] + + data = {"pixel_values": videos} + return BatchFeature(data=data, tensor_type=return_tensors) + + +class ImageBindDepthImageProcessor(BaseImageProcessor): + r""" + Constructs a ImageBind depth image processor. + + Args: + do_depth_norm (`bool`, *optional*, defaults to `True`): + Whether to perform depth normalization (following Omnivore). Can be overridden by `do_depth_norm` in the + `preprocess` method. + max_depth (`float`, *optional*, defaults to 75.0): + The max depth value, which will be used to scale the depth values by dividing them by `max_depth`. Can be + overridden by `max_depth` in the `preprocess` method. + min_depth (`float`, *optional*, defaults to 0.0): + The min depth value to clamp to. This is typically used to prevent negative depth values, which correspond + to far-away distances. Can be overridden by `min_depth` in the `preprocess` method. + clamp_max_before_scale (`bool`, *optional*, defaults to `True`): + Whether to clamp the depth values to `max_depth` before scaling by `max_depth`. If `True`, this will ensure + that the max depth value is 1. Can be overridden by `clamp_max_before_scale` in the `preprocess` method. + do_resize (`bool`, *optional*, defaults to `True`): + Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by + `do_resize` in the `preprocess` method. + size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`): + Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with + the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess` + method. + resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method. + do_center_crop (`bool`, *optional*, defaults to `True`): + Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the + `preprocess` method. + crop_size (`Dict[str, int]` *optional*, defaults to 224): + Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess` + method. + do_rescale (`bool`, *optional*, defaults to `True`): + Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in + the `preprocess` method. + rescale_factor (`int` or `float`, *optional*, defaults to `1/255`): + Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess` + method. + do_normalize: + Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method. + image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`): + Mean to use if normalizing the image. This is a float or list of floats the length of the number of + channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. + image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`): + Image standard deviation. + do_convert_rgb (`bool`, *optional*, defaults to `True`): + Standard deviation to use if normalizing the image. This is a float or list of floats the length of the + number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. + """ + + model_input_names = ["pixel_values"] + + def __init__( + self, + do_depth_norm = True, + max_depth: float = 75.0, + min_depth: float = 0.0, + clamp_max_before_scale: bool = True, + do_resize: bool = True, + size: Dict[str, int] = None, + resample: PILImageResampling = PILImageResampling.BICUBIC, + do_center_crop: bool = True, + crop_size: Dict[str, int] = None, + do_rescale: bool = True, + rescale_factor: Union[int, float] = 1 / 255, + do_normalize: bool = True, + image_mean: Optional[Union[float, List[float]]] = None, + image_std: Optional[Union[float, List[float]]] = None, + do_convert_rgb: bool = True, + **kwargs, + ) -> None: + super().__init__(**kwargs) + size = size if size is not None else {"shortest_edge": 224} + size = get_size_dict(size, default_to_square=False) + crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224} + crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size") + + self.do_depth_norm = do_depth_norm + self.max_depth = max_depth + self.min_depth = min_depth + self.clamp_max_before_scale = clamp_max_before_scale + self.do_resize = do_resize + self.size = size + self.resample = resample + self.do_center_crop = do_center_crop + self.crop_size = crop_size + self.do_rescale = do_rescale + self.rescale_factor = rescale_factor + self.do_normalize = do_normalize + self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN + self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD + self.do_convert_rgb = do_convert_rgb + + def depth_norm( + self, + image: np.ndarray, + max_depth: float, + min_depth: float = 0.0, + clamp_max_before_scale: bool = True, + data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ): + """ + Normalize the depth channel. This will apply to the single channel of a depth input. + + Args: + image (`np.ndarray`): + Single channel depth image to normalize. + max_depth (`float`, *optional*, defaults to 75.0): + The max depth value for the data. + min_depth (`float`, *optional*, defaults to 0.0): + The minimum value to clamp the depth values to. This is done to prevent negative depth values, which + correspond to far away distances. + clamp_max_before_scale (`bool`, *optional*, defaults to `True`): + Whether to clamp the depth values to `max_depth` before scaling them by dividing by `max_depth`. + """ + # Clamp depth values to 0.0 to prevent negative depths + image = np.clip(image, a_min=min_depth, a_max=None) + + if clamp_max_before_scale: + image = np.clip(image, a_min=None, a_max=max_depth) + + image = image / max_depth + + if data_format is not None: + image = to_channel_dimension_format(image, data_format, input_data_format) + return image + + def resize( + self, + image: np.ndarray, + size: Dict[str, int], + resample: PILImageResampling = PILImageResampling.BICUBIC, + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge + resized to keep the input aspect ratio. + + Args: + image (`np.ndarray`): + Image to resize. + size (`Dict[str, int]`): + Size of the output image. + resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + Resampling filter to use when resiizing the image. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + size = get_size_dict(size, default_to_square=False) + if "shortest_edge" not in size: + raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}") + output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False) + return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) + + def center_crop( + self, + image: np.ndarray, + size: Dict[str, int], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the + returned result will always be of size `size`). + + Args: + image (`np.ndarray`): + Image to center crop. + size (`Dict[str, int]`): + Size of the output image in the form of a dictionary with keys `height` and `width`. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + size = get_size_dict(size) + if "height" not in size or "width" not in size: + raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}") + return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs) + + def rescale( + self, + image: np.ndarray, + scale: Union[int, float], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ): + """ + Rescale an image by a scale factor. image = image * scale. + + Args: + image (`np.ndarray`): + Image to rescale. + scale (`int` or `float`): + Scale to apply to the image. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + return rescale(image, scale=scale, data_format=data_format, **kwargs) + + def normalize( + self, + image: np.ndarray, + mean: Union[float, List[float]], + std: Union[float, List[float]], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Normalize an image. image = (image - image_mean) / image_std. + + Args: + image (`np.ndarray`): + Image to normalize. + image_mean (`float` or `List[float]`): + Image mean. + image_std (`float` or `List[float]`): + Image standard deviation. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) + + def preprocess( + self, + images: ImageInput, + do_depth_norm: bool = None, + max_depth: float = None, + min_depth: float = None, + clamp_max_before_scale: bool = None, + do_resize: bool = None, + size: Dict[str, int] = None, + resample: PILImageResampling = None, + do_center_crop: bool = None, + crop_size: int = None, + do_rescale: bool = None, + rescale_factor: float = None, + do_normalize: bool = None, + image_mean: Optional[Union[float, List[float]]] = None, + image_std: Optional[Union[float, List[float]]] = None, + do_convert_rgb: bool = None, + return_tensors: Optional[Union[str, TensorType]] = None, + data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, + **kwargs, + ) -> PIL.Image.Image: + """ + Preprocess an image or batch of images. + + Args: + images (`ImageInput`): + Image to preprocess. + do_resize (`bool`, *optional*, defaults to `self.do_resize`): + Whether to resize the image. + size (`Dict[str, int]`, *optional*, defaults to `self.size`): + Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with + the longest edge resized to keep the input aspect ratio. + resample (`int`, *optional*, defaults to `self.resample`): + Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only + has an effect if `do_resize` is set to `True`. + do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`): + Whether to center crop the image. + crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`): + Size of the center crop. Only has an effect if `do_center_crop` is set to `True`. + do_rescale (`bool`, *optional*, defaults to `self.do_rescale`): + Whether to rescale the image. + rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`): + Rescale factor to rescale the image by if `do_rescale` is set to `True`. + do_normalize (`bool`, *optional*, defaults to `self.do_normalize`): + Whether to normalize the image. + image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`): + Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`. + image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`): + Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to + `True`. + do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`): + Whether to convert the image to RGB. + return_tensors (`str` or `TensorType`, *optional*): + The type of tensors to return. Can be one of: + - Unset: Return a list of `np.ndarray`. + - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`. + - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`. + - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`. + - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. + data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): + The channel dimension format for the output image. Can be one of: + - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: defaults to the channel dimension format of the input image. + """ + do_depth_norm = do_depth_norm if do_depth_norm is not None else self.do_depth_norm + max_depth = max_depth if max_depth is not None else self.max_depth + min_depth = min_depth if min_depth is not None else self.min_depth + clamp_max_before_scale = clamp_max_before_scale if clamp_max_before_scale is not None else self.clamp_max_before_scale + do_resize = do_resize if do_resize is not None else self.do_resize + size = size if size is not None else self.size + size = get_size_dict(size, param_name="size", default_to_square=False) + resample = resample if resample is not None else self.resample + do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop + crop_size = crop_size if crop_size is not None else self.crop_size + crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True) + do_rescale = do_rescale if do_rescale is not None else self.do_rescale + rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor + do_normalize = do_normalize if do_normalize is not None else self.do_normalize + image_mean = image_mean if image_mean is not None else self.image_mean + image_std = image_std if image_std is not None else self.image_std + do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb + + images = make_list_of_images(images) + + if not valid_images(images): + raise ValueError( + "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " + "torch.Tensor, tf.Tensor or jax.ndarray." + ) + + if do_depth_norm and max_depth is None: + raise ValueError("Max depth must be specified if do_depth_norm is True.") + + if do_resize and size is None: + raise ValueError("Size must be specified if do_resize is True.") + + if do_center_crop and crop_size is None: + raise ValueError("Crop size must be specified if do_center_crop is True.") + + if do_rescale and rescale_factor is None: + raise ValueError("Rescale factor must be specified if do_rescale is True.") + + if do_normalize and (image_mean is None or image_std is None): + raise ValueError("Image mean and std must be specified if do_normalize is True.") + + # PIL RGBA images are converted to RGB + if do_convert_rgb: + images = [convert_to_rgb(image) for image in images] + + # All transformations expect numpy arrays. + images = [to_numpy_array(image) for image in images] + + if do_depth_norm: + images = [self.do_depth_norm(image=image, max_depth=max_depth, min_depth=min_depth, clamp_max_before_scale=clamp_max_before_scale) for image in images] + + if do_resize: + images = [self.resize(image=image, size=size, resample=resample) for image in images] + + if do_center_crop: + images = [self.center_crop(image=image, size=crop_size) for image in images] + + if do_rescale: + images = [self.rescale(image=image, scale=rescale_factor) for image in images] + + if do_normalize: + images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images] + + images = [to_channel_dimension_format(image, data_format) for image in images] + + data = {"pixel_values": images} + return BatchFeature(data=data, tensor_type=return_tensors) + + +# NOTE: currently based on autogenerated ImageBindImageProcessor +class ImageBindThermalImageProcessor(BaseImageProcessor): + r""" + Constructs a ImageBind thermal image processor. + + Args: + do_resize (`bool`, *optional*, defaults to `True`): + Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by + `do_resize` in the `preprocess` method. + size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`): + Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with + the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess` + method. + resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method. + do_center_crop (`bool`, *optional*, defaults to `True`): + Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the + `preprocess` method. + crop_size (`Dict[str, int]` *optional*, defaults to 224): + Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess` + method. + do_rescale (`bool`, *optional*, defaults to `True`): + Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in + the `preprocess` method. + rescale_factor (`int` or `float`, *optional*, defaults to `1/255`): + Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess` + method. + do_normalize: + Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method. + image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`): + Mean to use if normalizing the image. This is a float or list of floats the length of the number of + channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. + image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`): + Image standard deviation. + do_convert_rgb (`bool`, *optional*, defaults to `True`): + Standard deviation to use if normalizing the image. This is a float or list of floats the length of the + number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. + """ + + model_input_names = ["pixel_values"] + + def __init__( + self, + do_resize: bool = True, + size: Dict[str, int] = None, + resample: PILImageResampling = PILImageResampling.BICUBIC, + do_center_crop: bool = True, + crop_size: Dict[str, int] = None, + do_rescale: bool = True, + rescale_factor: Union[int, float] = 1 / 255, + do_normalize: bool = True, + image_mean: Optional[Union[float, List[float]]] = None, + image_std: Optional[Union[float, List[float]]] = None, + do_convert_rgb: bool = True, + **kwargs, + ) -> None: + super().__init__(**kwargs) + size = size if size is not None else {"shortest_edge": 224} + size = get_size_dict(size, default_to_square=False) + crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224} + crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size") + + self.do_resize = do_resize + self.size = size + self.resample = resample + self.do_center_crop = do_center_crop + self.crop_size = crop_size + self.do_rescale = do_rescale + self.rescale_factor = rescale_factor + self.do_normalize = do_normalize + self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN + self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD + self.do_convert_rgb = do_convert_rgb + + def resize( + self, + image: np.ndarray, + size: Dict[str, int], + resample: PILImageResampling = PILImageResampling.BICUBIC, + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge + resized to keep the input aspect ratio. + + Args: + image (`np.ndarray`): + Image to resize. + size (`Dict[str, int]`): + Size of the output image. + resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + Resampling filter to use when resiizing the image. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + size = get_size_dict(size, default_to_square=False) + if "shortest_edge" not in size: + raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}") + output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False) + return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) + + def center_crop( + self, + image: np.ndarray, + size: Dict[str, int], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the + returned result will always be of size `size`). + + Args: + image (`np.ndarray`): + Image to center crop. + size (`Dict[str, int]`): + Size of the output image in the form of a dictionary with keys `height` and `width`. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + size = get_size_dict(size) + if "height" not in size or "width" not in size: + raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}") + return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs) + + def rescale( + self, + image: np.ndarray, + scale: Union[int, float], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ): + """ + Rescale an image by a scale factor. image = image * scale. + + Args: + image (`np.ndarray`): + Image to rescale. + scale (`int` or `float`): + Scale to apply to the image. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + return rescale(image, scale=scale, data_format=data_format, **kwargs) + + def normalize( + self, + image: np.ndarray, + mean: Union[float, List[float]], + std: Union[float, List[float]], + data_format: Optional[Union[str, ChannelDimension]] = None, + **kwargs, + ) -> np.ndarray: + """ + Normalize an image. image = (image - image_mean) / image_std. + + Args: + image (`np.ndarray`): + Image to normalize. + image_mean (`float` or `List[float]`): + Image mean. + image_std (`float` or `List[float]`): + Image standard deviation. + data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the image. If not provided, it will be the same as the input image. + """ + return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) + + def preprocess( + self, + images: ImageInput, + do_resize: bool = None, + size: Dict[str, int] = None, + resample: PILImageResampling = None, + do_center_crop: bool = None, + crop_size: int = None, + do_rescale: bool = None, + rescale_factor: float = None, + do_normalize: bool = None, + image_mean: Optional[Union[float, List[float]]] = None, + image_std: Optional[Union[float, List[float]]] = None, + do_convert_rgb: bool = None, + return_tensors: Optional[Union[str, TensorType]] = None, + data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, + **kwargs, + ) -> PIL.Image.Image: + """ + Preprocess an image or batch of images. + + Args: + images (`ImageInput`): + Image to preprocess. + do_resize (`bool`, *optional*, defaults to `self.do_resize`): + Whether to resize the image. + size (`Dict[str, int]`, *optional*, defaults to `self.size`): + Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with + the longest edge resized to keep the input aspect ratio. + resample (`int`, *optional*, defaults to `self.resample`): + Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only + has an effect if `do_resize` is set to `True`. + do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`): + Whether to center crop the image. + crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`): + Size of the center crop. Only has an effect if `do_center_crop` is set to `True`. + do_rescale (`bool`, *optional*, defaults to `self.do_rescale`): + Whether to rescale the image. + rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`): + Rescale factor to rescale the image by if `do_rescale` is set to `True`. + do_normalize (`bool`, *optional*, defaults to `self.do_normalize`): + Whether to normalize the image. + image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`): + Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`. + image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`): + Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to + `True`. + do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`): + Whether to convert the image to RGB. + return_tensors (`str` or `TensorType`, *optional*): + The type of tensors to return. Can be one of: + - Unset: Return a list of `np.ndarray`. + - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`. + - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`. + - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`. + - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. + data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): + The channel dimension format for the output image. Can be one of: + - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: defaults to the channel dimension format of the input image. + """ + do_resize = do_resize if do_resize is not None else self.do_resize + size = size if size is not None else self.size + size = get_size_dict(size, param_name="size", default_to_square=False) + resample = resample if resample is not None else self.resample + do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop + crop_size = crop_size if crop_size is not None else self.crop_size + crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True) + do_rescale = do_rescale if do_rescale is not None else self.do_rescale + rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor + do_normalize = do_normalize if do_normalize is not None else self.do_normalize + image_mean = image_mean if image_mean is not None else self.image_mean + image_std = image_std if image_std is not None else self.image_std + do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb + + images = make_list_of_images(images) + + if not valid_images(images): + raise ValueError( + "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " + "torch.Tensor, tf.Tensor or jax.ndarray." + ) + + if do_resize and size is None: + raise ValueError("Size must be specified if do_resize is True.") + + if do_center_crop and crop_size is None: + raise ValueError("Crop size must be specified if do_center_crop is True.") + + if do_rescale and rescale_factor is None: + raise ValueError("Rescale factor must be specified if do_rescale is True.") + + if do_normalize and (image_mean is None or image_std is None): + raise ValueError("Image mean and std must be specified if do_normalize is True.") + + # PIL RGBA images are converted to RGB + if do_convert_rgb: + images = [convert_to_rgb(image) for image in images] + + # All transformations expect numpy arrays. + images = [to_numpy_array(image) for image in images] + + if do_resize: + images = [self.resize(image=image, size=size, resample=resample) for image in images] + + if do_center_crop: + images = [self.center_crop(image=image, size=crop_size) for image in images] + + if do_rescale: + images = [self.rescale(image=image, scale=rescale_factor) for image in images] + + if do_normalize: + images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images] + + images = [to_channel_dimension_format(image, data_format) for image in images] + + data = {"pixel_values": images} + return BatchFeature(data=data, tensor_type=return_tensors) diff --git a/src/transformers/models/imagebind/modeling_imagebind.py b/src/transformers/models/imagebind/modeling_imagebind.py new file mode 100644 index 000000000000..ca4ef7202adf --- /dev/null +++ b/src/transformers/models/imagebind/modeling_imagebind.py @@ -0,0 +1,2985 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch ImageBind model.""" + + +from dataclasses import dataclass +from typing import Any, List, Optional, Tuple, Union + +import numpy as np +import torch +import torch.utils.checkpoint +from torch import nn +from timm.layers import DropPath + +from ...activations import ACT2FN +from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling +from ...modeling_utils import PreTrainedModel +from ...utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from .configuration_imagebind import ( + ImageBindConfig, + ImageBindAudioConfig, + ImageBindDepthConfig, + ImageBindImuConfig, + ImageBindTextConfig, + ImageBindThermalConfig, + ImageBindVisionConfig, +) + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "facebook/imagebind-huge" + +IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "facebook/imagebind-huge", + # See all ImageBind models at https://huggingface.co/models?filter=imagebind +] + + +# Copied from transformers.models.bart.modeling_bart._expand_mask +def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): + """ + Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. + """ + bsz, src_len = mask.size() + tgt_len = tgt_len if tgt_len is not None else src_len + + expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) + + inverted_mask = 1.0 - expanded_mask + + return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) + + +# TODO: can use code already in transformers? +# contrastive loss function, adapted from +# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/ImageBind.html +def contrastive_loss(logits: torch.Tensor) -> torch.Tensor: + return nn.functional.cross_entropy(logits, torch.arange(len(logits), device=logits.device)) + + +# Copied from transformers.models.clip.modeling_clip.clip_loss with clip->imagebind +def imagebind_loss(similarity: torch.Tensor) -> torch.Tensor: + caption_loss = contrastive_loss(similarity) + image_loss = contrastive_loss(similarity.t()) + return (caption_loss + image_loss) / 2.0 + + +# BaseModelOutputWithPooling + num_clips field for modalities which have clips (vision, audio) +@dataclass +class ImageBindTransformerOutput(ModelOutput): + """ + The output class for ImageBind*Transformer models. This is [`BaseModelOutputWithPooling`] with an additional + `num_clips` field for modalities which are organized into clips as well as batches (vision, audio). + + Args: + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): + Last layer hidden-state of the first token of the sequence (classification token) after further processing + through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns + the classification token after processing through a linear layer and a tanh activation function. The linear + layer weights are trained from the next sentence prediction (classification) objective during pretraining. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + num_clips: (`int`, *optional*): + The number of clips for modalities which have both a batch dimension (dim 0) and clip dimension (dim 1). + In the original ImageBind model, these modalities are vision (image/video) and audio. + """ + + last_hidden_state: torch.FloatTensor = None + pooler_output: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + num_clips: Optional[int] = None + + +@dataclass +# CLIPTextModelOutput + normalized embeddings +class ImageBindTextModelOutput(ModelOutput): + """ + Base class for text model's outputs that also contains a pooling of the last hidden states. + + Args: + text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The text embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + normalized_text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): + The normalized text embeddings obtained by applying the projection layer to the pooler_output, then + applying L2 normalization and scaling the logits. + """ + + text_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + normalized_text_embeds: Optional[torch.FloatTensor] = None + + +@dataclass +# ClipVisionModelOutput + normalized embeddings +class ImageBindVisionModelOutput(ModelOutput): + """ + Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. + + Args: + image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The image embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + normalized_image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): + The normalized image embeddings obtained by applying the projection layer to the pooler_output, then + applying L2 normalization and scaling the logits. + """ + + image_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + normalized_image_embeds: Optional[torch.FloatTensor] = None + + +# CLAPAudioModelOutput + normalized embeddings +@dataclass +class ImageBindAudioModelOutput(ModelOutput): + """ + ClapAudio model output to mimic the output of the original implementation. + + Args: + audio_embeds (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): + The Audio embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + normalized_audio_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): + The normalized audio embeddings obtained by applying the projection layer to the pooler_output, then + applying L2 normalization and scaling the logits. + """ + + audio_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + normalized_audio_embeds: Optional[torch.FloatTensor] = None + + +@dataclass +class ImageBindDepthModelOutput(ModelOutput): + """ + Base class for depth model's outputs that also contains a pooling of the last hidden states. + + Args: + depth_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The depth embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + normalized_depth_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): + The normalized depth embeddings obtained by applying the projection layer to the pooler_output, then + applying L2 normalization and scaling the logits. + """ + + depth_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + normalized_depth_embeds: Optional[torch.FloatTensor] = None + + +@dataclass +class ImageBindThermalModelOutput(ModelOutput): + """ + Base class for thermal model's outputs that also contains a pooling of the last hidden states. + + Args: + thermal_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The thermal embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + normalized_thermal_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): + The normalized thermal embeddings obtained by applying the projection layer to the pooler_output, then + applying L2 normalization and scaling the logits. + """ + + thermal_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + normalized_thermal_embeds: Optional[torch.FloatTensor] = None + + +@dataclass +class ImageBindImuModelOutput(ModelOutput): + """ + Base class for IMU model's outputs that also contains a pooling of the last hidden states. + + Args: + imu_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The IMU embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + normalized_imu_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): + The normalized IMU embeddings obtained by applying the projection layer to the pooler_output, then + applying L2 normalization and scaling the logits. + """ + + imu_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + normalized_imu_embeds: Optional[torch.FloatTensor] = None + + +@dataclass +# Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->ImageBind +class ImageBindOutput(ModelOutput): + """ + Args: + loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): + Contrastive loss for image-text similarity. + logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): + The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text + similarity scores. + logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): + The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image + similarity scores. + logits_per_audio:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): + The scaled dot product scores between `audio_embeds` and `image_embeds`. This represents the audio-image + similarity scores. + logits_per_depth:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): + The scaled dot product scores between `depth_embeds` and `image_embeds`. This represents the depth-image + similarity scores. + logits_per_thermal:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): + The scaled dot product scores between `thermal_embeds` and `image_embeds`. This represents the thermal-image + similarity scores. + logits_per_imu:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): + The scaled dot product scores between `imu_embeds` and `image_embeds`. This represents the IMU-image + similarity scores. + text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The normalized text embeddings obtained by applying the projection layer to the pooled output of [`ImageBindTextModel`], then applying L2 normalization and logit scaling. + image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The normalized image embeddings obtained by applying the projection layer to the pooled output of [`ImageBindVisionModel`], then applying L2 normalization and logit scaling. + audio_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The normalized audio embeddings obtained by applying the projection layer to the pooled output of [`ImageBindAudioModel`], then applying L2 normalization and logit scaling. + depth_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The normalized depth embeddings obtained by applying the projection layer to the pooled output of [`ImageBindDepthModel`], then applying L2 normalization and logit scaling. + thermal_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The normalized thermal embeddings obtained by applying the projection layer to the pooled output of [`ImageBindThermalModel`], then applying L2 normalization and logit scaling. + imu_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The normalized IMU embeddings obtained by applying the projection layer to the pooled output of [`ImageBindImuModel`], then applying L2 normalization and logit scaling. + text_model_output(`BaseModelOutputWithPooling`): + The output of the [`ImageBindTextModel`]. + vision_model_output(`BaseModelOutputWithPooling`): + The output of the [`ImageBindVisionModel`]. + audio_model_output(`BaseModelOutputWithPooling`): + The output of the [`ImageBindAudioModel`]. + depth_model_output(`BaseModelOutputWithPooling`): + The output of the [`ImageBindDepthModel`]. + thermal_model_output(`BaseModelOutputWithPooling`): + The output of the [`ImageBindThermalModel`]. + imu_model_output(`BaseModelOutputWithPooling`): + The output of the [`ImageBindImuModel`]. + """ + + loss: Optional[torch.FloatTensor] = None + logits_per_image: torch.FloatTensor = None + logits_per_text: torch.FloatTensor = None + logits_per_audio: torch.FloatTensor = None + logits_per_depth: torch.FloatTensor = None + logits_per_thermal: torch.FloatTensor = None + logits_per_imu: torch.FloatTensor = None + text_embeds: torch.FloatTensor = None + image_embeds: torch.FloatTensor = None + audio_embeds: torch.FloatTensor = None + depth_embeds: torch.FloatTensor = None + thermal_embeds: torch.FloatTensor = None + imu_embeds: torch.FloatTensor = None + text_model_output: BaseModelOutputWithPooling = None + vision_model_output: BaseModelOutputWithPooling = None + audio_model_output: BaseModelOutputWithPooling = None + depth_model_output: BaseModelOutputWithPooling = None + thermal_model_output: BaseModelOutputWithPooling = None + imu_model_output: BaseModelOutputWithPooling = None + + def to_tuple(self) -> Tuple[Any]: + fields_to_exclude = [ + "text_model_output", + "vision_model_output", + "audio_model_output", + "depth_model_output", + "thermal_model_output", + "imu_model_output", + ] + return tuple( + self[k] if k not in fields_to_exclude else getattr(self, k).to_tuple() + for k in self.keys() + ) + + +# Copied from transformers.models.clip.modeling_clip.CLIPTextEmbeddings with CLIP->ImageBind +class ImageBindTextEmbeddings(nn.Module): + def __init__(self, config: ImageBindTextConfig): + super().__init__() + embed_dim = config.hidden_size + + self.token_embedding = nn.Embedding(config.vocab_size, embed_dim) + self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim) + + # position_ids (1, len position emb) is contiguous in memory and exported when serialized + self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + ) -> torch.Tensor: + seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2] + + if position_ids is None: + position_ids = self.position_ids[:, :seq_length] + + if inputs_embeds is None: + inputs_embeds = self.token_embedding(input_ids) + + position_embeddings = self.position_embedding(position_ids) + embeddings = inputs_embeds + position_embeddings + + return embeddings + + +class RGBDTPatchEmbedding(nn.Module): + """ + Creates patch embeddings for spatiotemporal data (e.g. images, video, depth etc.). This handles patch embeddings + for all image-like modalities (image/video, depth, thermal). + """ + def __init__( + self, + config: Union[ImageBindAudioConfig, ImageBindDepthConfig, ImageBindThermalConfig, ImageBindVisionConfig], + image_shape: Union[List[int], Tuple[int]], + norm_layer: Optional[nn.Module] = None, + ): + super().__init__() + self.config = config + self.image_shape = image_shape + self.embed_dim = config.hidden_size + self.patch_size = config.patch_size + self.stride = config.stride + self.num_frames = config.num_frames if hasattr(config, "num_frames") else None + self.is_temporal = self.num_frames is not None + + self.class_embedding = nn.Parameter(torch.randn(self.embed_dim)) + + if self.is_temporal: + patch_embedding_cls = nn.Conv3d + else: + patch_embedding_cls = nn.Conv2d + + self.patch_embedding = patch_embedding_cls( + in_channels=image_shape[0], + out_channels=self.embed_dim, + kernel_size=self.patch_size, + stride=self.stride, + bias=False, + ) + self.norm_layer = norm_layer if norm_layer is not None else nn.Identity() + + if self.is_temporal: + patches_along_time_dim = (config.num_frames // self.patch_size[0]) + patches_along_height_dim = ((self.image_shape[-2] - self.patch_size[-2]) // self.stride[-2]) + 1 + patches_along_width_dim = ((self.image_shape[-1] - self.patch_size[-1]) // self.stride[-1]) + 1 + else: + patches_along_time_dim = 1 + patches_along_height_dim = ((self.image_shape[-2] - self.patch_size) // self.stride) + 1 + patches_along_width_dim = ((self.image_shape[-1] - self.patch_size) // self.stride) + 1 + self.num_patches = patches_along_height_dim * patches_along_width_dim * patches_along_time_dim + self.num_positions = self.num_patches + 1 + self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) + self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1))) + + def image_to_video(self, image: torch.FloatTensor, time_dim: int = 2, ntimes: int = 2, pad_type: str = "repeat"): + """ + Maps 4-dim image tensors of shape (B, C, H, W) to 5-dim video tensors, possibly repeating the image along the + time dimension. For example, if `time_dim == 1`, RGB images of shape (B, C, H, W) will be transformed to + video of shape (B, 1, C, H, W), and then the image will be repeated along the time dimension `ntimes` to get + shape (B, N, C, H, W). + """ + if image.ndim not in [4, 5]: + raise ValueError( + f"The input `image` tensor should be 4- or 5-dimensional but has {image.ndim} dimensions." + ) + + # Add time dimension at specified dim index + if image.ndim == 4: + image = image.unsqueeze(time_dim) + + # Repeat image across the time dimension ntimes. + if image.shape[time_dim] == 1: + if pad_type == "repeat": + new_shape = [1] * len(image.shape) + new_shape[time_dim] = ntimes + video = image.repeat(new_shape) + elif pad_type == "zero": + pad_arg = [0, 0] * len(image.shape) + pad_arg[2 * time_dim + 1] = self.ntimes - image.shape[time_dim] + video = nn.functional.pad(image, pad_arg) + return video + + def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: + batch_size = pixel_values.shape[0] + if self.is_temporal: + pixel_values = self.image_to_video(pixel_values, time_dim=1, ntimes=self.num_frames) + + patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] + patch_embeds = patch_embeds.flatten(2).transpose(1, 2) + patch_embeds = self.norm_layer(patch_embeds) + + class_embeds = self.class_embedding.expand(batch_size, 1, -1) + embeddings = torch.cat([class_embeds, patch_embeds], dim=1) + embeddings = embeddings + self.position_embedding(self.position_ids) + return embeddings + + +class ImageBindVisionEmbeddings(RGBDTPatchEmbedding): + def __init__(self, config: ImageBindVisionConfig): + image_shape = (config.num_channels, config.image_size, config.image_size) + super().__init__(config, image_shape, norm_layer=None) + + +class ImageBindAudioEmbeddings(RGBDTPatchEmbedding): + def __init__(self, config: ImageBindAudioConfig): + image_shape = (config.num_channels, config.num_mel_bins, config.target_len) + layer_norm = nn.LayerNorm(config.hidden_size) + super().__init__(config, image_shape, norm_layer=layer_norm) + + def forward(self, audio: torch.FloatTensor) -> torch.Tensor: + super().forward(pixel_values=audio) + + +class ImageBindDepthEmbeddings(RGBDTPatchEmbedding): + def __init__(self, config: ImageBindDepthConfig): + image_shape = (config.num_channels, config.image_size, config.image_size) + layer_norm = nn.LayerNorm(config.hidden_size) + super().__init__(config, image_shape, norm_layer=layer_norm) + + def forward(self, depth: torch.FloatTensor) -> torch.Tensor: + super().forward(pixel_values=depth) + + +class ImageBindThermalEmbeddings(RGBDTPatchEmbedding): + def __init__(self, config: ImageBindThermalConfig): + image_shape = (config.num_channels, config.image_size, config.image_size) + layer_norm = nn.LayerNorm(config.hidden_size) + super().__init__(config, image_shape, norm_layer=layer_norm) + + def forward(self, thermal: torch.FloatTensor) -> torch.Tensor: + super().forward(pixel_values=thermal) + + +class ImageBindImuEmbeddings(nn.Module): + def __init__(self, config: ImageBindImuConfig): + super().__init__() + self.config = config + self.embed_dim = config.hidden_size + self.kernel_size = config.kernel_size + self.in_features = config.input_shape[0] * self.kernel_size + + self.class_embedding = nn.Parameter(torch.randn(self.embed_dim)) + + self.patch_embedding = nn.Linear(self.in_features, self.embed_dim, bias=False) + self.norm_layer = nn.LayerNorm(self.embed_dim) + + self.num_patches = config.input_shape[1] // self.kernel_size + self.num_positions = self.num_patches + 1 + self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) + self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1))) + + def forward(self, imu: torch.FloatTensor) -> torch.Tensor: + batch_size = imu.shape[0] + + # Patchify + # (B, L, D) -> (B, L, D // K, K) -> (B, D // K, L, K) + patches = imu.unfold(-1, self.kernel_size, self.kernel_size).permute(0, 2, 1, 3) + patches = patches.reshape(batch_size, patches.shape[1], -1) + + patch_embeds = self.patch_embedding(patches) + patch_embeds = patch_embeds.flatten(2).transpose(1, 2) + patch_embeds = self.norm_layer(patch_embeds) + + class_embeds = self.class_embedding.expand(batch_size, 1, -1) + embeddings = torch.cat([class_embeds, patch_embeds], dim=1) + embeddings = embeddings + self.position_embedding(self.position_ids) + return embeddings + + +# CLIPAttention + key/value biases +class ImageBindAttention(nn.Module): + """Multi-headed attention from 'Attention Is All You Need' paper""" + + def __init__(self, config): + super().__init__() + self.config = config + self.embed_dim = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.embed_dim // self.num_heads + if self.head_dim * self.num_heads != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:" + f" {self.num_heads})." + ) + self.scale = self.head_dim**-0.5 + self.dropout = config.attention_dropout + + self.k_proj = nn.Linear(self.embed_dim, self.embed_dim) + self.v_proj = nn.Linear(self.embed_dim, self.embed_dim) + self.q_proj = nn.Linear(self.embed_dim, self.embed_dim) + self.out_proj = nn.Linear(self.embed_dim, self.embed_dim) + + # Create bias parameters for key and value sequences. + if config.add_kv_bias: + self.k_bias = nn.Parameter(torch.empty((1, 1, self.embed_dim))) + self.v_bias = nn.Parameter(torch.empty((1, 1, self.embed_dim))) + else: + self.k_bias = None + self.v_bias = None + + def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + causal_attention_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + """Input shape: Batch x Time x Channel""" + + bsz, tgt_len, embed_dim = hidden_states.size() + + # get query proj + query_states = self.q_proj(hidden_states) * self.scale + key_states = self.k_proj(hidden_states) + value_states = self.v_proj(hidden_states) + + # Add key/value biases if necessary + if self.k_bias is not None and self.v_bias is not None: + # Repeat bias along batch dimension (first) + key_states = torch.cat([key_states, self.k_bias.repeat(bsz, 1, 1)]) + value_states = torch.cat([value_states, self.v_bias.repeat(bsz, 1, 1)]) + + key_states = self._shape(key_states, -1, bsz) + value_states = self._shape(value_states, -1, bsz) + + proj_shape = (bsz * self.num_heads, -1, self.head_dim) + query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) + key_states = key_states.view(*proj_shape) + value_states = value_states.view(*proj_shape) + + src_len = key_states.size(1) + attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) + + if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len): + raise ValueError( + f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is" + f" {attn_weights.size()}" + ) + + # apply the causal_attention_mask first + if causal_attention_mask is not None: + if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len): + raise ValueError( + f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is" + f" {causal_attention_mask.size()}" + ) + attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + causal_attention_mask + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + if attention_mask is not None: + if attention_mask.size() != (bsz, 1, tgt_len, src_len): + raise ValueError( + f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}" + ) + attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + + if output_attentions: + # this operation is a bit akward, but it's required to + # make sure that attn_weights keeps its gradient. + # In order to do so, attn_weights have to reshaped + # twice and have to be reused in the following + attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len) + else: + attn_weights_reshaped = None + + attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training) + + attn_output = torch.bmm(attn_probs, value_states) + + if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): + raise ValueError( + f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f" {attn_output.size()}" + ) + + attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) + attn_output = attn_output.transpose(1, 2) + attn_output = attn_output.reshape(bsz, tgt_len, embed_dim) + + attn_output = self.out_proj(attn_output) + + return attn_output, attn_weights_reshaped + + +# Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->ImageBind +class ImageBindMLP(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.activation_fn = ACT2FN[config.hidden_act] + self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size) + self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size) + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.fc1(hidden_states) + hidden_states = self.activation_fn(hidden_states) + hidden_states = self.fc2(hidden_states) + return hidden_states + + +# CLIPEncoderLayer with DropPath layer after each residual subblock (attention, feedforward) +class ImageBindEncoderLayer(nn.Module): + def __init__(self, config: ImageBindConfig, drop_path_rate: float = 0.0): + super().__init__() + self.embed_dim = config.hidden_size + self.self_attn = ImageBindAttention(config) + self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) + self.mlp = ImageBindMLP(config) + self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) + if drop_path_rate > 0.0: + self.drop_path = DropPath(drop_path_rate) + else: + self.drop_path = nn.Identity() + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: torch.Tensor, + causal_attention_mask: torch.Tensor, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.FloatTensor]: + """ + Args: + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` + attention_mask (`torch.FloatTensor`): attention mask of size + `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + `(config.encoder_attention_heads,)`. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + """ + residual = hidden_states + + hidden_states = self.layer_norm1(hidden_states) + hidden_states, attn_weights = self.self_attn( + hidden_states=hidden_states, + attention_mask=attention_mask, + causal_attention_mask=causal_attention_mask, + output_attentions=output_attentions, + ) + hidden_states = self.drop_path(hidden_states) + hidden_states = residual + hidden_states + + residual = hidden_states + hidden_states = self.layer_norm2(hidden_states) + hidden_states = self.mlp(hidden_states) + hidden_states = self.drop_path(hidden_states) + hidden_states = residual + hidden_states + + outputs = (hidden_states,) + + if output_attentions: + outputs += (attn_weights,) + + return outputs + + +class ImageBindPostProcessor(nn.Module): + """ + Post-processes ImageBind embeddings by using a normalize layer followed by an optional logit scaling layer. + """ + def __init__( + self, + config, + dim: int = -1, + max_logit_scale: float = 100, + ): + super().__init__() + self.dim = dim + self.scale_logits = config.logit_scale_init_value is not None + + if self.scale_logits: + self.logit_scale_init = config.logit_scale_init_value + self.max_logit_scale = max_logit_scale + self.learnable = config.learnable_logit_scale + + log_logit_scale = torch.ones([]) * np.log(self.logit_scale_init) + if self.learnable: + self.log_logit_scale = nn.Parameter(log_logit_scale) + else: + self.register_buffer("log_logit_scale", log_logit_scale) + + def forward(self, logits: torch.FloatTensor) -> torch.FloatTensor: + logits = nn.functional.normalize(logits, dim=self.dim, p=2) + if self.scale_logits: + logits = torch.clip(self.log_logit_scale.exp(), max=self.max_logit_scale) * logits + return logits + + +class ImageBindPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = ImageBindConfig + base_model_prefix = "imagebind" + supports_gradient_checkpointing = True + _keys_to_ignore_on_load_missing = [r"position_ids"] + + def _init_weights(self, module): + """Initialize the weights""" + factor = self.config.initializer_factor + if isinstance(module, ImageBindTextEmbeddings): + module.token_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02) + module.position_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02) + elif isinstance(module, RGBDTPatchEmbedding): + factor = self.config.initializer_factor + nn.init.normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor) + nn.init.normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor) + nn.init.normal_(module.position_embedding.weight, std=module.config.initializer_range * factor) + elif isinstance(module, ImageBindImuEmbeddings): + factor = self.config.initializer_factor + nn.init.normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor) + nn.init.normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor) + nn.init.normal_(module.position_embedding.weight, std=module.config.initializer_range * factor) + elif isinstance(module, ImageBindAttention): + factor = self.config.initializer_factor + in_proj_std = (module.embed_dim**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor + out_proj_std = (module.embed_dim**-0.5) * factor + nn.init.normal_(module.q_proj.weight, std=in_proj_std) + nn.init.normal_(module.k_proj.weight, std=in_proj_std) + nn.init.normal_(module.v_proj.weight, std=in_proj_std) + nn.init.normal_(module.out_proj.weight, std=out_proj_std) + if module.k_bias is not None: + nn.init.normal_(module.k_bias, std=in_proj_std) + if module.v_bias is not None: + nn.init.normal_(module.v_bias, std=in_proj_std) + elif isinstance(module, ImageBindMLP): + factor = self.config.initializer_factor + in_proj_std = ( + (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor + ) + fc_std = (2 * module.config.hidden_size) ** -0.5 * factor + nn.init.normal_(module.fc1.weight, std=fc_std) + nn.init.normal_(module.fc2.weight, std=in_proj_std) + elif isinstance(module, ImageBindModel): + nn.init.normal_( + module.text_projection.weight, + std=module.text_embed_dim**-0.5 * self.config.initializer_factor, + ) + nn.init.normal_( + module.visual_projection.weight, + std=module.vision_embed_dim**-0.5 * self.config.initializer_factor, + ) + elif isinstance(module, ImageBindVisionModelWithProjection): + nn.init.normal_( + module.visual_projection.weight, + std=self.config.hidden_size**-0.5 * self.config.initializer_factor, + ) + elif isinstance(module, ImageBindTextModelWithProjection): + nn.init.normal_( + module.text_projection.weight, + std=self.config.hidden_size**-0.5 * self.config.initializer_factor, + ) + + if isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + if isinstance(module, nn.Linear) and module.bias is not None: + module.bias.data.zero_() + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, ImageBindEncoder): + module.gradient_checkpointing = value + + +IMAGEBIND_START_DOCSTRING = r""" + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`ImageBindConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +IMAGEBIND_TEXT_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +IMAGEBIND_VISION_INPUTS_DOCSTRING = r""" + Args: + pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using + [`AutoImageProcessor`]. See [`ImageBindImageProcessor.__call__`] for details. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +# TODO: add inputs doctrings for remaining modalities (audio, depth, thermal, IMU) +IMAGEBIND_AUDIO_INPUTS_DOCSTRING = r""" + Args: + TODO +""" + +IMAGEBIND_DEPTH_INPUTS_DOCSTRING = r""" + Args: + TODO +""" + +IMAGEBIND_THERMAL_INPUTS_DOCSTRING = r""" + Args: + TODO +""" + +IMAGEBIND_IMU_INPUTS_DOCSTRING = r""" + Args: + TODO +""" + +# TODO: update inputs docstring with remaining modalities +IMAGEBIND_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using + [`AutoImageProcessor`]. See [`ImageBindImageProcessor.__call__`] for details. + return_loss (`bool`, *optional*): + Whether or not to return the contrastive loss. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +# CLIPEncoder with DropPath support +class ImageBindEncoder(nn.Module): + """ + Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a + [`ImageBindEncoderLayer`]. + + Args: + config: ImageBindConfig + """ + + def __init__(self, config: ImageBindConfig, drop_path_type: str = "progressive"): + super().__init__() + self.config = config + + if drop_path_type == "progressive": + drop_path_rates = [prob.item() for prob in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)] + elif drop_path_type == "uniform": + drop_path_rates = [config.drop_path_rate for _ in range(config.num_hidden_layers)] + else: + raise ValueError( + f"`drop_path_type` is expected to be in `['uniform', 'progressive']` but got {drop_path_type}" + ) + + self.layers = nn.ModuleList( + [ImageBindEncoderLayer(config, drop_path_rate) for drop_path_rate in drop_path_rates] + ) + + def forward( + self, + inputs_embeds, + attention_mask: Optional[torch.Tensor] = None, + causal_attention_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutput]: + r""" + Args: + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + causal_attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Causal mask for the text model. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + encoder_states = () if output_hidden_states else None + all_attentions = () if output_attentions else None + + hidden_states = inputs_embeds + for idx, encoder_layer in enumerate(self.layers): + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(encoder_layer), + hidden_states, + attention_mask, + causal_attention_mask, + ) + else: + layer_outputs = encoder_layer( + hidden_states, + attention_mask, + causal_attention_mask, + output_attentions=output_attentions, + ) + + hidden_states = layer_outputs[0] + + if output_attentions: + all_attentions = all_attentions + (layer_outputs[1],) + + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None) + return BaseModelOutput( + last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions + ) + + +# TODO: copied from CLIP? +class ImageBindTextTransformer(nn.Module): + def __init__(self, config: ImageBindTextConfig): + super().__init__() + self.config = config + embed_dim = config.hidden_size + self.embeddings = ImageBindTextEmbeddings(config) + self.encoder = ImageBindEncoder(config) + self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + + @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindTextConfig) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is None: + raise ValueError("You have to specify input_ids") + + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + + hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids) + + bsz, seq_len = input_shape + # ImageBind's text model uses causal mask, prepare it here. + # https://github.com/facebookresearch/ImageBind/blob/95d27c7fd5a8362f3527e176c3a80ae5a4d880c0/imagebind/models/imagebind_model.py#L172 + causal_attention_mask = self._build_causal_attention_mask( + bsz, seq_len, hidden_states.dtype, device=hidden_states.device + ) + # expand attention_mask + if attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + attention_mask = _expand_mask(attention_mask, hidden_states.dtype) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + attention_mask=attention_mask, + causal_attention_mask=causal_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] + last_hidden_state = self.final_layer_norm(last_hidden_state) + + # text_embeds.shape = [batch_size, sequence_length, transformer.width] + # take features from the eot embedding (eot_token is the highest number in each sequence) + # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14 + pooled_output = last_hidden_state[ + torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), + input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1), + ] + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,) + + return ImageBindTransformerOutput( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + num_clips=None, + ) + + def _build_causal_attention_mask(self, bsz, seq_len, dtype, device=None): + # lazily create causal attention mask, with full attention between the vision tokens + # pytorch uses additive attention mask; fill with -inf + mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype, device=device) + mask.fill_(torch.finfo(dtype).min) + mask.triu_(1) # zero out the lower diagonal + mask = mask.unsqueeze(1) # expand mask + return mask + + +# TODO: copied from CLIP? +@add_start_docstrings( + """The text model from ImageBind without any head or projection on top.""", + IMAGEBIND_START_DOCSTRING, +) +class ImageBindTextModel(ImageBindPreTrainedModel): + config_class = ImageBindTextConfig + + _no_split_modules = ["ImageBindEncoderLayer"] + + def __init__(self, config: ImageBindTextConfig): + super().__init__(config) + self.text_model = ImageBindTextTransformer(config) + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.text_model.embeddings.token_embedding + + def set_input_embeddings(self, value): + self.text_model.embeddings.token_embedding = value + + @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindTextConfig) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from transformers import AutoTokenizer, ImageBindTextModel + + >>> model = ImageBindTextModel.from_pretrained("facebook/imagebind-huge") + >>> tokenizer = AutoTokenizer.from_pretrained("facebook/imagebind-huge") + + >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled (EOS token) states + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + return self.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + +# TODO: copied from CLIP? +class ImageBindVisionTransformer(nn.Module): + def __init__(self, config: ImageBindVisionConfig): + super().__init__() + self.config = config + embed_dim = config.hidden_size + + self.embeddings = ImageBindVisionEmbeddings(config) + self.pre_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + self.encoder = ImageBindEncoder(config) + self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + + @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindVisionConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + num_clips = None + reduce_clips = pixel_values.ndim >= 5 + if reduce_clips: + batch_size, num_clips = pixel_values.shape[:2] + pixel_values = pixel_values.reshape(batch_size * num_clips, *pixel_values.shape[2:]) + + hidden_states = self.embeddings(pixel_values) + hidden_states = self.pre_layernorm(hidden_states) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] + pooled_output = last_hidden_state[:, 0, :] + pooled_output = self.post_layernorm(pooled_output) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (num_clips,) + + return ImageBindTransformerOutput( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + num_clips=num_clips, + ) + + +# TODO: copied from CLIP? +@add_start_docstrings( + """The vision model from ImageBind without any head or projection on top.""", + IMAGEBIND_START_DOCSTRING, +) +class ImageBindVisionModel(ImageBindPreTrainedModel): + config_class = ImageBindVisionConfig + _no_split_modules = ["ImageBindEncoderLayer"] + + main_input_name = "pixel_values" + + def __init__(self, config: ImageBindVisionConfig): + super().__init__(config) + self.vision_model = ImageBindVisionTransformer(config) + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.vision_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindVisionConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindVisionModel + + >>> model = ImageBindVisionModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled CLS states + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + return self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + +# TODO: copied from CLIP? +class ImageBindAudioTransformer(nn.Module): + def __init__(self, config: ImageBindAudioConfig): + super().__init__() + self.config = config + embed_dim = config.hidden_size + + self.embeddings = ImageBindAudioEmbeddings(config) + self.encoder = ImageBindEncoder(config) + self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + + @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindAudioConfig) + def forward( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_features is None: + raise ValueError("You have to specify input_features") + + num_clips = None + reduce_clips = input_features.ndim >= 5 + if reduce_clips: + batch_size, num_clips = input_features.shape[:2] + input_features = input_features.reshape(batch_size * num_clips, *input_features.shape[2:]) + + hidden_states = self.embeddings(input_features) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] + pooled_output = last_hidden_state[:, 0, :] + pooled_output = self.post_layernorm(pooled_output) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (num_clips,) + + return ImageBindTransformerOutput( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + num_clips=num_clips, + ) + + +@add_start_docstrings( + """The vision model from ImageBind without any head or projection on top.""", + IMAGEBIND_START_DOCSTRING, +) +class ImageBindAudioModel(ImageBindPreTrainedModel): + config = ImageBindAudioConfig + _no_split_modules = ["ImageBindEncoderLayer"] + + main_input_name = "input_features" + + def __init__(self, config: ImageBindAudioConfig): + super().__init__(config) + self.audio_model = ImageBindAudioTransformer(config) + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.audio_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindAudioConfig) + def forward( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindAudioModel + + >>> model = ImageBindAudioModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled CLS states + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + return self.audio_model( + input_features=input_features, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + +# TODO: copied from CLIP? +class ImageBindDepthTransformer(nn.Module): + def __init__(self, config: ImageBindDepthConfig): + super().__init__() + self.config = config + embed_dim = config.hidden_size + + self.embeddings = ImageBindDepthEmbeddings(config) + self.encoder = ImageBindEncoder(config) + self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + + @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindDepthConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + hidden_states = self.embeddings(pixel_values) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] + pooled_output = last_hidden_state[:, 0, :] + pooled_output = self.post_layernorm(pooled_output) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,) + + return ImageBindTransformerOutput( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + num_clips=None, + ) + + +@add_start_docstrings( + """The depth model from ImageBind without any head or projection on top.""", + IMAGEBIND_START_DOCSTRING, +) +class ImageBindDepthModel(ImageBindPreTrainedModel): + config = ImageBindDepthConfig + _no_split_modules = ["ImageBindEncoderLayer"] + + main_input_name = "pixel_values" # TODO: rename to something better? + + def __init__(self, config: ImageBindDepthConfig): + super().__init__(config) + self.depth_model = ImageBindDepthTransformer(config) + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.depth_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindDepthConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindDepthModel + + >>> model = ImageBindDepthModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled CLS states + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + return self.depth_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + +# TODO: copied from CLIP? +class ImageBindThermalTransformer(nn.Module): + def __init__(self, config: ImageBindThermalConfig): + super().__init__() + self.config = config + embed_dim = config.hidden_size + + self.embeddings = ImageBindThermalEmbeddings(config) + self.encoder = ImageBindEncoder(config) + self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + + @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindThermalConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + hidden_states = self.embeddings(pixel_values) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] + pooled_output = last_hidden_state[:, 0, :] + pooled_output = self.post_layernorm(pooled_output) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,) + + return ImageBindTransformerOutput( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + num_clips=None, + ) + + +@add_start_docstrings( + """The thermal model from ImageBind without any head or projection on top.""", + IMAGEBIND_START_DOCSTRING, +) +class ImageBindThermalModel(ImageBindPreTrainedModel): + config = ImageBindThermalConfig + _no_split_modules = ["ImageBindEncoderLayer"] + + main_input_name = "pixel_values" # TODO: rename to something better? + + def __init__(self, config: ImageBindThermalConfig): + super().__init__(config) + self.thermal_model = ImageBindThermalTransformer(config) + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.thermal_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=ImageBindThermalConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPooling]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindThermalModel + + >>> model = ImageBindThermalModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled CLS states + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + return self.thermal_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + +# TODO: copied from CLIP? +class ImageBindImuTransformer(nn.Module): + def __init__(self, config: ImageBindImuConfig): + super().__init__() + self.config = config + embed_dim = config.hidden_size + + self.embeddings = ImageBindImuEmbeddings(config) + self.encoder = ImageBindEncoder(config) + self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + self.post_dropout = nn.Dropout(p=config.final_dropout) + + @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindImuConfig) + def forward( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_features is None: + raise ValueError("You have to specify input_features") + + hidden_states = self.embeddings(input_features) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] + pooled_output = last_hidden_state[:, 0, :] + pooled_output = self.post_layernorm(pooled_output) + pooled_output = self.post_dropout(pooled_output) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + (None,) + + return ImageBindTransformerOutput( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + num_clips=None, + ) + + +@add_start_docstrings( + """The IMU model from ImageBind without any head or projection on top.""", + IMAGEBIND_START_DOCSTRING, +) +class ImageBindImuModel(ImageBindPreTrainedModel): + config = ImageBindImuConfig + _no_split_modules = ["ImageBindEncoderLayer"] + + main_input_name = "input_features" + + def __init__(self, config: ImageBindImuConfig): + super().__init__(config) + self.imu_model = ImageBindImuTransformer(config) + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.imu_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTransformerOutput, config_class=ImageBindImuConfig) + def forward( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTransformerOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindImuModel + + >>> model = ImageBindImuModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled CLS states + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + return self.imu_model( + input_features=input_features, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + +@add_start_docstrings(IMAGEBIND_START_DOCSTRING) +class ImageBindModel(ImageBindPreTrainedModel): + config_class = ImageBindConfig + + def __init__(self, config: ImageBindConfig): + super().__init__(config) + + if not isinstance(config.text_config, ImageBindTextConfig): + raise ValueError( + "config.text_config is expected to be of type ImageBindTextConfig but is of type" + f" {type(config.text_config)}." + ) + + if not isinstance(config.vision_config, ImageBindVisionConfig): + raise ValueError( + "config.vision_config is expected to be of type ImageBindVisionConfig but is of type" + f" {type(config.vision_config)}." + ) + + if not isinstance(config.audio_config, ImageBindAudioConfig): + raise ValueError( + "config.audio_config is expected to be of type ImageBindAudioConfig but is of type" + f" {type(config.audio_config)}." + ) + + if not isinstance(config.depth_config, ImageBindDepthConfig): + raise ValueError( + "config.depth_config is expected to be of type ImageBindDepthConfig but is of type" + f" {type(config.depth_config)}." + ) + + if not isinstance(config.thermal_config, ImageBindThermalConfig): + raise ValueError( + "config.thermal_config is expected to be of type ImageBindThermalConfig but is of type" + f" {type(config.thermal_config)}." + ) + + if not isinstance(config.imu_config, ImageBindImuConfig): + raise ValueError( + "config.imu_config is expected to be of type ImageBindImuConfig but is of type" + f" {type(config.imu_config)}." + ) + + text_config = config.text_config + vision_config = config.vision_config + audio_config = config.audio_config + depth_config = config.depth_config + thermal_config = config.thermal_config + imu_config = config.imu_config + + self.projection_dim = config.projection_dim + self.text_embed_dim = text_config.hidden_size + self.vision_embed_dim = vision_config.hidden_size + self.audio_embed_dim = audio_config.hidden_size + self.depth_embed_dim = depth_config.hidden_size + self.thermal_embed_dim = thermal_config.hidden_size + self.imu_embed_dim = imu_config.hidden_size + + self.text_model = ImageBindTextTransformer(text_config) + self.vision_model = ImageBindVisionTransformer(vision_config) + self.audio_model = ImageBindAudioTransformer(audio_config) + self.depth_model = ImageBindDepthTransformer(depth_config) + self.thermal_model = ImageBindThermalTransformer(thermal_config) + self.imu_model = ImageBindImuTransformer(imu_config) + + self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False) + self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False) + self.audio_projection = nn.Linear(self.audio_embed_dim, self.projection_dim, bias=False) + self.depth_projection = nn.Linear(self.depth_embed_dim, self.projection_dim, bias=False) + self.thermal_projection = nn.Linear(self.thermal_embed_dim, self.projection_dim, bias=False) + self.imu_projection = nn.Linear(self.imu_embed_dim, self.projection_dim, bias=False) + + self.text_postprocessor = ImageBindPostProcessor(text_config) + self.vision_postprocessor = ImageBindPostProcessor(vision_config) + self.audio_postprocessor = ImageBindPostProcessor(audio_config) + self.depth_postprocessor = ImageBindPostProcessor(depth_config) + self.thermal_postprocessor = ImageBindPostProcessor(thermal_config) + self.imu_postprocessor = ImageBindPostProcessor(imu_config) + + # Initialize weights and apply final processing + self.post_init() + + @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING) + def get_text_features( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by + applying the projection layer to the pooled output of [`ImageBindTextModel`]. + + Examples: + + ```python + >>> from transformers import AutoTokenizer, ImageBindModel + + >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge") + >>> tokenizer = AutoTokenizer.from_pretrained("facebook/imagebind-huge") + + >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") + >>> text_features = model.get_text_features(**inputs) + ```""" + # Use ImageBind model's config for some fields (if specified) instead of those in the text component. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + text_outputs = self.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = text_outputs[1] + text_features = self.text_projection(pooled_output) + + return text_features + + @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING) + def get_image_features( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by + applying the projection layer to the pooled output of [`ImageBindVisionModel`]. + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindModel + + >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> image_features = model.get_image_features(**inputs) + ```""" + # Use ImageBind model's config for some fields (if specified) instead of those in the vision components. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + batch_size = pixel_values.shape[0] + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = vision_outputs[1] # pooled_output + image_features = self.visual_projection(pooled_output) + + num_clips = vision_outputs[-1] + if num_clips is not None: + image_features = image_features.reshape(batch_size, num_clips, -1) + # Take mean over all clips + image_features = image_features.mean(dim=1) + + return image_features + + # TODO: make sure inputs match with ImageBindAudioModel + @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING) + def get_audio_features( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + audio_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The audio embeddings obtained by + applying the projection layer to the pooled output of [`ImageBindAudioModel`]. + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindModel + + >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> audio_features = model.get_audio_features(**inputs) + ```""" + # Use ImageBind model's config for some fields (if specified) instead of those in the audio component. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + batch_size = input_features.shape[0] + + audio_outputs = self.audio_model( + input_features=input_features, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = audio_outputs[1] # pooled_output + audio_features = self.audio_projection(pooled_output) + + num_clips = audio_outputs[-1] + if num_clips is not None: + audio_features = audio_features.reshape(batch_size, num_clips, -1) + # Take mean over all clips + audio_features = audio_features.mean(dim=1) + + return audio_features + + # TODO: make sure inputs match with ImageBindDepthModel + @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING) + def get_depth_features( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + depth_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The depth embeddings obtained by + applying the projection layer to the pooled output of [`ImageBindDepthModel`]. + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindModel + + >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> depth_features = model.get_depth_features(**inputs) + ```""" + # Use ImageBind model's config for some fields (if specified) instead of those in the depth component. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + depth_outputs = self.depth_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = depth_outputs[1] # pooled_output + depth_features = self.depth_projection(pooled_output) + + return depth_features + + # TODO: make sure inputs match with ImageBindThermalModel + @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING) + def get_thermal_features( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + thermal_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The thermal embeddings obtained by + applying the projection layer to the pooled output of [`ImageBindThermalModel`]. + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindModel + + >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> thermal_features = model.get_thermal_features(**inputs) + ```""" + # Use ImageBind model's config for some fields (if specified) instead of those in the thermal component. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + thermal_outputs = self.thermal_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = thermal_outputs[1] # pooled_output + thermal_features = self.thermal_projection(pooled_output) + + return thermal_features + + # TODO: make sure inputs match with ImageBindImuModel + @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING) + def get_imu_features( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + imu_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The IMU embeddings obtained by + applying the projection layer to the pooled output of [`ImageBindImuModel`]. + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindModel + + >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> imu_features = model.get_imu_features(**inputs) + ```""" + # Use ImageBind model's config for some fields (if specified) instead of those in the IMU component. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + imu_outputs = self.imu_model( + input_features=input_features, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = imu_outputs[1] # pooled_output + imu_features = self.imu_projection(pooled_output) + + return imu_features + + @add_start_docstrings_to_model_forward(IMAGEBIND_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindOutput, config_class=ImageBindConfig) + def forward( + self, + input_features: Optional[torch.Tensor] = None, + pixel_values: Optional[torch.FloatTensor] = None, + modality: Optional[str] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + return_loss: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindModel + + >>> model = ImageBindModel.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor( + ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True + ... ) + + >>> outputs = model(**inputs) + >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score + >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities + ```""" + # Use ImageBind model's config for some fields (if specified) instead of those of vision & text components. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + image_batch_size = pixel_values.shape[0] + other_batch_size = input_features.shape[0] + + other_model, other_projection, other_postprocessor = self._resolve_modality_models(modality) + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + if modality == "text": + other_outputs = other_model( + input_ids=input_features, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + else: + other_outputs = other_model( + input_ids=input_features, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + image_embeds = vision_outputs[1] + image_embeds = self.visual_projection(image_embeds) + + other_embeds = other_outputs[1] + other_embeds = other_projection(other_embeds) + + # normalized features: postprocessor performs normalization and logit scaling + image_embeds = self.vision_postprocessor(image_embeds) + other_embeds = other_postprocessor(other_embeds) + + # If modality input was batched and clipped, reduce embedding over clips dimension + image_num_clips = vision_outputs[-1] + if image_num_clips is not None: + image_embeds = image_embeds.reshape(image_batch_size, image_num_clips, -1) + # Take mean over all clips + image_embeds = image_embeds.mean(dim=1) + other_num_clips = other_outputs[-1] + if other_num_clips is not None: + other_embeds = other_embeds.reshape(other_batch_size, other_num_clips, -1) + other_embeds = other_embeds.mean(dim=1) + + # cosine similarity as logits + logits_per_other = torch.matmul(other_embeds, image_embeds.t()) + logits_per_image = logits_per_other.t() + + loss = None + if return_loss: + loss = imagebind_loss(logits_per_other) + + if not return_dict: + output = (logits_per_image, logits_per_other, other_embeds, image_embeds, other_outputs, vision_outputs) + return ((loss,) + output) if loss is not None else output + + output_kwargs = self._resolve_output_keys(modality, logits_per_other, other_embeds, other_outputs) + + return ImageBindOutput( + loss=loss, + logits_per_image=logits_per_image, + image_embeds=image_embeds, + vision_model_output=vision_outputs, + **output_kwargs, + ) + + def _resolve_modality_models(self, modality: str): + if modality == "text": + model = self.text_model + projection = self.text_projection + postprocessor = self.text_postprocessor + elif modality == "vision": + model = self.vision_model + projection = self.visual_projection + postprocessor = self.vision_postprocessor + elif modality == "audio": + model = self.audio_model + projection = self.audio_projection + postprocessor = self.audio_postprocessor + elif modality == "depth": + model = self.depth_model + projection = self.depth_projection + postprocessor = self.depth_postprocessor + elif modality == "thermal": + model = self.thermal_model + projection = self.thermal_projection + postprocessor = self.thermal_postprocessor + elif modality == "imu": + model = self.imu_model + projection = self.imu_projection + postprocessor = self.imu_postprocessor + else: + raise ValueError( + f"`modality` is expected to be in `['text', 'vision', 'audio', 'depth', 'thermal', 'imu']` but got" + f" {modality}" + ) + return model, projection, postprocessor + + def _resolve_output_keys(self, modality: str, logits, embeds, model_outputs): + output_kwargs = {} + if modality == "vision": + # Different naming pattern + output_kwargs["logits_per_image"] = logits + output_kwargs["image_embeds"] = embeds + output_kwargs["vision_model_output"] = model_outputs + else: + output_kwargs[f"logits_per_{modality}"] = logits + output_kwargs[f"{modality}_embeds"] = embeds + output_kwargs[f"{modality}_model_output"] = model_outputs + return output_kwargs + + +@add_start_docstrings( + """ + ImageBind Text Model with a projection layer on top (a linear layer on top of the pooled output). + """, + IMAGEBIND_START_DOCSTRING, +) +class ImageBindTextModelWithProjection(ImageBindPreTrainedModel): + config_class = ImageBindTextConfig + + _no_split_modules = ["ImageBindEncoderLayer"] + + def __init__(self, config: ImageBindTextConfig): + super().__init__(config) + + self.text_model = ImageBindTextTransformer(config) + + self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False) + + self.text_postprocessor = ImageBindPostProcessor(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.text_model.embeddings.token_embedding + + def set_input_embeddings(self, value): + self.text_model.embeddings.token_embedding = value + + @add_start_docstrings_to_model_forward(IMAGEBIND_TEXT_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindTextModelOutput, config_class=ImageBindTextConfig) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindTextModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from transformers import AutoTokenizer, ImageBindTextModelWithProjection + + >>> model = ImageBindTextModelWithProjection.from_pretrained("facebook/imagebind-huge") + >>> tokenizer = AutoTokenizer.from_pretrained("facebook/imagebind-huge") + + >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> text_embeds = outputs.text_embeds + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + text_outputs = self.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = text_outputs[1] + + text_embeds = self.text_projection(pooled_output) + normalized_text_embeds = self.text_postprocessor(text_embeds) + + if not return_dict: + # Exclude num_clips output + outputs = (text_embeds, text_outputs[0]) + text_outputs[2:-1] + (normalized_text_embeds,) + return tuple(output for output in outputs if output is not None) + + return ImageBindTextModelOutput( + text_embeds=text_embeds, + last_hidden_state=text_outputs.last_hidden_state, + hidden_states=text_outputs.hidden_states, + attentions=text_outputs.attentions, + normalized_text_embeds=normalized_text_embeds, + ) + + +@add_start_docstrings( + """ + ImageBind Vision Model with a projection layer on top (a linear layer on top of the pooled output). + """, + IMAGEBIND_START_DOCSTRING, +) +class ImageBindVisionModelWithProjection(ImageBindPreTrainedModel): + config_class = ImageBindVisionConfig + main_input_name = "pixel_values" + + def __init__(self, config: ImageBindVisionConfig): + super().__init__(config) + + self.vision_model = ImageBindVisionTransformer(config) + + self.visual_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False) + + self.vision_postprocessor = ImageBindPostProcessor(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.vision_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindVisionModelOutput, config_class=ImageBindVisionConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindVisionModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindVisionModelWithProjection + + >>> model = ImageBindVisionModelWithProjection.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> image_embeds = outputs.image_embeds + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + batch_size = pixel_values.shape[0] + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = vision_outputs[1] # pooled_output + + image_embeds = self.visual_projection(pooled_output) + normalized_image_embeds = self.vision_postprocessor(image_embeds) + + num_clips = vision_outputs[-1] + if num_clips is not None: + image_embeds = image_embeds.reshape(batch_size, num_clips, -1) + # Take mean over all clips + image_embeds = image_embeds.mean(dim=1) + + normalized_image_embeds = normalized_image_embeds.reshape(batch_size, num_clips, -1) + normalized_image_embeds = normalized_image_embeds.mean(dim=1) + + if not return_dict: + # Exclude num_clips output + outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:-1] + (normalized_image_embeds,) + return tuple(output for output in outputs if output is not None) + + return ImageBindVisionModelOutput( + image_embeds=image_embeds, + last_hidden_state=vision_outputs.last_hidden_state, + hidden_states=vision_outputs.hidden_states, + attentions=vision_outputs.attentions, + normalized_image_embeds=normalized_image_embeds, + ) + + +@add_start_docstrings( + """ + ImageBind Audio Model with a projection layer on top (a linear layer on top of the pooled output). + """, + IMAGEBIND_START_DOCSTRING, +) +class ImageBindAudioModelWithProjection(ImageBindPreTrainedModel): + config_class = ImageBindAudioConfig + main_input_name = "input_features" + + def __init__(self, config: ImageBindAudioConfig): + super().__init__(config) + + self.audio_model = ImageBindAudioTransformer(config) + + self.audio_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False) + + self.audio_postprocessor = ImageBindPostProcessor(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.audio_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_AUDIO_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindAudioModelOutput, config_class=ImageBindAudioConfig) + def forward( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindAudioModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindAudioModelWithProjection + + >>> model = ImageBindAudioModelWithProjection.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") # TODO + + >>> outputs = model(**inputs) + >>> audio_embeds = outputs.audio_embeds + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + batch_size = input_features.shape[0] + + audio_outputs = self.audio_model( + input_features=input_features, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = audio_outputs[1] # pooled_output + + audio_embeds = self.audio_projection(pooled_output) + normalized_audio_embeds = self.audio_postprocessor(audio_embeds) + + num_clips = audio_outputs[-1] + if num_clips is not None: + audio_embeds = audio_embeds.reshape(batch_size, num_clips, -1) + # Take mean over all clips + audio_embeds = audio_embeds.mean(dim=1) + + normalized_audio_embeds = normalized_audio_embeds.reshape(batch_size, num_clips, -1) + normalized_audio_embeds = normalized_audio_embeds.mean(dim=1) + + if not return_dict: + # Exclude num_clips output + outputs = (audio_embeds, audio_outputs[0]) + audio_outputs[2:-1] + (normalized_audio_embeds,) + return tuple(output for output in outputs if output is not None) + + return ImageBindAudioModelOutput( + audio_embeds=audio_embeds, + last_hidden_state=audio_outputs.last_hidden_state, + hidden_states=audio_outputs.hidden_states, + attentions=audio_outputs.attentions, + normalized_audio_embeds=normalized_audio_embeds, + ) + + +@add_start_docstrings( + """ + ImageBind Depth Model with a projection layer on top (a linear layer on top of the pooled output). + """, + IMAGEBIND_START_DOCSTRING, +) +class ImageBindDepthModelWithProjection(ImageBindPreTrainedModel): + config_class = ImageBindDepthConfig + main_input_name = "pixel_values" # TODO: rename to something better? + + def __init__(self, config: ImageBindDepthConfig): + super().__init__(config) + + self.depth_model = ImageBindDepthTransformer(config) + + self.depth_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False) + + self.depth_postprocessor = ImageBindPostProcessor(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.depth_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_DEPTH_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindDepthModelOutput, config_class=ImageBindDepthConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindDepthModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindDepthModelWithProjection + + >>> model = ImageBindDepthModelWithProjection.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") # TODO + + >>> outputs = model(**inputs) + >>> depth_embeds = outputs.depth_embeds + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + depth_outputs = self.depth_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = depth_outputs[1] # pooled_output + + depth_embeds = self.depth_projection(pooled_output) + normalized_depth_embeds = self.depth_postprocessor(depth_embeds) + + if not return_dict: + # Exclude num_clips output + outputs = (depth_embeds, depth_outputs[0]) + depth_outputs[2:-1] + (normalized_depth_embeds,) + return tuple(output for output in outputs if output is not None) + + return ImageBindDepthModelOutput( + depth_embeds=depth_embeds, + last_hidden_state=depth_outputs.last_hidden_state, + hidden_states=depth_outputs.hidden_states, + attentions=depth_outputs.attentions, + normalized_depth_embeds=normalized_depth_embeds, + ) + + +@add_start_docstrings( + """ + ImageBind Thermal Model with a projection layer on top (a linear layer on top of the pooled output). + """, + IMAGEBIND_START_DOCSTRING, +) +class ImageBindThermalModelWithProjection(ImageBindPreTrainedModel): + config_class = ImageBindThermalConfig + main_input_name = "pixel_values" # TODO: rename to something better? + + def __init__(self, config: ImageBindThermalConfig): + super().__init__(config) + + self.thermal_model = ImageBindThermalTransformer(config) + + self.thermal_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False) + + self.thermal_postprocessor = ImageBindPostProcessor(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.thermal_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_THERMAL_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindThermalModelOutput, config_class=ImageBindThermalConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindThermalModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindDepthModelWithProjection + + >>> model = ImageBindDepthModelWithProjection.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") # TODO + + >>> outputs = model(**inputs) + >>> depth_embeds = outputs.depth_embeds + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + thermal_outputs = self.thermal_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = thermal_outputs[1] # pooled_output + + thermal_embeds = self.thermal_projection(pooled_output) + normalized_thermal_embeds = self.thermal_postprocessor(thermal_embeds) + + if not return_dict: + # Exclude num_clips output + outputs = (thermal_embeds, thermal_outputs[0]) + thermal_outputs[2:-1] + (normalized_thermal_embeds,) + return tuple(output for output in outputs if output is not None) + + return ImageBindThermalModelOutput( + thermal_embeds=thermal_embeds, + last_hidden_state=thermal_outputs.last_hidden_state, + hidden_states=thermal_outputs.hidden_states, + attentions=thermal_outputs.attentions, + normalized_thermal_embeds=normalized_thermal_embeds, + ) + + +@add_start_docstrings( + """ + ImageBind IMU Model with a projection layer on top (a linear layer on top of the pooled output). + """, + IMAGEBIND_START_DOCSTRING, +) +class ImageBindImuModelWithProjection(ImageBindPreTrainedModel): + config_class = ImageBindImuConfig + main_input_name = "input_features" + + def __init__(self, config: ImageBindImuConfig): + super().__init__(config) + + self.imu_model = ImageBindImuTransformer(config) + + self.imu_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False) + + self.imu_postprocessor = ImageBindPostProcessor(config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.imu_model.embeddings.patch_embedding + + @add_start_docstrings_to_model_forward(IMAGEBIND_IMU_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ImageBindImuModelOutput, config_class=ImageBindImuConfig) + def forward( + self, + input_features: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ImageBindImuModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, ImageBindDepthModelWithProjection + + >>> model = ImageBindDepthModelWithProjection.from_pretrained("facebook/imagebind-huge") + >>> processor = AutoProcessor.from_pretrained("facebook/imagebind-huge") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") # TODO + + >>> outputs = model(**inputs) + >>> depth_embeds = outputs.depth_embeds + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + imu_outputs = self.imu_model( + input_features=input_features, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = imu_outputs[1] # pooled_output + + imu_embeds = self.imu_projection(pooled_output) + normalized_imu_embeds = self.imu_postprocessor(imu_embeds) + + if not return_dict: + # Exclude num_clips output + outputs = (imu_embeds, imu_outputs[0]) + imu_outputs[2:-1] + (normalized_imu_embeds,) + return tuple(output for output in outputs if output is not None) + + return ImageBindImuModelOutput( + imu_embeds=imu_embeds, + last_hidden_state=imu_outputs.last_hidden_state, + hidden_states=imu_outputs.hidden_states, + attentions=imu_outputs.attentions, + normalized_imu_embeds=normalized_imu_embeds, + ) diff --git a/src/transformers/models/imagebind/processing_imagebind.py b/src/transformers/models/imagebind/processing_imagebind.py new file mode 100644 index 000000000000..03b3671fe8c7 --- /dev/null +++ b/src/transformers/models/imagebind/processing_imagebind.py @@ -0,0 +1,141 @@ +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Image/Text processor class for ImageBind +""" + +import warnings + +from ...processing_utils import ProcessorMixin +from ...tokenization_utils_base import BatchEncoding + + +# NOTE: currently copied from previous PR (#23284) + + +class ImageBindProcessor(ProcessorMixin): + r""" + Constructs a ImageBind processor which wraps a ImageBind image processor and a ImageBind tokenizer into a single processor. + [`ImageBindProcessor`] offers all the functionalities of [`ImageBindImageProcessor`] and [`ImageBindTokenizerFast`]. See the + [`~ImageBindProcessor.__call__`] and [`~ImageBindProcessor.decode`] for more information. + Args: + image_processor ([`ImageBindImageProcessor`]): + The image processor is a required input. + tokenizer ([`ImageBindTokenizerFast`]): + The tokenizer is a required input. + """ + attributes = ["image_processor", "tokenizer"] + image_processor_class = "ImageBindImageProcessor" + tokenizer_class = ("ImageBindTokenizer", "ImageBindTokenizerFast") + + def __init__(self, image_processor=None, tokenizer=None, **kwargs): + if "feature_extractor" in kwargs: + warnings.warn( + "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`" + " instead.", + FutureWarning, + ) + feature_extractor = kwargs.pop("feature_extractor") + + image_processor = image_processor if image_processor is not None else feature_extractor + if image_processor is None: + raise ValueError("You need to specify an `image_processor`.") + if tokenizer is None: + raise ValueError("You need to specify a `tokenizer`.") + + super().__init__(image_processor, tokenizer) + + def __call__(self, text=None, images=None, return_tensors=None, **kwargs): + """ + Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text` + and `kwargs` arguments to ImageBindTokenizerFast's [`~ImageBindTokenizerFast.__call__`] if `text` is not `None` to encode + the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to + ImageBindImageProcessor's [`~ImageBindImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring + of the above two methods for more information. + Args: + text (`str`, `List[str]`, `List[List[str]]`): + The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings + (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set + `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). + images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`): + The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch + tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a + number of channels, H and W are image height and width. + return_tensors (`str` or [`~utils.TensorType`], *optional*): + If set, will return tensors of a particular framework. Acceptable values are: + - `'tf'`: Return TensorFlow `tf.constant` objects. + - `'pt'`: Return PyTorch `torch.Tensor` objects. + - `'np'`: Return NumPy `np.ndarray` objects. + - `'jax'`: Return JAX `jnp.ndarray` objects. + Returns: + [`BatchEncoding`]: A [`BatchEncoding`] with the following fields: + - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`. + - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when + `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not + `None`). + - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`. + """ + + if text is None and images is None: + raise ValueError("You have to specify either text or images. Both cannot be none.") + + if text is not None: + encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs) + + if images is not None: + image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs) + + if text is not None and images is not None: + encoding["pixel_values"] = image_features.pixel_values + return encoding + elif text is not None: + return encoding + else: + return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors) + + def batch_decode(self, *args, **kwargs): + """ + This method forwards all its arguments to ImageBindTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please + refer to the docstring of this method for more information. + """ + return self.tokenizer.batch_decode(*args, **kwargs) + + def decode(self, *args, **kwargs): + """ + This method forwards all its arguments to ImageBindTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to + the docstring of this method for more information. + """ + return self.tokenizer.decode(*args, **kwargs) + + @property + def model_input_names(self): + tokenizer_input_names = self.tokenizer.model_input_names + image_processor_input_names = self.image_processor.model_input_names + return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names)) + + @property + def feature_extractor_class(self): + warnings.warn( + "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.", + FutureWarning, + ) + return self.image_processor_class + + @property + def feature_extractor(self): + warnings.warn( + "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.", + FutureWarning, + ) + return self.image_processor \ No newline at end of file diff --git a/src/transformers/models/imagebind/tokenization_imagebind.py b/src/transformers/models/imagebind/tokenization_imagebind.py new file mode 100644 index 000000000000..084406c774c8 --- /dev/null +++ b/src/transformers/models/imagebind/tokenization_imagebind.py @@ -0,0 +1,525 @@ +# Copyright 2023 The Open AI Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes for ImageBind.""" + +import json +import os +import unicodedata +from functools import lru_cache +from typing import List, Optional, Tuple + +import regex as re + +from ...tokenization_utils import AddedToken, PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace +from ...utils import logging + + +# NOTE: currently copied from previous PR (#23284) + + +logger = logging.get_logger(__name__) + +VOCAB_FILES_NAMES = { + "vocab_file": "vocab.json", + "merges_file": "merges.txt", +} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/vocab.json", + }, + "merges_file": { + "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/merges.txt", + }, +} + +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { + "facebook/imagebind-huge": 77, +} + + +PRETRAINED_INIT_CONFIGURATION = { + "facebook/imagebind-huge": {}, +} + + +@lru_cache() +def bytes_to_unicode(): + """ + Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control + characters the bpe code barfs on. + + The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab + if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for + decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup + tables between utf-8 bytes and unicode strings. + """ + bs = ( + list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1)) + ) + cs = bs[:] + n = 0 + for b in range(2**8): + if b not in bs: + bs.append(b) + cs.append(2**8 + n) + n += 1 + cs = [chr(n) for n in cs] + return dict(zip(bs, cs)) + + +def get_pairs(word): + """ + Return set of symbol pairs in a word. + + Word is represented as tuple of symbols (symbols being variable-length strings). + """ + pairs = set() + prev_char = word[0] + for char in word[1:]: + pairs.add((prev_char, char)) + prev_char = char + return pairs + + +def whitespace_clean(text): + text = re.sub(r"\s+", " ", text) + text = text.strip() + return text + + +# Copied from transformers.models.bert.tokenization_bert.whitespace_tokenize +def whitespace_tokenize(text): + """Runs basic whitespace cleaning and splitting on a piece of text.""" + text = text.strip() + if not text: + return [] + tokens = text.split() + return tokens + + +# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer +class BasicTokenizer(object): + """ + Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.). + + Args: + do_lower_case (`bool`, *optional*, defaults to `True`): + Whether or not to lowercase the input when tokenizing. + never_split (`Iterable`, *optional*): + Collection of tokens which will never be split during tokenization. Only has an effect when + `do_basic_tokenize=True` + tokenize_chinese_chars (`bool`, *optional*, defaults to `True`): + Whether or not to tokenize Chinese characters. + + This should likely be deactivated for Japanese (see this + [issue](https://github.com/huggingface/transformers/issues/328)). + strip_accents (`bool`, *optional*): + Whether or not to strip all accents. If this option is not specified, then it will be determined by the + value for `lowercase` (as in the original BERT). + """ + + def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None): + if never_split is None: + never_split = [] + self.do_lower_case = do_lower_case + self.never_split = set(never_split) + self.tokenize_chinese_chars = tokenize_chinese_chars + self.strip_accents = strip_accents + + def tokenize(self, text, never_split=None): + """ + Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see + WordPieceTokenizer. + + Args: + never_split (`List[str]`, *optional*) + Kept for backward compatibility purposes. Now implemented directly at the base class level (see + [`PreTrainedTokenizer.tokenize`]) List of token not to split. + """ + # union() returns a new set by concatenating the two sets. + never_split = self.never_split.union(set(never_split)) if never_split else self.never_split + text = self._clean_text(text) + + # This was added on November 1st, 2018 for the multilingual and Chinese + # models. This is also applied to the English models now, but it doesn't + # matter since the English models were not trained on any Chinese data + # and generally don't have any Chinese data in them (there are Chinese + # characters in the vocabulary because Wikipedia does have some Chinese + # words in the English Wikipedia.). + if self.tokenize_chinese_chars: + text = self._tokenize_chinese_chars(text) + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + for token in orig_tokens: + if token not in never_split: + if self.do_lower_case: + token = token.lower() + if self.strip_accents is not False: + token = self._run_strip_accents(token) + elif self.strip_accents: + token = self._run_strip_accents(token) + split_tokens.extend(self._run_split_on_punc(token, never_split)) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens + + def _run_strip_accents(self, text): + """Strips accents from a piece of text.""" + text = unicodedata.normalize("NFD", text) + output = [] + for char in text: + cat = unicodedata.category(char) + if cat == "Mn": + continue + output.append(char) + return "".join(output) + + def _run_split_on_punc(self, text, never_split=None): + """Splits punctuation on a piece of text.""" + if never_split is not None and text in never_split: + return [text] + chars = list(text) + i = 0 + start_new_word = True + output = [] + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + return ["".join(x) for x in output] + + def _tokenize_chinese_chars(self, text): + """Adds whitespace around any CJK character.""" + output = [] + for char in text: + cp = ord(char) + if self._is_chinese_char(cp): + output.append(" ") + output.append(char) + output.append(" ") + else: + output.append(char) + return "".join(output) + + def _is_chinese_char(self, cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ( + (cp >= 0x4E00 and cp <= 0x9FFF) + or (cp >= 0x3400 and cp <= 0x4DBF) # + or (cp >= 0x20000 and cp <= 0x2A6DF) # + or (cp >= 0x2A700 and cp <= 0x2B73F) # + or (cp >= 0x2B740 and cp <= 0x2B81F) # + or (cp >= 0x2B820 and cp <= 0x2CEAF) # + or (cp >= 0xF900 and cp <= 0xFAFF) + or (cp >= 0x2F800 and cp <= 0x2FA1F) # + ): # + return True + + return False + + def _clean_text(self, text): + """Performs invalid character removal and whitespace cleanup on text.""" + output = [] + for char in text: + cp = ord(char) + if cp == 0 or cp == 0xFFFD or _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +class ImageBindTokenizer(PreTrainedTokenizer): + """ + Construct a ImageBind tokenizer. Based on byte-level Byte-Pair-Encoding. + + This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to + this superclass for more information regarding those methods. + + Args: + vocab_file (`str`): + Path to the vocabulary file. + merges_file (`str`): + Path to the merges file. + errors (`str`, *optional*, defaults to `"replace"`): + Paradigm to follow when decoding bytes to UTF-8. See + [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information. + unk_token (`str`, *optional*, defaults to `<|endoftext|>`): + The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this + token instead. + bos_token (`str`, *optional*, defaults to `<|startoftext|>`): + The beginning of sequence token. + eos_token (`str`, *optional*, defaults to `<|endoftext|>`): + The end of sequence token. + """ + + vocab_files_names = VOCAB_FILES_NAMES + pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP + max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES + model_input_names = ["input_ids", "attention_mask"] + + def __init__( + self, + vocab_file, + merges_file, + errors="replace", + unk_token="<|endoftext|>", + bos_token="<|startoftext|>", + eos_token="<|endoftext|>", + pad_token="<|endoftext|>", # hack to enable padding + **kwargs, + ): + bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token + eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token + unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token + + super().__init__( + errors=errors, + unk_token=unk_token, + bos_token=bos_token, + eos_token=eos_token, + pad_token=pad_token, + **kwargs, + ) + + try: + import ftfy + + self.fix_text = ftfy.fix_text + except ImportError: + logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.") + self.nlp = BasicTokenizer(do_lower_case=True) + self.fix_text = None + + with open(vocab_file, encoding="utf-8") as vocab_handle: + self.encoder = json.load(vocab_handle) + self.decoder = {v: k for k, v in self.encoder.items()} + self.errors = errors # how to handle errors in decoding + self.byte_encoder = bytes_to_unicode() + self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} + with open(merges_file, encoding="utf-8") as merges_handle: + bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1] + bpe_merges = [tuple(merge.split()) for merge in bpe_merges] + self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges)))) + self.cache = {"<|startoftext|>": "<|startoftext|>", "<|endoftext|>": "<|endoftext|>"} + + self.pat = re.compile( + r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", + re.IGNORECASE, + ) + + @property + def vocab_size(self): + return len(self.encoder) + + def get_vocab(self): + return dict(self.encoder, **self.added_tokens_encoder) + + def build_inputs_with_special_tokens( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and + adding special tokens. A ImageBind sequence has the following format: + + - single sequence: `<|startoftext|> X <|endoftext|>` + + Pairs of sequences are not the expected use case, but they will be handled without a separator. + + Args: + token_ids_0 (`List[int]`): + List of IDs to which the special tokens will be added. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + + Returns: + `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. + """ + bos_token = [self.bos_token_id] + eos_token = [self.eos_token_id] + + if token_ids_1 is None: + return bos_token + token_ids_0 + eos_token + return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token + + def get_special_tokens_mask( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False + ) -> List[int]: + """ + Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding + special tokens using the tokenizer `prepare_for_model` method. + + Args: + token_ids_0 (`List[int]`): + List of IDs. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + already_has_special_tokens (`bool`, *optional*, defaults to `False`): + Whether or not the token list is already formatted with special tokens for the model. + + Returns: + `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. + """ + + if already_has_special_tokens: + return super().get_special_tokens_mask( + token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True + ) + + if token_ids_1 is None: + return [1] + ([0] * len(token_ids_0)) + [1] + return [1] + ([0] * len(token_ids_0)) + [1] + [1] + ([0] * len(token_ids_1)) + [1] + + def create_token_type_ids_from_sequences( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Create a mask from the two sequences passed. ImageBind does not make use of token type ids, therefore a list of + zeros is returned. + + Args: + token_ids_0 (`List[int]`): + List of IDs. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + + Returns: + `List[int]`: List of zeros. + """ + bos_token = [self.bos_token_id] + eos_token = [self.eos_token_id] + + if token_ids_1 is None: + return len(bos_token + token_ids_0 + eos_token) * [0] + return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0] + + def bpe(self, token): + if token in self.cache: + return self.cache[token] + word = tuple(token[:-1]) + (token[-1] + "",) + pairs = get_pairs(word) + + if not pairs: + return token + "" + + while True: + bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf"))) + if bigram not in self.bpe_ranks: + break + first, second = bigram + new_word = [] + i = 0 + while i < len(word): + try: + j = word.index(first, i) + except ValueError: + new_word.extend(word[i:]) + break + else: + new_word.extend(word[i:j]) + i = j + + if word[i] == first and i < len(word) - 1 and word[i + 1] == second: + new_word.append(first + second) + i += 2 + else: + new_word.append(word[i]) + i += 1 + new_word = tuple(new_word) + word = new_word + if len(word) == 1: + break + else: + pairs = get_pairs(word) + word = " ".join(word) + self.cache[token] = word + return word + + def _tokenize(self, text): + """Tokenize a string.""" + bpe_tokens = [] + if self.fix_text is None: + text = " ".join(self.nlp.tokenize(text)) + else: + text = whitespace_clean(self.fix_text(text)).lower() + + for token in re.findall(self.pat, text): + token = "".join( + self.byte_encoder[b] for b in token.encode("utf-8") + ) # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case) + bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" ")) + return bpe_tokens + + def _convert_token_to_id(self, token): + """Converts a token (str) in an id using the vocab.""" + return self.encoder.get(token, self.encoder.get(self.unk_token)) + + def _convert_id_to_token(self, index): + """Converts an index (integer) in a token (str) using the vocab.""" + return self.decoder.get(index) + + def convert_tokens_to_string(self, tokens): + """Converts a sequence of tokens (string) in a single string.""" + text = "".join(tokens) + byte_array = bytearray([self.byte_decoder[c] for c in text]) + text = byte_array.decode("utf-8", errors=self.errors).replace("", " ").strip() + return text + + def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: + if not os.path.isdir(save_directory): + logger.error("Vocabulary path ({}) should be a directory".format(save_directory)) + return + vocab_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] + ) + merge_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"] + ) + + with open(vocab_file, "w", encoding="utf-8") as f: + f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n") + + index = 0 + with open(merge_file, "w", encoding="utf-8") as writer: + writer.write("#version: 0.2\n") + for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]): + if index != token_index: + logger.warning( + "Saving vocabulary to {}: BPE merge indices are not consecutive." + " Please check that the tokenizer is not corrupted!".format(merge_file) + ) + index = token_index + writer.write(" ".join(bpe_tokens) + "\n") + index += 1 + + return vocab_file, merge_file \ No newline at end of file diff --git a/src/transformers/models/imagebind/tokenization_imagebind_fast.py b/src/transformers/models/imagebind/tokenization_imagebind_fast.py new file mode 100644 index 000000000000..a28a29a7efcf --- /dev/null +++ b/src/transformers/models/imagebind/tokenization_imagebind_fast.py @@ -0,0 +1,169 @@ +# Copyright 2023 The Open AI Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes for OpenAI GPT.""" + + +from typing import List, Optional, Tuple + +from tokenizers import pre_tokenizers + +from ...tokenization_utils_fast import PreTrainedTokenizerFast +from ...utils import logging +from .tokenization_imagebind import ImageBindTokenizer + + +# NOTE: currently copied from previous PR (#23284) + + +logger = logging.get_logger(__name__) + +VOCAB_FILES_NAMES = {"vocab_file": "vocab.json", "merges_file": "merges.txt", "tokenizer_file": "tokenizer.json"} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/vocab.json", + }, + "merges_file": { + "facebook/imagebind-huge": "https://huggingface.co/facebook/imagebind-huge/resolve/main/merges.txt", + }, + "tokenizer_file": { + "facebook/imagebind-huge": ( + "https://huggingface.co/facebook/imagebind-huge/resolve/main/tokenizer.json" + ), + }, +} + +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { + "facebook/imagebind-huge": 77, +} + + +class ImageBindTokenizerFast(PreTrainedTokenizerFast): + """ + Construct a "fast" ImageBind tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level + Byte-Pair-Encoding. + This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should + refer to this superclass for more information regarding those methods. + Args: + vocab_file (`str`): + Path to the vocabulary file. + merges_file (`str`): + Path to the merges file. + unk_token (`str`, *optional*, defaults to `<|endoftext|>`): + The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this + token instead. + bos_token (`str`, *optional*, defaults to `<|startoftext|>`): + The beginning of sequence token. + eos_token (`str`, *optional*, defaults to `<|endoftext|>`): + The end of sequence token. + """ + + vocab_files_names = VOCAB_FILES_NAMES + pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP + max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES + model_input_names = ["input_ids", "attention_mask"] + slow_tokenizer_class = ImageBindTokenizer + + def __init__( + self, + vocab_file=None, + merges_file=None, + tokenizer_file=None, + unk_token="<|endoftext|>", + bos_token="<|startoftext|>", + eos_token="<|endoftext|>", + pad_token="<|endoftext|>", # hack to enable padding + **kwargs, + ): + super().__init__( + vocab_file, + merges_file, + tokenizer_file=tokenizer_file, + unk_token=unk_token, + bos_token=bos_token, + eos_token=eos_token, + pad_token=pad_token, + **kwargs, + ) + + if not isinstance(self.backend_tokenizer.pre_tokenizer, pre_tokenizers.Sequence): + raise ValueError( + "The `backend_tokenizer` provided does not match the expected format. The ImageBind tokenizer has been" + " heavily modified from transformers version 4.17.0. You need to convert the tokenizer you are using" + " to be compatible with this version.The easiest way to do so is" + ' `ImageBindTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo, from_slow=True)`. If you want' + " to use your existing tokenizer, you will have to revert to a version prior to 4.17.0 of" + " transformers." + ) + + self._wrap_decode_method_backend_tokenizer() + + # Very ugly hack to enable padding to have a correct decoding see https://github.com/huggingface/tokenizers/issues/872 + def _wrap_decode_method_backend_tokenizer(self): + orig_decode_method = self.backend_tokenizer.decode + + def new_decode_method(*args, **kwargs): + text = orig_decode_method(*args, **kwargs) + text = text.replace(self.backend_tokenizer.model.end_of_word_suffix, " ").strip() + return text + + self.backend_tokenizer.decode = new_decode_method + + def build_inputs_with_special_tokens( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and + adding special tokens. A ImageBind sequence has the following format: + - single sequence: `<|startoftext|> X <|endoftext|>` + Pairs of sequences are not the expected use case, but they will be handled without a separator. + Args: + token_ids_0 (`List[int]`): + List of IDs to which the special tokens will be added. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + Returns: + `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. + """ + bos_token = [self.bos_token_id] + eos_token = [self.eos_token_id] + + if token_ids_1 is None: + return bos_token + token_ids_0 + eos_token + return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token + + def create_token_type_ids_from_sequences( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Create a mask from the two sequences passed. ImageBind does not make use of token type ids, therefore a list of + zeros is returned. + Args: + token_ids_0 (`List[int]`): + List of IDs. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + Returns: + `List[int]`: List of zeros. + """ + bos_token = [self.bos_token_id] + eos_token = [self.eos_token_id] + + if token_ids_1 is None: + return len(bos_token + token_ids_0 + eos_token) * [0] + return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0] + + def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: + files = self._tokenizer.model.save(save_directory, name=filename_prefix) + return tuple(files) \ No newline at end of file diff --git a/tests/models/imagebind/__init__.py b/tests/models/imagebind/__init__.py new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/tests/models/imagebind/test_image_processing_imagebind.py b/tests/models/imagebind/test_image_processing_imagebind.py new file mode 100644 index 000000000000..67c11c2d4ffd --- /dev/null +++ b/tests/models/imagebind/test_image_processing_imagebind.py @@ -0,0 +1,305 @@ +# Copyright 2023 HuggingFace Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import unittest + +import numpy as np + +from transformers.testing_utils import require_torch, require_vision +from transformers.utils import is_torch_available, is_vision_available + +from ...test_image_processing_common import ImageProcessingSavingTestMixin + + +# NOTE: currently copied from previous PR (#23284) + + +if is_torch_available(): + import torch + +if is_vision_available(): + from PIL import Image + + from transformers import ImageBindImageProcessor + + +class ImageBindImageProcessingTester(unittest.TestCase): + def __init__( + self, + parent, + batch_size=7, + num_channels=3, + image_size=18, + min_resolution=30, + max_resolution=400, + do_resize=True, + size=None, + do_center_crop=True, + crop_size=None, + do_normalize=True, + image_mean=[0.48145466, 0.4578275, 0.40821073], + image_std=[0.26862954, 0.26130258, 0.27577711], + do_convert_rgb=True, + ): + size = size if size is not None else {"shortest_edge": 20} + crop_size = crop_size if crop_size is not None else {"height": 18, "width": 18} + self.parent = parent + self.batch_size = batch_size + self.num_channels = num_channels + self.image_size = image_size + self.min_resolution = min_resolution + self.max_resolution = max_resolution + self.do_resize = do_resize + self.size = size + self.do_center_crop = do_center_crop + self.crop_size = crop_size + self.do_normalize = do_normalize + self.image_mean = image_mean + self.image_std = image_std + self.do_convert_rgb = do_convert_rgb + + def prepare_image_processor_dict(self): + return { + "do_resize": self.do_resize, + "size": self.size, + "do_center_crop": self.do_center_crop, + "crop_size": self.crop_size, + "do_normalize": self.do_normalize, + "image_mean": self.image_mean, + "image_std": self.image_std, + "do_convert_rgb": self.do_convert_rgb, + } + + def prepare_inputs(self, equal_resolution=False, numpify=False, torchify=False): + """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True, + or a list of PyTorch tensors if one specifies torchify=True. + """ + + assert not (numpify and torchify), "You cannot specify both numpy and PyTorch tensors at the same time" + + if equal_resolution: + image_inputs = [] + for i in range(self.batch_size): + image_inputs.append( + np.random.randint( + 255, size=(self.num_channels, self.max_resolution, self.max_resolution), dtype=np.uint8 + ) + ) + else: + image_inputs = [] + for i in range(self.batch_size): + width, height = np.random.choice(np.arange(self.min_resolution, self.max_resolution), 2) + image_inputs.append(np.random.randint(255, size=(self.num_channels, width, height), dtype=np.uint8)) + + if not numpify and not torchify: + # PIL expects the channel dimension as last dimension + image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs] + + if torchify: + image_inputs = [torch.from_numpy(x) for x in image_inputs] + + return image_inputs + + +@require_torch +@require_vision +class ImageBindImageProcessingTest(ImageProcessingSavingTestMixin, unittest.TestCase): + image_processing_class = ImageBindImageProcessor if is_vision_available() else None + + def setUp(self): + self.image_processor_tester = ImageBindImageProcessingTester(self) + + @property + def image_processor_dict(self): + return self.image_processor_tester.prepare_image_processor_dict() + + def test_image_processor_properties(self): + image_processing = self.image_processing_class(**self.image_processor_dict) + self.assertTrue(hasattr(image_processing, "do_resize")) + self.assertTrue(hasattr(image_processing, "size")) + self.assertTrue(hasattr(image_processing, "do_center_crop")) + self.assertTrue(hasattr(image_processing, "center_crop")) + self.assertTrue(hasattr(image_processing, "do_normalize")) + self.assertTrue(hasattr(image_processing, "image_mean")) + self.assertTrue(hasattr(image_processing, "image_std")) + self.assertTrue(hasattr(image_processing, "do_convert_rgb")) + + def test_image_processor_from_dict_with_kwargs(self): + image_processor = self.image_processing_class.from_dict(self.image_processor_dict) + self.assertEqual(image_processor.size, {"shortest_edge": 20}) + self.assertEqual(image_processor.crop_size, {"height": 18, "width": 18}) + + image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42, crop_size=84) + self.assertEqual(image_processor.size, {"shortest_edge": 42}) + self.assertEqual(image_processor.crop_size, {"height": 84, "width": 84}) + + def test_batch_feature(self): + pass + + def test_call_pil(self): + # Initialize image_processing + image_processing = self.image_processing_class(**self.image_processor_dict) + # create random PIL images + image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False) + for image in image_inputs: + self.assertIsInstance(image, Image.Image) + + # Test not batched input + encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + 1, + self.image_processor_tester.num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) + + # Test batched + encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + self.image_processor_tester.batch_size, + self.image_processor_tester.num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) + + def test_call_numpy(self): + # Initialize image_processing + image_processing = self.image_processing_class(**self.image_processor_dict) + # create random numpy tensors + image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False, numpify=True) + for image in image_inputs: + self.assertIsInstance(image, np.ndarray) + + # Test not batched input + encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + 1, + self.image_processor_tester.num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) + + # Test batched + encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + self.image_processor_tester.batch_size, + self.image_processor_tester.num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) + + def test_call_pytorch(self): + # Initialize image_processing + image_processing = self.image_processing_class(**self.image_processor_dict) + # create random PyTorch tensors + image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False, torchify=True) + for image in image_inputs: + self.assertIsInstance(image, torch.Tensor) + + # Test not batched input + encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + 1, + self.image_processor_tester.num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) + + # Test batched + encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + self.image_processor_tester.batch_size, + self.image_processor_tester.num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) + + +@require_torch +@require_vision +class ImageBindImageProcessingTestFourChannels(ImageProcessingSavingTestMixin, unittest.TestCase): + image_processing_class = ImageBindImageProcessor if is_vision_available() else None + + def setUp(self): + self.image_processor_tester = ImageBindImageProcessingTester(self, num_channels=4) + self.expected_encoded_image_num_channels = 3 + + @property + def image_processor_dict(self): + return self.image_processor_tester.prepare_image_processor_dict() + + def test_image_processor_properties(self): + image_processing = self.image_processing_class(**self.image_processor_dict) + self.assertTrue(hasattr(image_processing, "do_resize")) + self.assertTrue(hasattr(image_processing, "size")) + self.assertTrue(hasattr(image_processing, "do_center_crop")) + self.assertTrue(hasattr(image_processing, "center_crop")) + self.assertTrue(hasattr(image_processing, "do_normalize")) + self.assertTrue(hasattr(image_processing, "image_mean")) + self.assertTrue(hasattr(image_processing, "image_std")) + self.assertTrue(hasattr(image_processing, "do_convert_rgb")) + + def test_batch_feature(self): + pass + + def test_call_pil_four_channels(self): + # Initialize image_processing + image_processing = self.image_processing_class(**self.image_processor_dict) + # create random PIL images + image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False) + for image in image_inputs: + self.assertIsInstance(image, Image.Image) + + # Test not batched input + encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + 1, + self.expected_encoded_image_num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) + + # Test batched + encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values + self.assertEqual( + encoded_images.shape, + ( + self.image_processor_tester.batch_size, + self.expected_encoded_image_num_channels, + self.image_processor_tester.crop_size["height"], + self.image_processor_tester.crop_size["width"], + ), + ) \ No newline at end of file diff --git a/tests/models/imagebind/test_modeling_imagebind.py b/tests/models/imagebind/test_modeling_imagebind.py new file mode 100644 index 000000000000..e64276216f9a --- /dev/null +++ b/tests/models/imagebind/test_modeling_imagebind.py @@ -0,0 +1,1546 @@ +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Testing suite for the PyTorch ImageBind model. """ + + +import inspect +import os +import tempfile +import unittest + +import numpy as np +import requests + +import transformers +from transformers import ( + ImageBindConfig, + ImageBindAudioConfig, + ImageBindDepthConfig, + ImageBindImuConfig, + ImageBindTextConfig, + ImageBindThermalConfig, + ImageBindVisionConfig, +) +from transformers.testing_utils import ( + is_flax_available, + is_pt_flax_cross_test, + require_torch, + require_vision, + slow, + torch_device, +) +from transformers.utils import is_torch_available, is_vision_available + +from ...test_configuration_common import ConfigTester +from ...test_modeling_common import ( + ModelTesterMixin, + _config_zero_init, + floats_tensor, + ids_tensor, + random_attention_mask, +) +from ...test_pipeline_mixin import PipelineTesterMixin + + +if is_torch_available(): + import torch + from torch import nn + + from transformers import ( + ImageBindAudioModel, + ImageBindAudioModelWithProjection, + ImageBindDepthModel, + ImageBindDepthModelWithProjection, + ImageBindImuModel, + ImageBindImuModelWithProjection, + ImageBindModel, + ImageBindPreTrainedModel, + ImageBindTextModel, + ImageBindTextModelWithProjection, + ImageBindThermalModel, + ImageBindThermalModelWithProjection, + ImageBindVisionModel, + ImageBindVisionModelWithProjection, + ) + from transformers.models.imagebind.modeling_imagebind import IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST + + +if is_vision_available(): + from PIL import Image + + from transformers import ImageBindProcessor + + +if is_flax_available(): + import jax.numpy as jnp + + from transformers.modeling_flax_pytorch_utils import ( + convert_pytorch_state_dict_to_flax, + load_flax_weights_in_pytorch_model, + ) + + +class ImageBindTextModelTester: + def __init__( + self, + parent, + batch_size=12, + seq_length=7, + is_training=True, + use_input_mask=True, + use_labels=True, + vocab_size=99, + hidden_size=32, + projection_dim=32, + num_hidden_layers=5, + num_attention_heads=4, + intermediate_size=37, + dropout=0.0, + attention_dropout=0.0, + max_position_embeddings=512, + layer_norm_eps=1e-6, + initializer_range=0.02, + logit_scale_init_value=14.2857, + learnable_logit_scale=True, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.seq_length = seq_length + self.is_training = is_training + self.use_input_mask = use_input_mask + self.use_labels = use_labels + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.dropout = dropout + self.attention_dropout = attention_dropout + self.layer_norm_eps = layer_norm_eps + self.max_position_embeddings = max_position_embeddings + self.initializer_range = initializer_range + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + self.scope = scope + + def prepare_config_and_inputs(self): + input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size) + + input_mask = None + if self.use_input_mask: + input_mask = random_attention_mask([self.batch_size, self.seq_length]) + + if input_mask is not None: + batch_size, seq_length = input_mask.shape + rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,)) + for batch_idx, start_index in enumerate(rnd_start_indices): + input_mask[batch_idx, :start_index] = 1 + input_mask[batch_idx, start_index:] = 0 + + config = self.get_config() + + return config, input_ids, input_mask + + def get_config(self): + return ImageBindTextConfig( + vocab_size=self.vocab_size, + hidden_size=self.hidden_size, + projection_dim=self.projection_dim, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + dropout=self.dropout, + attention_dropout=self.attention_dropout, + layer_norm_eps=self.layer_norm_eps, + max_position_embeddings=self.max_position_embeddings, + initializer_range=self.initializer_range, + logit_scale_init_value=self.logit_scale_init_value, + learnable_logit_scale=self.learnable_logit_scale, + ) + + def create_and_check_model(self, config, input_ids, input_mask): + model = ImageBindTextModel(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(input_ids, attention_mask=input_mask) + result = model(input_ids) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size)) + self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) + + def create_and_check_model_with_projection(self, config, input_ids, input_mask): + model = ImageBindTextModelWithProjection(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(input_ids, attention_mask=input_mask) + result = model(input_ids) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size)) + self.parent.assertEqual(result.text_embeds.shape, (self.batch_size, self.projection_dim)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, input_ids, input_mask = config_and_inputs + inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask} + return config, inputs_dict + + +@require_torch +class ImageBindTextModelTest(ModelTesterMixin, unittest.TestCase): + all_model_classes = (ImageBindTextModel, ImageBindTextModelWithProjection) if is_torch_available() else () + fx_compatible = False + test_pruning = False + test_head_masking = False + + def setUp(self): + self.model_tester = ImageBindTextModelTester(self) + self.config_tester = ConfigTester(self, config_class=ImageBindTextConfig, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_model_with_projection(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model_with_projection(*config_and_inputs) + + def test_training(self): + pass + + def test_training_gradient_checkpointing(self): + pass + + @unittest.skip(reason="ImageBind does not use inputs_embeds") + def test_inputs_embeds(self): + pass + + @unittest.skip(reason="ImageBindTextModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_from_base(self): + pass + + @unittest.skip(reason="ImageBindTextModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_to_base(self): + pass + + @slow + def test_model_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindTextModel.from_pretrained(model_name) + self.assertIsNotNone(model) + + @slow + def test_model_with_projection_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindTextModelWithProjection.from_pretrained(model_name) + self.assertIsNotNone(model) + self.assertTrue(hasattr(model, "text_projection")) + + +class ImageBindVisionModelTester: + def __init__( + self, + parent, + batch_size=12, + image_size=30, + patch_size=(2, 2, 2), + stride=(2, 2, 2), + num_channels=3, + num_frames=2, + is_training=True, + hidden_size=32, + projection_dim=32, + num_hidden_layers=5, + num_attention_heads=4, + intermediate_size=37, + dropout=0.0, + layer_norm_eps=1e-6, + attention_dropout=0.0, + initializer_range=0.02, + logit_scale_init_value=None, + learnable_logit_scale=False, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.image_size = image_size + self.patch_size = patch_size + self.stride = stride + self.num_channels = num_channels + self.num_frames = num_frames + self.is_training = is_training + self.hidden_size = hidden_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.dropout = dropout + self.attention_dropout = attention_dropout + self.layer_norm_eps = layer_norm_eps + self.initializer_range = initializer_range + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + self.scope = scope + + # Resolve spatiotemporal patch size + patches_along_time_dim = num_frames // patch_size[0] + patches_along_height_dim = ((image_size - patch_size[1]) // stride[1]) + 1 + patches_along_width_dim = ((image_size - patch_size[2]) // stride[2]) + 1 + num_patches = patches_along_time_dim * patches_along_height_dim * patches_along_width_dim + # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token) + self.seq_length = num_patches + 1 + + def prepare_config_and_inputs(self): + pixel_values = floats_tensor([self.batch_size, self.num_channels, self.num_frames, self.image_size, self.image_size]) + config = self.get_config() + + return config, pixel_values + + def get_config(self): + return ImageBindVisionConfig( + image_size=self.image_size, + patch_size=self.patch_size, + stride=self.stride, + num_channels=self.num_channels, + num_frames=self.num_frames, + hidden_size=self.hidden_size, + projection_dim=self.projection_dim, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + dropout=self.dropout, + attention_dropout=self.attention_dropout, + layer_norm_eps=self.layer_norm_eps, + initializer_range=self.initializer_range, + logit_scale_init_value=self.logit_scale_init_value, + learnable_logit_scale=self.learnable_logit_scale, + ) + + # TODO: fix image size and patch_size + def create_and_check_model(self, config, pixel_values): + model = ImageBindVisionModel(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) + + # TODO: fix image size and patch_size + def create_and_check_model_with_projection(self, config, pixel_values): + model = ImageBindVisionModelWithProjection(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, pixel_values = config_and_inputs + inputs_dict = {"pixel_values": pixel_values} + return config, inputs_dict + + +@require_torch +class ImageBindVisionModelTest(ModelTesterMixin, unittest.TestCase): + """ + Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds, + attention_mask and seq_length. + """ + + all_model_classes = (ImageBindVisionModel, ImageBindVisionModelWithProjection) if is_torch_available() else () + fx_compatible = False + test_pruning = False + test_resize_embeddings = False + test_head_masking = False + + def setUp(self): + self.model_tester = ImageBindVisionModelTester(self) + self.config_tester = ConfigTester(self, config_class=ImageBindVisionConfig, has_text_modality=False, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + @unittest.skip(reason="IMAGEBIND does not use inputs_embeds") + def test_inputs_embeds(self): + pass + + def test_model_common_attributes(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + self.assertIsInstance(model.get_input_embeddings(), (nn.Module)) + x = model.get_output_embeddings() + self.assertTrue(x is None or isinstance(x, nn.Linear)) + + def test_forward_signature(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + signature = inspect.signature(model.forward) + # signature.parameters is an OrderedDict => so arg_names order is deterministic + arg_names = [*signature.parameters.keys()] + + expected_arg_names = ["pixel_values"] + self.assertListEqual(arg_names[:1], expected_arg_names) + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_model_with_projection(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model_with_projection(*config_and_inputs) + + def test_training(self): + pass + + def test_training_gradient_checkpointing(self): + pass + + @unittest.skip(reason="ImageBindVisionModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_from_base(self): + pass + + @unittest.skip(reason="ImageBindVisionModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_to_base(self): + pass + + @slow + def test_model_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindVisionModel.from_pretrained(model_name) + self.assertIsNotNone(model) + + @slow + def test_model_with_projection_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindVisionModelWithProjection.from_pretrained(model_name) + self.assertIsNotNone(model) + self.assertTrue(hasattr(model, "visual_projection")) + + +class ImageBindAudioModelTester: + def __init__( + self, + parent, + batch_size=12, + patch_size=16, + stride=10, + num_channels=1, + is_training=True, + num_mel_bins=128, + target_len=204, + hidden_size=32, + projection_dim=32, + num_hidden_layers=5, + num_attention_heads=4, + intermediate_size=37, + dropout=0.0, + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.1, + initializer_range=0.02, + logit_scale_init_value=20.0, + learnable_logit_scale=False, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.patch_size = patch_size + self.stride = stride + self.num_channels = num_channels + self.is_training = is_training + self.num_mel_bins = num_mel_bins + self.target_len = target_len + self.hidden_size = hidden_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.dropout = dropout + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.add_kv_bias = add_kv_bias + self.initializer_range = initializer_range + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + self.scope = scope + + # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token) + patches_along_height_dim = ((num_mel_bins - patch_size) // stride) + 1 + patches_along_width_dim = ((target_len - patch_size) // stride) + 1 + num_patches = patches_along_height_dim * patches_along_width_dim + self.seq_length = num_patches + 1 + + def prepare_config_and_inputs(self): + pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) + config = self.get_config() + + return config, pixel_values + + def get_config(self): + return ImageBindAudioConfig( + image_size=self.image_size, + patch_size=self.patch_size, + stride=self.stride, + num_channels=self.num_channels, + num_mel_bins=self.num_mel_bins, + target_len=self.target_len, + hidden_size=self.hidden_size, + projection_dim=self.projection_dim, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + dropout=self.dropout, + attention_dropout=self.attention_dropout, + layer_norm_eps=self.layer_norm_eps, + add_kv_bias=self.add_kv_bias, + initializer_range=self.initializer_range, + logit_scale_init_value=self.logit_scale_init_value, + learnable_logit_scale=self.learnable_logit_scale, + ) + + def create_and_check_model(self, config, pixel_values): + model = ImageBindAudioModel(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) + + def create_and_check_model_with_projection(self, config, pixel_values): + model = ImageBindAudioModelWithProjection(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, pixel_values = config_and_inputs + inputs_dict = {"pixel_values": pixel_values} + return config, inputs_dict + + +@require_torch +class ImageBindAudioModelTest(ModelTesterMixin, unittest.TestCase): + """ + Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds, + attention_mask and seq_length. + """ + + all_model_classes = (ImageBindAudioModel, ImageBindAudioModelWithProjection) if is_torch_available() else () + fx_compatible = False + test_pruning = False + test_resize_embeddings = False + test_head_masking = False + + def setUp(self): + self.model_tester = ImageBindAudioModelTester(self) + self.config_tester = ConfigTester(self, config_class=ImageBindAudioConfig, has_text_modality=False, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + @unittest.skip(reason="ImageBind does not use inputs_embeds") + def test_inputs_embeds(self): + pass + + def test_model_common_attributes(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + self.assertIsInstance(model.get_input_embeddings(), (nn.Module)) + x = model.get_output_embeddings() + self.assertTrue(x is None or isinstance(x, nn.Linear)) + + def test_forward_signature(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + signature = inspect.signature(model.forward) + # signature.parameters is an OrderedDict => so arg_names order is deterministic + arg_names = [*signature.parameters.keys()] + + expected_arg_names = ["pixel_values"] + self.assertListEqual(arg_names[:1], expected_arg_names) + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_model_with_projection(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model_with_projection(*config_and_inputs) + + def test_training(self): + pass + + def test_training_gradient_checkpointing(self): + pass + + @unittest.skip(reason="ImageBindAudioModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_from_base(self): + pass + + @unittest.skip(reason="ImageBindAudioModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_to_base(self): + pass + + @slow + def test_model_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindAudioModel.from_pretrained(model_name) + self.assertIsNotNone(model) + + @slow + def test_model_with_projection_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindAudioModelWithProjection.from_pretrained(model_name) + self.assertIsNotNone(model) + self.assertTrue(hasattr(model, "audio_projection")) + + +class ImageBindDepthModelTester: + def __init__( + self, + parent, + batch_size=12, + image_size=30, + patch_size=2, + stride=2, + num_channels=1, + is_training=True, + hidden_size=32, + projection_dim=32, + num_hidden_layers=5, + num_attention_heads=4, + intermediate_size=37, + dropout=0.0, + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.0, + initializer_range=0.02, + logit_scale_init_value=5.0, + learnable_logit_scale=False, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.image_size = image_size + self.patch_size = patch_size + self.stride = stride + self.num_channels = num_channels + self.is_training = is_training + self.hidden_size = hidden_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.dropout = dropout + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.add_kv_bias = add_kv_bias + self.initializer_range = initializer_range + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + self.scope = scope + + num_patches = (((image_size - patch_size) // stride) + 1) ** 2 + # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token) + self.seq_length = num_patches + 1 + + def prepare_config_and_inputs(self): + pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) + config = self.get_config() + + return config, pixel_values + + def get_config(self): + return ImageBindDepthConfig( + image_size=self.image_size, + patch_size=self.patch_size, + stride=self.stride, + num_channels=self.num_channels, + hidden_size=self.hidden_size, + projection_dim=self.projection_dim, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + dropout=self.dropout, + attention_dropout=self.attention_dropout, + layer_norm_eps=self.layer_norm_eps, + add_kv_bias=self.add_kv_bias, + initializer_range=self.initializer_range, + logit_scale_init_value=self.logit_scale_init_value, + learnable_logit_scale=self.learnable_logit_scale, + ) + + def create_and_check_model(self, config, pixel_values): + model = ImageBindDepthModel(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) + + def create_and_check_model_with_projection(self, config, pixel_values): + model = ImageBindDepthModelWithProjection(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, pixel_values = config_and_inputs + inputs_dict = {"pixel_values": pixel_values} + return config, inputs_dict + + +@require_torch +class ImageBindDepthModelTest(ModelTesterMixin, unittest.TestCase): + """ + Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds, + attention_mask and seq_length. + """ + + all_model_classes = (ImageBindDepthModel, ImageBindDepthModelWithProjection) if is_torch_available() else () + fx_compatible = False + test_pruning = False + test_resize_embeddings = False + test_head_masking = False + + def setUp(self): + self.model_tester = ImageBindDepthModelTester(self) + self.config_tester = ConfigTester(self, config_class=ImageBindDepthConfig, has_text_modality=False, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + @unittest.skip(reason="ImageBind does not use inputs_embeds") + def test_inputs_embeds(self): + pass + + def test_model_common_attributes(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + self.assertIsInstance(model.get_input_embeddings(), (nn.Module)) + x = model.get_output_embeddings() + self.assertTrue(x is None or isinstance(x, nn.Linear)) + + def test_forward_signature(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + signature = inspect.signature(model.forward) + # signature.parameters is an OrderedDict => so arg_names order is deterministic + arg_names = [*signature.parameters.keys()] + + expected_arg_names = ["pixel_values"] + self.assertListEqual(arg_names[:1], expected_arg_names) + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_model_with_projection(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model_with_projection(*config_and_inputs) + + def test_training(self): + pass + + def test_training_gradient_checkpointing(self): + pass + + @unittest.skip(reason="ImageBindDepthModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_from_base(self): + pass + + @unittest.skip(reason="ImageBindDepthModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_to_base(self): + pass + + @slow + def test_model_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindDepthModel.from_pretrained(model_name) + self.assertIsNotNone(model) + + @slow + def test_model_with_projection_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindDepthModelWithProjection.from_pretrained(model_name) + self.assertIsNotNone(model) + self.assertTrue(hasattr(model, "depth_projection")) + + +class ImageBindThermalModelTester: + def __init__( + self, + parent, + batch_size=12, + image_size=30, + patch_size=2, + stride=2, + num_channels=1, + is_training=True, + hidden_size=32, + projection_dim=32, + num_hidden_layers=5, + num_attention_heads=4, + intermediate_size=37, + dropout=0.0, + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.0, + initializer_range=0.02, + logit_scale_init_value=10.0, + learnable_logit_scale=False, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.image_size = image_size + self.patch_size = patch_size + self.stride = stride + self.num_channels = num_channels + self.is_training = is_training + self.hidden_size = hidden_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.dropout = dropout + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.add_kv_bias = add_kv_bias + self.initializer_range = initializer_range + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + self.scope = scope + + num_patches = (((image_size - patch_size) // stride) + 1) ** 2 + # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token) + self.seq_length = num_patches + 1 + + def prepare_config_and_inputs(self): + pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) + config = self.get_config() + + return config, pixel_values + + def get_config(self): + return ImageBindThermalConfig( + image_size=self.image_size, + patch_size=self.patch_size, + stride=self.stride, + num_channels=self.num_channels, + hidden_size=self.hidden_size, + projection_dim=self.projection_dim, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + dropout=self.dropout, + attention_dropout=self.attention_dropout, + layer_norm_eps=self.layer_norm_eps, + add_kv_bias=self.add_kv_bias, + initializer_range=self.initializer_range, + logit_scale_init_value=self.logit_scale_init_value, + learnable_logit_scale=self.learnable_logit_scale, + ) + + def create_and_check_model(self, config, pixel_values): + model = ImageBindThermalModel(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) + + def create_and_check_model_with_projection(self, config, pixel_values): + model = ImageBindThermalModelWithProjection(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, pixel_values = config_and_inputs + inputs_dict = {"pixel_values": pixel_values} + return config, inputs_dict + + +@require_torch +class ImageBindThermalModelTest(ModelTesterMixin, unittest.TestCase): + """ + Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds, + attention_mask and seq_length. + """ + + all_model_classes = (ImageBindThermalModel, ImageBindThermalModelWithProjection) if is_torch_available() else () + fx_compatible = False + test_pruning = False + test_resize_embeddings = False + test_head_masking = False + + def setUp(self): + self.model_tester = ImageBindThermalModelTester(self) + self.config_tester = ConfigTester(self, config_class=ImageBindThermalConfig, has_text_modality=False, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + @unittest.skip(reason="ImageBind does not use inputs_embeds") + def test_inputs_embeds(self): + pass + + def test_model_common_attributes(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + self.assertIsInstance(model.get_input_embeddings(), (nn.Module)) + x = model.get_output_embeddings() + self.assertTrue(x is None or isinstance(x, nn.Linear)) + + def test_forward_signature(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + signature = inspect.signature(model.forward) + # signature.parameters is an OrderedDict => so arg_names order is deterministic + arg_names = [*signature.parameters.keys()] + + expected_arg_names = ["pixel_values"] + self.assertListEqual(arg_names[:1], expected_arg_names) + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_model_with_projection(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model_with_projection(*config_and_inputs) + + def test_training(self): + pass + + def test_training_gradient_checkpointing(self): + pass + + @unittest.skip(reason="ImageBindThermalModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_from_base(self): + pass + + @unittest.skip(reason="ImageBindThermalModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_to_base(self): + pass + + @slow + def test_model_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindThermalModel.from_pretrained(model_name) + self.assertIsNotNone(model) + + @slow + def test_model_with_projection_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindThermalModelWithProjection.from_pretrained(model_name) + self.assertIsNotNone(model) + self.assertTrue(hasattr(model, "thermal_projection")) + + +class ImageBindImuModelTester: + def __init__( + self, + parent, + batch_size=12, + input_shape=(6, 30), + kernel_size=2, + is_training=True, + hidden_size=32, + projection_dim=32, + num_hidden_layers=5, + num_attention_heads=4, + intermediate_size=37, + dropout=0.0, + layer_norm_eps=1e-6, + add_kv_bias=True, + attention_dropout=0.0, + drop_path_rate=0.7, + initializer_range=0.02, + logit_scale_init_value=5.0, + learnable_logit_scale=False, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.input_shape = input_shape + self.kernel_size = kernel_size + self.is_training = is_training + self.hidden_size = hidden_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.dropout = dropout + self.attention_dropout = attention_dropout + self.drop_path_rate = drop_path_rate + self.layer_norm_eps = layer_norm_eps + self.add_kv_bias = add_kv_bias + self.initializer_range = initializer_range + self.logit_scale_init_value = logit_scale_init_value + self.learnable_logit_scale = learnable_logit_scale + self.scope = scope + + num_patches = input_shape[1] // kernel_size + # The seq length is the number of patches + 1 (for the [CLS] token) + self.seq_length = num_patches + 1 + + def prepare_config_and_inputs(self): + pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) + config = self.get_config() + + return config, pixel_values + + def get_config(self): + return ImageBindImuConfig( + input_shape=self.input_shape, + kernel_size=self.kernel_size, + hidden_size=self.hidden_size, + projection_dim=self.projection_dim, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + dropout=self.dropout, + attention_dropout=self.attention_dropout, + layer_norm_eps=self.layer_norm_eps, + add_kv_bias=self.add_kv_bias, + initializer_range=self.initializer_range, + logit_scale_init_value=self.logit_scale_init_value, + learnable_logit_scale=self.learnable_logit_scale, + ) + + def create_and_check_model(self, config, pixel_values): + model = ImageBindImuModel(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) + + def create_and_check_model_with_projection(self, config, pixel_values): + model = ImageBindImuModelWithProjection(config=config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + result = model(pixel_values) + # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token) + image_size = (self.image_size, self.image_size) + patch_size = (self.patch_size, self.patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size)) + self.parent.assertEqual(result.image_embeds.shape, (self.batch_size, self.projection_dim)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, pixel_values = config_and_inputs + inputs_dict = {"pixel_values": pixel_values} + return config, inputs_dict + + +@require_torch +class ImageBindImuModelTest(ModelTesterMixin, unittest.TestCase): + """ + Here we also overwrite some of the tests of test_modeling_common.py, as IMAGEBIND does not use input_ids, inputs_embeds, + attention_mask and seq_length. + """ + + all_model_classes = (ImageBindImuModel, ImageBindImuModelWithProjection) if is_torch_available() else () + fx_compatible = False + test_pruning = False + test_resize_embeddings = False + test_head_masking = False + + def setUp(self): + self.model_tester = ImageBindImuModelTester(self) + self.config_tester = ConfigTester(self, config_class=ImageBindImuConfig, has_text_modality=False, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + @unittest.skip(reason="ImageBind does not use inputs_embeds") + def test_inputs_embeds(self): + pass + + def test_model_common_attributes(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + self.assertIsInstance(model.get_input_embeddings(), (nn.Module)) + x = model.get_output_embeddings() + self.assertTrue(x is None or isinstance(x, nn.Linear)) + + def test_forward_signature(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + signature = inspect.signature(model.forward) + # signature.parameters is an OrderedDict => so arg_names order is deterministic + arg_names = [*signature.parameters.keys()] + + expected_arg_names = ["pixel_values"] + self.assertListEqual(arg_names[:1], expected_arg_names) + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_model_with_projection(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model_with_projection(*config_and_inputs) + + def test_training(self): + pass + + def test_training_gradient_checkpointing(self): + pass + + @unittest.skip(reason="ImageBindImuModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_from_base(self): + pass + + @unittest.skip(reason="ImageBindImuModel has no base class and is not available in MODEL_MAPPING") + def test_save_load_fast_init_to_base(self): + pass + + @slow + def test_model_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindImuModel.from_pretrained(model_name) + self.assertIsNotNone(model) + + @slow + def test_model_with_projection_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindImuModelWithProjection.from_pretrained(model_name) + self.assertIsNotNone(model) + self.assertTrue(hasattr(model, "imu_projection")) + + +class ImageBindModelTester: + def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True): + if text_kwargs is None: + text_kwargs = {} + if vision_kwargs is None: + vision_kwargs = {} + + self.parent = parent + self.text_model_tester = ImageBindTextModelTester(parent, **text_kwargs) + self.vision_model_tester = ImageBindVisionModelTester(parent, **vision_kwargs) + self.is_training = is_training + + def prepare_config_and_inputs(self): + text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs() + vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs() + + config = self.get_config() + + return config, input_ids, attention_mask, pixel_values + + def get_config(self): + return ImageBindConfig.from_text_vision_configs( + self.text_model_tester.get_config(), self.vision_model_tester.get_config(), projection_dim=64 + ) + + def create_and_check_model(self, config, input_ids, attention_mask, pixel_values): + model = ImageBindModel(config).to(torch_device).eval() + with torch.no_grad(): + result = model(input_ids, pixel_values, attention_mask) + self.parent.assertEqual( + result.logits_per_image.shape, (self.vision_model_tester.batch_size, self.text_model_tester.batch_size) + ) + self.parent.assertEqual( + result.logits_per_text.shape, (self.text_model_tester.batch_size, self.vision_model_tester.batch_size) + ) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + config, input_ids, attention_mask, pixel_values = config_and_inputs + inputs_dict = { + "input_ids": input_ids, + "attention_mask": attention_mask, + "pixel_values": pixel_values, + "return_loss": True, + } + return config, inputs_dict + + +@require_torch +class ImageBindModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase): + all_model_classes = (ImageBindModel,) if is_torch_available() else () + fx_compatible = False + test_head_masking = False + test_pruning = False + test_resize_embeddings = False + test_attention_outputs = False + + def setUp(self): + self.model_tester = ImageBindModelTester(self) + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + @unittest.skip(reason="Hidden_states is tested in individual model tests") + def test_hidden_states_output(self): + pass + + @unittest.skip(reason="Inputs_embeds is tested in individual model tests") + def test_inputs_embeds(self): + pass + + @unittest.skip(reason="Retain_grad is tested in individual model tests") + def test_retain_grad_hidden_states_attentions(self): + pass + + @unittest.skip(reason="ImageBindModel does not have input/output embeddings") + def test_model_common_attributes(self): + pass + + # override as the `logit_scale` parameter initilization is different for IMAGEBIND + def test_initialization(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + configs_no_init = _config_zero_init(config) + for model_class in self.all_model_classes: + model = model_class(config=configs_no_init) + for name, param in model.named_parameters(): + if param.requires_grad: + # check if `logit_scale` is initilized as per the original implementation + if name == "logit_scale": + self.assertAlmostEqual( + param.data.item(), + np.log(1 / 0.07), + delta=1e-3, + msg=f"Parameter {name} of model {model_class} seems not properly initialized", + ) + else: + self.assertIn( + ((param.data.mean() * 1e9).round() / 1e9).item(), + [0.0, 1.0], + msg=f"Parameter {name} of model {model_class} seems not properly initialized", + ) + + def _create_and_check_torchscript(self, config, inputs_dict): + if not self.test_torchscript: + return + + configs_no_init = _config_zero_init(config) # To be sure we have no Nan + configs_no_init.torchscript = True + configs_no_init.return_dict = False + for model_class in self.all_model_classes: + model = model_class(config=configs_no_init) + model.to(torch_device) + model.eval() + + try: + input_ids = inputs_dict["input_ids"] + pixel_values = inputs_dict["pixel_values"] # IMAGEBIND needs pixel_values + traced_model = torch.jit.trace(model, (input_ids, pixel_values)) + except RuntimeError: + self.fail("Couldn't trace module.") + + with tempfile.TemporaryDirectory() as tmp_dir_name: + pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt") + + try: + torch.jit.save(traced_model, pt_file_name) + except Exception: + self.fail("Couldn't save module.") + + try: + loaded_model = torch.jit.load(pt_file_name) + except Exception: + self.fail("Couldn't load module.") + + model.to(torch_device) + model.eval() + + loaded_model.to(torch_device) + loaded_model.eval() + + model_state_dict = model.state_dict() + loaded_model_state_dict = loaded_model.state_dict() + + self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys())) + + models_equal = True + for layer_name, p1 in model_state_dict.items(): + p2 = loaded_model_state_dict[layer_name] + if p1.data.ne(p2.data).sum() > 0: + models_equal = False + + self.assertTrue(models_equal) + + def test_load_vision_text_config(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + # Save ImageBindConfig and check if we can load ImageBindVisionConfig from it + with tempfile.TemporaryDirectory() as tmp_dir_name: + config.save_pretrained(tmp_dir_name) + vision_config = ImageBindVisionConfig.from_pretrained(tmp_dir_name) + self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict()) + + # Save ImageBindConfig and check if we can load ImageBindTextConfig from it + with tempfile.TemporaryDirectory() as tmp_dir_name: + config.save_pretrained(tmp_dir_name) + text_config = ImageBindTextConfig.from_pretrained(tmp_dir_name) + self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict()) + + # overwrite from common since FlaxImageBindModel returns nested output + # which is not supported in the common test + @is_pt_flax_cross_test + def test_equivalence_pt_to_flax(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + with self.subTest(model_class.__name__): + # load PyTorch class + pt_model = model_class(config).eval() + # Flax models don't use the `use_cache` option and cache is not returned as a default. + # So we disable `use_cache` here for PyTorch model. + pt_model.config.use_cache = False + + fx_model_class_name = "Flax" + model_class.__name__ + + if not hasattr(transformers, fx_model_class_name): + return + + fx_model_class = getattr(transformers, fx_model_class_name) + + # load Flax class + fx_model = fx_model_class(config, dtype=jnp.float32) + # make sure only flax inputs are forward that actually exist in function args + fx_input_keys = inspect.signature(fx_model.__call__).parameters.keys() + + # prepare inputs + pt_inputs = self._prepare_for_class(inputs_dict, model_class) + + # remove function args that don't exist in Flax + pt_inputs = {k: v for k, v in pt_inputs.items() if k in fx_input_keys} + + fx_state = convert_pytorch_state_dict_to_flax(pt_model.state_dict(), fx_model) + fx_model.params = fx_state + + with torch.no_grad(): + pt_outputs = pt_model(**pt_inputs).to_tuple() + + # convert inputs to Flax + fx_inputs = {k: np.array(v) for k, v in pt_inputs.items() if torch.is_tensor(v)} + fx_outputs = fx_model(**fx_inputs).to_tuple() + self.assertEqual(len(fx_outputs), len(pt_outputs), "Output lengths differ between Flax and PyTorch") + for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs[:4]): + self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2) + + with tempfile.TemporaryDirectory() as tmpdirname: + pt_model.save_pretrained(tmpdirname) + fx_model_loaded = fx_model_class.from_pretrained(tmpdirname, from_pt=True) + + fx_outputs_loaded = fx_model_loaded(**fx_inputs).to_tuple() + self.assertEqual( + len(fx_outputs_loaded), len(pt_outputs), "Output lengths differ between Flax and PyTorch" + ) + for fx_output_loaded, pt_output in zip(fx_outputs_loaded[:4], pt_outputs[:4]): + self.assert_almost_equals(fx_output_loaded, pt_output.numpy(), 4e-2) + + # overwrite from common since FlaxImageBindModel returns nested output + # which is not supported in the common test + @is_pt_flax_cross_test + def test_equivalence_flax_to_pt(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + with self.subTest(model_class.__name__): + # load corresponding PyTorch class + pt_model = model_class(config).eval() + + # So we disable `use_cache` here for PyTorch model. + pt_model.config.use_cache = False + + fx_model_class_name = "Flax" + model_class.__name__ + + if not hasattr(transformers, fx_model_class_name): + # no flax model exists for this class + return + + fx_model_class = getattr(transformers, fx_model_class_name) + + # load Flax class + fx_model = fx_model_class(config, dtype=jnp.float32) + # make sure only flax inputs are forward that actually exist in function args + fx_input_keys = inspect.signature(fx_model.__call__).parameters.keys() + + pt_model = load_flax_weights_in_pytorch_model(pt_model, fx_model.params) + + # make sure weights are tied in PyTorch + pt_model.tie_weights() + + # prepare inputs + pt_inputs = self._prepare_for_class(inputs_dict, model_class) + + # remove function args that don't exist in Flax + pt_inputs = {k: v for k, v in pt_inputs.items() if k in fx_input_keys} + + with torch.no_grad(): + pt_outputs = pt_model(**pt_inputs).to_tuple() + + fx_inputs = {k: np.array(v) for k, v in pt_inputs.items() if torch.is_tensor(v)} + + fx_outputs = fx_model(**fx_inputs).to_tuple() + self.assertEqual(len(fx_outputs), len(pt_outputs), "Output lengths differ between Flax and PyTorch") + + for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs[:4]): + self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2) + + with tempfile.TemporaryDirectory() as tmpdirname: + fx_model.save_pretrained(tmpdirname) + pt_model_loaded = model_class.from_pretrained(tmpdirname, from_flax=True) + + with torch.no_grad(): + pt_outputs_loaded = pt_model_loaded(**pt_inputs).to_tuple() + + self.assertEqual( + len(fx_outputs), len(pt_outputs_loaded), "Output lengths differ between Flax and PyTorch" + ) + for fx_output, pt_output in zip(fx_outputs[:4], pt_outputs_loaded[:4]): + self.assert_almost_equals(fx_output, pt_output.numpy(), 4e-2) + + @slow + def test_model_from_pretrained(self): + for model_name in IMAGEBIND_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: + model = ImageBindModel.from_pretrained(model_name) + self.assertIsNotNone(model) + + +# We will verify our results on an image of cute cats +def prepare_img(): + url = "http://images.cocodataset.org/val2017/000000039769.jpg" + im = Image.open(requests.get(url, stream=True).raw) + return im + + +@require_vision +@require_torch +class ImageBindModelIntegrationTest(unittest.TestCase): + @slow + def test_inference(self): + model_name = "facebook/imagebind-huge" + model = ImageBindModel.from_pretrained(model_name).to(torch_device) + processor = ImageBindProcessor.from_pretrained(model_name) + + image = prepare_img() + inputs = processor( + text=["a photo of a cat", "a photo of a dog"], images=image, padding=True, return_tensors="pt" + ).to(torch_device) + + # forward pass + with torch.no_grad(): + outputs = model(**inputs) + + # verify the logits + self.assertEqual( + outputs.logits_per_image.shape, + torch.Size((inputs.pixel_values.shape[0], inputs.input_ids.shape[0])), + ) + self.assertEqual( + outputs.logits_per_text.shape, + torch.Size((inputs.input_ids.shape[0], inputs.pixel_values.shape[0])), + ) + + expected_logits = torch.tensor([[24.5701, 19.3049]], device=torch_device) + + self.assertTrue(torch.allclose(outputs.logits_per_image, expected_logits, atol=1e-3)) \ No newline at end of file diff --git a/tests/models/imagebind/test_processor_imagebind.py b/tests/models/imagebind/test_processor_imagebind.py new file mode 100644 index 000000000000..ff27287c4e79 --- /dev/null +++ b/tests/models/imagebind/test_processor_imagebind.py @@ -0,0 +1,205 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +import shutil +import tempfile +import unittest + +import numpy as np +import pytest + +from transformers import ImageBindTokenizer, ImageBindTokenizerFast +from transformers.models.imagebind.tokenization_imagebind import VOCAB_FILES_NAMES +from transformers.testing_utils import require_vision +from transformers.utils import IMAGE_PROCESSOR_NAME, is_vision_available + + +if is_vision_available(): + from PIL import Image + + from transformers import ImageBindImageProcessor, ImageBindProcessor + + +# NOTE: currently copied from previous PR (#23284) + + +@require_vision +class ImageBindProcessorTest(unittest.TestCase): + def setUp(self): + self.tmpdirname = tempfile.mkdtemp() + + # fmt: off + vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "lo", "l", "w", "r", "t", "low", "er", "lowest", "newer", "wider", "", "<|startoftext|>", "<|endoftext|>"] + # fmt: on + vocab_tokens = dict(zip(vocab, range(len(vocab)))) + merges = ["#version: 0.2", "l o", "lo w", "e r", ""] + self.special_tokens_map = {"unk_token": ""} + + self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"]) + self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"]) + with open(self.vocab_file, "w", encoding="utf-8") as fp: + fp.write(json.dumps(vocab_tokens) + "\n") + with open(self.merges_file, "w", encoding="utf-8") as fp: + fp.write("\n".join(merges)) + + image_processor_map = { + "do_resize": True, + "size": 20, + "do_center_crop": True, + "crop_size": 18, + "do_normalize": True, + "image_mean": [0.48145466, 0.4578275, 0.40821073], + "image_std": [0.26862954, 0.26130258, 0.27577711], + } + self.image_processor_file = os.path.join(self.tmpdirname, IMAGE_PROCESSOR_NAME) + with open(self.image_processor_file, "w", encoding="utf-8") as fp: + json.dump(image_processor_map, fp) + + def get_tokenizer(self, **kwargs): + return ImageBindTokenizer.from_pretrained(self.tmpdirname, **kwargs) + + def get_rust_tokenizer(self, **kwargs): + return ImageBindTokenizerFast.from_pretrained(self.tmpdirname, **kwargs) + + def get_image_processor(self, **kwargs): + return ImageBindImageProcessor.from_pretrained(self.tmpdirname, **kwargs) + + def tearDown(self): + shutil.rmtree(self.tmpdirname) + + def prepare_image_inputs(self): + """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True, + or a list of PyTorch tensors if one specifies torchify=True. + """ + + image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)] + + image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs] + + return image_inputs + + def test_save_load_pretrained_default(self): + tokenizer_slow = self.get_tokenizer() + tokenizer_fast = self.get_rust_tokenizer() + image_processor = self.get_image_processor() + + processor_slow = ImageBindProcessor(tokenizer=tokenizer_slow, image_processor=image_processor) + processor_slow.save_pretrained(self.tmpdirname) + processor_slow = ImageBindProcessor.from_pretrained(self.tmpdirname, use_fast=False) + + processor_fast = ImageBindProcessor(tokenizer=tokenizer_fast, image_processor=image_processor) + processor_fast.save_pretrained(self.tmpdirname) + processor_fast = ImageBindProcessor.from_pretrained(self.tmpdirname) + + self.assertEqual(processor_slow.tokenizer.get_vocab(), tokenizer_slow.get_vocab()) + self.assertEqual(processor_fast.tokenizer.get_vocab(), tokenizer_fast.get_vocab()) + self.assertEqual(tokenizer_slow.get_vocab(), tokenizer_fast.get_vocab()) + self.assertIsInstance(processor_slow.tokenizer, ImageBindTokenizer) + self.assertIsInstance(processor_fast.tokenizer, ImageBindTokenizerFast) + + self.assertEqual(processor_slow.image_processor.to_json_string(), image_processor.to_json_string()) + self.assertEqual(processor_fast.image_processor.to_json_string(), image_processor.to_json_string()) + self.assertIsInstance(processor_slow.image_processor, ImageBindImageProcessor) + self.assertIsInstance(processor_fast.image_processor, ImageBindImageProcessor) + + def test_save_load_pretrained_additional_features(self): + processor = ImageBindProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor()) + processor.save_pretrained(self.tmpdirname) + + tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)") + image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0) + + processor = ImageBindProcessor.from_pretrained( + self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0 + ) + + self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab()) + self.assertIsInstance(processor.tokenizer, ImageBindTokenizerFast) + + self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string()) + self.assertIsInstance(processor.image_processor, ImageBindImageProcessor) + + def test_image_processor(self): + image_processor = self.get_image_processor() + tokenizer = self.get_tokenizer() + + processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor) + + image_input = self.prepare_image_inputs() + + input_image_proc = image_processor(image_input, return_tensors="np") + input_processor = processor(images=image_input, return_tensors="np") + + for key in input_image_proc.keys(): + self.assertAlmostEqual(input_image_proc[key].sum(), input_processor[key].sum(), delta=1e-2) + + def test_tokenizer(self): + image_processor = self.get_image_processor() + tokenizer = self.get_tokenizer() + + processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor) + + input_str = "lower newer" + + encoded_processor = processor(text=input_str) + + encoded_tok = tokenizer(input_str) + + for key in encoded_tok.keys(): + self.assertListEqual(encoded_tok[key], encoded_processor[key]) + + def test_processor(self): + image_processor = self.get_image_processor() + tokenizer = self.get_tokenizer() + + processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor) + + input_str = "lower newer" + image_input = self.prepare_image_inputs() + + inputs = processor(text=input_str, images=image_input) + + self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "pixel_values"]) + + # test if it raises when no input is passed + with pytest.raises(ValueError): + processor() + + def test_tokenizer_decode(self): + image_processor = self.get_image_processor() + tokenizer = self.get_tokenizer() + + processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor) + + predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]] + + decoded_processor = processor.batch_decode(predicted_ids) + decoded_tok = tokenizer.batch_decode(predicted_ids) + + self.assertListEqual(decoded_tok, decoded_processor) + + def test_model_input_names(self): + image_processor = self.get_image_processor() + tokenizer = self.get_tokenizer() + + processor = ImageBindProcessor(tokenizer=tokenizer, image_processor=image_processor) + + input_str = "lower newer" + image_input = self.prepare_image_inputs() + + inputs = processor(text=input_str, images=image_input) + + self.assertListEqual(list(inputs.keys()), processor.model_input_names) \ No newline at end of file diff --git a/tests/models/imagebind/test_tokenization_imagebind.py b/tests/models/imagebind/test_tokenization_imagebind.py new file mode 100644 index 000000000000..1f465dc547a1 --- /dev/null +++ b/tests/models/imagebind/test_tokenization_imagebind.py @@ -0,0 +1,187 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import json +import os +import unittest + +from transformers import ImageBindTokenizer, ImageBindTokenizerFast +from transformers.models.imagebind.tokenization_imagebind import VOCAB_FILES_NAMES +from transformers.testing_utils import require_ftfy, require_tokenizers + +from ...test_tokenization_common import TokenizerTesterMixin + + +# NOTE: currently copied from previous PR (#23284) + + +@require_tokenizers +class ImageBindTokenizationTest(TokenizerTesterMixin, unittest.TestCase): + tokenizer_class = ImageBindTokenizer + rust_tokenizer_class = ImageBindTokenizerFast + test_rust_tokenizer = True + from_pretrained_kwargs = {} + test_seq2seq = False + + def setUp(self): + super().setUp() + + # fmt: off + vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "lo", "l", "w", "r", "t", "low", "er", "lowest", "newer", "wider", "", "<|startoftext|>", "<|endoftext|>"] + # fmt: on + vocab_tokens = dict(zip(vocab, range(len(vocab)))) + merges = ["#version: 0.2", "l o", "lo w", "e r"] + self.special_tokens_map = {"unk_token": ""} + + self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"]) + self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"]) + with open(self.vocab_file, "w", encoding="utf-8") as fp: + fp.write(json.dumps(vocab_tokens) + "\n") + with open(self.merges_file, "w", encoding="utf-8") as fp: + fp.write("\n".join(merges)) + + def get_tokenizer(self, **kwargs): + kwargs.update(self.special_tokens_map) + return ImageBindTokenizer.from_pretrained(self.tmpdirname, **kwargs) + + def get_rust_tokenizer(self, **kwargs): + kwargs.update(self.special_tokens_map) + return ImageBindTokenizerFast.from_pretrained(self.tmpdirname, **kwargs) + + def get_input_output_texts(self, tokenizer): + input_text = "lower newer" + output_text = "lower newer" + return input_text, output_text + + def test_full_tokenizer(self): + tokenizer = ImageBindTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map) + text = "lower newer" + bpe_tokens = ["lo", "w", "er", "n", "e", "w", "er"] + tokens = tokenizer.tokenize(text) + self.assertListEqual(tokens, bpe_tokens) + + input_tokens = tokens + [tokenizer.unk_token] + input_bpe_tokens = [10, 2, 16, 9, 3, 2, 16, 20] + self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens) + + @require_ftfy + def test_check_encoding_slow_fast(self): + for tokenizer, pretrained_name, kwargs in self.tokenizers_list: + with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): + tokenizer_s = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs) + tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs) + + text = "A\n'll 11p223RF☆ho!!to?'d'd''d of a cat" + text_tokenized_s = tokenizer_s.tokenize(text) + text_tokenized_r = tokenizer_r.tokenize(text) + + self.assertListEqual(text_tokenized_s, text_tokenized_r) + + # Test that the tokenization is identical on an example containing a character (Latin Small Letter A + # with Tilde) encoded in 2 different ways + text = "xa\u0303y" + " " + "x\xe3y" + text_tokenized_s = tokenizer_s.tokenize(text) + text_tokenized_r = tokenizer_r.tokenize(text) + + self.assertListEqual(text_tokenized_s, text_tokenized_r) + + # Test that the tokenization is identical on unicode of space type + spaces_unicodes = [ + "\u0009", # (horizontal tab, '\t') + "\u000B", # (vertical tab) + "\u000C", # (form feed) + "\u0020", # (space, ' ') + "\u200E", # (left-to-right mark):w + "\u200F", # (right-to-left mark) + ] + for unicode_seq in spaces_unicodes: + text_tokenized_s = tokenizer_s.tokenize(unicode_seq) + text_tokenized_r = tokenizer_r.tokenize(unicode_seq) + + self.assertListEqual(text_tokenized_s, text_tokenized_r) + + # Test that the tokenization is identical on unicode of line break type + line_break_unicodes = [ + "\u000A", # (line feed, '\n') + "\r\n", # (carriage return and line feed, '\r\n') + "\u000D", # (carriage return, '\r') + "\r", # (carriage return, '\r') + "\u000D", # (carriage return, '\r') + "\u2028", # (line separator) + "\u2029", # (paragraph separator) + # "\u0085", # (next line) + ] + + # The tokenization is not identical for the character "\u0085" (next line). The slow version transforms + # it into the Horizontal Ellipsis character "…" ("\u2026") while the fast version transforms it into a + # space (and thus into an empty list). + + for unicode_seq in line_break_unicodes: + text_tokenized_s = tokenizer_s.tokenize(unicode_seq) + text_tokenized_r = tokenizer_r.tokenize(unicode_seq) + + self.assertListEqual(text_tokenized_s, text_tokenized_r) + + def test_offsets_mapping_with_different_add_prefix_space_argument(self): + # Test which aims to verify that the offsets are well adapted to the argument `add_prefix_space` + for tokenizer, pretrained_name, kwargs in self.tokenizers_list: + with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): + text_of_1_token = "hello" # `hello` is a token in the vocabulary of `pretrained_name` + text = f"{text_of_1_token} {text_of_1_token}" + + tokenizer_r = self.rust_tokenizer_class.from_pretrained( + pretrained_name, + use_fast=True, + ) + encoding = tokenizer_r(text, return_offsets_mapping=True, add_special_tokens=False) + self.assertEqual(encoding.offset_mapping[0], (0, len(text_of_1_token))) + self.assertEqual( + encoding.offset_mapping[1], + (len(text_of_1_token) + 1, len(text_of_1_token) + 1 + len(text_of_1_token)), + ) + + text = f" {text}" + + tokenizer_r = self.rust_tokenizer_class.from_pretrained( + pretrained_name, + use_fast=True, + ) + encoding = tokenizer_r(text, return_offsets_mapping=True, add_special_tokens=False) + self.assertEqual(encoding.offset_mapping[0], (1, 1 + len(text_of_1_token))) + self.assertEqual( + encoding.offset_mapping[1], + (1 + len(text_of_1_token) + 1, 1 + len(text_of_1_token) + 1 + len(text_of_1_token)), + ) + + def test_log_warning(self): + # Test related to the breaking change introduced in transformers v4.17.0 + # We need to check that an error in raised when the user try to load a previous version of the tokenizer. + with self.assertRaises(ValueError) as context: + self.rust_tokenizer_class.from_pretrained("robot-test/old-imagebind-tokenizer") + + self.assertTrue( + context.exception.args[0].startswith( + "The `backend_tokenizer` provided does not match the expected format." + ) + ) + + @require_ftfy + def test_tokenization_python_rust_equals(self): + super().test_tokenization_python_rust_equals() + + # overwrite common test + def test_added_tokens_do_lower_case(self): + # ImageBind always lower cases letters + pass \ No newline at end of file