diff --git a/README.md b/README.md
index cc1fd458f5af..bacda019ca59 100644
--- a/README.md
+++ b/README.md
@@ -373,6 +373,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](https://huggingface.co/docs/transformers/model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/main/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
diff --git a/README_es.md b/README_es.md
index fcb6049870be..e447ff56afc8 100644
--- a/README_es.md
+++ b/README_es.md
@@ -348,6 +348,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](https://huggingface.co/docs/transformers/model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/main/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
diff --git a/README_hd.md b/README_hd.md
index 9b694ed60781..eb0c3a0f1a9c 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -320,6 +320,7 @@ conda install -c huggingface transformers
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा।
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा।
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा।
+1. **[Image Completion Transformer](https://huggingface.co/docs/transformers/model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/main/model_doc/instructblip)** (Salesforce से) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. द्वाराअनुसंधान पत्र [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) के साथ जारी किया गया
diff --git a/README_ja.md b/README_ja.md
index 60b14191c165..6c10f026842d 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -382,6 +382,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094)
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447)
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321)
+1. **[Image Completion Transformer](https://huggingface.co/docs/transformers/model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI から) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever から公開された研究論文: [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/)
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/main/model_doc/instructblip)** (Salesforce から) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. から公開された研究論文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)
diff --git a/README_ko.md b/README_ko.md
index cdbeec9a4b8a..8b2849358cd0 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -297,6 +297,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA 에서) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 의 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 논문과 함께 발표했습니다.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다.
+1. **[Image Completion Transformer](https://huggingface.co/docs/transformers/model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI 에서) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 의 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 논문과 함께 발표했습니다.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/main/model_doc/instructblip)** (Salesforce 에서 제공)은 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.의 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index db1cbd423725..d7f8a007d44c 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -321,6 +321,7 @@ conda install -c huggingface transformers
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。
+1. **[Image Completion Transformer](https://huggingface.co/docs/transformers/model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (来自 OpenAI) 伴随论文 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 由 Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 发布。
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/main/model_doc/instructblip)** (来自 Salesforce) 伴随论文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 由 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index e66cd1f867a2..a9fad51839a1 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -333,6 +333,7 @@ conda install -c huggingface transformers
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](https://huggingface.co/docs/transformers/model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/main/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
diff --git a/docs/source/de/index.md b/docs/source/de/index.md
index 22f34aa84758..7b856b70ebe9 100644
--- a/docs/source/de/index.md
+++ b/docs/source/de/index.md
@@ -109,6 +109,7 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen,
 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
 1. **[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
index 91c57e0f393f..da745f3ebf27 100644
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -137,6 +137,7 @@ The documentation is organized into five sections:
 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
@@ -348,6 +349,7 @@ Flax), PyTorch, and/or TensorFlow.
 |           GroupViT            |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
 |            Hubert             |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
 |            I-BERT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|              ICT              |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |           ImageGPT            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |           Informer            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |         InstructBLIP          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
diff --git a/docs/source/en/model_doc/ict.md b/docs/source/en/model_doc/ict.md
new file mode 100644
index 000000000000..d1a738f79c55
--- /dev/null
+++ b/docs/source/en/model_doc/ict.md
@@ -0,0 +1,54 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# ICT
+
+## Overview
+
+The ICT model was proposed in [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) 
+by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao. ICT (Image Completion with Transformers) leverages both a 
+transformer and CNNs by decoupling image completion into two steps: pluralistic appearance priors reconstruction with a 
+transformer to recover the coherent image structures, and low-resolution upsampling with CNNs to replenish ﬁne textures.
+
+The abstract from the paper is the following:
+
+*Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.*
+
+Tips:
+
+- Unlike auto-regressive methods, in order to make the transformer model capable of completing the missing regions by 
+  considering all the available context, this model optimizes the log-likelihood objective of missing pixels 
+  bi-directionally conditions, which is inspired by the masked language model like BERT.
+- The computational cost of multi-head attention increases quadratically, so the appearance priors is resized to 
+  low-resolution versions, which contains structural information and coarse textures only. But the dimension is further 
+  reduced by using an extra visual vocabulary (512 × 3) which is generated using k-means cluster centers of the whole 
+  ImageNet RGB pixel spaces.
+- Three available checkpoints are trained on [ImageNet](https://www.image-net.org/challenges/LSVRC), 
+  [FFHQ](https://github.com/NVlabs/ffhq-dataset) and [Places2](http://places2.csail.mit.edu/).
+
+This model was contributed by [Sheon Han](https://huggingface.co/sheonhan).
+The original code can be found [here](https://github.com/raywzy/ICT).
+
+
+## IctConfig
+
+[[autodoc]] IctConfig
+
+## IctImageProcessor
+
+[[autodoc]] IctImageProcessor
+    - preprocess
+
+## IctModel
+
+[[autodoc]] IctModel
+    - forward
diff --git a/docs/source/es/index.md b/docs/source/es/index.md
index caefdfb7ad7b..851f348abe5d 100644
--- a/docs/source/es/index.md
+++ b/docs/source/es/index.md
@@ -97,6 +97,7 @@ La biblioteca actualmente contiene implementaciones de JAX, PyTorch y TensorFlow
 1. **[GPTSAN-japanese](model_doc/gptsan-japanese)** released with [GPTSAN](https://github.com/tanreinama/GPTSAN) by Toshiyuki Sakamoto (tanreinama).
 1. **[Hubert](model_doc/hubert)** (de Facebook) publicado con el paper [HuBERT: Self-Supervised Speech Representation Learning por Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) por Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (de Berkeley) publicado con el paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) por Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (de OpenAI) publicado con el paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) por Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[LayoutLM](model_doc/layoutlm)** (de Microsoft Research Asia) publicado con el paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) por Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
 1. **[LayoutLMv2](model_doc/layoutlmv2)** (de Microsoft Research Asia) publicado con el paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) por Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
diff --git a/docs/source/fr/index.md b/docs/source/fr/index.md
index f18ad8e57c21..7c71b4b7ab15 100644
--- a/docs/source/fr/index.md
+++ b/docs/source/fr/index.md
@@ -126,6 +126,7 @@ La documentation est organisée en 5 parties:
 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Jukebox](model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
 1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
diff --git a/docs/source/it/index.md b/docs/source/it/index.md
index 5c7d22c1e6b1..9d5bc33e7fda 100644
--- a/docs/source/it/index.md
+++ b/docs/source/it/index.md
@@ -104,6 +104,7 @@ La libreria attualmente contiene implementazioni in JAX, PyTorch e TensorFlow, p
 1. **[GPT NeoX](model_doc/gpt_neox)** (da EleutherAI) rilasciato con il paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) da Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach
 1. **[Hubert](model_doc/hubert)** (da Facebook) rilasciato con il paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) da Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (da Berkeley) rilasciato con il paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) da Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (da OpenAI) rilasciato con il paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) da Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[LayoutLM](model_doc/layoutlm)** (da Microsoft Research Asia) rilasciato con il paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) da Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
 1. **[LayoutLMv2](model_doc/layoutlmv2)** (da Microsoft Research Asia) rilasciato con il paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) da Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
diff --git a/docs/source/ja/index.md b/docs/source/ja/index.md
index 364a3b34caba..9f2d05500af9 100644
--- a/docs/source/ja/index.md
+++ b/docs/source/ja/index.md
@@ -122,6 +122,7 @@ rendered properly in your Markdown viewer.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094)
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447)
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321)
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI から) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever から公開された研究論文: [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/)
 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (OpenAI から) Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever から公開された研究論文: [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf)
 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (Microsoft Research Asia から) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou から公開された研究論文: [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)
diff --git a/docs/source/ko/index.md b/docs/source/ko/index.md
index f0ec9ae1b8b9..49cb8e146085 100644
--- a/docs/source/ko/index.md
+++ b/docs/source/ko/index.md
@@ -114,6 +114,7 @@ rendered properly in your Markdown viewer.
 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Jukebox](model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
 1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
diff --git a/docs/source/pt/index.md b/docs/source/pt/index.md
index 08575b0bea22..483ccdda8e19 100644
--- a/docs/source/pt/index.md
+++ b/docs/source/pt/index.md
@@ -110,6 +110,7 @@ Atualmente a biblioteca contém implementações do PyTorch, TensorFlow e JAX, p
 1. **[GPTSAN-japanese](model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama).
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
 1. **[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
diff --git a/docs/source/zh/index.md b/docs/source/zh/index.md
index 38e758caf73c..25f180c8fb45 100644
--- a/docs/source/zh/index.md
+++ b/docs/source/zh/index.md
@@ -121,6 +121,7 @@ rendered properly in your Markdown viewer.
 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
+1. **[Image Completion Transformer](model_doc/ict)** (from City University of Hong Kong and Microsoft Cloud + AI) released with the paper [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/abs/2103.14031) by Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao.
 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Jukebox](model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever.
 1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index 99683306d6b4..36ca591dc503 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -330,6 +330,7 @@
     "models.herbert": ["HerbertTokenizer"],
     "models.hubert": ["HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "HubertConfig"],
     "models.ibert": ["IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "IBertConfig"],
+    "models.ict": ["ICT_PRETRAINED_CONFIG_ARCHIVE_MAP", "IctConfig"],
     "models.imagegpt": ["IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ImageGPTConfig"],
     "models.informer": ["INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "InformerConfig"],
     "models.instructblip": [
@@ -909,6 +910,7 @@
     _import_structure["models.efficientnet"].append("EfficientNetImageProcessor")
     _import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaImageProcessor", "FlavaProcessor"])
     _import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"])
+    _import_structure["models.ict"].extend(["IctImageProcessor"])
     _import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
     _import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
     _import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
@@ -1832,6 +1834,13 @@
             "IBertPreTrainedModel",
         ]
     )
+    _import_structure["models.ict"].extend(
+        [
+            "ICT_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "IctModel",
+            "IctPreTrainedModel",
+        ]
+    )
     _import_structure["models.imagegpt"].extend(
         [
             "IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4201,6 +4210,7 @@
     from .models.herbert import HerbertTokenizer
     from .models.hubert import HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, HubertConfig
     from .models.ibert import IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, IBertConfig
+    from .models.ict import ICT_PRETRAINED_CONFIG_ARCHIVE_MAP, IctConfig
     from .models.imagegpt import IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, ImageGPTConfig
     from .models.informer import INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, InformerConfig
     from .models.instructblip import (
@@ -4719,6 +4729,7 @@
         from .models.efficientnet import EfficientNetImageProcessor
         from .models.flava import FlavaFeatureExtractor, FlavaImageProcessor, FlavaProcessor
         from .models.glpn import GLPNFeatureExtractor, GLPNImageProcessor
+        from .models.ict import IctImageProcessor
         from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
         from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2ImageProcessor
         from .models.layoutlmv3 import LayoutLMv3FeatureExtractor, LayoutLMv3ImageProcessor
@@ -5486,6 +5497,11 @@
             IBertModel,
             IBertPreTrainedModel,
         )
+        from .models.ict import (
+            ICT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            IctModel,
+            IctPreTrainedModel,
+        )
         from .models.imagegpt import (
             IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST,
             ImageGPTForCausalImageModeling,
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index d8345c9ef8c0..be6e3caa2caf 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -98,6 +98,7 @@
     herbert,
     hubert,
     ibert,
+    ict,
     imagegpt,
     informer,
     instructblip,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index 5cbaa0705afb..6dd4b1b502d7 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -105,6 +105,7 @@
         ("groupvit", "GroupViTConfig"),
         ("hubert", "HubertConfig"),
         ("ibert", "IBertConfig"),
+        ("ict", "IctConfig"),
         ("imagegpt", "ImageGPTConfig"),
         ("informer", "InformerConfig"),
         ("instructblip", "InstructBlipConfig"),
@@ -299,6 +300,7 @@
         ("groupvit", "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("ict", "ICT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("informer", "INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("instructblip", "INSTRUCTBLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -495,6 +497,7 @@
         ("herbert", "HerBERT"),
         ("hubert", "Hubert"),
         ("ibert", "I-BERT"),
+        ("ict", "ICT"),
         ("imagegpt", "ImageGPT"),
         ("informer", "Informer"),
         ("instructblip", "InstructBLIP"),
diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
index 7d7502670126..d3c2db1af300 100644
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -65,6 +65,7 @@
         ("git", "CLIPImageProcessor"),
         ("glpn", "GLPNImageProcessor"),
         ("groupvit", "CLIPImageProcessor"),
+        ("ict", "IctImageProcessor"),
         ("imagegpt", "ImageGPTImageProcessor"),
         ("instructblip", "BlipImageProcessor"),
         ("layoutlmv2", "LayoutLMv2ImageProcessor"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 8bb6ea37aab2..889dcae8e871 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -103,6 +103,7 @@
         ("groupvit", "GroupViTModel"),
         ("hubert", "HubertModel"),
         ("ibert", "IBertModel"),
+        ("ict", "IctModel"),
         ("imagegpt", "ImageGPTModel"),
         ("informer", "InformerModel"),
         ("jukebox", "JukeboxModel"),
diff --git a/src/transformers/models/ict/__init__.py b/src/transformers/models/ict/__init__.py
new file mode 100644
index 000000000000..b58bca645d16
--- /dev/null
+++ b/src/transformers/models/ict/__init__.py
@@ -0,0 +1,74 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_flax_available,
+    is_tf_available,
+    is_torch_available,
+    is_vision_available,
+)
+
+
+_import_structure = {"configuration_ict": ["ICT_PRETRAINED_CONFIG_ARCHIVE_MAP", "IctConfig"]}
+
+try:
+    if not is_vision_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["image_processing_ict"] = ["IctImageProcessor"]
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_ict"] = [
+        "ICT_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "IctModel",
+        "IctPreTrainedModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_ict import ICT_PRETRAINED_CONFIG_ARCHIVE_MAP, IctConfig
+
+    try:
+        if not is_vision_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .image_processing_ict import IctImageProcessor
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_ict import (
+            ICT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            IctModel,
+            IctPreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/ict/configuration_ict.py b/src/transformers/models/ict/configuration_ict.py
new file mode 100644
index 000000000000..06d69795e6a0
--- /dev/null
+++ b/src/transformers/models/ict/configuration_ict.py
@@ -0,0 +1,151 @@
+# coding=utf-8
+# Copyright 2023 Authors at City University of Hong Kong, Microsoft Cloud + AI,
+# The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ICT model configuration"""
+
+import numpy as np
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+ICT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "sheonhan/ict-imagenet-256": "https://huggingface.co/sheonhan/ict-imagenet-256/resolve/main/config.json",
+}
+
+
+class IctConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`IctModel`]. It is used to instantiate an ICT
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the ICT model trained with the ImageNet dataset
+    [sheonhan/ict-imagenet-256](https://huggingface.co/sheonhan/ict-imagenet-256).
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 512):
+            Vocabulary size of the ICT model. Defines the number of different tokens that can be represented by the
+            `pixel_values` passed when calling [`IctTransformer`].
+        hidden_size (`int`, *optional*, defaults to 1024):
+            Dimensionality of the embeddings and hidden states.
+        num_hidden_layers (`int`, *optional*, defaults to 35):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_residual_blocks (`int`, *optional*, defaults to 8):
+            The number of residual blocks in [`IctGuidedUpsampler`].
+        intermediate_size (`int`, *optional*, defaults to 4096):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        activation_function (`str`, *optional*, defaults to `"gelu"`):
+            Activation function (can be one of the activation functions defined in src/transformers/activations.py).
+            Defaults to "quick_gelu".
+        embedding_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the embeddings.
+        residual_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        image_size (`int`, *optional*, defaults to 32):
+            The size (resolution) of each image.
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries, keys and values.
+        temperature (`float`, *optional*, defaults to 1.0):
+            The value used to module the next token probabilities that will be used by default in the `generate` method
+            of the model. Must be strictly positive.
+        top_k (`int`, *optional*, defaults to 50):
+            Number of highest probability vocabulary tokens to keep for top-k-filtering that will be used by default in
+            the `generate` method of the model.
+        gan_loss_function (`str`, *optional*, defaults to `"nsgan"`):
+            GAN loss function for the guided upsampler. Choose one of `"nsgan"`, `"lsgan"`, `"hinge"`. Defaults to
+            "nsgan".
+        output_image_size (`int`, *optional*, defaults to 256):
+            The size (resolution) of the output image.
+        clusters (`np.ndarray`, *optional*, defaults to `None`):
+            Clusters used to quantize the image of shape `(n_clusters, 3)`. Provide the same `clusters` used for
+            `IctImageProcessor`.
+
+    Example:
+
+    ```python
+    >>> from transformers import IctConfig, IctModel
+
+    >>> # Initializing a ICT ict-imagenet-256 style configuration
+    >>> configuration = IctConfig()
+
+    >>> # Initializing a model (with random weights) from the ict-imagenet-256 style configuration
+    >>> model = IctModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "ict"
+
+    def __init__(
+        self,
+        vocab_size=512,
+        hidden_size=1024,
+        num_hidden_layers=35,
+        num_attention_heads=8,
+        num_residual_blocks=8,
+        intermediate_size=4096,
+        activation_function="gelu",
+        embedding_dropout_prob=0.0,
+        residual_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        image_size=32,
+        num_channels=3,
+        qkv_bias=True,
+        temperature=1.0,
+        top_k=50,
+        gan_loss_function="nsgan",
+        output_image_size=256,
+        clusters=None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_residual_blocks = num_residual_blocks
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.activation_function = activation_function
+        self.embedding_dropout_prob = embedding_dropout_prob
+        self.residual_dropout_prob = residual_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.image_size = image_size
+        self.num_channels = num_channels
+        self.qkv_bias = qkv_bias
+        self.temperature = temperature
+        self.top_k = top_k
+        self.gan_loss_function = gan_loss_function
+        self.output_image_size = output_image_size
+        self.clusters = np.array(clusters) if clusters is not None else None
diff --git a/src/transformers/models/ict/convert_ict_to_pytorch.py b/src/transformers/models/ict/convert_ict_to_pytorch.py
new file mode 100644
index 000000000000..5bdf322e01af
--- /dev/null
+++ b/src/transformers/models/ict/convert_ict_to_pytorch.py
@@ -0,0 +1,245 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert ICT checkpoints from the original library."""
+
+
+import argparse
+from pathlib import Path
+
+import numpy as np
+import requests
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+from torchvision.transforms import Compose, Lambda, Resize
+
+from transformers import IctConfig, IctImageProcessor, IctModel
+from transformers.image_utils import PILImageResampling
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+# here we list all keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys(config):
+    rename_keys = []
+    rename_keys.append(("pos_emb", "transformer.embeddings.position_embedding"))
+    rename_keys.append(("tok_emb.weight", "transformer.embeddings.token_embedding.weight"))
+    # NOTE: masks token does not exist in the original weights
+
+    for i in range(config.num_hidden_layers):
+        rename_keys.append((f"blocks.{i}.ln1.weight", f"transformer.encoder.layers.{i}.ln_1.weight"))
+        rename_keys.append((f"blocks.{i}.ln1.bias", f"transformer.encoder.layers.{i}.ln_1.bias"))
+        rename_keys.append((f"blocks.{i}.ln2.weight", f"transformer.encoder.layers.{i}.ln_2.weight"))
+        rename_keys.append((f"blocks.{i}.ln2.bias", f"transformer.encoder.layers.{i}.ln_2.bias"))
+        rename_keys.append((f"blocks.{i}.attn.key.weight", f"transformer.encoder.layers.{i}.attention.key.weight"))
+        rename_keys.append((f"blocks.{i}.attn.key.bias", f"transformer.encoder.layers.{i}.attention.key.bias"))
+        rename_keys.append((f"blocks.{i}.attn.query.weight", f"transformer.encoder.layers.{i}.attention.query.weight"))
+        rename_keys.append((f"blocks.{i}.attn.query.bias", f"transformer.encoder.layers.{i}.attention.query.bias"))
+        rename_keys.append((f"blocks.{i}.attn.value.weight", f"transformer.encoder.layers.{i}.attention.value.weight"))
+        rename_keys.append((f"blocks.{i}.attn.value.bias", f"transformer.encoder.layers.{i}.attention.value.bias"))
+        rename_keys.append((f"blocks.{i}.attn.proj.weight", f"transformer.encoder.layers.{i}.attention.output.weight"))
+        rename_keys.append((f"blocks.{i}.attn.proj.bias", f"transformer.encoder.layers.{i}.attention.output.bias"))
+        rename_keys.append((f"blocks.{i}.mlp.0.weight", f"transformer.encoder.layers.{i}.mlp.0.weight"))
+        rename_keys.append((f"blocks.{i}.mlp.0.bias", f"transformer.encoder.layers.{i}.mlp.0.bias"))
+        rename_keys.append((f"blocks.{i}.mlp.2.weight", f"transformer.encoder.layers.{i}.mlp.2.weight"))
+        rename_keys.append((f"blocks.{i}.mlp.2.bias", f"transformer.encoder.layers.{i}.mlp.2.bias"))
+
+    # Generator
+    rename_keys.append(("module.encoder.1.weight", "guided_upsampler.generator.encoder.1.weight"))
+    rename_keys.append(("module.encoder.1.bias", "guided_upsampler.generator.encoder.1.bias"))
+    rename_keys.append(("module.encoder.3.weight", "guided_upsampler.generator.encoder.3.weight"))
+    rename_keys.append(("module.encoder.3.bias", "guided_upsampler.generator.encoder.3.bias"))
+    rename_keys.append(("module.encoder.5.weight", "guided_upsampler.generator.encoder.5.weight"))
+    rename_keys.append(("module.encoder.5.bias", "guided_upsampler.generator.encoder.5.bias"))
+
+    for i in range(config.num_residual_blocks):
+        rename_keys.append(
+            (f"module.middle.{i}.conv_block.1.weight", f"guided_upsampler.generator.middle.{i}.conv_block.1.weight")
+        )
+        rename_keys.append(
+            (f"module.middle.{i}.conv_block.1.bias", f"guided_upsampler.generator.middle.{i}.conv_block.1.bias")
+        )
+        rename_keys.append(
+            (f"module.middle.{i}.conv_block.4.weight", f"guided_upsampler.generator.middle.{i}.conv_block.4.weight")
+        )
+        rename_keys.append(
+            (f"module.middle.{i}.conv_block.4.bias", f"guided_upsampler.generator.middle.{i}.conv_block.4.bias")
+        )
+
+    rename_keys.append(("module.decoder.0.weight", "guided_upsampler.generator.decoder.0.weight"))
+    rename_keys.append(("module.decoder.0.bias", "guided_upsampler.generator.decoder.0.bias"))
+    rename_keys.append(("module.decoder.2.weight", "guided_upsampler.generator.decoder.2.weight"))
+    rename_keys.append(("module.decoder.2.bias", "guided_upsampler.generator.decoder.2.bias"))
+    rename_keys.append(("module.decoder.5.weight", "guided_upsampler.generator.decoder.5.weight"))
+    rename_keys.append(("module.decoder.5.bias", "guided_upsampler.generator.decoder.5.bias"))
+
+    rename_keys.append(("ln_f.weight", "transformer.layernorm.weight"))
+    rename_keys.append(("ln_f.bias", "transformer.layernorm.bias"))
+    rename_keys.append(("head.weight", "transformer.head.weight"))
+
+    return rename_keys
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+@torch.no_grad()
+def convert_ict_checkpoint(
+    checkpoint_path: Path,
+    ict_config_file: Path,
+    ict_image_processor_config_file: Path,
+    pytorch_dump_path: Path,
+    push_to_hub: bool,
+):
+    config = IctConfig.from_json_file(ict_config_file)
+    model = IctModel(config)
+    model_name = checkpoint_path.split("/")[-1].split(".")[0]
+
+    model_state_dict = torch.load(checkpoint_path, map_location="cpu")["model"]
+
+    generator_local_path = hf_hub_download(repo_id="sheonhan/ict-imagenet-256", filename="generator.pt")
+    generator_state_dict = torch.load(generator_local_path, map_location="cpu")
+
+    model_state_dict.update(generator_state_dict)
+    model_state_dict = {key: value for key, value in model_state_dict.items() if "attn.mask" not in key}
+
+    rename_keys = create_rename_keys(config)
+    for src, dest in rename_keys:
+        val = model_state_dict.pop(src)
+        model_state_dict[dest] = val
+
+    model.load_state_dict(model_state_dict, strict=False)
+    model.eval()
+
+    # prepare image
+    image = prepare_img()
+    image_size = 32
+    image_processor = IctImageProcessor.from_json_file(ict_image_processor_config_file)
+    clusters = image_processor.clusters
+    pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
+
+    # original processing pipeline
+    image_transforms = Compose(
+        [
+            Resize((image_size, image_size), interpolation=PILImageResampling.BILINEAR),
+            Lambda(lambda img: torch.from_numpy(np.array(img)).view(-1, 3)),
+            Lambda(lambda img: ((img[:, None, :] - clusters[None, :, :]) ** 2).sum(-1).argmin(1)),
+        ]
+    )
+    original_pixel_values = image_transforms(image).unsqueeze(0)
+
+    assert torch.allclose(original_pixel_values, pixel_values)
+
+    bool_masked_pos_local_path = hf_hub_download(repo_id="sheonhan/ict-imagenet-256", filename="my_bool_masked_pos.pt")
+    bool_masked_pos = torch.load(bool_masked_pos_local_path)
+    bool_masked_pos = bool_masked_pos.unsqueeze(0)
+
+    outputs = model(pixel_values=pixel_values, bool_masked_pos=bool_masked_pos, clusters=clusters)
+    logits = outputs.logits
+
+    expected_shape = (3, 256, 256)
+
+    if "ImageNet" in model_name:
+        expected_logits = torch.Tensor(
+            [-0.1312, 0.4353, -1.0499, -0.5124, 0.4183, -0.6793, -1.3777, -0.0893, -0.7358, -2.4328]
+        )
+        assert torch.allclose(logits[0, :10], expected_logits, atol=1e-3)
+        assert logits.shape == expected_shape
+    # elif "FFHQ" in model_name:
+    #     expected_logits = torch.Tensor(
+    #         [-1.3150, -1.5456, -1.2556, -0.8496, -0.7127, -0.7897, -0.9728, -0.3052, 0.3751, -0.3127]
+    #     )
+    #     assert torch.allclose(logits[0, :10], expected_logits, atol=1e-3)
+    #     assert logits.shape == expected_shape
+    # elif "Places2_Nature" in model_name:
+    #     expected_logits = torch.Tensor(
+    #         [-1.0283, -1.4131, -0.5644, -1.3115, -0.5785, -1.2049, -0.7528, 0.1992, -0.3822, -0.0878]
+    #     )
+    #     assert logits.shape == expected_shape
+    else:
+        raise ValueError(
+            f"Unknown model checkpoint: {checkpoint_path}. Supported version of efficientformer are l1, l3 and l7"
+        )
+
+    # Save Checkpoints
+    Path(pytorch_dump_path).mkdir(exist_ok=True)
+    model.save_pretrained(pytorch_dump_path)
+    print(f"Checkpoint successfuly converted. Model saved at {pytorch_dump_path}")
+    image_processor.save_pretrained(pytorch_dump_path)
+    print(f"Image processor successfuly saved at {pytorch_dump_path}")
+
+    if push_to_hub:
+        print("Pushing model to the hub...")
+
+        model.push_to_hub(
+            repo_id=f"sheonhan/{pytorch_dump_path}",
+            commit_message="Add model",
+            use_temp_dir=True,
+        )
+        image_processor.push_to_hub(
+            repo_id=f"sheonhan/{pytorch_dump_path}",
+            commit_message="Add feature extractor",
+            use_temp_dir=True,
+        )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--pytorch_model_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to ICT pytorch checkpoint.",
+    )
+    parser.add_argument(
+        "--config_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The json file for ICT model config.",
+    )
+    parser.add_argument(
+        "--image_processor_config_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The json file for IctImageProcessor config.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_path",
+        default="model",
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument("--save_model", action="store_true", help="Save model to local")
+    parser.add_argument("--push_to_hub", action="store_true", help="Push model and image preprocessor to the hub")
+
+    args = parser.parse_args()
+    convert_ict_checkpoint(
+        checkpoint_path=args.pytorch_model_path,
+        ict_config_file=args.config_file,
+        ict_image_processor_config_file=args.image_processor_config_file,
+        pytorch_dump_path=args.pytorch_dump_path,
+        push_to_hub=args.push_to_hub,
+    )
diff --git a/src/transformers/models/ict/image_processing_ict.py b/src/transformers/models/ict/image_processing_ict.py
new file mode 100644
index 000000000000..f655b3ac5557
--- /dev/null
+++ b/src/transformers/models/ict/image_processing_ict.py
@@ -0,0 +1,328 @@
+# coding=utf-8
+# Copyright 2023 Authors at City University of Hong Kong, Microsoft Cloud + AI,
+# The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for ICT."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+
+from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ...image_transforms import normalize, rescale, resize, to_channel_dimension_format
+from ...image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    make_list_of_images,
+    to_numpy_array,
+    valid_images,
+)
+from ...utils import TensorType, is_torch_available, logging
+
+
+if is_torch_available():
+    import torch
+
+logger = logging.get_logger(__name__)
+
+
+class IctImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a ICT image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `(size["height"],
+            size["width"])`. Can be overridden by the `do_resize` parameter in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"height": 32, "width": 32}`):
+            Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+            Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the
+            `preprocess` method.
+        do_rescale (`bool`, *optional*, defaults to `False`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
+            parameter in the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
+            `preprocess` method.
+        do_normalize (`bool`, *optional*, defaults to `False`)::
+            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+            method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_color_quantize (`bool`, *optional*, defaults to `True`):
+            Whether to color quantize the image. Can be overridden by the `do_color_quantize` parameter in the
+            `preprocess` method.
+        clusters (`np.ndarray`, *optional*, defaults to `None`):
+            Clusters used to quantize the image of shape `(n_clusters, 3)`. Only has an effect if `do_color_quantize`
+            is set to `True`.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Optional[Dict[str, int]] = None,
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        do_rescale: bool = False,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = False,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_color_quantize: bool = True,
+        clusters: Optional[Union[np.ndarray, List[float]]] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"height": 32, "width": 32}
+        size = get_size_dict(size)
+        self.do_resize = do_resize
+        self.do_rescale = do_rescale
+        self.do_normalize = do_normalize
+        self.size = size
+        self.resample = resample
+        self.rescale_factor = rescale_factor
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+        self.do_color_quantize = do_color_quantize
+        self.clusters = np.array(clusters) if clusters is not None else None
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image to `(size["height"], size["width"])`.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
+            resample:
+                `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
+            data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+
+        Returns:
+            `np.ndarray`: The resized image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}")
+        return resize(
+            image, size=(size["height"], size["width"]), resample=resample, data_format=data_format, **kwargs
+        )
+
+    def rescale(
+        self, image: np.ndarray, scale: float, data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs
+    ) -> np.ndarray:
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`float`):
+                The scaling factor to rescale pixel values by.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+
+        Returns:
+            `np.ndarray`: The rescaled image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            mean (`float` or `List[float]`):
+                Image mean to use for normalization.
+            std (`float` or `List[float]`):
+                Image standard deviation to use for normalization.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+
+        Returns:
+            `np.ndarray`: The normalized image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def color_quantize(self, image: np.ndarray, clusters: np.ndarray):
+        """
+        Reduce the dimension by using an extra visual vocabulary (Bags-of-Words vectors) with spatial size num_clusters
+        × 3, which was generated using k-means clustered centers of the ImageNet RGB pixel spaces.
+
+        e.g., An image of shape (32, 24, 3) will be reduced to (32, 24) where each element of the output tensor
+        corresponds to an integer index in `clusters` which contain the actual RGB pixel.
+
+        Args:
+            image (`np.ndarray`):
+                Image whose dimension will be reduced.
+
+        Returns:
+            `np.ndarray`: Image with reduced dimensions.
+        """
+
+        # Modified from https://github.com/raywzy/ICT/blob/59dd12d374d47cdf0dce90923017ca3657e6aa0b/Transformer/inference.py#L98
+        image = to_channel_dimension_format(image, ChannelDimension.LAST)
+        image = image.reshape(-1, 3)
+        image = np.argmin(np.sum((image[:, None, :] - clusters[None, :, :]) ** 2, axis=-1), axis=1)
+        return image
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: Optional[bool] = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_color_quantize: bool = True,
+        clusters: Optional[np.ndarray] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs,
+    ):
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Dictionary in the format `{"height": h, "width": w}` specifying the size of the output image after
+                resizing.
+            resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
+                `PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BILINEAR`. Only has
+                an effect if `do_resize` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image values between [0 - 1].
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use if `do_normalize` is set to `True`.
+            do_color_quantize (`bool`, *optional*, defaults to `self.do_color_quantize`):
+                Whether to color quantize the image.
+            clusters (`np.ndarray`, *optional*, defaults to `self.clusters`):
+                Clusters used to quantize the image of shape `(n_clusters, 3)`. Only has an effect if
+                `do_color_quantize` is set to `True`.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        resample = resample if resample is not None else self.resample
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_color_quantize = do_color_quantize if do_color_quantize is not None else self.do_color_quantize
+        clusters = clusters if clusters is not None else self.clusters
+
+        size = size if size is not None else self.size
+        size_dict = get_size_dict(size)
+
+        images = make_list_of_images(images)
+
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_color_quantize and clusters is None:
+            raise ValueError("Clusters must be specified if do_color_quantize is True.")
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size_dict, resample=resample) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        if do_color_quantize:
+            images = [to_channel_dimension_format(image, ChannelDimension.LAST) for image in images]
+            # flatten images to (batch_size, height * width)
+            images = [self.color_quantize(image=image, clusters=clusters) for image in images]
+        else:
+            images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        clusters = torch.from_numpy(clusters)
+        data = {"pixel_values": images, "clusters": clusters}
+
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/src/transformers/models/ict/modeling_ict.py b/src/transformers/models/ict/modeling_ict.py
new file mode 100644
index 000000000000..048864d8aa3b
--- /dev/null
+++ b/src/transformers/models/ict/modeling_ict.py
@@ -0,0 +1,855 @@
+# coding=utf-8
+# Copyright 2023 Authors at City University of Hong Kong, Microsoft Cloud + AI,
+# The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch ICT model."""
+
+
+import math
+from typing import Dict, List, Optional, Set, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+import torchvision.models as models
+from torch import nn
+
+from ...activations import ACT2FN
+from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, MaskedImageModelingOutput
+from ...modeling_utils import PreTrainedModel
+from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ...utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+)
+from .configuration_ict import IctConfig
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "IctConfig"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "sheonhan/ict-imagenet-256"
+_EXPECTED_OUTPUT_SHAPE = [3, 256, 256]
+
+
+ICT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "sheonhan/ict-imagenet-256",
+    "sheonhan/ict-ffhq-256",
+    "sheonhan/ict-places-256",
+    # See all ICT models at https://huggingface.co/models?filter=ict
+]
+
+
+class IctEmbeddings(nn.Module):
+    """
+    Construct the embeddings. Optionally, also the mask token.
+    """
+
+    def __init__(self, config, use_mask_token=False):
+        super().__init__()
+
+        self.token_embedding = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embedding = nn.Parameter(
+            torch.zeros(1, config.image_size * config.image_size, config.hidden_size)
+        )
+        self.dropout = nn.Dropout(config.embedding_dropout_prob)
+
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size)) if use_mask_token else None
+
+    def forward(
+        self, pixel_values: Optional[torch.FloatTensor], bool_masked_pos: Optional[torch.BoolTensor] = None
+    ) -> Tuple[torch.Tensor]:
+        batch_size, num_pixel = pixel_values.shape
+
+        embeddings = self.token_embedding(pixel_values)
+
+        if bool_masked_pos is not None:
+            seq_length = embeddings.shape[1]
+            mask_tokens = self.mask_token.expand(batch_size, seq_length, -1)
+            # replace the masked visual tokens by mask_tokens
+            mask = bool_masked_pos.unsqueeze(-1).type_as(mask_tokens)
+            embeddings = embeddings * (1.0 - mask) + mask_tokens * mask
+
+        # each position maps to a learnable vector
+        position_embeds = self.position_embedding[:, :num_pixel, :]
+        embeddings = embeddings + position_embeds
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class IctSelfAttention(nn.Module):
+    def __init__(self, config: IctConfig) -> None:
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size {config.hidden_size,} is not a multiple of the number of attention "
+                f"heads {config.num_attention_heads}."
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
+
+        self.output = nn.Linear(config.hidden_size, config.hidden_size)
+
+        self.attention_dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.residual_dropout = nn.Dropout(config.residual_dropout_prob)
+
+    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def prune_heads(self, heads: Set[int]) -> None:
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.num_attention_heads, self.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.query = prune_linear_layer(self.query, index)
+        self.key = prune_linear_layer(self.key, index)
+        self.value = prune_linear_layer(self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.num_attention_heads = self.num_attention_heads - len(heads)
+        self.all_head_size = self.attention_head_size * self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self, hidden_states, output_attentions: bool = False
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor]]:
+        mixed_query_layer = self.query(hidden_states)
+
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.attention_dropout(attention_probs)
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+
+        outputs = self.output(context_layer)
+        outputs = self.residual_dropout(outputs)
+
+        return (outputs, attention_probs) if output_attentions else (outputs,)
+
+
+class IctLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        hidden_size = config.hidden_size
+        intermediate_size = config.intermediate_size
+        self.intermediate_act_fn = ACT2FN[config.activation_function]
+
+        self.layer_norm_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_eps)
+        self.attention = IctSelfAttention(config)
+        self.layer_norm_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_eps)
+        self.mlp = nn.Sequential(
+            nn.Linear(hidden_size, intermediate_size),
+            self.intermediate_act_fn,
+            nn.Linear(intermediate_size, hidden_size),
+            nn.Dropout(config.residual_dropout_prob),
+        )
+
+    def forward(self, hidden_states, output_attentions: bool = False):
+        self_attention_outputs = self.attention(self.layer_norm_1(hidden_states), output_attentions=output_attentions)
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        hidden_states = hidden_states + attention_output
+        hidden_states = hidden_states + self.mlp(self.layer_norm_2(hidden_states))
+
+        outputs = (hidden_states,) + outputs
+
+        return outputs
+
+
+class IctEncoder(nn.Module):
+    def __init__(self, config: IctConfig) -> None:
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList([IctLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        return_dict: bool = True,
+    ) -> Union[tuple, BaseModelOutput]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        for _, layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer),
+                    hidden_states,
+                )
+            else:
+                layer_outputs = layer(hidden_states, output_attentions)
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class IctPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = IctConfig
+    base_model_prefix = "ict"
+    main_input_name = "pixel_values"
+    supports_gradient_checkpointing = True
+    _no_split_modules = []
+
+    def _init_weights(
+        self, module: Union[nn.Linear, nn.Embedding, nn.LayerNorm, nn.Conv2d, nn.ConvTranspose2d]
+    ) -> None:
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Embedding, nn.Conv2d, nn.ConvTranspose2d)):
+            module.weight.data = nn.init.normal_(
+                module.weight.data.to(torch.float32), mean=0.0, std=self.config.initializer_range
+            ).to(module.weight.dtype)
+            if isinstance(module, (nn.Linear, nn.Conv2d, nn.ConvTranspose2d)) and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+    def _set_gradient_checkpointing(self, module, value: bool = False) -> None:
+        if isinstance(module, (IctEncoder)):
+            module.gradient_checkpointing = value
+
+
+class IctTransformerModel(IctPreTrainedModel):
+    def __init__(self, config: IctConfig, use_mask_token: bool = False):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = IctEmbeddings(config, use_mask_token=use_mask_token)
+        self.encoder = IctEncoder(config)
+
+        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.pooler = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embeddings.token_embedding
+
+    def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None:
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layers[layer].attention.prune_heads(heads)
+
+    def forward(
+        self,
+        pixel_values: Optional[torch.Tensor] = None,
+        bool_masked_pos: Optional[torch.BoolTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        embedding_output = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        sequence_output = self.layernorm(sequence_output)
+        pooled_output = self.pooler(sequence_output)
+
+        if not return_dict:
+            head_outputs = (sequence_output, pooled_output) if pooled_output is not None else (sequence_output,)
+            return head_outputs + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class IctResnetBlock(nn.Module):
+    """
+    ResNet block without the final ReLU (https://torch.ch/blog/2016/02/04/resnets.html).
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.conv_block = nn.Sequential(
+            nn.ReflectionPad2d(2),
+            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=0, dilation=2),
+            nn.ReLU(True),
+            nn.ReflectionPad2d(1),
+            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=0, dilation=1),
+        )
+
+    def forward(self, x):
+        out = x + self.conv_block(x)
+        return out
+
+
+class IctInpaintGenerator(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+
+        self.encoder = nn.Sequential(
+            nn.ReflectionPad2d(3),
+            nn.Conv2d(in_channels=6, out_channels=64, kernel_size=7, padding=0),
+            nn.ReLU(True),
+            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=4, stride=2, padding=1),
+            nn.ReLU(True),
+            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=4, stride=2, padding=1),
+            nn.ReLU(True),
+        )
+
+        blocks = [IctResnetBlock() for _ in range(config.num_residual_blocks)]
+
+        self.middle = nn.Sequential(*blocks)
+
+        self.decoder = nn.Sequential(
+            nn.ConvTranspose2d(in_channels=256, out_channels=128, kernel_size=4, stride=2, padding=1),
+            nn.ReLU(True),
+            nn.ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=4, stride=2, padding=1),
+            nn.ReLU(True),
+            nn.ReflectionPad2d(3),
+            nn.Conv2d(in_channels=64, out_channels=3, kernel_size=7, padding=0),
+        )
+
+    def forward(self, x):
+        x = self.encoder(x)
+        x = self.middle(x)
+        x = self.decoder(x)
+        x = (torch.tanh(x) + 1) / 2
+
+        return x
+
+
+class VGG19(nn.Module):
+    def __init__(self):
+        super().__init__()
+        features = models.vgg19(pretrained=True).features
+        self.relu1_1 = torch.nn.Sequential()
+        self.relu1_2 = torch.nn.Sequential()
+
+        self.relu2_1 = torch.nn.Sequential()
+        self.relu2_2 = torch.nn.Sequential()
+
+        self.relu3_1 = torch.nn.Sequential()
+        self.relu3_2 = torch.nn.Sequential()
+        self.relu3_3 = torch.nn.Sequential()
+        self.relu3_4 = torch.nn.Sequential()
+
+        self.relu4_1 = torch.nn.Sequential()
+        self.relu4_2 = torch.nn.Sequential()
+        self.relu4_3 = torch.nn.Sequential()
+        self.relu4_4 = torch.nn.Sequential()
+
+        self.relu5_1 = torch.nn.Sequential()
+        self.relu5_2 = torch.nn.Sequential()
+        self.relu5_3 = torch.nn.Sequential()
+        self.relu5_4 = torch.nn.Sequential()
+
+        for x in range(2):
+            self.relu1_1.add_module(str(x), features[x])
+
+        for x in range(2, 4):
+            self.relu1_2.add_module(str(x), features[x])
+
+        for x in range(4, 7):
+            self.relu2_1.add_module(str(x), features[x])
+
+        for x in range(7, 9):
+            self.relu2_2.add_module(str(x), features[x])
+
+        for x in range(9, 12):
+            self.relu3_1.add_module(str(x), features[x])
+
+        for x in range(12, 14):
+            self.relu3_2.add_module(str(x), features[x])
+
+        for x in range(14, 16):
+            self.relu3_3.add_module(str(x), features[x])
+
+        for x in range(16, 18):
+            self.relu3_4.add_module(str(x), features[x])
+
+        for x in range(18, 21):
+            self.relu4_1.add_module(str(x), features[x])
+
+        for x in range(21, 23):
+            self.relu4_2.add_module(str(x), features[x])
+
+        for x in range(23, 25):
+            self.relu4_3.add_module(str(x), features[x])
+
+        for x in range(25, 27):
+            self.relu4_4.add_module(str(x), features[x])
+
+        for x in range(27, 30):
+            self.relu5_1.add_module(str(x), features[x])
+
+        for x in range(30, 32):
+            self.relu5_2.add_module(str(x), features[x])
+
+        for x in range(32, 34):
+            self.relu5_3.add_module(str(x), features[x])
+
+        for x in range(34, 36):
+            self.relu5_4.add_module(str(x), features[x])
+
+        # don't need the gradients, just want the features
+        for param in self.parameters():
+            param.requires_grad = False
+
+    def forward(self, x):
+        relu1_1 = self.relu1_1(x)
+        relu1_2 = self.relu1_2(relu1_1)
+
+        relu2_1 = self.relu2_1(relu1_2)
+        relu2_2 = self.relu2_2(relu2_1)
+
+        relu3_1 = self.relu3_1(relu2_2)
+        relu3_2 = self.relu3_2(relu3_1)
+        relu3_3 = self.relu3_3(relu3_2)
+        relu3_4 = self.relu3_4(relu3_3)
+
+        relu4_1 = self.relu4_1(relu3_4)
+        relu4_2 = self.relu4_2(relu4_1)
+        relu4_3 = self.relu4_3(relu4_2)
+        relu4_4 = self.relu4_4(relu4_3)
+
+        relu5_1 = self.relu5_1(relu4_4)
+        relu5_2 = self.relu5_2(relu5_1)
+        relu5_3 = self.relu5_3(relu5_2)
+        relu5_4 = self.relu5_4(relu5_3)
+
+        out = {
+            "relu1_1": relu1_1,
+            "relu1_2": relu1_2,
+            "relu2_1": relu2_1,
+            "relu2_2": relu2_2,
+            "relu3_1": relu3_1,
+            "relu3_2": relu3_2,
+            "relu3_3": relu3_3,
+            "relu3_4": relu3_4,
+            "relu4_1": relu4_1,
+            "relu4_2": relu4_2,
+            "relu4_3": relu4_3,
+            "relu4_4": relu4_4,
+            "relu5_1": relu5_1,
+            "relu5_2": relu5_2,
+            "relu5_3": relu5_3,
+            "relu5_4": relu5_4,
+        }
+        return out
+
+
+class IctAdversarialLoss(nn.Module):
+    r"""
+    ICT Adversarial loss https://arxiv.org/abs/1711.10337
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.gan_loss_function = config.gan_loss_function
+        self.real_label = torch.tensor(1.0)
+        self.fake_label = torch.tensor(0.0)
+
+        if self.gan_loss_function == "nsgan":
+            self.criterion = nn.BCELoss()
+
+        elif self.gan_loss_function == "lsgan":
+            self.criterion = nn.MSELoss()
+
+        elif self.gan_loss_function == "hinge":
+            self.criterion = nn.ReLU()
+
+        else:
+            raise ValueError("`gan_loss_function` has to be `nsgan`, `lsgan`, or `hinge`.")
+
+    def forward(self, outputs, is_real, is_discriminator=False):
+        if self.gan_loss_function == "hinge":
+            if is_discriminator:
+                if is_real:
+                    outputs = -outputs
+                return self.criterion(1 + outputs).mean()
+            else:
+                return (-outputs).mean()
+
+        labels = (self.real_label if is_real else self.fake_label).expand_as(outputs)
+        loss = self.criterion(outputs, labels)
+        return loss
+
+
+class IctStyleLoss(nn.Module):
+    r"""
+    Style loss, VGG-based https://arxiv.org/abs/1603.08155
+    https://github.com/dxyang/StyleTransfer/blob/master/utils.py
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.vgg = VGG19()
+        self.criterion = torch.nn.L1Loss()
+
+    def compute_gram_matrix(self, x):
+        batch_size, channels, height, width = x.size()
+        features = x.view(batch_size, channels, width * height)
+        gram = features.bmm(features.transpose(1, 2)) / (height * width * channels)
+
+        return gram
+
+    def forward(self, x, y):
+        # Compute features
+        x_vgg, y_vgg = self.vgg(x), self.vgg(y)
+
+        # Compute loss
+        style_loss = 0.0
+        style_loss += self.criterion(
+            self.compute_gram_matrix(x_vgg["relu2_2"]), self.compute_gram_matrix(y_vgg["relu2_2"])
+        )
+        style_loss += self.criterion(
+            self.compute_gram_matrix(x_vgg["relu3_4"]), self.compute_gram_matrix(y_vgg["relu3_4"])
+        )
+        style_loss += self.criterion(
+            self.compute_gram_matrix(x_vgg["relu4_4"]), self.compute_gram_matrix(y_vgg["relu4_4"])
+        )
+        style_loss += self.criterion(
+            self.compute_gram_matrix(x_vgg["relu5_2"]), self.compute_gram_matrix(y_vgg["relu5_2"])
+        )
+
+        return style_loss
+
+
+class IctPerceptualLoss(nn.Module):
+    r"""
+    Perceptual loss, VGG-based https://arxiv.org/abs/1603.08155
+    https://github.com/dxyang/StyleTransfer/blob/master/utils.py
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.vgg = VGG19()
+        self.criterion = torch.nn.L1Loss()
+        self.weights = [1.0, 1.0, 1.0, 1.0, 1.0]
+
+    def forward(self, x, y):
+        # Compute features
+        x_vgg, y_vgg = self.vgg(x), self.vgg(y)
+
+        content_loss = 0.0
+        content_loss += self.weights[0] * self.criterion(x_vgg["relu1_1"], y_vgg["relu1_1"])
+        content_loss += self.weights[1] * self.criterion(x_vgg["relu2_1"], y_vgg["relu2_1"])
+        content_loss += self.weights[2] * self.criterion(x_vgg["relu3_1"], y_vgg["relu3_1"])
+        content_loss += self.weights[3] * self.criterion(x_vgg["relu4_1"], y_vgg["relu4_1"])
+        content_loss += self.weights[4] * self.criterion(x_vgg["relu5_1"], y_vgg["relu5_1"])
+
+        return content_loss
+
+
+class IctGuidedUpsampler(IctPreTrainedModel):
+    def __init__(self, config: IctConfig):
+        super().__init__(config)
+
+        self.generator = IctInpaintGenerator(config)
+        self.adversarial_loss = IctAdversarialLoss(config)
+        self.l1_loss = nn.L1Loss()
+        self.style_loss = IctStyleLoss()
+        self.perceptual_loss = IctPerceptualLoss()
+        self.output_image_size = config.output_image_size
+
+        self.post_init()
+
+    # modified from https://github.com/raywzy/ICT/blob/59dd12d374d47cdf0dce90923017ca3657e6aa0b/Guided_Upsample/src/dataset_my.py#L203-L209
+    # and https://github.com/raywzy/ICT/blob/59dd12d374d47cdf0dce90923017ca3657e6aa0b/Guided_Upsample/src/dataset_my.py#L183-L186
+    def resize(self, img: torch.Tensor, target_height: int, target_width: int):
+        img = img.to(self.device)
+        # If the image tensor is in the format (N, H, W, C), change it to (N, C, H, W)
+        if img.dim() == 4 and img.shape[1] > img.shape[3]:
+            img = img.permute(0, 3, 1, 2)
+
+        # Handle boolean tensors
+        if img.dim() == 3:
+            img = img.unsqueeze(1)
+
+        # Center crop for non-square images
+        _, _, height, width = img.shape
+        if height != width:
+            side_length = min(height, width)
+            height_offset = (height - side_length) // 2
+            width_offset = (width - side_length) // 2
+            img = img[:, :, height_offset : height_offset + side_length, width_offset : width_offset + side_length]
+
+        img = img.float()
+        img = F.interpolate(img, size=(target_height, target_width), mode="bicubic")
+
+        return img
+
+    # modified from https://github.com/raywzy/ICT/blob/59dd12d374d47cdf0dce90923017ca3657e6aa0b/Guided_Upsample/src/models.py#L165-L183
+    def forward(self, images: List[torch.Tensor], appearance_priors: List[torch.Tensor], masks: List[torch.Tensor]):
+        images = self.resize(images, self.output_image_size, self.output_image_size)
+        appearance_priors = self.resize(appearance_priors, self.output_image_size, self.output_image_size)
+        masks = self.resize(masks, self.output_image_size, self.output_image_size)
+
+        images_masked = (images * (1 - masks).float()) + masks
+
+        inputs = torch.cat((images_masked, appearance_priors), dim=1)
+        outputs = self.generator(inputs)
+
+        return outputs
+
+
+ICT_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
+    as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+
+    Parameters:
+        config ([`IctConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+ICT_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, height * width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See [`IctImageProcessor.__call__`]
+            for details.
+        bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, height * width)`, *optional*):
+            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). Generate random
+            masks if not provided.
+        clusters (`np.ndarray`, of shape `(n_clusters, 3)`):
+            Clusters used to quantize the image of shape `(n_clusters, 3)` before being fed to Guided Upsampler.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(ICT_START_DOCSTRING)
+class IctModel(IctPreTrainedModel):
+    config_class = IctConfig
+
+    def __init__(self, config: IctConfig, use_mask_token: bool = True):
+        super().__init__(config)
+
+        self.config = config
+        self.transformer = IctTransformerModel(config, use_mask_token=use_mask_token)
+        self.guided_upsampler = IctGuidedUpsampler(config)
+        self.clusters = config.clusters
+        self.image_size = config.image_size
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.transformer.embeddings.token_embedding
+
+    def top_k_logits(self, logits, k):
+        values, indices = torch.topk(logits, k)
+        new_logits = torch.full_like(logits, -float("inf"))
+        new_logits[:, indices[0]] = values
+        return new_logits
+
+    def sample_mask(self, pixel_values, logits, bool_masked_pos, temperature=1.0, top_k=50):
+        logits = logits / temperature
+        bool_masked_pos_expanded = bool_masked_pos.expand(logits.shape[0], logits.shape[1])
+
+        logits = logits[bool_masked_pos_expanded].view(-1, logits.size(-1))
+        logits = self.top_k_logits(logits, top_k)
+        probs = nn.functional.softmax(logits, dim=-1)
+        pred = torch.multinomial(probs, num_samples=1)
+
+        output = torch.zeros_like(pixel_values)
+        output[~bool_masked_pos_expanded] = pixel_values[~bool_masked_pos_expanded]
+        output[bool_masked_pos_expanded] = pred.squeeze()
+
+        return output
+
+    @add_start_docstrings_to_model_forward(ICT_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=MaskedImageModelingOutput,
+        config_class=_CONFIG_FOR_DOC,
+        modality="vision",
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    def forward(
+        self,
+        pixel_values: Optional[torch.Tensor],
+        bool_masked_pos: Optional[torch.BoolTensor] = None,
+        clusters: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, MaskedImageModelingOutput]:
+        r"""
+        Returns:
+
+        Example:
+        ```python
+        >>> import torch
+        >>> import numpy as np
+        >>> from PIL import Image
+        >>> import requests
+
+        >>> from transformers import AutoImageProcessor, IctModel
+
+        >>> image_processor = image_AutoImageProcessor.from_pretrained("sheonhan/ict-imagenet-256")
+        >>> model = IctModel.from_pretrained("sheonhan/ict-imagenet-256")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
+        >>> clusters = image_processor.clusters
+
+        >>> # create random boolean mask of shape (batch_size, num_patches)
+        >>> bool_masked_pos = torch.randint(low=0, high=2, size=(pixel_values.shape[0] * pixel_values.shape[1])).bool()
+
+        >>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos, clusters=clusters)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        if clusters is None:
+            raise ValueError("You have to specify clusters")
+
+        outputs = self.transformer(
+            pixel_values,
+            bool_masked_pos=bool_masked_pos,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = outputs[1]
+        batch_size, sequence_length, _ = logits.shape
+        height = width = math.floor(sequence_length**0.5)
+
+        original_images = clusters[pixel_values].view(batch_size, height, width, 3)
+
+        recovered_pixel_values = self.sample_mask(
+            pixel_values, logits, bool_masked_pos, temperature=self.config.temperature, top_k=self.config.top_k
+        )
+        recovered_images = clusters[recovered_pixel_values].view(batch_size, height, width, 3)
+
+        if bool_masked_pos is None:
+            reshaped_bool_masked_pos = torch.full((batch_size, height, width), 1)
+        else:
+            reshaped_bool_masked_pos = torch.tile(bool_masked_pos, (batch_size, 1, 1))
+
+        reconstructed_pixel_values = self.guided_upsampler(original_images, recovered_images, reshaped_bool_masked_pos)
+
+        loss = None
+        if bool_masked_pos is not None:
+            bool_masked_pos = bool_masked_pos.reshape(-1, self.image_size, self.image_size)
+            bool_masked_pos.repeat_interleave(1, 1).repeat_interleave(1, 2).unsqueeze(1).contiguous()
+            # nn.functional.l1_loss(pixel_values, reconstructed_pixel_values, reduction="none")
+            # loss = (reconstruction_loss * mask).sum() / (mask.sum() + 1e-5) / self.config.num_channels
+
+        if not return_dict:
+            output = (reconstructed_pixel_values,) + outputs[2:]  # TODO
+            return ((loss,) + output) if loss is not None else output
+
+        return MaskedImageModelingOutput(
+            loss=loss,
+            reconstruction=reconstructed_pixel_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index 2c40f7143d4b..664231d5f868 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -3670,6 +3670,23 @@ def __init__(self, *args, **kwargs):
         requires_backends(self, ["torch"])
 
 
+ICT_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class IctModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class IctPreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST = None
 
 
diff --git a/src/transformers/utils/dummy_vision_objects.py b/src/transformers/utils/dummy_vision_objects.py
index bfb3cdcaff5a..9bfef1e28fe9 100644
--- a/src/transformers/utils/dummy_vision_objects.py
+++ b/src/transformers/utils/dummy_vision_objects.py
@@ -233,6 +233,13 @@ def __init__(self, *args, **kwargs):
         requires_backends(self, ["vision"])
 
 
+class IctImageProcessor(metaclass=DummyObject):
+    _backends = ["vision"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["vision"])
+
+
 class ImageGPTFeatureExtractor(metaclass=DummyObject):
     _backends = ["vision"]
 
diff --git a/tests/models/ict/__init__.py b/tests/models/ict/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/tests/models/ict/test_image_processing_ict.py b/tests/models/ict/test_image_processing_ict.py
new file mode 100644
index 000000000000..e01bef03e241
--- /dev/null
+++ b/tests/models/ict/test_image_processing_ict.py
@@ -0,0 +1,269 @@
+# coding=utf-8
+# Copyright 2023 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import tempfile
+import unittest
+
+import numpy as np
+from datasets import load_dataset
+
+from transformers.testing_utils import check_json_file_has_correct_format, require_torch, require_vision, slow
+from transformers.utils import is_torch_available, is_vision_available
+
+from ...test_image_processing_common import ImageProcessingSavingTestMixin, prepare_image_inputs
+
+
+if is_torch_available():
+    import torch
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import IctImageProcessor
+
+
+class IctImageProcessingTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        num_channels=3,
+        image_size=18,
+        min_resolution=30,
+        max_resolution=400,
+        do_resize=True,
+        size=None,
+        do_normalize=True,
+        image_mean=[0.5, 0.5, 0.5],
+        image_std=[0.5, 0.5, 0.5],
+    ):
+        size = size if size is not None else {"height": 18, "width": 18}
+        self.parent = parent
+        self.batch_size = batch_size
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.min_resolution = min_resolution
+        self.max_resolution = max_resolution
+        self.do_resize = do_resize
+        self.size = size
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean
+        self.image_std = image_std
+
+    def prepare_image_processor_dict(self):
+        return {
+            # here we create 2 clusters for the sake of simplicity
+            "clusters": np.asarray([[241.0, 212.0, 177.0], [50.0, 125.0, 197.0]]),
+            "image_mean": self.image_mean,
+            "image_std": self.image_std,
+            "do_normalize": self.do_normalize,
+            "do_resize": self.do_resize,
+            "size": self.size,
+        }
+
+
+@require_torch
+@require_vision
+class IctImageProcessingTest(ImageProcessingSavingTestMixin, unittest.TestCase):
+    image_processing_class = IctImageProcessor if is_vision_available() else None
+
+    def setUp(self):
+        self.image_processor_tester = IctImageProcessingTester(self)
+
+    @property
+    def image_processor_dict(self):
+        return self.image_processor_tester.prepare_image_processor_dict()
+
+    def test_image_processor_properties(self):
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        self.assertTrue(hasattr(image_processing, "clusters"))
+        self.assertTrue(hasattr(image_processing, "image_mean"))
+        self.assertTrue(hasattr(image_processing, "image_std"))
+        self.assertTrue(hasattr(image_processing, "do_normalize"))
+        self.assertTrue(hasattr(image_processing, "do_resize"))
+        self.assertTrue(hasattr(image_processing, "size"))
+
+    def test_image_processor_from_dict_with_kwargs(self):
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
+        self.assertEqual(image_processor.size, {"height": 18, "width": 18})
+
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42)
+        self.assertEqual(image_processor.size, {"height": 42, "width": 42})
+
+    def test_image_processor_to_json_file(self):
+        image_processor_first = self.image_processing_class(**self.image_processor_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            json_file_path = os.path.join(tmpdirname, "image_processor.json")
+            image_processor_first.to_json_file(json_file_path)
+            image_processor_second = self.image_processing_class.from_json_file(json_file_path).to_dict()
+
+        image_processor_first = image_processor_first.to_dict()
+        for key, value in image_processor_first.items():
+            if key == "clusters":
+                self.assertTrue(np.array_equal(value, image_processor_second[key]))
+            else:
+                self.assertEqual(image_processor_first[key], value)
+
+    def test_image_processor_to_json_string(self):
+        image_processor = self.image_processing_class(**self.image_processor_dict)
+        obj = json.loads(image_processor.to_json_string())
+        for key, value in self.image_processor_dict.items():
+            if key == "clusters":
+                self.assertTrue(np.array_equal(value, obj[key]))
+            else:
+                self.assertEqual(obj[key], value)
+
+    def test_image_processor_from_and_save_pretrained(self):
+        image_processor_first = self.image_processing_class(**self.image_processor_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            saved_file = image_processor_first.save_pretrained(tmpdirname)[0]
+            check_json_file_has_correct_format(saved_file)
+            image_processor_second = self.image_processing_class.from_pretrained(tmpdirname).to_dict()
+
+        image_processor_first = image_processor_first.to_dict()
+        for key, value in image_processor_first.items():
+            if key == "clusters":
+                self.assertTrue(np.array_equal(value, image_processor_second[key]))
+            else:
+                self.assertEqual(image_processor_first[key], value)
+
+    def test_batch_feature(self):
+        pass
+
+    def test_call_pil(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PIL images
+        image_inputs = prepare_image_inputs(self.image_processor_tester, equal_resolution=False)
+        for image in image_inputs:
+            self.assertIsInstance(image, Image.Image)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.image_processor_tester.size["height"] * self.image_processor_tester.size["width"],
+            ),
+        )
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.size["height"] * self.image_processor_tester.size["width"],
+            ),
+        )
+
+    def test_call_numpy(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random numpy tensors
+        image_inputs = prepare_image_inputs(self.image_processor_tester, equal_resolution=False, numpify=True)
+        for image in image_inputs:
+            self.assertIsInstance(image, np.ndarray)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.image_processor_tester.size["height"] * self.image_processor_tester.size["width"],
+            ),
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.size["height"] * self.image_processor_tester.size["width"],
+            ),
+        )
+
+    def test_call_pytorch(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PyTorch tensors
+        image_inputs = prepare_image_inputs(self.image_processor_tester, equal_resolution=False, torchify=True)
+        for image in image_inputs:
+            self.assertIsInstance(image, torch.Tensor)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.image_processor_tester.size["height"] * self.image_processor_tester.size["width"],
+            ),
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pt").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.size["height"] * self.image_processor_tester.size["width"],
+            ),
+        )
+
+
+def prepare_images():
+    dataset = load_dataset("hf-internal-testing/fixtures_image_utils", split="test")
+
+    image1 = Image.open(dataset[4]["file"])
+    image2 = Image.open(dataset[5]["file"])
+
+    images = [image1, image2]
+
+    return images
+
+
+@require_vision
+@require_torch
+class IctImageProcessorIntegrationTest(unittest.TestCase):
+    @slow
+    def test_image(self):
+        image_processing = IctImageProcessor.from_pretrained("sheonhan/ict-imagenet-256")
+
+        images = prepare_images()
+
+        # test non-batched
+        encoding = image_processing(images[0], return_tensors="pt")
+
+        self.assertIsInstance(encoding.pixel_values, torch.LongTensor)
+        self.assertEqual(encoding.pixel_values.shape, (1, 1024))
+
+        expected_slice = [306, 191, 191]
+        self.assertEqual(encoding.pixel_values[0, :3].tolist(), expected_slice)
+
+        # test batched
+        encoding = image_processing(images, return_tensors="pt")
+
+        self.assertIsInstance(encoding.pixel_values, torch.LongTensor)
+        self.assertEqual(encoding.pixel_values.shape, (2, 1024))
+
+        expected_slice = [303, 13, 13]
+        self.assertEqual(encoding.pixel_values[1, -3:].tolist(), expected_slice)
diff --git a/tests/models/ict/test_modeling_ict.py b/tests/models/ict/test_modeling_ict.py
new file mode 100644
index 000000000000..a7d7e1a969c9
--- /dev/null
+++ b/tests/models/ict/test_modeling_ict.py
@@ -0,0 +1,259 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the PyTorch ICT model. """
+
+
+import inspect
+import unittest
+
+from transformers import IctConfig
+from transformers.testing_utils import (
+    require_torch,
+    require_vision,
+    slow,
+    torch_device,
+)
+from transformers.utils import cached_property, is_torch_available, is_vision_available
+
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, ids_tensor
+from ...test_pipeline_mixin import PipelineTesterMixin
+
+
+if is_torch_available():
+    import torch
+    from torch import nn
+
+    torch.manual_seed(3)
+
+    from transformers import IctModel
+    from transformers.models.ict.modeling_ict import ICT_PRETRAINED_MODEL_ARCHIVE_LIST
+
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import IctImageProcessor
+
+
+class IctModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        vocab_size=512,
+        hidden_size=32,
+        num_hidden_layers=6,
+        num_attention_heads=4,
+        num_residual_blocks=8,
+        intermediate_size=37,
+        activation_function="gelu",
+        embedding_dropout_prob=0.0,
+        residual_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        image_size=32,
+        num_channels=3,
+        qkv_bias=False,
+        temperature=1.0,
+        top_k=50,
+        gan_loss_function="nsgan",
+        output_image_size=256,
+        scope=None,
+        is_training=True,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_residual_blocks = num_residual_blocks
+        self.intermediate_size = intermediate_size
+        self.activation_function = activation_function
+        self.embedding_dropout_prob = embedding_dropout_prob
+        self.residual_dropout_prob = residual_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.image_size = image_size
+        self.num_channels = num_channels
+        self.qkv_bias = qkv_bias
+        self.temperature = temperature
+        self.top_k = top_k
+        self.gan_loss_function = gan_loss_function
+        self.output_image_size = output_image_size
+
+        self.seq_length = image_size * image_size
+        self.scope = scope
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        pixel_values = ids_tensor([self.batch_size, self.image_size * self.image_size], self.vocab_size)
+        bool_masked_pos = torch.randint(low=0, high=2, size=(1, pixel_values.shape[1])).bool()
+
+        clusters = torch.rand(512, 3)
+
+        config = self.get_config()
+
+        return config, pixel_values, bool_masked_pos, clusters
+
+    def get_config(self):
+        return IctConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            num_residual_blocks=self.num_residual_blocks,
+            intermediate_size=self.intermediate_size,
+            activation_function=self.activation_function,
+            embedding_dropout_prob=self.embedding_dropout_prob,
+            residual_dropout_prob=self.residual_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            initializer_range=self.initializer_range,
+            layer_norm_eps=self.layer_norm_eps,
+            image_size=self.image_size,
+            num_channels=self.num_channels,
+            qkv_bias=self.qkv_bias,
+            temperature=self.temperature,
+            top_k=self.top_k,
+            gan_loss_function=self.gan_loss_function,
+            output_image_size=self.output_image_size,
+        )
+
+    def create_and_check_model(self, config, pixel_values, bool_masked_pos, clusters):
+        model = IctModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        result = model(pixel_values, bool_masked_pos, clusters)
+        self.parent.assertEqual(
+            result.reconstruction.shape,
+            (self.batch_size, self.num_channels, self.output_image_size, self.output_image_size),
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, pixel_values, bool_masked_pos, clusters) = config_and_inputs
+        inputs_dict = {
+            "pixel_values": pixel_values,
+            "bool_masked_pos": bool_masked_pos,
+            "clusters": clusters,
+        }
+        return config, inputs_dict
+
+
+@require_torch
+class IctModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as ICT does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (IctModel,) if is_torch_available() else ()
+    pipeline_model_mapping = {"feature-extraction": IctModel} if is_torch_available() else {}
+    fx_compatible = False
+
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = IctModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=IctConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="ICT does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in ICT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = IctModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
+    return image
+
+
+@require_torch
+@require_vision
+class IctModelIntegrationTest(unittest.TestCase):
+    @cached_property
+    def default_image_processor(self):
+        return IctImageProcessor.from_pretrained("sheonhan/ict-imagenet-256") if is_vision_available() else None
+
+    # @slow
+    def test_inference_masked_image_modeling(self):
+        model = IctModel.from_pretrained("sheonhan/ict-imagenet-256").to(torch_device)
+
+        image_processor = self.default_image_processor
+        image = prepare_img()
+        inputs = image_processor(images=image, return_tensors="pt")
+
+        pixel_values = inputs.pixel_values
+        image_size = pixel_values.shape[1]
+
+        bool_masked_pos = torch.randint(low=0, high=2, size=(1, image_size)).bool()
+        clusters = inputs.clusters
+
+        # forward pass
+        with torch.no_grad():
+            outputs = model(
+                pixel_values=pixel_values,
+                bool_masked_pos=bool_masked_pos,
+                clusters=clusters,
+            )
+
+        # verify the logits
+        expected_shape = torch.Size((1, 3, 256, 256))
+        self.assertEqual(outputs.logits.shape, expected_shape)
+
+        expected_slice = torch.tensor(
+            [[2.3445, 2.6889, 2.7313], [1.0530, 1.2416, 0.5699], [0.2205, 0.7749, 0.3953]]
+        ).to(torch_device)
+
+        self.assertTrue(torch.allclose(outputs.logits[0, :3, :3, :3], expected_slice, atol=1e-4))
diff --git a/utils/check_copies.py b/utils/check_copies.py
index 959c7b2d329b..89bcc9640660 100644
--- a/utils/check_copies.py
+++ b/utils/check_copies.py
@@ -499,6 +499,7 @@ def check_model_list_copy(overwrite=False, max_per_line=119):
     "DonutSwin": "Swin Transformer",
     "Marian": "MarianMT",
     "MaskFormerSwin": "Swin Transformer",
+    "ICT": "Image Completion Transformer",
     "OpenAI GPT-2": "GPT-2",
     "OpenAI GPT": "GPT",
     "Perceiver": "Perceiver IO",
diff --git a/utils/check_repo.py b/utils/check_repo.py
index db947d834bad..118109b0f4a8 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -52,6 +52,8 @@
     "MaskFormerSwinPreTrainedModel",
     "BridgeTowerTextModel",
     "BridgeTowerVisionModel",
+    "IctGuidedUpsampler",
+    "IctTransformerModel",
 ]
 
 # Update this list for models that are not tested with a comment explaining the reason it should not be.