diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 122a884c58a9..0bf7e59df226 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -139,15 +139,15 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
 2. Clone your fork to your local disk, and add the base repository as a remote:
 
    ```bash
-   $ git clone git@github.com:<your Github handle>/transformers.git
-   $ cd transformers
-   $ git remote add upstream https://github.com/huggingface/transformers.git
+   git clone git@github.com:<your Github handle>/transformers.git
+   cd transformers
+   git remote add upstream https://github.com/huggingface/transformers.git
    ```
 
 3. Create a new branch to hold your development changes:
 
    ```bash
-   $ git checkout -b a-descriptive-name-for-my-changes
+   git checkout -b a-descriptive-name-for-my-changes
    ```
 
    🚨 **Do not** work on the `main` branch!
@@ -155,7 +155,7 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
 4. Set up a development environment by running the following command in a virtual environment:
 
    ```bash
-   $ pip install -e ".[dev]"
+   pip install -e ".[dev]"
    ```
 
    If 🤗 Transformers was already installed in the virtual environment, remove
@@ -176,7 +176,7 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
    passes. Run the tests impacted by your changes like this:
 
    ```bash
-   $ pytest tests/<TEST_TO_RUN>.py
+   pytest tests/<TEST_TO_RUN>.py
    ```
 
    For more information about tests, check out the
@@ -187,7 +187,7 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
    that can't be automated in one go with:
 
    ```bash
-   $ make fixup
+   make fixup
    ```
 
    This target is also optimized to only work with files modified by the PR you're working on.
@@ -196,21 +196,21 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
    style corrections:
 
    ```bash
-   $ make style
+   make style
    ```
 
    🤗 Transformers also uses `ruff` and a few custom scripts to check for coding mistakes. Quality
    controls are run by the CI, but you can run the same checks with:
 
    ```bash
-   $ make quality
+   make quality
    ```
 
    Finally, we have a lot of scripts to make sure we didn't forget to update
    some files when adding a new model. You can run these scripts with:
 
    ```bash
-   $ make repo-consistency
+   make repo-consistency
    ```
 
    To learn more about those checks and how to fix any issues with them, check out the
@@ -220,13 +220,13 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
    make sure you install the documentation builder:
    
    ```bash
-   $ pip install ".[docs]"
+   pip install ".[docs]"
    ```
 
    Run the following command from the root of the repository:
 
    ```bash
-   $ doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build
+   doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build
    ```
 
    This will build the documentation in the `~/tmp/test-build` folder where you can inspect the generated
@@ -236,8 +236,8 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
    record your changes locally with `git commit`:
 
    ```bash
-   $ git add modified_file.py
-   $ git commit
+   git add modified_file.py
+   git commit
    ```
 
    Please remember to write [good commit
@@ -247,14 +247,14 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai
    repository, rebase your branch on `upstream/branch` *before* you open a pull request or if requested by a maintainer:
 
    ```bash
-   $ git fetch upstream
-   $ git rebase upstream/main
+   git fetch upstream
+   git rebase upstream/main
    ```
 
    Push your changes to your branch:
 
    ```bash
-   $ git push -u origin a-descriptive-name-for-my-changes
+   git push -u origin a-descriptive-name-for-my-changes
    ```
 
    If you've already opened a pull request, you'll need to force push with the `--force` flag. Otherwise, if the pull request hasn't been opened yet, you can just push your changes normally.
@@ -307,14 +307,14 @@ We like `pytest` and `pytest-xdist` because it's faster. From the root of the
 repository, specify a *path to a subfolder or a test file* to run the test.
 
 ```bash
-$ python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
+python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
 ```
 
 Similarly, for the `examples` directory, specify a *path to a subfolder or test file* to run the test. For example, the following command tests the text classification subfolder in the PyTorch `examples` directory:
 
 ```bash
-$ pip install -r examples/xxx/requirements.txt  # only needed the first time
-$ python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification
+pip install -r examples/xxx/requirements.txt  # only needed the first time
+python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification
 ```
 
 In fact, this is actually how our `make test` and `make test-examples` commands are implemented (not including the `pip install`)!
@@ -333,8 +333,8 @@ Remember to specify a *path to a subfolder or a test file* to run the test. Othe
 </Tip>
 
 ```bash
-$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
-$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification
+RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model
+RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification
 ```
 
 Like the slow tests, there are other environment variables available which not enabled by default during testing:
@@ -351,8 +351,8 @@ This means `unittest` is fully supported. Here's how to run tests with
 `unittest`:
 
 ```bash
-$ python -m unittest discover -s tests -t . -v
-$ python -m unittest discover -s examples -t examples -v
+python -m unittest discover -s tests -t . -v
+python -m unittest discover -s examples -t examples -v
 ```
 
 ### Style guide
@@ -386,8 +386,8 @@ When updating the main branch of a forked repository, please follow these steps
 2. If a PR is absolutely necessary, use the following steps after checking out your branch:
 
 ```bash
-$ git checkout -b your-branch-for-syncing
-$ git pull --squash --no-commit upstream main
-$ git commit -m '<your message without GitHub references>'
-$ git push --set-upstream origin your-branch-for-syncing
+git checkout -b your-branch-for-syncing
+git pull --squash --no-commit upstream main
+git commit -m '<your message without GitHub references>'
+git push --set-upstream origin your-branch-for-syncing
 ```
diff --git a/README.md b/README.md
index 4110c524c069..4c81448e2ab9 100644
--- a/README.md
+++ b/README.md
@@ -377,6 +377,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
@@ -498,3 +499,4 @@ We now have a [paper](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) you
     pages = "38--45"
 }
 ```
+
diff --git a/README_es.md b/README_es.md
index 728f596eabe2..aa83571ca0ce 100644
--- a/README_es.md
+++ b/README_es.md
@@ -365,6 +365,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
diff --git a/README_hd.md b/README_hd.md
index 0e61392f26ce..ff49b28426d4 100644
--- a/README_hd.md
+++ b/README_hd.md
@@ -337,6 +337,7 @@ conda install -c huggingface transformers
 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (फेसबुक से) साथ में पेपर [एक्स्टेंसिबल बहुभाषी प्रीट्रेनिंग और फाइनट्यूनिंग के साथ बहुभाषी अनुवाद](https://arxiv युकिंग टैंग, चाउ ट्रान, जियान ली, पेंग-जेन चेन, नमन गोयल, विश्रव चौधरी, जियाताओ गु, एंजेला फैन द्वारा .org/abs/2008.00401)।
 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA से) कागज के साथ [Megatron-LM: मॉडल का उपयोग करके बहु-अरब पैरामीटर भाषा मॉडल का प्रशिक्षण Parallelism](https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा।
 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA से) साथ वाला पेपर [Megatron-LM: ट्रेनिंग मल्टी-बिलियन पैरामीटर लैंग्वेज मॉडल्स यूजिंग मॉडल पैरेललिज़्म] (https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा पोस्ट किया गया।
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research से) Peng Wang, Cheng Da, and Cong Yao. द्वाराअनुसंधान पत्र [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) के साथ जारी किया गया
 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (फ्रॉम Studio Ousia) साथ में पेपर [mLUKE: द पावर ऑफ एंटिटी रिप्रेजेंटेशन इन मल्टीलिंगुअल प्रीट्रेन्ड लैंग्वेज मॉडल्स](https://arxiv.org/abs/2110.08151) रयोकन री, इकुया यामाडा, और योशिमासा त्सुरोका द्वारा।
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [मोबाइलबर्ट: संसाधन-सीमित उपकरणों के लिए एक कॉम्पैक्ट टास्क-अज्ञेय बीईआरटी] (https://arxiv.org/abs/2004.02984) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, और Denny Zhou द्वारा पोस्ट किया गया।
 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
diff --git a/README_ja.md b/README_ja.md
index 4edaeffb83b6..3803a6a2c5ee 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -399,6 +399,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook から) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan から公開された研究論文: [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401)
 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research から) Peng Wang, Cheng Da, and Cong Yao. から公開された研究論文 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592)
 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia から) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka から公開された研究論文: [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151)
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain から) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou から公開された研究論文: [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984)
 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. から) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam から公開された研究論文: [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)
diff --git a/README_ko.md b/README_ko.md
index ce957d6a925e..18056a12a62e 100644
--- a/README_ko.md
+++ b/README_ko.md
@@ -314,6 +314,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook 에서) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 의 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 논문과 함께 발표했습니다.
 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다.
 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다.
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research 에서 제공)은 Peng Wang, Cheng Da, and Cong Yao.의 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592)논문과 함께 발표했습니다.
 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia 에서) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 의 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 논문과 함께 발표했습니다.
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain 에서) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 의 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 논문과 함께 발표했습니다.
 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. 에서) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 의 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 논문과 함께 발표했습니다.
diff --git a/README_zh-hans.md b/README_zh-hans.md
index ca224409143a..54fb9346b2de 100644
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -338,6 +338,7 @@ conda install -c huggingface transformers
 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 由 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 发布。
 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (来自 Alibaba Research) 伴随论文 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) 由 Peng Wang, Cheng Da, and Cong Yao 发布。
 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (来自 Studio Ousia) 伴随论文 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 由 Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 发布。
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (来自 CMU/Google Brain) 伴随论文 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 由 Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 发布。
 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (来自 Google Inc.) 伴随论文 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 由 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 发布。
diff --git a/README_zh-hant.md b/README_zh-hant.md
index e496a6f86c89..08ce35cbfff5 100644
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -350,6 +350,7 @@ conda install -c huggingface transformers
 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 773438798682..f164348a94aa 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -580,6 +580,8 @@
         title: LiLT
       - local: model_doc/lxmert
         title: LXMERT
+      - local: model_doc/mgp-str
+        title: MGP-STR
       - local: model_doc/oneformer
         title: OneFormer
       - local: model_doc/owlvit
diff --git a/docs/source/en/index.mdx b/docs/source/en/index.mdx
index 5a547ba8ef56..36964e924a81 100644
--- a/docs/source/en/index.mdx
+++ b/docs/source/en/index.mdx
@@ -151,6 +151,7 @@ The documentation is organized into five sections:
 1. **[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
 1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
 1. **[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
+1. **[MGP-STR](model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
 1. **[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
 1. **[MobileBERT](model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
 1. **[MobileNetV1](model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
@@ -340,6 +341,7 @@ Flax), PyTorch, and/or TensorFlow.
 |        MaskFormerSwin         |       ❌       |       ❌       |       ❌        |         ❌         |      ❌      |
 |             mBART             |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
 |         Megatron-BERT         |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|            MGP-STR            |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |          MobileBERT           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
 |          MobileNetV1          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |          MobileNetV2          |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
diff --git a/docs/source/en/installation.mdx b/docs/source/en/installation.mdx
index 4ff4e04436c7..2c59d9146ba4 100644
--- a/docs/source/en/installation.mdx
+++ b/docs/source/en/installation.mdx
@@ -54,19 +54,31 @@ pip install transformers
 For CPU-support only, you can conveniently install 🤗 Transformers and a deep learning library in one line. For example, install 🤗 Transformers and PyTorch with:
 
 ```bash
-pip install transformers[torch]
+pip install 'transformers[torch]'
 ```
 
 🤗 Transformers and TensorFlow 2.0:
 
 ```bash
-pip install transformers[tf-cpu]
+pip install 'transformers[tf-cpu]'
 ```
 
+<Tip warning={true}>
+
+M1 / ARM Users
+    
+You will need to install the following before installing TensorFLow 2.0
+```
+brew install cmake
+brew install pkg-config
+```
+
+</Tip>
+
 🤗 Transformers and Flax:
 
 ```bash
-pip install transformers[flax]
+pip install 'transformers[flax]'
 ```
 
 Finally, check if 🤗 Transformers has been properly installed by running the following command. It will download a pretrained model:
@@ -237,4 +249,4 @@ Once your file is downloaded and locally cached, specify it's local path to load
 
 See the [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream) section for more details on downloading files stored on the Hub.
 
-</Tip>
\ No newline at end of file
+</Tip>
diff --git a/docs/source/en/model_doc/align.mdx b/docs/source/en/model_doc/align.mdx
index 5ffec6bebcdb..043de683c17a 100644
--- a/docs/source/en/model_doc/align.mdx
+++ b/docs/source/en/model_doc/align.mdx
@@ -14,15 +14,57 @@ specific language governing permissions and limitations under the License.
 
 ## Overview
 
-The ALIGN model was proposed in [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. ALIGN features a dual-encoder architecture with [EfficientNet](efficientnet) as its vision encoder and [BERT](bert) as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
+The ALIGN model was proposed in [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with [EfficientNet](efficientnet) as its vision encoder and [BERT](bert) as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
 
 The abstract from the paper is the following:
 
 *Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.*
 
+## Usage
+
+ALIGN uses EfficientNet to get visual features and BERT to get the text features. Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similarity score.
+
+[`AlignProcessor`] wraps [`EfficientNetImageProcessor`] and [`BertTokenizer`] into a single instance to both encode the text and preprocess the images. The following example shows how to get the image-text similarity scores using [`AlignProcessor`] and [`AlignModel`].
+
+```python
+import requests
+import torch
+from PIL import Image
+from transformers import AlignProcessor, AlignModel
+
+processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
+model = AlignModel.from_pretrained("kakaobrain/align-base")
+
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+candidate_labels = ["an image of a cat", "an image of a dog"]
+
+inputs = processor(text=candidate_labels, images=image, return_tensors="pt")
+
+with torch.no_grad():
+    outputs = model(**inputs)
+
+# this is the image-text similarity score
+logits_per_image = outputs.logits_per_image
+
+# we can take the softmax to get the label probabilities
+probs = logits_per_image.softmax(dim=1)
+print(probs)
+```
+
 This model was contributed by [Alara Dirik](https://huggingface.co/adirik).
 The original code is not released, this implementation is based on the Kakao Brain implementation based on the original paper.
 
+## Resources
+
+A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ALIGN.
+
+- A blog post on [ALIGN and the COYO-700M dataset](https://huggingface.co/blog).
+- A zero-shot image classification [demo](https://huggingface.co/spaces/adirik/ALIGN-zero-shot-image-classification).
+- [Model card](https://huggingface.co/kakaobrain/align-base) of `kakaobrain/align-base` model.
+
+If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.
+
 
 ## AlignConfig
 
diff --git a/docs/source/en/model_doc/auto.mdx b/docs/source/en/model_doc/auto.mdx
index 9df4fa9c995d..39b0645eb5c8 100644
--- a/docs/source/en/model_doc/auto.mdx
+++ b/docs/source/en/model_doc/auto.mdx
@@ -258,6 +258,14 @@ The following auto classes are available for the following computer vision tasks
 
 [[autodoc]] AutoModelForUniversalSegmentation
 
+### AutoModelForZeroShotImageClassification
+
+[[autodoc]] AutoModelForZeroShotImageClassification
+
+### TFAutoModelForZeroShotImageClassification
+
+[[autodoc]] TFAutoModelForZeroShotImageClassification
+
 ### AutoModelForZeroShotObjectDetection
 
 [[autodoc]] AutoModelForZeroShotObjectDetection
diff --git a/docs/source/en/model_doc/flan-ul2.mdx b/docs/source/en/model_doc/flan-ul2.mdx
index d1687b2be261..bcc00295420e 100644
--- a/docs/source/en/model_doc/flan-ul2.mdx
+++ b/docs/source/en/model_doc/flan-ul2.mdx
@@ -31,7 +31,7 @@ One can refer to [T5's documentation page](t5) for all tips, code examples and n
 The original checkpoints can be found [here](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints).
 
 
-## Running on low ressources devices
+## Running on low resource devices
 
 The model is pretty heavy (~40GB in half precision) so if you just want to run the model, make sure you load your model in 8bit, and use `device_map="auto"` to make sure  you don't have any OOM issue!
 
@@ -49,4 +49,4 @@ The model is pretty heavy (~40GB in half precision) so if you just want to run t
 
 ## Inference
 
-The inference protocol is exaclty the same as any `T5` model, please have a look at the [T5's documentation page](t5) for more details.
\ No newline at end of file
+The inference protocol is exaclty the same as any `T5` model, please have a look at the [T5's documentation page](t5) for more details.
diff --git a/docs/source/en/model_doc/gptj.mdx b/docs/source/en/model_doc/gptj.mdx
index 98247fcfb044..2b407d7b2d40 100644
--- a/docs/source/en/model_doc/gptj.mdx
+++ b/docs/source/en/model_doc/gptj.mdx
@@ -21,21 +21,22 @@ This model was contributed by [Stella Biderman](https://huggingface.co/stellaath
 
 Tips:
 
-- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size CPU
-  RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB of CPU
-  RAM to just load the model. To reduce the CPU RAM usage there are a few options. The `torch_dtype` argument can be
-  used to initialize the model in half-precision. And the `low_cpu_mem_usage` argument can be used to keep the RAM
-  usage to 1x. There is also a [fp16 branch](https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) which stores
-  the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly
-  12.1GB of CPU RAM to load the model.
+- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size
+  RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB
+  RAM to just load the model. To reduce the RAM usage there are a few options. The `torch_dtype` argument can be
+  used to initialize the model in half-precision on a CUDA device only. There is also a fp16 branch which stores the fp16 weights,
+  which could be used to further minimize the RAM usage:
 
 ```python
 >>> from transformers import GPTJForCausalLM
 >>> import torch
 
+>>> device = "cuda"
 >>> model = GPTJForCausalLM.from_pretrained(
-...     "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
-... )
+...     "EleutherAI/gpt-j-6B",
+...     revision="float16",
+...     torch_dtype=torch.float16,
+... ).to(device)
 ```
 
 - The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
@@ -85,7 +86,8 @@ model.
 >>> from transformers import GPTJForCausalLM, AutoTokenizer
 >>> import torch
 
->>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
+>>> device = "cuda"
+>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to(device)
 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
 
 >>> prompt = (
@@ -94,7 +96,7 @@ model.
 ...     "researchers was the fact that the unicorns spoke perfect English."
 ... )
 
->>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
 
 >>> gen_tokens = model.generate(
 ...     input_ids,
diff --git a/docs/source/en/model_doc/mgp-str.mdx b/docs/source/en/model_doc/mgp-str.mdx
new file mode 100644
index 000000000000..fc3d9f41526d
--- /dev/null
+++ b/docs/source/en/model_doc/mgp-str.mdx
@@ -0,0 +1,86 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# MGP-STR
+
+## Overview
+
+The MGP-STR model was proposed in [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. MGP-STR is a conceptually **simple** yet **powerful** vision Scene Text Recognition (STR) model, which is built upon the [Vision Transformer (ViT)](vit). To integrate linguistic knowledge, Multi-Granularity Prediction (MGP) strategy is proposed to inject information from the language modality into the model in an implicit way.
+
+The abstract from the paper is the following:
+
+*Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks.*
+
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/mgp_str_architecture.png"
+alt="drawing" width="600"/>
+
+<small> MGP-STR architecture. Taken from the <a href="https://arxiv.org/abs/2209.03592">original paper</a>. </small>
+
+Tips:
+
+- MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and SynthText(http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
+- This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR).
+
+## Inference
+
+[`MgpstrModel`] accepts images as input and generates three types of predictions, which represent textual information at different granularities.
+The three types of predictions are fused to give the final prediction result.
+
+The [`ViTImageProcessor`] class is responsible for preprocessing the input image and
+[`MgpstrTokenizer`] decodes the generated character tokens to the target string. The
+[`MgpstrProcessor`] wraps [`ViTImageProcessor`] and [`MgpstrTokenizer`]
+into a single instance to both extract the input features and decode the predicted token ids.
+
+- Step-by-step Optical Character Recognition (OCR)
+
+``` py
+>>> from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
+>>> import requests
+>>> from PIL import Image
+
+>>> processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
+>>> model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')
+
+>>> # load image from the IIIT-5k dataset
+>>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
+>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+
+>>> pixel_values = processor(image, return_tensors="pt").pixel_values
+>>> outputs = model(pixel_values)
+
+>>> generated_text = processor.batch_decode(outputs.logits)['generated_text']
+```
+
+## MgpstrConfig
+
+[[autodoc]] MgpstrConfig
+
+## MgpstrTokenizer
+
+[[autodoc]] MgpstrTokenizer
+    - save_vocabulary
+
+## MgpstrProcessor
+
+[[autodoc]] MgpstrProcessor
+    - __call__
+    - batch_decode
+
+## MgpstrModel
+
+[[autodoc]] MgpstrModel
+    - forward
+
+## MgpstrForSceneTextRecognition
+
+[[autodoc]] MgpstrForSceneTextRecognition
+    - forward
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index cceefc393718..f66bf314f04f 100644
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -370,6 +370,7 @@
     "models.mctct": ["MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MCTCTConfig", "MCTCTProcessor"],
     "models.megatron_bert": ["MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MegatronBertConfig"],
     "models.megatron_gpt2": [],
+    "models.mgp_str": ["MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP", "MgpstrConfig", "MgpstrProcessor", "MgpstrTokenizer"],
     "models.mluke": [],
     "models.mmbt": ["MMBTConfig"],
     "models.mobilebert": ["MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MobileBertConfig", "MobileBertTokenizer"],
@@ -1001,6 +1002,7 @@
             "MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING",
             "MODEL_FOR_VISION_2_SEQ_MAPPING",
             "MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING",
+            "MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING",
             "MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING",
             "MODEL_MAPPING",
             "MODEL_WITH_LM_HEAD_MAPPING",
@@ -1033,6 +1035,7 @@
             "AutoModelForVideoClassification",
             "AutoModelForVision2Seq",
             "AutoModelForVisualQuestionAnswering",
+            "AutoModelForZeroShotImageClassification",
             "AutoModelForZeroShotObjectDetection",
             "AutoModelWithLMHead",
         ]
@@ -1902,6 +1905,14 @@
             "MegatronBertPreTrainedModel",
         ]
     )
+    _import_structure["models.mgp_str"].extend(
+        [
+            "MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "MgpstrForSceneTextRecognition",
+            "MgpstrModel",
+            "MgpstrPreTrainedModel",
+        ]
+    )
     _import_structure["models.mmbt"].extend(["MMBTForClassification", "MMBTModel", "ModalEmbeddings"])
     _import_structure["models.mobilebert"].extend(
         [
@@ -2785,6 +2796,7 @@
             "TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING",
             "TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING",
             "TF_MODEL_FOR_VISION_2_SEQ_MAPPING",
+            "TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING",
             "TF_MODEL_MAPPING",
             "TF_MODEL_WITH_LM_HEAD_MAPPING",
             "TFAutoModel",
@@ -2803,6 +2815,7 @@
             "TFAutoModelForTableQuestionAnswering",
             "TFAutoModelForTokenClassification",
             "TFAutoModelForVision2Seq",
+            "TFAutoModelForZeroShotImageClassification",
             "TFAutoModelWithLMHead",
         ]
     )
@@ -3958,6 +3971,7 @@
     from .models.mbart import MBartConfig
     from .models.mctct import MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP, MCTCTConfig, MCTCTProcessor
     from .models.megatron_bert import MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MegatronBertConfig
+    from .models.mgp_str import MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP, MgpstrConfig, MgpstrProcessor, MgpstrTokenizer
     from .models.mmbt import MMBTConfig
     from .models.mobilebert import MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileBertConfig, MobileBertTokenizer
     from .models.mobilenet_v1 import MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileNetV1Config
@@ -4514,6 +4528,7 @@
             MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING,
             MODEL_FOR_VISION_2_SEQ_MAPPING,
             MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING,
+            MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING,
             MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING,
             MODEL_MAPPING,
             MODEL_WITH_LM_HEAD_MAPPING,
@@ -4546,6 +4561,7 @@
             AutoModelForVideoClassification,
             AutoModelForVision2Seq,
             AutoModelForVisualQuestionAnswering,
+            AutoModelForZeroShotImageClassification,
             AutoModelForZeroShotObjectDetection,
             AutoModelWithLMHead,
         )
@@ -5243,6 +5259,12 @@
             MegatronBertModel,
             MegatronBertPreTrainedModel,
         )
+        from .models.mgp_str import (
+            MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST,
+            MgpstrForSceneTextRecognition,
+            MgpstrModel,
+            MgpstrPreTrainedModel,
+        )
         from .models.mmbt import MMBTForClassification, MMBTModel, ModalEmbeddings
         from .models.mobilebert import (
             MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
@@ -5971,6 +5993,7 @@
             TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING,
             TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
             TF_MODEL_FOR_VISION_2_SEQ_MAPPING,
+            TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING,
             TF_MODEL_MAPPING,
             TF_MODEL_WITH_LM_HEAD_MAPPING,
             TFAutoModel,
@@ -5989,6 +6012,7 @@
             TFAutoModelForTableQuestionAnswering,
             TFAutoModelForTokenClassification,
             TFAutoModelForVision2Seq,
+            TFAutoModelForZeroShotImageClassification,
             TFAutoModelWithLMHead,
         )
         from .models.bart import (
@@ -6222,7 +6246,6 @@
         from .models.mbart import TFMBartForConditionalGeneration, TFMBartModel, TFMBartPreTrainedModel
         from .models.mobilebert import (
             TF_MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
-            TF_MOBILEVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
             TFMobileBertForMaskedLM,
             TFMobileBertForMultipleChoice,
             TFMobileBertForNextSentencePrediction,
@@ -6233,6 +6256,9 @@
             TFMobileBertMainLayer,
             TFMobileBertModel,
             TFMobileBertPreTrainedModel,
+        )
+        from .models.mobilevit import (
+            TF_MOBILEVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
             TFMobileViTForImageClassification,
             TFMobileViTForSemanticSegmentation,
             TFMobileViTModel,
diff --git a/src/transformers/commands/pt_to_tf.py b/src/transformers/commands/pt_to_tf.py
index 669f7a98003b..e025db5d1344 100644
--- a/src/transformers/commands/pt_to_tf.py
+++ b/src/transformers/commands/pt_to_tf.py
@@ -68,6 +68,7 @@ def convert_command_factory(args: Namespace):
         args.no_pr,
         args.push,
         args.extra_commit_description,
+        args.override_model_class,
     )
 
 
@@ -126,6 +127,13 @@ def register_subcommand(parser: ArgumentParser):
             default="",
             help="Optional additional commit description to use when opening a PR (e.g. to tag the owner).",
         )
+        train_parser.add_argument(
+            "--override-model-class",
+            type=str,
+            default=None,
+            help="If you think you know better than the auto-detector, you can specify the model class here. "
+            "Can be either an AutoModel class or a specific model class like BertForSequenceClassification.",
+        )
         train_parser.set_defaults(func=convert_command_factory)
 
     @staticmethod
@@ -175,6 +183,7 @@ def __init__(
         no_pr: bool,
         push: bool,
         extra_commit_description: str,
+        override_model_class: str,
         *args,
     ):
         self._logger = logging.get_logger("transformers-cli/pt_to_tf")
@@ -185,6 +194,7 @@ def __init__(
         self._no_pr = no_pr
         self._push = push
         self._extra_commit_description = extra_commit_description
+        self._override_model_class = override_model_class
 
     def get_inputs(self, pt_model, config):
         """
@@ -269,7 +279,20 @@ def run(self):
         # Load config and get the appropriate architecture -- the latter is needed to convert the head's weights
         config = AutoConfig.from_pretrained(self._local_dir)
         architectures = config.architectures
-        if architectures is None:  # No architecture defined -- use auto classes
+        if self._override_model_class is not None:
+            if self._override_model_class.startswith("TF"):
+                architectures = [self._override_model_class[2:]]
+            else:
+                architectures = [self._override_model_class]
+            try:
+                pt_class = getattr(import_module("transformers"), architectures[0])
+            except AttributeError:
+                raise ValueError(f"Model class {self._override_model_class} not found in transformers.")
+            try:
+                tf_class = getattr(import_module("transformers"), "TF" + architectures[0])
+            except AttributeError:
+                raise ValueError(f"TF model class TF{self._override_model_class} not found in transformers.")
+        elif architectures is None:  # No architecture defined -- use auto classes
             pt_class = getattr(import_module("transformers"), "AutoModel")
             tf_class = getattr(import_module("transformers"), "TFAutoModel")
             self._logger.warning("No detected architecture, using AutoModel/TFAutoModel")
@@ -287,7 +310,6 @@ def run(self):
         pt_model = pt_class.from_pretrained(self._local_dir)
         pt_model.eval()
 
-        tf_from_pt_model = tf_class.from_pretrained(self._local_dir, from_pt=True)
         pt_input, tf_input = self.get_inputs(pt_model, config)
 
         with torch.no_grad():
diff --git a/src/transformers/data/data_collator.py b/src/transformers/data/data_collator.py
index 587a4f4d0043..cd363588757b 100644
--- a/src/transformers/data/data_collator.py
+++ b/src/transformers/data/data_collator.py
@@ -883,6 +883,8 @@ def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> D
         return {"input_ids": inputs, "labels": labels}
 
     def tf_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
+        import tensorflow as tf
+
         if isinstance(examples[0], Mapping):
             input_ids = [e["input_ids"] for e in examples]
         else:
@@ -907,7 +909,7 @@ def tf_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict
                         ref_tokens[i] = "##" + ref_tokens[i]
             mask_labels.append(self._whole_word_mask(ref_tokens))
         batch_mask = _tf_collate_batch(mask_labels, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
-        inputs, labels = self.tf_mask_tokens(batch_input, batch_mask)
+        inputs, labels = self.tf_mask_tokens(tf.cast(batch_input, tf.int64), batch_mask)
         return {"input_ids": inputs, "labels": labels}
 
     def numpy_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
diff --git a/src/transformers/generation/flax_utils.py b/src/transformers/generation/flax_utils.py
index 440a0ae26275..4ff1164c88e9 100644
--- a/src/transformers/generation/flax_utils.py
+++ b/src/transformers/generation/flax_utils.py
@@ -140,7 +140,7 @@ class FlaxGenerationMixin:
               `do_sample=False`
 
     You do not need to call any of the above methods directly. Pass custom parameter values to 'generate' instead. To
-    learn more about decoding strategies refer to the [text generation strategies guide](./generation_strategies).
+    learn more about decoding strategies refer to the [text generation strategies guide](../generation_strategies).
     """
 
     def prepare_inputs_for_generation(self, *args, **kwargs):
diff --git a/src/transformers/generation/tf_utils.py b/src/transformers/generation/tf_utils.py
index 2cb9c7cad709..4a9140f8853d 100644
--- a/src/transformers/generation/tf_utils.py
+++ b/src/transformers/generation/tf_utils.py
@@ -455,7 +455,7 @@ class TFGenerationMixin:
         - *beam-search decoding* by calling [`~generation.TFGenerationMixin.beam_search`] if `num_beams>1`
 
     You do not need to call any of the above methods directly. Pass custom parameter values to 'generate' instead. To
-    learn more about decoding strategies refer to the [text generation strategies guide](./generation_strategies).
+    learn more about decoding strategies refer to the [text generation strategies guide](../generation_strategies).
     """
 
     _seed_generator = None
@@ -681,7 +681,7 @@ def generate(
         parameters to generate, e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
 
         For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 0624550493b5..7091b49e3ecc 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -483,7 +483,7 @@ class GenerationMixin:
           `constraints!=None` or `force_words_ids!=None`
 
     You do not need to call any of the above methods directly. Pass custom parameter values to 'generate' instead. To
-    learn more about decoding strategies refer to the [text generation strategies guide](./generation_strategies).
+    learn more about decoding strategies refer to the [text generation strategies guide](../generation_strategies).
     """
 
     def prepare_inputs_for_generation(self, *args, **kwargs):
@@ -1129,7 +1129,7 @@ def generate(
         parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.
 
         For an overview of generation strategies and code examples, check out the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
@@ -1699,7 +1699,7 @@ def contrastive_search(
 
         In most cases, you do not need to call [`~generation.GenerationMixin.contrastive_search`] directly. Use
         generate() instead. For an overview of generation strategies and code examples, check the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
@@ -2057,7 +2057,7 @@ def greedy_search(
 
         In most cases, you do not need to call [`~generation.GenerationMixin.greedy_search`] directly. Use generate()
         instead. For an overview of generation strategies and code examples, check the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
@@ -2304,7 +2304,7 @@ def sample(
 
         In most cases, you do not need to call [`~generation.GenerationMixin.sample`] directly. Use generate() instead.
         For an overview of generation strategies and code examples, check the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
@@ -2573,7 +2573,7 @@ def beam_search(
 
         In most cases, you do not need to call [`~generation.GenerationMixin.beam_search`] directly. Use generate()
         instead. For an overview of generation strategies and code examples, check the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
@@ -2897,7 +2897,7 @@ def beam_sample(
 
         In most cases, you do not need to call [`~generation.GenerationMixin.beam_sample`] directly. Use generate()
         instead. For an overview of generation strategies and code examples, check the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
@@ -3229,7 +3229,7 @@ def group_beam_search(
 
         In most cases, you do not need to call [`~generation.GenerationMixin.group_beam_search`] directly. Use
         generate() instead. For an overview of generation strategies and code examples, check the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
@@ -3607,7 +3607,7 @@ def constrained_beam_search(
 
         In most cases, you do not need to call [`~generation.GenerationMixin.constrained_beam_search`] directly. Use
         generate() instead. For an overview of generation strategies and code examples, check the [following
-        guide](./generation_strategies).
+        guide](../generation_strategies).
 
         </Tip>
 
diff --git a/src/transformers/integrations.py b/src/transformers/integrations.py
index a2effeac6361..14857f830831 100644
--- a/src/transformers/integrations.py
+++ b/src/transformers/integrations.py
@@ -346,7 +346,7 @@ def dynamic_modules_import_trainable(*args, **kwargs):
         **kwargs,
     )
     best_trial = analysis.get_best_trial(metric="objective", mode=direction[:3], scope=trainer.args.ray_scope)
-    best_run = BestRun(best_trial.trial_id, best_trial.last_result["objective"], best_trial.config)
+    best_run = BestRun(best_trial.trial_id, best_trial.last_result["objective"], best_trial.config, analysis)
     if _tb_writer is not None:
         trainer.add_callback(_tb_writer)
     return best_run
diff --git a/src/transformers/modelcard.py b/src/transformers/modelcard.py
index ac954272cda4..e89216b0d8b3 100644
--- a/src/transformers/modelcard.py
+++ b/src/transformers/modelcard.py
@@ -43,6 +43,7 @@
     MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES,
     MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING_NAMES,
     MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES,
+    MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES,
 )
 from .training_args import ParallelMode
 from .utils import (
@@ -70,6 +71,7 @@
     "token-classification": MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES,
     "audio-classification": MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES,
     "automatic-speech-recognition": {**MODEL_FOR_CTC_MAPPING_NAMES, **MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES},
+    "zero-shot-image-classification": MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES,
 }
 
 logger = logging.get_logger(__name__)
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 494367b813bf..e85d8e304d09 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -807,7 +807,7 @@ def create_extended_attention_mask_for_decoder(input_shape, attention_mask, devi
         return extended_attention_mask
 
     def get_extended_attention_mask(
-        self, attention_mask: Tensor, input_shape: Tuple[int], device: device = None, dtype: torch.float = None
+        self, attention_mask: Tensor, input_shape: Tuple[int], device: torch.device = None, dtype: torch.float = None
     ) -> Tensor:
         """
         Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index 35017e3e0320..29d023fc6a5d 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -116,6 +116,7 @@
     mctct,
     megatron_bert,
     megatron_gpt2,
+    mgp_str,
     mluke,
     mmbt,
     mobilebert,
diff --git a/src/transformers/models/align/modeling_align.py b/src/transformers/models/align/modeling_align.py
index 6e74c8205d73..09ee6eca6265 100644
--- a/src/transformers/models/align/modeling_align.py
+++ b/src/transformers/models/align/modeling_align.py
@@ -122,7 +122,7 @@
     Args:
         pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
             Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
-            [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
+            [`AutoImageProcessor`]. See [`EfficientNetImageProcessor.__call__`] for details.
         output_hidden_states (`bool`, *optional*):
             Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
             more detail.
@@ -172,7 +172,7 @@
             model's internal embedding lookup matrix.
         pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
             Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
-            [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
+            [`AutoImageProcessor`]. See [`EfficientNetImageProcessor.__call__`] for details.
         return_loss (`bool`, *optional*):
             Whether or not to return the contrastive loss.
         output_attentions (`bool`, *optional*):
diff --git a/src/transformers/models/align/processing_align.py b/src/transformers/models/align/processing_align.py
index 0a26aaa379a3..0863c11310e3 100644
--- a/src/transformers/models/align/processing_align.py
+++ b/src/transformers/models/align/processing_align.py
@@ -31,7 +31,7 @@ class AlignProcessor(ProcessorMixin):
     Args:
         image_processor ([`EfficientNetImageProcessor`]):
             The image processor is a required input.
-        tokenizer ([`BERTTokenizer`, `BertTokenizerFast`]):
+        tokenizer ([`BertTokenizer`, `BertTokenizerFast`]):
             The tokenizer is a required input.
     """
 
diff --git a/src/transformers/models/auto/__init__.py b/src/transformers/models/auto/__init__.py
index 73965b657f3b..4eccfded5b6a 100644
--- a/src/transformers/models/auto/__init__.py
+++ b/src/transformers/models/auto/__init__.py
@@ -69,6 +69,7 @@
         "MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING",
         "MODEL_MAPPING",
         "MODEL_WITH_LM_HEAD_MAPPING",
+        "MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING",
         "MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING",
         "AutoModel",
         "AutoBackbone",
@@ -100,6 +101,7 @@
         "AutoModelForVisualQuestionAnswering",
         "AutoModelForDocumentQuestionAnswering",
         "AutoModelWithLMHead",
+        "AutoModelForZeroShotImageClassification",
         "AutoModelForZeroShotObjectDetection",
     ]
 
@@ -126,6 +128,7 @@
         "TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING",
         "TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING",
         "TF_MODEL_FOR_VISION_2_SEQ_MAPPING",
+        "TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING",
         "TF_MODEL_MAPPING",
         "TF_MODEL_WITH_LM_HEAD_MAPPING",
         "TFAutoModel",
@@ -144,6 +147,7 @@
         "TFAutoModelForTableQuestionAnswering",
         "TFAutoModelForTokenClassification",
         "TFAutoModelForVision2Seq",
+        "TFAutoModelForZeroShotImageClassification",
         "TFAutoModelWithLMHead",
     ]
 
@@ -226,6 +230,7 @@
             MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING,
             MODEL_FOR_VISION_2_SEQ_MAPPING,
             MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING,
+            MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING,
             MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING,
             MODEL_MAPPING,
             MODEL_WITH_LM_HEAD_MAPPING,
@@ -258,6 +263,7 @@
             AutoModelForVideoClassification,
             AutoModelForVision2Seq,
             AutoModelForVisualQuestionAnswering,
+            AutoModelForZeroShotImageClassification,
             AutoModelForZeroShotObjectDetection,
             AutoModelWithLMHead,
         )
@@ -285,6 +291,7 @@
             TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING,
             TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
             TF_MODEL_FOR_VISION_2_SEQ_MAPPING,
+            TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING,
             TF_MODEL_MAPPING,
             TF_MODEL_WITH_LM_HEAD_MAPPING,
             TFAutoModel,
@@ -303,6 +310,7 @@
             TFAutoModelForTableQuestionAnswering,
             TFAutoModelForTokenClassification,
             TFAutoModelForVision2Seq,
+            TFAutoModelForZeroShotImageClassification,
             TFAutoModelWithLMHead,
         )
 
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index ccd516c2d15b..1ab0b62c793a 100755
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -121,6 +121,7 @@
         ("mbart", "MBartConfig"),
         ("mctct", "MCTCTConfig"),
         ("megatron-bert", "MegatronBertConfig"),
+        ("mgp-str", "MgpstrConfig"),
         ("mobilebert", "MobileBertConfig"),
         ("mobilenet_v1", "MobileNetV1Config"),
         ("mobilenet_v2", "MobileNetV2Config"),
@@ -294,6 +295,7 @@
         ("mbart", "MBART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("mctct", "MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("megatron-bert", "MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("mgp-str", "MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("mobilenet_v1", "MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("mobilenet_v2", "MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
         ("mobilevit", "MOBILEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -476,6 +478,7 @@
         ("mctct", "M-CTC-T"),
         ("megatron-bert", "Megatron-BERT"),
         ("megatron_gpt2", "Megatron-GPT2"),
+        ("mgp-str", "MGP-STR"),
         ("mluke", "mLUKE"),
         ("mobilebert", "MobileBERT"),
         ("mobilenet_v1", "MobileNetV1"),
diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
index 8b45c4d65140..6274c88f7d31 100644
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -69,6 +69,7 @@
         ("levit", "LevitImageProcessor"),
         ("mask2former", "Mask2FormerImageProcessor"),
         ("maskformer", "MaskFormerImageProcessor"),
+        ("mgp-str", "ViTImageProcessor"),
         ("mobilenet_v1", "MobileNetV1ImageProcessor"),
         ("mobilenet_v2", "MobileNetV2ImageProcessor"),
         ("mobilenet_v2", "MobileNetV2ImageProcessor"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index 446ab8ec572d..f871de8e55f2 100755
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -119,6 +119,7 @@
         ("mbart", "MBartModel"),
         ("mctct", "MCTCTModel"),
         ("megatron-bert", "MegatronBertModel"),
+        ("mgp-str", "MgpstrForSceneTextRecognition"),
         ("mobilebert", "MobileBertModel"),
         ("mobilenet_v1", "MobileNetV1Model"),
         ("mobilenet_v2", "MobileNetV2Model"),
@@ -920,7 +921,7 @@
     ]
 )
 
-_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
+MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
     [
         # Model for Zero Shot Image Classification mapping
         ("align", "AlignModel"),
@@ -955,6 +956,9 @@
 MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping(
     CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES
 )
+MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES
+)
 MODEL_FOR_IMAGE_SEGMENTATION_MAPPING = _LazyAutoMapping(
     CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES
 )
@@ -1142,6 +1146,15 @@ class AutoModelForImageClassification(_BaseAutoModelClass):
 AutoModelForImageClassification = auto_class_update(AutoModelForImageClassification, head_doc="image classification")
 
 
+class AutoModelForZeroShotImageClassification(_BaseAutoModelClass):
+    _model_mapping = MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING
+
+
+AutoModelForZeroShotImageClassification = auto_class_update(
+    AutoModelForZeroShotImageClassification, head_doc="zero-shot image classification"
+)
+
+
 class AutoModelForImageSegmentation(_BaseAutoModelClass):
     _model_mapping = MODEL_FOR_IMAGE_SEGMENTATION_MAPPING
 
diff --git a/src/transformers/models/auto/modeling_tf_auto.py b/src/transformers/models/auto/modeling_tf_auto.py
index 4d48e6181ebc..caf5ba71dc03 100644
--- a/src/transformers/models/auto/modeling_tf_auto.py
+++ b/src/transformers/models/auto/modeling_tf_auto.py
@@ -209,6 +209,15 @@
     ]
 )
 
+
+TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Zero Shot Image Classification mapping
+        ("clip", "TFCLIPModel"),
+    ]
+)
+
+
 TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES = OrderedDict(
     [
         # Model for Semantic Segmentation mapping
@@ -424,6 +433,9 @@
 TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping(
     CONFIG_MAPPING_NAMES, TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES
 )
+TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping(
+    CONFIG_MAPPING_NAMES, TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES
+)
 TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING = _LazyAutoMapping(
     CONFIG_MAPPING_NAMES, TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES
 )
@@ -505,6 +517,15 @@ class TFAutoModelForImageClassification(_BaseAutoModelClass):
 )
 
 
+class TFAutoModelForZeroShotImageClassification(_BaseAutoModelClass):
+    _model_mapping = TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING
+
+
+TFAutoModelForZeroShotImageClassification = auto_class_update(
+    TFAutoModelForZeroShotImageClassification, head_doc="zero-shot image classification"
+)
+
+
 class TFAutoModelForSemanticSegmentation(_BaseAutoModelClass):
     _model_mapping = TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING
 
diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py
index 197cfe8e79c5..29726fde4f08 100644
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -57,6 +57,7 @@
         ("layoutlmv2", "LayoutLMv2Processor"),
         ("layoutlmv3", "LayoutLMv3Processor"),
         ("markuplm", "MarkupLMProcessor"),
+        ("mgp-str", "MgpstrProcessor"),
         ("oneformer", "OneFormerProcessor"),
         ("owlvit", "OwlViTProcessor"),
         ("sew", "Wav2Vec2Processor"),
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index f5035ab33180..3f81c5e1f935 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -194,6 +194,7 @@
                 ),
             ),
             ("megatron-bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+            ("mgp-str", ("MgpstrTokenizer", None)),
             ("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
             ("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
             ("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
diff --git a/src/transformers/models/detr/configuration_detr.py b/src/transformers/models/detr/configuration_detr.py
index b3da5f86b016..955b71de1ec5 100644
--- a/src/transformers/models/detr/configuration_detr.py
+++ b/src/transformers/models/detr/configuration_detr.py
@@ -239,6 +239,7 @@ def hidden_size(self) -> int:
     @classmethod
     def from_backbone_config(cls, backbone_config: PretrainedConfig, **kwargs):
         """Instantiate a [`DetrConfig`] (or a derived class) from a pre-trained backbone model configuration.
+
         Args:
             backbone_config ([`PretrainedConfig`]):
                 The backbone configuration.
diff --git a/src/transformers/models/donut/image_processing_donut.py b/src/transformers/models/donut/image_processing_donut.py
index 325a2bb9b602..c94927472915 100644
--- a/src/transformers/models/donut/image_processing_donut.py
+++ b/src/transformers/models/donut/image_processing_donut.py
@@ -63,12 +63,14 @@ class DonutImageProcessor(BaseImageProcessor):
             method.
         resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
             Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
-        do_center_crop (`bool`, *optional*, defaults to `True`):
-            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
-            `preprocess` method.
-        crop_size (`Dict[str, int]` *optional*, defaults to 224):
-            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
-            method.
+        do_thumbnail (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image using thumbnail method.
+        do_align_long_axis (`bool`, *optional*, defaults to `False`):
+            Whether to align the long axis of the image with the long axis of `size` by rotating by 90 degrees.
+        do_pad (`bool`, *optional*, defaults to `True`):
+            Whether to pad the image. If `random_padding` is set to `True` in `preprocess`, each image is padded with a
+            random amont of padding on each size, up to the largest image size in the batch. Otherwise, all images are
+            padded to the largest image size in the batch.
         do_rescale (`bool`, *optional*, defaults to `True`):
             Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
             the `preprocess` method.
@@ -82,9 +84,6 @@ class DonutImageProcessor(BaseImageProcessor):
             channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
         image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
             Image standard deviation.
-        do_convert_rgb (`bool`, *optional*, defaults to `True`):
-            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
-            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
     """
 
     model_input_names = ["pixel_values"]
diff --git a/src/transformers/models/mgp_str/__init__.py b/src/transformers/models/mgp_str/__init__.py
new file mode 100644
index 000000000000..01c0ab7fd9f0
--- /dev/null
+++ b/src/transformers/models/mgp_str/__init__.py
@@ -0,0 +1,62 @@
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
+
+
+_import_structure = {
+    "configuration_mgp_str": ["MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP", "MgpstrConfig"],
+    "processing_mgp_str": ["MgpstrProcessor"],
+    "tokenization_mgp_str": ["MgpstrTokenizer"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_mgp_str"] = [
+        "MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "MgpstrModel",
+        "MgpstrPreTrainedModel",
+        "MgpstrForSceneTextRecognition",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_mgp_str import MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP, MgpstrConfig
+    from .processing_mgp_str.py import MgpstrProcessor
+    from .tokenization_mgp_str import MgpstrTokenizer
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_mgp_str import (
+            MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST,
+            MgpstrForSceneTextRecognition,
+            MgpstrModel,
+            MgpstrPreTrainedModel,
+        )
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/mgp_str/configuration_mgp_str.py b/src/transformers/models/mgp_str/configuration_mgp_str.py
new file mode 100644
index 000000000000..e77248cd6449
--- /dev/null
+++ b/src/transformers/models/mgp_str/configuration_mgp_str.py
@@ -0,0 +1,137 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MGP-STR model configuration"""
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "alibaba-damo/mgp-str-base": "https://huggingface.co/alibaba-damo/mgp-str-base/resolve/main/config.json",
+}
+
+
+class MgpstrConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of an [`MgpstrModel`]. It is used to instantiate an
+    MGP-STR model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the MGP-STR
+    [alibaba-damo/mgp-str-base](https://huggingface.co/alibaba-damo/mgp-str-base) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        image_size (`List[int]`, *optional*, defaults to `[32, 128]`):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 4):
+            The size (resolution) of each patch.
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        max_token_length (`int`, *optional*, defaults to 27):
+            The max number of output tokens.
+        num_character_labels (`int`, *optional*, defaults to 38):
+            The number of classes for character head .
+        num_bpe_labels (`int`, *optional*, defaults to 50257):
+            The number of classes for bpe head .
+        num_wordpiece_labels (`int`, *optional*, defaults to 30522):
+            The number of classes for wordpiece head .
+        hidden_size (`int`, *optional*, defaults to 768):
+            The embedding dimension.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        mlp_ratio (`float`, *optional*, defaults to 4.0):
+            The ratio of mlp hidden dim to embedding dim.
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries, keys and values.
+        distilled (`bool`, *optional*, defaults to `False`):
+            Model includes a distillation token and head as in DeiT models.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        drop_rate (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder.
+        attn_drop_rate (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            The stochastic depth rate.
+        output_a3_attentions (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should returns A^3 module attentions.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+
+    Example:
+
+    ```python
+    >>> from transformers import MgpstrConfig, MgpstrForSceneTextRecognition
+
+    >>> # Initializing a Mgpstr mgp-str-base style configuration
+    >>> configuration = MgpstrConfig()
+
+    >>> # Initializing a model (with random weights) from the mgp-str-base style configuration
+    >>> model = MgpstrForSceneTextRecognition(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "mgp-str"
+
+    def __init__(
+        self,
+        image_size=[32, 128],
+        patch_size=4,
+        num_channels=3,
+        max_token_length=27,
+        num_character_labels=38,
+        num_bpe_labels=50257,
+        num_wordpiece_labels=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        mlp_ratio=4.0,
+        qkv_bias=True,
+        distilled=False,
+        layer_norm_eps=1e-5,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.0,
+        output_a3_attentions=False,
+        initializer_range=0.02,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.max_token_length = max_token_length
+        self.num_character_labels = num_character_labels
+        self.num_bpe_labels = num_bpe_labels
+        self.num_wordpiece_labels = num_wordpiece_labels
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.mlp_ratio = mlp_ratio
+        self.distilled = distilled
+        self.layer_norm_eps = layer_norm_eps
+        self.drop_rate = drop_rate
+        self.qkv_bias = qkv_bias
+        self.attn_drop_rate = attn_drop_rate
+        self.drop_path_rate = drop_path_rate
+        self.output_a3_attentions = output_a3_attentions
+        self.initializer_range = initializer_range
diff --git a/src/transformers/models/mgp_str/modeling_mgp_str.py b/src/transformers/models/mgp_str/modeling_mgp_str.py
new file mode 100644
index 000000000000..35ed55f5f578
--- /dev/null
+++ b/src/transformers/models/mgp_str/modeling_mgp_str.py
@@ -0,0 +1,512 @@
+# coding=utf-8
+# Copyright 2023 Alibaba Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch MGP-STR model."""
+
+import collections.abc
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+
+from ...modeling_outputs import BaseModelOutput
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+    ModelOutput,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from .configuration_mgp_str import MgpstrConfig
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "MgpstrConfig"
+_TOKENIZER_FOR_DOC = "MgpstrTokenizer"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "alibaba-damo/mgp-str-base"
+
+MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "alibaba-damo/mgp-str-base",
+    # See all MGP-STR models at https://huggingface.co/models?filter=mgp-str
+]
+
+
+# Copied from transformers.models.beit.modeling_beit.drop_path
+def drop_path(input, drop_prob: float = 0.0, training: bool = False):
+    """
+    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+
+    Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
+    however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
+    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
+    layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
+    argument.
+    """
+    if drop_prob == 0.0 or not training:
+        return input
+    keep_prob = 1 - drop_prob
+    shape = (input.shape[0],) + (1,) * (input.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device)
+    random_tensor.floor_()  # binarize
+    output = input.div(keep_prob) * random_tensor
+    return output
+
+
+# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->Mgpstr
+class MgpstrDropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+
+    def __init__(self, drop_prob: Optional[float] = None) -> None:
+        super().__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return drop_path(hidden_states, self.drop_prob, self.training)
+
+    def extra_repr(self) -> str:
+        return "p={}".format(self.drop_prob)
+
+
+@dataclass
+class MgpstrModelOutput(ModelOutput):
+    """
+    Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
+
+    Args:
+        logits (`tuple(torch.FloatTensor)` of shape `(batch_size, config.num_character_labels)`):
+            Tuple of `torch.FloatTensor` (one for the output of character of shape `(batch_size,
+            config.max_token_length, config.num_character_labels)`, + one for the output of bpe of shape `(batch_size,
+            config.max_token_length, config.num_bpe_labels)`, + one for the output of wordpiece of shape `(batch_size,
+            config.max_token_length, config.num_wordpiece_labels)`) .
+
+            Classification scores (before SoftMax) of character, bpe and wordpiece.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, config.max_token_length,
+            sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        a3_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_a3_attentions=True` is passed or when `config.output_a3_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for the attention of character, + one for the attention of bpe`, + one
+            for the attention of wordpiece) of shape `(batch_size, config.max_token_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    logits: Tuple[torch.FloatTensor] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    a3_attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+class MgpstrEmbeddings(nn.Module):
+    """2D Image to Patch Embedding"""
+
+    def __init__(self, config: MgpstrConfig):
+        super().__init__()
+        image_size = (
+            config.image_size
+            if isinstance(config.image_size, collections.abc.Iterable)
+            else (config.image_size, config.image_size)
+        )
+        patch_size = (
+            config.patch_size
+            if isinstance(config.patch_size, collections.abc.Iterable)
+            else (config.patch_size, config.patch_size)
+        )
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.grid_size = (image_size[0] // patch_size[0], image_size[1] // patch_size[1])
+        self.num_patches = self.grid_size[0] * self.grid_size[1]
+        self.num_tokens = 2 if config.distilled else 1
+
+        self.proj = nn.Conv2d(config.num_channels, config.hidden_size, kernel_size=patch_size, stride=patch_size)
+
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
+
+        self.pos_embed = nn.Parameter(torch.zeros(1, self.num_patches + self.num_tokens, config.hidden_size))
+        self.pos_drop = nn.Dropout(p=config.drop_rate)
+
+    def forward(self, pixel_values):
+        batch_size, channel, height, width = pixel_values.shape
+        if height != self.image_size[0] or width != self.image_size[1]:
+            raise ValueError(
+                f"Input image size ({height}*{width}) doesn't match model ({self.image_size[0]}*{self.image_size[1]})."
+            )
+
+        patch_embeddings = self.proj(pixel_values)
+        patch_embeddings = patch_embeddings.flatten(2).transpose(1, 2)  # BCHW -> BNC
+
+        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
+        embedding_output = torch.cat((cls_tokens, patch_embeddings), dim=1)
+        embedding_output = embedding_output + self.pos_embed
+        embedding_output = self.pos_drop(embedding_output)
+
+        return embedding_output
+
+
+class MgpstrMlp(nn.Module):
+    """MLP as used in Vision Transformer, MLP-Mixer and related networks"""
+
+    def __init__(self, config: MgpstrConfig, hidden_features):
+        super().__init__()
+        hidden_features = hidden_features or config.hidden_size
+        self.fc1 = nn.Linear(config.hidden_size, hidden_features)
+        self.act = nn.GELU()
+        self.fc2 = nn.Linear(hidden_features, config.hidden_size)
+        self.drop = nn.Dropout(config.drop_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.drop(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        hidden_states = self.drop(hidden_states)
+        return hidden_states
+
+
+class MgpstrAttention(nn.Module):
+    def __init__(self, config: MgpstrConfig):
+        super().__init__()
+        self.num_heads = config.num_attention_heads
+        head_dim = config.hidden_size // config.num_attention_heads
+        self.scale = head_dim**-0.5
+
+        self.qkv = nn.Linear(config.hidden_size, config.hidden_size * 3, bias=config.qkv_bias)
+        self.attn_drop = nn.Dropout(config.attn_drop_rate)
+        self.proj = nn.Linear(config.hidden_size, config.hidden_size)
+        self.proj_drop = nn.Dropout(config.drop_rate)
+
+    def forward(self, hidden_states):
+        batch_size, num, channel = hidden_states.shape
+        qkv = (
+            self.qkv(hidden_states)
+            .reshape(batch_size, num, 3, self.num_heads, channel // self.num_heads)
+            .permute(2, 0, 3, 1, 4)
+        )
+        query, key, value = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)
+
+        attention_probs = (query @ key.transpose(-2, -1)) * self.scale
+        attention_probs = attention_probs.softmax(dim=-1)
+        attention_probs = self.attn_drop(attention_probs)
+
+        context_layer = (attention_probs @ value).transpose(1, 2).reshape(batch_size, num, channel)
+        context_layer = self.proj(context_layer)
+        context_layer = self.proj_drop(context_layer)
+        return (context_layer, attention_probs)
+
+
+class MgpstrLayer(nn.Module):
+    def __init__(self, config: MgpstrConfig, drop_path=None):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.attn = MgpstrAttention(config)
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = MgpstrDropPath(drop_path) if drop_path is not None else nn.Identity()
+        self.norm2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        mlp_hidden_dim = int(config.hidden_size * config.mlp_ratio)
+        self.mlp = MgpstrMlp(config, mlp_hidden_dim)
+
+    def forward(self, hidden_states):
+        self_attention_outputs = self.attn(self.norm1(hidden_states))
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1]
+
+        # first residual connection
+        hidden_states = self.drop_path(attention_output) + hidden_states
+
+        # second residual connection is done here
+        layer_output = hidden_states + self.drop_path(self.mlp(self.norm2(hidden_states)))
+
+        outputs = (layer_output, outputs)
+        return outputs
+
+
+class MgpstrEncoder(nn.Module):
+    def __init__(self, config: MgpstrConfig):
+        super().__init__()
+        # stochastic depth decay rule
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
+
+        self.blocks = nn.Sequential(
+            *[MgpstrLayer(config=config, drop_path=dpr[i]) for i in range(config.num_hidden_layers)]
+        )
+
+    def forward(self, hidden_states, output_attentions=False, output_hidden_states=False, return_dict=True):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        for _, blk in enumerate(self.blocks):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_outputs = blk(hidden_states)
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class MgpstrA3Module(nn.Module):
+    def __init__(self, config: MgpstrConfig):
+        super().__init__()
+        self.token_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.tokenLearner = nn.Sequential(
+            nn.Conv2d(config.hidden_size, config.hidden_size, kernel_size=(1, 1), stride=1, groups=8, bias=False),
+            nn.Conv2d(config.hidden_size, config.max_token_length, kernel_size=(1, 1), stride=1, bias=False),
+        )
+        self.feat = nn.Conv2d(
+            config.hidden_size, config.hidden_size, kernel_size=(1, 1), stride=1, groups=8, bias=False
+        )
+        self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.token_norm(hidden_states)
+        hidden_states = hidden_states.transpose(1, 2).unsqueeze(-1)
+        selected = self.tokenLearner(hidden_states)
+        selected = selected.flatten(2)
+        attentions = F.softmax(selected, dim=-1)
+
+        feat = self.feat(hidden_states)
+        feat = feat.flatten(2).transpose(1, 2)
+        feat = torch.einsum("...si,...id->...sd", attentions, feat)
+        a3_out = self.norm(feat)
+
+        return (a3_out, attentions)
+
+
+class MgpstrPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = MgpstrConfig
+    base_model_prefix = "mgp_str"
+
+    def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None:
+        """Initialize the weights"""
+        if isinstance(module, MgpstrEmbeddings):
+            nn.init.trunc_normal_(module.pos_embed, mean=0.0, std=self.config.initializer_range)
+            nn.init.trunc_normal_(module.cls_token, mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, (nn.Linear, nn.Conv2d)):
+            module.weight.data = nn.init.trunc_normal_(module.weight.data, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+    def _set_gradient_checkpointing(self, module: MgpstrEncoder, value: bool = False) -> None:
+        if isinstance(module, MgpstrEncoder):
+            module.gradient_checkpointing = value
+
+
+MGP_STR_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
+    as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+
+    Parameters:
+        config ([`MgpstrConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+MGP_STR_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See [`ViTImageProcessor.__call__`]
+            for details.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare MGP-STR Model transformer outputting raw hidden-states without any specific head on top.",
+    MGP_STR_START_DOCSTRING,
+)
+class MgpstrModel(MgpstrPreTrainedModel):
+    def __init__(self, config: MgpstrConfig):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = MgpstrEmbeddings(config)
+        self.encoder = MgpstrEncoder(config)
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.embeddings.proj
+
+    @add_start_docstrings_to_model_forward(MGP_STR_INPUTS_DOCSTRING)
+    def forward(self, pixel_values, output_attentions=None, output_hidden_states=None, return_dict=None):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        embedding_output = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            return encoder_outputs
+        return BaseModelOutput(
+            last_hidden_state=encoder_outputs.last_hidden_state,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    MGP-STR Model transformer with three classification heads on top (three A^3 modules and three linear layer on top
+    of the transformer encoder output) for scene text recognition (STR) .
+    """,
+    MGP_STR_START_DOCSTRING,
+)
+class MgpstrForSceneTextRecognition(MgpstrPreTrainedModel):
+    config_class = MgpstrConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: MgpstrConfig) -> None:
+        super().__init__(config)
+
+        self.num_labels = config.num_labels
+        self.mgp_str = MgpstrModel(config)
+
+        self.char_a3_module = MgpstrA3Module(config)
+        self.bpe_a3_module = MgpstrA3Module(config)
+        self.wp_a3_module = MgpstrA3Module(config)
+
+        self.char_head = nn.Linear(config.hidden_size, config.num_character_labels)
+        self.bpe_head = nn.Linear(config.hidden_size, config.num_bpe_labels)
+        self.wp_head = nn.Linear(config.hidden_size, config.num_wordpiece_labels)
+
+    @add_start_docstrings_to_model_forward(MGP_STR_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=MgpstrModelOutput, config_class=MgpstrConfig)
+    def forward(
+        self,
+        pixel_values,
+        output_attentions=None,
+        output_a3_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        output_a3_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of a3 modules. See `a3_attentions` under returned tensors
+            for more detail.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import (
+        ...     MgpstrProcessor,
+        ...     MgpstrForSceneTextRecognition,
+        ... )
+        >>> import requests
+        >>> from PIL import Image
+
+        >>> # load image from the IIIT-5k dataset
+        >>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
+        >>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+
+        >>> processor = MgpstrProcessor.from_pretrained("alibaba-damo/mgp-str-base")
+        >>> pixel_values = processor(images=image, return_tensors="pt").pixel_values
+
+        >>> model = MgpstrForSceneTextRecognition.from_pretrained("alibaba-damo/mgp-str-base")
+
+        >>> # inference
+        >>> outputs = model(pixel_values)
+        >>> out_strs = processor.batch_decode(outputs.logits)
+        >>> out_strs["generated_text"]
+        '["ticket"]'
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        mgp_outputs = self.mgp_str(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = mgp_outputs[0]
+
+        char_a3_out, char_attention = self.char_a3_module(sequence_output)
+        bpe_a3_out, bpe_attention = self.bpe_a3_module(sequence_output)
+        wp_a3_out, wp_attention = self.wp_a3_module(sequence_output)
+
+        char_logits = self.char_head(char_a3_out)
+        bpe_logits = self.bpe_head(bpe_a3_out)
+        wp_logits = self.wp_head(wp_a3_out)
+
+        all_a3_attentions = (char_attention, bpe_attention, wp_attention) if output_a3_attentions else None
+        all_logits = (char_logits, bpe_logits, wp_logits)
+
+        if not return_dict:
+            outputs = (all_logits, all_a3_attentions) + mgp_outputs[1:]
+            return tuple(output for output in outputs if output is not None)
+        return MgpstrModelOutput(
+            logits=all_logits,
+            hidden_states=mgp_outputs.hidden_states,
+            attentions=mgp_outputs.attentions,
+            a3_attentions=all_a3_attentions,
+        )
diff --git a/src/transformers/models/mgp_str/processing_mgp_str.py b/src/transformers/models/mgp_str/processing_mgp_str.py
new file mode 100644
index 000000000000..76143deead64
--- /dev/null
+++ b/src/transformers/models/mgp_str/processing_mgp_str.py
@@ -0,0 +1,228 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Processor class for MGP-STR."""
+
+import warnings
+
+from transformers import AutoTokenizer
+from transformers.utils import is_torch_available
+from transformers.utils.generic import ExplicitEnum
+
+from ...processing_utils import ProcessorMixin
+
+
+if is_torch_available():
+    import torch
+
+
+class DecodeType(ExplicitEnum):
+    CHARACTER = "char"
+    BPE = "bpe"
+    WORDPIECE = "wp"
+
+
+SUPPORTED_ANNOTATION_FORMATS = (DecodeType.CHARACTER, DecodeType.BPE, DecodeType.WORDPIECE)
+
+
+class MgpstrProcessor(ProcessorMixin):
+    r"""
+    Constructs a MGP-STR processor which wraps an image processor and MGP-STR tokenizers into a single
+
+    [`MgpstrProcessor`] offers all the functionalities of `ViTImageProcessor`] and [`MgpstrTokenizer`]. See the
+    [`~MgpstrProcessor.__call__`] and [`~MgpstrProcessor.batch_decode`] for more information.
+
+    Args:
+        image_processor (`ViTImageProcessor`):
+            An instance of `ViTImageProcessor`. The image processor is a required input.
+        tokenizer ([`MgpstrTokenizer`]):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "char_tokenizer"]
+    image_processor_class = "ViTImageProcessor"
+    char_tokenizer_class = "MgpstrTokenizer"
+
+    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        if "feature_extractor" in kwargs:
+            warnings.warn(
+                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
+                " instead.",
+                FutureWarning,
+            )
+            feature_extractor = kwargs.pop("feature_extractor")
+
+        image_processor = image_processor if image_processor is not None else feature_extractor
+        if image_processor is None:
+            raise ValueError("You need to specify an `image_processor`.")
+        if tokenizer is None:
+            raise ValueError("You need to specify a `tokenizer`.")
+
+        self.char_tokenizer = tokenizer
+        self.bpe_tokenizer = AutoTokenizer.from_pretrained("gpt2")
+        self.wp_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+        super().__init__(image_processor, tokenizer)
+
+    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
+        """
+        When used in normal mode, this method forwards all its arguments to ViTImageProcessor's
+        [`~ViTImageProcessor.__call__`] and returns its output. This method also forwards the `text` and `kwargs`
+        arguments to MgpstrTokenizer's [`~MgpstrTokenizer.__call__`] if `text` is not `None` to encode the text. Please
+        refer to the doctsring of the above methods for more information.
+        """
+        if images is None and text is None:
+            raise ValueError("You need to specify either an `images` or `text` input to process.")
+
+        if images is not None:
+            inputs = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+        if text is not None:
+            encodings = self.char_tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if text is None:
+            return inputs
+        elif images is None:
+            return encodings
+        else:
+            inputs["labels"] = encodings["input_ids"]
+            return inputs
+
+    def batch_decode(self, sequences):
+        """
+        Convert a list of lists of token ids into a list of strings by calling decode.
+
+        Args:
+            sequences (`torch.Tensor`):
+                List of tokenized input ids.
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the outputs of the decoded results.
+                generated_text (`List[str]`): The final results after fusion of char, bpe, and wp. scores
+                (`List[float]`): The final scores after fusion of char, bpe, and wp. char_preds (`List[str]`): The list
+                of character decoded sentences. bpe_preds (`List[str]`): The list of bpe decoded sentences. wp_preds
+                (`List[str]`): The list of wp decoded sentences.
+
+        This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        char_preds, bpe_preds, wp_preds = sequences
+        batch_size = char_preds.size(0)
+
+        char_strs, char_scores = self._decode_helper(char_preds, "char")
+        bpe_strs, bpe_scores = self._decode_helper(bpe_preds, "bpe")
+        wp_strs, wp_scores = self._decode_helper(wp_preds, "wp")
+
+        final_strs = []
+        final_scores = []
+        for i in range(batch_size):
+            scores = [char_scores[i], bpe_scores[i], wp_scores[i]]
+            strs = [char_strs[i], bpe_strs[i], wp_strs[i]]
+            max_score_index = scores.index(max(scores))
+            final_strs.append(strs[max_score_index])
+            final_scores.append(scores[max_score_index])
+
+        out = {}
+        out["generated_text"] = final_strs
+        out["scores"] = final_scores
+        out["char_preds"] = char_strs
+        out["bpe_preds"] = bpe_strs
+        out["wp_preds"] = wp_strs
+        return out
+
+    def _decode_helper(self, pred_logits, format):
+        """
+        Convert a list of lists of bpe token ids into a list of strings by calling bpe tokenizer.
+
+        Args:
+            pred_logits (`torch.Tensor`):
+                List of model prediction logits.
+            format (`Union[DecoderType, str]`):
+                Type of model prediction. Must be one of ['char', 'bpe', 'wp'].
+        Returns:
+            `tuple`:
+                dec_strs(`str`): The decode strings of model prediction. conf_scores(`List[float]`): The confidence
+                score of model prediction.
+        """
+        if format == DecodeType.CHARACTER:
+            decoder = self.char_decode
+            eos_token = 1
+            eos_str = "[s]"
+        elif format == DecodeType.BPE:
+            decoder = self.bpe_decode
+            eos_token = 2
+            eos_str = "#"
+        elif format == DecodeType.WORDPIECE:
+            decoder = self.wp_decode
+            eos_token = 102
+            eos_str = "[SEP]"
+        else:
+            raise ValueError(f"Format {format} is not supported.")
+
+        dec_strs, conf_scores = [], []
+        batch_size = pred_logits.size(0)
+        batch_max_length = pred_logits.size(1)
+        _, preds_index = pred_logits.topk(1, dim=-1, largest=True, sorted=True)
+        preds_index = preds_index.view(-1, batch_max_length)[:, 1:]
+        preds_str = decoder(preds_index)
+        preds_max_prob, _ = torch.nn.functional.softmax(pred_logits, dim=2).max(dim=2)
+        preds_max_prob = preds_max_prob[:, 1:]
+
+        for index in range(batch_size):
+            pred_eos = preds_str[index].find(eos_str)
+            pred = preds_str[index][:pred_eos]
+            pred_index = preds_index[index].cpu().tolist()
+            pred_eos_index = pred_index.index(eos_token) if eos_token in pred_index else -1
+            pred_max_prob = preds_max_prob[index][: pred_eos_index + 1]
+            confidence_score = pred_max_prob.cumprod(dim=0)[-1] if pred_max_prob.nelement() != 0 else 0.0
+            dec_strs.append(pred)
+            conf_scores.append(confidence_score)
+
+        return dec_strs, conf_scores
+
+    def char_decode(self, sequences):
+        """
+        Convert a list of lists of char token ids into a list of strings by calling char tokenizer.
+
+        Args:
+            sequences (`torch.Tensor`):
+                List of tokenized input ids.
+        Returns:
+            `List[str]`: The list of char decoded sentences.
+        """
+        decode_strs = [seq.replace(" ", "") for seq in self.char_tokenizer.batch_decode(sequences)]
+        return decode_strs
+
+    def bpe_decode(self, sequences):
+        """
+        Convert a list of lists of bpe token ids into a list of strings by calling bpe tokenizer.
+
+        Args:
+            sequences (`torch.Tensor`):
+                List of tokenized input ids.
+        Returns:
+            `List[str]`: The list of bpe decoded sentences.
+        """
+        return self.bpe_tokenizer.batch_decode(sequences)
+
+    def wp_decode(self, sequences):
+        """
+        Convert a list of lists of word piece token ids into a list of strings by calling word piece tokenizer.
+
+        Args:
+            sequences (`torch.Tensor`):
+                List of tokenized input ids.
+        Returns:
+            `List[str]`: The list of wp decoded sentences.
+        """
+        decode_strs = [seq.replace(" ", "") for seq in self.wp_tokenizer.batch_decode(sequences)]
+        return decode_strs
diff --git a/src/transformers/models/mgp_str/tokenization_mgp_str.py b/src/transformers/models/mgp_str/tokenization_mgp_str.py
new file mode 100644
index 000000000000..9d4fddcc7e83
--- /dev/null
+++ b/src/transformers/models/mgp_str/tokenization_mgp_str.py
@@ -0,0 +1,110 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for MGT-STR CHAR."""
+
+import json
+import os
+from typing import Optional, Tuple
+
+from ...tokenization_utils import PreTrainedTokenizer
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.json"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "mgp-str": "https://huggingface.co/alibaba-damo/mgp-str-base/blob/main/vocab.json",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"mgp-str": 27}
+
+
+class MgpstrTokenizer(PreTrainedTokenizer):
+    """
+    Construct a MGP-STR char tokenizer.
+
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods.
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        unk_token (`str`, *optional*, defaults to `"[GO]"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        bos_token (`str`, *optional*, defaults to `"[GO]"`):
+            The beginning of sequence token.
+        eos_token (`str`, *optional*, defaults to `"[s]"`):
+            The end of sequence token.
+        pad_token (`str` or `tokenizers.AddedToken`, *optional*, , defaults to `"[GO]"`):
+            A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
+            attention mechanisms or loss computation.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(self, vocab_file, unk_token="[GO]", bos_token="[GO]", eos_token="[s]", pad_token="[GO]", **kwargs):
+        super().__init__(
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            **kwargs,
+        )
+
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.vocab = json.load(vocab_handle)
+        self.decoder = {v: k for k, v in self.vocab.items()}
+
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        char_tokens = []
+        for s in text:
+            char_tokens.extend(s)
+        return char_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index)
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not os.path.isdir(save_directory):
+            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
+            return
+        vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            f.write(json.dumps(self.vocab, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
+
+        return (vocab_file,)
diff --git a/src/transformers/models/mt5/modeling_mt5.py b/src/transformers/models/mt5/modeling_mt5.py
index 951a68cb7678..e2235fc69055 100644
--- a/src/transformers/models/mt5/modeling_mt5.py
+++ b/src/transformers/models/mt5/modeling_mt5.py
@@ -978,6 +978,13 @@ def forward(
         else:
             encoder_extended_attention_mask = None
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         # Prepare head mask if needed
         head_mask = self.get_head_mask(head_mask, self.config.num_layers)
         cross_attn_head_mask = self.get_head_mask(cross_attn_head_mask, self.config.num_layers)
@@ -1015,11 +1022,6 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning_once(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
 
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
diff --git a/src/transformers/models/speech_to_text/modeling_speech_to_text.py b/src/transformers/models/speech_to_text/modeling_speech_to_text.py
index eaef470e0822..d08863f8353f 100755
--- a/src/transformers/models/speech_to_text/modeling_speech_to_text.py
+++ b/src/transformers/models/speech_to_text/modeling_speech_to_text.py
@@ -1024,6 +1024,13 @@ def forward(
         hidden_states = inputs_embeds + positions
         hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`..."
+                )
+                use_cache = False
+
         # decoder layers
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
@@ -1048,12 +1055,6 @@ def forward(
             past_key_value = past_key_values[idx] if past_key_values is not None else None
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache ="
-                        " False`..."
-                    )
-                    use_cache = False
 
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
diff --git a/src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py b/src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py
index e408974a5053..319589eab144 100755
--- a/src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py
+++ b/src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py
@@ -632,6 +632,13 @@ def forward(
         hidden_states = inputs_embeds + positions
         hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache =" " False`..."
+                )
+                use_cache = False
+
         # decoder layers
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
@@ -657,12 +664,6 @@ def forward(
             past_key_value = past_key_values[idx] if past_key_values is not None else None
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache ="
-                        " False`..."
-                    )
-                    use_cache = False
 
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
diff --git a/src/transformers/models/speecht5/modeling_speecht5.py b/src/transformers/models/speecht5/modeling_speecht5.py
index e67c55c23b4e..975f483395be 100644
--- a/src/transformers/models/speecht5/modeling_speecht5.py
+++ b/src/transformers/models/speecht5/modeling_speecht5.py
@@ -1662,6 +1662,13 @@ def forward(
 
         deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled()
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         # decoder layers
         all_hidden_states = () if output_hidden_states else None
         all_self_attentions = () if output_attentions else None
@@ -1691,11 +1698,6 @@ def forward(
             past_key_value = past_key_values[idx] if past_key_values is not None else None
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning_once(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
 
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
diff --git a/src/transformers/models/switch_transformers/modeling_switch_transformers.py b/src/transformers/models/switch_transformers/modeling_switch_transformers.py
index de24797c67b0..bcf1c4b7bc83 100644
--- a/src/transformers/models/switch_transformers/modeling_switch_transformers.py
+++ b/src/transformers/models/switch_transformers/modeling_switch_transformers.py
@@ -1039,6 +1039,13 @@ def forward(
         else:
             encoder_extended_attention_mask = None
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         # Prepare head mask if needed
         head_mask = self.get_head_mask(head_mask, self.config.num_layers)
         cross_attn_head_mask = self.get_head_mask(cross_attn_head_mask, self.config.num_layers)
@@ -1060,11 +1067,6 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning_once(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
 
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
diff --git a/src/transformers/models/t5/modeling_t5.py b/src/transformers/models/t5/modeling_t5.py
index 976899863119..19cb83dac352 100644
--- a/src/transformers/models/t5/modeling_t5.py
+++ b/src/transformers/models/t5/modeling_t5.py
@@ -1007,6 +1007,13 @@ def forward(
         else:
             encoder_extended_attention_mask = None
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         # Prepare head mask if needed
         head_mask = self.get_head_mask(head_mask, self.config.num_layers)
         cross_attn_head_mask = self.get_head_mask(cross_attn_head_mask, self.config.num_layers)
@@ -1044,11 +1051,6 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning_once(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
 
                 def create_custom_forward(module):
                     def custom_forward(*inputs):
diff --git a/src/transformers/models/whisper/modeling_whisper.py b/src/transformers/models/whisper/modeling_whisper.py
index cefcac389507..94d0be40476d 100644
--- a/src/transformers/models/whisper/modeling_whisper.py
+++ b/src/transformers/models/whisper/modeling_whisper.py
@@ -737,7 +737,6 @@ class WhisperEncoder(WhisperPreTrainedModel):
 
     Args:
         config: WhisperConfig
-        embed_tokens (nn.Embedding): output embedding
     """
 
     def __init__(self, config: WhisperConfig):
diff --git a/src/transformers/pipelines/__init__.py b/src/transformers/pipelines/__init__.py
index 0f83cb0dea8f..c8c0549a4674 100755
--- a/src/transformers/pipelines/__init__.py
+++ b/src/transformers/pipelines/__init__.py
@@ -103,6 +103,7 @@
         TFAutoModelForTableQuestionAnswering,
         TFAutoModelForTokenClassification,
         TFAutoModelForVision2Seq,
+        TFAutoModelForZeroShotImageClassification,
     )
 
 if is_torch_available():
@@ -135,6 +136,7 @@
         AutoModelForVideoClassification,
         AutoModelForVision2Seq,
         AutoModelForVisualQuestionAnswering,
+        AutoModelForZeroShotImageClassification,
         AutoModelForZeroShotObjectDetection,
     )
 if TYPE_CHECKING:
@@ -290,8 +292,8 @@
     },
     "zero-shot-image-classification": {
         "impl": ZeroShotImageClassificationPipeline,
-        "tf": (TFAutoModel,) if is_tf_available() else (),
-        "pt": (AutoModel,) if is_torch_available() else (),
+        "tf": (TFAutoModelForZeroShotImageClassification,) if is_tf_available() else (),
+        "pt": (AutoModelForZeroShotImageClassification,) if is_torch_available() else (),
         "default": {
             "model": {
                 "pt": ("openai/clip-vit-base-patch32", "f4881ba"),
diff --git a/src/transformers/pipelines/token_classification.py b/src/transformers/pipelines/token_classification.py
index 5dc243d1acf2..f3c78d3498d5 100644
--- a/src/transformers/pipelines/token_classification.py
+++ b/src/transformers/pipelines/token_classification.py
@@ -304,7 +304,9 @@ def gather_pre_entities(
                         start_ind = start_ind.item()
                         end_ind = end_ind.item()
                 word_ref = sentence[start_ind:end_ind]
-                if getattr(self.tokenizer._tokenizer.model, "continuing_subword_prefix", None):
+                if getattr(self.tokenizer, "_tokenizer", None) and getattr(
+                    self.tokenizer._tokenizer.model, "continuing_subword_prefix", None
+                ):
                     # This is a BPE, word aware tokenizer, there is a correct way
                     # to fuse tokens
                     is_subword = len(word) != len(word_ref)
diff --git a/src/transformers/pipelines/zero_shot_image_classification.py b/src/transformers/pipelines/zero_shot_image_classification.py
index f19a548c85db..8ba07eb018cb 100644
--- a/src/transformers/pipelines/zero_shot_image_classification.py
+++ b/src/transformers/pipelines/zero_shot_image_classification.py
@@ -18,9 +18,10 @@
     from ..image_utils import load_image
 
 if is_torch_available():
-    pass
+    from ..models.auto.modeling_auto import MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING
 
 if is_tf_available():
+    from ..models.auto.modeling_tf_auto import TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING
     from ..tf_utils import stable_softmax
 
 logger = logging.get_logger(__name__)
@@ -64,8 +65,11 @@ def __init__(self, **kwargs):
         super().__init__(**kwargs)
 
         requires_backends(self, "vision")
-        # No specific FOR_XXX available yet
-        # self.check_model_type(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING)
+        self.check_model_type(
+            TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING
+            if self.framework == "tf"
+            else MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING
+        )
 
     def __call__(self, images: Union[str, List[str], "Image", List["Image"]], **kwargs):
         """
@@ -137,9 +141,11 @@ def postprocess(self, model_outputs):
         if self.framework == "pt":
             probs = logits.softmax(dim=-1).squeeze(-1)
             scores = probs.tolist()
-        else:
+        elif self.framework == "tf":
             probs = stable_softmax(logits, axis=-1)
             scores = probs.numpy().tolist()
+        else:
+            raise ValueError(f"Unsupported framework: {self.framework}")
 
         result = [
             {"score": score, "label": candidate_label}
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 518e58342afe..344523842343 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -2492,7 +2492,8 @@ def hyperparameter_search(
                 - the documentation of [sigopt](https://app.sigopt.com/docs/endpoints/experiments/create)
 
         Returns:
-            [`trainer_utils.BestRun`]: All the information about the best run.
+            [`trainer_utils.BestRun`]: All the information about the best run. Experiment summary can be found in
+            `run_summary` attribute for Ray backend.
         """
         if backend is None:
             backend = default_hp_search_backend()
diff --git a/src/transformers/trainer_utils.py b/src/transformers/trainer_utils.py
index 9f273ab1ed6c..a213e4b1f447 100644
--- a/src/transformers/trainer_utils.py
+++ b/src/transformers/trainer_utils.py
@@ -192,7 +192,7 @@ class HubStrategy(ExplicitEnum):
 
 class BestRun(NamedTuple):
     """
-    The best run found by an hyperparameter search (see [`~Trainer.hyperparameter_search`]).
+    The best run found by a hyperparameter search (see [`~Trainer.hyperparameter_search`]).
 
     Parameters:
         run_id (`str`):
@@ -202,11 +202,14 @@ class BestRun(NamedTuple):
             The objective that was obtained for this run.
         hyperparameters (`Dict[str, Any]`):
             The hyperparameters picked to get this run.
+        run_summary (`Optional[Any]`):
+            A summary of tuning experiments. `ray.tune.ExperimentAnalysis` object for Ray backend.
     """
 
     run_id: str
     objective: float
     hyperparameters: Dict[str, Any]
+    run_summary: Optional[Any] = None
 
 
 def default_compute_objective(metrics: Dict[str, float]) -> float:
diff --git a/src/transformers/training_args.py b/src/transformers/training_args.py
index de9b36a445c2..32b1f0059063 100644
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -474,8 +474,6 @@ class TrainingArguments:
             The optimizer to use: adamw_hf, adamw_torch, adamw_apex_fused, adamw_anyprecision or adafactor.
         optim_args (`str`, *optional*):
             Optional arguments that are supplied to AnyPrecisionAdamW.
-        adafactor (`bool`, *optional*, defaults to `False`):
-            This argument is deprecated. Use `--optim adafactor` instead.
         group_by_length (`bool`, *optional*, defaults to `False`):
             Whether or not to group together samples of roughly the same length in the training dataset (to minimize
             padding applied and be more efficient). Only useful if applying dynamic padding.
@@ -1803,7 +1801,9 @@ def _no_sync_in_gradient_accumulation(self):
         """
         Whether or not to use no_sync for the gradients when doing gradient accumulation.
         """
-        return not (self.deepspeed or is_sagemaker_dp_enabled() or is_sagemaker_mp_enabled())
+        return not (
+            self.deepspeed or is_sagemaker_dp_enabled() or is_sagemaker_mp_enabled() or is_torch_neuroncore_available()
+        )
 
     @contextlib.contextmanager
     def main_process_first(self, local=True, desc="work"):
@@ -1905,6 +1905,524 @@ def to_sanitized_dict(self) -> Dict[str, Any]:
 
         return {k: v if type(v) in valid_types else str(v) for k, v in d.items()}
 
+    # The following methods are there to simplify the instantiation of `TrainingArguments`
+    def set_training(
+        self,
+        learning_rate: float = 5e-5,
+        batch_size: int = 8,
+        weight_decay: float = 0,
+        num_epochs: float = 3,
+        max_steps: int = -1,
+        gradient_accumulation_steps: int = 1,
+        seed: int = 42,
+        gradient_checkpointing: bool = False,
+    ):
+        """
+        A method that regroups all basic arguments linked to the training.
+
+        <Tip>
+
+        Calling this method will automatically set `self.do_train` to `True`.
+
+        </Tip>
+
+        Args:
+            learning_rate (`float`, *optional*, defaults to 5e-5):
+                The initial learning rate for the optimizer.
+            batch_size (`int` *optional*, defaults to 8):
+                The batch size per device (GPU/TPU core/CPU...) used for training.
+            weight_decay (`float`, *optional*, defaults to 0):
+                The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the
+                optimizer.
+            num_train_epochs(`float`, *optional*, defaults to 3.0):
+                Total number of training epochs to perform (if not an integer, will perform the decimal part percents
+                of the last epoch before stopping training).
+            max_steps (`int`, *optional*, defaults to -1):
+                If set to a positive number, the total number of training steps to perform. Overrides
+                `num_train_epochs`. In case of using a finite iterable dataset the training may stop before reaching
+                the set number of steps when all data is exhausted.
+            gradient_accumulation_steps (`int`, *optional*, defaults to 1):
+                Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
+
+                <Tip warning={true}>
+
+                When using gradient accumulation, one step is counted as one step with backward pass. Therefore,
+                logging, evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training
+                examples.
+
+                </Tip>
+
+            seed (`int`, *optional*, defaults to 42):
+                Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use
+                the [`~Trainer.model_init`] function to instantiate the model if it has some randomly initialized
+                parameters.
+            gradient_checkpointing (`bool`, *optional*, defaults to `False`):
+                If True, use gradient checkpointing to save memory at the expense of slower backward pass.
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_training(learning_rate=1e-4, batch_size=32)
+        >>> args.learning_rate
+        1e-4
+        ```
+        """
+        self.do_train = True
+        self.learning_rate = learning_rate
+        self.per_device_train_batch_size = batch_size
+        self.weight_decay = weight_decay
+        self.num_train_epochs = num_epochs
+        self.max_steps = max_steps
+        self.gradient_accumulation_steps = gradient_accumulation_steps
+        self.seed = seed
+        self.gradient_checkpointing = gradient_checkpointing
+        return self
+
+    def set_evaluate(
+        self,
+        strategy: Union[str, IntervalStrategy] = "no",
+        steps: int = 500,
+        batch_size: int = 8,
+        accumulation_steps: Optional[int] = None,
+        delay: Optional[float] = None,
+        loss_only: bool = False,
+        jit_mode: bool = False,
+    ):
+        """
+        A method that regroups all arguments linked to the evaluation.
+
+        Args:
+            strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"no"`):
+                The evaluation strategy to adopt during training. Possible values are:
+
+                    - `"no"`: No evaluation is done during training.
+                    - `"steps"`: Evaluation is done (and logged) every `steps`.
+                    - `"epoch"`: Evaluation is done at the end of each epoch.
+
+                Setting a `strategy` different from `"no"` will set `self.do_eval` to `True`.
+            steps (`int`, *optional*, defaults to 500):
+                Number of update steps between two evaluations if `strategy="steps"`.
+            batch_size (`int` *optional*, defaults to 8):
+                The batch size per device (GPU/TPU core/CPU...) used for evaluation.
+            accumulation_steps (`int`, *optional*):
+                Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU.
+                If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster
+                but requires more memory).
+            delay (`float`, *optional*):
+                Number of epochs or steps to wait for before the first evaluation can be performed, depending on the
+                evaluation_strategy.
+            loss_only (`bool`, *optional*, defaults to `False`):
+                Ignores all outputs except the loss.
+            jit_mode (`bool`, *optional*):
+                Whether or not to use PyTorch jit trace for inference.
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_evaluate(strategy="steps", steps=100)
+        >>> args.eval_steps
+        100
+        ```
+        """
+        self.evaluation_strategy = IntervalStrategy(strategy)
+        if self.evaluation_strategy == IntervalStrategy.STEPS and steps == 0:
+            raise ValueError("Setting `strategy` as 'steps' requires a positive value for `steps`.")
+        self.do_eval = self.evaluation_strategy != IntervalStrategy.NO
+        self.eval_steps = steps
+        self.per_device_eval_batch_size = batch_size
+        self.eval_accumulation_steps = accumulation_steps
+        self.eval_delay = delay
+        self.prediction_loss_only = loss_only
+        self.jit_mode_eval = jit_mode
+        return self
+
+    def set_testing(
+        self,
+        batch_size: int = 8,
+        loss_only: bool = False,
+        jit_mode: bool = False,
+    ):
+        """
+        A method that regroups all basic arguments linked to testing on a held-out dataset.
+
+        <Tip>
+
+        Calling this method will automatically set `self.do_predict` to `True`.
+
+        </Tip>
+
+        Args:
+            batch_size (`int` *optional*, defaults to 8):
+                The batch size per device (GPU/TPU core/CPU...) used for testing.
+            loss_only (`bool`, *optional*, defaults to `False`):
+                Ignores all outputs except the loss.
+            jit_mode (`bool`, *optional*):
+                Whether or not to use PyTorch jit trace for inference.
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_testing(batch_size=32)
+        >>> args.per_device_eval_batch_size
+        32
+        ```
+        """
+        self.do_predict = True
+        self.per_device_eval_batch_size = batch_size
+        self.prediction_loss_only = loss_only
+        self.jit_mode_eval = jit_mode
+        return self
+
+    def set_save(
+        self,
+        strategy: Union[str, IntervalStrategy] = "steps",
+        steps: int = 500,
+        total_limit: Optional[int] = None,
+        on_each_node: bool = False,
+    ):
+        """
+        A method that regroups all arguments linked to the evaluation.
+
+        Args:
+            strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`):
+                The checkpoint save strategy to adopt during training. Possible values are:
+
+                    - `"no"`: No save is done during training.
+                    - `"epoch"`: Save is done at the end of each epoch.
+                    - `"steps"`: Save is done every `save_steps`.
+
+            steps (`int`, *optional*, defaults to 500):
+                Number of updates steps before two checkpoint saves if `strategy="steps"`.
+            total_limit (`int`, *optional*):
+                If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
+                `output_dir`.
+            on_each_node (`bool`, *optional*, defaults to `False`):
+                When doing multi-node distributed training, whether to save models and checkpoints on each node, or
+                only on the main one.
+
+                This should not be activated when the different nodes use the same storage as the files will be saved
+                with the same names for each node.
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_save(strategy="steps", steps=100)
+        >>> args.save_steps
+        100
+        ```
+        """
+        self.save_strategy = IntervalStrategy(strategy)
+        if self.save_strategy == IntervalStrategy.STEPS and steps == 0:
+            raise ValueError("Setting `strategy` as 'steps' requires a positive value for `steps`.")
+        self.save_steps = steps
+        self.save_total_limit = total_limit
+        self.save_on_each_node = on_each_node
+        return self
+
+    def set_logging(
+        self,
+        strategy: Union[str, IntervalStrategy] = "steps",
+        steps: int = 500,
+        report_to: Union[str, List[str]] = "none",
+        level: str = "passive",
+        first_step: bool = False,
+        nan_inf_filter: bool = False,
+        on_each_node: bool = False,
+        replica_level: str = "passive",
+    ):
+        """
+        A method that regroups all arguments linked to the evaluation.
+
+        Args:
+            strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`):
+                The logging strategy to adopt during training. Possible values are:
+
+                    - `"no"`: No save is done during training.
+                    - `"epoch"`: Save is done at the end of each epoch.
+                    - `"steps"`: Save is done every `save_steps`.
+
+            steps (`int`, *optional*, defaults to 500):
+                Number of update steps between two logs if `strategy="steps"`.
+            level (`str`, *optional*, defaults to `"passive"`):
+                Logger log level to use on the main process. Possible choices are the log levels as strings: `"debug"`,
+                `"info"`, `"warning"`, `"error"` and `"critical"`, plus a `"passive"` level which doesn't set anything
+                and lets the application set the level.
+            report_to (`str` or `List[str]`, *optional*, defaults to `"none"`):
+                The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,
+                `"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. Use `"all"` to report
+                to all integrations installed, `"none"` for no integrations.
+            first_step (`bool`, *optional*, defaults to `False`):
+                Whether to log and evaluate the first `global_step` or not.
+            nan_inf_filter (`bool`, *optional*, defaults to `True`):
+                Whether to filter `nan` and `inf` losses for logging. If set to `True` the loss of every step that is
+                `nan` or `inf` is filtered and the average loss of the current logging window is taken instead.
+
+                <Tip>
+
+                `nan_inf_filter` only influences the logging of loss values, it does not change the behavior the
+                gradient is computed or applied to the model.
+
+                </Tip>
+
+            on_each_node (`bool`, *optional*, defaults to `True`):
+                In multinode distributed training, whether to log using `log_level` once per node, or only on the main
+                node.
+            replica_level (`str`, *optional*, defaults to `"passive"`):
+                Logger log level to use on replicas. Same choices as `log_level`
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_logging(strategy="steps", steps=100)
+        >>> args.logging_steps
+        100
+        ```
+        """
+        self.logging_strategy = IntervalStrategy(strategy)
+        if self.logging_strategy == IntervalStrategy.STEPS and steps == 0:
+            raise ValueError("Setting `strategy` as 'steps' requires a positive value for `steps`.")
+        self.logging_steps = steps
+        self.report_to = report_to
+        self.log_level = level
+        self.logging_first_step = first_step
+        self.logging_nan_inf_filter = nan_inf_filter
+        self.log_on_each_node = on_each_node
+        self.log_level_replica = replica_level
+        return self
+
+    def set_push_to_hub(
+        self,
+        model_id: str,
+        strategy: Union[str, HubStrategy] = "every_save",
+        token: Optional[str] = None,
+        private_repo: bool = False,
+    ):
+        """
+        A method that regroups all arguments linked to synchronizing checkpoints with the Hub.
+
+        <Tip>
+
+        Calling this method will set `self.push_to_hub` to `True`, which means the `output_dir` will begin a git
+        directory synced with the repo (determined by `model_id`) and the content will be pushed each time a save is
+        triggered (depending on`self.save_strategy`). Calling [`~Trainer.save_model`] will also trigger a push.
+
+        </Tip>
+
+        Args:
+            model_id (`str`):
+                The name of the repository to keep in sync with the local *output_dir*. It can be a simple model ID in
+                which case the model will be pushed in your namespace. Otherwise it should be the whole repository
+                name, for instance `"user_name/model"`, which allows you to push to an organization you are a member of
+                with `"organization_name/model"`.
+            strategy (`str` or [`~trainer_utils.HubStrategy`], *optional*, defaults to `"every_save"`):
+                Defines the scope of what is pushed to the Hub and when. Possible values are:
+
+                - `"end"`: push the model, its configuration, the tokenizer (if passed along to the [`Trainer`]) and a
+                draft of a model card when the [`~Trainer.save_model`] method is called.
+                - `"every_save"`: push the model, its configuration, the tokenizer (if passed along to the [`Trainer`])
+                  and
+                a draft of a model card each time there is a model save. The pushes are asynchronous to not block
+                training, and in case the save are very frequent, a new push is only attempted if the previous one is
+                finished. A last push is made with the final model at the end of training.
+                - `"checkpoint"`: like `"every_save"` but the latest checkpoint is also pushed in a subfolder named
+                last-checkpoint, allowing you to resume training easily with
+                `trainer.train(resume_from_checkpoint="last-checkpoint")`.
+                - `"all_checkpoints"`: like `"checkpoint"` but all checkpoints are pushed like they appear in the
+                  output
+                folder (so you will get one checkpoint folder per folder in your final repository)
+
+            token (`str`, *optional*):
+                The token to use to push the model to the Hub. Will default to the token in the cache folder obtained
+                with `huggingface-cli login`.
+            private_repo (`bool`, *optional*, defaults to `False`):
+                If True, the Hub repo will be set to private.
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_push_to_hub("me/awesome-model")
+        >>> args.hub_model_id
+        'me/awesome-model'
+        ```
+        """
+        self.push_to_hub = True
+        self.hub_model_id = model_id
+        self.hub_strategy = HubStrategy(strategy)
+        self.hub_token = token
+        self.hub_private_repo = private_repo
+        return self
+
+    def set_optimizer(
+        self,
+        name: Union[str, OptimizerNames] = "adamw_hf",
+        learning_rate: float = 5e-5,
+        weight_decay: float = 0,
+        beta1: float = 0.9,
+        beta2: float = 0.999,
+        epsilon: float = 1e-8,
+        args: Optional[str] = None,
+    ):
+        """
+        A method that regroups all arguments linked to the optimizer and its hyperparameters.
+
+        Args:
+            name (`str` or [`training_args.OptimizerNames`], *optional*, defaults to `"adamw_hf"`):
+                The optimizer to use: `"adamw_hf"`, `"adamw_torch"`, `"adamw_apex_fused"`, `"adamw_anyprecision"` or
+                `"adafactor"`.
+            learning_rate (`float`, *optional*, defaults to 5e-5):
+                The initial learning rate.
+            weight_decay (`float`, *optional*, defaults to 0):
+                The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights.
+            beta1 (`float`, *optional*, defaults to 0.9):
+                The beta1 hyperparameter for the adam optimizer or its variants.
+            beta2 (`float`, *optional*, defaults to 0.999):
+                The beta2 hyperparameter for the adam optimizer or its variants.
+            epsilon (`float`, *optional*, defaults to 1e-8):
+                The epsilon hyperparameter for the adam optimizer or its variants.
+            args (`str`, *optional*):
+                Optional arguments that are supplied to AnyPrecisionAdamW (only useful when
+                `optim="adamw_anyprecision"`).
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_optimizer(name="adamw_torch", beta1=0.8)
+        >>> args.optim
+        'adamw_torch'
+        ```
+        """
+        self.optim = OptimizerNames(name)
+        self.learning_rate = learning_rate
+        self.weight_decay = weight_decay
+        self.adam_beta1 = beta1
+        self.adam_beta2 = beta2
+        self.adam_epsilon = epsilon
+        self.optim_args = args
+        return self
+
+    def set_lr_scheduler(
+        self,
+        name: Union[str, SchedulerType] = "linear",
+        num_epochs: float = 3.0,
+        max_steps: int = -1,
+        warmup_ratio: float = 0,
+        warmup_steps: int = 0,
+    ):
+        """
+        A method that regroups all arguments linked to the learning rate scheduler and its hyperparameters.
+
+        Args:
+            name (`str` or [`SchedulerType`], *optional*, defaults to `"linear"`):
+                The scheduler type to use. See the documentation of [`SchedulerType`] for all possible values.
+            num_epochs(`float`, *optional*, defaults to 3.0):
+                Total number of training epochs to perform (if not an integer, will perform the decimal part percents
+                of the last epoch before stopping training).
+            max_steps (`int`, *optional*, defaults to -1):
+                If set to a positive number, the total number of training steps to perform. Overrides
+                `num_train_epochs`. In case of using a finite iterable dataset the training may stop before reaching
+                the set number of steps when all data is exhausted.
+            warmup_ratio (`float`, *optional*, defaults to 0.0):
+                Ratio of total training steps used for a linear warmup from 0 to `learning_rate`.
+            warmup_steps (`int`, *optional*, defaults to 0):
+                Number of steps used for a linear warmup from 0 to `learning_rate`. Overrides any effect of
+                `warmup_ratio`.
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_lr_scheduler(name="cosine", warmup_ratio=0.05)
+        >>> args.warmup_ratio
+        0.05
+        ```
+        """
+        self.lr_scheduler_type = SchedulerType(name)
+        self.num_train_epochs = num_epochs
+        self.max_steps = max_steps
+        self.warmup_ratio = warmup_ratio
+        self.warmup_steps = warmup_steps
+        return self
+
+    def set_dataloader(
+        self,
+        train_batch_size: int = 8,
+        eval_batch_size: int = 8,
+        drop_last: bool = False,
+        num_workers: int = 0,
+        pin_memory: bool = True,
+        auto_find_batch_size: bool = False,
+        ignore_data_skip: bool = False,
+        sampler_seed: Optional[int] = None,
+    ):
+        """
+        A method that regroups all arguments linked to the dataloaders creation.
+
+        Args:
+            drop_last (`bool`, *optional*, defaults to `False`):
+                Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch
+                size) or not.
+            num_workers (`int`, *optional*, defaults to 0):
+                Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in
+                the main process.
+            pin_memory (`bool`, *optional*, defaults to `True`):
+                Whether you want to pin memory in data loaders or not. Will default to `True`.
+            auto_find_batch_size (`bool`, *optional*, defaults to `False`)
+                Whether to find a batch size that will fit into memory automatically through exponential decay,
+                avoiding CUDA Out-of-Memory errors. Requires accelerate to be installed (`pip install accelerate`)
+            ignore_data_skip (`bool`, *optional*, defaults to `False`):
+                When resuming training, whether or not to skip the epochs and batches to get the data loading at the
+                same stage as in the previous training. If set to `True`, the training will begin faster (as that
+                skipping step can take a long time) but will not yield the same results as the interrupted training
+                would have.
+            sampler_seed (`int`, *optional*):
+                Random seed to be used with data samplers. If not set, random generators for data sampling will use the
+                same seed as `self.seed`. This can be used to ensure reproducibility of data sampling, independent of
+                the model seed.
+
+        Example:
+
+        ```py
+        >>> from transformers import TrainingArguments
+
+        >>> args = TrainingArguments("working_dir")
+        >>> args = args.set_dataloader(train_batch_size=16, eval_batch_size=64)
+        >>> args.per_device_train_batch_size
+        16
+        ```
+        """
+        self.per_device_train_batch_size = train_batch_size
+        self.per_device_eval_batch_size = eval_batch_size
+        self.dataloader_drop_last = drop_last
+        self.dataloader_num_workers = num_workers
+        self.dataloader_pin_memory = pin_memory
+        self.auto_find_batch_size = auto_find_batch_size
+        self.ignore_data_skip = ignore_data_skip
+        self.data_seed = sampler_seed
+        return self
+
 
 class ParallelMode(Enum):
     NOT_PARALLEL = "not_parallel"
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index 85b4010f38c4..1f5931885f83 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -526,6 +526,9 @@ def __init__(self, *args, **kwargs):
 MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING = None
 
 
+MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING = None
+
+
 MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING = None
 
 
@@ -738,6 +741,13 @@ def __init__(self, *args, **kwargs):
         requires_backends(self, ["torch"])
 
 
+class AutoModelForZeroShotImageClassification(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 class AutoModelForZeroShotObjectDetection(metaclass=DummyObject):
     _backends = ["torch"]
 
@@ -4209,6 +4219,30 @@ def __init__(self, *args, **kwargs):
         requires_backends(self, ["torch"])
 
 
+MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class MgpstrForSceneTextRecognition(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class MgpstrModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class MgpstrPreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 class MMBTForClassification(metaclass=DummyObject):
     _backends = ["torch"]
 
diff --git a/src/transformers/utils/dummy_tf_objects.py b/src/transformers/utils/dummy_tf_objects.py
index 3eac414edd5d..55eb6599f100 100644
--- a/src/transformers/utils/dummy_tf_objects.py
+++ b/src/transformers/utils/dummy_tf_objects.py
@@ -316,6 +316,9 @@ def __init__(self, *args, **kwargs):
 TF_MODEL_FOR_VISION_2_SEQ_MAPPING = None
 
 
+TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING = None
+
+
 TF_MODEL_MAPPING = None
 
 
@@ -434,6 +437,13 @@ def __init__(self, *args, **kwargs):
         requires_backends(self, ["tf"])
 
 
+class TFAutoModelForZeroShotImageClassification(metaclass=DummyObject):
+    _backends = ["tf"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["tf"])
+
+
 class TFAutoModelWithLMHead(metaclass=DummyObject):
     _backends = ["tf"]
 
@@ -1647,9 +1657,6 @@ def __init__(self, *args, **kwargs):
 TF_MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST = None
 
 
-TF_MOBILEVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
-
-
 class TFMobileBertForMaskedLM(metaclass=DummyObject):
     _backends = ["tf"]
 
@@ -1720,6 +1727,9 @@ def __init__(self, *args, **kwargs):
         requires_backends(self, ["tf"])
 
 
+TF_MOBILEVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
 class TFMobileViTForImageClassification(metaclass=DummyObject):
     _backends = ["tf"]
 
diff --git a/src/transformers/utils/fx.py b/src/transformers/utils/fx.py
index da9c43b17195..9da544cb45e9 100755
--- a/src/transformers/utils/fx.py
+++ b/src/transformers/utils/fx.py
@@ -50,6 +50,7 @@
     MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES,
     MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES,
     MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES,
+    MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES,
     MODEL_MAPPING_NAMES,
 )
 from ..utils import ENV_VARS_TRUE_VALUES, TORCH_FX_REQUIRED_VERSION, is_torch_fx_available
@@ -79,6 +80,7 @@ def _generate_supported_model_class_names(
         "token-classification": MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES,
         "masked-image-modeling": MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING_NAMES,
         "image-classification": MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES,
+        "zero-shot-image-classification": MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES,
         "ctc": MODEL_FOR_CTC_MAPPING_NAMES,
         "audio-classification": MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES,
         "semantic-segmentation": MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES,
diff --git a/src/transformers/utils/hub.py b/src/transformers/utils/hub.py
index 3403867eafe8..db00878c9ae4 100644
--- a/src/transformers/utils/hub.py
+++ b/src/transformers/utils/hub.py
@@ -390,7 +390,7 @@ def cached_file(
     if isinstance(cache_dir, Path):
         cache_dir = str(cache_dir)
 
-    if _commit_hash is not None:
+    if _commit_hash is not None and not force_download:
         # If the file is cached under that commit hash, we return it directly.
         resolved_file = try_to_load_from_cache(
             path_or_repo_id, full_filename, cache_dir=cache_dir, revision=_commit_hash
@@ -913,7 +913,13 @@ def get_checkpoint_shard_files(
 
     # At this stage pretrained_model_name_or_path is a model identifier on the Hub
     cached_filenames = []
-    for shard_filename in shard_filenames:
+    # Check if the model is already cached or not. We only try the last checkpoint, this should cover most cases of
+    # downloaded (if interrupted).
+    last_shard = try_to_load_from_cache(
+        pretrained_model_name_or_path, shard_filenames[-1], cache_dir=cache_dir, revision=_commit_hash
+    )
+    show_progress_bar = last_shard is None or force_download
+    for shard_filename in tqdm(shard_filenames, desc="Downloading shards", disable=not show_progress_bar):
         try:
             # Load from URL
             cached_filename = cached_file(
diff --git a/tests/deepspeed/test_deepspeed.py b/tests/deepspeed/test_deepspeed.py
index 9b203dfd7b95..ba9c269cd107 100644
--- a/tests/deepspeed/test_deepspeed.py
+++ b/tests/deepspeed/test_deepspeed.py
@@ -426,6 +426,7 @@ def test_hf_optimizer_with_offload(self, stage, dtype):
         del ds_config_dict["optimizer"]  # force default HF Trainer optimizer
         # force cpu offload
         ds_config_dict["zero_optimization"]["offload_optimizer"]["device"] = "cpu"
+        ds_config_dict["zero_force_ds_cpu_optimizer"] = False  # offload is not efficient w/o CPUAdam
         with mockenv_context(**self.dist_env_1_gpu):
             kwargs = {"local_rank": 0, "deepspeed": ds_config_dict}
             kwargs[dtype] = True
@@ -776,6 +777,7 @@ def test_load_best_model(self, stage, dtype):
         ds_config_dict = self.get_config_dict(stage)
         del ds_config_dict["optimizer"]  # will use HF Trainer optimizer
         del ds_config_dict["scheduler"]  # will use HF Trainer scheduler
+        ds_config_dict["zero_force_ds_cpu_optimizer"] = False  # offload is not efficient w/o CPUAdam
         # must use this setting to get the reload path exercised
         ds_config_dict["zero_optimization"]["stage3_gather_16bit_weights_on_model_save"] = True
 
diff --git a/tests/generation/test_configuration_utils.py b/tests/generation/test_configuration_utils.py
index 8add735a0bf0..a12b35968283 100644
--- a/tests/generation/test_configuration_utils.py
+++ b/tests/generation/test_configuration_utils.py
@@ -17,7 +17,7 @@
 import tempfile
 import unittest
 
-from huggingface_hub import HfFolder, delete_repo, set_access_token
+from huggingface_hub import HfFolder, delete_repo
 from parameterized import parameterized
 from requests.exceptions import HTTPError
 
@@ -99,7 +99,6 @@ class ConfigPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/models/auto/test_processor_auto.py b/tests/models/auto/test_processor_auto.py
index a880bc0a084a..e0bb4946f702 100644
--- a/tests/models/auto/test_processor_auto.py
+++ b/tests/models/auto/test_processor_auto.py
@@ -21,7 +21,7 @@
 from pathlib import Path
 from shutil import copyfile
 
-from huggingface_hub import HfFolder, Repository, create_repo, delete_repo, set_access_token
+from huggingface_hub import HfFolder, Repository, create_repo, delete_repo
 from requests.exceptions import HTTPError
 
 from transformers import (
@@ -219,7 +219,6 @@ class ProcessorPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/models/bridgetower/test_modeling_bridgetower.py b/tests/models/bridgetower/test_modeling_bridgetower.py
index 9e70c4cdcd44..20396c8bf7bf 100644
--- a/tests/models/bridgetower/test_modeling_bridgetower.py
+++ b/tests/models/bridgetower/test_modeling_bridgetower.py
@@ -202,6 +202,7 @@ def prepare_config_and_inputs_for_common(self):
         return config, inputs_dict
 
 
+@slow
 @require_torch
 @unittest.skipIf(not is_torch_greater_or_equal_than_1_10, "BridgeTower is only available in torch v1.10+")
 class BridgeTowerModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
@@ -494,6 +495,7 @@ def test_constrastive_learning(self):
         self.assertEqual(outputs.logits.shape, expected_shape)
 
 
+@slow
 @require_torch
 @unittest.skipIf(not is_torch_greater_or_equal_than_1_10, "BridgeTower is only available in torch v1.10+")
 class BridgeTowerModelTrainingTest(unittest.TestCase):
diff --git a/tests/models/mgp_str/__init__.py b/tests/models/mgp_str/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/tests/models/mgp_str/test_modeling_mgp_str.py b/tests/models/mgp_str/test_modeling_mgp_str.py
new file mode 100644
index 000000000000..ab5b6422b627
--- /dev/null
+++ b/tests/models/mgp_str/test_modeling_mgp_str.py
@@ -0,0 +1,269 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the PyTorch MGP-STR model. """
+
+import inspect
+import unittest
+
+import requests
+
+from transformers import MgpstrConfig
+from transformers.testing_utils import require_torch, require_vision, slow, torch_device
+from transformers.utils import is_torch_available, is_vision_available
+
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
+
+
+if is_torch_available():
+    import torch
+    from torch import nn
+
+    from transformers import MgpstrForSceneTextRecognition
+
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import MgpstrProcessor
+
+
+class MgpstrModelTester:
+    def __init__(
+        self,
+        parent,
+        is_training=False,
+        batch_size=13,
+        image_size=(32, 128),
+        patch_size=4,
+        num_channels=3,
+        max_token_length=27,
+        num_character_labels=38,
+        num_bpe_labels=99,
+        num_wordpiece_labels=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        mlp_ratio=4.0,
+        patch_embeds_hidden_size=257,
+        output_hidden_states=None,
+    ):
+        self.parent = parent
+        self.is_training = is_training
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.max_token_length = max_token_length
+        self.num_character_labels = num_character_labels
+        self.num_bpe_labels = num_bpe_labels
+        self.num_wordpiece_labels = num_wordpiece_labels
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.mlp_ratio = mlp_ratio
+        self.patch_embeds_hidden_size = patch_embeds_hidden_size
+        self.output_hidden_states = output_hidden_states
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size[0], self.image_size[1]])
+        config = self.get_config()
+        return config, pixel_values
+
+    def get_config(self):
+        return MgpstrConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            max_token_length=self.max_token_length,
+            num_character_labels=self.num_character_labels,
+            num_bpe_labels=self.num_bpe_labels,
+            num_wordpiece_labels=self.num_wordpiece_labels,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            mlp_ratio=self.mlp_ratio,
+            output_hidden_states=self.output_hidden_states,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = MgpstrForSceneTextRecognition(config)
+        model.to(torch_device)
+        model.eval()
+        with torch.no_grad():
+            generated_ids = model(pixel_values)
+        self.parent.assertEqual(
+            generated_ids[0][0].shape, (self.batch_size, self.max_token_length, self.num_character_labels)
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+@require_torch
+class MgpstrModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (MgpstrForSceneTextRecognition,) if is_torch_available() else ()
+    fx_compatible = False
+
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+    test_attention_outputs = False
+
+    def setUp(self):
+        self.model_tester = MgpstrModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=MgpstrConfig, has_text_modality=False)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="MgpstrModel does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    @unittest.skip(reason="MgpstrModel does not support feedforward chunking")
+    def test_feed_forward_chunking(self):
+        pass
+
+    def test_gradient_checkpointing_backward_compatibility(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            if not model_class.supports_gradient_checkpointing:
+                continue
+
+            config.gradient_checkpointing = True
+            model = model_class(config)
+            self.assertTrue(model.is_gradient_checkpointing)
+
+    def test_hidden_states_output(self):
+        def check_hidden_states_output(inputs_dict, config, model_class):
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            with torch.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            hidden_states = outputs.hidden_states
+
+            expected_num_layers = getattr(
+                self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
+            )
+            self.assertEqual(len(hidden_states), expected_num_layers)
+
+            self.assertListEqual(
+                list(hidden_states[0].shape[-2:]),
+                [self.model_tester.patch_embeds_hidden_size, self.model_tester.hidden_size],
+            )
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            inputs_dict["output_hidden_states"] = True
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+            # check that output_hidden_states also work using config
+            del inputs_dict["output_hidden_states"]
+            config.output_hidden_states = True
+
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+    # override as the `logit_scale` parameter initilization is different for MgpstrModel
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if isinstance(param, (nn.Linear, nn.Conv2d, nn.LayerNorm)):
+                    if param.requires_grad:
+                        self.assertIn(
+                            ((param.data.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+
+# We will verify our results on an image from the IIIT-5k dataset
+def prepare_img():
+    url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
+    im = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+    return im
+
+
+@require_vision
+@require_torch
+class MgpstrModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference(self):
+        model_name = "alibaba-damo/mgp-str-base"
+        model = MgpstrForSceneTextRecognition.from_pretrained(model_name).to(torch_device)
+        processor = MgpstrProcessor.from_pretrained(model_name)
+
+        image = prepare_img()
+        inputs = processor(images=image, return_tensors="pt").pixel_values.to(torch_device)
+
+        # forward pass
+        with torch.no_grad():
+            outputs = model(inputs)
+
+        # verify the logits
+        self.assertEqual(outputs.logits[0].shape, torch.Size((1, 27, 38)))
+
+        out_strs = processor.batch_decode(outputs.logits)
+        expected_text = "ticket"
+
+        self.assertEqual(out_strs["generated_text"][0], expected_text)
+
+        expected_slice = torch.tensor(
+            [[[-39.7358, -44.8562, -36.6253], [-62.3605, -64.5908, -59.0069], [-74.6127, -68.9724, -71.7150]]],
+            device=torch_device,
+        )
+
+        self.assertTrue(torch.allclose(outputs.logits[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
diff --git a/tests/models/mgp_str/test_processor_mgp_str.py b/tests/models/mgp_str/test_processor_mgp_str.py
new file mode 100644
index 000000000000..387d13fad18e
--- /dev/null
+++ b/tests/models/mgp_str/test_processor_mgp_str.py
@@ -0,0 +1,211 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the MgpstrProcessor. """
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import pytest
+
+from transformers import MgpstrTokenizer
+from transformers.models.mgp_str.tokenization_mgp_str import VOCAB_FILES_NAMES
+from transformers.testing_utils import require_torch, require_vision
+from transformers.utils import IMAGE_PROCESSOR_NAME, is_torch_available, is_vision_available
+
+
+if is_torch_available():
+    import torch
+
+
+if is_vision_available():
+    from PIL import Image
+
+    from transformers import MgpstrProcessor, ViTImageProcessor
+
+
+@require_torch
+@require_vision
+class MgpstrProcessorTest(unittest.TestCase):
+    image_processing_class = ViTImageProcessor if is_vision_available() else None
+
+    @property
+    def image_processor_dict(self):
+        return self.image_processor_tester.prepare_image_processor_dict()
+
+    def setUp(self):
+        self.image_size = (3, 32, 128)
+        self.tmpdirname = tempfile.mkdtemp()
+
+        # fmt: off
+        vocab = ['[GO]', '[s]', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
+        # fmt: on
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+
+        image_processor_map = {
+            "do_normalize": False,
+            "do_resize": True,
+            "feature_extractor_type": "ViTFeatureExtractor",
+            "resample": 3,
+            "size": {"height": 32, "width": 128},
+        }
+        self.image_processor_file = os.path.join(self.tmpdirname, IMAGE_PROCESSOR_NAME)
+        with open(self.image_processor_file, "w", encoding="utf-8") as fp:
+            json.dump(image_processor_map, fp)
+
+    def get_tokenizer(self, **kwargs):
+        return MgpstrTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_image_processor(self, **kwargs):
+        return ViTImageProcessor.from_pretrained(self.tmpdirname, **kwargs)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def prepare_image_inputs(self):
+        """This function prepares a list of PIL images."""
+
+        image_input = np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)
+
+        image_input = Image.fromarray(np.moveaxis(image_input, 0, -1))
+
+        return image_input
+
+    def test_save_load_pretrained_default(self):
+        tokenizer = self.get_tokenizer()
+        image_processor = self.get_image_processor()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+        processor.save_pretrained(self.tmpdirname)
+        processor = MgpstrProcessor.from_pretrained(self.tmpdirname, use_fast=False)
+
+        self.assertEqual(processor.char_tokenizer.get_vocab(), tokenizer.get_vocab())
+        self.assertIsInstance(processor.char_tokenizer, MgpstrTokenizer)
+
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor.to_json_string())
+        self.assertIsInstance(processor.image_processor, ViTImageProcessor)
+
+    def test_save_load_pretrained_additional_features(self):
+        tokenizer = self.get_tokenizer()
+        image_processor = self.get_image_processor()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
+
+        processor = MgpstrProcessor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+        )
+
+        self.assertEqual(processor.char_tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+        self.assertIsInstance(processor.char_tokenizer, MgpstrTokenizer)
+
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.image_processor, ViTImageProcessor)
+
+    def test_image_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+
+        input_image_proc = image_processor(image_input, return_tensors="np")
+        input_processor = processor(images=image_input, return_tensors="np")
+
+        for key in input_image_proc.keys():
+            self.assertAlmostEqual(input_image_proc[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "test"
+
+        encoded_processor = processor(text=input_str)
+
+        encoded_tok = tokenizer(input_str)
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "test"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), ["pixel_values", "labels"])
+
+        # test if it raises when no input is passed
+        with pytest.raises(ValueError):
+            processor()
+
+    def test_tokenizer_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.char_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+        decode_strs = [seq.replace(" ", "") for seq in decoded_tok]
+
+        self.assertListEqual(decode_strs, decoded_processor)
+
+    def test_model_input_names(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = None
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), processor.model_input_names)
+
+    def test_processor_batch_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MgpstrProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        char_input = torch.randn(1, 27, 38)
+        bpe_input = torch.randn(1, 27, 50257)
+        wp_input = torch.randn(1, 27, 30522)
+
+        results = processor.batch_decode([char_input, bpe_input, wp_input])
+
+        self.assertListEqual(list(results.keys()), ["generated_text", "scores", "char_preds", "bpe_preds", "wp_preds"])
diff --git a/tests/models/mgp_str/test_tokenization_mgp_str.py b/tests/models/mgp_str/test_tokenization_mgp_str.py
new file mode 100644
index 000000000000..a05d7f3cbf90
--- /dev/null
+++ b/tests/models/mgp_str/test_tokenization_mgp_str.py
@@ -0,0 +1,96 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+import os
+import unittest
+
+from transformers import MgpstrTokenizer
+from transformers.models.mgp_str.tokenization_mgp_str import VOCAB_FILES_NAMES
+from transformers.testing_utils import require_tokenizers
+
+from ...test_tokenization_common import TokenizerTesterMixin
+
+
+@require_tokenizers
+class MgpstrTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = MgpstrTokenizer
+    test_rust_tokenizer = False
+    from_pretrained_kwargs = {}
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        # fmt: off
+        vocab = ['[GO]', '[s]', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
+        # fmt: on
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+
+    def get_tokenizer(self, **kwargs):
+        return MgpstrTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "tester"
+        output_text = "tester"
+        return input_text, output_text
+
+    @unittest.skip("MGP-STR always lower cases letters.")
+    def test_added_tokens_do_lower_case(self):
+        pass
+
+    def test_add_special_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                special_token = "[SPECIAL_TOKEN]"
+
+                tokenizer.add_special_tokens({"cls_token": special_token})
+                encoded_special_token = tokenizer.encode([special_token], add_special_tokens=False)
+                self.assertEqual(len(encoded_special_token), 1)
+
+                decoded = tokenizer.decode(encoded_special_token, skip_special_tokens=True)
+                self.assertTrue(special_token not in decoded)
+
+    def test_internal_consistency(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, output_text = self.get_input_output_texts(tokenizer)
+
+                tokens = tokenizer.tokenize(input_text)
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                ids_2 = tokenizer.encode(input_text, add_special_tokens=False)
+                self.assertListEqual(ids, ids_2)
+
+                tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+                self.assertNotEqual(len(tokens_2), 0)
+                text_2 = tokenizer.decode(ids)
+                self.assertIsInstance(text_2, str)
+
+                self.assertEqual(text_2.replace(" ", ""), output_text)
+
+    @unittest.skip("MGP-STR tokenizer only handles one sequence.")
+    def test_maximum_encoding_length_pair_input(self):
+        pass
+
+    @unittest.skip("inputs cannot be pretokenized in MgpstrTokenizer")
+    def test_pretokenized_inputs(self):
+        pass
diff --git a/tests/models/whisper/test_modeling_whisper.py b/tests/models/whisper/test_modeling_whisper.py
index 6fa50a7f0819..d4b252398f28 100644
--- a/tests/models/whisper/test_modeling_whisper.py
+++ b/tests/models/whisper/test_modeling_whisper.py
@@ -1525,6 +1525,18 @@ def test_forward_signature(self):
             expected_arg_names = ["input_features", "head_mask", "encoder_outputs"]
             self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
 
+    @unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.")
+    def test_cpu_offload(self):
+        pass
+
+    @unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.")
+    def test_disk_offload(self):
+        pass
+
+    @unittest.skip(reason="Some undefined behavior encountered with tiny versions of this model. Skip for now.")
+    def test_model_parallelism(self):
+        pass
+
     # input embeds is meaningless for an encoder-only acoustic model
     def test_inputs_embeds(self):
         pass
diff --git a/tests/pipelines/test_pipelines_common.py b/tests/pipelines/test_pipelines_common.py
index 68207b9ca189..f43f439ac279 100644
--- a/tests/pipelines/test_pipelines_common.py
+++ b/tests/pipelines/test_pipelines_common.py
@@ -21,7 +21,7 @@
 
 import datasets
 import numpy as np
-from huggingface_hub import HfFolder, Repository, create_repo, delete_repo, set_access_token
+from huggingface_hub import HfFolder, Repository, create_repo, delete_repo
 from requests.exceptions import HTTPError
 
 from transformers import (
@@ -759,7 +759,6 @@ class DynamicPipelineTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/test_configuration_common.py b/tests/test_configuration_common.py
index dc9927f19385..d2172491f982 100644
--- a/tests/test_configuration_common.py
+++ b/tests/test_configuration_common.py
@@ -23,7 +23,7 @@
 import unittest.mock as mock
 from pathlib import Path
 
-from huggingface_hub import HfFolder, delete_repo, set_access_token
+from huggingface_hub import HfFolder, delete_repo
 from requests.exceptions import HTTPError
 
 from transformers import AutoConfig, BertConfig, GPT2Config, is_torch_available
@@ -222,7 +222,6 @@ class ConfigPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/test_feature_extraction_common.py b/tests/test_feature_extraction_common.py
index ee8dfefb8406..f26709e2d9a6 100644
--- a/tests/test_feature_extraction_common.py
+++ b/tests/test_feature_extraction_common.py
@@ -22,7 +22,7 @@
 import unittest.mock as mock
 from pathlib import Path
 
-from huggingface_hub import HfFolder, delete_repo, set_access_token
+from huggingface_hub import HfFolder, delete_repo
 from requests.exceptions import HTTPError
 
 from transformers import AutoFeatureExtractor, Wav2Vec2FeatureExtractor
@@ -100,7 +100,6 @@ class FeatureExtractorPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/test_image_processing_common.py b/tests/test_image_processing_common.py
index 32be6e0e6330..12cdef7a47c7 100644
--- a/tests/test_image_processing_common.py
+++ b/tests/test_image_processing_common.py
@@ -22,7 +22,7 @@
 import unittest.mock as mock
 from pathlib import Path
 
-from huggingface_hub import HfFolder, delete_repo, set_access_token
+from huggingface_hub import HfFolder, delete_repo
 from requests.exceptions import HTTPError
 
 from transformers import AutoImageProcessor, ViTImageProcessor
@@ -232,7 +232,6 @@ class ImageProcessorPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/test_modeling_common.py b/tests/test_modeling_common.py
index 5cb04edbb192..cb06400e9a77 100755
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -30,7 +30,7 @@
 from typing import Dict, List, Tuple
 
 import numpy as np
-from huggingface_hub import HfFolder, delete_repo, set_access_token
+from huggingface_hub import HfFolder, delete_repo
 from huggingface_hub.file_download import http_get
 from pytest import mark
 from requests.exceptions import HTTPError
@@ -3429,7 +3429,6 @@ class ModelPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/test_modeling_flax_common.py b/tests/test_modeling_flax_common.py
index 76e4d6c88154..5578b7893608 100644
--- a/tests/test_modeling_flax_common.py
+++ b/tests/test_modeling_flax_common.py
@@ -21,7 +21,7 @@
 from typing import List, Tuple
 
 import numpy as np
-from huggingface_hub import HfFolder, delete_repo, set_access_token
+from huggingface_hub import HfFolder, delete_repo
 from requests.exceptions import HTTPError
 
 import transformers
@@ -1173,7 +1173,6 @@ class FlaxModelPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/test_modeling_tf_common.py b/tests/test_modeling_tf_common.py
index 42db9aea2698..26e567530ec2 100644
--- a/tests/test_modeling_tf_common.py
+++ b/tests/test_modeling_tf_common.py
@@ -28,7 +28,7 @@
 from typing import List, Tuple, get_type_hints
 
 from datasets import Dataset
-from huggingface_hub import HfFolder, Repository, delete_repo, set_access_token
+from huggingface_hub import HfFolder, Repository, delete_repo
 from huggingface_hub.file_download import http_get
 from requests.exceptions import HTTPError
 
@@ -2409,7 +2409,6 @@ class TFModelPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/test_tokenization_common.py b/tests/test_tokenization_common.py
index 965a059c7719..a8bdb7372b8f 100644
--- a/tests/test_tokenization_common.py
+++ b/tests/test_tokenization_common.py
@@ -30,7 +30,7 @@
 from pathlib import Path
 from typing import TYPE_CHECKING, Any, Dict, List, Tuple, Union
 
-from huggingface_hub import HfFolder, delete_repo, set_access_token
+from huggingface_hub import HfFolder, delete_repo
 from huggingface_hub.file_download import http_get
 from parameterized import parameterized
 from requests.exceptions import HTTPError
@@ -3971,7 +3971,6 @@ class TokenizerPushToHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/tests/trainer/test_data_collator.py b/tests/trainer/test_data_collator.py
index 39277ca8cc18..f5104cd37507 100644
--- a/tests/trainer/test_data_collator.py
+++ b/tests/trainer/test_data_collator.py
@@ -271,12 +271,17 @@ def test_data_collator_for_language_modeling(self):
         self._test_no_pad_and_pad(no_pad_features, pad_features)
 
     def test_data_collator_for_whole_word_mask(self):
-        features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
-
         tokenizer = BertTokenizer(self.vocab_file)
         data_collator = DataCollatorForWholeWordMask(tokenizer, return_tensors="pt")
+
+        features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
         batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, torch.Size((2, 10)))
+        self.assertEqual(batch["labels"].shape, torch.Size((2, 10)))
 
+        # Features can already be tensors
+        features = [{"input_ids": np.arange(10)}, {"input_ids": np.arange(10)}]
+        batch = data_collator(features)
         self.assertEqual(batch["input_ids"].shape, torch.Size((2, 10)))
         self.assertEqual(batch["labels"].shape, torch.Size((2, 10)))
 
@@ -553,12 +558,17 @@ def test_data_collator_for_language_modeling(self):
         self._test_no_pad_and_pad(no_pad_features, pad_features)
 
     def test_data_collator_for_whole_word_mask(self):
-        features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
-
         tokenizer = BertTokenizer(self.vocab_file)
         data_collator = DataCollatorForWholeWordMask(tokenizer, return_tensors="tf")
+
+        features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
         batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape.as_list(), [2, 10])
+        self.assertEqual(batch["labels"].shape.as_list(), [2, 10])
 
+        # Features can already be tensors
+        features = [{"input_ids": np.arange(10)}, {"input_ids": np.arange(10)}]
+        batch = data_collator(features)
         self.assertEqual(batch["input_ids"].shape.as_list(), [2, 10])
         self.assertEqual(batch["labels"].shape.as_list(), [2, 10])
 
@@ -825,12 +835,17 @@ def test_data_collator_for_language_modeling(self):
         self._test_no_pad_and_pad(no_pad_features, pad_features)
 
     def test_data_collator_for_whole_word_mask(self):
-        features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
-
         tokenizer = BertTokenizer(self.vocab_file)
         data_collator = DataCollatorForWholeWordMask(tokenizer, return_tensors="np")
+
+        features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
         batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, (2, 10))
+        self.assertEqual(batch["labels"].shape, (2, 10))
 
+        # Features can already be tensors
+        features = [{"input_ids": np.arange(10)}, {"input_ids": np.arange(10)}]
+        batch = data_collator(features)
         self.assertEqual(batch["input_ids"].shape, (2, 10))
         self.assertEqual(batch["labels"].shape, (2, 10))
 
diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py
index 2ff81e5fe7bf..274ee76e53c8 100644
--- a/tests/trainer/test_trainer.py
+++ b/tests/trainer/test_trainer.py
@@ -29,7 +29,7 @@
 from unittest.mock import Mock, patch
 
 import numpy as np
-from huggingface_hub import HfFolder, Repository, delete_repo, set_access_token
+from huggingface_hub import HfFolder, Repository, delete_repo
 from parameterized import parameterized
 from requests.exceptions import HTTPError
 
@@ -2005,7 +2005,6 @@ class TrainerIntegrationWithHubTester(unittest.TestCase):
     @classmethod
     def setUpClass(cls):
         cls._token = TOKEN
-        set_access_token(TOKEN)
         HfFolder.save_token(TOKEN)
 
     @classmethod
diff --git a/utils/check_repo.py b/utils/check_repo.py
index f16c4fb851bf..fca0cd86ef90 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -95,6 +95,7 @@
     "M2M100Encoder",  # Building part of bigger (tested) model.
     "M2M100Decoder",  # Building part of bigger (tested) model.
     "MCTCTEncoder",  # Building part of bigger (tested) model.
+    "MgpstrModel",  # Building part of bigger (tested) model.
     "Speech2TextEncoder",  # Building part of bigger (tested) model.
     "Speech2TextDecoder",  # Building part of bigger (tested) model.
     "LEDEncoder",  # Building part of bigger (tested) model.
@@ -269,6 +270,7 @@
     "LukeForEntityClassification",
     "LukeForEntityPairClassification",
     "LukeForEntitySpanClassification",
+    "MgpstrModel",
     "OpenAIGPTDoubleHeadsModel",
     "OwlViTTextModel",
     "OwlViTVisionModel",
diff --git a/utils/create_dummy_models.py b/utils/create_dummy_models.py
index 60e055d67063..176bd07dfaaa 100644
--- a/utils/create_dummy_models.py
+++ b/utils/create_dummy_models.py
@@ -23,6 +23,7 @@
 import shutil
 import sys
 import tempfile
+import traceback
 from pathlib import Path
 
 from check_config_docstrings import get_checkpoint_from_config_class
@@ -71,6 +72,67 @@
 TARGET_VOCAB_SIZE = 1024
 
 
+# This list contains the model architectures for which a tiny version could not be created.
+# Avoid to add new architectures here - unless we have verified carefully that it's (almost) impossible to create them.
+# One such case is: no model tester class is implemented for a model type (like `MT5`) because its architecture is
+# identical to another one (`MT5` is based on `T5`), but trained on different datasets or with different techniques.
+UNCONVERTIBLE_MODEL_ARCHITECTURES = {
+    "BertGenerationEncoder",
+    "BertGenerationDecoder",
+    "CamembertForSequenceClassification",
+    "CamembertForMultipleChoice",
+    "CamembertForMaskedLM",
+    "CamembertForCausalLM",
+    "CamembertForTokenClassification",
+    "CamembertForQuestionAnswering",
+    "CamembertModel",
+    "TFCamembertForMultipleChoice",
+    "TFCamembertForTokenClassification",
+    "TFCamembertForQuestionAnswering",
+    "TFCamembertForSequenceClassification",
+    "TFCamembertForMaskedLM",
+    "TFCamembertModel",
+    "TFCamembertForCausalLM",
+    "DecisionTransformerModel",
+    "JukeboxModel",
+    "MarianForCausalLM",
+    "MT5Model",
+    "MT5ForConditionalGeneration",
+    "TFMT5ForConditionalGeneration",
+    "TFMT5Model",
+    "QDQBertForSequenceClassification",
+    "QDQBertForMaskedLM",
+    "QDQBertModel",
+    "QDQBertForTokenClassification",
+    "QDQBertLMHeadModel",
+    "QDQBertForMultipleChoice",
+    "QDQBertForQuestionAnswering",
+    "QDQBertForNextSentencePrediction",
+    "ReformerModelWithLMHead",
+    "RetriBertModel",
+    "Speech2Text2ForCausalLM",
+    "TimeSeriesTransformerModel",
+    "TrajectoryTransformerModel",
+    "TrOCRForCausalLM",
+    "XLMProphetNetForConditionalGeneration",
+    "XLMProphetNetForCausalLM",
+    "XLMProphetNetModel",
+    "XLMRobertaModel",
+    "XLMRobertaForTokenClassification",
+    "XLMRobertaForMultipleChoice",
+    "XLMRobertaForMaskedLM",
+    "XLMRobertaForCausalLM",
+    "XLMRobertaForSequenceClassification",
+    "XLMRobertaForQuestionAnswering",
+    "TFXLMRobertaForSequenceClassification",
+    "TFXLMRobertaForMaskedLM",
+    "TFXLMRobertaForQuestionAnswering",
+    "TFXLMRobertaModel",
+    "TFXLMRobertaForMultipleChoice",
+    "TFXLMRobertaForTokenClassification",
+}
+
+
 def get_processor_types_from_config_class(config_class, allowed_mappings=None):
     """Return a tuple of processors for `config_class`.
 
@@ -131,7 +193,7 @@ def get_architectures_from_config_class(config_class, arch_mappings):
             models = mapping[config_class]
             models = tuple(models) if isinstance(models, collections.abc.Sequence) else (models,)
             for model in models:
-                if model.__name__ not in unexportable_model_architectures:
+                if model.__name__ not in UNCONVERTIBLE_MODEL_ARCHITECTURES:
                     architectures.add(model)
 
     architectures = tuple(architectures)
@@ -186,8 +248,7 @@ def build_processor(config_class, processor_class, allow_no_checkpoint=False):
     try:
         processor = processor_class.from_pretrained(checkpoint)
     except Exception as e:
-        logger.error(e)
-        pass
+        logger.error(f"{e.__class__.__name__}: {e}")
 
     # Try to get a new processor class from checkpoint. This is helpful for a checkpoint without necessary file to load
     # processor while `processor_class` is an Auto class. For example, `sew` has `Wav2Vec2Processor` in
@@ -203,7 +264,7 @@ def build_processor(config_class, processor_class, allow_no_checkpoint=False):
         try:
             config = AutoConfig.from_pretrained(checkpoint)
         except Exception as e:
-            logger.error(e)
+            logger.error(f"{e.__class__.__name__}: {e}")
             config = None
         if config is not None:
             if not isinstance(config, config_class):
@@ -263,8 +324,7 @@ def build_processor(config_class, processor_class, allow_no_checkpoint=False):
                 try:
                     processor = processor_class(**{k: v[0] for k, v in attrs.items()})
                 except Exception as e:
-                    logger.error(e)
-                    pass
+                    logger.error(f"{e.__class__.__name__}: {e}")
         else:
             # `checkpoint` might lack some file(s) to load a processor. For example, `facebook/hubert-base-ls960`
             # has no tokenizer file to load `Wav2Vec2CTCTokenizer`. In this case, we try to build a processor
@@ -282,8 +342,7 @@ def build_processor(config_class, processor_class, allow_no_checkpoint=False):
         try:
             processor = processor_class()
         except Exception as e:
-            logger.error(e)
-            pass
+            logger.error(f"{e.__class__.__name__}: {e}")
 
     # validation
     if processor is not None:
@@ -322,12 +381,12 @@ def get_tiny_config(config_class, **model_tester_kwargs):
         module = importlib.import_module(f".models.{module_name}.test_modeling_{modeling_name}", package="tests")
         camel_case_model_name = config_class.__name__.split("Config")[0]
         model_tester_class = getattr(module, f"{camel_case_model_name}ModelTester", None)
-    except ModuleNotFoundError as e:
-        error = f"Tiny config not created for {model_type} - cannot find the testing module from the model name"
-        raise ValueError(f"{error}: {e}")
+    except ModuleNotFoundError:
+        error = f"Tiny config not created for {model_type} - cannot find the testing module from the model name."
+        raise ValueError(error)
 
     if model_tester_class is None:
-        error = f"Tiny config not created for {model_type} - no model tester is found in the testing module"
+        error = f"Tiny config not created for {model_type} - no model tester is found in the testing module."
         raise ValueError(error)
 
     # `parent` is an instance of `unittest.TestCase`, but we don't need it here.
@@ -434,9 +493,12 @@ def convert_processors(processors, tiny_config, output_folder, result):
                     # be retrained
                     if fast_tokenizer.vocab_size > TARGET_VOCAB_SIZE:
                         fast_tokenizer = convert_tokenizer(tokenizer)
-                except Exception as e:
+                except Exception:
                     result["warnings"].append(
-                        f"Failed to convert the fast tokenizer for {fast_tokenizer.__class__.__name__}: {e}"
+                        (
+                            f"Failed to convert the fast tokenizer for {fast_tokenizer.__class__.__name__}.",
+                            traceback.format_exc(),
+                        )
                     )
                     continue
         elif slow_tokenizer is None:
@@ -446,9 +508,12 @@ def convert_processors(processors, tiny_config, output_folder, result):
     if fast_tokenizer:
         try:
             fast_tokenizer.save_pretrained(output_folder)
-        except Exception as e:
+        except Exception:
             result["warnings"].append(
-                f"Failed to save the fast tokenizer for {fast_tokenizer.__class__.__name__}: {e}"
+                (
+                    f"Failed to save the fast tokenizer for {fast_tokenizer.__class__.__name__}.",
+                    traceback.format_exc(),
+                )
             )
             fast_tokenizer = None
 
@@ -456,9 +521,12 @@ def convert_processors(processors, tiny_config, output_folder, result):
     if fast_tokenizer:
         try:
             slow_tokenizer = AutoTokenizer.from_pretrained(output_folder, use_fast=False)
-        except Exception as e:
+        except Exception:
             result["warnings"].append(
-                f"Failed to load the slow tokenizer saved from {fast_tokenizer.__class__.__name__}: {e}"
+                (
+                    f"Failed to load the slow tokenizer saved from {fast_tokenizer.__class__.__name__}.",
+                    traceback.format_exc(),
+                )
             )
             # Let's just keep the fast version
             slow_tokenizer = None
@@ -467,17 +535,25 @@ def convert_processors(processors, tiny_config, output_folder, result):
     if not fast_tokenizer and slow_tokenizer:
         try:
             slow_tokenizer.save_pretrained(output_folder)
-        except Exception as e:
+        except Exception:
             result["warnings"].append(
-                f"Failed to save the slow tokenizer for {slow_tokenizer.__class__.__name__}: {e}"
+                (
+                    f"Failed to save the slow tokenizer for {slow_tokenizer.__class__.__name__}.",
+                    traceback.format_exc(),
+                )
             )
             slow_tokenizer = None
 
     # update feature extractors using the tiny config
     try:
         feature_extractors = [convert_feature_extractor(p, tiny_config) for p in feature_extractors]
-    except Exception as e:
-        result["warnings"].append(f"Failed to convert feature extractors: {e}")
+    except Exception:
+        result["warnings"].append(
+            (
+                "Failed to convert feature extractors.",
+                traceback.format_exc(),
+            )
+        )
         feature_extractors = []
 
     if hasattr(tiny_config, "max_position_embeddings") and tiny_config.max_position_embeddings > 0:
@@ -538,9 +614,9 @@ def build_model(model_arch, tiny_config, output_dir):
     return model
 
 
-def fill_result_with_error(result, error, models_to_create):
+def fill_result_with_error(result, error, trace, models_to_create):
     """Fill `result` with errors for all target model arch if we can't build processor"""
-
+    error = (error, trace)
     result["error"] = error
     for framework in FRAMEWORKS:
         if framework in models_to_create:
@@ -548,7 +624,7 @@ def fill_result_with_error(result, error, models_to_create):
             for model_arch in models_to_create[framework]:
                 result[framework][model_arch.__name__] = {"model": None, "checkpoint": None, "error": error}
 
-    result["processor"] = {type(p).__name__: p.__class__.__name__ for p in result["processor"]}
+    result["processor"] = {p.__class__.__name__: p.__class__.__name__ for p in result["processor"].values()}
 
 
 def upload_model(model_dir, organization):
@@ -572,7 +648,7 @@ def upload_model(model_dir, organization):
             except Exception as e:
                 error = e
     if error is not None:
-        raise ValueError(error)
+        raise error
 
     with tempfile.TemporaryDirectory() as tmpdir:
         repo = Repository(local_dir=tmpdir, clone_from=f"{organization}/{repo_name}")
@@ -589,13 +665,13 @@ def upload_model(model_dir, organization):
                 commit_description=f"Upload tiny models for {arch_name}",
                 create_pr=True,
             )
-            logger.warning(f"PR open in {hub_pr_url}")
+            logger.warning(f"PR open in {hub_pr_url}.")
         else:
             # Push to Hub repo directly
             repo.git_add(auto_lfs_track=True)
             repo.git_commit(f"Upload tiny models for {arch_name}")
             repo.git_push(blocking=True)  # this prints a progress bar with the upload
-            logger.warning(f"Tiny models {arch_name} pushed to {organization}/{repo_name}")
+            logger.warning(f"Tiny models {arch_name} pushed to {organization}/{repo_name}.")
 
 
 def build_composite_models(config_class, output_dir):
@@ -715,7 +791,7 @@ def build_composite_models(config_class, output_dir):
                 shutil.copytree(decoder_processor_path, model_path, dirs_exist_ok=True)
 
             # fill `result`
-            result["processor"] = tuple({x.__name__ for x in encoder_processor + decoder_processor})
+            result["processor"] = {x.__name__: x.__name__ for x in encoder_processor + decoder_processor}
 
             result["pytorch"] = {model_class.__name__: {"model": model_class.__name__, "checkpoint": model_path}}
 
@@ -724,9 +800,11 @@ def build_composite_models(config_class, output_dir):
                 result["tensorflow"] = {
                     tf_model_class.__name__: {"model": tf_model_class.__name__, "checkpoint": model_path}
                 }
-
-        except Exception as e:
-            result["error"] = f"Failed to build models for {config_class.__name__}: {e}"
+        except Exception:
+            result["error"] = (
+                f"Failed to build models for {config_class.__name__}.",
+                traceback.format_exc(),
+            )
 
     if not result["error"]:
         del result["error"]
@@ -862,8 +940,8 @@ def build(config_class, models_to_create, output_dir):
 
     if len(processor_classes) == 0:
         error = f"No processor class could be found in {config_class.__name__}."
-        fill_result_with_error(result, error, models_to_create)
-        logger.error(result["error"])
+        fill_result_with_error(result, error, None, models_to_create)
+        logger.error(result["error"][0])
         return result
 
     for processor_class in processor_classes:
@@ -871,24 +949,26 @@ def build(config_class, models_to_create, output_dir):
             processor = build_processor(config_class, processor_class, allow_no_checkpoint=True)
             if processor is not None:
                 result["processor"][processor_class] = processor
-        except Exception as e:
-            error = f"Failed to build processor for {processor_class.__name__}: {e}"
-            fill_result_with_error(result, error, models_to_create)
-            logger.error(result["error"])
+        except Exception:
+            error = f"Failed to build processor for {processor_class.__name__}."
+            trace = traceback.format_exc()
+            fill_result_with_error(result, error, trace, models_to_create)
+            logger.error(result["error"][0])
             return result
 
     if len(result["processor"]) == 0:
         error = f"No processor could be built for {config_class.__name__}."
-        fill_result_with_error(result, error, models_to_create)
-        logger.error(result["error"])
+        fill_result_with_error(result, error, None, models_to_create)
+        logger.error(result["error"][0])
         return result
 
     try:
         tiny_config = get_tiny_config(config_class)
     except Exception as e:
         error = f"Failed to get tiny config for {config_class.__name__}: {e}"
-        fill_result_with_error(result, error, models_to_create)
-        logger.error(result["error"])
+        trace = traceback.format_exc()
+        fill_result_with_error(result, error, trace, models_to_create)
+        logger.error(result["error"][0])
         return result
 
     # Convert the processors (reduce vocabulary size, smaller image size, etc.)
@@ -896,22 +976,24 @@ def build(config_class, models_to_create, output_dir):
     processor_output_folder = os.path.join(output_dir, "processors")
     try:
         processors = convert_processors(processors, tiny_config, processor_output_folder, result)
-    except Exception as e:
-        error = f"Failed to convert the processors: {e}"
-        result["warnings"].append(error)
+    except Exception:
+        error = "Failed to convert the processors."
+        trace = traceback.format_exc()
+        result["warnings"].append((error, trace))
 
     if len(processors) == 0:
         error = f"No processor is returned by `convert_processors` for {config_class.__name__}."
-        fill_result_with_error(result, error, models_to_create)
-        logger.error(result["error"])
+        fill_result_with_error(result, error, None, models_to_create)
+        logger.error(result["error"][0])
         return result
 
     try:
         config_overrides = get_config_overrides(config_class, processors)
     except Exception as e:
         error = f"Failure occurs while calling `get_config_overrides`: {e}"
-        fill_result_with_error(result, error, models_to_create)
-        logger.error(result["error"])
+        trace = traceback.format_exc()
+        fill_result_with_error(result, error, trace, models_to_create)
+        logger.error(result["error"][0])
         return result
 
     # Just for us to see this easily in the report
@@ -935,7 +1017,7 @@ def build(config_class, models_to_create, output_dir):
                 tiny_config.text_config_dict[k] = v
 
     if result["warnings"]:
-        logger.warning(result["warnings"])
+        logger.warning(result["warnings"][0][0])
 
     # update `result["processor"]`
     result["processor"] = {type(p).__name__: p.__class__.__name__ for p in processors}
@@ -948,13 +1030,14 @@ def build(config_class, models_to_create, output_dir):
         except Exception as e:
             model = None
             error = f"Failed to create the pytorch model for {pytorch_arch}: {e}"
+            trace = traceback.format_exc()
 
         result["pytorch"][pytorch_arch.__name__]["model"] = model.__class__.__name__ if model is not None else None
         result["pytorch"][pytorch_arch.__name__]["checkpoint"] = (
             get_checkpoint_dir(output_dir, pytorch_arch) if model is not None else None
         )
         if error is not None:
-            result["pytorch"][pytorch_arch.__name__]["error"] = error
+            result["pytorch"][pytorch_arch.__name__]["error"] = (error, trace)
             logger.error(f"{pytorch_arch.__name__}: {error}")
 
     for tensorflow_arch in models_to_create["tensorflow"]:
@@ -974,12 +1057,14 @@ def build(config_class, models_to_create, output_dir):
                 # Conversion may fail. Let's not create a model with different weights to avoid confusion (for now).
                 model = None
                 error = f"Failed to convert the pytorch model to the tensorflow model for {pt_arch}: {e}"
+                trace = traceback.format_exc()
         else:
             try:
                 model = build_model(tensorflow_arch, tiny_config, output_dir=output_dir)
             except Exception as e:
                 model = None
                 error = f"Failed to create the tensorflow model for {tensorflow_arch}: {e}"
+                trace = traceback.format_exc()
 
         result["tensorflow"][tensorflow_arch.__name__]["model"] = (
             model.__class__.__name__ if model is not None else None
@@ -988,7 +1073,7 @@ def build(config_class, models_to_create, output_dir):
             get_checkpoint_dir(output_dir, tensorflow_arch) if model is not None else None
         )
         if error is not None:
-            result["tensorflow"][tensorflow_arch.__name__]["error"] = error
+            result["tensorflow"][tensorflow_arch.__name__]["error"] = (error, trace)
             logger.error(f"{tensorflow_arch.__name__}: {error}")
 
     if not result["error"]:
@@ -999,6 +1084,37 @@ def build(config_class, models_to_create, output_dir):
     return result
 
 
+def build_tiny_model_summary(results):
+    """Build a summary: a dictionary of the form
+    {
+      model architecture name:
+        {
+          "tokenizer_classes": [...],
+          "processor_classes": [...]
+        }
+      ..
+    }
+    """
+    tiny_model_summary = {}
+    for config_name in results:
+        processors = [key for key, value in results[config_name]["processor"].items()]
+        tokenizer_classes = [x for x in processors if x.endswith("TokenizerFast") or x.endswith("Tokenizer")]
+        processor_classes = [x for x in processors if x not in tokenizer_classes]
+        for framework in FRAMEWORKS:
+            if framework not in results[config_name]:
+                continue
+            for arch_name in results[config_name][framework]:
+                # tiny model is not created for `arch_name`
+                if results[config_name][framework][arch_name] is None:
+                    continue
+                tiny_model_summary[arch_name] = {
+                    "tokenizer_classes": tokenizer_classes,
+                    "processor_classes": processor_classes,
+                }
+
+    return tiny_model_summary
+
+
 def build_failed_report(results, include_warning=True):
     failed_results = {}
     for config_name in results:
@@ -1039,10 +1155,10 @@ def build_simple_report(results):
             for arch_name in results[config_name][framework]:
                 if "error" in results[config_name][framework][arch_name]:
                     result = results[config_name][framework][arch_name]["error"]
-                    failed_text += f"{arch_name}: {result}\n"
+                    failed_text += f"{arch_name}: {result[0]}\n"
                 else:
-                    result = "OK"
-                text += f"{arch_name}: {result}\n"
+                    result = ("OK",)
+                text += f"{arch_name}: {result[0]}\n"
 
     return text, failed_text
 
@@ -1066,8 +1182,6 @@ def build_simple_report(results):
     tensorflow_arch_mappings = [getattr(transformers_module, x) for x in _tensorflow_arch_mappings]
     # flax_arch_mappings = [getattr(transformers_module, x) for x in _flax_arch_mappings]
 
-    unexportable_model_architectures = []
-
     ds = load_dataset("wikitext", "wikitext-2-raw-v1")
     training_ds = ds["train"]
     testing_ds = ds["test"]
@@ -1129,16 +1243,27 @@ def list_str(values):
     with open("tiny_model_creation_report.json", "w") as fp:
         json.dump(results, fp, indent=4)
 
-    # Build the failure report
+    # Build the tiny model summary file. The `tokenizer_classes` and `processor_classes` could be both empty lists.
+    # When using the items in this file to update the file `tests/utils/tiny_model_summary.json`, the model
+    # architectures with `tokenizer_classes` and `processor_classes` being both empty should **NOT** be added to
+    # `tests/utils/tiny_model_summary.json`.
+    tiny_model_summary = build_tiny_model_summary(results)
+    with open("tiny_model_summary.json", "w") as fp:
+        json.dump(tiny_model_summary, fp, indent=4)
+
+    # Build the warning/failure report (json format): same format as the complete `results` except this contains only
+    # warnings or errors.
     failed_results = build_failed_report(results)
     with open("failed_report.json", "w") as fp:
         json.dump(failed_results, fp, indent=4)
 
-    # Build the failure report
     simple_report, failed_report = build_simple_report(results)
+    # The simplified report: a .txt file with each line of format:
+    # {model architecture name}: {OK or error message}
     with open("simple_report.txt", "w") as fp:
         fp.write(simple_report)
 
+    # The simplified failure report: same above except this only contains line with errors
     with open("simple_failed_report.txt", "w") as fp:
         fp.write(failed_report)
 
@@ -1160,7 +1285,7 @@ def list_str(values):
                 try:
                     upload_model(model_dir, args.organization)
                 except Exception as e:
-                    error = f"Failed to upload {model_dir}: {e}"
+                    error = f"Failed to upload {model_dir}. {e.__class__.__name__}: {e}"
                     logger.error(error)
                     upload_results[model_dir] = error
 
diff --git a/utils/notification_service.py b/utils/notification_service.py
index 25933a0ab218..0aefd5844d32 100644
--- a/utils/notification_service.py
+++ b/utils/notification_service.py
@@ -215,11 +215,15 @@ def warnings(self) -> Dict:
             # Use the actual job link
             job_link = f"{github_actions_job_links['Extract warnings in CI artifacts']}"
 
+        huggingface_hub_warnings = [x for x in self.selected_warnings if "huggingface_hub" in x]
+        text = f"There are {len(self.selected_warnings)} warnings being selected."
+        text += f"\n{len(huggingface_hub_warnings)} of them are from `huggingface_hub`."
+
         return {
             "type": "section",
             "text": {
                 "type": "plain_text",
-                "text": f"There were {len(self.selected_warnings)} warnings being selected.",
+                "text": text,
                 "emoji": True,
             },
             "accessory": {
diff --git a/utils/update_metadata.py b/utils/update_metadata.py
index f95a4575d1e7..8c34bba5d6ae 100644
--- a/utils/update_metadata.py
+++ b/utils/update_metadata.py
@@ -93,8 +93,8 @@
     ("image-to-text", "MODEL_FOR_FOR_VISION_2_SEQ_MAPPING_NAMES", "AutoModelForVision2Seq"),
     (
         "zero-shot-image-classification",
-        "_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES",
-        "AutoModel",
+        "MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES",
+        "AutoModelForZeroShotImageClassification",
     ),
     ("depth-estimation", "MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES", "AutoModelForDepthEstimation"),
     ("video-classification", "MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES", "AutoModelForVideoClassification"),