From 3ff7dc98d9cfd6a2f2c125ba470e858fd84d1782 Mon Sep 17 00:00:00 2001
From: Steven <steven.liu@huggingface.co>
Date: Thu, 31 Mar 2022 15:29:00 -0700
Subject: [PATCH 01/34] =?UTF-8?q?=20=F0=9F=93=9D=20add=20image/vision=20cl?=
 =?UTF-8?q?assification=20and=20asr?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/source/task_summary.mdx | 155 +++++++++++++++++++++++++++++++++++
 1 file changed, 155 insertions(+)
diff --git a/docs/source/task_summary.mdx b/docs/source/task_summary.mdx
index 95c2d9c201a5..068b376eeb25 100644
--- a/docs/source/task_summary.mdx
+++ b/docs/source/task_summary.mdx
@@ -967,3 +967,158 @@ Here is an example of doing translation using a model and a tokenizer. The proce
 </frameworkcontent>
 
 We get the same translation as with the pipeline example.
+
+## Audio classification
+
+Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or the how-to guide [here](/tasks/audio_classification).
+
+The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for audio classification inference:
+
+```py
+>>> from transformers import pipeline
+
+>>> audio_classifier = pipeline(
+...     task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
+... )
+>>> audio_classifier("jfk_moon_speech.wav")
+[{'label': 'calm', 'score': 0.13856211304664612},
+ {'label': 'disgust', 'score': 0.13148026168346405},
+ {'label': 'happy', 'score': 0.12635163962841034},
+ {'label': 'angry', 'score': 0.12439591437578201},
+ {'label': 'fearful', 'score': 0.12404385954141617}]
+```
+
+The general process for using a model and tokenizer for audio classification is:
+
+1. Instantiate a tokenizer and a model from the checkpoint name.
+2. Process the audio signal to be classified with a feature extractor.
+3. Pass the input through the model and take the `argmax` to retrieve the most likely class.
+4. Convert the class id to a class name with `id2label` to return an interpretable result.
+
+<frameworkcontent>
+<pt>
+```py
+>>> from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
+>>> from datasets import load_dataset
+>>> import torch
+
+>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+>>> dataset = dataset.sort("id")
+>>> sampling_rate = dataset.features["audio"].sampling_rate
+
+>>> feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
+>>> model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-ks")
+
+>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+
+>>> with torch.no_grad():
+...     logits = model(**inputs).logits
+
+>>> predicted_class_ids = torch.argmax(logits, dim=-1).item()
+>>> predicted_label = model.config.id2label[predicted_class_ids]
+>>> predicted_label
+```
+</pt>
+</frameworkcontent>
+
+## Automatic speech recognition
+
+Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py )scripts or the how-to guide [here](/tasks/asr).
+
+The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for automatic speech recognition inference:
+
+```py
+>>> from transformers import pipeline
+
+>>> speech_recognizer = pipeline(
+   task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h"
+)
+>>> speech_recognizer("jfk_moon_speech.wav")
+{'text': "PRESENTETE MISTER VICE PRESIDENT GOVERNOR CONGRESSMEN THOMAS SAN O TE WILAN CONGRESSMAN MILLA MISTER WEBB MSTBELL SCIENIS DISTINGUISHED GUESS AT LADIES AND GENTLEMAN I APPRECIATE TO YOUR PRESIDENT HAVING MADE ME AN HONORARY VISITING PROFESSOR AND I WILL ASSURE YOU THAT MY FIRST LECTURE WILL BE A VERY BRIEF I AM DELIGHTED TO BE HERE AND I'M PARTICULARLY DELIGHTED TO BE HERE ON THIS OCCASION WE MEED AT A COLLEGE NOTED FOR KNOWLEGE IN A CITY NOTED FOR PROGRESS IN A STATE NOTED FOR STRAINTH AN WE STAND IN NEED OF ALL THREE"}
+```
+
+The general process for using a model and tokenizer for automatic speech recognition is:
+
+1. Instantiate a tokenizer and a model from the checkpoint name.
+2. Process the audio signal and text with a processor.
+3. Pass the input through the model and take the `argmax` to retrieve the predicted text.
+4. Decode the text with a tokenizer to obtain the transcription.
+
+<frameworkcontent>
+<pt>
+```py
+>>> from transformers import AutoProcessor, AutoModelForCTC
+>>> from datasets import load_dataset
+>>> import torch
+
+>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+>>> dataset = dataset.sort("id")
+>>> sampling_rate = dataset.features["audio"].sampling_rate
+
+>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
+>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")
+
+>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+>>> with torch.no_grad():
+...     logits = model(**inputs).logits
+>>> predicted_ids = torch.argmax(logits, dim=-1)
+
+>>> transcription = processor.batch_decode(predicted_ids)
+>>> transcription[0]
+```
+</pt>
+</frameworkcontent>
+
+## Image classification
+
+Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or the how-to guide [here](/tasks/image_classification).
+
+The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for image classification inference:
+
+```py
+>>> from transformers import pipeline
+
+>>> vision_classifier = pipeline(task="image-classification")
+>>> vision_classifier(
+...     images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+... )
+[{'label': 'lynx, catamount', 'score': 0.4403027892112732},
+ {'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor',
+  'score': 0.03433405980467796},
+ {'label': 'snow leopard, ounce, Panthera uncia',
+  'score': 0.032148055732250214},
+ {'label': 'Egyptian cat', 'score': 0.02353910356760025},
+ {'label': 'tiger cat', 'score': 0.023034192621707916}]
+```
+
+The general process for using a model and tokenizer for image classification is:
+
+1. Instantiate a tokenizer and a model from the checkpoint name.
+2. Process the image to be classified with a feature extractor.
+3. Pass the input through the model and take the `argmax` to retrieve the predicted class.
+4. Convert the class id to a class name with `id2label` to return an interpretable result.
+
+<frameworkcontent>
+<pt>
+```py
+>>> from transformers import AutoFeatureExtractor, AutoModelForImageClassification
+>>> import torch
+>>> from datasets import load_dataset
+
+>>> dataset = load_dataset("huggingface/cats-image")
+>>> image = dataset["test"]["image"][0]
+
+>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
+>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
+
+>>> inputs = feature_extractor(image, return_tensors="pt")
+
+>>> with torch.no_grad():
+...     logits = model(**inputs).logits
+
+>>> predicted_label = logits.argmax(-1).item()
+>>> print(model.config.id2label[predicted_label])
+Egyptian cat
+```
+</pt>
+</frameworkcontent>
\ No newline at end of file

From 4e8453a503ad65f3a7df1aad84e861811b19380a Mon Sep 17 00:00:00 2001
From: Steven <steven.liu@huggingface.co>
Date: Thu, 31 Mar 2022 16:01:24 -0700
Subject: [PATCH 02/34] =?UTF-8?q?=20=F0=9F=96=8D=20minor=20formatting=20fi?=
 =?UTF-8?q?xes?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/source/task_summary.mdx | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/source/task_summary.mdx b/docs/source/task_summary.mdx
index 068b376eeb25..fd30add50729 100644
--- a/docs/source/task_summary.mdx
+++ b/docs/source/task_summary.mdx
@@ -970,7 +970,7 @@ We get the same translation as with the pipeline example.
 
 ## Audio classification
 
-Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or the how-to guide [here](/tasks/audio_classification).
+Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or the how-to guide [here](./tasks/audio_classification).
 
 The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for audio classification inference:
 
@@ -1006,8 +1006,8 @@ The general process for using a model and tokenizer for audio classification is:
 >>> dataset = dataset.sort("id")
 >>> sampling_rate = dataset.features["audio"].sampling_rate
 
->>> feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
->>> model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-ks")
+>>> feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
+>>> model = AutoModelForAudioClassification.from_pretrained("superb/wav2vec2-base-superb-ks")
 
 >>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
 
@@ -1023,7 +1023,7 @@ The general process for using a model and tokenizer for audio classification is:
 
 ## Automatic speech recognition
 
-Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py )scripts or the how-to guide [here](/tasks/asr).
+Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) scripts or the how-to guide [here](./tasks/asr).
 
 The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for automatic speech recognition inference:
 
@@ -1031,8 +1031,8 @@ The following examples demonstrate how to use a [`pipeline`] and a model and tok
 >>> from transformers import pipeline
 
 >>> speech_recognizer = pipeline(
-   task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h"
-)
+...     task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h"
+... )
 >>> speech_recognizer("jfk_moon_speech.wav")
 {'text': "PRESENTETE MISTER VICE PRESIDENT GOVERNOR CONGRESSMEN THOMAS SAN O TE WILAN CONGRESSMAN MILLA MISTER WEBB MSTBELL SCIENIS DISTINGUISHED GUESS AT LADIES AND GENTLEMAN I APPRECIATE TO YOUR PRESIDENT HAVING MADE ME AN HONORARY VISITING PROFESSOR AND I WILL ASSURE YOU THAT MY FIRST LECTURE WILL BE A VERY BRIEF I AM DELIGHTED TO BE HERE AND I'M PARTICULARLY DELIGHTED TO BE HERE ON THIS OCCASION WE MEED AT A COLLEGE NOTED FOR KNOWLEGE IN A CITY NOTED FOR PROGRESS IN A STATE NOTED FOR STRAINTH AN WE STAND IN NEED OF ALL THREE"}
 ```
@@ -1071,7 +1071,7 @@ The general process for using a model and tokenizer for automatic speech recogni
 
 ## Image classification
 
-Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or the how-to guide [here](/tasks/image_classification).
+Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or the how-to guide [here](./tasks/image_classification).
 
 The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for image classification inference:
 

From 4f80d31dd2ad3b2956729e49ec58c20f33e1c1a1 Mon Sep 17 00:00:00 2001
From: Cathy <815244047@qq.com>
Date: Fri, 1 Apr 2022 15:17:31 +0800
Subject: [PATCH 03/34] Fixed a typo in legacy seq2seq_trainer.py (#16531)

---
 examples/legacy/seq2seq/seq2seq_trainer.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/legacy/seq2seq/seq2seq_trainer.py b/examples/legacy/seq2seq/seq2seq_trainer.py
index 1c2d7924a444..eeff082499c4 100644
--- a/examples/legacy/seq2seq/seq2seq_trainer.py
+++ b/examples/legacy/seq2seq/seq2seq_trainer.py
@@ -115,7 +115,7 @@ def create_optimizer_and_scheduler(self, num_training_steps: int):
                     "eps": self.args.adam_epsilon,
                 }
             optimizer_kwargs["lr"] = self.args.learning_rate
-            if self.sharded_dpp:
+            if self.sharded_ddp:
                 self.optimizer = OSS(
                     params=optimizer_grouped_parameters,
                     optim=optimizer_cls,

From 1f426af06a88364b5934403cc603e01dc1f06a86 Mon Sep 17 00:00:00 2001
From: Jim Rohrer <jrohrer1@gmail.com>
Date: Fri, 1 Apr 2022 03:52:42 -0500
Subject: [PATCH 04/34] Add ONNX export for BeiT (#16498)

* Add beit onnx conversion support

* Updated docs

* Added cross reference to ViT ONNX config
---
 docs/source/serialization.mdx                 |  1 +
 src/transformers/models/beit/__init__.py      |  4 ++--
 .../models/beit/configuration_beit.py         | 23 +++++++++++++++++++
 src/transformers/onnx/features.py             |  2 ++
 tests/onnx/test_onnx_v2.py                    |  6 ++---
 5 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/docs/source/serialization.mdx b/docs/source/serialization.mdx
index fc969aac4fdd..65fb5fa5cc54 100644
--- a/docs/source/serialization.mdx
+++ b/docs/source/serialization.mdx
@@ -47,6 +47,7 @@ Ready-made configurations include the following architectures:
 
 - ALBERT
 - BART
+- BEiT
 - BERT
 - Blenderbot
 - BlenderbotSmall
diff --git a/src/transformers/models/beit/__init__.py b/src/transformers/models/beit/__init__.py
index 319fb2880a1d..27c31775d34e 100644
--- a/src/transformers/models/beit/__init__.py
+++ b/src/transformers/models/beit/__init__.py
@@ -22,7 +22,7 @@
 
 
 _import_structure = {
-    "configuration_beit": ["BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BeitConfig"],
+    "configuration_beit": ["BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BeitConfig", "BeitOnnxConfig"],
 }
 
 if is_vision_available():
@@ -48,7 +48,7 @@
     ]
 
 if TYPE_CHECKING:
-    from .configuration_beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
+    from .configuration_beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig, BeitOnnxConfig
 
     if is_vision_available():
         from .feature_extraction_beit import BeitFeatureExtractor
diff --git a/src/transformers/models/beit/configuration_beit.py b/src/transformers/models/beit/configuration_beit.py
index 9a1dfa8c20fc..7c47aba0c2ab 100644
--- a/src/transformers/models/beit/configuration_beit.py
+++ b/src/transformers/models/beit/configuration_beit.py
@@ -13,8 +13,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ BEiT model configuration"""
+from collections import OrderedDict
+from typing import Mapping
+
+from packaging import version
 
 from ...configuration_utils import PretrainedConfig
+from ...onnx import OnnxConfig
 from ...utils import logging
 
 
@@ -176,3 +181,21 @@ def __init__(
         self.auxiliary_num_convs = auxiliary_num_convs
         self.auxiliary_concat_input = auxiliary_concat_input
         self.semantic_loss_ignore_index = semantic_loss_ignore_index
+
+
+# Copied from transformers.models.vit.configuration_vit.ViTOnnxConfig
+class BeitOnnxConfig(OnnxConfig):
+
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "sequence"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
diff --git a/src/transformers/onnx/features.py b/src/transformers/onnx/features.py
index 926137c59482..cf5e55c521de 100644
--- a/src/transformers/onnx/features.py
+++ b/src/transformers/onnx/features.py
@@ -4,6 +4,7 @@
 from .. import PretrainedConfig, PreTrainedModel, TFPreTrainedModel, is_tf_available, is_torch_available
 from ..models.albert import AlbertOnnxConfig
 from ..models.bart import BartOnnxConfig
+from ..models.beit import BeitOnnxConfig
 from ..models.bert import BertOnnxConfig
 from ..models.blenderbot import BlenderbotOnnxConfig
 from ..models.blenderbot_small import BlenderbotSmallOnnxConfig
@@ -270,6 +271,7 @@ class FeaturesManager:
             onnx_config_cls=ElectraOnnxConfig,
         ),
         "vit": supported_features_mapping("default", "image-classification", onnx_config_cls=ViTOnnxConfig),
+        "beit": supported_features_mapping("default", "image-classification", onnx_config_cls=BeitOnnxConfig),
         "blenderbot": supported_features_mapping(
             "default",
             "default-with-past",
diff --git a/tests/onnx/test_onnx_v2.py b/tests/onnx/test_onnx_v2.py
index f530515aed79..ba8d51158ff9 100644
--- a/tests/onnx/test_onnx_v2.py
+++ b/tests/onnx/test_onnx_v2.py
@@ -15,14 +15,13 @@
     export,
     validate_model_outputs,
 )
+from transformers.onnx.utils import compute_effective_axis_dimension, compute_serialized_parameters_size
+from transformers.testing_utils import require_onnx, require_tf, require_torch, require_vision, slow
 
 
 if is_torch_available() or is_tf_available():
     from transformers.onnx.features import FeaturesManager
 
-from transformers.onnx.utils import compute_effective_axis_dimension, compute_serialized_parameters_size
-from transformers.testing_utils import require_onnx, require_tf, require_torch, require_vision, slow
-
 
 @require_onnx
 class OnnxUtilsTestCaseV2(TestCase):
@@ -181,6 +180,7 @@ def test_values_override(self):
     ("xlm-roberta", "xlm-roberta-base"),
     ("layoutlm", "microsoft/layoutlm-base-uncased"),
     ("vit", "google/vit-base-patch16-224"),
+    ("beit", "microsoft/beit-base-patch16-224"),
 }
 
 PYTORCH_EXPORT_WITH_PAST_MODELS = {

From ef37dc48640516c70f106b94794dc5db3d4d34df Mon Sep 17 00:00:00 2001
From: Ferdinand Schlatt <fschlatt@gmail.com>
Date: Fri, 1 Apr 2022 14:50:47 +0200
Subject: [PATCH 05/34] call on_train_end when trial is pruned (#16536)

---
 src/transformers/trainer.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 157e65d18352..948697e35127 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -991,6 +991,7 @@ def _report_to_hp_search(
 
             trial.report(self.objective, epoch)
             if trial.should_prune():
+                self.callback_handler.on_train_end(self.args, self.state, self.control)
                 raise optuna.TrialPruned()
         elif self.hp_search_backend == HPSearchBackend.RAY:
             from ray import tune

From 91167b13c9055c0be7eaca27a20e9673a5ce6979 Mon Sep 17 00:00:00 2001
From: Dahlbomii <101373053+Dahlbomii@users.noreply.github.com>
Date: Fri, 1 Apr 2022 06:27:41 -0700
Subject: [PATCH 06/34] Type hints added (#16529)

---
 .../models/openai/modeling_tf_openai.py       | 98 ++++++++++---------
 1 file changed, 50 insertions(+), 48 deletions(-)

diff --git a/src/transformers/models/openai/modeling_tf_openai.py b/src/transformers/models/openai/modeling_tf_openai.py
index 490b3fac47e5..80d7a9abd192 100644
--- a/src/transformers/models/openai/modeling_tf_openai.py
+++ b/src/transformers/models/openai/modeling_tf_openai.py
@@ -16,8 +16,9 @@
 """ TF 2.0 OpenAI GPT model."""
 
 from dataclasses import dataclass
-from typing import Optional, Tuple
+from typing import Optional, Tuple, Union
 
+import numpy as np
 import tensorflow as tf
 
 from ...activations_tf import get_tf_activation
@@ -25,6 +26,7 @@
 from ...modeling_tf_utils import (
     TFCausalLanguageModelingLoss,
     TFConv1D,
+    TFModelInputType,
     TFPreTrainedModel,
     TFSequenceClassificationLoss,
     TFSequenceSummary,
@@ -510,18 +512,18 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        training=False,
+        input_ids: Optional[TFModelInputType] = None,
+        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: Optional[bool] = False,
         **kwargs,
-    ):
+    ) -> Union[Tuple, TFBaseModelOutput]:
 
         outputs = self.transformer(
             input_ids=input_ids,
@@ -573,19 +575,19 @@ def set_output_embeddings(self, value):
     )
     def call(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        labels=None,
-        training=False,
+        input_ids: Optional[TFModelInputType] = None,
+        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        training: Optional[bool] = False,
         **kwargs,
-    ):
+    ) -> Union[Tuple, TFCausalLMOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the cross entropy classification loss. Indices should be in `[0, ...,
@@ -656,19 +658,19 @@ def __init__(self, config, *inputs, **kwargs):
     @replace_return_docstrings(output_type=TFOpenAIGPTDoubleHeadsModelOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        mc_token_ids=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        training=False,
+        input_ids: Optional[TFModelInputType] = None,
+        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        mc_token_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: Optional[bool] = False,
         **kwargs,
-    ):
+    ) -> Union[Tuple, TFOpenAIGPTDoubleHeadsModelOutput]:
         r"""
         mc_token_ids (`tf.Tensor` or `Numpy array` of shape `(batch_size, num_choices)`, *optional*, default to index of the last token of the input):
             Index of the classification token in each input sequence. Selected in the range `[0, input_ids.size(-1) -
@@ -800,19 +802,19 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        labels=None,
-        training=False,
+        input_ids: Optional[TFModelInputType] = None,
+        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        training: Optional[bool] = False,
         **kwargs,
-    ):
+    ) -> Union[Tuple, TFSequenceClassifierOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the cross entropy classification loss. Indices should be in `[0, ...,

From a9425ec12314745028dafd44ac62befac454c9c2 Mon Sep 17 00:00:00 2001
From: Gunjan Chhablani <chhablani.gunjan@gmail.com>
Date: Fri, 1 Apr 2022 19:20:22 +0530
Subject: [PATCH 07/34] Fix Bart type hints (#16297)

* Add type hints to PLBart PyTorch

* Remove pending merge conflicts

* Fix PLBart Type Hints

* Add changes from review
---
 .../models/plbart/modeling_plbart.py          | 80 +++++++++----------
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/src/transformers/models/plbart/modeling_plbart.py b/src/transformers/models/plbart/modeling_plbart.py
index b1a2088913fd..37230541e9db 100755
--- a/src/transformers/models/plbart/modeling_plbart.py
+++ b/src/transformers/models/plbart/modeling_plbart.py
@@ -16,7 +16,7 @@
 import copy
 import math
 import random
-from typing import List, Optional, Tuple, Union
+from typing import Any, Dict, List, Optional, Tuple, Union
 
 import torch
 import torch.utils.checkpoint
@@ -1142,21 +1142,21 @@ def get_decoder(self):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        decoder_input_ids=None,
-        decoder_attention_mask=None,
-        head_mask=None,
-        decoder_head_mask=None,
-        cross_attn_head_mask=None,
-        encoder_outputs=None,
-        past_key_values=None,
-        inputs_embeds=None,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        decoder_head_mask: Optional[torch.LongTensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        encoder_outputs: Optional[List[torch.FloatTensor]] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
         decoder_inputs_embeds=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
     ):
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
@@ -1271,23 +1271,23 @@ def set_output_embeddings(self, new_embeddings):
     @add_end_docstrings(PLBART_GENERATION_EXAMPLE)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        decoder_input_ids=None,
-        decoder_attention_mask=None,
-        head_mask=None,
-        decoder_head_mask=None,
-        cross_attn_head_mask=None,
-        encoder_outputs=None,
-        past_key_values=None,
-        inputs_embeds=None,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        decoder_head_mask: Optional[torch.LongTensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        encoder_outputs: Optional[List[torch.FloatTensor]] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
         decoder_inputs_embeds=None,
-        labels=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], Seq2SeqLMOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
@@ -1345,16 +1345,16 @@ def forward(
 
     def prepare_inputs_for_generation(
         self,
-        decoder_input_ids,
-        past=None,
-        attention_mask=None,
-        head_mask=None,
-        decoder_head_mask=None,
-        cross_attn_head_mask=None,
-        use_cache=None,
-        encoder_outputs=None,
+        decoder_input_ids: torch.LongTensor,
+        past: Optional[List[torch.FloatTensor]] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        decoder_head_mask: Optional[torch.Tensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        encoder_outputs: Optional[List[torch.FloatTensor]] = None,
         **kwargs  # TODO: Check if this is needed. It is unused?
-    ):
+    ) -> Dict[str, Any]:
         # cut decoder_input_ids if past is used
         if past is not None:
             decoder_input_ids = decoder_input_ids[:, -1:]

From a1dfe0064eab59b95fc0becd94a195a79341751a Mon Sep 17 00:00:00 2001
From: Gunjan Chhablani <chhablani.gunjan@gmail.com>
Date: Fri, 1 Apr 2022 19:32:58 +0530
Subject: [PATCH 08/34] Add VisualBert type hints (#16544)

---
 .../visual_bert/modeling_visual_bert.py       | 184 +++++++++---------
 1 file changed, 92 insertions(+), 92 deletions(-)

diff --git a/src/transformers/models/visual_bert/modeling_visual_bert.py b/src/transformers/models/visual_bert/modeling_visual_bert.py
index 0e5acf32b3c4..69495785fe81 100755
--- a/src/transformers/models/visual_bert/modeling_visual_bert.py
+++ b/src/transformers/models/visual_bert/modeling_visual_bert.py
@@ -17,7 +17,7 @@
 
 import math
 from dataclasses import dataclass
-from typing import Optional, Tuple
+from typing import Optional, Tuple, Union
 
 import torch
 import torch.utils.checkpoint
@@ -720,20 +720,20 @@ class PreTrainedModel
     @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        visual_embeds=None,
-        visual_attention_mask=None,
-        visual_token_type_ids=None,
-        image_text_alignment=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        visual_embeds: Optional[torch.FloatTensor] = None,
+        visual_attention_mask: Optional[torch.LongTensor] = None,
+        visual_token_type_ids: Optional[torch.LongTensor] = None,
+        image_text_alignment: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
         r"""
 
         Returns:
@@ -893,22 +893,22 @@ def set_output_embeddings(self, new_embeddings):
     @replace_return_docstrings(output_type=VisualBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        visual_embeds=None,
-        visual_attention_mask=None,
-        visual_token_type_ids=None,
-        image_text_alignment=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        labels=None,
-        sentence_image_labels=None,
-    ):
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        visual_embeds: Optional[torch.FloatTensor] = None,
+        visual_attention_mask: Optional[torch.LongTensor] = None,
+        visual_token_type_ids: Optional[torch.LongTensor] = None,
+        image_text_alignment: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.LongTensor] = None,
+        sentence_image_labels: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple[torch.Tensor], VisualBertForPreTrainingOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, total_sequence_length)`, *optional*):
             Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
@@ -1039,21 +1039,21 @@ def __init__(self, config):
     @replace_return_docstrings(output_type=MultipleChoiceModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        visual_embeds=None,
-        visual_attention_mask=None,
-        visual_token_type_ids=None,
-        image_text_alignment=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        labels=None,
-    ):
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        visual_embeds: Optional[torch.FloatTensor] = None,
+        visual_attention_mask: Optional[torch.LongTensor] = None,
+        visual_token_type_ids: Optional[torch.LongTensor] = None,
+        image_text_alignment: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
@@ -1191,21 +1191,21 @@ def __init__(self, config):
     @replace_return_docstrings(output_type=SequenceClassifierOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        visual_embeds=None,
-        visual_attention_mask=None,
-        visual_token_type_ids=None,
-        image_text_alignment=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        labels=None,
-    ):
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        visual_embeds: Optional[torch.FloatTensor] = None,
+        visual_attention_mask: Optional[torch.LongTensor] = None,
+        visual_token_type_ids: Optional[torch.LongTensor] = None,
+        image_text_alignment: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, total_sequence_length)`, *optional*):
             Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
@@ -1317,21 +1317,21 @@ def __init__(self, config):
     @replace_return_docstrings(output_type=SequenceClassifierOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        visual_embeds=None,
-        visual_attention_mask=None,
-        visual_token_type_ids=None,
-        image_text_alignment=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        labels=None,
-    ):
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        visual_embeds: Optional[torch.FloatTensor] = None,
+        visual_attention_mask: Optional[torch.LongTensor] = None,
+        visual_token_type_ids: Optional[torch.LongTensor] = None,
+        image_text_alignment: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
@@ -1477,22 +1477,22 @@ def __init__(self, config):
     @replace_return_docstrings(output_type=SequenceClassifierOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        visual_embeds=None,
-        visual_attention_mask=None,
-        visual_token_type_ids=None,
-        image_text_alignment=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        region_to_phrase_position=None,
-        labels=None,
-    ):
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        visual_embeds: Optional[torch.FloatTensor] = None,
+        visual_attention_mask: Optional[torch.LongTensor] = None,
+        visual_token_type_ids: Optional[torch.LongTensor] = None,
+        image_text_alignment: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        region_to_phrase_position: Optional[torch.LongTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
         r"""
         region_to_phrase_position (`torch.LongTensor` of shape `(batch_size, total_sequence_length)`, *optional*):
             The positions depicting the position of the image embedding corresponding to the textual tokens.

From 8976f05a8f9524a4278b34143f010d1578c43255 Mon Sep 17 00:00:00 2001
From: Rishav Chandra Varma <rishavchandra.v16@iiits.in>
Date: Fri, 1 Apr 2022 19:51:26 +0530
Subject: [PATCH 09/34] Adding missing type hints for mBART model (PyTorch)
 (#16429)

* added type hints for mbart tensorflow tf implementation

* Adding missing type hints for mBART model

Tensorflow Implementation model added with missing type hints

* Missing Type hints - correction

For TF model

* Code fixup using make quality tests

* Hint types - typo error

* make fix-copies and make fixup

* type hints

* updated files

* type hints update

* making dependent modesls coherent

Co-authored-by: matt <rocketknight1@gmail.com>
---
 .../modeling_bigbird_pegasus.py               |   2 +-
 .../models/blenderbot/modeling_blenderbot.py  |   4 +-
 .../models/m2m_100/modeling_m2m_100.py        |   4 +-
 .../models/mbart/modeling_mbart.py            | 112 +++++++++---------
 .../models/pegasus/modeling_pegasus.py        |   4 +-
 src/transformers/models/xglm/modeling_xglm.py |   2 +-
 6 files changed, 64 insertions(+), 64 deletions(-)

diff --git a/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py b/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
index 540f77944b7b..1fb8de8e1452 100755
--- a/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
+++ b/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
@@ -1478,7 +1478,7 @@ def forward(
         past_key_value: Optional[Tuple[torch.Tensor]] = None,
         output_attentions: Optional[bool] = False,
         use_cache: Optional[bool] = True,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
diff --git a/src/transformers/models/blenderbot/modeling_blenderbot.py b/src/transformers/models/blenderbot/modeling_blenderbot.py
index 928e22e860e7..d1f84d2c3917 100755
--- a/src/transformers/models/blenderbot/modeling_blenderbot.py
+++ b/src/transformers/models/blenderbot/modeling_blenderbot.py
@@ -294,7 +294,7 @@ def forward(
         attention_mask: torch.Tensor,
         layer_head_mask: torch.Tensor,
         output_attentions: bool = False,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
@@ -378,7 +378,7 @@ def forward(
         past_key_value: Optional[Tuple[torch.Tensor]] = None,
         output_attentions: Optional[bool] = False,
         use_cache: Optional[bool] = True,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
diff --git a/src/transformers/models/m2m_100/modeling_m2m_100.py b/src/transformers/models/m2m_100/modeling_m2m_100.py
index 3bb749564a01..d816218824e1 100755
--- a/src/transformers/models/m2m_100/modeling_m2m_100.py
+++ b/src/transformers/models/m2m_100/modeling_m2m_100.py
@@ -363,7 +363,7 @@ def forward(
         attention_mask: torch.Tensor,
         layer_head_mask: torch.Tensor,
         output_attentions: bool = False,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
@@ -447,7 +447,7 @@ def forward(
         past_key_value: Optional[Tuple[torch.Tensor]] = None,
         output_attentions: Optional[bool] = False,
         use_cache: Optional[bool] = True,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
diff --git a/src/transformers/models/mbart/modeling_mbart.py b/src/transformers/models/mbart/modeling_mbart.py
index 446a02f648cd..6ed7c24ab176 100755
--- a/src/transformers/models/mbart/modeling_mbart.py
+++ b/src/transformers/models/mbart/modeling_mbart.py
@@ -307,7 +307,7 @@ def forward(
         attention_mask: torch.Tensor,
         layer_head_mask: torch.Tensor,
         output_attentions: bool = False,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
@@ -390,7 +390,7 @@ def forward(
         past_key_value: Optional[Tuple[torch.Tensor]] = None,
         output_attentions: Optional[bool] = False,
         use_cache: Optional[bool] = True,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
@@ -722,14 +722,14 @@ def _backward_compatibility_gradient_checkpointing(self):
 
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        head_mask=None,
-        inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
         r"""
         Args:
             input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
@@ -913,19 +913,19 @@ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_em
 
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        head_mask=None,
-        cross_attn_head_mask=None,
-        past_key_values=None,
-        inputs_embeds=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
         r"""
         Args:
             input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
@@ -1168,22 +1168,22 @@ def get_decoder(self):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        decoder_input_ids=None,
-        decoder_attention_mask=None,
-        head_mask=None,
-        decoder_head_mask=None,
-        cross_attn_head_mask=None,
-        encoder_outputs=None,
-        past_key_values=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        decoder_head_mask: Optional[torch.Tensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Seq2SeqModelOutput, Tuple[torch.FloatTensor]]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -1297,23 +1297,23 @@ def set_output_embeddings(self, new_embeddings):
     @add_end_docstrings(MBART_GENERATION_EXAMPLE)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        decoder_input_ids=None,
-        decoder_attention_mask=None,
-        head_mask=None,
-        decoder_head_mask=None,
-        cross_attn_head_mask=None,
-        encoder_outputs=None,
-        past_key_values=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        decoder_head_mask: Optional[torch.Tensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Seq2SeqLMOutput, Tuple[torch.FloatTensor]]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
diff --git a/src/transformers/models/pegasus/modeling_pegasus.py b/src/transformers/models/pegasus/modeling_pegasus.py
index f1d7a6ce56ef..06cf9f130a73 100755
--- a/src/transformers/models/pegasus/modeling_pegasus.py
+++ b/src/transformers/models/pegasus/modeling_pegasus.py
@@ -309,7 +309,7 @@ def forward(
         attention_mask: torch.Tensor,
         layer_head_mask: torch.Tensor,
         output_attentions: bool = False,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
@@ -393,7 +393,7 @@ def forward(
         past_key_value: Optional[Tuple[torch.Tensor]] = None,
         output_attentions: Optional[bool] = False,
         use_cache: Optional[bool] = True,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
diff --git a/src/transformers/models/xglm/modeling_xglm.py b/src/transformers/models/xglm/modeling_xglm.py
index af277fcd7880..8d45e2b200b7 100755
--- a/src/transformers/models/xglm/modeling_xglm.py
+++ b/src/transformers/models/xglm/modeling_xglm.py
@@ -423,7 +423,7 @@ def forward(
         past_key_value: Optional[Tuple[torch.Tensor]] = None,
         output_attentions: Optional[bool] = False,
         use_cache: Optional[bool] = True,
-    ):
+    ) -> torch.Tensor:
         """
         Args:
             hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*

From 50ca1d1055691cd9363a511a7cc8089a2e618c61 Mon Sep 17 00:00:00 2001
From: Gunjan Chhablani <chhablani.gunjan@gmail.com>
Date: Fri, 1 Apr 2022 20:09:28 +0530
Subject: [PATCH 10/34] Remove MBart subclass of XLMRoberta in tokenzier docs
 (#16546)

* Remove MBart subclass of XLMRoberta in tokenzier

* Fix style

* Copy docs from MBart50 tokenizer
---
 src/transformers/models/mbart/tokenization_mbart_fast.py | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/transformers/models/mbart/tokenization_mbart_fast.py b/src/transformers/models/mbart/tokenization_mbart_fast.py
index 1de8d62f3608..a172d37913a4 100644
--- a/src/transformers/models/mbart/tokenization_mbart_fast.py
+++ b/src/transformers/models/mbart/tokenization_mbart_fast.py
@@ -62,9 +62,8 @@ class MBartTokenizerFast(PreTrainedTokenizerFast):
     Construct a "fast" MBART tokenizer (backed by HuggingFace's *tokenizers* library). Based on
     [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
 
-    [`MBartTokenizerFast`] is a subclass of [`XLMRobertaTokenizerFast`]. Refer to superclass
-    [`XLMRobertaTokenizerFast`] for usage examples and documentation concerning the initialization parameters and other
-    methods.
+    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
+    refer to this superclass for more information regarding those methods.
 
     The tokenization method is `<tokens> <eos> <language code>` for source language documents, and ``<language code>
     <tokens> <eos>``` for target language documents.

From 4ba9b4d663ce9dae4220aa3b69741dd6bb881c9b Mon Sep 17 00:00:00 2001
From: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Date: Fri, 1 Apr 2022 16:53:07 +0200
Subject: [PATCH 11/34] Use random_attention_mask for TF tests (#16517)

* use random_attention_mask for TF tests

* Fix for TFCLIP test (for now).

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
---
 ...est_modeling_tf_{{cookiecutter.lowercase_modelname}}.py | 4 ++--
 tests/albert/test_modeling_tf_albert.py                    | 4 ++--
 tests/bert/test_modeling_tf_bert.py                        | 4 ++--
 tests/clip/test_modeling_tf_clip.py                        | 6 ++++++
 tests/convbert/test_modeling_tf_convbert.py                | 4 ++--
 tests/ctrl/test_modeling_tf_ctrl.py                        | 4 ++--
 tests/deberta/test_modeling_tf_deberta.py                  | 4 ++--
 tests/deberta_v2/test_modeling_tf_deberta_v2.py            | 4 ++--
 tests/distilbert/test_modeling_tf_distilbert.py            | 4 ++--
 tests/dpr/test_modeling_tf_dpr.py                          | 7 +++----
 tests/electra/test_modeling_tf_electra.py                  | 4 ++--
 tests/flaubert/test_modeling_tf_flaubert.py                | 4 ++--
 tests/funnel/test_modeling_tf_funnel.py                    | 4 ++--
 tests/gpt2/test_modeling_tf_gpt2.py                        | 4 ++--
 tests/gptj/test_modeling_tf_gptj.py                        | 4 ++--
 tests/layoutlm/test_modeling_tf_layoutlm.py                | 4 ++--
 tests/longformer/test_modeling_tf_longformer.py            | 4 ++--
 tests/lxmert/test_modeling_tf_lxmert.py                    | 4 ++--
 tests/mobilebert/test_modeling_tf_mobilebert.py            | 4 ++--
 tests/mpnet/test_modeling_tf_mpnet.py                      | 4 ++--
 tests/openai/test_modeling_tf_openai.py                    | 4 ++--
 tests/rembert/test_modeling_tf_rembert.py                  | 4 ++--
 tests/roberta/test_modeling_tf_roberta.py                  | 4 ++--
 tests/roformer/test_modeling_tf_roformer.py                | 4 ++--
 tests/t5/test_modeling_tf_t5.py                            | 4 ++--
 tests/tapas/test_modeling_tf_tapas.py                      | 4 ++--
 tests/test_modeling_tf_common.py                           | 2 +-
 tests/xlm/test_modeling_tf_xlm.py                          | 4 ++--
 tests/xlnet/test_modeling_tf_xlnet.py                      | 4 ++--
 29 files changed, 62 insertions(+), 57 deletions(-)

diff --git a/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py b/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py
index 16b31500dd6c..57fd95dd3ff6 100644
--- a/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py
+++ b/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py
@@ -21,7 +21,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -92,7 +92,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/albert/test_modeling_tf_albert.py b/tests/albert/test_modeling_tf_albert.py
index 59815561c056..7eacc1f32a47 100644
--- a/tests/albert/test_modeling_tf_albert.py
+++ b/tests/albert/test_modeling_tf_albert.py
@@ -21,7 +21,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -96,7 +96,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/bert/test_modeling_tf_bert.py b/tests/bert/test_modeling_tf_bert.py
index 611268337ffd..8c709e093801 100644
--- a/tests/bert/test_modeling_tf_bert.py
+++ b/tests/bert/test_modeling_tf_bert.py
@@ -21,7 +21,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
 from ..utils.test_modeling_tf_core import TFCoreModelTesterMixin
 
 
@@ -96,7 +96,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/clip/test_modeling_tf_clip.py b/tests/clip/test_modeling_tf_clip.py
index 02e289cd5b2a..d3c3cb9f5033 100644
--- a/tests/clip/test_modeling_tf_clip.py
+++ b/tests/clip/test_modeling_tf_clip.py
@@ -301,6 +301,12 @@ def prepare_config_and_inputs(self):
         input_mask = None
         if self.use_input_mask:
             input_mask = random_attention_mask([self.batch_size, self.seq_length])
+            # make sure the first token has attention mask `1` to ensure that, after combining the causal mask, there
+            # is still at least one token being attended to for each batch.
+            # TODO: Change `random_attention_mask` in PT/TF/Flax common test file, after a discussion with the team.
+            input_mask = tf.concat(
+                [tf.ones_like(input_mask[:, :1], dtype=input_mask.dtype), input_mask[:, 1:]], axis=-1
+            )
 
         config = self.get_config()
 
diff --git a/tests/convbert/test_modeling_tf_convbert.py b/tests/convbert/test_modeling_tf_convbert.py
index ff4cbb1aa974..e2d68876263a 100644
--- a/tests/convbert/test_modeling_tf_convbert.py
+++ b/tests/convbert/test_modeling_tf_convbert.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -94,7 +94,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/ctrl/test_modeling_tf_ctrl.py b/tests/ctrl/test_modeling_tf_ctrl.py
index 65b984b51c9a..d17a97a3ad83 100644
--- a/tests/ctrl/test_modeling_tf_ctrl.py
+++ b/tests/ctrl/test_modeling_tf_ctrl.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -69,7 +69,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/deberta/test_modeling_tf_deberta.py b/tests/deberta/test_modeling_tf_deberta.py
index 581f6f02f470..7e2a3c3110ee 100644
--- a/tests/deberta/test_modeling_tf_deberta.py
+++ b/tests/deberta/test_modeling_tf_deberta.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -92,7 +92,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/deberta_v2/test_modeling_tf_deberta_v2.py b/tests/deberta_v2/test_modeling_tf_deberta_v2.py
index 391afee59784..4fd967c2fa6e 100644
--- a/tests/deberta_v2/test_modeling_tf_deberta_v2.py
+++ b/tests/deberta_v2/test_modeling_tf_deberta_v2.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -95,7 +95,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/distilbert/test_modeling_tf_distilbert.py b/tests/distilbert/test_modeling_tf_distilbert.py
index 7a146e9c3bf8..5266723f1f86 100644
--- a/tests/distilbert/test_modeling_tf_distilbert.py
+++ b/tests/distilbert/test_modeling_tf_distilbert.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -70,7 +70,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         sequence_labels = None
         token_labels = None
diff --git a/tests/dpr/test_modeling_tf_dpr.py b/tests/dpr/test_modeling_tf_dpr.py
index 7a48a2254e10..ffce36efc3a6 100644
--- a/tests/dpr/test_modeling_tf_dpr.py
+++ b/tests/dpr/test_modeling_tf_dpr.py
@@ -19,7 +19,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -94,9 +94,8 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor(
-                [self.batch_size, self.seq_length], vocab_size=2
-            )  # follow test_modeling_tf_ctrl.py
+            # follow test_modeling_tf_ctrl.py
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/electra/test_modeling_tf_electra.py b/tests/electra/test_modeling_tf_electra.py
index 4593ecff6100..ff2acd37e69f 100644
--- a/tests/electra/test_modeling_tf_electra.py
+++ b/tests/electra/test_modeling_tf_electra.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -71,7 +71,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/flaubert/test_modeling_tf_flaubert.py b/tests/flaubert/test_modeling_tf_flaubert.py
index 62503bac2861..86bcd6ea6484 100644
--- a/tests/flaubert/test_modeling_tf_flaubert.py
+++ b/tests/flaubert/test_modeling_tf_flaubert.py
@@ -19,7 +19,7 @@
 from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -75,7 +75,7 @@ def __init__(
 
     def prepare_config_and_inputs(self):
         input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
-        input_mask = ids_tensor([self.batch_size, self.seq_length], 2, dtype=tf.float32)
+        input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype=tf.float32)
 
         input_lengths = None
         if self.use_input_lengths:
diff --git a/tests/funnel/test_modeling_tf_funnel.py b/tests/funnel/test_modeling_tf_funnel.py
index 6105f9ab8035..c3ae3788d61e 100644
--- a/tests/funnel/test_modeling_tf_funnel.py
+++ b/tests/funnel/test_modeling_tf_funnel.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -111,7 +111,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/gpt2/test_modeling_tf_gpt2.py b/tests/gpt2/test_modeling_tf_gpt2.py
index f94387509e6a..d6470c0d1526 100644
--- a/tests/gpt2/test_modeling_tf_gpt2.py
+++ b/tests/gpt2/test_modeling_tf_gpt2.py
@@ -19,7 +19,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
 from ..utils.test_modeling_tf_core import TFCoreModelTesterMixin
 
 
@@ -74,7 +74,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/gptj/test_modeling_tf_gptj.py b/tests/gptj/test_modeling_tf_gptj.py
index 32ce3f8564b0..63feffb8c62e 100644
--- a/tests/gptj/test_modeling_tf_gptj.py
+++ b/tests/gptj/test_modeling_tf_gptj.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow, tooslow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 from ..utils.test_modeling_tf_core import TFCoreModelTesterMixin
 
 
@@ -70,7 +70,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/layoutlm/test_modeling_tf_layoutlm.py b/tests/layoutlm/test_modeling_tf_layoutlm.py
index f60d0c6f91d5..90e2b4fcf169 100644
--- a/tests/layoutlm/test_modeling_tf_layoutlm.py
+++ b/tests/layoutlm/test_modeling_tf_layoutlm.py
@@ -21,7 +21,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -107,7 +107,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/longformer/test_modeling_tf_longformer.py b/tests/longformer/test_modeling_tf_longformer.py
index 37c1ce534953..6bfa708912dd 100644
--- a/tests/longformer/test_modeling_tf_longformer.py
+++ b/tests/longformer/test_modeling_tf_longformer.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -79,7 +79,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/lxmert/test_modeling_tf_lxmert.py b/tests/lxmert/test_modeling_tf_lxmert.py
index 8d91d249d90b..63ec44a1ad90 100644
--- a/tests/lxmert/test_modeling_tf_lxmert.py
+++ b/tests/lxmert/test_modeling_tf_lxmert.py
@@ -23,7 +23,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -124,7 +124,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_lang_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
         token_type_ids = None
         if self.use_token_type_ids:
             token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
diff --git a/tests/mobilebert/test_modeling_tf_mobilebert.py b/tests/mobilebert/test_modeling_tf_mobilebert.py
index 4cbfcefee874..c0ddf043562f 100644
--- a/tests/mobilebert/test_modeling_tf_mobilebert.py
+++ b/tests/mobilebert/test_modeling_tf_mobilebert.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -114,7 +114,7 @@ def prepare_config_and_inputs(self):
 
             input_mask = None
             if self.use_input_mask:
-                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+                input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
             token_type_ids = None
             if self.use_token_type_ids:
diff --git a/tests/mpnet/test_modeling_tf_mpnet.py b/tests/mpnet/test_modeling_tf_mpnet.py
index 23448610cc21..f9f9e2d51201 100644
--- a/tests/mpnet/test_modeling_tf_mpnet.py
+++ b/tests/mpnet/test_modeling_tf_mpnet.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -90,7 +90,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         sequence_labels = None
         token_labels = None
diff --git a/tests/openai/test_modeling_tf_openai.py b/tests/openai/test_modeling_tf_openai.py
index 227689df59aa..f74a85ee60d6 100644
--- a/tests/openai/test_modeling_tf_openai.py
+++ b/tests/openai/test_modeling_tf_openai.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -70,7 +70,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/rembert/test_modeling_tf_rembert.py b/tests/rembert/test_modeling_tf_rembert.py
index f8f17f30a9dd..d5d52062e8c9 100644
--- a/tests/rembert/test_modeling_tf_rembert.py
+++ b/tests/rembert/test_modeling_tf_rembert.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -95,7 +95,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/roberta/test_modeling_tf_roberta.py b/tests/roberta/test_modeling_tf_roberta.py
index fa947d64f081..9771673d8748 100644
--- a/tests/roberta/test_modeling_tf_roberta.py
+++ b/tests/roberta/test_modeling_tf_roberta.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_sentencepiece, require_tf, require_tokenizers, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -72,7 +72,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/roformer/test_modeling_tf_roformer.py b/tests/roformer/test_modeling_tf_roformer.py
index 1f26f7e2adc6..9a23ca3b83d2 100644
--- a/tests/roformer/test_modeling_tf_roformer.py
+++ b/tests/roformer/test_modeling_tf_roformer.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -95,7 +95,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = None
         if self.use_token_type_ids:
diff --git a/tests/t5/test_modeling_tf_t5.py b/tests/t5/test_modeling_tf_t5.py
index a2ea255faca5..c6585f83b18e 100644
--- a/tests/t5/test_modeling_tf_t5.py
+++ b/tests/t5/test_modeling_tf_t5.py
@@ -20,7 +20,7 @@
 from transformers.utils import cached_property
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -58,7 +58,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_labels = None
         if self.use_labels:
diff --git a/tests/tapas/test_modeling_tf_tapas.py b/tests/tapas/test_modeling_tf_tapas.py
index 936273a6ca30..9e3cb63f70b5 100644
--- a/tests/tapas/test_modeling_tf_tapas.py
+++ b/tests/tapas/test_modeling_tf_tapas.py
@@ -38,7 +38,7 @@
 from transformers.utils import cached_property
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -158,7 +158,7 @@ def prepare_config_and_inputs(self):
 
         input_mask = None
         if self.use_input_mask:
-            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
 
         token_type_ids = []
         for type_vocab_size in self.type_vocab_sizes:
diff --git a/tests/test_modeling_tf_common.py b/tests/test_modeling_tf_common.py
index 3d2f7976cf6c..9473a50f53aa 100644
--- a/tests/test_modeling_tf_common.py
+++ b/tests/test_modeling_tf_common.py
@@ -1440,7 +1440,7 @@ def ids_tensor(shape, vocab_size, rng=None, name=None, dtype=None):
 def random_attention_mask(shape, rng=None, name=None, dtype=None):
     attn_mask = ids_tensor(shape, vocab_size=2, rng=None, name=None, dtype=dtype)
     # make sure that at least one token is attended to for each batch
-    attn_mask = tf.concat([tf.constant(value=1, shape=(shape[0], 1), dtype=dtype), attn_mask[:, 1:]], axis=1)
+    attn_mask = tf.concat([attn_mask[:, :-1], tf.ones_like(attn_mask[:, -1:], dtype=dtype)], axis=-1)
     return attn_mask
 
 
diff --git a/tests/xlm/test_modeling_tf_xlm.py b/tests/xlm/test_modeling_tf_xlm.py
index 5fc4d2413f9e..412a8430ad6d 100644
--- a/tests/xlm/test_modeling_tf_xlm.py
+++ b/tests/xlm/test_modeling_tf_xlm.py
@@ -20,7 +20,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -75,7 +75,7 @@ def __init__(
 
     def prepare_config_and_inputs(self):
         input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
-        input_mask = ids_tensor([self.batch_size, self.seq_length], 2, dtype=tf.float32)
+        input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype=tf.float32)
 
         input_lengths = None
         if self.use_input_lengths:
diff --git a/tests/xlnet/test_modeling_tf_xlnet.py b/tests/xlnet/test_modeling_tf_xlnet.py
index 4b92581a0efc..8cf4ca2099bd 100644
--- a/tests/xlnet/test_modeling_tf_xlnet.py
+++ b/tests/xlnet/test_modeling_tf_xlnet.py
@@ -22,7 +22,7 @@
 from transformers.testing_utils import require_tf, slow
 
 from ..test_configuration_common import ConfigTester
-from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor
+from ..test_modeling_tf_common import TFModelTesterMixin, ids_tensor, random_attention_mask
 
 
 if is_tf_available():
@@ -75,7 +75,7 @@ def prepare_config_and_inputs(self):
         input_ids_1 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
         input_ids_2 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
         segment_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
-        input_mask = ids_tensor([self.batch_size, self.seq_length], 2, dtype=tf.float32)
+        input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype=tf.float32)
 
         input_ids_q = ids_tensor([self.batch_size, self.seq_length + 1], self.vocab_size)
         perm_mask = tf.zeros((self.batch_size, self.seq_length + 1, self.seq_length), dtype=tf.float32)

From ecd9f353c92ac62395db9b3570c3f191b3242e40 Mon Sep 17 00:00:00 2001
From: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Date: Fri, 1 Apr 2022 17:19:36 +0200
Subject: [PATCH 12/34] Improve code example (#16450)

Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home>
---
 src/transformers/models/glpn/modeling_glpn.py | 28 +++++++++++++++----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/src/transformers/models/glpn/modeling_glpn.py b/src/transformers/models/glpn/modeling_glpn.py
index c8d6bac79b36..86e53c787572 100755
--- a/src/transformers/models/glpn/modeling_glpn.py
+++ b/src/transformers/models/glpn/modeling_glpn.py
@@ -708,18 +708,36 @@ def forward(
 
         ```python
         >>> from transformers import GLPNFeatureExtractor, GLPNForDepthEstimation
+        >>> import torch
+        >>> import numpy as np
         >>> from PIL import Image
         >>> import requests
 
-        >>> feature_extractor = GLPNFeatureExtractor.from_pretrained("vinvino02/glpn-kitti")
-        >>> model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-kitti")
-
         >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
         >>> image = Image.open(requests.get(url, stream=True).raw)
 
+        >>> feature_extractor = GLPNFeatureExtractor.from_pretrained("vinvino02/glpn-kitti")
+        >>> model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-kitti")
+
+        >>> # prepare image for the model
         >>> inputs = feature_extractor(images=image, return_tensors="pt")
-        >>> outputs = model(**inputs)
-        >>> predicted_depth = outputs.predicted_depth  # shape (batch_size, height, width)
+
+        >>> with torch.no_grad():
+        ...     outputs = model(**inputs)
+        ...     predicted_depth = outputs.predicted_depth
+
+        >>> # interpolate to original size
+        >>> prediction = torch.nn.functional.interpolate(
+        ...     predicted_depth.unsqueeze(1),
+        ...     size=image.size[::-1],
+        ...     mode="bicubic",
+        ...     align_corners=False,
+        ... )
+
+        >>> # visualize the prediction
+        >>> output = prediction.squeeze().cpu().numpy()
+        >>> formatted = (output * 255 / np.max(output)).astype("uint8")
+        >>> depth = Image.fromarray(formatted)
         ```"""
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         output_hidden_states = (

From f05d235c226d0b183377c23913f2565b9322af43 Mon Sep 17 00:00:00 2001
From: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Date: Fri, 1 Apr 2022 17:53:18 +0200
Subject: [PATCH 13/34] Pin tokenizers version <0.13 (#16539)

* Pin tokenizers version <0.13

* Style
---
 setup.py                                      | 2 +-
 src/transformers/dependency_versions_table.py | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/setup.py b/setup.py
index c9455eaa901d..56ba7d4c4784 100644
--- a/setup.py
+++ b/setup.py
@@ -151,7 +151,7 @@
     "tf2onnx",
     "timeout-decorator",
     "timm",
-    "tokenizers>=0.11.1,!=0.11.3",
+    "tokenizers>=0.11.1,!=0.11.3,<0.13",
     "torch>=1.0",
     "torchaudio",
     "pyctcdecode>=0.3.0",
diff --git a/src/transformers/dependency_versions_table.py b/src/transformers/dependency_versions_table.py
index 2ba72f5b9593..334103c20a56 100644
--- a/src/transformers/dependency_versions_table.py
+++ b/src/transformers/dependency_versions_table.py
@@ -61,7 +61,7 @@
     "tf2onnx": "tf2onnx",
     "timeout-decorator": "timeout-decorator",
     "timm": "timm",
-    "tokenizers": "tokenizers>=0.11.1,!=0.11.3",
+    "tokenizers": "tokenizers>=0.11.1,!=0.11.3,<0.13",
     "torch": "torch>=1.0",
     "torchaudio": "torchaudio",
     "pyctcdecode": "pyctcdecode>=0.3.0",

From 085f0f7dea866862e99915981a807e4fc9bd1bc0 Mon Sep 17 00:00:00 2001
From: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Date: Fri, 1 Apr 2022 17:54:01 +0200
Subject: [PATCH 14/34] Add code samples for TF speech models (#16494)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
---
 src/transformers/utils/doc.py | 63 +++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/src/transformers/utils/doc.py b/src/transformers/utils/doc.py
index f395f8d4fb80..eaf59ba50215 100644
--- a/src/transformers/utils/doc.py
+++ b/src/transformers/utils/doc.py
@@ -794,6 +794,67 @@ def _prepare_output_docstrings(output_type, config_class, min_indent=None):
     ```
 """
 
+TF_SPEECH_BASE_MODEL_SAMPLE = r"""
+    Example:
+
+    ```python
+    >>> from transformers import {processor_class}, {model_class}
+    >>> from datasets import load_dataset
+
+    >>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+    >>> dataset = dataset.sort("id")
+    >>> sampling_rate = dataset.features["audio"].sampling_rate
+
+    >>> processor = {processor_class}.from_pretrained("{checkpoint}")
+    >>> model = {model_class}.from_pretrained("{checkpoint}")
+
+    >>> # audio file is decoded on the fly
+    >>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="tf")
+    >>> outputs = model(**inputs)
+
+    >>> last_hidden_states = outputs.last_hidden_state
+    >>> list(last_hidden_states.shape)
+    {expected_output}
+    ```
+"""
+
+TF_SPEECH_CTC_SAMPLE = r"""
+    Example:
+
+    ```python
+    >>> from transformers import {processor_class}, {model_class}
+    >>> from datasets import load_dataset
+    >>> import tensorflow as tf
+
+    >>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+    >>> dataset = dataset.sort("id")
+    >>> sampling_rate = dataset.features["audio"].sampling_rate
+
+    >>> processor = {processor_class}.from_pretrained("{checkpoint}")
+    >>> model = {model_class}.from_pretrained("{checkpoint}")
+
+    >>> # audio file is decoded on the fly
+    >>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="tf")
+    >>> logits = model(**inputs).logits
+    >>> predicted_ids = tf.math.argmax(logits, axis=-1)
+
+    >>> # transcribe speech
+    >>> transcription = processor.batch_decode(predicted_ids)
+    >>> transcription[0]
+    {expected_output}
+    ```
+
+    ```python
+    >>> with processor.as_target_processor():
+    ...     inputs["labels"] = processor(dataset[0]["text"], return_tensors="tf").input_ids
+
+    >>> # compute loss
+    >>> loss = model(**inputs).loss
+    >>> round(float(loss), 2)
+    {expected_loss}
+    ```
+"""
+
 TF_VISION_BASE_MODEL_SAMPLE = r"""
     Example:
 
@@ -848,6 +909,8 @@ def _prepare_output_docstrings(output_type, config_class, min_indent=None):
     "MaskedLM": TF_MASKED_LM_SAMPLE,
     "LMHead": TF_CAUSAL_LM_SAMPLE,
     "BaseModel": TF_BASE_MODEL_SAMPLE,
+    "SpeechBaseModel": TF_SPEECH_BASE_MODEL_SAMPLE,
+    "CTC": TF_SPEECH_CTC_SAMPLE,
     "VisionBaseModel": TF_VISION_BASE_MODEL_SAMPLE,
     "ImageClassification": TF_VISION_SEQ_CLASS_SAMPLE,
 }

From cbc776aa3d77b66c573bf4330b7afed50017e308 Mon Sep 17 00:00:00 2001
From: Patrick von Platen <patrick.v.platen@gmail.com>
Date: Mon, 4 Apr 2022 13:53:54 +0200
Subject: [PATCH 15/34] [FlaxSpeechEncoderDecoder] Fix dtype bug (#16581)

* [FlaxSpeechEncoderDecoder] Fix dtype bug

* more fixes
---
 .../modeling_flax_speech_encoder_decoder.py        | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/transformers/models/speech_encoder_decoder/modeling_flax_speech_encoder_decoder.py b/src/transformers/models/speech_encoder_decoder/modeling_flax_speech_encoder_decoder.py
index aff3953b8407..6e36703cf9d7 100644
--- a/src/transformers/models/speech_encoder_decoder/modeling_flax_speech_encoder_decoder.py
+++ b/src/transformers/models/speech_encoder_decoder/modeling_flax_speech_encoder_decoder.py
@@ -310,7 +310,7 @@ def __call__(
             decoder_hidden_states=decoder_outputs.hidden_states,
             decoder_attentions=decoder_outputs.attentions,
             cross_attentions=decoder_outputs.cross_attentions,
-            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_last_hidden_state=encoder_hidden_states,
             encoder_hidden_states=encoder_outputs.hidden_states,
             encoder_attentions=encoder_outputs.attentions,
         )
@@ -363,8 +363,8 @@ def init_weights(self, rng: jax.random.PRNGKey, input_shape: Tuple) -> FrozenDic
         encoder_input_shape, decoder_input_shape = input_shape
 
         # init input DeviceArrays
-        inputs = jnp.zeros(encoder_input_shape, dtype="i4")
-        attention_mask = jnp.ones_like(inputs)
+        inputs = jnp.zeros(encoder_input_shape, dtype="f4")
+        attention_mask = jnp.ones_like(inputs, dtype="i4")
         decoder_input_ids = jnp.zeros(decoder_input_shape, dtype="i4")
         decoder_attention_mask = jnp.ones_like(decoder_input_ids)
 
@@ -472,7 +472,7 @@ def encode(
         return_dict = return_dict if return_dict is not None else self.config.return_dict
 
         if attention_mask is None:
-            attention_mask = jnp.ones_like(inputs)
+            attention_mask = jnp.ones_like(inputs, dtype="i4")
 
         # Handle any PRNG if needed
         rngs = {}
@@ -485,7 +485,7 @@ def _encoder_forward(module, inputs, attention_mask, **kwargs):
 
         outputs = self.module.apply(
             {"params": params or self.params},
-            inputs=jnp.array(inputs, dtype="i4"),
+            inputs=jnp.array(inputs, dtype="f4"),
             attention_mask=jnp.array(attention_mask, dtype="i4"),
             output_attentions=output_attentions,
             output_hidden_states=output_hidden_states,
@@ -680,7 +680,7 @@ def __call__(
 
         # prepare encoder inputs
         if attention_mask is None:
-            attention_mask = jnp.ones_like(inputs)
+            attention_mask = jnp.ones_like(inputs, dtype="i4")
 
         # prepare decoder inputs
         if decoder_input_ids is None:
@@ -700,7 +700,7 @@ def __call__(
 
         return self.module.apply(
             {"params": params or self.params},
-            inputs=jnp.array(inputs, dtype="i4"),
+            inputs=jnp.array(inputs, dtype="f4"),
             attention_mask=jnp.array(attention_mask, dtype="i4"),
             decoder_input_ids=jnp.array(decoder_input_ids, dtype="i4"),
             decoder_attention_mask=jnp.array(decoder_attention_mask, dtype="i4"),

From b615e7c754cc201188d55a5e1abb74f49ea0f5d7 Mon Sep 17 00:00:00 2001
From: Nicolas Patry <patry.nicolas@protonmail.com>
Date: Mon, 4 Apr 2022 14:26:23 +0200
Subject: [PATCH 16/34] Making the impossible to connect error actually report
 the right URL. (#16446)

---
 src/transformers/configuration_utils.py      | 3 ++-
 src/transformers/feature_extraction_utils.py | 3 ++-
 src/transformers/modeling_flax_utils.py      | 3 ++-
 src/transformers/modeling_tf_utils.py        | 3 ++-
 src/transformers/modeling_utils.py           | 5 +++--
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/transformers/configuration_utils.py b/src/transformers/configuration_utils.py
index f572cd9fd5a8..f7318bf8ab84 100755
--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -31,6 +31,7 @@
 from .dynamic_module_utils import custom_object_save
 from .utils import (
     CONFIG_NAME,
+    HUGGINGFACE_CO_RESOLVE_ENDPOINT,
     EntryNotFoundError,
     PushToHubMixin,
     RepositoryNotFoundError,
@@ -626,7 +627,7 @@ def _get_config_dict(
             )
         except ValueError:
             raise EnvironmentError(
-                "We couldn't connect to 'https://huggingface.co/' to load this model, couldn't find it in the cached "
+                f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this model, couldn't find it in the cached "
                 f"files and it looks like {pretrained_model_name_or_path} is not the path to a directory containing a "
                 "{configuration_file} file.\nCheckout your internet connection or see how to run the library in "
                 "offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'."
diff --git a/src/transformers/feature_extraction_utils.py b/src/transformers/feature_extraction_utils.py
index 953ef41ba7db..bb719b98f6e7 100644
--- a/src/transformers/feature_extraction_utils.py
+++ b/src/transformers/feature_extraction_utils.py
@@ -29,6 +29,7 @@
 from .dynamic_module_utils import custom_object_save
 from .utils import (
     FEATURE_EXTRACTOR_NAME,
+    HUGGINGFACE_CO_RESOLVE_ENDPOINT,
     EntryNotFoundError,
     PushToHubMixin,
     RepositoryNotFoundError,
@@ -433,7 +434,7 @@ def get_feature_extractor_dict(
             )
         except ValueError:
             raise EnvironmentError(
-                "We couldn't connect to 'https://huggingface.co/' to load this model, couldn't find it in the cached "
+                f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this model, couldn't find it in the cached "
                 f"files and it looks like {pretrained_model_name_or_path} is not the path to a directory containing a "
                 f"{FEATURE_EXTRACTOR_NAME} file.\nCheckout your internet connection or see how to run the library in "
                 "offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'."
diff --git a/src/transformers/modeling_flax_utils.py b/src/transformers/modeling_flax_utils.py
index dd9a7dc29fd7..3ff9ae387582 100644
--- a/src/transformers/modeling_flax_utils.py
+++ b/src/transformers/modeling_flax_utils.py
@@ -34,6 +34,7 @@
 from .modeling_flax_pytorch_utils import load_pytorch_checkpoint_in_flax_state_dict
 from .utils import (
     FLAX_WEIGHTS_NAME,
+    HUGGINGFACE_CO_RESOLVE_ENDPOINT,
     WEIGHTS_NAME,
     EntryNotFoundError,
     PushToHubMixin,
@@ -530,7 +531,7 @@ def from_pretrained(
                 )
             except ValueError:
                 raise EnvironmentError(
-                    "We couldn't connect to 'https://huggingface.co/' to load this model, couldn't find it in the cached "
+                    f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this model, couldn't find it in the cached "
                     f"files and it looks like {pretrained_model_name_or_path} is not the path to a directory "
                     f"containing a file named {FLAX_WEIGHTS_NAME} or {WEIGHTS_NAME}.\n"
                     "Checkout your internet connection or see how to run the library in offline mode at "
diff --git a/src/transformers/modeling_tf_utils.py b/src/transformers/modeling_tf_utils.py
index d46226a5a1d1..a28a09425087 100644
--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
@@ -43,6 +43,7 @@
 from .tokenization_utils_base import BatchEncoding
 from .utils import (
     DUMMY_INPUTS,
+    HUGGINGFACE_CO_RESOLVE_ENDPOINT,
     TF2_WEIGHTS_NAME,
     WEIGHTS_NAME,
     EntryNotFoundError,
@@ -1685,7 +1686,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
                 )
             except ValueError:
                 raise EnvironmentError(
-                    "We couldn't connect to 'https://huggingface.co/' to load this model, couldn't find it in the cached "
+                    f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this model, couldn't find it in the cached "
                     f"files and it looks like {pretrained_model_name_or_path} is not the path to a directory "
                     f"containing a file named {TF2_WEIGHTS_NAME} or {WEIGHTS_NAME}.\n"
                     "Checkout your internet connection or see how to run the library in offline mode at "
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 21b8f2269110..33401c3c093f 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -40,6 +40,7 @@
 from .utils import (
     DUMMY_INPUTS,
     FLAX_WEIGHTS_NAME,
+    HUGGINGFACE_CO_RESOLVE_ENDPOINT,
     TF2_WEIGHTS_NAME,
     TF_WEIGHTS_NAME,
     WEIGHTS_INDEX_NAME,
@@ -331,7 +332,7 @@ def get_checkpoint_shard_files(
             )
         except HTTPError:
             raise EnvironmentError(
-                f"We couldn't connect to 'https://huggingface.co/' to load {shard_filename}. You should try again "
+                f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load {shard_filename}. You should try again "
                 "after checking your internet connection."
             )
 
@@ -1749,7 +1750,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
                 )
             except ValueError:
                 raise EnvironmentError(
-                    "We couldn't connect to 'https://huggingface.co/' to load this model, couldn't find it in the cached "
+                    f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this model, couldn't find it in the cached "
                     f"files and it looks like {pretrained_model_name_or_path} is not the path to a directory "
                     f"containing a file named {WEIGHTS_NAME}, {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or "
                     f"{FLAX_WEIGHTS_NAME}.\n"

From 0e1dc49aa3d9829ead6d71ff04d12168929f9e68 Mon Sep 17 00:00:00 2001
From: Daniel Stancl <46073029+stancld@users.noreply.github.com>
Date: Mon, 4 Apr 2022 14:54:25 +0200
Subject: [PATCH 17/34] Fix flax import in __init__.py: modeling_xglm ->
 modeling_flax_xglm (#16556)

---
 src/transformers/models/xglm/__init__.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/transformers/models/xglm/__init__.py b/src/transformers/models/xglm/__init__.py
index ddc79c678769..d5934dea6666 100644
--- a/src/transformers/models/xglm/__init__.py
+++ b/src/transformers/models/xglm/__init__.py
@@ -67,7 +67,7 @@
         from .modeling_xglm import XGLM_PRETRAINED_MODEL_ARCHIVE_LIST, XGLMForCausalLM, XGLMModel, XGLMPreTrainedModel
 
     if is_flax_available():
-        from .modeling_xglm import FlaxXGLMForCausalLM, FlaxXGLMModel, FlaxXGLMPreTrainedModel
+        from .modeling_flax_xglm import FlaxXGLMForCausalLM, FlaxXGLMModel, FlaxXGLMPreTrainedModel
 
 
 else:

From 90ea20434c62b72d0e56749ff3db8b109b805caf Mon Sep 17 00:00:00 2001
From: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Date: Mon, 4 Apr 2022 10:06:57 -0400
Subject: [PATCH 18/34] Add utility to find model labels (#16526)

* Add utility to find model labels

* Use it in the Trainer

* Update src/transformers/utils/generic.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* Quality

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
---
 src/transformers/trainer.py        |  8 ++----
 src/transformers/utils/__init__.py |  1 +
 src/transformers/utils/generic.py  | 21 ++++++++++++++++
 tests/utils/test_file_utils.py     | 39 +++++++++++++++++++++++++++---
 4 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 948697e35127..921b9d27ac08 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -67,7 +67,6 @@
 from .dependency_versions_check import dep_version_check
 from .modelcard import TrainingSummary
 from .modeling_utils import PreTrainedModel, unwrap_model
-from .models.auto.modeling_auto import MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES
 from .optimization import Adafactor, get_scheduler
 from .tokenization_utils_base import PreTrainedTokenizerBase
 from .trainer_callback import (
@@ -124,6 +123,7 @@
 from .utils import (
     CONFIG_NAME,
     WEIGHTS_NAME,
+    find_labels,
     get_full_repo_name,
     is_apex_available,
     is_datasets_available,
@@ -495,11 +495,7 @@ def __init__(
         self.current_flos = 0
         self.hp_search_backend = None
         self.use_tune_checkpoints = False
-        default_label_names = (
-            ["start_positions", "end_positions"]
-            if type(self.model).__name__ in MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES.values()
-            else ["labels"]
-        )
+        default_label_names = find_labels(self.model.__class__)
         self.label_names = default_label_names if self.args.label_names is None else self.args.label_names
         self.control = self.callback_handler.on_init_end(self.args, self.state, self.control)
 
diff --git a/src/transformers/utils/__init__.py b/src/transformers/utils/__init__.py
index af326b53e86c..45364fb8fd33 100644
--- a/src/transformers/utils/__init__.py
+++ b/src/transformers/utils/__init__.py
@@ -37,6 +37,7 @@
     PaddingStrategy,
     TensorType,
     cached_property,
+    find_labels,
     is_tensor,
     to_numpy,
     to_py_obj,
diff --git a/src/transformers/utils/generic.py b/src/transformers/utils/generic.py
index e455cdc6adb0..bea5b3dd4775 100644
--- a/src/transformers/utils/generic.py
+++ b/src/transformers/utils/generic.py
@@ -15,6 +15,7 @@
 Generic utilities
 """
 
+import inspect
 from collections import OrderedDict, UserDict
 from contextlib import ExitStack
 from dataclasses import fields
@@ -289,3 +290,23 @@ def __enter__(self):
 
     def __exit__(self, *args, **kwargs):
         self.stack.__exit__(*args, **kwargs)
+
+
+def find_labels(model_class):
+    """
+    Find the labels used by a given model.
+
+    Args:
+        model_class (`type`): The class of the model.
+    """
+    model_name = model_class.__name__
+    if model_name.startswith("TF"):
+        signature = inspect.signature(model_class.call)
+    elif model_name.startswith("Flax"):
+        signature = inspect.signature(model_class.__call__)
+    else:
+        signature = inspect.signature(model_class.forward)
+    if "QuestionAnswering" in model_name:
+        return [p for p in signature.parameters if "label" in p or p in ("start_positions", "end_positions")]
+    else:
+        return [p for p in signature.parameters if "label" in p]
diff --git a/tests/utils/test_file_utils.py b/tests/utils/test_file_utils.py
index decc7fd17c01..75c4f19caa1d 100644
--- a/tests/utils/test_file_utils.py
+++ b/tests/utils/test_file_utils.py
@@ -35,10 +35,14 @@
     RepositoryNotFoundError,
     RevisionNotFoundError,
     filename_to_url,
+    find_labels,
     get_file_from_repo,
     get_from_cache,
     has_file,
     hf_bucket_url,
+    is_flax_available,
+    is_tf_available,
+    is_torch_available,
 )
 
 
@@ -158,24 +162,51 @@ def test_get_file_from_repo_local(self):
             self.assertIsNone(get_file_from_repo(tmp_dir, "b.txt"))
 
 
-class ContextManagerTests(unittest.TestCase):
+class GenericUtilTests(unittest.TestCase):
     @unittest.mock.patch("sys.stdout", new_callable=io.StringIO)
-    def test_no_context(self, mock_stdout):
+    def test_context_managers_no_context(self, mock_stdout):
         with ContextManagers([]):
             print("Transformers are awesome!")
         # The print statement adds a new line at the end of the output
         self.assertEqual(mock_stdout.getvalue(), "Transformers are awesome!\n")
 
     @unittest.mock.patch("sys.stdout", new_callable=io.StringIO)
-    def test_one_context(self, mock_stdout):
+    def test_context_managers_one_context(self, mock_stdout):
         with ContextManagers([context_en()]):
             print("Transformers are awesome!")
         # The output should be wrapped with an English welcome and goodbye
         self.assertEqual(mock_stdout.getvalue(), "Welcome!\nTransformers are awesome!\nBye!\n")
 
     @unittest.mock.patch("sys.stdout", new_callable=io.StringIO)
-    def test_two_context(self, mock_stdout):
+    def test_context_managers_two_context(self, mock_stdout):
         with ContextManagers([context_fr(), context_en()]):
             print("Transformers are awesome!")
         # The output should be wrapped with an English and French welcome and goodbye
         self.assertEqual(mock_stdout.getvalue(), "Bonjour!\nWelcome!\nTransformers are awesome!\nBye!\nAu revoir!\n")
+
+    def test_find_labels(self):
+        if is_torch_available():
+            from transformers import BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification
+
+            self.assertEqual(find_labels(BertForSequenceClassification), ["labels"])
+            self.assertEqual(find_labels(BertForPreTraining), ["labels", "next_sentence_label"])
+            self.assertEqual(find_labels(BertForQuestionAnswering), ["start_positions", "end_positions"])
+
+        if is_tf_available():
+            from transformers import TFBertForPreTraining, TFBertForQuestionAnswering, TFBertForSequenceClassification
+
+            self.assertEqual(find_labels(TFBertForSequenceClassification), ["labels"])
+            self.assertEqual(find_labels(TFBertForPreTraining), ["labels", "next_sentence_label"])
+            self.assertEqual(find_labels(TFBertForQuestionAnswering), ["start_positions", "end_positions"])
+
+        if is_flax_available():
+            # Flax models don't have labels
+            from transformers import (
+                FlaxBertForPreTraining,
+                FlaxBertForQuestionAnswering,
+                FlaxBertForSequenceClassification,
+            )
+
+            self.assertEqual(find_labels(FlaxBertForSequenceClassification), [])
+            self.assertEqual(find_labels(FlaxBertForPreTraining), [])
+            self.assertEqual(find_labels(FlaxBertForQuestionAnswering), [])

From 02d49ea50275b226b50fb8f34d7bd9268c53e870 Mon Sep 17 00:00:00 2001
From: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Date: Mon, 4 Apr 2022 10:25:46 -0400
Subject: [PATCH 19/34] Enable doc in Spanish (#16518)

* Reorganize doc for multilingual support

* Fix style

* Style

* Toc trees

* Adapt templates
---
 .github/workflows/build_documentation.yml     |  1 +
 .github/workflows/build_pr_documentation.yml  |  1 +
 docs/source/contributing.md                   |  1 -
 docs/source/en/_config.py                     | 14 +++++++
 docs/source/{ => en}/_toctree.yml             |  0
 docs/source/{ => en}/accelerate.mdx           |  0
 docs/source/{ => en}/add_new_model.mdx        |  0
 docs/source/{ => en}/add_new_pipeline.mdx     |  0
 docs/source/{ => en}/autoclass_tutorial.mdx   |  0
 docs/source/{ => en}/benchmarks.mdx           |  0
 docs/source/{ => en}/bertology.mdx            |  0
 docs/source/{ => en}/community.mdx            |  0
 docs/source/en/contributing.md                |  1 +
 .../{ => en}/converting_tensorflow_models.mdx |  0
 docs/source/{ => en}/create_a_model.mdx       |  0
 docs/source/{ => en}/custom_models.mdx        |  0
 docs/source/{ => en}/debugging.mdx            |  0
 docs/source/{ => en}/fast_tokenizers.mdx      |  0
 docs/source/{ => en}/glossary.mdx             |  0
 docs/source/{ => en}/index.mdx                |  0
 docs/source/{ => en}/installation.mdx         |  0
 docs/source/{ => en}/internal/file_utils.mdx  |  0
 .../{ => en}/internal/generation_utils.mdx    |  0
 .../{ => en}/internal/modeling_utils.mdx      |  0
 .../{ => en}/internal/pipelines_utils.mdx     |  0
 .../{ => en}/internal/tokenization_utils.mdx  |  0
 .../{ => en}/internal/trainer_utils.mdx       |  0
 .../source/{ => en}/main_classes/callback.mdx |  0
 .../{ => en}/main_classes/configuration.mdx   |  0
 .../{ => en}/main_classes/data_collator.mdx   |  0
 .../{ => en}/main_classes/deepspeed.mdx       |  0
 .../main_classes/feature_extractor.mdx        |  0
 .../{ => en}/main_classes/keras_callbacks.mdx |  0
 docs/source/{ => en}/main_classes/logging.mdx |  0
 docs/source/{ => en}/main_classes/model.mdx   |  0
 docs/source/{ => en}/main_classes/onnx.mdx    |  0
 .../main_classes/optimizer_schedules.mdx      |  0
 docs/source/{ => en}/main_classes/output.mdx  |  0
 .../{ => en}/main_classes/pipelines.mdx       |  0
 .../{ => en}/main_classes/processors.mdx      |  0
 .../{ => en}/main_classes/text_generation.mdx |  0
 .../{ => en}/main_classes/tokenizer.mdx       |  0
 docs/source/{ => en}/main_classes/trainer.mdx |  0
 docs/source/{ => en}/migration.mdx            |  0
 docs/source/{ => en}/model_doc/albert.mdx     |  0
 docs/source/{ => en}/model_doc/auto.mdx       |  0
 docs/source/{ => en}/model_doc/bart.mdx       |  0
 docs/source/{ => en}/model_doc/barthez.mdx    |  0
 docs/source/{ => en}/model_doc/bartpho.mdx    |  0
 docs/source/{ => en}/model_doc/beit.mdx       |  0
 .../{ => en}/model_doc/bert-generation.mdx    |  0
 .../{ => en}/model_doc/bert-japanese.mdx      |  0
 docs/source/{ => en}/model_doc/bert.mdx       |  0
 docs/source/{ => en}/model_doc/bertweet.mdx   |  0
 docs/source/{ => en}/model_doc/big_bird.mdx   |  0
 .../{ => en}/model_doc/bigbird_pegasus.mdx    |  0
 .../{ => en}/model_doc/blenderbot-small.mdx   |  0
 docs/source/{ => en}/model_doc/blenderbot.mdx |  0
 docs/source/{ => en}/model_doc/bort.mdx       |  0
 docs/source/{ => en}/model_doc/byt5.mdx       |  0
 docs/source/{ => en}/model_doc/camembert.mdx  |  0
 docs/source/{ => en}/model_doc/canine.mdx     |  0
 docs/source/{ => en}/model_doc/clip.mdx       |  0
 docs/source/{ => en}/model_doc/convbert.mdx   |  0
 docs/source/{ => en}/model_doc/convnext.mdx   |  0
 docs/source/{ => en}/model_doc/cpm.mdx        |  0
 docs/source/{ => en}/model_doc/ctrl.mdx       |  0
 docs/source/{ => en}/model_doc/data2vec.mdx   |  0
 docs/source/{ => en}/model_doc/deberta-v2.mdx |  0
 docs/source/{ => en}/model_doc/deberta.mdx    |  0
 .../model_doc/decision_transformer.mdx        |  0
 docs/source/{ => en}/model_doc/deit.mdx       |  0
 docs/source/{ => en}/model_doc/detr.mdx       |  0
 docs/source/{ => en}/model_doc/dialogpt.mdx   |  0
 docs/source/{ => en}/model_doc/distilbert.mdx |  0
 docs/source/{ => en}/model_doc/dit.mdx        |  0
 docs/source/{ => en}/model_doc/dpr.mdx        |  0
 docs/source/{ => en}/model_doc/dpt.mdx        |  0
 docs/source/{ => en}/model_doc/electra.mdx    |  0
 .../{ => en}/model_doc/encoder-decoder.mdx    |  0
 docs/source/{ => en}/model_doc/flaubert.mdx   |  0
 docs/source/{ => en}/model_doc/fnet.mdx       |  0
 docs/source/{ => en}/model_doc/fsmt.mdx       |  0
 docs/source/{ => en}/model_doc/funnel.mdx     |  0
 docs/source/{ => en}/model_doc/glpn.mdx       |  0
 docs/source/{ => en}/model_doc/gpt2.mdx       |  0
 docs/source/{ => en}/model_doc/gpt_neo.mdx    |  0
 docs/source/{ => en}/model_doc/gptj.mdx       |  0
 docs/source/{ => en}/model_doc/herbert.mdx    |  0
 docs/source/{ => en}/model_doc/hubert.mdx     |  0
 docs/source/{ => en}/model_doc/ibert.mdx      |  0
 docs/source/{ => en}/model_doc/imagegpt.mdx   |  0
 docs/source/{ => en}/model_doc/layoutlm.mdx   |  0
 docs/source/{ => en}/model_doc/layoutlmv2.mdx |  0
 docs/source/{ => en}/model_doc/layoutxlm.mdx  |  0
 docs/source/{ => en}/model_doc/led.mdx        |  0
 docs/source/{ => en}/model_doc/longformer.mdx |  0
 docs/source/{ => en}/model_doc/luke.mdx       |  0
 docs/source/{ => en}/model_doc/lxmert.mdx     |  0
 docs/source/{ => en}/model_doc/m2m_100.mdx    |  0
 docs/source/{ => en}/model_doc/marian.mdx     |  0
 docs/source/{ => en}/model_doc/maskformer.mdx |  0
 docs/source/{ => en}/model_doc/mbart.mdx      |  0
 .../{ => en}/model_doc/megatron-bert.mdx      |  0
 .../{ => en}/model_doc/megatron_gpt2.mdx      |  0
 docs/source/{ => en}/model_doc/mluke.mdx      |  0
 docs/source/{ => en}/model_doc/mobilebert.mdx |  0
 docs/source/{ => en}/model_doc/mpnet.mdx      |  0
 docs/source/{ => en}/model_doc/mt5.mdx        |  0
 .../{ => en}/model_doc/nystromformer.mdx      |  0
 docs/source/{ => en}/model_doc/openai-gpt.mdx |  0
 docs/source/{ => en}/model_doc/pegasus.mdx    |  0
 docs/source/{ => en}/model_doc/perceiver.mdx  |  0
 docs/source/{ => en}/model_doc/phobert.mdx    |  0
 docs/source/{ => en}/model_doc/plbart.mdx     |  0
 docs/source/{ => en}/model_doc/poolformer.mdx |  0
 docs/source/{ => en}/model_doc/prophetnet.mdx |  0
 docs/source/{ => en}/model_doc/qdqbert.mdx    |  0
 docs/source/{ => en}/model_doc/rag.mdx        |  0
 docs/source/{ => en}/model_doc/realm.mdx      |  0
 docs/source/{ => en}/model_doc/reformer.mdx   |  0
 docs/source/{ => en}/model_doc/rembert.mdx    |  0
 docs/source/{ => en}/model_doc/resnet.mdx     |  0
 docs/source/{ => en}/model_doc/retribert.mdx  |  0
 docs/source/{ => en}/model_doc/roberta.mdx    |  0
 docs/source/{ => en}/model_doc/roformer.mdx   |  0
 docs/source/{ => en}/model_doc/segformer.mdx  |  0
 docs/source/{ => en}/model_doc/sew-d.mdx      |  0
 docs/source/{ => en}/model_doc/sew.mdx        |  0
 .../model_doc/speech-encoder-decoder.mdx      |  0
 .../{ => en}/model_doc/speech_to_text.mdx     |  0
 .../{ => en}/model_doc/speech_to_text_2.mdx   |  0
 docs/source/{ => en}/model_doc/splinter.mdx   |  0
 .../source/{ => en}/model_doc/squeezebert.mdx |  0
 docs/source/{ => en}/model_doc/swin.mdx       |  0
 docs/source/{ => en}/model_doc/t5.mdx         |  0
 docs/source/{ => en}/model_doc/t5v1.1.mdx     |  0
 docs/source/{ => en}/model_doc/tapas.mdx      |  0
 docs/source/{ => en}/model_doc/transfo-xl.mdx |  0
 docs/source/{ => en}/model_doc/trocr.mdx      |  0
 .../{ => en}/model_doc/unispeech-sat.mdx      |  0
 docs/source/{ => en}/model_doc/unispeech.mdx  |  0
 docs/source/{ => en}/model_doc/van.mdx        |  0
 docs/source/{ => en}/model_doc/vilt.mdx       |  0
 .../model_doc/vision-encoder-decoder.mdx      |  0
 .../model_doc/vision-text-dual-encoder.mdx    |  0
 .../source/{ => en}/model_doc/visual_bert.mdx |  0
 docs/source/{ => en}/model_doc/vit.mdx        |  0
 docs/source/{ => en}/model_doc/vit_mae.mdx    |  0
 docs/source/{ => en}/model_doc/wav2vec2.mdx   |  0
 .../{ => en}/model_doc/wav2vec2_phoneme.mdx   |  0
 docs/source/{ => en}/model_doc/wavlm.mdx      |  0
 docs/source/{ => en}/model_doc/xglm.mdx       |  0
 .../{ => en}/model_doc/xlm-prophetnet.mdx     |  0
 .../{ => en}/model_doc/xlm-roberta-xl.mdx     |  0
 .../source/{ => en}/model_doc/xlm-roberta.mdx |  0
 docs/source/{ => en}/model_doc/xlm.mdx        |  0
 docs/source/{ => en}/model_doc/xlnet.mdx      |  0
 docs/source/{ => en}/model_doc/xls_r.mdx      |  0
 .../{ => en}/model_doc/xlsr_wav2vec2.mdx      |  0
 docs/source/{ => en}/model_doc/yoso.mdx       |  0
 docs/source/{ => en}/model_sharing.mdx        |  0
 docs/source/{ => en}/model_summary.mdx        |  0
 docs/source/{ => en}/multilingual.mdx         |  0
 docs/source/en/notebooks.md                   |  1 +
 docs/source/{ => en}/pad_truncation.mdx       |  0
 docs/source/{ => en}/parallelism.mdx          |  0
 docs/source/{ => en}/performance.mdx          |  0
 docs/source/{ => en}/perplexity.mdx           |  0
 docs/source/{ => en}/philosophy.mdx           |  0
 docs/source/{ => en}/pipeline_tutorial.mdx    |  0
 docs/source/{ => en}/pr_checks.mdx            |  0
 docs/source/{ => en}/preprocessing.mdx        |  0
 docs/source/{ => en}/quicktour.mdx            |  0
 docs/source/{ => en}/run_scripts.mdx          |  0
 docs/source/{ => en}/sagemaker.mdx            |  0
 docs/source/{ => en}/serialization.mdx        |  0
 docs/source/{ => en}/task_summary.mdx         |  0
 docs/source/{ => en}/tasks/asr.mdx            |  0
 .../{ => en}/tasks/audio_classification.mdx   |  0
 .../{ => en}/tasks/image_classification.mdx   |  0
 .../{ => en}/tasks/language_modeling.mdx      |  0
 .../source/{ => en}/tasks/multiple_choice.mdx |  0
 .../{ => en}/tasks/question_answering.mdx     |  0
 .../tasks/sequence_classification.mdx         |  0
 docs/source/{ => en}/tasks/summarization.mdx  |  0
 .../{ => en}/tasks/token_classification.mdx   |  0
 docs/source/{ => en}/tasks/translation.mdx    |  0
 docs/source/{ => en}/testing.mdx              |  0
 docs/source/{ => en}/tokenizer_summary.mdx    |  0
 docs/source/{ => en}/training.mdx             |  0
 docs/source/{ => en}/troubleshooting.mdx      |  0
 docs/source/es/_config.py                     | 14 +++++++
 docs/source/es/_toctree.yml                   | 17 +++++++++
 docs/{source_es => source/es}/accelerate.mdx  |  0
 .../{source_es => source/es}/installation.mdx | 38 ++++++++-----------
 .../{source_es => source/es}/multilingual.mdx |  0
 .../es}/pipeline_tutorial.mdx                 |  0
 docs/{source_es => source/es}/quicktour.mdx   |  0
 docs/{source_es => source/es}/training.mdx    |  0
 docs/source/notebooks.md                      |  1 -
 src/transformers/commands/add_new_model.py    |  2 +-
 .../commands/add_new_model_like.py            |  4 +-
 utils/check_copies.py                         |  2 +-
 utils/check_repo.py                           |  2 +-
 utils/check_table.py                          |  2 +-
 206 files changed, 71 insertions(+), 30 deletions(-)
 delete mode 120000 docs/source/contributing.md
 create mode 100644 docs/source/en/_config.py
 rename docs/source/{ => en}/_toctree.yml (100%)
 rename docs/source/{ => en}/accelerate.mdx (100%)
 rename docs/source/{ => en}/add_new_model.mdx (100%)
 rename docs/source/{ => en}/add_new_pipeline.mdx (100%)
 rename docs/source/{ => en}/autoclass_tutorial.mdx (100%)
 rename docs/source/{ => en}/benchmarks.mdx (100%)
 rename docs/source/{ => en}/bertology.mdx (100%)
 rename docs/source/{ => en}/community.mdx (100%)
 create mode 120000 docs/source/en/contributing.md
 rename docs/source/{ => en}/converting_tensorflow_models.mdx (100%)
 rename docs/source/{ => en}/create_a_model.mdx (100%)
 rename docs/source/{ => en}/custom_models.mdx (100%)
 rename docs/source/{ => en}/debugging.mdx (100%)
 rename docs/source/{ => en}/fast_tokenizers.mdx (100%)
 rename docs/source/{ => en}/glossary.mdx (100%)
 rename docs/source/{ => en}/index.mdx (100%)
 rename docs/source/{ => en}/installation.mdx (100%)
 rename docs/source/{ => en}/internal/file_utils.mdx (100%)
 rename docs/source/{ => en}/internal/generation_utils.mdx (100%)
 rename docs/source/{ => en}/internal/modeling_utils.mdx (100%)
 rename docs/source/{ => en}/internal/pipelines_utils.mdx (100%)
 rename docs/source/{ => en}/internal/tokenization_utils.mdx (100%)
 rename docs/source/{ => en}/internal/trainer_utils.mdx (100%)
 rename docs/source/{ => en}/main_classes/callback.mdx (100%)
 rename docs/source/{ => en}/main_classes/configuration.mdx (100%)
 rename docs/source/{ => en}/main_classes/data_collator.mdx (100%)
 rename docs/source/{ => en}/main_classes/deepspeed.mdx (100%)
 rename docs/source/{ => en}/main_classes/feature_extractor.mdx (100%)
 rename docs/source/{ => en}/main_classes/keras_callbacks.mdx (100%)
 rename docs/source/{ => en}/main_classes/logging.mdx (100%)
 rename docs/source/{ => en}/main_classes/model.mdx (100%)
 rename docs/source/{ => en}/main_classes/onnx.mdx (100%)
 rename docs/source/{ => en}/main_classes/optimizer_schedules.mdx (100%)
 rename docs/source/{ => en}/main_classes/output.mdx (100%)
 rename docs/source/{ => en}/main_classes/pipelines.mdx (100%)
 rename docs/source/{ => en}/main_classes/processors.mdx (100%)
 rename docs/source/{ => en}/main_classes/text_generation.mdx (100%)
 rename docs/source/{ => en}/main_classes/tokenizer.mdx (100%)
 rename docs/source/{ => en}/main_classes/trainer.mdx (100%)
 rename docs/source/{ => en}/migration.mdx (100%)
 rename docs/source/{ => en}/model_doc/albert.mdx (100%)
 rename docs/source/{ => en}/model_doc/auto.mdx (100%)
 rename docs/source/{ => en}/model_doc/bart.mdx (100%)
 rename docs/source/{ => en}/model_doc/barthez.mdx (100%)
 rename docs/source/{ => en}/model_doc/bartpho.mdx (100%)
 rename docs/source/{ => en}/model_doc/beit.mdx (100%)
 rename docs/source/{ => en}/model_doc/bert-generation.mdx (100%)
 rename docs/source/{ => en}/model_doc/bert-japanese.mdx (100%)
 rename docs/source/{ => en}/model_doc/bert.mdx (100%)
 rename docs/source/{ => en}/model_doc/bertweet.mdx (100%)
 rename docs/source/{ => en}/model_doc/big_bird.mdx (100%)
 rename docs/source/{ => en}/model_doc/bigbird_pegasus.mdx (100%)
 rename docs/source/{ => en}/model_doc/blenderbot-small.mdx (100%)
 rename docs/source/{ => en}/model_doc/blenderbot.mdx (100%)
 rename docs/source/{ => en}/model_doc/bort.mdx (100%)
 rename docs/source/{ => en}/model_doc/byt5.mdx (100%)
 rename docs/source/{ => en}/model_doc/camembert.mdx (100%)
 rename docs/source/{ => en}/model_doc/canine.mdx (100%)
 rename docs/source/{ => en}/model_doc/clip.mdx (100%)
 rename docs/source/{ => en}/model_doc/convbert.mdx (100%)
 rename docs/source/{ => en}/model_doc/convnext.mdx (100%)
 rename docs/source/{ => en}/model_doc/cpm.mdx (100%)
 rename docs/source/{ => en}/model_doc/ctrl.mdx (100%)
 rename docs/source/{ => en}/model_doc/data2vec.mdx (100%)
 rename docs/source/{ => en}/model_doc/deberta-v2.mdx (100%)
 rename docs/source/{ => en}/model_doc/deberta.mdx (100%)
 rename docs/source/{ => en}/model_doc/decision_transformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/deit.mdx (100%)
 rename docs/source/{ => en}/model_doc/detr.mdx (100%)
 rename docs/source/{ => en}/model_doc/dialogpt.mdx (100%)
 rename docs/source/{ => en}/model_doc/distilbert.mdx (100%)
 rename docs/source/{ => en}/model_doc/dit.mdx (100%)
 rename docs/source/{ => en}/model_doc/dpr.mdx (100%)
 rename docs/source/{ => en}/model_doc/dpt.mdx (100%)
 rename docs/source/{ => en}/model_doc/electra.mdx (100%)
 rename docs/source/{ => en}/model_doc/encoder-decoder.mdx (100%)
 rename docs/source/{ => en}/model_doc/flaubert.mdx (100%)
 rename docs/source/{ => en}/model_doc/fnet.mdx (100%)
 rename docs/source/{ => en}/model_doc/fsmt.mdx (100%)
 rename docs/source/{ => en}/model_doc/funnel.mdx (100%)
 rename docs/source/{ => en}/model_doc/glpn.mdx (100%)
 rename docs/source/{ => en}/model_doc/gpt2.mdx (100%)
 rename docs/source/{ => en}/model_doc/gpt_neo.mdx (100%)
 rename docs/source/{ => en}/model_doc/gptj.mdx (100%)
 rename docs/source/{ => en}/model_doc/herbert.mdx (100%)
 rename docs/source/{ => en}/model_doc/hubert.mdx (100%)
 rename docs/source/{ => en}/model_doc/ibert.mdx (100%)
 rename docs/source/{ => en}/model_doc/imagegpt.mdx (100%)
 rename docs/source/{ => en}/model_doc/layoutlm.mdx (100%)
 rename docs/source/{ => en}/model_doc/layoutlmv2.mdx (100%)
 rename docs/source/{ => en}/model_doc/layoutxlm.mdx (100%)
 rename docs/source/{ => en}/model_doc/led.mdx (100%)
 rename docs/source/{ => en}/model_doc/longformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/luke.mdx (100%)
 rename docs/source/{ => en}/model_doc/lxmert.mdx (100%)
 rename docs/source/{ => en}/model_doc/m2m_100.mdx (100%)
 rename docs/source/{ => en}/model_doc/marian.mdx (100%)
 rename docs/source/{ => en}/model_doc/maskformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/mbart.mdx (100%)
 rename docs/source/{ => en}/model_doc/megatron-bert.mdx (100%)
 rename docs/source/{ => en}/model_doc/megatron_gpt2.mdx (100%)
 rename docs/source/{ => en}/model_doc/mluke.mdx (100%)
 rename docs/source/{ => en}/model_doc/mobilebert.mdx (100%)
 rename docs/source/{ => en}/model_doc/mpnet.mdx (100%)
 rename docs/source/{ => en}/model_doc/mt5.mdx (100%)
 rename docs/source/{ => en}/model_doc/nystromformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/openai-gpt.mdx (100%)
 rename docs/source/{ => en}/model_doc/pegasus.mdx (100%)
 rename docs/source/{ => en}/model_doc/perceiver.mdx (100%)
 rename docs/source/{ => en}/model_doc/phobert.mdx (100%)
 rename docs/source/{ => en}/model_doc/plbart.mdx (100%)
 rename docs/source/{ => en}/model_doc/poolformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/prophetnet.mdx (100%)
 rename docs/source/{ => en}/model_doc/qdqbert.mdx (100%)
 rename docs/source/{ => en}/model_doc/rag.mdx (100%)
 rename docs/source/{ => en}/model_doc/realm.mdx (100%)
 rename docs/source/{ => en}/model_doc/reformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/rembert.mdx (100%)
 rename docs/source/{ => en}/model_doc/resnet.mdx (100%)
 rename docs/source/{ => en}/model_doc/retribert.mdx (100%)
 rename docs/source/{ => en}/model_doc/roberta.mdx (100%)
 rename docs/source/{ => en}/model_doc/roformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/segformer.mdx (100%)
 rename docs/source/{ => en}/model_doc/sew-d.mdx (100%)
 rename docs/source/{ => en}/model_doc/sew.mdx (100%)
 rename docs/source/{ => en}/model_doc/speech-encoder-decoder.mdx (100%)
 rename docs/source/{ => en}/model_doc/speech_to_text.mdx (100%)
 rename docs/source/{ => en}/model_doc/speech_to_text_2.mdx (100%)
 rename docs/source/{ => en}/model_doc/splinter.mdx (100%)
 rename docs/source/{ => en}/model_doc/squeezebert.mdx (100%)
 rename docs/source/{ => en}/model_doc/swin.mdx (100%)
 rename docs/source/{ => en}/model_doc/t5.mdx (100%)
 rename docs/source/{ => en}/model_doc/t5v1.1.mdx (100%)
 rename docs/source/{ => en}/model_doc/tapas.mdx (100%)
 rename docs/source/{ => en}/model_doc/transfo-xl.mdx (100%)
 rename docs/source/{ => en}/model_doc/trocr.mdx (100%)
 rename docs/source/{ => en}/model_doc/unispeech-sat.mdx (100%)
 rename docs/source/{ => en}/model_doc/unispeech.mdx (100%)
 rename docs/source/{ => en}/model_doc/van.mdx (100%)
 rename docs/source/{ => en}/model_doc/vilt.mdx (100%)
 rename docs/source/{ => en}/model_doc/vision-encoder-decoder.mdx (100%)
 rename docs/source/{ => en}/model_doc/vision-text-dual-encoder.mdx (100%)
 rename docs/source/{ => en}/model_doc/visual_bert.mdx (100%)
 rename docs/source/{ => en}/model_doc/vit.mdx (100%)
 rename docs/source/{ => en}/model_doc/vit_mae.mdx (100%)
 rename docs/source/{ => en}/model_doc/wav2vec2.mdx (100%)
 rename docs/source/{ => en}/model_doc/wav2vec2_phoneme.mdx (100%)
 rename docs/source/{ => en}/model_doc/wavlm.mdx (100%)
 rename docs/source/{ => en}/model_doc/xglm.mdx (100%)
 rename docs/source/{ => en}/model_doc/xlm-prophetnet.mdx (100%)
 rename docs/source/{ => en}/model_doc/xlm-roberta-xl.mdx (100%)
 rename docs/source/{ => en}/model_doc/xlm-roberta.mdx (100%)
 rename docs/source/{ => en}/model_doc/xlm.mdx (100%)
 rename docs/source/{ => en}/model_doc/xlnet.mdx (100%)
 rename docs/source/{ => en}/model_doc/xls_r.mdx (100%)
 rename docs/source/{ => en}/model_doc/xlsr_wav2vec2.mdx (100%)
 rename docs/source/{ => en}/model_doc/yoso.mdx (100%)
 rename docs/source/{ => en}/model_sharing.mdx (100%)
 rename docs/source/{ => en}/model_summary.mdx (100%)
 rename docs/source/{ => en}/multilingual.mdx (100%)
 create mode 120000 docs/source/en/notebooks.md
 rename docs/source/{ => en}/pad_truncation.mdx (100%)
 rename docs/source/{ => en}/parallelism.mdx (100%)
 rename docs/source/{ => en}/performance.mdx (100%)
 rename docs/source/{ => en}/perplexity.mdx (100%)
 rename docs/source/{ => en}/philosophy.mdx (100%)
 rename docs/source/{ => en}/pipeline_tutorial.mdx (100%)
 rename docs/source/{ => en}/pr_checks.mdx (100%)
 rename docs/source/{ => en}/preprocessing.mdx (100%)
 rename docs/source/{ => en}/quicktour.mdx (100%)
 rename docs/source/{ => en}/run_scripts.mdx (100%)
 rename docs/source/{ => en}/sagemaker.mdx (100%)
 rename docs/source/{ => en}/serialization.mdx (100%)
 rename docs/source/{ => en}/task_summary.mdx (100%)
 rename docs/source/{ => en}/tasks/asr.mdx (100%)
 rename docs/source/{ => en}/tasks/audio_classification.mdx (100%)
 rename docs/source/{ => en}/tasks/image_classification.mdx (100%)
 rename docs/source/{ => en}/tasks/language_modeling.mdx (100%)
 rename docs/source/{ => en}/tasks/multiple_choice.mdx (100%)
 rename docs/source/{ => en}/tasks/question_answering.mdx (100%)
 rename docs/source/{ => en}/tasks/sequence_classification.mdx (100%)
 rename docs/source/{ => en}/tasks/summarization.mdx (100%)
 rename docs/source/{ => en}/tasks/token_classification.mdx (100%)
 rename docs/source/{ => en}/tasks/translation.mdx (100%)
 rename docs/source/{ => en}/testing.mdx (100%)
 rename docs/source/{ => en}/tokenizer_summary.mdx (100%)
 rename docs/source/{ => en}/training.mdx (100%)
 rename docs/source/{ => en}/troubleshooting.mdx (100%)
 create mode 100644 docs/source/es/_config.py
 create mode 100644 docs/source/es/_toctree.yml
 rename docs/{source_es => source/es}/accelerate.mdx (100%)
 rename docs/{source_es => source/es}/installation.mdx (92%)
 rename docs/{source_es => source/es}/multilingual.mdx (100%)
 rename docs/{source_es => source/es}/pipeline_tutorial.mdx (100%)
 rename docs/{source_es => source/es}/quicktour.mdx (100%)
 rename docs/{source_es => source/es}/training.mdx (100%)
 delete mode 120000 docs/source/notebooks.md

diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml
index 4d02ef020cf5..f69edb4e897f 100644
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@@ -15,5 +15,6 @@ jobs:
       commit_sha: ${{ github.sha }}
       package: transformers
       notebook_folder: transformers_doc
+      languages: en es
     secrets:
       token: ${{ secrets.HUGGINGFACE_PUSH }}
diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml
index 2225b9cb7083..95bce32bbac0 100644
--- a/.github/workflows/build_pr_documentation.yml
+++ b/.github/workflows/build_pr_documentation.yml
@@ -14,3 +14,4 @@ jobs:
       commit_sha: ${{ github.event.pull_request.head.sha }}
       pr_number: ${{ github.event.number }}
       package: transformers
+      languages: en es
diff --git a/docs/source/contributing.md b/docs/source/contributing.md
deleted file mode 120000
index f939e75f21a8..000000000000
--- a/docs/source/contributing.md
+++ /dev/null
@@ -1 +0,0 @@
-../../CONTRIBUTING.md
\ No newline at end of file
diff --git a/docs/source/en/_config.py b/docs/source/en/_config.py
new file mode 100644
index 000000000000..cd76263e9a5c
--- /dev/null
+++ b/docs/source/en/_config.py
@@ -0,0 +1,14 @@
+# docstyle-ignore
+INSTALL_CONTENT = """
+# Transformers installation
+! pip install transformers datasets
+# To install from source instead of the last release, comment the command above and uncomment the following one.
+# ! pip install git+https://github.com/huggingface/transformers.git
+"""
+
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
+black_avoid_patterns = {
+    "{processor_class}": "FakeProcessorClass",
+    "{model_class}": "FakeModelClass",
+    "{object_class}": "FakeObjectClass",    
+}
diff --git a/docs/source/_toctree.yml b/docs/source/en/_toctree.yml
similarity index 100%
rename from docs/source/_toctree.yml
rename to docs/source/en/_toctree.yml
diff --git a/docs/source/accelerate.mdx b/docs/source/en/accelerate.mdx
similarity index 100%
rename from docs/source/accelerate.mdx
rename to docs/source/en/accelerate.mdx
diff --git a/docs/source/add_new_model.mdx b/docs/source/en/add_new_model.mdx
similarity index 100%
rename from docs/source/add_new_model.mdx
rename to docs/source/en/add_new_model.mdx
diff --git a/docs/source/add_new_pipeline.mdx b/docs/source/en/add_new_pipeline.mdx
similarity index 100%
rename from docs/source/add_new_pipeline.mdx
rename to docs/source/en/add_new_pipeline.mdx
diff --git a/docs/source/autoclass_tutorial.mdx b/docs/source/en/autoclass_tutorial.mdx
similarity index 100%
rename from docs/source/autoclass_tutorial.mdx
rename to docs/source/en/autoclass_tutorial.mdx
diff --git a/docs/source/benchmarks.mdx b/docs/source/en/benchmarks.mdx
similarity index 100%
rename from docs/source/benchmarks.mdx
rename to docs/source/en/benchmarks.mdx
diff --git a/docs/source/bertology.mdx b/docs/source/en/bertology.mdx
similarity index 100%
rename from docs/source/bertology.mdx
rename to docs/source/en/bertology.mdx
diff --git a/docs/source/community.mdx b/docs/source/en/community.mdx
similarity index 100%
rename from docs/source/community.mdx
rename to docs/source/en/community.mdx
diff --git a/docs/source/en/contributing.md b/docs/source/en/contributing.md
new file mode 120000
index 000000000000..c97564d93a7f
--- /dev/null
+++ b/docs/source/en/contributing.md
@@ -0,0 +1 @@
+../../../CONTRIBUTING.md
\ No newline at end of file
diff --git a/docs/source/converting_tensorflow_models.mdx b/docs/source/en/converting_tensorflow_models.mdx
similarity index 100%
rename from docs/source/converting_tensorflow_models.mdx
rename to docs/source/en/converting_tensorflow_models.mdx
diff --git a/docs/source/create_a_model.mdx b/docs/source/en/create_a_model.mdx
similarity index 100%
rename from docs/source/create_a_model.mdx
rename to docs/source/en/create_a_model.mdx
diff --git a/docs/source/custom_models.mdx b/docs/source/en/custom_models.mdx
similarity index 100%
rename from docs/source/custom_models.mdx
rename to docs/source/en/custom_models.mdx
diff --git a/docs/source/debugging.mdx b/docs/source/en/debugging.mdx
similarity index 100%
rename from docs/source/debugging.mdx
rename to docs/source/en/debugging.mdx
diff --git a/docs/source/fast_tokenizers.mdx b/docs/source/en/fast_tokenizers.mdx
similarity index 100%
rename from docs/source/fast_tokenizers.mdx
rename to docs/source/en/fast_tokenizers.mdx
diff --git a/docs/source/glossary.mdx b/docs/source/en/glossary.mdx
similarity index 100%
rename from docs/source/glossary.mdx
rename to docs/source/en/glossary.mdx
diff --git a/docs/source/index.mdx b/docs/source/en/index.mdx
similarity index 100%
rename from docs/source/index.mdx
rename to docs/source/en/index.mdx
diff --git a/docs/source/installation.mdx b/docs/source/en/installation.mdx
similarity index 100%
rename from docs/source/installation.mdx
rename to docs/source/en/installation.mdx
diff --git a/docs/source/internal/file_utils.mdx b/docs/source/en/internal/file_utils.mdx
similarity index 100%
rename from docs/source/internal/file_utils.mdx
rename to docs/source/en/internal/file_utils.mdx
diff --git a/docs/source/internal/generation_utils.mdx b/docs/source/en/internal/generation_utils.mdx
similarity index 100%
rename from docs/source/internal/generation_utils.mdx
rename to docs/source/en/internal/generation_utils.mdx
diff --git a/docs/source/internal/modeling_utils.mdx b/docs/source/en/internal/modeling_utils.mdx
similarity index 100%
rename from docs/source/internal/modeling_utils.mdx
rename to docs/source/en/internal/modeling_utils.mdx
diff --git a/docs/source/internal/pipelines_utils.mdx b/docs/source/en/internal/pipelines_utils.mdx
similarity index 100%
rename from docs/source/internal/pipelines_utils.mdx
rename to docs/source/en/internal/pipelines_utils.mdx
diff --git a/docs/source/internal/tokenization_utils.mdx b/docs/source/en/internal/tokenization_utils.mdx
similarity index 100%
rename from docs/source/internal/tokenization_utils.mdx
rename to docs/source/en/internal/tokenization_utils.mdx
diff --git a/docs/source/internal/trainer_utils.mdx b/docs/source/en/internal/trainer_utils.mdx
similarity index 100%
rename from docs/source/internal/trainer_utils.mdx
rename to docs/source/en/internal/trainer_utils.mdx
diff --git a/docs/source/main_classes/callback.mdx b/docs/source/en/main_classes/callback.mdx
similarity index 100%
rename from docs/source/main_classes/callback.mdx
rename to docs/source/en/main_classes/callback.mdx
diff --git a/docs/source/main_classes/configuration.mdx b/docs/source/en/main_classes/configuration.mdx
similarity index 100%
rename from docs/source/main_classes/configuration.mdx
rename to docs/source/en/main_classes/configuration.mdx
diff --git a/docs/source/main_classes/data_collator.mdx b/docs/source/en/main_classes/data_collator.mdx
similarity index 100%
rename from docs/source/main_classes/data_collator.mdx
rename to docs/source/en/main_classes/data_collator.mdx
diff --git a/docs/source/main_classes/deepspeed.mdx b/docs/source/en/main_classes/deepspeed.mdx
similarity index 100%
rename from docs/source/main_classes/deepspeed.mdx
rename to docs/source/en/main_classes/deepspeed.mdx
diff --git a/docs/source/main_classes/feature_extractor.mdx b/docs/source/en/main_classes/feature_extractor.mdx
similarity index 100%
rename from docs/source/main_classes/feature_extractor.mdx
rename to docs/source/en/main_classes/feature_extractor.mdx
diff --git a/docs/source/main_classes/keras_callbacks.mdx b/docs/source/en/main_classes/keras_callbacks.mdx
similarity index 100%
rename from docs/source/main_classes/keras_callbacks.mdx
rename to docs/source/en/main_classes/keras_callbacks.mdx
diff --git a/docs/source/main_classes/logging.mdx b/docs/source/en/main_classes/logging.mdx
similarity index 100%
rename from docs/source/main_classes/logging.mdx
rename to docs/source/en/main_classes/logging.mdx
diff --git a/docs/source/main_classes/model.mdx b/docs/source/en/main_classes/model.mdx
similarity index 100%
rename from docs/source/main_classes/model.mdx
rename to docs/source/en/main_classes/model.mdx
diff --git a/docs/source/main_classes/onnx.mdx b/docs/source/en/main_classes/onnx.mdx
similarity index 100%
rename from docs/source/main_classes/onnx.mdx
rename to docs/source/en/main_classes/onnx.mdx
diff --git a/docs/source/main_classes/optimizer_schedules.mdx b/docs/source/en/main_classes/optimizer_schedules.mdx
similarity index 100%
rename from docs/source/main_classes/optimizer_schedules.mdx
rename to docs/source/en/main_classes/optimizer_schedules.mdx
diff --git a/docs/source/main_classes/output.mdx b/docs/source/en/main_classes/output.mdx
similarity index 100%
rename from docs/source/main_classes/output.mdx
rename to docs/source/en/main_classes/output.mdx
diff --git a/docs/source/main_classes/pipelines.mdx b/docs/source/en/main_classes/pipelines.mdx
similarity index 100%
rename from docs/source/main_classes/pipelines.mdx
rename to docs/source/en/main_classes/pipelines.mdx
diff --git a/docs/source/main_classes/processors.mdx b/docs/source/en/main_classes/processors.mdx
similarity index 100%
rename from docs/source/main_classes/processors.mdx
rename to docs/source/en/main_classes/processors.mdx
diff --git a/docs/source/main_classes/text_generation.mdx b/docs/source/en/main_classes/text_generation.mdx
similarity index 100%
rename from docs/source/main_classes/text_generation.mdx
rename to docs/source/en/main_classes/text_generation.mdx
diff --git a/docs/source/main_classes/tokenizer.mdx b/docs/source/en/main_classes/tokenizer.mdx
similarity index 100%
rename from docs/source/main_classes/tokenizer.mdx
rename to docs/source/en/main_classes/tokenizer.mdx
diff --git a/docs/source/main_classes/trainer.mdx b/docs/source/en/main_classes/trainer.mdx
similarity index 100%
rename from docs/source/main_classes/trainer.mdx
rename to docs/source/en/main_classes/trainer.mdx
diff --git a/docs/source/migration.mdx b/docs/source/en/migration.mdx
similarity index 100%
rename from docs/source/migration.mdx
rename to docs/source/en/migration.mdx
diff --git a/docs/source/model_doc/albert.mdx b/docs/source/en/model_doc/albert.mdx
similarity index 100%
rename from docs/source/model_doc/albert.mdx
rename to docs/source/en/model_doc/albert.mdx
diff --git a/docs/source/model_doc/auto.mdx b/docs/source/en/model_doc/auto.mdx
similarity index 100%
rename from docs/source/model_doc/auto.mdx
rename to docs/source/en/model_doc/auto.mdx
diff --git a/docs/source/model_doc/bart.mdx b/docs/source/en/model_doc/bart.mdx
similarity index 100%
rename from docs/source/model_doc/bart.mdx
rename to docs/source/en/model_doc/bart.mdx
diff --git a/docs/source/model_doc/barthez.mdx b/docs/source/en/model_doc/barthez.mdx
similarity index 100%
rename from docs/source/model_doc/barthez.mdx
rename to docs/source/en/model_doc/barthez.mdx
diff --git a/docs/source/model_doc/bartpho.mdx b/docs/source/en/model_doc/bartpho.mdx
similarity index 100%
rename from docs/source/model_doc/bartpho.mdx
rename to docs/source/en/model_doc/bartpho.mdx
diff --git a/docs/source/model_doc/beit.mdx b/docs/source/en/model_doc/beit.mdx
similarity index 100%
rename from docs/source/model_doc/beit.mdx
rename to docs/source/en/model_doc/beit.mdx
diff --git a/docs/source/model_doc/bert-generation.mdx b/docs/source/en/model_doc/bert-generation.mdx
similarity index 100%
rename from docs/source/model_doc/bert-generation.mdx
rename to docs/source/en/model_doc/bert-generation.mdx
diff --git a/docs/source/model_doc/bert-japanese.mdx b/docs/source/en/model_doc/bert-japanese.mdx
similarity index 100%
rename from docs/source/model_doc/bert-japanese.mdx
rename to docs/source/en/model_doc/bert-japanese.mdx
diff --git a/docs/source/model_doc/bert.mdx b/docs/source/en/model_doc/bert.mdx
similarity index 100%
rename from docs/source/model_doc/bert.mdx
rename to docs/source/en/model_doc/bert.mdx
diff --git a/docs/source/model_doc/bertweet.mdx b/docs/source/en/model_doc/bertweet.mdx
similarity index 100%
rename from docs/source/model_doc/bertweet.mdx
rename to docs/source/en/model_doc/bertweet.mdx
diff --git a/docs/source/model_doc/big_bird.mdx b/docs/source/en/model_doc/big_bird.mdx
similarity index 100%
rename from docs/source/model_doc/big_bird.mdx
rename to docs/source/en/model_doc/big_bird.mdx
diff --git a/docs/source/model_doc/bigbird_pegasus.mdx b/docs/source/en/model_doc/bigbird_pegasus.mdx
similarity index 100%
rename from docs/source/model_doc/bigbird_pegasus.mdx
rename to docs/source/en/model_doc/bigbird_pegasus.mdx
diff --git a/docs/source/model_doc/blenderbot-small.mdx b/docs/source/en/model_doc/blenderbot-small.mdx
similarity index 100%
rename from docs/source/model_doc/blenderbot-small.mdx
rename to docs/source/en/model_doc/blenderbot-small.mdx
diff --git a/docs/source/model_doc/blenderbot.mdx b/docs/source/en/model_doc/blenderbot.mdx
similarity index 100%
rename from docs/source/model_doc/blenderbot.mdx
rename to docs/source/en/model_doc/blenderbot.mdx
diff --git a/docs/source/model_doc/bort.mdx b/docs/source/en/model_doc/bort.mdx
similarity index 100%
rename from docs/source/model_doc/bort.mdx
rename to docs/source/en/model_doc/bort.mdx
diff --git a/docs/source/model_doc/byt5.mdx b/docs/source/en/model_doc/byt5.mdx
similarity index 100%
rename from docs/source/model_doc/byt5.mdx
rename to docs/source/en/model_doc/byt5.mdx
diff --git a/docs/source/model_doc/camembert.mdx b/docs/source/en/model_doc/camembert.mdx
similarity index 100%
rename from docs/source/model_doc/camembert.mdx
rename to docs/source/en/model_doc/camembert.mdx
diff --git a/docs/source/model_doc/canine.mdx b/docs/source/en/model_doc/canine.mdx
similarity index 100%
rename from docs/source/model_doc/canine.mdx
rename to docs/source/en/model_doc/canine.mdx
diff --git a/docs/source/model_doc/clip.mdx b/docs/source/en/model_doc/clip.mdx
similarity index 100%
rename from docs/source/model_doc/clip.mdx
rename to docs/source/en/model_doc/clip.mdx
diff --git a/docs/source/model_doc/convbert.mdx b/docs/source/en/model_doc/convbert.mdx
similarity index 100%
rename from docs/source/model_doc/convbert.mdx
rename to docs/source/en/model_doc/convbert.mdx
diff --git a/docs/source/model_doc/convnext.mdx b/docs/source/en/model_doc/convnext.mdx
similarity index 100%
rename from docs/source/model_doc/convnext.mdx
rename to docs/source/en/model_doc/convnext.mdx
diff --git a/docs/source/model_doc/cpm.mdx b/docs/source/en/model_doc/cpm.mdx
similarity index 100%
rename from docs/source/model_doc/cpm.mdx
rename to docs/source/en/model_doc/cpm.mdx
diff --git a/docs/source/model_doc/ctrl.mdx b/docs/source/en/model_doc/ctrl.mdx
similarity index 100%
rename from docs/source/model_doc/ctrl.mdx
rename to docs/source/en/model_doc/ctrl.mdx
diff --git a/docs/source/model_doc/data2vec.mdx b/docs/source/en/model_doc/data2vec.mdx
similarity index 100%
rename from docs/source/model_doc/data2vec.mdx
rename to docs/source/en/model_doc/data2vec.mdx
diff --git a/docs/source/model_doc/deberta-v2.mdx b/docs/source/en/model_doc/deberta-v2.mdx
similarity index 100%
rename from docs/source/model_doc/deberta-v2.mdx
rename to docs/source/en/model_doc/deberta-v2.mdx
diff --git a/docs/source/model_doc/deberta.mdx b/docs/source/en/model_doc/deberta.mdx
similarity index 100%
rename from docs/source/model_doc/deberta.mdx
rename to docs/source/en/model_doc/deberta.mdx
diff --git a/docs/source/model_doc/decision_transformer.mdx b/docs/source/en/model_doc/decision_transformer.mdx
similarity index 100%
rename from docs/source/model_doc/decision_transformer.mdx
rename to docs/source/en/model_doc/decision_transformer.mdx
diff --git a/docs/source/model_doc/deit.mdx b/docs/source/en/model_doc/deit.mdx
similarity index 100%
rename from docs/source/model_doc/deit.mdx
rename to docs/source/en/model_doc/deit.mdx
diff --git a/docs/source/model_doc/detr.mdx b/docs/source/en/model_doc/detr.mdx
similarity index 100%
rename from docs/source/model_doc/detr.mdx
rename to docs/source/en/model_doc/detr.mdx
diff --git a/docs/source/model_doc/dialogpt.mdx b/docs/source/en/model_doc/dialogpt.mdx
similarity index 100%
rename from docs/source/model_doc/dialogpt.mdx
rename to docs/source/en/model_doc/dialogpt.mdx
diff --git a/docs/source/model_doc/distilbert.mdx b/docs/source/en/model_doc/distilbert.mdx
similarity index 100%
rename from docs/source/model_doc/distilbert.mdx
rename to docs/source/en/model_doc/distilbert.mdx
diff --git a/docs/source/model_doc/dit.mdx b/docs/source/en/model_doc/dit.mdx
similarity index 100%
rename from docs/source/model_doc/dit.mdx
rename to docs/source/en/model_doc/dit.mdx
diff --git a/docs/source/model_doc/dpr.mdx b/docs/source/en/model_doc/dpr.mdx
similarity index 100%
rename from docs/source/model_doc/dpr.mdx
rename to docs/source/en/model_doc/dpr.mdx
diff --git a/docs/source/model_doc/dpt.mdx b/docs/source/en/model_doc/dpt.mdx
similarity index 100%
rename from docs/source/model_doc/dpt.mdx
rename to docs/source/en/model_doc/dpt.mdx
diff --git a/docs/source/model_doc/electra.mdx b/docs/source/en/model_doc/electra.mdx
similarity index 100%
rename from docs/source/model_doc/electra.mdx
rename to docs/source/en/model_doc/electra.mdx
diff --git a/docs/source/model_doc/encoder-decoder.mdx b/docs/source/en/model_doc/encoder-decoder.mdx
similarity index 100%
rename from docs/source/model_doc/encoder-decoder.mdx
rename to docs/source/en/model_doc/encoder-decoder.mdx
diff --git a/docs/source/model_doc/flaubert.mdx b/docs/source/en/model_doc/flaubert.mdx
similarity index 100%
rename from docs/source/model_doc/flaubert.mdx
rename to docs/source/en/model_doc/flaubert.mdx
diff --git a/docs/source/model_doc/fnet.mdx b/docs/source/en/model_doc/fnet.mdx
similarity index 100%
rename from docs/source/model_doc/fnet.mdx
rename to docs/source/en/model_doc/fnet.mdx
diff --git a/docs/source/model_doc/fsmt.mdx b/docs/source/en/model_doc/fsmt.mdx
similarity index 100%
rename from docs/source/model_doc/fsmt.mdx
rename to docs/source/en/model_doc/fsmt.mdx
diff --git a/docs/source/model_doc/funnel.mdx b/docs/source/en/model_doc/funnel.mdx
similarity index 100%
rename from docs/source/model_doc/funnel.mdx
rename to docs/source/en/model_doc/funnel.mdx
diff --git a/docs/source/model_doc/glpn.mdx b/docs/source/en/model_doc/glpn.mdx
similarity index 100%
rename from docs/source/model_doc/glpn.mdx
rename to docs/source/en/model_doc/glpn.mdx
diff --git a/docs/source/model_doc/gpt2.mdx b/docs/source/en/model_doc/gpt2.mdx
similarity index 100%
rename from docs/source/model_doc/gpt2.mdx
rename to docs/source/en/model_doc/gpt2.mdx
diff --git a/docs/source/model_doc/gpt_neo.mdx b/docs/source/en/model_doc/gpt_neo.mdx
similarity index 100%
rename from docs/source/model_doc/gpt_neo.mdx
rename to docs/source/en/model_doc/gpt_neo.mdx
diff --git a/docs/source/model_doc/gptj.mdx b/docs/source/en/model_doc/gptj.mdx
similarity index 100%
rename from docs/source/model_doc/gptj.mdx
rename to docs/source/en/model_doc/gptj.mdx
diff --git a/docs/source/model_doc/herbert.mdx b/docs/source/en/model_doc/herbert.mdx
similarity index 100%
rename from docs/source/model_doc/herbert.mdx
rename to docs/source/en/model_doc/herbert.mdx
diff --git a/docs/source/model_doc/hubert.mdx b/docs/source/en/model_doc/hubert.mdx
similarity index 100%
rename from docs/source/model_doc/hubert.mdx
rename to docs/source/en/model_doc/hubert.mdx
diff --git a/docs/source/model_doc/ibert.mdx b/docs/source/en/model_doc/ibert.mdx
similarity index 100%
rename from docs/source/model_doc/ibert.mdx
rename to docs/source/en/model_doc/ibert.mdx
diff --git a/docs/source/model_doc/imagegpt.mdx b/docs/source/en/model_doc/imagegpt.mdx
similarity index 100%
rename from docs/source/model_doc/imagegpt.mdx
rename to docs/source/en/model_doc/imagegpt.mdx
diff --git a/docs/source/model_doc/layoutlm.mdx b/docs/source/en/model_doc/layoutlm.mdx
similarity index 100%
rename from docs/source/model_doc/layoutlm.mdx
rename to docs/source/en/model_doc/layoutlm.mdx
diff --git a/docs/source/model_doc/layoutlmv2.mdx b/docs/source/en/model_doc/layoutlmv2.mdx
similarity index 100%
rename from docs/source/model_doc/layoutlmv2.mdx
rename to docs/source/en/model_doc/layoutlmv2.mdx
diff --git a/docs/source/model_doc/layoutxlm.mdx b/docs/source/en/model_doc/layoutxlm.mdx
similarity index 100%
rename from docs/source/model_doc/layoutxlm.mdx
rename to docs/source/en/model_doc/layoutxlm.mdx
diff --git a/docs/source/model_doc/led.mdx b/docs/source/en/model_doc/led.mdx
similarity index 100%
rename from docs/source/model_doc/led.mdx
rename to docs/source/en/model_doc/led.mdx
diff --git a/docs/source/model_doc/longformer.mdx b/docs/source/en/model_doc/longformer.mdx
similarity index 100%
rename from docs/source/model_doc/longformer.mdx
rename to docs/source/en/model_doc/longformer.mdx
diff --git a/docs/source/model_doc/luke.mdx b/docs/source/en/model_doc/luke.mdx
similarity index 100%
rename from docs/source/model_doc/luke.mdx
rename to docs/source/en/model_doc/luke.mdx
diff --git a/docs/source/model_doc/lxmert.mdx b/docs/source/en/model_doc/lxmert.mdx
similarity index 100%
rename from docs/source/model_doc/lxmert.mdx
rename to docs/source/en/model_doc/lxmert.mdx
diff --git a/docs/source/model_doc/m2m_100.mdx b/docs/source/en/model_doc/m2m_100.mdx
similarity index 100%
rename from docs/source/model_doc/m2m_100.mdx
rename to docs/source/en/model_doc/m2m_100.mdx
diff --git a/docs/source/model_doc/marian.mdx b/docs/source/en/model_doc/marian.mdx
similarity index 100%
rename from docs/source/model_doc/marian.mdx
rename to docs/source/en/model_doc/marian.mdx
diff --git a/docs/source/model_doc/maskformer.mdx b/docs/source/en/model_doc/maskformer.mdx
similarity index 100%
rename from docs/source/model_doc/maskformer.mdx
rename to docs/source/en/model_doc/maskformer.mdx
diff --git a/docs/source/model_doc/mbart.mdx b/docs/source/en/model_doc/mbart.mdx
similarity index 100%
rename from docs/source/model_doc/mbart.mdx
rename to docs/source/en/model_doc/mbart.mdx
diff --git a/docs/source/model_doc/megatron-bert.mdx b/docs/source/en/model_doc/megatron-bert.mdx
similarity index 100%
rename from docs/source/model_doc/megatron-bert.mdx
rename to docs/source/en/model_doc/megatron-bert.mdx
diff --git a/docs/source/model_doc/megatron_gpt2.mdx b/docs/source/en/model_doc/megatron_gpt2.mdx
similarity index 100%
rename from docs/source/model_doc/megatron_gpt2.mdx
rename to docs/source/en/model_doc/megatron_gpt2.mdx
diff --git a/docs/source/model_doc/mluke.mdx b/docs/source/en/model_doc/mluke.mdx
similarity index 100%
rename from docs/source/model_doc/mluke.mdx
rename to docs/source/en/model_doc/mluke.mdx
diff --git a/docs/source/model_doc/mobilebert.mdx b/docs/source/en/model_doc/mobilebert.mdx
similarity index 100%
rename from docs/source/model_doc/mobilebert.mdx
rename to docs/source/en/model_doc/mobilebert.mdx
diff --git a/docs/source/model_doc/mpnet.mdx b/docs/source/en/model_doc/mpnet.mdx
similarity index 100%
rename from docs/source/model_doc/mpnet.mdx
rename to docs/source/en/model_doc/mpnet.mdx
diff --git a/docs/source/model_doc/mt5.mdx b/docs/source/en/model_doc/mt5.mdx
similarity index 100%
rename from docs/source/model_doc/mt5.mdx
rename to docs/source/en/model_doc/mt5.mdx
diff --git a/docs/source/model_doc/nystromformer.mdx b/docs/source/en/model_doc/nystromformer.mdx
similarity index 100%
rename from docs/source/model_doc/nystromformer.mdx
rename to docs/source/en/model_doc/nystromformer.mdx
diff --git a/docs/source/model_doc/openai-gpt.mdx b/docs/source/en/model_doc/openai-gpt.mdx
similarity index 100%
rename from docs/source/model_doc/openai-gpt.mdx
rename to docs/source/en/model_doc/openai-gpt.mdx
diff --git a/docs/source/model_doc/pegasus.mdx b/docs/source/en/model_doc/pegasus.mdx
similarity index 100%
rename from docs/source/model_doc/pegasus.mdx
rename to docs/source/en/model_doc/pegasus.mdx
diff --git a/docs/source/model_doc/perceiver.mdx b/docs/source/en/model_doc/perceiver.mdx
similarity index 100%
rename from docs/source/model_doc/perceiver.mdx
rename to docs/source/en/model_doc/perceiver.mdx
diff --git a/docs/source/model_doc/phobert.mdx b/docs/source/en/model_doc/phobert.mdx
similarity index 100%
rename from docs/source/model_doc/phobert.mdx
rename to docs/source/en/model_doc/phobert.mdx
diff --git a/docs/source/model_doc/plbart.mdx b/docs/source/en/model_doc/plbart.mdx
similarity index 100%
rename from docs/source/model_doc/plbart.mdx
rename to docs/source/en/model_doc/plbart.mdx
diff --git a/docs/source/model_doc/poolformer.mdx b/docs/source/en/model_doc/poolformer.mdx
similarity index 100%
rename from docs/source/model_doc/poolformer.mdx
rename to docs/source/en/model_doc/poolformer.mdx
diff --git a/docs/source/model_doc/prophetnet.mdx b/docs/source/en/model_doc/prophetnet.mdx
similarity index 100%
rename from docs/source/model_doc/prophetnet.mdx
rename to docs/source/en/model_doc/prophetnet.mdx
diff --git a/docs/source/model_doc/qdqbert.mdx b/docs/source/en/model_doc/qdqbert.mdx
similarity index 100%
rename from docs/source/model_doc/qdqbert.mdx
rename to docs/source/en/model_doc/qdqbert.mdx
diff --git a/docs/source/model_doc/rag.mdx b/docs/source/en/model_doc/rag.mdx
similarity index 100%
rename from docs/source/model_doc/rag.mdx
rename to docs/source/en/model_doc/rag.mdx
diff --git a/docs/source/model_doc/realm.mdx b/docs/source/en/model_doc/realm.mdx
similarity index 100%
rename from docs/source/model_doc/realm.mdx
rename to docs/source/en/model_doc/realm.mdx
diff --git a/docs/source/model_doc/reformer.mdx b/docs/source/en/model_doc/reformer.mdx
similarity index 100%
rename from docs/source/model_doc/reformer.mdx
rename to docs/source/en/model_doc/reformer.mdx
diff --git a/docs/source/model_doc/rembert.mdx b/docs/source/en/model_doc/rembert.mdx
similarity index 100%
rename from docs/source/model_doc/rembert.mdx
rename to docs/source/en/model_doc/rembert.mdx
diff --git a/docs/source/model_doc/resnet.mdx b/docs/source/en/model_doc/resnet.mdx
similarity index 100%
rename from docs/source/model_doc/resnet.mdx
rename to docs/source/en/model_doc/resnet.mdx
diff --git a/docs/source/model_doc/retribert.mdx b/docs/source/en/model_doc/retribert.mdx
similarity index 100%
rename from docs/source/model_doc/retribert.mdx
rename to docs/source/en/model_doc/retribert.mdx
diff --git a/docs/source/model_doc/roberta.mdx b/docs/source/en/model_doc/roberta.mdx
similarity index 100%
rename from docs/source/model_doc/roberta.mdx
rename to docs/source/en/model_doc/roberta.mdx
diff --git a/docs/source/model_doc/roformer.mdx b/docs/source/en/model_doc/roformer.mdx
similarity index 100%
rename from docs/source/model_doc/roformer.mdx
rename to docs/source/en/model_doc/roformer.mdx
diff --git a/docs/source/model_doc/segformer.mdx b/docs/source/en/model_doc/segformer.mdx
similarity index 100%
rename from docs/source/model_doc/segformer.mdx
rename to docs/source/en/model_doc/segformer.mdx
diff --git a/docs/source/model_doc/sew-d.mdx b/docs/source/en/model_doc/sew-d.mdx
similarity index 100%
rename from docs/source/model_doc/sew-d.mdx
rename to docs/source/en/model_doc/sew-d.mdx
diff --git a/docs/source/model_doc/sew.mdx b/docs/source/en/model_doc/sew.mdx
similarity index 100%
rename from docs/source/model_doc/sew.mdx
rename to docs/source/en/model_doc/sew.mdx
diff --git a/docs/source/model_doc/speech-encoder-decoder.mdx b/docs/source/en/model_doc/speech-encoder-decoder.mdx
similarity index 100%
rename from docs/source/model_doc/speech-encoder-decoder.mdx
rename to docs/source/en/model_doc/speech-encoder-decoder.mdx
diff --git a/docs/source/model_doc/speech_to_text.mdx b/docs/source/en/model_doc/speech_to_text.mdx
similarity index 100%
rename from docs/source/model_doc/speech_to_text.mdx
rename to docs/source/en/model_doc/speech_to_text.mdx
diff --git a/docs/source/model_doc/speech_to_text_2.mdx b/docs/source/en/model_doc/speech_to_text_2.mdx
similarity index 100%
rename from docs/source/model_doc/speech_to_text_2.mdx
rename to docs/source/en/model_doc/speech_to_text_2.mdx
diff --git a/docs/source/model_doc/splinter.mdx b/docs/source/en/model_doc/splinter.mdx
similarity index 100%
rename from docs/source/model_doc/splinter.mdx
rename to docs/source/en/model_doc/splinter.mdx
diff --git a/docs/source/model_doc/squeezebert.mdx b/docs/source/en/model_doc/squeezebert.mdx
similarity index 100%
rename from docs/source/model_doc/squeezebert.mdx
rename to docs/source/en/model_doc/squeezebert.mdx
diff --git a/docs/source/model_doc/swin.mdx b/docs/source/en/model_doc/swin.mdx
similarity index 100%
rename from docs/source/model_doc/swin.mdx
rename to docs/source/en/model_doc/swin.mdx
diff --git a/docs/source/model_doc/t5.mdx b/docs/source/en/model_doc/t5.mdx
similarity index 100%
rename from docs/source/model_doc/t5.mdx
rename to docs/source/en/model_doc/t5.mdx
diff --git a/docs/source/model_doc/t5v1.1.mdx b/docs/source/en/model_doc/t5v1.1.mdx
similarity index 100%
rename from docs/source/model_doc/t5v1.1.mdx
rename to docs/source/en/model_doc/t5v1.1.mdx
diff --git a/docs/source/model_doc/tapas.mdx b/docs/source/en/model_doc/tapas.mdx
similarity index 100%
rename from docs/source/model_doc/tapas.mdx
rename to docs/source/en/model_doc/tapas.mdx
diff --git a/docs/source/model_doc/transfo-xl.mdx b/docs/source/en/model_doc/transfo-xl.mdx
similarity index 100%
rename from docs/source/model_doc/transfo-xl.mdx
rename to docs/source/en/model_doc/transfo-xl.mdx
diff --git a/docs/source/model_doc/trocr.mdx b/docs/source/en/model_doc/trocr.mdx
similarity index 100%
rename from docs/source/model_doc/trocr.mdx
rename to docs/source/en/model_doc/trocr.mdx
diff --git a/docs/source/model_doc/unispeech-sat.mdx b/docs/source/en/model_doc/unispeech-sat.mdx
similarity index 100%
rename from docs/source/model_doc/unispeech-sat.mdx
rename to docs/source/en/model_doc/unispeech-sat.mdx
diff --git a/docs/source/model_doc/unispeech.mdx b/docs/source/en/model_doc/unispeech.mdx
similarity index 100%
rename from docs/source/model_doc/unispeech.mdx
rename to docs/source/en/model_doc/unispeech.mdx
diff --git a/docs/source/model_doc/van.mdx b/docs/source/en/model_doc/van.mdx
similarity index 100%
rename from docs/source/model_doc/van.mdx
rename to docs/source/en/model_doc/van.mdx
diff --git a/docs/source/model_doc/vilt.mdx b/docs/source/en/model_doc/vilt.mdx
similarity index 100%
rename from docs/source/model_doc/vilt.mdx
rename to docs/source/en/model_doc/vilt.mdx
diff --git a/docs/source/model_doc/vision-encoder-decoder.mdx b/docs/source/en/model_doc/vision-encoder-decoder.mdx
similarity index 100%
rename from docs/source/model_doc/vision-encoder-decoder.mdx
rename to docs/source/en/model_doc/vision-encoder-decoder.mdx
diff --git a/docs/source/model_doc/vision-text-dual-encoder.mdx b/docs/source/en/model_doc/vision-text-dual-encoder.mdx
similarity index 100%
rename from docs/source/model_doc/vision-text-dual-encoder.mdx
rename to docs/source/en/model_doc/vision-text-dual-encoder.mdx
diff --git a/docs/source/model_doc/visual_bert.mdx b/docs/source/en/model_doc/visual_bert.mdx
similarity index 100%
rename from docs/source/model_doc/visual_bert.mdx
rename to docs/source/en/model_doc/visual_bert.mdx
diff --git a/docs/source/model_doc/vit.mdx b/docs/source/en/model_doc/vit.mdx
similarity index 100%
rename from docs/source/model_doc/vit.mdx
rename to docs/source/en/model_doc/vit.mdx
diff --git a/docs/source/model_doc/vit_mae.mdx b/docs/source/en/model_doc/vit_mae.mdx
similarity index 100%
rename from docs/source/model_doc/vit_mae.mdx
rename to docs/source/en/model_doc/vit_mae.mdx
diff --git a/docs/source/model_doc/wav2vec2.mdx b/docs/source/en/model_doc/wav2vec2.mdx
similarity index 100%
rename from docs/source/model_doc/wav2vec2.mdx
rename to docs/source/en/model_doc/wav2vec2.mdx
diff --git a/docs/source/model_doc/wav2vec2_phoneme.mdx b/docs/source/en/model_doc/wav2vec2_phoneme.mdx
similarity index 100%
rename from docs/source/model_doc/wav2vec2_phoneme.mdx
rename to docs/source/en/model_doc/wav2vec2_phoneme.mdx
diff --git a/docs/source/model_doc/wavlm.mdx b/docs/source/en/model_doc/wavlm.mdx
similarity index 100%
rename from docs/source/model_doc/wavlm.mdx
rename to docs/source/en/model_doc/wavlm.mdx
diff --git a/docs/source/model_doc/xglm.mdx b/docs/source/en/model_doc/xglm.mdx
similarity index 100%
rename from docs/source/model_doc/xglm.mdx
rename to docs/source/en/model_doc/xglm.mdx
diff --git a/docs/source/model_doc/xlm-prophetnet.mdx b/docs/source/en/model_doc/xlm-prophetnet.mdx
similarity index 100%
rename from docs/source/model_doc/xlm-prophetnet.mdx
rename to docs/source/en/model_doc/xlm-prophetnet.mdx
diff --git a/docs/source/model_doc/xlm-roberta-xl.mdx b/docs/source/en/model_doc/xlm-roberta-xl.mdx
similarity index 100%
rename from docs/source/model_doc/xlm-roberta-xl.mdx
rename to docs/source/en/model_doc/xlm-roberta-xl.mdx
diff --git a/docs/source/model_doc/xlm-roberta.mdx b/docs/source/en/model_doc/xlm-roberta.mdx
similarity index 100%
rename from docs/source/model_doc/xlm-roberta.mdx
rename to docs/source/en/model_doc/xlm-roberta.mdx
diff --git a/docs/source/model_doc/xlm.mdx b/docs/source/en/model_doc/xlm.mdx
similarity index 100%
rename from docs/source/model_doc/xlm.mdx
rename to docs/source/en/model_doc/xlm.mdx
diff --git a/docs/source/model_doc/xlnet.mdx b/docs/source/en/model_doc/xlnet.mdx
similarity index 100%
rename from docs/source/model_doc/xlnet.mdx
rename to docs/source/en/model_doc/xlnet.mdx
diff --git a/docs/source/model_doc/xls_r.mdx b/docs/source/en/model_doc/xls_r.mdx
similarity index 100%
rename from docs/source/model_doc/xls_r.mdx
rename to docs/source/en/model_doc/xls_r.mdx
diff --git a/docs/source/model_doc/xlsr_wav2vec2.mdx b/docs/source/en/model_doc/xlsr_wav2vec2.mdx
similarity index 100%
rename from docs/source/model_doc/xlsr_wav2vec2.mdx
rename to docs/source/en/model_doc/xlsr_wav2vec2.mdx
diff --git a/docs/source/model_doc/yoso.mdx b/docs/source/en/model_doc/yoso.mdx
similarity index 100%
rename from docs/source/model_doc/yoso.mdx
rename to docs/source/en/model_doc/yoso.mdx
diff --git a/docs/source/model_sharing.mdx b/docs/source/en/model_sharing.mdx
similarity index 100%
rename from docs/source/model_sharing.mdx
rename to docs/source/en/model_sharing.mdx
diff --git a/docs/source/model_summary.mdx b/docs/source/en/model_summary.mdx
similarity index 100%
rename from docs/source/model_summary.mdx
rename to docs/source/en/model_summary.mdx
diff --git a/docs/source/multilingual.mdx b/docs/source/en/multilingual.mdx
similarity index 100%
rename from docs/source/multilingual.mdx
rename to docs/source/en/multilingual.mdx
diff --git a/docs/source/en/notebooks.md b/docs/source/en/notebooks.md
new file mode 120000
index 000000000000..10fb7a7b979a
--- /dev/null
+++ b/docs/source/en/notebooks.md
@@ -0,0 +1 @@
+../../../notebooks/README.md
\ No newline at end of file
diff --git a/docs/source/pad_truncation.mdx b/docs/source/en/pad_truncation.mdx
similarity index 100%
rename from docs/source/pad_truncation.mdx
rename to docs/source/en/pad_truncation.mdx
diff --git a/docs/source/parallelism.mdx b/docs/source/en/parallelism.mdx
similarity index 100%
rename from docs/source/parallelism.mdx
rename to docs/source/en/parallelism.mdx
diff --git a/docs/source/performance.mdx b/docs/source/en/performance.mdx
similarity index 100%
rename from docs/source/performance.mdx
rename to docs/source/en/performance.mdx
diff --git a/docs/source/perplexity.mdx b/docs/source/en/perplexity.mdx
similarity index 100%
rename from docs/source/perplexity.mdx
rename to docs/source/en/perplexity.mdx
diff --git a/docs/source/philosophy.mdx b/docs/source/en/philosophy.mdx
similarity index 100%
rename from docs/source/philosophy.mdx
rename to docs/source/en/philosophy.mdx
diff --git a/docs/source/pipeline_tutorial.mdx b/docs/source/en/pipeline_tutorial.mdx
similarity index 100%
rename from docs/source/pipeline_tutorial.mdx
rename to docs/source/en/pipeline_tutorial.mdx
diff --git a/docs/source/pr_checks.mdx b/docs/source/en/pr_checks.mdx
similarity index 100%
rename from docs/source/pr_checks.mdx
rename to docs/source/en/pr_checks.mdx
diff --git a/docs/source/preprocessing.mdx b/docs/source/en/preprocessing.mdx
similarity index 100%
rename from docs/source/preprocessing.mdx
rename to docs/source/en/preprocessing.mdx
diff --git a/docs/source/quicktour.mdx b/docs/source/en/quicktour.mdx
similarity index 100%
rename from docs/source/quicktour.mdx
rename to docs/source/en/quicktour.mdx
diff --git a/docs/source/run_scripts.mdx b/docs/source/en/run_scripts.mdx
similarity index 100%
rename from docs/source/run_scripts.mdx
rename to docs/source/en/run_scripts.mdx
diff --git a/docs/source/sagemaker.mdx b/docs/source/en/sagemaker.mdx
similarity index 100%
rename from docs/source/sagemaker.mdx
rename to docs/source/en/sagemaker.mdx
diff --git a/docs/source/serialization.mdx b/docs/source/en/serialization.mdx
similarity index 100%
rename from docs/source/serialization.mdx
rename to docs/source/en/serialization.mdx
diff --git a/docs/source/task_summary.mdx b/docs/source/en/task_summary.mdx
similarity index 100%
rename from docs/source/task_summary.mdx
rename to docs/source/en/task_summary.mdx
diff --git a/docs/source/tasks/asr.mdx b/docs/source/en/tasks/asr.mdx
similarity index 100%
rename from docs/source/tasks/asr.mdx
rename to docs/source/en/tasks/asr.mdx
diff --git a/docs/source/tasks/audio_classification.mdx b/docs/source/en/tasks/audio_classification.mdx
similarity index 100%
rename from docs/source/tasks/audio_classification.mdx
rename to docs/source/en/tasks/audio_classification.mdx
diff --git a/docs/source/tasks/image_classification.mdx b/docs/source/en/tasks/image_classification.mdx
similarity index 100%
rename from docs/source/tasks/image_classification.mdx
rename to docs/source/en/tasks/image_classification.mdx
diff --git a/docs/source/tasks/language_modeling.mdx b/docs/source/en/tasks/language_modeling.mdx
similarity index 100%
rename from docs/source/tasks/language_modeling.mdx
rename to docs/source/en/tasks/language_modeling.mdx
diff --git a/docs/source/tasks/multiple_choice.mdx b/docs/source/en/tasks/multiple_choice.mdx
similarity index 100%
rename from docs/source/tasks/multiple_choice.mdx
rename to docs/source/en/tasks/multiple_choice.mdx
diff --git a/docs/source/tasks/question_answering.mdx b/docs/source/en/tasks/question_answering.mdx
similarity index 100%
rename from docs/source/tasks/question_answering.mdx
rename to docs/source/en/tasks/question_answering.mdx
diff --git a/docs/source/tasks/sequence_classification.mdx b/docs/source/en/tasks/sequence_classification.mdx
similarity index 100%
rename from docs/source/tasks/sequence_classification.mdx
rename to docs/source/en/tasks/sequence_classification.mdx
diff --git a/docs/source/tasks/summarization.mdx b/docs/source/en/tasks/summarization.mdx
similarity index 100%
rename from docs/source/tasks/summarization.mdx
rename to docs/source/en/tasks/summarization.mdx
diff --git a/docs/source/tasks/token_classification.mdx b/docs/source/en/tasks/token_classification.mdx
similarity index 100%
rename from docs/source/tasks/token_classification.mdx
rename to docs/source/en/tasks/token_classification.mdx
diff --git a/docs/source/tasks/translation.mdx b/docs/source/en/tasks/translation.mdx
similarity index 100%
rename from docs/source/tasks/translation.mdx
rename to docs/source/en/tasks/translation.mdx
diff --git a/docs/source/testing.mdx b/docs/source/en/testing.mdx
similarity index 100%
rename from docs/source/testing.mdx
rename to docs/source/en/testing.mdx
diff --git a/docs/source/tokenizer_summary.mdx b/docs/source/en/tokenizer_summary.mdx
similarity index 100%
rename from docs/source/tokenizer_summary.mdx
rename to docs/source/en/tokenizer_summary.mdx
diff --git a/docs/source/training.mdx b/docs/source/en/training.mdx
similarity index 100%
rename from docs/source/training.mdx
rename to docs/source/en/training.mdx
diff --git a/docs/source/troubleshooting.mdx b/docs/source/en/troubleshooting.mdx
similarity index 100%
rename from docs/source/troubleshooting.mdx
rename to docs/source/en/troubleshooting.mdx
diff --git a/docs/source/es/_config.py b/docs/source/es/_config.py
new file mode 100644
index 000000000000..cd76263e9a5c
--- /dev/null
+++ b/docs/source/es/_config.py
@@ -0,0 +1,14 @@
+# docstyle-ignore
+INSTALL_CONTENT = """
+# Transformers installation
+! pip install transformers datasets
+# To install from source instead of the last release, comment the command above and uncomment the following one.
+# ! pip install git+https://github.com/huggingface/transformers.git
+"""
+
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
+black_avoid_patterns = {
+    "{processor_class}": "FakeProcessorClass",
+    "{model_class}": "FakeModelClass",
+    "{object_class}": "FakeObjectClass",    
+}
diff --git a/docs/source/es/_toctree.yml b/docs/source/es/_toctree.yml
new file mode 100644
index 000000000000..525683955e71
--- /dev/null
+++ b/docs/source/es/_toctree.yml
@@ -0,0 +1,17 @@
+- sections: 
+  - local: quicktour
+    title: Quick tour
+  - local: installation
+    title: Instalación
+  title: Get started
+- sections:
+  - local: pipeline_tutorial
+    title: Pipelines para inferencia
+  - local: training
+    title: Fine-tuning a un modelo pre-entrenado
+  - local: accelerate
+    title: Entrenamiento distribuido con 🤗 Accelerate
+  title: Tutorials
+- sections:
+  - local: multilingual
+    title: Modelos multilingües para inferencia
\ No newline at end of file
diff --git a/docs/source_es/accelerate.mdx b/docs/source/es/accelerate.mdx
similarity index 100%
rename from docs/source_es/accelerate.mdx
rename to docs/source/es/accelerate.mdx
diff --git a/docs/source_es/installation.mdx b/docs/source/es/installation.mdx
similarity index 92%
rename from docs/source_es/installation.mdx
rename to docs/source/es/installation.mdx
index 1e0b587e283b..cc7601c117cd 100644
--- a/docs/source_es/installation.mdx
+++ b/docs/source/es/installation.mdx
@@ -185,43 +185,43 @@ Otra opción para usar 🤗 Transformers offline es descargando previamente los
 * Utiliza el flujo de [`PreTrainedModel.from_pretrained`] y [`PreTrainedModel.save_pretrained`]:
     1. Descarga previamente los archivos con [`PreTrainedModel.from_pretrained`]:
 
-        ```py
-                    >>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+    ```py
+    >>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 
-        >>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B")
-        >>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B")
-        ```
+    >>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B")
+    >>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B")
+    ```
 
 
     2. Guarda los archivos en un directorio específico con [`PreTrainedModel.save_pretrained`]:
 
     ```py
-     >>> tokenizer.save_pretrained("./your/path/bigscience_t0")
-     >>> model.save_pretrained("./your/path/bigscience_t0")
-     ```
+    >>> tokenizer.save_pretrained("./your/path/bigscience_t0")
+    >>> model.save_pretrained("./your/path/bigscience_t0")
+    ```
 
     3. Cuando te encuentres offline, recarga los archivos con [`PreTrainedModel.from_pretrained`] desde el directorio especificado: 
 
     ```py
-     >>> tokenizer = AutoTokenizer.from_pretrained("./your/path/bigscience_t0")
-     >>> model = AutoModel.from_pretrained("./your/path/bigscience_t0")
-     ```
+    >>> tokenizer = AutoTokenizer.from_pretrained("./your/path/bigscience_t0")
+    >>> model = AutoModel.from_pretrained("./your/path/bigscience_t0")
+    ```
 
 * Descarga de manera programática los archivos con la biblioteca [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub):
 
     1. Instala la biblioteca [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub) en tu entorno virtual: 
 
     ```bash
-      python -m pip install huggingface_hub
-     ```
+    python -m pip install huggingface_hub
+    ```
 
     2. Utiliza la función [`hf_hub_download`](https://huggingface.co/docs/hub/adding-a-library#download-files-from-the-hub) para descargar un archivo a un path específico. Por ejemplo, el siguiente comando descarga el archivo `config.json` del modelo [T0](https://huggingface.co/bigscience/T0_3B) al path deseado:
 
     ```py
-     >>> from huggingface_hub import hf_hub_download
+    >>> from huggingface_hub import hf_hub_download
 
-     >>> hf_hub_download(repo_id="bigscience/T0_3B", filename="config.json", cache_dir="./your/path/bigscience_t0")
-     ```
+    >>> hf_hub_download(repo_id="bigscience/T0_3B", filename="config.json", cache_dir="./your/path/bigscience_t0")
+    ```
 
 Una vez que el archivo se descargue y se almacene en caché localmente, especifica tu ruta local para cargarlo y usarlo:
 
@@ -236,9 +236,3 @@ Una vez que el archivo se descargue y se almacene en caché localmente, especifi
 Para más detalles sobre cómo descargar archivos almacenados en el Hub consulta la sección [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream).
 
 </Tip>
-
-
-
-
-
-
diff --git a/docs/source_es/multilingual.mdx b/docs/source/es/multilingual.mdx
similarity index 100%
rename from docs/source_es/multilingual.mdx
rename to docs/source/es/multilingual.mdx
diff --git a/docs/source_es/pipeline_tutorial.mdx b/docs/source/es/pipeline_tutorial.mdx
similarity index 100%
rename from docs/source_es/pipeline_tutorial.mdx
rename to docs/source/es/pipeline_tutorial.mdx
diff --git a/docs/source_es/quicktour.mdx b/docs/source/es/quicktour.mdx
similarity index 100%
rename from docs/source_es/quicktour.mdx
rename to docs/source/es/quicktour.mdx
diff --git a/docs/source_es/training.mdx b/docs/source/es/training.mdx
similarity index 100%
rename from docs/source_es/training.mdx
rename to docs/source/es/training.mdx
diff --git a/docs/source/notebooks.md b/docs/source/notebooks.md
deleted file mode 120000
index 1ffa21de255f..000000000000
--- a/docs/source/notebooks.md
+++ /dev/null
@@ -1 +0,0 @@
-../../notebooks/README.md
\ No newline at end of file
diff --git a/src/transformers/commands/add_new_model.py b/src/transformers/commands/add_new_model.py
index a5854863d2dc..276032eefe63 100644
--- a/src/transformers/commands/add_new_model.py
+++ b/src/transformers/commands/add_new_model.py
@@ -178,7 +178,7 @@ def remove_copy_lines(path):
 
         shutil.move(
             f"{directory}/{lowercase_model_name}.mdx",
-            f"{path_to_transformer_root}/docs/source/model_doc/{lowercase_model_name}.mdx",
+            f"{path_to_transformer_root}/docs/source/en/model_doc/{lowercase_model_name}.mdx",
         )
 
         shutil.move(
diff --git a/src/transformers/commands/add_new_model_like.py b/src/transformers/commands/add_new_model_like.py
index 31a5d714ab68..8ef5adf445b8 100644
--- a/src/transformers/commands/add_new_model_like.py
+++ b/src/transformers/commands/add_new_model_like.py
@@ -541,7 +541,7 @@ def get_model_files(model_type: str, frameworks: Optional[List[str]] = None) ->
     model_files = list(model_module.glob("*.py"))
     model_files = filter_framework_files(model_files, frameworks=frameworks)
 
-    doc_file = REPO_PATH / "docs" / "source" / "model_doc" / f"{model_type}.mdx"
+    doc_file = REPO_PATH / "docs" / "source" / "en" / "model_doc" / f"{model_type}.mdx"
 
     # Basic pattern for test files
     test_files = [
@@ -1256,7 +1256,7 @@ def disable_fx_test(filename: Path) -> bool:
     add_model_to_auto_classes(old_model_patterns, new_model_patterns, model_classes)
 
     # 5. Add doc file
-    doc_file = REPO_PATH / "docs" / "source" / "model_doc" / f"{old_model_patterns.model_type}.mdx"
+    doc_file = REPO_PATH / "docs" / "source" / "en" / "model_doc" / f"{old_model_patterns.model_type}.mdx"
     duplicate_doc_file(doc_file, old_model_patterns, new_model_patterns, frameworks=frameworks)
 
     # 6. Warn the user for duplicate patterns
diff --git a/utils/check_copies.py b/utils/check_copies.py
index e823b866d2a7..5363fd1ff338 100644
--- a/utils/check_copies.py
+++ b/utils/check_copies.py
@@ -25,7 +25,7 @@
 # All paths are set with the intent you should run this script from the root of the repo with the command
 # python utils/check_copies.py
 TRANSFORMERS_PATH = "src/transformers"
-PATH_TO_DOCS = "docs/source"
+PATH_TO_DOCS = "docs/source/en"
 REPO_PATH = "."
 
 # Mapping for files that are full copies of others (keys are copies, values the file to keep them up to data with)
diff --git a/utils/check_repo.py b/utils/check_repo.py
index 5f81c3bcfca6..99af09355274 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -31,7 +31,7 @@
 # python utils/check_repo.py
 PATH_TO_TRANSFORMERS = "src/transformers"
 PATH_TO_TESTS = "tests"
-PATH_TO_DOC = "docs/source"
+PATH_TO_DOC = "docs/source/en"
 
 # Update this list with models that are supposed to be private.
 PRIVATE_MODELS = [
diff --git a/utils/check_table.py b/utils/check_table.py
index 9d948fbb6d9f..d59f3e7b1e5a 100644
--- a/utils/check_table.py
+++ b/utils/check_table.py
@@ -23,7 +23,7 @@
 # All paths are set with the intent you should run this script from the root of the repo with the command
 # python utils/check_table.py
 TRANSFORMERS_PATH = "src/transformers"
-PATH_TO_DOCS = "docs/source"
+PATH_TO_DOCS = "docs/source/en"
 REPO_PATH = "."
 
 

From 96494dc2badde11a59a9e238f9586fd74c4169ba Mon Sep 17 00:00:00 2001
From: Karim Foda <35491698+KMFODA@users.noreply.github.com>
Date: Mon, 4 Apr 2022 15:27:45 +0100
Subject: [PATCH 20/34] Add use_auth to load_datasets for private datasets to
 PT and TF examples (#16521)

* fix formatting and remove use_auth

* Add use_auth_token to Flax examples
---
 .../run_image_captioning_flax.py              | 25 +++++++-
 .../flax/language-modeling/run_clm_flax.py    | 58 +++++++++++++++---
 .../flax/language-modeling/run_mlm_flax.py    | 58 +++++++++++++++---
 .../flax/language-modeling/run_t5_mlm_flax.py | 59 ++++++++++++++++---
 examples/flax/question-answering/run_qa.py    | 13 +++-
 .../summarization/run_summarization_flax.py   | 53 ++++++++++++++---
 .../flax/text-classification/run_flax_glue.py | 27 +++++++--
 .../flax/token-classification/run_flax_ner.py | 12 +++-
 .../flax/vision/run_image_classification.py   | 20 ++++++-
 .../run_audio_classification.py               | 10 +++-
 .../contrastive-image-text/run_clip.py        |  8 ++-
 .../run_image_classification.py               |  1 +
 examples/pytorch/image-pretraining/run_mae.py |  1 +
 examples/pytorch/image-pretraining/run_mim.py |  1 +
 examples/pytorch/language-modeling/run_clm.py | 17 +++++-
 examples/pytorch/language-modeling/run_mlm.py | 16 ++++-
 examples/pytorch/language-modeling/run_plm.py |  9 ++-
 examples/pytorch/multiple-choice/run_swag.py  | 14 ++++-
 examples/pytorch/question-answering/run_qa.py | 13 +++-
 .../question-answering/run_qa_beam_search.py  | 13 +++-
 .../run_wav2vec2_pretraining_no_trainer.py    |  5 +-
 .../run_speech_recognition_seq2seq.py         | 10 +++-
 .../summarization/run_summarization.py        | 12 +++-
 .../pytorch/text-classification/run_glue.py   | 26 ++++++--
 .../pytorch/text-classification/run_xnli.py   | 30 ++++++++--
 .../pytorch/token-classification/run_ner.py   |  5 +-
 .../pytorch/translation/run_translation.py    | 12 +++-
 .../tensorflow/language-modeling/run_clm.py   | 15 ++++-
 .../tensorflow/language-modeling/run_mlm.py   | 14 ++++-
 .../tensorflow/multiple-choice/run_swag.py    | 14 ++++-
 .../tensorflow/question-answering/run_qa.py   | 15 ++++-
 .../summarization/run_summarization.py        | 12 +++-
 .../text-classification/run_glue.py           |  7 ++-
 .../run_text_classification.py                |  7 ++-
 .../token-classification/run_ner.py           | 12 +++-
 .../tensorflow/translation/run_translation.py | 12 +++-
 36 files changed, 544 insertions(+), 92 deletions(-)

diff --git a/examples/flax/image-captioning/run_image_captioning_flax.py b/examples/flax/image-captioning/run_image_captioning_flax.py
index b4b9afe0d305..b1c9012777ac 100644
--- a/examples/flax/image-captioning/run_image_captioning_flax.py
+++ b/examples/flax/image-captioning/run_image_captioning_flax.py
@@ -178,6 +178,13 @@ class ModelArguments:
             "help": "Floating-point format in which the model weights should be initialized and trained. Choose one of `[float32, float16, bfloat16]`."
         },
     )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
 
 
 @dataclass
@@ -418,6 +425,7 @@ def main():
             cache_dir=model_args.cache_dir,
             keep_in_memory=False,
             data_dir=data_args.data_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -430,7 +438,12 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        dataset = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        dataset = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
@@ -439,12 +452,18 @@ def main():
         model_args.model_name_or_path,
         seed=training_args.seed,
         dtype=getattr(jnp, model_args.dtype),
+        use_auth_token=True if model_args.use_auth_token else None,
     )
     feature_extractor = AutoFeatureExtractor.from_pretrained(
-        model_args.model_name_or_path, cache_dir=model_args.cache_dir
+        model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
     tokenizer = AutoTokenizer.from_pretrained(
-        model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+        model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        use_fast=model_args.use_fast_tokenizer,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
     tokenizer.pad_token = tokenizer.convert_ids_to_tokens(model.config.pad_token_id)
 
diff --git a/examples/flax/language-modeling/run_clm_flax.py b/examples/flax/language-modeling/run_clm_flax.py
index 82a9757d5c26..afb6d75b3857 100755
--- a/examples/flax/language-modeling/run_clm_flax.py
+++ b/examples/flax/language-modeling/run_clm_flax.py
@@ -165,6 +165,13 @@ class ModelArguments:
             "help": "Floating-point format in which the model weights should be initialized and trained. Choose one of `[float32, float16, bfloat16]`."
         },
     )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
 
 
 @dataclass
@@ -363,7 +370,11 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         dataset = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, keep_in_memory=False
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            keep_in_memory=False,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
 
         if "validation" not in dataset.keys():
@@ -372,12 +383,14 @@ def main():
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             dataset["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -390,7 +403,13 @@ def main():
         if extension == "txt":
             extension = "text"
             dataset_args["keep_linebreaks"] = data_args.keep_linebreaks
-        dataset = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir, **dataset_args)
+        dataset = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            **dataset_args,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
         if "validation" not in dataset.keys():
             dataset["validation"] = load_dataset(
@@ -399,6 +418,7 @@ def main():
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
                 **dataset_args,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             dataset["train"] = load_dataset(
                 extension,
@@ -406,6 +426,7 @@ def main():
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
                 **dataset_args,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
@@ -416,20 +437,34 @@ def main():
     # The .from_pretrained methods guarantee that only one local process can concurrently
     # download model & vocab.
     if model_args.config_name:
-        config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
+        config = AutoConfig.from_pretrained(
+            model_args.config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     elif model_args.model_name_or_path:
-        config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
+        config = AutoConfig.from_pretrained(
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         config = CONFIG_MAPPING[model_args.model_type]()
         logger.warning("You are instantiating a new config instance from scratch.")
 
     if model_args.tokenizer_name:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.tokenizer_name,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     elif model_args.model_name_or_path:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         raise ValueError(
@@ -439,11 +474,18 @@ def main():
 
     if model_args.model_name_or_path:
         model = FlaxAutoModelForCausalLM.from_pretrained(
-            model_args.model_name_or_path, config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            model_args.model_name_or_path,
+            config=config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         model = FlaxAutoModelForCausalLM.from_config(
-            config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
 
     # Preprocessing the datasets.
diff --git a/examples/flax/language-modeling/run_mlm_flax.py b/examples/flax/language-modeling/run_mlm_flax.py
index daa247ecaae0..6ea0f6e1564f 100755
--- a/examples/flax/language-modeling/run_mlm_flax.py
+++ b/examples/flax/language-modeling/run_mlm_flax.py
@@ -163,6 +163,13 @@ class ModelArguments:
             "help": "Floating-point format in which the model weights should be initialized and trained. Choose one of `[float32, float16, bfloat16]`."
         },
     )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
 
 
 @dataclass
@@ -396,7 +403,12 @@ def main():
     # download the dataset.
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
-        datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir)
+        datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
         if "validation" not in datasets.keys():
             datasets["validation"] = load_dataset(
@@ -404,12 +416,14 @@ def main():
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             datasets["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -420,7 +434,12 @@ def main():
         extension = data_args.train_file.split(".")[-1]
         if extension == "txt":
             extension = "text"
-        datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
         if "validation" not in datasets.keys():
             datasets["validation"] = load_dataset(
@@ -428,12 +447,14 @@ def main():
                 data_files=data_files,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             datasets["train"] = load_dataset(
                 extension,
                 data_files=data_files,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
@@ -444,20 +465,34 @@ def main():
     # The .from_pretrained methods guarantee that only one local process can concurrently
     # download model & vocab.
     if model_args.config_name:
-        config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
+        config = AutoConfig.from_pretrained(
+            model_args.config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     elif model_args.model_name_or_path:
-        config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
+        config = AutoConfig.from_pretrained(
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         config = CONFIG_MAPPING[model_args.model_type]()
         logger.warning("You are instantiating a new config instance from scratch.")
 
     if model_args.tokenizer_name:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.tokenizer_name,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     elif model_args.model_name_or_path:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         raise ValueError(
@@ -572,11 +607,18 @@ def group_texts(examples):
 
     if model_args.model_name_or_path:
         model = FlaxAutoModelForMaskedLM.from_pretrained(
-            model_args.model_name_or_path, config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            model_args.model_name_or_path,
+            config=config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         model = FlaxAutoModelForMaskedLM.from_config(
-            config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
 
     # Store some constant
diff --git a/examples/flax/language-modeling/run_t5_mlm_flax.py b/examples/flax/language-modeling/run_t5_mlm_flax.py
index 622f11f5de2a..5b1067cd993e 100755
--- a/examples/flax/language-modeling/run_t5_mlm_flax.py
+++ b/examples/flax/language-modeling/run_t5_mlm_flax.py
@@ -162,6 +162,13 @@ class ModelArguments:
             "help": "Floating-point format in which the model weights should be initialized and trained. Choose one of `[float32, float16, bfloat16]`."
         },
     )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
 
 
 @dataclass
@@ -525,7 +532,12 @@ def main():
     # 'text' is found. You can easily tweak this behavior (see below).
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
-        datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir)
+        datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
         if "validation" not in datasets.keys():
             datasets["validation"] = load_dataset(
@@ -533,12 +545,14 @@ def main():
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             datasets["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -549,7 +563,12 @@ def main():
         extension = data_args.train_file.split(".")[-1]
         if extension == "txt":
             extension = "text"
-        datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
         if "validation" not in datasets.keys():
             datasets["validation"] = load_dataset(
@@ -557,12 +576,14 @@ def main():
                 data_files=data_files,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             datasets["train"] = load_dataset(
                 extension,
                 data_files=data_files,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
@@ -571,11 +592,17 @@ def main():
 
     if model_args.tokenizer_name:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.tokenizer_name,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     elif model_args.model_name_or_path:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         raise ValueError(
@@ -585,10 +612,17 @@ def main():
 
     if model_args.config_name:
         config = T5Config.from_pretrained(
-            model_args.config_name, cache_dir=model_args.cache_dir, vocab_size=len(tokenizer)
+            model_args.config_name,
+            cache_dir=model_args.cache_dir,
+            vocab_size=len(tokenizer),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     elif model_args.model_name_or_path:
-        config = T5Config.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
+        config = T5Config.from_pretrained(
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         config = CONFIG_MAPPING[model_args.model_type]()
         logger.warning("You are instantiating a new config instance from scratch.")
@@ -678,11 +712,20 @@ def group_texts(examples):
 
     if model_args.model_name_or_path:
         model = FlaxT5ForConditionalGeneration.from_pretrained(
-            model_args.model_name_or_path, config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            model_args.model_name_or_path,
+            config=config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         config.vocab_size = len(tokenizer)
-        model = FlaxT5ForConditionalGeneration(config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype))
+        model = FlaxT5ForConditionalGeneration(
+            config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
     # Data collator
     # This one will take care of randomly masking the tokens.
diff --git a/examples/flax/question-answering/run_qa.py b/examples/flax/question-answering/run_qa.py
index a15cca6607cc..6ab150a762b0 100644
--- a/examples/flax/question-answering/run_qa.py
+++ b/examples/flax/question-answering/run_qa.py
@@ -448,7 +448,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         # Loading the dataset from local csv or json file.
@@ -463,7 +466,13 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, field="data", cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            field="data",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
     # endregion
diff --git a/examples/flax/summarization/run_summarization_flax.py b/examples/flax/summarization/run_summarization_flax.py
index effe3b58839f..3ebff73b98ff 100644
--- a/examples/flax/summarization/run_summarization_flax.py
+++ b/examples/flax/summarization/run_summarization_flax.py
@@ -176,6 +176,13 @@ class ModelArguments:
             "help": "Floating-point format in which the model weights should be initialized and trained. Choose one of `[float32, float16, bfloat16]`."
         },
     )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
 
 
 @dataclass
@@ -421,7 +428,11 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         dataset = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, keep_in_memory=False
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            keep_in_memory=False,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -434,27 +445,46 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        dataset = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        dataset = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
     # Load pretrained model and tokenizer
 
     if model_args.config_name:
-        config = AutoConfig.from_pretrained(model_args.config_name, cache_dir=model_args.cache_dir)
+        config = AutoConfig.from_pretrained(
+            model_args.config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     elif model_args.model_name_or_path:
-        config = AutoConfig.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
+        config = AutoConfig.from_pretrained(
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         config = CONFIG_MAPPING[model_args.model_type]()
         logger.warning("You are instantiating a new config instance from scratch.")
 
     if model_args.tokenizer_name:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.tokenizer_name,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     elif model_args.model_name_or_path:
         tokenizer = AutoTokenizer.from_pretrained(
-            model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer
+            model_args.model_name_or_path,
+            cache_dir=model_args.cache_dir,
+            use_fast=model_args.use_fast_tokenizer,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         raise ValueError(
@@ -464,11 +494,18 @@ def main():
 
     if model_args.model_name_or_path:
         model = FlaxAutoModelForSeq2SeqLM.from_pretrained(
-            model_args.model_name_or_path, config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            model_args.model_name_or_path,
+            config=config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         model = FlaxAutoModelForSeq2SeqLM.from_config(
-            config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
 
     if model.config.decoder_start_token_id is None:
diff --git a/examples/flax/text-classification/run_flax_glue.py b/examples/flax/text-classification/run_flax_glue.py
index d56d23d2734e..06f9caba8943 100755
--- a/examples/flax/text-classification/run_flax_glue.py
+++ b/examples/flax/text-classification/run_flax_glue.py
@@ -337,7 +337,11 @@ def main():
     # download the dataset.
     if data_args.task_name is not None:
         # Downloading and loading a dataset from the hub.
-        raw_datasets = load_dataset("glue", data_args.task_name)
+        raw_datasets = load_dataset(
+            "glue",
+            data_args.task_name,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         # Loading the dataset from local csv or json file.
         data_files = {}
@@ -346,7 +350,11 @@ def main():
         if data_args.validation_file is not None:
             data_files["validation"] = data_args.validation_file
         extension = (data_args.train_file if data_args.train_file is not None else data_args.valid_file).split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
@@ -372,12 +380,21 @@ def main():
 
     # Load pretrained model and tokenizer
     config = AutoConfig.from_pretrained(
-        model_args.model_name_or_path, num_labels=num_labels, finetuning_task=data_args.task_name
+        model_args.model_name_or_path,
+        num_labels=num_labels,
+        finetuning_task=data_args.task_name,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
     tokenizer = AutoTokenizer.from_pretrained(
-        model_args.model_name_or_path, use_fast=not model_args.use_slow_tokenizer
+        model_args.model_name_or_path,
+        use_fast=not model_args.use_slow_tokenizer,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    model = FlaxAutoModelForSequenceClassification.from_pretrained(
+        model_args.model_name_or_path,
+        config=config,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
-    model = FlaxAutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, config=config)
 
     # Preprocessing the datasets
     if data_args.task_name is not None:
diff --git a/examples/flax/token-classification/run_flax_ner.py b/examples/flax/token-classification/run_flax_ner.py
index abf1b8d0c117..32f0104b8929 100644
--- a/examples/flax/token-classification/run_flax_ner.py
+++ b/examples/flax/token-classification/run_flax_ner.py
@@ -391,7 +391,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         # Loading the dataset from local csv or json file.
@@ -401,7 +404,12 @@ def main():
         if data_args.validation_file is not None:
             data_files["validation"] = data_args.validation_file
         extension = (data_args.train_file if data_args.train_file is not None else data_args.valid_file).split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/flax/vision/run_image_classification.py b/examples/flax/vision/run_image_classification.py
index 7459d24c6346..0dc7b2f95742 100644
--- a/examples/flax/vision/run_image_classification.py
+++ b/examples/flax/vision/run_image_classification.py
@@ -154,6 +154,13 @@ class ModelArguments:
             "help": "Floating-point format in which the model weights should be initialized and trained. Choose one of `[float32, float16, bfloat16]`."
         },
     )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
 
 
 @dataclass
@@ -315,6 +322,7 @@ def main():
             num_labels=len(train_dataset.classes),
             image_size=data_args.image_size,
             cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     elif model_args.model_name_or_path:
         config = AutoConfig.from_pretrained(
@@ -322,6 +330,7 @@ def main():
             num_labels=len(train_dataset.classes),
             image_size=data_args.image_size,
             cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         config = CONFIG_MAPPING[model_args.model_type]()
@@ -329,11 +338,18 @@ def main():
 
     if model_args.model_name_or_path:
         model = FlaxAutoModelForImageClassification.from_pretrained(
-            model_args.model_name_or_path, config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            model_args.model_name_or_path,
+            config=config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         model = FlaxAutoModelForImageClassification.from_config(
-            config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype)
+            config,
+            seed=training_args.seed,
+            dtype=getattr(jnp, model_args.dtype),
+            use_auth_token=True if model_args.use_auth_token else None,
         )
 
     # Store some constant
diff --git a/examples/pytorch/audio-classification/run_audio_classification.py b/examples/pytorch/audio-classification/run_audio_classification.py
index 14c0a026fda4..c0eb755b6a5a 100644
--- a/examples/pytorch/audio-classification/run_audio_classification.py
+++ b/examples/pytorch/audio-classification/run_audio_classification.py
@@ -227,10 +227,16 @@ def main():
     # Initialize our dataset and prepare it for the audio classification task.
     raw_datasets = DatasetDict()
     raw_datasets["train"] = load_dataset(
-        data_args.dataset_name, data_args.dataset_config_name, split=data_args.train_split_name
+        data_args.dataset_name,
+        data_args.dataset_config_name,
+        split=data_args.train_split_name,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
     raw_datasets["eval"] = load_dataset(
-        data_args.dataset_name, data_args.dataset_config_name, split=data_args.eval_split_name
+        data_args.dataset_name,
+        data_args.dataset_config_name,
+        split=data_args.eval_split_name,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
 
     if data_args.audio_column_name not in raw_datasets["train"].column_names:
diff --git a/examples/pytorch/contrastive-image-text/run_clip.py b/examples/pytorch/contrastive-image-text/run_clip.py
index 79fd123064a1..02f20936873b 100644
--- a/examples/pytorch/contrastive-image-text/run_clip.py
+++ b/examples/pytorch/contrastive-image-text/run_clip.py
@@ -276,6 +276,7 @@ def main():
             cache_dir=model_args.cache_dir,
             keep_in_memory=False,
             data_dir=data_args.data_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -288,7 +289,12 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        dataset = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        dataset = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/pytorch/image-classification/run_image_classification.py b/examples/pytorch/image-classification/run_image_classification.py
index b7de0f5f7b6e..fef52c4bf5e5 100644
--- a/examples/pytorch/image-classification/run_image_classification.py
+++ b/examples/pytorch/image-classification/run_image_classification.py
@@ -207,6 +207,7 @@ def main():
         data_files=data_args.data_files,
         cache_dir=model_args.cache_dir,
         task="image-classification",
+        use_auth_token=True if model_args.use_auth_token else None,
     )
 
     # If we don't have a validation split, split off a percentage of train as validation.
diff --git a/examples/pytorch/image-pretraining/run_mae.py b/examples/pytorch/image-pretraining/run_mae.py
index 3b634d691832..e2182ec783da 100644
--- a/examples/pytorch/image-pretraining/run_mae.py
+++ b/examples/pytorch/image-pretraining/run_mae.py
@@ -207,6 +207,7 @@ def main():
         data_args.dataset_config_name,
         data_files=data_args.data_files,
         cache_dir=model_args.cache_dir,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
 
     # If we don't have a validation split, split off a percentage of train as validation.
diff --git a/examples/pytorch/image-pretraining/run_mim.py b/examples/pytorch/image-pretraining/run_mim.py
index 0377a505e02d..323c38489589 100644
--- a/examples/pytorch/image-pretraining/run_mim.py
+++ b/examples/pytorch/image-pretraining/run_mim.py
@@ -266,6 +266,7 @@ def main():
         data_args.dataset_config_name,
         data_files=data_args.data_files,
         cache_dir=model_args.cache_dir,
+        use_auth_token=True if model_args.use_auth_token else None,
     )
 
     # If we don't have a validation split, split off a percentage of train as validation.
diff --git a/examples/pytorch/language-modeling/run_clm.py b/examples/pytorch/language-modeling/run_clm.py
index a1cdcf9ee4a9..3d2af72ccaf6 100755
--- a/examples/pytorch/language-modeling/run_clm.py
+++ b/examples/pytorch/language-modeling/run_clm.py
@@ -254,7 +254,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
         if "validation" not in raw_datasets.keys():
             raw_datasets["validation"] = load_dataset(
@@ -262,12 +265,14 @@ def main():
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             raw_datasets["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -284,7 +289,13 @@ def main():
         if extension == "txt":
             extension = "text"
             dataset_args["keep_linebreaks"] = data_args.keep_linebreaks
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir, **dataset_args)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+            **dataset_args,
+        )
         # If no validation data is there, validation_split_percentage will be used to divide the dataset.
         if "validation" not in raw_datasets.keys():
             raw_datasets["validation"] = load_dataset(
@@ -292,6 +303,7 @@ def main():
                 data_files=data_files,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
                 **dataset_args,
             )
             raw_datasets["train"] = load_dataset(
@@ -299,6 +311,7 @@ def main():
                 data_files=data_files,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
                 **dataset_args,
             )
 
diff --git a/examples/pytorch/language-modeling/run_mlm.py b/examples/pytorch/language-modeling/run_mlm.py
index 6ea3c2c934d3..f829e86781f1 100755
--- a/examples/pytorch/language-modeling/run_mlm.py
+++ b/examples/pytorch/language-modeling/run_mlm.py
@@ -263,7 +263,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
         if "validation" not in raw_datasets.keys():
             raw_datasets["validation"] = load_dataset(
@@ -271,12 +274,14 @@ def main():
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             raw_datasets["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -288,7 +293,12 @@ def main():
             extension = data_args.validation_file.split(".")[-1]
         if extension == "txt":
             extension = "text"
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
         # If no validation data is there, validation_split_percentage will be used to divide the dataset.
         if "validation" not in raw_datasets.keys():
@@ -297,12 +307,14 @@ def main():
                 data_files=data_files,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             raw_datasets["train"] = load_dataset(
                 extension,
                 data_files=data_files,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
 
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
diff --git a/examples/pytorch/language-modeling/run_plm.py b/examples/pytorch/language-modeling/run_plm.py
index d1c09896d8e7..cc4ad602329c 100755
--- a/examples/pytorch/language-modeling/run_plm.py
+++ b/examples/pytorch/language-modeling/run_plm.py
@@ -256,7 +256,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
         if "validation" not in raw_datasets.keys():
             raw_datasets["validation"] = load_dataset(
@@ -264,12 +267,14 @@ def main():
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             raw_datasets["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -288,12 +293,14 @@ def main():
                 data_files=data_files,
                 split=f"train[:{data_args.validation_split_percentage}%]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             raw_datasets["train"] = load_dataset(
                 extension,
                 data_files=data_files,
                 split=f"train[{data_args.validation_split_percentage}%:]",
                 cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
 
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
diff --git a/examples/pytorch/multiple-choice/run_swag.py b/examples/pytorch/multiple-choice/run_swag.py
index 01c9e8bcf7d2..4578e4570aa0 100755
--- a/examples/pytorch/multiple-choice/run_swag.py
+++ b/examples/pytorch/multiple-choice/run_swag.py
@@ -269,10 +269,20 @@ def main():
         if data_args.validation_file is not None:
             data_files["validation"] = data_args.validation_file
         extension = data_args.train_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         # Downloading and loading the swag dataset from the hub.
-        raw_datasets = load_dataset("swag", "regular", cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            "swag",
+            "regular",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/pytorch/question-answering/run_qa.py b/examples/pytorch/question-answering/run_qa.py
index 67aaf1d84ff0..90d199b14d6d 100755
--- a/examples/pytorch/question-answering/run_qa.py
+++ b/examples/pytorch/question-answering/run_qa.py
@@ -262,7 +262,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -276,7 +279,13 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, field="data", cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            field="data",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/pytorch/question-answering/run_qa_beam_search.py b/examples/pytorch/question-answering/run_qa_beam_search.py
index 4c79be08b91b..96aa07a8086b 100755
--- a/examples/pytorch/question-answering/run_qa_beam_search.py
+++ b/examples/pytorch/question-answering/run_qa_beam_search.py
@@ -260,7 +260,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -273,7 +276,13 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, field="data", cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            field="data",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py b/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
index 51ac5191181e..88021a428503 100755
--- a/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
+++ b/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
@@ -403,7 +403,10 @@ def main():
     for dataset_config_name, train_split_name in zip(args.dataset_config_names, args.dataset_split_names):
         # load dataset
         dataset_split = load_dataset(
-            args.dataset_name, dataset_config_name, split=train_split_name, cache_dir=args.cache_dir
+            args.dataset_name,
+            dataset_config_name,
+            split=train_split_name,
+            cache_dir=args.cache_dir,
         )
         datasets_splits.append(dataset_split)
 
diff --git a/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py b/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py
index 695a5b24fd18..46d4785fa8f8 100755
--- a/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py
+++ b/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py
@@ -278,12 +278,18 @@ def main():
 
     if training_args.do_train:
         raw_datasets["train"] = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, split=data_args.train_split_name
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            split=data_args.train_split_name,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
 
     if training_args.do_eval:
         raw_datasets["eval"] = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, split=data_args.eval_split_name
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            split=data_args.eval_split_name,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
 
     if data_args.audio_column_name not in next(iter(raw_datasets.values())).column_names:
diff --git a/examples/pytorch/summarization/run_summarization.py b/examples/pytorch/summarization/run_summarization.py
index 66aeb981bdf4..7b39cb8e48f9 100755
--- a/examples/pytorch/summarization/run_summarization.py
+++ b/examples/pytorch/summarization/run_summarization.py
@@ -341,7 +341,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -354,7 +357,12 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/pytorch/text-classification/run_glue.py b/examples/pytorch/text-classification/run_glue.py
index 88be878faea2..a0730f609820 100755
--- a/examples/pytorch/text-classification/run_glue.py
+++ b/examples/pytorch/text-classification/run_glue.py
@@ -252,11 +252,19 @@ def main():
     # download the dataset.
     if data_args.task_name is not None:
         # Downloading and loading a dataset from the hub.
-        raw_datasets = load_dataset("glue", data_args.task_name, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            "glue",
+            data_args.task_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     elif data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         # Loading a dataset from your local files.
@@ -281,10 +289,20 @@ def main():
 
         if data_args.train_file.endswith(".csv"):
             # Loading a dataset from local csv files
-            raw_datasets = load_dataset("csv", data_files=data_files, cache_dir=model_args.cache_dir)
+            raw_datasets = load_dataset(
+                "csv",
+                data_files=data_files,
+                cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
+            )
         else:
             # Loading a dataset from local json files
-            raw_datasets = load_dataset("json", data_files=data_files, cache_dir=model_args.cache_dir)
+            raw_datasets = load_dataset(
+                "json",
+                data_files=data_files,
+                cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
+            )
     # See more about loading any type of standard or custom dataset at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/pytorch/text-classification/run_xnli.py b/examples/pytorch/text-classification/run_xnli.py
index f54b1ec2aa60..4a17a5d702ba 100755
--- a/examples/pytorch/text-classification/run_xnli.py
+++ b/examples/pytorch/text-classification/run_xnli.py
@@ -213,19 +213,41 @@ def main():
     # Downloading and loading xnli dataset from the hub.
     if training_args.do_train:
         if model_args.train_language is None:
-            train_dataset = load_dataset("xnli", model_args.language, split="train", cache_dir=model_args.cache_dir)
+            train_dataset = load_dataset(
+                "xnli",
+                model_args.language,
+                split="train",
+                cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
+            )
         else:
             train_dataset = load_dataset(
-                "xnli", model_args.train_language, split="train", cache_dir=model_args.cache_dir
+                "xnli",
+                model_args.train_language,
+                split="train",
+                cache_dir=model_args.cache_dir,
+                use_auth_token=True if model_args.use_auth_token else None,
             )
         label_list = train_dataset.features["label"].names
 
     if training_args.do_eval:
-        eval_dataset = load_dataset("xnli", model_args.language, split="validation", cache_dir=model_args.cache_dir)
+        eval_dataset = load_dataset(
+            "xnli",
+            model_args.language,
+            split="validation",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
         label_list = eval_dataset.features["label"].names
 
     if training_args.do_predict:
-        predict_dataset = load_dataset("xnli", model_args.language, split="test", cache_dir=model_args.cache_dir)
+        predict_dataset = load_dataset(
+            "xnli",
+            model_args.language,
+            split="test",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
         label_list = predict_dataset.features["label"].names
 
     # Labels
diff --git a/examples/pytorch/token-classification/run_ner.py b/examples/pytorch/token-classification/run_ner.py
index 9ff64b37978c..5545b35862b3 100755
--- a/examples/pytorch/token-classification/run_ner.py
+++ b/examples/pytorch/token-classification/run_ner.py
@@ -249,7 +249,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
diff --git a/examples/pytorch/translation/run_translation.py b/examples/pytorch/translation/run_translation.py
index b458a3f0cd65..f7e98276dc7b 100755
--- a/examples/pytorch/translation/run_translation.py
+++ b/examples/pytorch/translation/run_translation.py
@@ -306,7 +306,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -319,7 +322,12 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/tensorflow/language-modeling/run_clm.py b/examples/tensorflow/language-modeling/run_clm.py
index 4cbc00b3cdc9..84e71efe50d1 100755
--- a/examples/tensorflow/language-modeling/run_clm.py
+++ b/examples/tensorflow/language-modeling/run_clm.py
@@ -280,17 +280,23 @@ def main():
     # download the dataset.
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
-        raw_datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)
+        raw_datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
         if "validation" not in raw_datasets.keys():
             raw_datasets["validation"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             raw_datasets["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -303,7 +309,12 @@ def main():
         if extension == "txt":
             extension = "text"
             dataset_args["keep_linebreaks"] = data_args.keep_linebreaks
-        raw_datasets = load_dataset(extension, data_files=data_files, **dataset_args)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            use_auth_token=True if model_args.use_auth_token else None,
+            **dataset_args,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
     # endregion
diff --git a/examples/tensorflow/language-modeling/run_mlm.py b/examples/tensorflow/language-modeling/run_mlm.py
index 44c5d230318b..8b32070b2dd1 100755
--- a/examples/tensorflow/language-modeling/run_mlm.py
+++ b/examples/tensorflow/language-modeling/run_mlm.py
@@ -292,17 +292,23 @@ def main():
     # download the dataset.
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
-        raw_datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)
+        raw_datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
         if "validation" not in raw_datasets.keys():
             raw_datasets["validation"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[:{data_args.validation_split_percentage}%]",
+                use_auth_token=True if model_args.use_auth_token else None,
             )
             raw_datasets["train"] = load_dataset(
                 data_args.dataset_name,
                 data_args.dataset_config_name,
                 split=f"train[{data_args.validation_split_percentage}%:]",
+                use_auth_token=True if model_args.use_auth_token else None,
             )
     else:
         data_files = {}
@@ -313,7 +319,11 @@ def main():
         extension = data_args.train_file.split(".")[-1]
         if extension == "txt":
             extension = "text"
-        raw_datasets = load_dataset(extension, data_files=data_files)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
 
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
diff --git a/examples/tensorflow/multiple-choice/run_swag.py b/examples/tensorflow/multiple-choice/run_swag.py
index e14815cf81f3..2c78ab39fa60 100644
--- a/examples/tensorflow/multiple-choice/run_swag.py
+++ b/examples/tensorflow/multiple-choice/run_swag.py
@@ -290,10 +290,20 @@ def main():
         if data_args.validation_file is not None:
             data_files["validation"] = data_args.validation_file
         extension = data_args.train_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         # Downloading and loading the swag dataset from the hub.
-        raw_datasets = load_dataset("swag", "regular", cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            "swag",
+            "regular",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/tensorflow/question-answering/run_qa.py b/examples/tensorflow/question-answering/run_qa.py
index 50e8c7f50d96..891219d3a1a2 100755
--- a/examples/tensorflow/question-answering/run_qa.py
+++ b/examples/tensorflow/question-answering/run_qa.py
@@ -278,7 +278,12 @@ def main():
     # download the dataset.
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
-        datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir)
+        datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         data_files = {}
         if data_args.train_file is not None:
@@ -291,7 +296,13 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        datasets = load_dataset(extension, data_files=data_files, field="data", cache_dir=model_args.cache_dir)
+        datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            field="data",
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
     # endregion
diff --git a/examples/tensorflow/summarization/run_summarization.py b/examples/tensorflow/summarization/run_summarization.py
index e40c763530c0..09aa8f90de3d 100644
--- a/examples/tensorflow/summarization/run_summarization.py
+++ b/examples/tensorflow/summarization/run_summarization.py
@@ -391,7 +391,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -404,7 +407,12 @@ def main():
         if data_args.test_file is not None:
             data_files["test"] = data_args.test_file
             extension = data_args.test_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
     # endregion
diff --git a/examples/tensorflow/text-classification/run_glue.py b/examples/tensorflow/text-classification/run_glue.py
index 03d7df675b78..fa8cb98a5a6e 100644
--- a/examples/tensorflow/text-classification/run_glue.py
+++ b/examples/tensorflow/text-classification/run_glue.py
@@ -236,7 +236,12 @@ def main():
 
     # Downloading and loading a dataset from the hub. In distributed training, the load_dataset function guarantee
     # that only one local process can concurrently download the dataset.
-    datasets = load_dataset("glue", data_args.task_name, cache_dir=model_args.cache_dir)
+    datasets = load_dataset(
+        "glue",
+        data_args.task_name,
+        cache_dir=model_args.cache_dir,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
     # See more about loading any type of standard or custom dataset at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/tensorflow/text-classification/run_text_classification.py b/examples/tensorflow/text-classification/run_text_classification.py
index 114caacdbf54..3f3d64b6236d 100644
--- a/examples/tensorflow/text-classification/run_text_classification.py
+++ b/examples/tensorflow/text-classification/run_text_classification.py
@@ -236,7 +236,12 @@ def main():
 
     if data_args.input_file_extension == "csv":
         # Loading a dataset from local csv files
-        datasets = load_dataset("csv", data_files=data_files, cache_dir=model_args.cache_dir)
+        datasets = load_dataset(
+            "csv",
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         # Loading a dataset from local json files
         datasets = load_dataset("json", data_files=data_files, cache_dir=model_args.cache_dir)
diff --git a/examples/tensorflow/token-classification/run_ner.py b/examples/tensorflow/token-classification/run_ner.py
index acb72855666d..e580ed94b061 100644
--- a/examples/tensorflow/token-classification/run_ner.py
+++ b/examples/tensorflow/token-classification/run_ner.py
@@ -266,7 +266,11 @@ def main():
     # download the dataset.
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
-        raw_datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)
+        raw_datasets = load_dataset(
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     else:
         data_files = {}
         if data_args.train_file is not None:
@@ -274,7 +278,11 @@ def main():
         if data_args.validation_file is not None:
             data_files["validation"] = data_args.validation_file
         extension = data_args.train_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
 
diff --git a/examples/tensorflow/translation/run_translation.py b/examples/tensorflow/translation/run_translation.py
index fce150b712ad..c6921bbf3c51 100644
--- a/examples/tensorflow/translation/run_translation.py
+++ b/examples/tensorflow/translation/run_translation.py
@@ -347,7 +347,10 @@ def main():
     if data_args.dataset_name is not None:
         # Downloading and loading a dataset from the hub.
         raw_datasets = load_dataset(
-            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
+            data_args.dataset_name,
+            data_args.dataset_config_name,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
         )
     else:
         data_files = {}
@@ -357,7 +360,12 @@ def main():
         if data_args.validation_file is not None:
             data_files["validation"] = data_args.validation_file
             extension = data_args.validation_file.split(".")[-1]
-        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
+        raw_datasets = load_dataset(
+            extension,
+            data_files=data_files,
+            cache_dir=model_args.cache_dir,
+            use_auth_token=True if model_args.use_auth_token else None,
+        )
     # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
     # https://huggingface.co/docs/datasets/loading_datasets.html.
     # endregion

From cb50ff9caeb14f45abd9ac8ccf75355f175b588e Mon Sep 17 00:00:00 2001
From: SaulLu <55560583+SaulLu@users.noreply.github.com>
Date: Mon, 4 Apr 2022 16:57:24 +0200
Subject: [PATCH 21/34] add a test checking the format of
 `convert_tokens_to_string`'s output (#16540)

* add new tests

* add comment to overridden tests
---
 tests/byt5/test_tokenization_byt5.py                  | 11 +++++++++++
 tests/perceiver/test_tokenization_perceiver.py        | 11 +++++++++++
 tests/test_tokenization_common.py                     |  9 +++++++++
 tests/wav2vec2/test_tokenization_wav2vec2.py          | 11 +++++++++++
 .../test_tokenization_wav2vec2_phoneme.py             | 11 +++++++++++
 5 files changed, 53 insertions(+)

diff --git a/tests/byt5/test_tokenization_byt5.py b/tests/byt5/test_tokenization_byt5.py
index afdcae0ee389..eb210530f0f3 100644
--- a/tests/byt5/test_tokenization_byt5.py
+++ b/tests/byt5/test_tokenization_byt5.py
@@ -321,3 +321,14 @@ def test_pretokenized_inputs(self):
     # tests all ids in vocab => vocab doesn't exist so unnecessary to test
     def test_conversion_reversible(self):
         pass
+
+    def test_convert_tokens_to_string_format(self):
+        # The default common tokenizer tests uses invalid tokens for ByT5 that can only accept one-character strings
+        # and special added tokens as tokens
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["t", "h", "i", "s", " ", "i", "s", " ", "a", " ", "t", "e", "x", "t", "</s>"]
+                string = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(string, str)
diff --git a/tests/perceiver/test_tokenization_perceiver.py b/tests/perceiver/test_tokenization_perceiver.py
index 214e6aff38e9..0b6b7d4c75a8 100644
--- a/tests/perceiver/test_tokenization_perceiver.py
+++ b/tests/perceiver/test_tokenization_perceiver.py
@@ -286,3 +286,14 @@ def test_pretokenized_inputs(self):
     # tests all ids in vocab => vocab doesn't exist so unnecessary to test
     def test_conversion_reversible(self):
         pass
+
+    def test_convert_tokens_to_string_format(self):
+        # The default common tokenizer tests uses invalid tokens for Perceiver that can only accept one-character
+        # strings and special added tokens as tokens
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["[CLS]", "t", "h", "i", "s", " ", "i", "s", " ", "a", " ", "t", "e", "s", "t", "[SEP]"]
+                string = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(string, str)
diff --git a/tests/test_tokenization_common.py b/tests/test_tokenization_common.py
index f260fa71fff1..2d26d76b9a08 100644
--- a/tests/test_tokenization_common.py
+++ b/tests/test_tokenization_common.py
@@ -3713,6 +3713,15 @@ def test_saving_tokenizer_trainer(self):
                     trainer.save_model(os.path.join(tmp_dir, "checkpoint"))
                     self.assertIn("tokenizer.json", os.listdir(os.path.join(tmp_dir, "checkpoint")))
 
+    def test_convert_tokens_to_string_format(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["this", "is", "a", "test"]
+                string = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(string, str)
+
     def test_save_slow_from_fast_and_reload_fast(self):
         if not self.test_slow_tokenizer or not self.test_rust_tokenizer:
             # we need both slow and fast versions
diff --git a/tests/wav2vec2/test_tokenization_wav2vec2.py b/tests/wav2vec2/test_tokenization_wav2vec2.py
index 98c6f126bbfb..775b3916e7a6 100644
--- a/tests/wav2vec2/test_tokenization_wav2vec2.py
+++ b/tests/wav2vec2/test_tokenization_wav2vec2.py
@@ -753,3 +753,14 @@ def test_tf_encode_plus_sent_to_model(self):
     @unittest.skip("The tokenizer shouldn't be used to encode input IDs (except for labels), only to decode.")
     def test_torch_encode_plus_sent_to_model(self):
         pass
+
+    def test_convert_tokens_to_string_format(self):
+        # The default common tokenizer tests assumes that the output of `convert_tokens_to_string` is a string which
+        # is not the case for Wav2vec2.
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["T", "H", "I", "S", "|", "I", "S", "|", "A", "|", "T", "E", "X", "T"]
+                output = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(output["text"], str)
diff --git a/tests/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py b/tests/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py
index 73f47010b777..24582cefbbd9 100644
--- a/tests/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py
+++ b/tests/wav2vec2_phoneme/test_tokenization_wav2vec2_phoneme.py
@@ -398,3 +398,14 @@ def test_tf_encode_plus_sent_to_model(self):
     @unittest.skip("The tokenizer shouldn't be used to encode input IDs (except for labels), only to decode.")
     def test_torch_encode_plus_sent_to_model(self):
         pass
+
+    def test_convert_tokens_to_string_format(self):
+        # The default common tokenizer tests assumes that the output of `convert_tokens_to_string` is a string which
+        # is not the case for Wav2Vec2PhonemeCTCTokenizer.
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["ð", "ɪ", "s", "ɪ", "z", "ɐ", "t", "ɛ", "k", "s", "t"]
+                output = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(output["text"], str)

From 7d6488100fe2ffdcd7726b5026382d14404c42c9 Mon Sep 17 00:00:00 2001
From: Joao Gante <joao@huggingface.co>
Date: Mon, 4 Apr 2022 16:37:33 +0100
Subject: [PATCH 22/34] TF: Finalize `unpack_inputs`-related changes (#16499)

* Add unpack_inputs to remaining models

* removed kwargs to `call()` in TF models

* fix TF T5 tests
---
 src/transformers/modeling_tf_utils.py         | 31 ++++++++------
 .../models/albert/modeling_tf_albert.py       |  8 ----
 .../models/bart/modeling_tf_bart.py           |  3 --
 .../models/bert/modeling_tf_bert.py           |  9 -----
 .../blenderbot/modeling_tf_blenderbot.py      |  3 --
 .../modeling_tf_blenderbot_small.py           |  3 --
 .../models/clip/modeling_tf_clip.py           | 12 ------
 .../models/convbert/modeling_tf_convbert.py   |  7 ----
 .../models/convnext/modeling_tf_convnext.py   |  3 --
 .../models/ctrl/modeling_tf_ctrl.py           |  4 --
 .../models/deberta/modeling_tf_deberta.py     |  6 ---
 .../deberta_v2/modeling_tf_deberta_v2.py      |  6 ---
 .../distilbert/modeling_tf_distilbert.py      |  7 ----
 .../models/dpr/modeling_tf_dpr.py             |  7 ----
 .../models/electra/modeling_tf_electra.py     |  8 ----
 .../modeling_tf_encoder_decoder.py            | 22 +++-------
 .../models/flaubert/modeling_tf_flaubert.py   |  3 --
 .../models/funnel/modeling_tf_funnel.py       |  9 -----
 .../models/gpt2/modeling_tf_gpt2.py           |  5 ---
 .../models/gptj/modeling_tf_gptj.py           |  5 ---
 .../models/layoutlm/modeling_tf_layoutlm.py   |  5 ---
 .../models/led/modeling_tf_led.py             |  5 +--
 .../longformer/modeling_tf_longformer.py      |  7 ----
 .../models/lxmert/modeling_tf_lxmert.py       |  3 --
 .../models/marian/modeling_tf_marian.py       |  3 --
 .../models/mbart/modeling_tf_mbart.py         |  5 +--
 .../mobilebert/modeling_tf_mobilebert.py      |  9 -----
 .../models/mpnet/modeling_tf_mpnet.py         |  6 ---
 .../models/openai/modeling_tf_openai.py       |  5 ---
 .../models/pegasus/modeling_tf_pegasus.py     |  3 --
 .../models/rembert/modeling_tf_rembert.py     |  8 ----
 .../models/roberta/modeling_tf_roberta.py     |  8 ----
 .../models/roformer/modeling_tf_roformer.py   |  8 ----
 .../modeling_tf_speech_to_text.py             |  2 -
 src/transformers/models/t5/modeling_tf_t5.py  | 21 ++++++++--
 .../models/tapas/modeling_tf_tapas.py         |  5 ---
 .../transfo_xl/modeling_tf_transfo_xl.py      |  4 --
 .../modeling_tf_vision_encoder_decoder.py     | 20 +++-------
 .../models/vit/modeling_tf_vit.py             |  3 --
 .../models/vit_mae/modeling_tf_vit_mae.py     |  3 --
 .../models/xlm/modeling_tf_xlm.py             |  7 ----
 .../models/xlnet/modeling_tf_xlnet.py         |  7 ----
 ...tf_{{cookiecutter.lowercase_modelname}}.py | 11 -----
 tests/convbert/test_modeling_tf_convbert.py   |  1 -
 tests/t5/test_modeling_tf_t5.py               |  5 +++
 tests/test_modeling_tf_common.py              | 40 +++++++++++--------
 46 files changed, 78 insertions(+), 287 deletions(-)

diff --git a/src/transformers/modeling_tf_utils.py b/src/transformers/modeling_tf_utils.py
index a28a09425087..ee5b32886b07 100644
--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
@@ -312,10 +312,12 @@ def booleans_processing(config, **kwargs):
     final_booleans = {}
 
     if tf.executing_eagerly():
-        # Pure conv models (such as ConvNext) do not have `output_attentions`
-        final_booleans["output_attentions"] = kwargs.get("output_attentions", None)
-        if final_booleans["output_attentions"] is None:
-            final_booleans["output_attentions"] = config.output_attentions
+        # Pure conv models (such as ConvNext) do not have `output_attentions`. If the signature has
+        # `output_attentions`, it will be present here in `kwargs`, even if unset (in that case, as `None`)
+        if "output_attentions" in kwargs:
+            final_booleans["output_attentions"] = (
+                kwargs["output_attentions"] if kwargs["output_attentions"] is not None else config.output_attentions
+            )
         final_booleans["output_hidden_states"] = (
             kwargs["output_hidden_states"]
             if kwargs["output_hidden_states"] is not None
@@ -330,7 +332,10 @@ def booleans_processing(config, **kwargs):
                 kwargs["use_cache"] if kwargs["use_cache"] is not None else getattr(config, "use_cache", None)
             )
     else:
-        final_booleans["output_attentions"] = config.output_attentions
+        # Pure conv models (such as ConvNext) do not have `output_attentions`. If the signature has
+        # `output_attentions`, it will be present here in `kwargs`, even if unset (in that case, as `None`)
+        if "output_attentions" in kwargs:
+            final_booleans["output_attentions"] = config.output_attentions
         final_booleans["output_hidden_states"] = config.output_hidden_states
 
         if kwargs.get("return_dict", None) not in (None, True):
@@ -403,7 +408,7 @@ def input_processing(func, config, input_ids, **kwargs):
         Two lists, one for the missing layers, and another one for the unexpected layers.
     """
     signature = dict(inspect.signature(func).parameters)
-    signature.pop("kwargs", None)
+    has_kwargs = bool(signature.pop("kwargs", None))
     signature.pop("self", None)
     parameter_names = list(signature.keys())
     output = {}
@@ -433,12 +438,14 @@ def input_processing(func, config, input_ids, **kwargs):
     elif "past_key_values" in kwargs["kwargs_call"] and "past" in parameter_names:
         kwargs["past"] = kwargs["kwargs_call"].pop("past_key_values")
 
-    if len(kwargs["kwargs_call"]) > 0:
-        raise ValueError(
-            f"The following keyword arguments are not supported by this model: {list(kwargs['kwargs_call'].keys())}."
-        )
-
-    kwargs.pop("kwargs_call")
+    if has_kwargs:
+        output["kwargs"] = kwargs.pop("kwargs_call", {})
+    else:
+        if len(kwargs["kwargs_call"]) > 0:
+            raise ValueError(
+                f"The following keyword arguments are not supported by this model: {list(kwargs['kwargs_call'].keys())}."
+            )
+        kwargs.pop("kwargs_call")
 
     for k, v in kwargs.items():
         if isinstance(v, allowed_types) or v is None:
diff --git a/src/transformers/models/albert/modeling_tf_albert.py b/src/transformers/models/albert/modeling_tf_albert.py
index 51bc5c0ae77b..ae325558cd73 100644
--- a/src/transformers/models/albert/modeling_tf_albert.py
+++ b/src/transformers/models/albert/modeling_tf_albert.py
@@ -551,7 +551,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -785,7 +784,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         outputs = self.albert(
             input_ids=input_ids,
@@ -854,7 +852,6 @@ def call(
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         sentence_order_label: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFAlbertForPreTrainingOutput, Tuple[tf.Tensor]]:
         r"""
         Return:
@@ -976,7 +973,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1064,7 +1060,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1158,7 +1153,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1244,7 +1238,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1355,7 +1348,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/bart/modeling_tf_bart.py b/src/transformers/models/bart/modeling_tf_bart.py
index 106a87c043c9..9cf3e04054ec 100644
--- a/src/transformers/models/bart/modeling_tf_bart.py
+++ b/src/transformers/models/bart/modeling_tf_bart.py
@@ -679,7 +679,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
         """
         Args:
@@ -834,7 +833,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         Args:
@@ -1273,7 +1271,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSeq2SeqLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/bert/modeling_tf_bert.py b/src/transformers/models/bert/modeling_tf_bert.py
index 6dfae3d5fb60..5e8775ab0deb 100644
--- a/src/transformers/models/bert/modeling_tf_bert.py
+++ b/src/transformers/models/bert/modeling_tf_bert.py
@@ -737,7 +737,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
 
         if not self.config.is_decoder:
@@ -1067,7 +1066,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1174,7 +1172,6 @@ def call(
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         next_sentence_label: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBertForPreTrainingOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1302,7 +1299,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1520,7 +1516,6 @@ def call(
         return_dict: Optional[bool] = None,
         next_sentence_label: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFNextSentencePredictorOutput, Tuple[tf.Tensor]]:
         r"""
         Return:
@@ -1628,7 +1623,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1723,7 +1717,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1857,7 +1850,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1949,7 +1941,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/blenderbot/modeling_tf_blenderbot.py b/src/transformers/models/blenderbot/modeling_tf_blenderbot.py
index 80236fab0211..4225f8e14e58 100644
--- a/src/transformers/models/blenderbot/modeling_tf_blenderbot.py
+++ b/src/transformers/models/blenderbot/modeling_tf_blenderbot.py
@@ -662,7 +662,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         """
         Args:
@@ -823,7 +822,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Args:
@@ -1276,7 +1274,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFSeq2SeqLMOutput]:
         r"""
         labels (`tf.tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py b/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py
index af575e6418b7..2d7fe2af6137 100644
--- a/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py
+++ b/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py
@@ -667,7 +667,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         """
         Args:
@@ -827,7 +826,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Args:
@@ -1253,7 +1251,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFSeq2SeqLMOutput]:
         r"""
         labels (`tf.tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/clip/modeling_tf_clip.py b/src/transformers/models/clip/modeling_tf_clip.py
index f8192ac7aa05..366d0a9eb1dd 100644
--- a/src/transformers/models/clip/modeling_tf_clip.py
+++ b/src/transformers/models/clip/modeling_tf_clip.py
@@ -504,7 +504,6 @@ def call(
         output_hidden_states: bool,
         return_dict: bool,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         input_shape = shape_list(input_ids)
 
@@ -593,7 +592,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         if input_ids is None:
             raise ValueError("You have to specify input_ids")
@@ -632,7 +630,6 @@ def call(
         output_hidden_states: bool,
         return_dict: bool,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
 
         embedding_output = self.embeddings(pixel_values=pixel_values)
@@ -683,7 +680,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
 
         if pixel_values is None:
@@ -762,7 +758,6 @@ def get_text_features(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> tf.Tensor:
 
         if input_ids is None:
@@ -796,7 +791,6 @@ def get_image_features(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> tf.Tensor:
         if pixel_values is None:
             raise ValueError("You have to specify pixel_values")
@@ -826,7 +820,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFCLIPOutput, Tuple[tf.Tensor]]:
 
         if input_ids is None:
@@ -1058,7 +1051,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -1153,7 +1145,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -1258,7 +1249,6 @@ def get_text_features(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> tf.Tensor:
         r"""
         Returns:
@@ -1297,7 +1287,6 @@ def get_image_features(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> tf.Tensor:
         r"""
         Returns:
@@ -1345,7 +1334,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFCLIPOutput, Tuple[tf.Tensor]]:
         r"""
         Returns:
diff --git a/src/transformers/models/convbert/modeling_tf_convbert.py b/src/transformers/models/convbert/modeling_tf_convbert.py
index f167325527b6..8ec1b18ae748 100644
--- a/src/transformers/models/convbert/modeling_tf_convbert.py
+++ b/src/transformers/models/convbert/modeling_tf_convbert.py
@@ -581,7 +581,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -751,7 +750,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.convbert(
             input_ids=input_ids,
@@ -870,7 +868,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFMaskedLMOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -979,7 +976,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFSequenceClassifierOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1073,7 +1069,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFMultipleChoiceModelOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1188,7 +1183,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFTokenClassifierOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1268,7 +1262,6 @@ def call(
         start_positions: Optional[tf.Tensor] = None,
         end_positions: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFQuestionAnsweringModelOutput]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/convnext/modeling_tf_convnext.py b/src/transformers/models/convnext/modeling_tf_convnext.py
index b952b6775248..1cb1b71b6130 100644
--- a/src/transformers/models/convnext/modeling_tf_convnext.py
+++ b/src/transformers/models/convnext/modeling_tf_convnext.py
@@ -293,7 +293,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -439,7 +438,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -518,7 +516,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/ctrl/modeling_tf_ctrl.py b/src/transformers/models/ctrl/modeling_tf_ctrl.py
index 89d3ef561141..2a58467119ae 100644
--- a/src/transformers/models/ctrl/modeling_tf_ctrl.py
+++ b/src/transformers/models/ctrl/modeling_tf_ctrl.py
@@ -268,7 +268,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         # If using past key value states, only the last tokens
@@ -541,7 +540,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.transformer(
             input_ids=input_ids,
@@ -653,7 +651,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -765,7 +762,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/deberta/modeling_tf_deberta.py b/src/transformers/models/deberta/modeling_tf_deberta.py
index c97b676596fb..90ec5ca2c89e 100644
--- a/src/transformers/models/deberta/modeling_tf_deberta.py
+++ b/src/transformers/models/deberta/modeling_tf_deberta.py
@@ -928,7 +928,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -1096,7 +1095,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
         outputs = self.deberta(
             input_ids=input_ids,
@@ -1156,7 +1154,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1242,7 +1239,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1325,7 +1321,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1404,7 +1399,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py b/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py
index 0a77a6057d9d..39cf57a146f6 100644
--- a/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py
+++ b/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py
@@ -1028,7 +1028,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -1198,7 +1197,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
         outputs = self.deberta(
             input_ids=input_ids,
@@ -1259,7 +1257,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1346,7 +1343,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1430,7 +1426,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1510,7 +1505,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/distilbert/modeling_tf_distilbert.py b/src/transformers/models/distilbert/modeling_tf_distilbert.py
index ccae454ebe05..07aeee9e1f97 100644
--- a/src/transformers/models/distilbert/modeling_tf_distilbert.py
+++ b/src/transformers/models/distilbert/modeling_tf_distilbert.py
@@ -372,7 +372,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -543,7 +542,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
         outputs = self.distilbert(
             input_ids=input_ids,
@@ -647,7 +645,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -735,7 +732,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -817,7 +813,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -911,7 +906,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1021,7 +1015,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/dpr/modeling_tf_dpr.py b/src/transformers/models/dpr/modeling_tf_dpr.py
index f2b1a1606e4d..df290f6f5d72 100644
--- a/src/transformers/models/dpr/modeling_tf_dpr.py
+++ b/src/transformers/models/dpr/modeling_tf_dpr.py
@@ -174,7 +174,6 @@ def call(
         output_hidden_states: bool = None,
         return_dict: bool = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor, ...]]:
         outputs = self.bert_model(
             input_ids=input_ids,
@@ -235,7 +234,6 @@ def call(
         output_hidden_states: bool = False,
         return_dict: bool = False,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFDPRReaderOutput, Tuple[tf.Tensor, ...]]:
         # notations: N - number of questions in a batch, M - number of passages per questions, L - sequence length
         n_passages, sequence_length = shape_list(input_ids) if input_ids is not None else shape_list(inputs_embeds)[:2]
@@ -294,7 +292,6 @@ def call(
         output_hidden_states: bool = False,
         return_dict: bool = False,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFDPRReaderOutput, Tuple[tf.Tensor, ...]]:
         outputs = self.encoder(
             input_ids=input_ids,
@@ -328,7 +325,6 @@ def call(
         output_hidden_states: bool = False,
         return_dict: bool = False,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFDPRReaderOutput, Tuple[tf.Tensor, ...]]:
         outputs = self.encoder(
             input_ids=input_ids,
@@ -560,7 +556,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFDPRContextEncoderOutput, Tuple[tf.Tensor, ...]]:
         r"""
         Return:
@@ -648,7 +643,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFDPRQuestionEncoderOutput, Tuple[tf.Tensor, ...]]:
         r"""
         Return:
@@ -734,7 +728,6 @@ def call(
         output_hidden_states: bool = None,
         return_dict=None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFDPRReaderOutput, Tuple[tf.Tensor, ...]]:
         r"""
         Return:
diff --git a/src/transformers/models/electra/modeling_tf_electra.py b/src/transformers/models/electra/modeling_tf_electra.py
index 9cbbd4b7e1e5..eccb321f1005 100644
--- a/src/transformers/models/electra/modeling_tf_electra.py
+++ b/src/transformers/models/electra/modeling_tf_electra.py
@@ -719,7 +719,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor]]:
         if not self.config.is_decoder:
             use_cache = False
@@ -953,7 +952,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1043,7 +1041,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFElectraForPreTrainingOutput, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -1180,7 +1177,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1290,7 +1286,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1383,7 +1378,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1501,7 +1495,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1583,7 +1576,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py b/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
index 1c59493e1bf7..9e92e767b1b8 100644
--- a/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
@@ -23,7 +23,7 @@
 
 from ...configuration_utils import PretrainedConfig
 from ...modeling_tf_outputs import TFBaseModelOutput, TFSeq2SeqLMOutput
-from ...modeling_tf_utils import TFCausalLanguageModelingLoss, TFPreTrainedModel, get_initializer, input_processing
+from ...modeling_tf_utils import TFCausalLanguageModelingLoss, TFPreTrainedModel, get_initializer, unpack_inputs
 from ...tf_utils import shape_list
 from ...utils import (
     DUMMY_INPUTS,
@@ -491,6 +491,7 @@ def from_encoder_decoder_pretrained(
         config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config, **kwargs)
         return cls(encoder=encoder, decoder=decoder, config=config)
 
+    @unpack_inputs
     @add_start_docstrings_to_model_forward(ENCODER_DECODER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFSeq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
     def call(
@@ -559,9 +560,7 @@ def call(
 
         if encoder_outputs is None:
 
-            encoder_processing_inputs = {
-                "func": self.encoder.call,
-                "config": self.encoder.config,
+            encoder_inputs = {
                 "input_ids": input_ids,
                 "attention_mask": attention_mask,
                 "inputs_embeds": inputs_embeds,
@@ -569,14 +568,10 @@ def call(
                 "output_hidden_states": output_hidden_states,
                 "return_dict": return_dict,
                 "training": training,
-                "kwargs_call": {},
             }
 
             # Add arguments to encoder from `kwargs_encoder`
-            for k, v in kwargs_encoder.items():
-                encoder_processing_inputs[k] = v
-
-            encoder_inputs = input_processing(**encoder_processing_inputs)
+            encoder_inputs.update(kwargs_encoder)
 
             # Handle the case where the inputs are passed as a single dict which contains `labels`.
             # The `labels` shouldn't be passed to `self.encoder` below, because it is a based model without this
@@ -607,9 +602,7 @@ def call(
                 labels, self.config.pad_token_id, self.config.decoder_start_token_id
             )
 
-        decoder_processing_inputs = {
-            "func": self.decoder.call,
-            "config": self.decoder.config,
+        decoder_inputs = {
             "input_ids": decoder_input_ids,
             "attention_mask": decoder_attention_mask,
             "encoder_hidden_states": encoder_hidden_states,
@@ -621,14 +614,11 @@ def call(
             "past_key_values": past_key_values,
             "return_dict": return_dict,
             "training": training,
-            "kwargs_call": {},
         }
 
         # Add arguments to decoder from `kwargs_decoder`
-        for k, v in kwargs_decoder.items():
-            decoder_processing_inputs[k] = v
+        decoder_inputs.update(kwargs_decoder)
 
-        decoder_inputs = input_processing(**decoder_processing_inputs)
         decoder_outputs = self.decoder(**decoder_inputs)
 
         logits = decoder_outputs[0]
diff --git a/src/transformers/models/flaubert/modeling_tf_flaubert.py b/src/transformers/models/flaubert/modeling_tf_flaubert.py
index 8441e1801730..f751c0f22502 100644
--- a/src/transformers/models/flaubert/modeling_tf_flaubert.py
+++ b/src/transformers/models/flaubert/modeling_tf_flaubert.py
@@ -258,7 +258,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFBaseModelOutput]:
         outputs = self.transformer(
             input_ids=input_ids,
@@ -490,7 +489,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFBaseModelOutput]:
         # removed: src_enc=None, src_len=None
 
@@ -808,7 +806,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFFlaubertWithLMHeadModelOutput]:
 
         transformer_outputs = self.transformer(
diff --git a/src/transformers/models/funnel/modeling_tf_funnel.py b/src/transformers/models/funnel/modeling_tf_funnel.py
index 56e6bf13b494..c1ddef0ad9cd 100644
--- a/src/transformers/models/funnel/modeling_tf_funnel.py
+++ b/src/transformers/models/funnel/modeling_tf_funnel.py
@@ -761,7 +761,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         if input_ids is not None and inputs_embeds is not None:
@@ -835,7 +834,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -1117,7 +1115,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFBaseModelOutput]:
         return self.funnel(
             input_ids=input_ids,
@@ -1165,7 +1162,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFBaseModelOutput]:
 
         return self.funnel(
@@ -1293,7 +1289,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFMaskedLMOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1369,7 +1364,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFSequenceClassifierOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1455,7 +1449,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFMultipleChoiceModelOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1566,7 +1559,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFTokenClassifierOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1645,7 +1637,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[Tuple[tf.Tensor], TFQuestionAnsweringModelOutput]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/gpt2/modeling_tf_gpt2.py b/src/transformers/models/gpt2/modeling_tf_gpt2.py
index 88b4fb5ed607..8a35208b52e8 100644
--- a/src/transformers/models/gpt2/modeling_tf_gpt2.py
+++ b/src/transformers/models/gpt2/modeling_tf_gpt2.py
@@ -367,7 +367,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor]]:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -730,7 +729,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -920,7 +918,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFCausalLMOutputWithCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1038,7 +1035,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFGPT2DoubleHeadsModelOutput, Tuple[tf.Tensor]]:
         r"""
         mc_token_ids (`tf.Tensor` or `Numpy array` of shape `(batch_size, num_choices)`, *optional*, default to index of the last token of the input):
@@ -1195,7 +1191,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutputWithPast, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/gptj/modeling_tf_gptj.py b/src/transformers/models/gptj/modeling_tf_gptj.py
index ce5c5d78e5ae..702b163f4719 100644
--- a/src/transformers/models/gptj/modeling_tf_gptj.py
+++ b/src/transformers/models/gptj/modeling_tf_gptj.py
@@ -390,7 +390,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         if input_ids is not None and inputs_embeds is not None:
@@ -672,7 +671,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ):
         r"""
         use_cache (`bool`, *optional*, defaults to `True`):
@@ -781,7 +779,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ):
         r"""
         labels (`np.ndarray` or `tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -886,7 +883,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ):
         r"""
         labels (`np.ndarray` or `tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1011,7 +1007,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ):
         r"""
         start_positions (`np.ndarray` or `tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/layoutlm/modeling_tf_layoutlm.py b/src/transformers/models/layoutlm/modeling_tf_layoutlm.py
index e6fd771d37e2..86b2fc5a38ae 100644
--- a/src/transformers/models/layoutlm/modeling_tf_layoutlm.py
+++ b/src/transformers/models/layoutlm/modeling_tf_layoutlm.py
@@ -706,7 +706,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -928,7 +927,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -1048,7 +1046,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1172,7 +1169,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1303,7 +1299,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/led/modeling_tf_led.py b/src/transformers/models/led/modeling_tf_led.py
index 4519f5df9808..8381d81afb4d 100644
--- a/src/transformers/models/led/modeling_tf_led.py
+++ b/src/transformers/models/led/modeling_tf_led.py
@@ -1666,7 +1666,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         """
         Args:
@@ -1911,7 +1910,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Args:
@@ -2333,7 +2331,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         """
         Returns:
@@ -2429,7 +2426,7 @@ def prepare_inputs_for_generation(
         decoder_head_mask=None,
         use_cache=None,
         encoder_outputs=None,
-        **kwargs,
+        **kwargs
     ):
         # cut decoder_input_ids if past is used
         if past is not None:
diff --git a/src/transformers/models/longformer/modeling_tf_longformer.py b/src/transformers/models/longformer/modeling_tf_longformer.py
index 762f872ee709..850a8113f6ad 100644
--- a/src/transformers/models/longformer/modeling_tf_longformer.py
+++ b/src/transformers/models/longformer/modeling_tf_longformer.py
@@ -1676,7 +1676,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         if input_ids is not None and inputs_embeds is not None:
@@ -2023,7 +2022,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFLongformerBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
 
         outputs = self.longformer(
@@ -2100,7 +2098,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFLongformerMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -2194,7 +2191,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFLongformerQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -2340,7 +2336,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFLongformerSequenceClassifierOutput, Tuple[tf.Tensor]]:
 
         if global_attention_mask is None and input_ids is not None:
@@ -2450,7 +2445,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFLongformerMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -2580,7 +2574,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.array, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFLongformerTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/lxmert/modeling_tf_lxmert.py b/src/transformers/models/lxmert/modeling_tf_lxmert.py
index efa812a59654..2101b7cf1f54 100644
--- a/src/transformers/models/lxmert/modeling_tf_lxmert.py
+++ b/src/transformers/models/lxmert/modeling_tf_lxmert.py
@@ -685,7 +685,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         if input_ids is not None and inputs_embeds is not None:
@@ -946,7 +945,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.lxmert(
             input_ids,
@@ -1282,7 +1280,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         masked_lm_labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/marian/modeling_tf_marian.py b/src/transformers/models/marian/modeling_tf_marian.py
index aa766e681544..a696a5648fe4 100644
--- a/src/transformers/models/marian/modeling_tf_marian.py
+++ b/src/transformers/models/marian/modeling_tf_marian.py
@@ -707,7 +707,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         """
         Args:
@@ -866,7 +865,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Args:
@@ -1296,7 +1294,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/mbart/modeling_tf_mbart.py b/src/transformers/models/mbart/modeling_tf_mbart.py
index 3f2ea655f455..021dc21f21a1 100644
--- a/src/transformers/models/mbart/modeling_tf_mbart.py
+++ b/src/transformers/models/mbart/modeling_tf_mbart.py
@@ -684,7 +684,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
         """
         Args:
@@ -848,7 +847,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[
         TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]
     ]:
@@ -1278,7 +1276,7 @@ def call(
         decoder_head_mask: Optional[tf.Tensor] = None,
         cross_attn_head_mask: Optional[tf.Tensor] = None,
         encoder_outputs: Optional[TFBaseModelOutput] = None,
-        past_key_values: [Tuple[Tuple[tf.Tensor]]] = None,
+        past_key_values: Tuple[Tuple[tf.Tensor]] = None,
         inputs_embeds: Optional[tf.Tensor] = None,
         decoder_inputs_embeds: Optional[tf.Tensor] = None,
         use_cache: Optional[bool] = None,
@@ -1287,7 +1285,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[tf.Tensor] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSeq2SeqLMOutput, Tuple[tf.Tensor]]:
         """
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/mobilebert/modeling_tf_mobilebert.py b/src/transformers/models/mobilebert/modeling_tf_mobilebert.py
index 007be43f5f06..5d1c74252e9b 100644
--- a/src/transformers/models/mobilebert/modeling_tf_mobilebert.py
+++ b/src/transformers/models/mobilebert/modeling_tf_mobilebert.py
@@ -692,7 +692,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
@@ -928,7 +927,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.mobilebert(
             input_ids=input_ids,
@@ -993,7 +991,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Return:
@@ -1092,7 +1089,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1176,7 +1172,6 @@ def call(
         return_dict=None,
         next_sentence_label=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Return:
@@ -1287,7 +1282,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1381,7 +1375,6 @@ def call(
         start_positions=None,
         end_positions=None,
         training=False,
-        **kwargs,
     ):
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1498,7 +1491,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1626,7 +1618,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/mpnet/modeling_tf_mpnet.py b/src/transformers/models/mpnet/modeling_tf_mpnet.py
index 5edd73c4170b..0e8c61e3403c 100644
--- a/src/transformers/models/mpnet/modeling_tf_mpnet.py
+++ b/src/transformers/models/mpnet/modeling_tf_mpnet.py
@@ -497,7 +497,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         if input_ids is not None and inputs_embeds is not None:
@@ -686,7 +685,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.mpnet(
             input_ids=input_ids,
@@ -803,7 +801,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -909,7 +906,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1000,7 +996,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1112,7 +1107,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/openai/modeling_tf_openai.py b/src/transformers/models/openai/modeling_tf_openai.py
index 80d7a9abd192..40a94c18815e 100644
--- a/src/transformers/models/openai/modeling_tf_openai.py
+++ b/src/transformers/models/openai/modeling_tf_openai.py
@@ -249,7 +249,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         if input_ids is not None and inputs_embeds is not None:
@@ -522,7 +521,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFBaseModelOutput]:
 
         outputs = self.transformer(
@@ -586,7 +584,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFCausalLMOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -669,7 +666,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFOpenAIGPTDoubleHeadsModelOutput]:
         r"""
         mc_token_ids (`tf.Tensor` or `Numpy array` of shape `(batch_size, num_choices)`, *optional*, default to index of the last token of the input):
@@ -813,7 +809,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFSequenceClassifierOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/pegasus/modeling_tf_pegasus.py b/src/transformers/models/pegasus/modeling_tf_pegasus.py
index 26f3ef461198..d7eea1660a40 100644
--- a/src/transformers/models/pegasus/modeling_tf_pegasus.py
+++ b/src/transformers/models/pegasus/modeling_tf_pegasus.py
@@ -710,7 +710,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         """
         Args:
@@ -872,7 +871,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Args:
@@ -1305,7 +1303,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         """
         labels (`tf.tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/rembert/modeling_tf_rembert.py b/src/transformers/models/rembert/modeling_tf_rembert.py
index 9a3892f409fe..f40ea6f6f1c4 100644
--- a/src/transformers/models/rembert/modeling_tf_rembert.py
+++ b/src/transformers/models/rembert/modeling_tf_rembert.py
@@ -660,7 +660,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
 
         if not self.config.is_decoder:
@@ -959,7 +958,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1060,7 +1058,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1155,7 +1152,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFCausalLMOutputWithCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1283,7 +1279,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1374,7 +1369,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1494,7 +1488,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1575,7 +1568,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/roberta/modeling_tf_roberta.py b/src/transformers/models/roberta/modeling_tf_roberta.py
index a62659582b7e..b63d99a901b7 100644
--- a/src/transformers/models/roberta/modeling_tf_roberta.py
+++ b/src/transformers/models/roberta/modeling_tf_roberta.py
@@ -624,7 +624,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPoolingAndCrossAttentions, Tuple[tf.Tensor]]:
 
         if not self.config.is_decoder:
@@ -936,7 +935,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFBaseModelOutputWithPoolingAndCrossAttentions]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1093,7 +1091,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1196,7 +1193,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFCausalLMOutputWithCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1353,7 +1349,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1449,7 +1444,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1567,7 +1561,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1655,7 +1648,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/roformer/modeling_tf_roformer.py b/src/transformers/models/roformer/modeling_tf_roformer.py
index bed8ecf975c2..020824bb37e2 100644
--- a/src/transformers/models/roformer/modeling_tf_roformer.py
+++ b/src/transformers/models/roformer/modeling_tf_roformer.py
@@ -614,7 +614,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -817,7 +816,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         outputs = self.roformer(
             input_ids=input_ids,
@@ -877,7 +875,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -953,7 +950,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFCausalLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1064,7 +1060,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1155,7 +1150,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1269,7 +1263,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1348,7 +1341,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/speech_to_text/modeling_tf_speech_to_text.py b/src/transformers/models/speech_to_text/modeling_tf_speech_to_text.py
index 6c78ab1b58f3..7848630314d4 100755
--- a/src/transformers/models/speech_to_text/modeling_tf_speech_to_text.py
+++ b/src/transformers/models/speech_to_text/modeling_tf_speech_to_text.py
@@ -791,7 +791,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         """
         Args:
@@ -957,7 +956,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Args:
diff --git a/src/transformers/models/t5/modeling_tf_t5.py b/src/transformers/models/t5/modeling_tf_t5.py
index 133103f3e855..d7fd5b30145d 100644
--- a/src/transformers/models/t5/modeling_tf_t5.py
+++ b/src/transformers/models/t5/modeling_tf_t5.py
@@ -654,7 +654,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ) -> Tuple:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -1152,7 +1151,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFSeq2SeqModelOutput]:
         r"""
         Returns:
@@ -1329,7 +1327,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFSeq2SeqLMOutput]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1611,6 +1608,10 @@ def __init__(self, config, *inputs, **kwargs):
         encoder_config.use_cache = False
         self.encoder = TFT5MainLayer(encoder_config, embed_tokens, name="encoder")
 
+    @property
+    def dummy_inputs(self):
+        return {"input_ids": tf.constant(DUMMY_INPUTS)}
+
     def get_encoder(self):
         return self.encoder
 
@@ -1627,7 +1628,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFBaseModelOutput]:
         r"""
         Returns:
@@ -1670,6 +1670,19 @@ def call(
             attentions=encoder_outputs.attentions,
         )
 
+    @tf.function(
+        input_signature=[
+            {
+                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
+                "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
+            }
+        ]
+    )
+    def serving(self, inputs):
+        output = self.call(inputs)
+
+        return self.serving_output(output)
+
     # Copied from transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertModel.serving_output
     def serving_output(self, output):
         hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
diff --git a/src/transformers/models/tapas/modeling_tf_tapas.py b/src/transformers/models/tapas/modeling_tf_tapas.py
index 8f2138f2fbad..b6a2f10d1205 100644
--- a/src/transformers/models/tapas/modeling_tf_tapas.py
+++ b/src/transformers/models/tapas/modeling_tf_tapas.py
@@ -770,7 +770,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
 
         if input_ids is not None and inputs_embeds is not None:
@@ -980,7 +979,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -1067,7 +1065,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1285,7 +1282,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTableQuestionAnsweringOutput, Tuple[tf.Tensor]]:
         r"""
         table_mask (`tf.Tensor` of shape `(batch_size, seq_length)`, *optional*):
@@ -1602,7 +1598,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/transfo_xl/modeling_tf_transfo_xl.py b/src/transformers/models/transfo_xl/modeling_tf_transfo_xl.py
index d5dc28c36503..8ad931150edc 100644
--- a/src/transformers/models/transfo_xl/modeling_tf_transfo_xl.py
+++ b/src/transformers/models/transfo_xl/modeling_tf_transfo_xl.py
@@ -550,7 +550,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library
@@ -898,7 +897,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.transformer(
             input_ids=input_ids,
@@ -979,7 +977,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         if input_ids is not None:
             bsz, tgt_len = shape_list(input_ids)[:2]
@@ -1088,7 +1085,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[Tuple, TFTransfoXLSequenceClassifierOutputWithPast]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
diff --git a/src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py b/src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py
index eeaca58c5a01..edc2973a0734 100644
--- a/src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py
+++ b/src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py
@@ -23,7 +23,7 @@
 
 from ...configuration_utils import PretrainedConfig
 from ...modeling_tf_outputs import TFBaseModelOutput, TFSeq2SeqLMOutput
-from ...modeling_tf_utils import TFCausalLanguageModelingLoss, TFPreTrainedModel, get_initializer, input_processing
+from ...modeling_tf_utils import TFCausalLanguageModelingLoss, TFPreTrainedModel, get_initializer, unpack_inputs
 from ...tf_utils import shape_list
 from ...utils import (
     DUMMY_INPUTS,
@@ -510,6 +510,7 @@ def from_encoder_decoder_pretrained(
         config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(encoder.config, decoder.config, **kwargs)
         return cls(encoder=encoder, decoder=decoder, config=config)
 
+    @unpack_inputs
     @add_start_docstrings_to_model_forward(
         VISION_ENCODER_DECODER_INPUTS_DOCSTRING.format("batch_size, sequence_length")
     )
@@ -585,21 +586,16 @@ def call(
 
         if encoder_outputs is None:
 
-            encoder_processing_inputs = {
-                "func": self.encoder.call,
-                "config": self.encoder.config,
+            encoder_inputs = {
                 "input_ids": pixel_values,
                 "output_attentions": output_attentions,
                 "output_hidden_states": output_hidden_states,
                 "return_dict": return_dict,
                 "training": training,
-                "kwargs_call": {},
             }
 
             # Add arguments to encoder from `kwargs_encoder`
-            encoder_processing_inputs.update(kwargs_encoder)
-
-            encoder_inputs = input_processing(**encoder_processing_inputs)
+            encoder_inputs.update(kwargs_encoder)
 
             if "input_ids" in encoder_inputs:
                 encoder_inputs["pixel_values"] = encoder_inputs.pop("input_ids")
@@ -639,9 +635,7 @@ def call(
         batch_size, sequence_length = shape_list(encoder_hidden_states)[:2]
         encoder_attention_mask = tf.ones(shape=(batch_size, sequence_length), dtype=tf.int32)
 
-        decoder_processing_inputs = {
-            "func": self.decoder.call,
-            "config": self.decoder.config,
+        decoder_inputs = {
             "input_ids": decoder_input_ids,
             "attention_mask": decoder_attention_mask,
             "encoder_hidden_states": encoder_hidden_states,
@@ -653,13 +647,11 @@ def call(
             "past_key_values": past_key_values,
             "return_dict": return_dict,
             "training": training,
-            "kwargs_call": {},
         }
 
         # Add arguments to decoder from `kwargs_decoder`
-        decoder_processing_inputs.update(kwargs_decoder)
+        decoder_inputs.update(kwargs_decoder)
 
-        decoder_inputs = input_processing(**decoder_processing_inputs)
         decoder_outputs = self.decoder(**decoder_inputs)
 
         logits = decoder_outputs[0]
diff --git a/src/transformers/models/vit/modeling_tf_vit.py b/src/transformers/models/vit/modeling_tf_vit.py
index e2e946d8c9f4..cbf935f4f743 100644
--- a/src/transformers/models/vit/modeling_tf_vit.py
+++ b/src/transformers/models/vit/modeling_tf_vit.py
@@ -486,7 +486,6 @@ def call(
         interpolate_pos_encoding: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
 
         if pixel_values is None:
@@ -656,7 +655,6 @@ def call(
         interpolate_pos_encoding: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -757,7 +755,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/vit_mae/modeling_tf_vit_mae.py b/src/transformers/models/vit_mae/modeling_tf_vit_mae.py
index 40f100b64ff1..6ff588fce3d4 100644
--- a/src/transformers/models/vit_mae/modeling_tf_vit_mae.py
+++ b/src/transformers/models/vit_mae/modeling_tf_vit_mae.py
@@ -647,7 +647,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFViTMAEModelOutput, Tuple[tf.Tensor]]:
         embedding_output, mask, ids_restore = self.embeddings(
             pixel_values=pixel_values, training=training, noise=noise
@@ -811,7 +810,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFViTMAEModelOutput, Tuple[tf.Tensor]]:
         r"""
         Returns:
@@ -1028,7 +1026,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFViTMAEForPreTrainingOutput, Tuple[tf.Tensor]]:
         r"""
         Returns:
diff --git a/src/transformers/models/xlm/modeling_tf_xlm.py b/src/transformers/models/xlm/modeling_tf_xlm.py
index dbb994ed47c1..46b41fba3ae7 100644
--- a/src/transformers/models/xlm/modeling_tf_xlm.py
+++ b/src/transformers/models/xlm/modeling_tf_xlm.py
@@ -360,7 +360,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         # removed: src_enc=None, src_len=None
 
@@ -707,7 +706,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.transformer(
             input_ids=input_ids,
@@ -843,7 +841,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         transformer_outputs = self.transformer(
             input_ids=input_ids,
@@ -917,7 +914,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1025,7 +1021,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         if input_ids is not None:
             num_choices = shape_list(input_ids)[1]
@@ -1150,7 +1145,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1237,7 +1231,6 @@ def call(
         start_positions=None,
         end_positions=None,
         training=False,
-        **kwargs,
     ):
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/src/transformers/models/xlnet/modeling_tf_xlnet.py b/src/transformers/models/xlnet/modeling_tf_xlnet.py
index 3a77c4845dfd..d81924d3451e 100644
--- a/src/transformers/models/xlnet/modeling_tf_xlnet.py
+++ b/src/transformers/models/xlnet/modeling_tf_xlnet.py
@@ -597,7 +597,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
 
         if training and use_mems is None:
@@ -1152,7 +1151,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         outputs = self.transformer(
             input_ids=input_ids,
@@ -1262,7 +1260,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFXLNetLMHeadModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1394,7 +1391,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFXLNetForSequenceClassificationOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1501,7 +1497,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFXLNetForMultipleChoiceOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
@@ -1623,7 +1618,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFXLNetForTokenClassificationOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1711,7 +1705,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFXLNetForQuestionAnsweringSimpleOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` of shape `(batch_size,)`, *optional*):
diff --git a/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/modeling_tf_{{cookiecutter.lowercase_modelname}}.py b/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/modeling_tf_{{cookiecutter.lowercase_modelname}}.py
index da2a0a3828f9..2d9914eebd6c 100644
--- a/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/modeling_tf_{{cookiecutter.lowercase_modelname}}.py
+++ b/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/modeling_tf_{{cookiecutter.lowercase_modelname}}.py
@@ -653,7 +653,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor]]:
 
         if not self.config.is_decoder:
@@ -949,7 +948,6 @@ def call(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFBaseModelOutputWithPastAndCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1049,7 +1047,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1146,7 +1143,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFCausalLMOutputWithCrossAttentions, Tuple[tf.Tensor]]:
         r"""
         encoder_hidden_states  (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
@@ -1289,7 +1285,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1379,7 +1374,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -1506,7 +1500,6 @@ def call(
         return_dict: Optional[bool] = None,
         labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
         labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1588,7 +1581,6 @@ def call(
         start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
         training: Optional[bool] = False,
-        **kwargs,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
         start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
@@ -2262,7 +2254,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         """
         Args:
@@ -2421,7 +2412,6 @@ def call(
         output_hidden_states=None,
         return_dict=None,
         training=False,
-        **kwargs,
     ):
         r"""
         Args:
@@ -2876,7 +2866,6 @@ def call(
         return_dict=None,
         labels=None,
         training=False,
-        **kwargs,
     ):
         """
         Returns:
diff --git a/tests/convbert/test_modeling_tf_convbert.py b/tests/convbert/test_modeling_tf_convbert.py
index e2d68876263a..2ae29c3e4a5a 100644
--- a/tests/convbert/test_modeling_tf_convbert.py
+++ b/tests/convbert/test_modeling_tf_convbert.py
@@ -355,7 +355,6 @@ def check_encoder_attentions_output(outputs):
 
         for model_class in self.all_model_classes:
             inputs_dict["output_attentions"] = True
-            inputs_dict["use_cache"] = False
             config.output_hidden_states = False
             model = model_class(config)
             outputs = model(self._prepare_for_class(inputs_dict, model_class))
diff --git a/tests/t5/test_modeling_tf_t5.py b/tests/t5/test_modeling_tf_t5.py
index c6585f83b18e..7ac0b33e426b 100644
--- a/tests/t5/test_modeling_tf_t5.py
+++ b/tests/t5/test_modeling_tf_t5.py
@@ -346,6 +346,11 @@ def test_resize_embeddings(self):
         self.assertEqual(model.get_input_embeddings().weight.shape[0], len(tokenizer))
         self.assertNotEqual(model.get_input_embeddings().weight.shape[0], original_vocab_size)
 
+    # This test is run in `TFT5EncoderOnlyModelTest`, where the main layer has the same inputs as the model
+    @unittest.skip(reason="The inputs of the Main Layer are different.")
+    def test_keras_save_load(self):
+        pass
+
 
 class TFT5EncoderOnlyModelTester:
     def __init__(
diff --git a/tests/test_modeling_tf_common.py b/tests/test_modeling_tf_common.py
index 9473a50f53aa..b72034de6958 100644
--- a/tests/test_modeling_tf_common.py
+++ b/tests/test_modeling_tf_common.py
@@ -573,7 +573,12 @@ def check_pt_tf_models(tf_model, pt_model):
             pt_model = pt_model_class(config)
 
             tf_inputs_dict = self._prepare_for_class(inputs_dict, model_class)
-            tf_inputs_dict_maybe_with_labels = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+            tf_inputs_dict_maybe_with_labels = self._prepare_for_class(
+                inputs_dict,
+                model_class,
+                # Not all models accept "labels" in the forward pass (yet :) )
+                return_labels=True if "labels" in inspect.signature(model_class.call).parameters.keys() else False,
+            )
 
             # Check we can load pt model in tf and vice-versa with model => model functions
             tf_model = transformers.load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=tf_inputs_dict)
@@ -722,7 +727,6 @@ def check_encoder_attentions_output(outputs):
 
         for model_class in self.all_model_classes:
             inputs_dict["output_attentions"] = True
-            inputs_dict["use_cache"] = False
             config.output_hidden_states = False
             model = model_class(config)
             outputs = model(self._prepare_for_class(inputs_dict, model_class))
@@ -944,10 +948,6 @@ def recursive_check(tuple_object, dict_object):
             dict_inputs = self._prepare_for_class(inputs_dict, model_class)
             check_equivalence(model, tuple_inputs, dict_inputs)
 
-            tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            check_equivalence(model, tuple_inputs, dict_inputs)
-
             tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
             dict_inputs = self._prepare_for_class(inputs_dict, model_class)
             check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
@@ -956,19 +956,25 @@ def recursive_check(tuple_object, dict_object):
             dict_inputs = self._prepare_for_class(inputs_dict, model_class)
             check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
 
-            tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
+            # Not all models accept "labels" in the forward pass (yet :) )
+            if "labels" in inspect.signature(model.call).parameters.keys():
+                tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                check_equivalence(model, tuple_inputs, dict_inputs)
 
-            tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
+                tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
 
-            tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
-            check_equivalence(
-                model, tuple_inputs, dict_inputs, {"output_hidden_states": True, "output_attentions": True}
-            )
+                tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
+
+                tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+                check_equivalence(
+                    model, tuple_inputs, dict_inputs, {"output_hidden_states": True, "output_attentions": True}
+                )
 
     def test_inputs_embeds(self):
         config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

From b442b3348585c871c355a39171e89d6f047aeb95 Mon Sep 17 00:00:00 2001
From: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Date: Mon, 4 Apr 2022 17:50:56 +0200
Subject: [PATCH 23/34] [SpeechEncoderDecoderModel] Correct Encoder Last Hidden
 State Output (#16586)

---
 .../speech_encoder_decoder/modeling_speech_encoder_decoder.py   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py b/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
index 45262ad940fe..3722c123c3bb 100644
--- a/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
+++ b/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
@@ -572,7 +572,7 @@ def forward(
             decoder_hidden_states=decoder_outputs.hidden_states,
             decoder_attentions=decoder_outputs.attentions,
             cross_attentions=decoder_outputs.cross_attentions,
-            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_last_hidden_state=encoder_hidden_states,
             encoder_hidden_states=encoder_outputs.hidden_states,
             encoder_attentions=encoder_outputs.attentions,
         )

From 29a3b42737a1ee129f517491e90b627bf8d5c899 Mon Sep 17 00:00:00 2001
From: Andres Codas <andrescodas@users.noreply.github.com>
Date: Mon, 4 Apr 2022 12:20:26 -0400
Subject: [PATCH 24/34] initialize the default rank set on TrainerState
 (#16530)

* initialize the default rank set on TrainerState

* fix style
---
 src/transformers/trainer.py | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 921b9d27ac08..a2fb10b9e040 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -488,7 +488,11 @@ def __init__(
         else:
             self.label_smoother = None
 
-        self.state = TrainerState()
+        self.state = TrainerState(
+            is_local_process_zero=self.is_local_process_zero(),
+            is_world_process_zero=self.is_world_process_zero(),
+        )
+
         self.control = TrainerControl()
         # Internal variable to count flos in each process, will be accumulated in `self.state.total_flos` then
         # returned to 0 every time flos need to be logged

From b96e629676ea57a09ee6e027d27af08ce34a392c Mon Sep 17 00:00:00 2001
From: Sylvain Gugger <Sylvain.gugger@gmail.com>
Date: Mon, 4 Apr 2022 14:06:49 -0400
Subject: [PATCH 25/34] Trigger doc build


From 82ad581c3bdf294bff55810c6dac2796c0025c27 Mon Sep 17 00:00:00 2001
From: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Date: Tue, 5 Apr 2022 10:00:03 +0200
Subject: [PATCH 26/34] Fix CI: test_inference_for_pretraining in
 ViTMAEModelTest (#16591)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
---
 tests/vit_mae/test_modeling_vit_mae.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/vit_mae/test_modeling_vit_mae.py b/tests/vit_mae/test_modeling_vit_mae.py
index 6ae62cb1c2c3..8cbde5b2ce92 100644
--- a/tests/vit_mae/test_modeling_vit_mae.py
+++ b/tests/vit_mae/test_modeling_vit_mae.py
@@ -561,7 +561,7 @@ def test_inference_for_pretraining(self):
 
         # forward pass
         with torch.no_grad():
-            outputs = model(**inputs, noise=torch.from_numpy(noise))
+            outputs = model(**inputs, noise=torch.from_numpy(noise).to(device=torch_device))
 
         # verify the logits
         expected_shape = torch.Size((1, 196, 768))

From cfb63da0bffe11a0609fdaa117d2d60d7b6b9ac1 Mon Sep 17 00:00:00 2001
From: SaulLu <55560583+SaulLu@users.noreply.github.com>
Date: Tue, 5 Apr 2022 10:50:22 +0200
Subject: [PATCH 27/34] add a template to add missing tokenization test
 (#16553)

* add a template to add missing tokenization test

* add cookiecutter setting

* improve doc

* Update templates/adding_a_missing_tokenization_test/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
---
 .../README.md                                 | 39 ++++++++++
 ...on_{{cookiecutter.lowercase_modelname}}.py | 78 +++++++++++++++++++
 .../cookiecutter.json                         | 10 +++
 3 files changed, 127 insertions(+)
 create mode 100644 templates/adding_a_missing_tokenization_test/README.md
 create mode 100644 templates/adding_a_missing_tokenization_test/cookiecutter-template-{{cookiecutter.modelname}}/test_tokenization_{{cookiecutter.lowercase_modelname}}.py
 create mode 100644 templates/adding_a_missing_tokenization_test/cookiecutter.json

diff --git a/templates/adding_a_missing_tokenization_test/README.md b/templates/adding_a_missing_tokenization_test/README.md
new file mode 100644
index 000000000000..935f21c5ca8a
--- /dev/null
+++ b/templates/adding_a_missing_tokenization_test/README.md
@@ -0,0 +1,39 @@
+<!---
+Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+This folder contains a template to add a tokenization test. 
+
+## Usage
+
+Using the `cookiecutter` utility requires to have all the `dev` dependencies installed. 
+
+Let's first [fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the `transformers` repo on github. Once it's done you can clone your fork and install `transformers` in our environment:
+
+```shell script
+git clone https://github.com/YOUR-USERNAME/transformers
+cd transformers
+pip install -e ".[dev]"
+```
+
+Once the installation is done, you can generate the template by running the following command. Be careful, the template will be generated inside a new folder in your current working directory.
+
+```shell script
+cookiecutter path-to-the folder/adding_a_missing_tokenization_test/
+```
+
+You will then have to answer some questions about the tokenizer for which you want to add tests. The `modelname` should be cased according to the plain text casing, i.e., BERT, RoBERTa, DeBERTa.
+
+Once the command has finished, you should have a one new file inside the newly created folder named `test_tokenization_Xxx.py`. At this point the template is finished and you can move it to the sub-folder of the corresponding model in the test folder.
diff --git a/templates/adding_a_missing_tokenization_test/cookiecutter-template-{{cookiecutter.modelname}}/test_tokenization_{{cookiecutter.lowercase_modelname}}.py b/templates/adding_a_missing_tokenization_test/cookiecutter-template-{{cookiecutter.modelname}}/test_tokenization_{{cookiecutter.lowercase_modelname}}.py
new file mode 100644
index 000000000000..631886f6b2eb
--- /dev/null
+++ b/templates/adding_a_missing_tokenization_test/cookiecutter-template-{{cookiecutter.modelname}}/test_tokenization_{{cookiecutter.lowercase_modelname}}.py
@@ -0,0 +1,78 @@
+# coding=utf-8
+# Copyright 2022 {{cookiecutter.authors}}. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the {{cookiecutter.modelname}} tokenizer. """
+
+
+import unittest
+
+{% if cookiecutter.has_slow_class == "True" and  cookiecutter.has_fast_class == "True" -%}
+from transformers import {{cookiecutter.camelcase_modelname}}Tokenizer, {{cookiecutter.camelcase_modelname}}TokenizerFast
+{% elif  cookiecutter.has_slow_class == "True" -%}
+from transformers import {{cookiecutter.camelcase_modelname}}Tokenizer
+{% elif  cookiecutter.has_fast_class == "True" -%}
+from transformers import {{cookiecutter.camelcase_modelname}}TokenizerFast
+{% endif -%}
+{% if cookiecutter.has_fast_class == "True" and  cookiecutter.slow_tokenizer_use_sentencepiece == "True" -%}
+from transformers.testing_utils import require_sentencepiece, require_tokenizers
+from ..test_tokenization_common import TokenizerTesterMixin
+
+
+@require_sentencepiece
+@require_tokenizers
+{% elif  cookiecutter.slow_tokenizer_use_sentencepiece == "True" -%}
+from transformers.testing_utils import require_sentencepiece
+from ..test_tokenization_common import TokenizerTesterMixin
+
+
+@require_sentencepiece
+{% elif  cookiecutter.has_fast_class == "True" -%}
+from transformers.testing_utils import require_tokenizers
+from ..test_tokenization_common import TokenizerTesterMixin
+
+
+@require_tokenizers
+{% else -%}
+from ..test_tokenization_common import TokenizerTesterMixin
+
+
+{% endif -%}
+class {{cookiecutter.camelcase_modelname}}TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    {% if cookiecutter.has_slow_class == "True" -%}
+    tokenizer_class = {{cookiecutter.camelcase_modelname}}Tokenizer
+    test_slow_tokenizer = True
+    {% else -%}
+    tokenizer_class = None
+    test_slow_tokenizer = False
+    {% endif -%}
+    {% if cookiecutter.has_fast_class == "True" -%}
+    rust_tokenizer_class = {{cookiecutter.camelcase_modelname}}TokenizerFast
+    test_rust_tokenizer = True
+    {% else -%}
+    rust_tokenizer_class = None
+    test_rust_tokenizer = False
+    {% endif -%}
+    {% if  cookiecutter.slow_tokenizer_use_sentencepiece == "True" -%}
+    test_sentencepiece = True
+    {% endif -%}
+    # TODO: Check in `TokenizerTesterMixin` if other attributes need to be changed
+    def setUp(self):
+        super().setUp()
+
+        raise NotImplementedError(
+            "Here you have to implement the saving of a toy tokenizer in "
+            "`self.tmpdirname`."
+        )
+
+    # TODO: add tests with hard-coded target values 
\ No newline at end of file
diff --git a/templates/adding_a_missing_tokenization_test/cookiecutter.json b/templates/adding_a_missing_tokenization_test/cookiecutter.json
new file mode 100644
index 000000000000..2e53818f9bb6
--- /dev/null
+++ b/templates/adding_a_missing_tokenization_test/cookiecutter.json
@@ -0,0 +1,10 @@
+{
+  "modelname": "BrandNewBERT",
+  "uppercase_modelname": "BRAND_NEW_BERT",
+  "lowercase_modelname": "brand_new_bert",
+  "camelcase_modelname": "BrandNewBert",
+  "has_slow_class": ["True", "False"],
+  "has_fast_class": ["True", "False"],
+  "slow_tokenizer_use_sentencepiece": ["True", "False"],
+  "authors": "The HuggingFace Team"
+}

From e98825cf54a9a8065aa68dc49543fc6d3373cd46 Mon Sep 17 00:00:00 2001
From: Francesco Saverio Zuppichini <francesco.zuppichini@gmail.com>
Date: Tue, 5 Apr 2022 11:56:36 +0200
Subject: [PATCH 28/34] made _load_pretrained_model_low_mem static + bug fix
 (#16548)

---
 src/transformers/modeling_utils.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 33401c3c093f..0719700c0964 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -2103,8 +2103,8 @@ def retrieve_modules_from_names(self, names, add_prefix=False, remove_prefix=Fal
 
         return retrieved_modules
 
-    @classmethod
-    def _load_pretrained_model_low_mem(cls, model, loaded_state_dict_keys, resolved_archive_file):
+    @staticmethod
+    def _load_pretrained_model_low_mem(model, loaded_state_dict_keys, resolved_archive_file):
         """
         This is an experimental function that loads the model using ~1.x model size CPU memory
 
@@ -2159,7 +2159,7 @@ def find_submodule_and_param_name(model, long_key):
             resolved_archive_file = [resolved_archive_file]
 
         for archive_file in resolved_archive_file:
-            state_dict = torch.load(resolved_archive_file, map_location="cpu")
+            state_dict = torch.load(archive_file, map_location="cpu")
 
             # materialize state_dict entries one by one on CPU
             for k in loaded_state_dict_keys:

From 83362822cedd2df4319ac6a23fd69461b16fa813 Mon Sep 17 00:00:00 2001
From: Suraj Patil <surajp815@gmail.com>
Date: Tue, 5 Apr 2022 12:26:03 +0200
Subject: [PATCH 29/34] handle torch_dtype in low cpu mem usage (#16580)

---
 src/transformers/modeling_utils.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 0719700c0964..a1a0ad7d36fd 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -2165,7 +2165,8 @@ def find_submodule_and_param_name(model, long_key):
             for k in loaded_state_dict_keys:
                 submodule, param_name = find_submodule_and_param_name(model, k)
                 if submodule is not None:
-                    new_val = state_dict[k]
+                    param_dtype = getattr(submodule, param_name).dtype
+                    new_val = state_dict[k].to(param_dtype)
                     if isinstance(getattr(submodule, param_name), torch.nn.Parameter):
                         new_val = torch.nn.Parameter(new_val)
                     setattr(submodule, param_name, new_val)

From 85f2bd96c2cac9a7842684b5fdf30329c74f0d0c Mon Sep 17 00:00:00 2001
From: Patrick von Platen <patrick.v.platen@gmail.com>
Date: Tue, 5 Apr 2022 14:15:02 +0200
Subject: [PATCH 30/34] [Doctests] Correct filenaming (#16599)

* [Doctests] Correct filenaming

* improve quicktour

* make style
---
 docs/source/en/quicktour.mdx  | 14 +++++++-------
 docs/source/es/quicktour.mdx  | 13 ++++++-------
 utils/documentation_tests.txt | 18 +++---------------
 3 files changed, 16 insertions(+), 29 deletions(-)

diff --git a/docs/source/en/quicktour.mdx b/docs/source/en/quicktour.mdx
index 1fc4f8b865dc..0d7edd630702 100644
--- a/docs/source/en/quicktour.mdx
+++ b/docs/source/en/quicktour.mdx
@@ -115,23 +115,23 @@ Create a [`pipeline`] with the task you want to solve for and the model you want
 >>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
 ```
 
-Next, load a dataset (see the 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) for more details) you'd like to iterate over. For example, let's load the [SUPERB](https://huggingface.co/datasets/superb) dataset:
+Next, load a dataset (see the 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) for more details) you'd like to iterate over. For example, let's load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset:
 
 ```py
 >>> import datasets
 
->>> dataset = datasets.load_dataset("superb", name="asr", split="test")  # doctest: +IGNORE_RESULT
+>>> dataset = datasets.load_dataset("minds14", name="en-US", split="train")  # doctest: +IGNORE_RESULT
 ```
 
 You can pass a whole dataset pipeline:
 
 ```py
->>> files = dataset["file"]
+>>> files = dataset["path"]
 >>> speech_recognizer(files[:4])
-[{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'},
- {'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'},
- {'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'},
- {'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}]
+[{'text': 'I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT'}, 
+ {'text': "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE"}, 
+ {'text': "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS"}, 
+ {'text': 'HOW DO I FURN A JOINA COUT'}]
 ```
 
 For a larger dataset where the inputs are big (like in speech or vision), you will want to pass along a generator instead of a list that loads all the inputs in memory. See the [pipeline documentation](./main_classes/pipelines) for more information.
diff --git a/docs/source/es/quicktour.mdx b/docs/source/es/quicktour.mdx
index 8b400867099e..67ed7e7bb5c2 100644
--- a/docs/source/es/quicktour.mdx
+++ b/docs/source/es/quicktour.mdx
@@ -115,23 +115,22 @@ Crea un [`pipeline`] con la tarea que deseas resolver y el modelo que quieres us
 >>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
 ```
 
-A continuación, carga el dataset (ve 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) para más detalles) sobre el que quisieras iterar. Por ejemplo, vamos a cargar el dataset [SUPERB](https://huggingface.co/datasets/superb):
+A continuación, carga el dataset (ve 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) para más detalles) sobre el que quisieras iterar. Por ejemplo, vamos a cargar el dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14):
 
 ```py
 >>> import datasets
 
->>> dataset = datasets.load_dataset("superb", name="asr", split="test")  # doctest: +IGNORE_RESULT
+>>> dataset = datasets.load_dataset("minds14", name="en-US", split="train")  # doctest: +IGNORE_RESULT
 ```
 
 Puedes pasar un pipeline para un dataset:
 
 ```py
->>> files = dataset["file"]
+>>> files = dataset["path"]
 >>> speech_recognizer(files[:4])
-[{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'},
- {'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'},
- {'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'},
- {'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}]
+[{'text': 'I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT'}, 
+ {'text': "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE"}, 
+ {'text': "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS"},
 ```
 
 Para un dataset más grande, donde los inputs son de mayor tamaño (como en habla/audio o visión), querrás pasar un generador en lugar de una lista que carga todos los inputs en memoria. Ve la [documentación del pipeline](./main_classes/pipelines) para más información.
diff --git a/utils/documentation_tests.txt b/utils/documentation_tests.txt
index 372e63ad232b..f88974ed434e 100644
--- a/utils/documentation_tests.txt
+++ b/utils/documentation_tests.txt
@@ -1,17 +1,10 @@
-docs/source/quicktour.mdx
-docs/source/quicktour.mdx
-docs/source/task_summary.mdx
-docs/source/task_summary.mdx
+docs/source/en/quicktour.mdx
+docs/source/en/task_summary.mdx
 src/transformers/generation_utils.py
-src/transformers/generation_utils.py
-src/transformers/models/bart/modeling_bart.py
 src/transformers/models/bart/modeling_bart.py
 src/transformers/models/beit/modeling_beit.py
 src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
-src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
 src/transformers/models/blenderbot/modeling_blenderbot.py
-src/transformers/models/blenderbot/modeling_blenderbot.py
-src/transformers/models/blenderbot_small/modeling_blenderbot_small.py
 src/transformers/models/blenderbot_small/modeling_blenderbot_small.py
 src/transformers/models/convnext/modeling_convnext.py
 src/transformers/models/data2vec/modeling_data2vec_audio.py
@@ -20,16 +13,11 @@ src/transformers/models/dpt/modeling_dpt.py
 src/transformers/models/glpn/modeling_glpn.py
 src/transformers/models/hubert/modeling_hubert.py
 src/transformers/models/marian/modeling_marian.py
-src/transformers/models/marian/modeling_marian.py
-src/transformers/models/mbart/modeling_mbart.py
 src/transformers/models/mbart/modeling_mbart.py
 src/transformers/models/pegasus/modeling_pegasus.py
-src/transformers/models/pegasus/modeling_pegasus.py
-src/transformers/models/plbart/modeling_plbart.py
 src/transformers/models/plbart/modeling_plbart.py
 src/transformers/models/poolformer/modeling_poolformer.py
 src/transformers/models/resnet/modeling_resnet.py
-src/transformers/models/resnet/modeling_resnet.py
 src/transformers/models/roberta/modeling_roberta.py
 src/transformers/models/roberta/modeling_tf_roberta.py
 src/transformers/models/segformer/modeling_segformer.py
@@ -50,4 +38,4 @@ src/transformers/models/vit_mae/modeling_vit_mae.py
 src/transformers/models/wav2vec2/modeling_wav2vec2.py
 src/transformers/models/wav2vec2/tokenization_wav2vec2.py
 src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
-src/transformers/models/wavlm/modeling_wavlm.py
\ No newline at end of file
+src/transformers/models/wavlm/modeling_wavlm.py

From d726e679e48e82ed42e7646521f1172e80033ea8 Mon Sep 17 00:00:00 2001
From: Matt <Rocketknight1@users.noreply.github.com>
Date: Tue, 5 Apr 2022 14:23:27 +0100
Subject: [PATCH 31/34] Adding new train_step logic to make things less
 confusing for users (#15994)

* Adding new train_step logic to make things less confusing for users

* DO NOT ASK WHY WE NEED THAT SUBCLASS

* Metrics now working, at least for single-output models with type annotations!

* Updates and TODOs for the new train_step

* Make fixup

* Temporary test workaround until T5 has types

* Temporary test workaround until T5 has types

* I think this actually works! Needs a lot of tests though

* MAke style/quality

* Revert changes to T5 tests

* Deleting the aforementioned unmentionable subclass

* Deleting the aforementioned unmentionable subclass

* Adding a Keras API test

* Style fixes

* Removing unneeded TODO and comments

* Update test_step too

* Stop trying to compute metrics with the dummy_loss, patch up test

* Make style

* make fixup

* Docstring cleanup

* make fixup

* make fixup

* Stop expanding 1D input tensors when using dummy loss

* Adjust T5 test given the new compile()

* make fixup

* Skipping test for convnext

* Removing old T5-specific Keras test now that we have a common one

* make fixup

* make fixup

* Only skip convnext test on CPU

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Avoiding TF import issues

* make fixup

* Update compile() to support TF 2.3

* Skipping model.fit() on template classes for now

* Skipping model.fit() on template class tests for now

* Replace ad-hoc solution with find_labels

* make fixup

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
---
 src/transformers/modeling_tf_utils.py         | 171 ++++++++++++------
 ...tf_{{cookiecutter.lowercase_modelname}}.py |   9 +
 tests/convnext/test_modeling_tf_convnext.py   |   7 +
 tests/t5/test_modeling_tf_t5.py               |  30 ---
 tests/test_modeling_tf_common.py              |  50 +++++
 5 files changed, 184 insertions(+), 83 deletions(-)

diff --git a/src/transformers/modeling_tf_utils.py b/src/transformers/modeling_tf_utils.py
index ee5b32886b07..efa37e32bd75 100644
--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
@@ -38,7 +38,6 @@
 from .configuration_utils import PretrainedConfig
 from .dynamic_module_utils import custom_object_save
 from .generation_tf_utils import TFGenerationMixin
-from .modeling_tf_outputs import TFSeq2SeqLMOutput
 from .tf_utils import shape_list
 from .tokenization_utils_base import BatchEncoding
 from .utils import (
@@ -53,6 +52,7 @@
     RevisionNotFoundError,
     cached_path,
     copy_func,
+    find_labels,
     has_file,
     hf_bucket_url,
     is_offline_mode,
@@ -715,6 +715,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, Pu
     base_model_prefix = ""
     main_input_name = "input_ids"
     _auto_class = None
+    _using_dummy_loss = None
 
     # a list of re pattern of tensor names to ignore from the model when loading the model weights
     # (and avoid unnecessary warnings).
@@ -899,24 +900,46 @@ def compile(
         function themselves.
         """
         if loss == "passthrough":
+            if metrics is not None:
+                raise ValueError(
+                    "Passing metrics as a dict is not supported when using the internal loss! "
+                    "Please either compile the model with a loss, or remove the metrics argument. "
+                    "Note that advanced metrics using the `KerasMetricCallback` can still be used with the internal "
+                    "loss."
+                )
             logger.warning(
                 "No loss specified in compile() - the model's internal loss computation will be used as the "
                 "loss. Don't panic - this is a common way to train TensorFlow models in Transformers! "
-                "Please ensure your labels are passed as keys in the input dict so that they are "
-                "accessible to the model during the forward pass. To disable this behaviour, please pass a "
-                "loss argument, or explicitly pass loss=None if you do not want your model to compute a loss."
+                "To disable this behaviour, please pass a loss argument, or explicitly pass "
+                "`loss=None` if you do not want your model to compute a loss."
+            )
+            loss = dummy_loss
+            self._using_dummy_loss = True
+        else:
+            self._using_dummy_loss = False
+        parent_args = list(inspect.signature(tf.keras.Model.compile).parameters.keys())
+        if "steps_per_execution" in parent_args:
+            super().compile(
+                optimizer=optimizer,
+                loss=loss,
+                metrics=metrics,
+                loss_weights=loss_weights,
+                weighted_metrics=weighted_metrics,
+                run_eagerly=run_eagerly,
+                steps_per_execution=steps_per_execution,
+                **kwargs,
+            )
+        else:
+            super().compile(
+                optimizer=optimizer,
+                loss=loss,
+                metrics=metrics,
+                loss_weights=loss_weights,
+                weighted_metrics=weighted_metrics,
+                run_eagerly=run_eagerly,
+                experimental_steps_per_execution=steps_per_execution,
+                **kwargs,
             )
-            loss = {"loss": dummy_loss}
-        super().compile(
-            optimizer=optimizer,
-            loss=loss,
-            metrics=metrics,
-            loss_weights=loss_weights,
-            weighted_metrics=weighted_metrics,
-            run_eagerly=run_eagerly,
-            steps_per_execution=steps_per_execution,
-            **kwargs,
-        )
 
     def compute_loss(self, *args, **kwargs):
         if hasattr(tf.keras.Model, "compute_loss"):
@@ -935,40 +958,54 @@ def compute_loss(self, *args, **kwargs):
     def train_step(self, data):
         """
         A modification of Keras's default `train_step` that cleans up the printed metrics when we use a dummy loss. If
-        a user specifies a loss at model compile time, this function behaves as the original Keras `train_step`. In
-        this case, it expects the same `data` as the original function (i.e. `(inputs, labels)`).
-
-        However, when the model is compiled without specifying the loss AND the expected label columns are passed as
-        part of the input dictionary, the loss is computed internally (inside the model class) and is used in the
-        backwards pass. In this case, `data` is a singleton tuple containing `(inputs,)`.
+        a user specifies a loss at model compile time, this function behaves as the original Keras `train_step`.
 
-        This is possible under the aforementioned circumstances because our overriden compile function can set an
-        additional loss function that reduces a `loss` output, and the model will output a `loss` component (notice the
-        name matching) containing the loss that was used to train the pre-trained model.
+        When the model is compiled without specifying the loss, our overridden compile function can set a simple dummy
+        loss that just reads the loss output head of the model. When using this dummy loss, inputs can be passed either
+        as keys in the input dictionary, or as normal Keras labels.
         """
+
         # These are the only transformations `Model.fit` applies to user-input
         # data when a `tf.data.Dataset` is provided.
-        data = data_adapter.expand_1d(data)
+        if not self._using_dummy_loss:
+            data = data_adapter.expand_1d(data)
         x, y, sample_weight = data_adapter.unpack_x_y_sample_weight(data)
-        # These next two lines differ from the base method - they avoid issues when the labels are in
-        # the input dict (and loss is computed internally)
-        if y is None and "labels" in x:
-            y = x["labels"]  # Stops confusion with metric computations
-        elif y is None and "input_ids" in x:
-            # Just make any kind of dummy array to make loss work
-            y = tf.zeros(tf.shape(x["input_ids"])[0], dtype=tf.int64)
+
+        # When using a dummy loss, we ensure that separate labels are copied to the correct model arguments,
+        # if those keys are not already present in the input dict
+        if self._using_dummy_loss and y is not None:
+            arg_names = list(dict(inspect.signature(self.call).parameters).keys())
+            label_kwargs = find_labels(self.__class__)
+            # If y is a tensor and the model only has one label-like input, map y to that input
+            if len(label_kwargs) == 1 and isinstance(y, tf.Tensor):
+                if isinstance(x, tf.Tensor):
+                    x = {arg_names[0]: x}
+                label_kwarg = next(iter(label_kwargs))
+                if label_kwarg not in x:
+                    x[label_kwarg] = y
+            # Otherwise, copy keys from y to x as long as they weren't already present in x
+            elif isinstance(y, dict):
+                if isinstance(x, tf.Tensor):
+                    x = {arg_names[0]: x}
+                for key, val in y.items():
+                    if key in arg_names and key not in x:
+                        x[key] = val
+
         # Run forward pass.
         with tf.GradientTape() as tape:
             y_pred = self(x, training=True)
-            loss = self.compiled_loss(y, y_pred, sample_weight, regularization_losses=self.losses)
+            if self._using_dummy_loss:
+                loss = self.compiled_loss(y_pred.loss, y_pred.loss, sample_weight, regularization_losses=self.losses)
+            else:
+                loss = self.compiled_loss(y, y_pred, sample_weight, regularization_losses=self.losses)
         # Run backwards pass.
         self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
-        # When y_pred is a ModelOutput and y is a tf.Tensor the metrics update
-        # should be done only with the relevant ModelOutput param that is
-        # considered by the loss.
-        if isinstance(y_pred, TFSeq2SeqLMOutput) and isinstance(y, tf.Tensor):
-            y_pred = y_pred["logits"]
-        self.compiled_metrics.update_state(y, y_pred, sample_weight)
+
+        # When using the dummy_loss we know metrics are not present, so we can skip a lot of this
+        if self._using_dummy_loss:
+            self.compiled_metrics.update_state(y_pred.loss, y_pred.loss, sample_weight)
+        else:
+            self.compiled_metrics.update_state(y, y_pred, sample_weight)
         # Collect metrics to return
         return_metrics = {}
         for metric in self.metrics:
@@ -985,23 +1022,51 @@ def train_step(self, data):
 
     def test_step(self, data):
         """
-        A modification of Keras's default test_step that cleans up the printed metrics when we use a dummy loss.
+        A modification of Keras's default `test_step` that cleans up the printed metrics when we use a dummy loss. If a
+        user specifies a loss at model compile time, this function behaves as the original Keras `test_step`.
+
+        When the model is compiled without specifying the loss, our overridden compile function can set a simple dummy
+        loss that just reads the loss output head of the model. When using this dummy loss, inputs can be passed either
+        as keys in the input dictionary, or as normal Keras labels.
         """
-        data = data_adapter.expand_1d(data)
+        # These are the only transformations `Model.fit` applies to user-input
+        # data when a `tf.data.Dataset` is provided.
+        if not self._using_dummy_loss:
+            data = data_adapter.expand_1d(data)
         x, y, sample_weight = data_adapter.unpack_x_y_sample_weight(data)
-        # These next two lines differ from the base method - they avoid issues when the labels are in
-        # the input dict (and loss is computed internally)
-        if y is None and "labels" in x:
-            y = x["labels"]  # Stops confusion with metric computations
-        elif y is None and "input_ids" in x:
-            # Just make any kind of dummy array to make loss work
-            y = tf.zeros(tf.shape(x["input_ids"])[0], dtype=tf.int64)
+
+        # When using a dummy loss, we ensure that separate labels are copied to the correct model arguments,
+        # if those keys are not already present in the input dict
+        if self._using_dummy_loss and y is not None:
+            arg_names = list(dict(inspect.signature(self.call).parameters).keys())
+            label_kwargs = find_labels(self.__class__)
+            # If y is a tensor and the model only has one label-like input, map y to that input
+            if len(label_kwargs) == 1 and isinstance(y, tf.Tensor):
+                if isinstance(x, tf.Tensor):
+                    x = {arg_names[0]: x}
+                label_kwarg = next(iter(label_kwargs))
+                if label_kwarg not in x:
+                    x[label_kwarg] = y
+            # Otherwise, copy keys from y to x as long as they weren't already present in x
+            elif isinstance(y, dict):
+                if isinstance(x, tf.Tensor):
+                    x = {arg_names[0]: x}
+                for key, val in y.items():
+                    if key in arg_names and key not in x:
+                        x[key] = val
+
+        # Run forward pass.
         y_pred = self(x, training=False)
-        self.compiled_loss(y, y_pred, sample_weight, regularization_losses=self.losses)
-        # Updates stateful loss metrics.
-        if isinstance(y_pred, TFSeq2SeqLMOutput) and isinstance(y, tf.Tensor):
-            y_pred = y_pred["logits"]
-        self.compiled_metrics.update_state(y, y_pred, sample_weight)
+        if self._using_dummy_loss:
+            self.compiled_loss(y_pred.loss, y_pred.loss, sample_weight, regularization_losses=self.losses)
+        else:
+            self.compiled_loss(y, y_pred, sample_weight, regularization_losses=self.losses)
+
+        # When using the dummy_loss we know metrics are not present, so we can skip a lot of this
+        if self._using_dummy_loss:
+            self.compiled_metrics.update_state(y_pred.loss, y_pred.loss, sample_weight)
+        else:
+            self.compiled_metrics.update_state(y, y_pred, sample_weight)
         # Collect metrics to return
         return_metrics = {}
         for metric in self.metrics:
diff --git a/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py b/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py
index 57fd95dd3ff6..0f4d7824c164 100644
--- a/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py
+++ b/templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/test_modeling_tf_{{cookiecutter.lowercase_modelname}}.py
@@ -259,6 +259,7 @@ def create_and_check_causal_lm_model_as_decoder(
             list(prediction_scores.numpy().shape), [self.batch_size, self.seq_length, self.vocab_size]
         )
 
+
     def create_and_check_causal_lm_model_past(
         self,
         config,
@@ -597,6 +598,10 @@ def test_model(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
         self.model_tester.create_and_check_model(*config_and_inputs)
 
+    @unittest.skip(reason="Template classes interact badly with this test.")
+    def test_keras_fit(self):
+        pass
+
     def test_causal_lm_base_model(self):
         """Test the base model of the causal LM model
 
@@ -947,6 +952,10 @@ def _get_word_embedding_weight(model, embedding_layer):
                                 models_equal = False
                     self.assertTrue(models_equal)
 
+    @unittest.skip(reason="Template classes interact badly with this test.")
+    def test_keras_fit(self):
+        pass
+
 
 def _assert_tensors_equal(a, b, atol=1e-12, prefix=""):
     """If tensors not close, or a and b arent both tensors, raise a nice Assertion error."""
diff --git a/tests/convnext/test_modeling_tf_convnext.py b/tests/convnext/test_modeling_tf_convnext.py
index edab09fb69b9..579c27dd27a6 100644
--- a/tests/convnext/test_modeling_tf_convnext.py
+++ b/tests/convnext/test_modeling_tf_convnext.py
@@ -143,6 +143,13 @@ def setUp(self):
     def test_inputs_embeds(self):
         pass
 
+    @unittest.skipIf(
+        not is_tf_available() or len(tf.config.list_physical_devices("GPU")) == 0,
+        reason="TF (<=2.8) does not support backprop for grouped convolutions on CPU.",
+    )
+    def test_keras_fit(self):
+        pass
+
     @unittest.skip(reason="ConvNext does not support input and output embeddings")
     def test_model_common_attributes(self):
         pass
diff --git a/tests/t5/test_modeling_tf_t5.py b/tests/t5/test_modeling_tf_t5.py
index 7ac0b33e426b..7445aae53001 100644
--- a/tests/t5/test_modeling_tf_t5.py
+++ b/tests/t5/test_modeling_tf_t5.py
@@ -804,33 +804,3 @@ def test_translation_en_to_ro(self):
         translation = tok.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
 
         self.assertEqual(translation, expected_translation)
-
-    def test_finetune_keras_trainer(self):
-        """Ensure that the model can be fine-tuned via the keras API and
-        that metrics work as expected.
-        """
-
-        # This metric expects to be called with the logits output
-        def _accuracy(y_true, y_pred):
-            return tf.keras.metrics.sparse_categorical_crossentropy(y_true[:, 0], y_pred[:, 0])
-
-        # measure the accuracy of the first token
-        class FirstTokenAccuracy(tf.keras.metrics.MeanMetricWrapper):
-            def __init__(self, name="accuracy", **kwargs):
-                super().__init__(_accuracy, name=name, **kwargs)
-
-        model = self.model
-        model.compile("adam", metrics=FirstTokenAccuracy())
-        tokenizer = T5Tokenizer.from_pretrained("t5-small")
-
-        examples = [
-            ("sentiment: Everything is awesome!", "positive"),
-            ("sentiment: Tensorflow datasets are hard to use", "negative"),
-        ]
-
-        inputs = dict(tokenizer([x[0] for x in examples], padding=True, return_tensors="tf"))
-        inputs["labels"] = tokenizer([x[1] for x in examples], return_tensors="tf").input_ids
-
-        model.fit(inputs)
-        m = model.evaluate(inputs)
-        self.assertEqual(len(m), 2)
diff --git a/tests/test_modeling_tf_common.py b/tests/test_modeling_tf_common.py
index b72034de6958..b7b4b68414a7 100644
--- a/tests/test_modeling_tf_common.py
+++ b/tests/test_modeling_tf_common.py
@@ -1302,6 +1302,56 @@ def test_loss_computation(self):
 
                 self.assertEqual(loss.shape, [loss_size])
 
+    def test_keras_fit(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            if getattr(model, "hf_compute_loss", None):
+                # Test that model correctly compute the loss with kwargs
+                prepared_for_class = self._prepare_for_class(inputs_dict.copy(), model_class, return_labels=True)
+                # Is there a better way to remove these decoder inputs?
+                prepared_for_class = {
+                    key: val
+                    for key, val in prepared_for_class.items()
+                    if key not in ("head_mask", "decoder_head_mask", "cross_attn_head_mask", "decoder_input_ids")
+                }
+
+                possible_label_cols = {
+                    "labels",
+                    "label",
+                    "label_ids",
+                    "start_positions",
+                    "start_position",
+                    "end_positions",
+                    "end_position",
+                    "next_sentence_label",
+                }
+                label_names = possible_label_cols.intersection(set(prepared_for_class))
+                self.assertGreater(len(label_names), 0, msg="No matching label names found!")
+                labels = {key: val for key, val in prepared_for_class.items() if key in label_names}
+                inputs_minus_labels = {key: val for key, val in prepared_for_class.items() if key not in label_names}
+                self.assertGreater(len(inputs_minus_labels), 0)
+                model.compile(optimizer=tf.keras.optimizers.SGD(0.0), run_eagerly=True)
+                # Make sure the model fits without crashing regardless of where we pass the labels
+                history1 = model.fit(
+                    prepared_for_class,
+                    validation_data=prepared_for_class,
+                    steps_per_epoch=1,
+                    validation_steps=1,
+                    shuffle=False,
+                )
+                val_loss1 = history1.history["val_loss"][0]
+                history2 = model.fit(
+                    inputs_minus_labels,
+                    labels,
+                    validation_data=(inputs_minus_labels, labels),
+                    steps_per_epoch=1,
+                    validation_steps=1,
+                    shuffle=False,
+                )
+                val_loss2 = history2.history["val_loss"][0]
+                self.assertTrue(np.allclose(val_loss1, val_loss2, atol=1e-2, rtol=1e-3))
+
     def test_generate_with_headmasking(self):
         attention_names = ["encoder_attentions", "decoder_attentions", "cross_attentions"]
         config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

From 7744ce7befeaf94007e44d66007de9a6be27d58f Mon Sep 17 00:00:00 2001
From: Rishav Chandra Varma <rishavchandra.v16@iiits.in>
Date: Tue, 5 Apr 2022 19:20:45 +0530
Subject: [PATCH 32/34] Adding missing type hints for BigBird model   (#16555)

* added type hints for mbart tensorflow tf implementation

* Adding missing type hints for mBART model

Tensorflow Implementation model added with missing type hints

* Missing Type hints - correction

For TF model

* Code fixup using make quality tests

* Hint types - typo error

* make fix-copies and make fixup

* type hints

* updated files

* type hints update

* making dependent modesls coherent

* Type hints for BigBird

* removing typos

Co-authored-by: matt <rocketknight1@gmail.com>
---
 .../models/big_bird/modeling_big_bird.py      | 202 +++++++++---------
 1 file changed, 101 insertions(+), 101 deletions(-)

diff --git a/src/transformers/models/big_bird/modeling_big_bird.py b/src/transformers/models/big_bird/modeling_big_bird.py
index b765a854009d..85b48170f70c 100755
--- a/src/transformers/models/big_bird/modeling_big_bird.py
+++ b/src/transformers/models/big_bird/modeling_big_bird.py
@@ -18,7 +18,7 @@
 import math
 import os
 from dataclasses import dataclass
-from typing import Optional, Tuple
+from typing import Optional, Tuple, Union
 
 import numpy as np
 import torch
@@ -1592,7 +1592,7 @@ def forward(
         to_mask=None,
         blocked_encoder_mask=None,
         return_dict=True,
-    ):
+    ) -> Union[BaseModelOutputWithPastAndCrossAttentions, Tuple]:
         all_hidden_states = () if output_hidden_states else None
         all_self_attentions = () if output_attentions else None
         all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
@@ -1986,20 +1986,20 @@ def set_attention_type(self, value: str):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        past_key_values=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.FloatTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[BaseModelOutputWithPoolingAndCrossAttentions, Tuple[torch.FloatTensor]]:
         r"""
         encoder_hidden_states  (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
@@ -2280,18 +2280,18 @@ def set_output_embeddings(self, new_embeddings):
     @replace_return_docstrings(output_type=BigBirdForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        labels=None,
-        next_sentence_label=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.FloatTensor] = None,
+        next_sentence_label: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[BigBirdForPreTrainingOutput, Tuple[torch.FloatTensor]]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
@@ -2395,19 +2395,19 @@ def set_output_embeddings(self, new_embeddings):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[MaskedLMOutput, Tuple[torch.FloatTensor]]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
@@ -2493,21 +2493,21 @@ def set_output_embeddings(self, new_embeddings):
     @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        past_key_values=None,
-        labels=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+        encoder_attention_mask: Optional[torch.FloatTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[CausalLMOutputWithCrossAttentions, Tuple[torch.FloatTensor]]:
         r"""
         encoder_hidden_states  (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
             Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
@@ -2664,17 +2664,17 @@ def __init__(self, config):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[SequenceClassifierOutput, Tuple[torch.FloatTensor]]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
@@ -2762,17 +2762,17 @@ def __init__(self, config):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[MultipleChoiceModelOutput, Tuple[torch.FloatTensor]]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
@@ -2858,17 +2858,17 @@ def __init__(self, config):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[TokenClassifierOutput, Tuple[torch.FloatTensor]]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
@@ -2957,19 +2957,19 @@ def __init__(self, config, add_pooling_layer=False):
     )
     def forward(
         self,
-        input_ids=None,
-        attention_mask=None,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
         question_lengths=None,
-        token_type_ids=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        start_positions=None,
-        end_positions=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        start_positions: Optional[torch.LongTensor] = None,
+        end_positions: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[BigBirdForQuestionAnsweringModelOutput, Tuple[torch.FloatTensor]]:
         r"""
         start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.

From 20ef1a0ac58061a0bb6bb1ffbe4863845df27c7b Mon Sep 17 00:00:00 2001
From: Stas Bekman <stas00@users.noreply.github.com>
Date: Tue, 5 Apr 2022 08:13:12 -0700
Subject: [PATCH 33/34] [deepspeed] fix typo, adjust config name (#16597)

---
 src/transformers/deepspeed.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/transformers/deepspeed.py b/src/transformers/deepspeed.py
index 46cfb9730760..6feabdaa8095 100644
--- a/src/transformers/deepspeed.py
+++ b/src/transformers/deepspeed.py
@@ -250,7 +250,7 @@ def trainer_config_process(self, args):
         self.fill_match("bf16.enabled", (args.bf16 or args.bf16_full_eval), "bf16|bf16_full_eval")
 
         # deepspeed's default mode is fp16 unless there is a config that says differently
-        if self.is_true("bfoat16.enabled"):
+        if self.is_true("bf16.enabled"):
             self._dtype = torch.bfloat16
         elif self.is_false("fp16.enabled"):
             self._dtype = torch.float32

From 578abb1632b4e68581257c8fd33d11135630080b Mon Sep 17 00:00:00 2001
From: Steven <steven.liu@huggingface.co>
Date: Tue, 5 Apr 2022 10:02:50 -0700
Subject: [PATCH 34/34] =?UTF-8?q?=20=F0=9F=96=8D=20apply=20feedback?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/source/en/task_summary.mdx | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/source/en/task_summary.mdx b/docs/source/en/task_summary.mdx
index fd30add50729..8323e182a607 100644
--- a/docs/source/en/task_summary.mdx
+++ b/docs/source/en/task_summary.mdx
@@ -970,7 +970,7 @@ We get the same translation as with the pipeline example.
 
 ## Audio classification
 
-Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or the how-to guide [here](./tasks/audio_classification).
+Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or this [how-to guide](./tasks/audio_classification).
 
 The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for audio classification inference:
 
@@ -988,9 +988,9 @@ The following examples demonstrate how to use a [`pipeline`] and a model and tok
  {'label': 'fearful', 'score': 0.12404385954141617}]
 ```
 
-The general process for using a model and tokenizer for audio classification is:
+The general process for using a model and feature extractor for audio classification is:
 
-1. Instantiate a tokenizer and a model from the checkpoint name.
+1. Instantiate a feature extractor and a model from the checkpoint name.
 2. Process the audio signal to be classified with a feature extractor.
 3. Pass the input through the model and take the `argmax` to retrieve the most likely class.
 4. Convert the class id to a class name with `id2label` to return an interpretable result.
@@ -1023,7 +1023,7 @@ The general process for using a model and tokenizer for audio classification is:
 
 ## Automatic speech recognition
 
-Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) scripts or the how-to guide [here](./tasks/asr).
+Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) scripts or this [how-to guide](./tasks/asr).
 
 The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for automatic speech recognition inference:
 
@@ -1037,9 +1037,9 @@ The following examples demonstrate how to use a [`pipeline`] and a model and tok
 {'text': "PRESENTETE MISTER VICE PRESIDENT GOVERNOR CONGRESSMEN THOMAS SAN O TE WILAN CONGRESSMAN MILLA MISTER WEBB MSTBELL SCIENIS DISTINGUISHED GUESS AT LADIES AND GENTLEMAN I APPRECIATE TO YOUR PRESIDENT HAVING MADE ME AN HONORARY VISITING PROFESSOR AND I WILL ASSURE YOU THAT MY FIRST LECTURE WILL BE A VERY BRIEF I AM DELIGHTED TO BE HERE AND I'M PARTICULARLY DELIGHTED TO BE HERE ON THIS OCCASION WE MEED AT A COLLEGE NOTED FOR KNOWLEGE IN A CITY NOTED FOR PROGRESS IN A STATE NOTED FOR STRAINTH AN WE STAND IN NEED OF ALL THREE"}
 ```
 
-The general process for using a model and tokenizer for automatic speech recognition is:
+The general process for using a model and processor for automatic speech recognition is:
 
-1. Instantiate a tokenizer and a model from the checkpoint name.
+1. Instantiate a processor (which regroups a feature extractor for input processing and a tokenizer for decoding) and a model from the checkpoint name.
 2. Process the audio signal and text with a processor.
 3. Pass the input through the model and take the `argmax` to retrieve the predicted text.
 4. Decode the text with a tokenizer to obtain the transcription.
@@ -1071,7 +1071,7 @@ The general process for using a model and tokenizer for automatic speech recogni
 
 ## Image classification
 
-Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or the how-to guide [here](./tasks/image_classification).
+Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or this [how-to guide](./tasks/image_classification).
 
 The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for image classification inference:
 
@@ -1091,9 +1091,9 @@ The following examples demonstrate how to use a [`pipeline`] and a model and tok
  {'label': 'tiger cat', 'score': 0.023034192621707916}]
 ```
 
-The general process for using a model and tokenizer for image classification is:
+The general process for using a model and feature extractor for image classification is:
 
-1. Instantiate a tokenizer and a model from the checkpoint name.
+1. Instantiate a feature extractor and a model from the checkpoint name.
 2. Process the image to be classified with a feature extractor.
 3. Pass the input through the model and take the `argmax` to retrieve the predicted class.
 4. Convert the class id to a class name with `id2label` to return an interpretable result.