diff --git a/.github/workflows/build_doc_test.yml b/.github/workflows/build_doc_test.yml
index 348cc484a93d..f18121791cbc 100644
--- a/.github/workflows/build_doc_test.yml
+++ b/.github/workflows/build_doc_test.yml
@@ -46,4 +46,4 @@ jobs:
- name: Make documentation
run: |
- doc-builder build transformers ./transformers/docs/source
+ doc-builder build transformers ./docs/source
diff --git a/docs/source/main_classes/trainer.mdx b/docs/source/main_classes/trainer.mdx
new file mode 100644
index 000000000000..19ee38c90302
--- /dev/null
+++ b/docs/source/main_classes/trainer.mdx
@@ -0,0 +1,550 @@
+
+
+# Trainer
+
+The [`Trainer`] class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the [example scripts](../examples).
+
+Before instantiating your [`Trainer`], create a [`TrainingArguments`] to access all the points of customization during training.
+
+The API supports distributed training on multiple GPUs/TPUs, mixed precision through [NVIDIA Apex](https://github.com/NVIDIA/apex) and Native AMP for PyTorch.
+
+The [`Trainer`] contains the basic training loop which supports the above features. To inject custom behavior you can subclass them and override the following methods:
+
+- **get_train_dataloader** -- Creates the training DataLoader.
+- **get_eval_dataloader** -- Creates the evaluation DataLoader.
+- **get_test_dataloader** -- Creates the test DataLoader.
+- **log** -- Logs information on the various objects watching training.
+- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
+ init. Note, that you can also subclass or override the `create_optimizer` and `create_scheduler` methods
+ separately.
+- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
+- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
+- **compute_loss** - Computes the loss on a batch of training inputs.
+- **training_step** -- Performs a training step.
+- **prediction_step** -- Performs an evaluation/test step.
+- **evaluate** -- Runs an evaluation loop and returns metrics.
+- **predict** -- Returns predictions (with metrics if labels are available) on a test set.
+
+
+
+The [`Trainer`] class is optimized for 🤗 Transformers models and can have surprising behaviors
+when you use it on other models. When using it on your own model, make sure:
+
+- your model always return tuples or subclasses of [`~file_utils.ModelOutput`].
+- your model can compute the loss if a `labels` argument is provided and that loss is returned as the first
+ element of the tuple (if your model returns tuples)
+- your model can accept multiple label arguments (use the `label_names` in your [`TrainingArguments`] to indicate their name to the [`Trainer`]) but none of them should be named `"label"`.
+
+
+
+Here is an example of how to customize [`Trainer`] using a custom loss function for multi-label classification:
+
+```python
+from torch import nn
+from transformers import Trainer
+
+class MultilabelTrainer(Trainer):
+ def compute_loss(self, model, inputs, return_outputs=False):
+ labels = inputs.get("labels")
+ outputs = model(**inputs)
+ logits = outputs.get('logits')
+ loss_fct = nn.BCEWithLogitsLoss()
+ loss = loss_fct(logits.view(-1, self.model.config.num_labels),
+ labels.float().view(-1, self.model.config.num_labels))
+ return (loss, outputs) if return_outputs else loss
+```
+
+Another way to customize the training loop behavior for the PyTorch [`Trainer`] is to use [callbacks](callback) that can inspect the training loop state (for progress reporting, logging on TensorBoard or other ML platforms...) and take decisions (like early stopping).
+
+
+## Trainer
+
+[[autodoc]] Trainer
+ - all
+
+## Seq2SeqTrainer
+
+[[autodoc]] Seq2SeqTrainer
+ - evaluate
+ - predict
+
+## TrainingArguments
+
+[[autodoc]] TrainingArguments
+ - all
+
+## Seq2SeqTrainingArguments
+
+[[autodoc]] Seq2SeqTrainingArguments
+ - all
+
+## Checkpoints
+
+By default, [`Trainer`] will save all checkpoints in the `output_dir` you set in the
+[`TrainingArguments`] you are using. Those will go in subfolder named `checkpoint-xxx` with xxx
+being the step at which the training was at.
+
+Resuming training from a checkpoint can be done when calling [`Trainer.train`] with either:
+
+- `resume_from_checkpoint=True` which will resume training from the latest checkpoint
+- `resume_from_checkpoint=checkpoint_dir` which will resume training from the specific checkpoint in the directory
+ passed.
+
+In addition, you can easily save your checkpoints on the Model Hub when using `push_to_hub=True`. By default, all
+the models saved in intermediate checkpoints are saved in different commits, but not the optimizer state. You can adapt
+the `hub-strategy` value of your [`TrainingArguments`] to either:
+
+- `"checkpoint"`: the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to
+ resume training easily with `trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")`.
+- `"all_checkpoints"`: all checkpoints are pushed like they appear in the output folder (so you will get one
+ checkpoint folder per folder in your final repository)
+
+
+## Logging
+
+By default [`Trainer`] will use `logging.INFO` for the main process and `logging.WARNING` for the replicas if any.
+
+These defaults can be overridden to use any of the 5 `logging` levels with [`TrainingArguments`]'s
+arguments:
+
+- `log_level` - for the main process
+- `log_level_replica` - for the replicas
+
+Further, if [`TrainingArguments`]'s `log_on_each_node` is set to `False` only the main node will
+use the log level settings for its main process, all other nodes will use the log level settings for replicas.
+
+Note that [`Trainer`] is going to set `transformers`'s log level separately for each node in its
+[`Trainer.__init__`]. So you may want to set this sooner (see the next example) if you tap into other
+`transformers` functionality before creating the [`Trainer`] object.
+
+Here is an example of how this can be used in an application:
+
+```python
+[...]
+logger = logging.getLogger(__name__)
+
+# Setup logging
+logging.basicConfig(
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+ datefmt="%m/%d/%Y %H:%M:%S",
+ handlers=[logging.StreamHandler(sys.stdout)],
+)
+
+# set the main code and the modules it uses to the same log-level according to the node
+log_level = training_args.get_process_log_level()
+logger.setLevel(log_level)
+datasets.utils.logging.set_verbosity(log_level)
+transformers.utils.logging.set_verbosity(log_level)
+
+trainer = Trainer(...)
+```
+
+And then if you only want to see warnings on the main node and all other nodes to not print any most likely duplicated
+warnings you could run it as:
+
+```bash
+my_app.py ... --log_level warning --log_level_replica error
+```
+
+In the multi-node environment if you also don't want the logs to repeat for each node's main process, you will want to
+change the above to:
+
+```bash
+my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0
+```
+
+and then only the main process of the first node will log at the "warning" level, and all other processes on the main
+node and all processes on other nodes will log at the "error" level.
+
+If you need your application to be as quiet as possible you could do:
+
+```bash
+my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0
+```
+
+(add `--log_on_each_node 0` if on multi-node environment)
+
+
+## Randomness
+
+When resuming from a checkpoint generated by [`Trainer`] all efforts are made to restore the
+_python_, _numpy_ and _pytorch_ RNG states to the same states as they were at the moment of saving that checkpoint,
+which should make the "stop and resume" style of training as close as possible to non-stop training.
+
+However, due to various default non-deterministic pytorch settings this might not fully work. If you want full
+determinism please refer to [Controlling sources of randomness](https://pytorch.org/docs/stable/notes/randomness). As explained in the document, that some of those settings
+that make things deterministic (.e.g., `torch.backends.cudnn.deterministic`) may slow things down, therefore this
+can't be done by default, but you can enable those yourself if needed.
+
+
+## Trainer Integrations
+
+The [`Trainer`] has been extended to support libraries that may dramatically improve your training
+time and fit much bigger models.
+
+Currently it supports third party solutions, [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FairScale](https://github.com/facebookresearch/fairscale/), which implement parts of the paper [ZeRO: Memory Optimizations
+Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He](https://arxiv.org/abs/1910.02054).
+
+This provided support is new and experimental as of this writing.
+
+
+
+### CUDA Extension Installation Notes
+
+As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
+
+While all installation issues should be dealt with through the corresponding GitHub Issues of [FairScale](https://github.com/facebookresearch/fairscale/issues) and [Deepspeed](https://github.com/microsoft/DeepSpeed/issues), there are a few common issues that one may encounter while building
+any PyTorch extension that needs to build CUDA extensions.
+
+Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
+
+```bash
+pip install fairscale
+pip install deepspeed
+```
+
+please, read the following notes first.
+
+In these notes we give examples for what to do when `pytorch` has been built with CUDA `10.2`. If your situation is
+different remember to adjust the version number to the one you are after.
+
+#### Possible problem #1
+
+While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
+installed system-wide.
+
+For example, if you installed `pytorch` with `cudatoolkit==10.2` in the Python environment, you also need to have
+CUDA `10.2` installed system-wide.
+
+The exact location may vary from system to system, but `/usr/local/cuda-10.2` is the most common location on many
+Unix systems. When CUDA is correctly set up and added to the `PATH` environment variable, one can find the
+installation location by doing:
+
+```bash
+which nvcc
+```
+
+If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
+search engine. For example, if you're on Ubuntu you may want to search for: [ubuntu cuda 10.2 install](https://www.google.com/search?q=ubuntu+cuda+10.2+install).
+
+#### Possible problem #2
+
+Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
+may have:
+
+```bash
+/usr/local/cuda-10.2
+/usr/local/cuda-11.0
+```
+
+Now, in this situation you need to make sure that your `PATH` and `LD_LIBRARY_PATH` environment variables contain
+the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
+last version was installed. If you encounter the problem, where the package build fails because it can't find the right
+CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
+environment variables.
+
+First, you may look at their contents:
+
+```bash
+echo $PATH
+echo $LD_LIBRARY_PATH
+```
+
+so you get an idea of what is inside.
+
+It's possible that `LD_LIBRARY_PATH` is empty.
+
+`PATH` lists the locations of where executables can be found and `LD_LIBRARY_PATH` is for where shared libraries
+are to looked for. In both cases, earlier entries have priority over the later ones. `:` is used to separate multiple
+entries.
+
+Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
+doing:
+
+```bash
+export PATH=/usr/local/cuda-10.2/bin:$PATH
+export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
+```
+
+Note that we aren't overwriting the existing values, but prepending instead.
+
+Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
+exist. `lib64` sub-directory is where the various CUDA `.so` objects, like `libcudart.so` reside, it's unlikely
+that your system will have it named differently, but if it is adjust it to reflect your reality.
+
+
+#### Possible problem #3
+
+Some older CUDA versions may refuse to build with newer compilers. For example, you my have `gcc-9` but it wants
+`gcc-7`.
+
+There are various ways to go about it.
+
+If you can install the latest CUDA toolkit it typically should support the newer compiler.
+
+Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
+already have it but it's not the default one, so the build system can't see it. If you have `gcc-7` installed but the
+build system complains it can't find it, the following might do the trick:
+
+```bash
+sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
+sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
+```
+
+Here, we are making a symlink to `gcc-7` from `/usr/local/cuda-10.2/bin/gcc` and since
+`/usr/local/cuda-10.2/bin/` should be in the `PATH` environment variable (see the previous problem's solution), it
+should find `gcc-7` (and `g++7`) and then the build will succeed.
+
+As always make sure to edit the paths in the example to match your situation.
+
+### FairScale
+
+By integrating [FairScale](https://github.com/facebookresearch/fairscale/) the [`Trainer`]
+provides support for the following features from [the ZeRO paper](https://arxiv.org/abs/1910.02054):
+
+1. Optimizer State Sharding
+2. Gradient Sharding
+3. Model Parameters Sharding (new and very experimental)
+4. CPU offload (new and very experimental)
+
+You will need at least two GPUs to use this feature.
+
+
+**Installation**:
+
+Install the library via pypi:
+
+```bash
+pip install fairscale
+```
+
+or via `transformers`' `extras`:
+
+```bash
+pip install transformers[fairscale]
+```
+
+(available starting from `transformers==4.6.0`) or find more details on [the FairScale's GitHub page](https://github.com/facebookresearch/fairscale/#installation).
+
+If you're still struggling with the build, first make sure to read [CUDA Extension Installation Notes](#zero-install-notes).
+
+If it's still not resolved the build issue, here are a few more ideas.
+
+`fairscale` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
+with it, you may want to try one of:
+
+```bash
+pip install fairscale --no-build-isolation .
+```
+
+or:
+
+```bash
+git clone https://github.com/facebookresearch/fairscale/
+cd fairscale
+rm -r dist build
+python setup.py bdist_wheel
+pip uninstall -y fairscale
+pip install dist/fairscale-*.whl
+```
+
+`fairscale` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:
+
+```bash
+pip uninstall -y fairscale; pip install fairscale --pre \
+-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly \
+--no-cache --no-build-isolation
+```
+
+or:
+
+```bash
+pip install -v --disable-pip-version-check . \
+-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly --pre
+```
+
+Of course, adjust the urls to match the cuda version you use.
+
+If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
+[FairScale](https://github.com/facebookresearch/fairscale/issues).
+
+
+
+**Usage**:
+
+To use the first version of Sharded data-parallelism, add `--sharded_ddp simple` to the command line arguments, and
+make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.
+
+For example here is how you could use it for `run_translation.py` with 2 GPUs:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
+--model_name_or_path t5-small --per_device_train_batch_size 1 \
+--output_dir output_dir --overwrite_output_dir \
+--do_train --max_train_samples 500 --num_train_epochs 1 \
+--dataset_name wmt16 --dataset_config "ro-en" \
+--source_lang en --target_lang ro \
+--fp16 --sharded_ddp simple
+```
+
+Notes:
+
+- This feature requires distributed training (so multiple GPUs).
+- It is not implemented for TPUs.
+- It works with `--fp16` too, to make things even faster.
+- One of the main benefits of enabling `--sharded_ddp simple` is that it uses a lot less GPU memory, so you should be
+ able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
+ significantly shorter training time.
+
+3. To use the second version of Sharded data-parallelism, add `--sharded_ddp zero_dp_2` or `--sharded_ddp zero_dp_3` to the command line arguments, and make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.
+
+For example here is how you could use it for `run_translation.py` with 2 GPUs:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
+--model_name_or_path t5-small --per_device_train_batch_size 1 \
+--output_dir output_dir --overwrite_output_dir \
+--do_train --max_train_samples 500 --num_train_epochs 1 \
+--dataset_name wmt16 --dataset_config "ro-en" \
+--source_lang en --target_lang ro \
+--fp16 --sharded_ddp zero_dp_2
+```
+
+`zero_dp_2` is an optimized version of the simple wrapper, while `zero_dp_3` fully shards model weights,
+gradients and optimizer states.
+
+Both are compatible with adding `cpu_offload` to enable ZeRO-offload (activate it like this: `--sharded_ddp "zero_dp_2 cpu_offload"`).
+
+Notes:
+
+- This feature requires distributed training (so multiple GPUs).
+- It is not implemented for TPUs.
+- It works with `--fp16` too, to make things even faster.
+- The `cpu_offload` additional option requires `--fp16`.
+- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
+ some bugs you encounter may have been fixed there already.
+
+Known caveats:
+
+- This feature is incompatible with `--predict_with_generate` in the _run_translation.py_ script.
+- Using `--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
+ `FullyShardedDataParallelism` of fairscale. It should be used with the option `auto_wrap` if you are not
+ doing this yourself: `--sharded_ddp "zero_dp_3 auto_wrap"`.
+
+
+### DeepSpeed
+
+
+Moved to [Trainer DeepSpeed integration](deepspeed#trainer-deepspeed-integration).
+
+
+#### Installation
+
+Moved to [Installation](deepspeed#deepspeed-installation).
+
+
+#### Deployment with multiple GPUs
+
+Moved to [Deployment with multiple GPUs](deepspeed#deepspeed-multi-gpu).
+
+
+#### Deployment with one GPU
+
+Moved to [Deployment with one GPU](deepspeed#deepspeed-one-gpu).
+
+
+#### Deployment in Notebooks
+
+Moved to [Deployment in Notebooks](deepspeed#deepspeed-notebook).
+
+
+#### Configuration
+
+Moved to [Configuration](deepspeed#deepspeed-config).
+
+
+#### Passing Configuration
+
+Moved to [Passing Configuration](deepspeed#deepspeed-config-passing).
+
+
+#### Shared Configuration
+
+Moved to [Shared Configuration](deepspeed#deepspeed-config-shared).
+
+#### ZeRO
+
+Moved to [ZeRO](deepspeed#deepspeed-zero).
+
+##### ZeRO-2 Config
+
+Moved to [ZeRO-2 Config](deepspeed#deepspeed-zero2-config).
+
+##### ZeRO-3 Config
+
+Moved to [ZeRO-3 Config](deepspeed#deepspeed-zero3-config).
+
+
+#### NVMe Support
+
+Moved to [NVMe Support](deepspeed#deepspeed-nvme).
+
+##### ZeRO-2 vs ZeRO-3 Performance
+
+Moved to [ZeRO-2 vs ZeRO-3 Performance](deepspeed#deepspeed-zero2-zero3-performance).
+
+##### ZeRO-2 Example
+
+Moved to [ZeRO-2 Example](deepspeed#deepspeed-zero2-example).
+
+##### ZeRO-3 Example
+
+Moved to [ZeRO-3 Example](deepspeed#deepspeed-zero3-example).
+
+
+#### Optimizer and Scheduler
+
+##### Optimizer
+
+Moved to [Optimizer](deepspeed#deepspeed-optimizer).
+
+
+##### Scheduler
+
+Moved to [Scheduler](deepspeed#deepspeed-scheduler).
+
+#### fp32 Precision
+
+Moved to [fp32 Precision](deepspeed#deepspeed-fp32).
+
+#### Automatic Mixed Precision
+
+Moved to [Automatic Mixed Precision](deepspeed#deepspeed-amp).
+
+#### Batch Size
+
+Moved to [Batch Size](deepspeed#deepspeed-bs).
+
+#### Gradient Accumulation
+
+Moved to [Gradient Accumulation](deepspeed#deepspeed-grad-acc).
+
+
+#### Gradient Clipping
+
+Moved to [Gradient Clipping](deepspeed#deepspeed-grad-clip).
+
+
+#### Getting The Model Weights Out
+
+Moved to [Getting The Model Weights Out](deepspeed#deepspeed-weight-extraction).
diff --git a/docs/source/main_classes/trainer.rst b/docs/source/main_classes/trainer.rst
deleted file mode 100644
index 9429136c49e9..000000000000
--- a/docs/source/main_classes/trainer.rst
+++ /dev/null
@@ -1,632 +0,0 @@
-..
- Copyright 2020 The HuggingFace Team. All rights reserved.
-
- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
- an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
- specific language governing permissions and limitations under the License.
-
-Trainer
------------------------------------------------------------------------------------------------------------------------
-
-The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
-training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
-
-Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a
-:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
-customization during training.
-
-The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
-`__ and Native AMP for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
-
-Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop which supports
-the above features. To inject custom behavior you can subclass them and override the following methods:
-
-- **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
-- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaluation DataLoader (PyTorch) or TF Dataset.
-- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
-- **log** -- Logs information on the various objects watching training.
-- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at
- init. Note, that you can also subclass or override the ``create_optimizer`` and ``create_scheduler`` methods
- separately.
-- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init.
-- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init.
-- **compute_loss** - Computes the loss on a batch of training inputs.
-- **training_step** -- Performs a training step.
-- **prediction_step** -- Performs an evaluation/test step.
-- **run_model** (TensorFlow only) -- Basic pass through the model.
-- **evaluate** -- Runs an evaluation loop and returns metrics.
-- **predict** -- Returns predictions (with metrics if labels are available) on a test set.
-
-.. warning::
-
- The :class:`~transformers.Trainer` class is optimized for 🤗 Transformers models and can have surprising behaviors
- when you use it on other models. When using it on your own model, make sure:
-
- - your model always return tuples or subclasses of :class:`~transformers.file_utils.ModelOutput`.
- - your model can compute the loss if a :obj:`labels` argument is provided and that loss is returned as the first
- element of the tuple (if your model returns tuples)
- - your model can accept multiple label arguments (use the :obj:`label_names` in your
- :class:`~transformers.TrainingArguments` to indicate their name to the :class:`~transformers.Trainer`) but none
- of them should be named :obj:`"label"`.
-
-Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function for multi-label
-classification:
-
-.. code-block:: python
-
- from torch import nn
- from transformers import Trainer
-
- class MultilabelTrainer(Trainer):
- def compute_loss(self, model, inputs, return_outputs=False):
- labels = inputs.get("labels")
- outputs = model(**inputs)
- logits = outputs.get('logits')
- loss_fct = nn.BCEWithLogitsLoss()
- loss = loss_fct(logits.view(-1, self.model.config.num_labels),
- labels.float().view(-1, self.model.config.num_labels))
- return (loss, outputs) if return_outputs else loss
-
-Another way to customize the training loop behavior for the PyTorch :class:`~transformers.Trainer` is to use
-:doc:`callbacks ` that can inspect the training loop state (for progress reporting, logging on TensorBoard or
-other ML platforms...) and take decisions (like early stopping).
-
-
-Trainer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Trainer
- :members:
-
-
-Seq2SeqTrainer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Seq2SeqTrainer
- :members: evaluate, predict
-
-
-TFTrainer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFTrainer
- :members:
-
-
-TrainingArguments
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TrainingArguments
- :members:
-
-
-Seq2SeqTrainingArguments
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.Seq2SeqTrainingArguments
- :members:
-
-
-TFTrainingArguments
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFTrainingArguments
- :members:
-
-
-Checkpoints
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-By default, :class:`~transformers.Trainer` will save all checkpoints in the :obj:`output_dir` you set in the
-:class:`~transformers.TrainingArguments` you are using. Those will go in subfolder named :obj:`checkpoint-xxx` with xxx
-being the step at which the training was at.
-
-Resuming training from a checkpoint can be done when calling :meth:`~transformers.Trainer.train` with either:
-
-- :obj:`resume_from_checkpoint=True` which will resume training from the latest checkpoint
-- :obj:`resume_from_checkpoint=checkpoint_dir` which will resume training from the specific checkpoint in the directory
- passed.
-
-In addition, you can easily save your checkpoints on the Model Hub when using :obj:`push_to_hub=True`. By default, all
-the models saved in intermediate checkpoints are saved in different commits, but not the optimizer state. You can adapt
-the :obj:`hub-strategy` value of your :class:`~transformers.TrainingArguments` to either:
-
-- :obj:`"checkpoint"`: the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to
- resume training easily with :obj:`trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")`.
-- :obj:`"all_checkpoints"`: all checkpoints are pushed like they appear in the output folder (so you will get one
- checkpoint folder per folder in your final repository)
-
-
-Logging
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-By default :class:`~transformers.Trainer` will use ``logging.INFO`` for the main process and ``logging.WARNING`` for
-the replicas if any.
-
-These defaults can be overridden to use any of the 5 ``logging`` levels with :class:`~transformers.TrainingArguments`'s
-arguments:
-
-- ``log_level`` - for the main process
-- ``log_level_replica`` - for the replicas
-
-Further, if :class:`~transformers.TrainingArguments`'s ``log_on_each_node`` is set to ``False`` only the main node will
-use the log level settings for its main process, all other nodes will use the log level settings for replicas.
-
-Note that :class:`~transformers.Trainer` is going to set ``transformers``'s log level separately for each node in its
-:meth:`~transformers.Trainer.__init__`. So you may want to set this sooner (see the next example) if you tap into other
-``transformers`` functionality before creating the :class:`~transformers.Trainer` object.
-
-Here is an example of how this can be used in an application:
-
-.. code-block:: python
-
- [...]
- logger = logging.getLogger(__name__)
-
- # Setup logging
- logging.basicConfig(
- format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
- datefmt="%m/%d/%Y %H:%M:%S",
- handlers=[logging.StreamHandler(sys.stdout)],
- )
-
- # set the main code and the modules it uses to the same log-level according to the node
- log_level = training_args.get_process_log_level()
- logger.setLevel(log_level)
- datasets.utils.logging.set_verbosity(log_level)
- transformers.utils.logging.set_verbosity(log_level)
-
- trainer = Trainer(...)
-
-And then if you only want to see warnings on the main node and all other nodes to not print any most likely duplicated
-warnings you could run it as:
-
-.. code-block:: bash
-
- my_app.py ... --log_level warning --log_level_replica error
-
-In the multi-node environment if you also don't want the logs to repeat for each node's main process, you will want to
-change the above to:
-
-.. code-block:: bash
-
- my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0
-
-and then only the main process of the first node will log at the "warning" level, and all other processes on the main
-node and all processes on other nodes will log at the "error" level.
-
-If you need your application to be as quiet as possible you could do:
-
-.. code-block:: bash
-
- my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0
-
-(add ``--log_on_each_node 0`` if on multi-node environment)
-
-
-
-Randomness
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When resuming from a checkpoint generated by :class:`~transformers.Trainer` all efforts are made to restore the
-`python`, `numpy` and `pytorch` RNG states to the same states as they were at the moment of saving that checkpoint,
-which should make the "stop and resume" style of training as close as possible to non-stop training.
-
-However, due to various default non-deterministic pytorch settings this might not fully work. If you want full
-determinism please refer to `Controlling sources of randomness
-`__. As explained in the document, that some of those settings
-that make things deterministic (.e.g., ``torch.backends.cudnn.deterministic``) may slow things down, therefore this
-can't be done by default, but you can enable those yourself if needed.
-
-
-Trainer Integrations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-
-
-The :class:`~transformers.Trainer` has been extended to support libraries that may dramatically improve your training
-time and fit much bigger models.
-
-Currently it supports third party solutions, `DeepSpeed `__ and `FairScale
-`__, which implement parts of the paper `ZeRO: Memory Optimizations
-Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
-`__.
-
-This provided support is new and experimental as of this writing.
-
-.. _zero-install-notes:
-
-CUDA Extension Installation Notes
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
-
-While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
-`__ and `Deepspeed
-`__, there are a few common issues that one may encounter while building
-any PyTorch extension that needs to build CUDA extensions.
-
-Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
-
-.. code-block:: bash
-
- pip install fairscale
- pip install deepspeed
-
-please, read the following notes first.
-
-In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
-different remember to adjust the version number to the one you are after.
-
-Possible problem #1
-=======================================================================================================================
-
-While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
-installed system-wide.
-
-For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
-CUDA ``10.2`` installed system-wide.
-
-The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
-Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
-installation location by doing:
-
-.. code-block:: bash
-
- which nvcc
-
-If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
-search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
-`__.
-
-Possible problem #2
-=======================================================================================================================
-
-Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
-may have:
-
-.. code-block:: bash
-
- /usr/local/cuda-10.2
- /usr/local/cuda-11.0
-
-Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
-the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
-last version was installed. If you encounter the problem, where the package build fails because it can't find the right
-CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
-environment variables.
-
-First, you may look at their contents:
-
-.. code-block:: bash
-
- echo $PATH
- echo $LD_LIBRARY_PATH
-
-so you get an idea of what is inside.
-
-It's possible that ``LD_LIBRARY_PATH`` is empty.
-
-``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
-are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
-entries.
-
-Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
-doing:
-
-.. code-block:: bash
-
- export PATH=/usr/local/cuda-10.2/bin:$PATH
- export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
-
-Note that we aren't overwriting the existing values, but prepending instead.
-
-Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
-exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
-that your system will have it named differently, but if it is adjust it to reflect your reality.
-
-
-Possible problem #3
-=======================================================================================================================
-
-Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
-``gcc-7``.
-
-There are various ways to go about it.
-
-If you can install the latest CUDA toolkit it typically should support the newer compiler.
-
-Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
-already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
-build system complains it can't find it, the following might do the trick:
-
-.. code-block:: bash
-
- sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
- sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
-
-
-Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
-``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
-should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
-
-As always make sure to edit the paths in the example to match your situation.
-
-FairScale
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-By integrating `FairScale `__ the :class:`~transformers.Trainer`
-provides support for the following features from `the ZeRO paper `__:
-
-1. Optimizer State Sharding
-2. Gradient Sharding
-3. Model Parameters Sharding (new and very experimental)
-4. CPU offload (new and very experimental)
-
-You will need at least two GPUs to use this feature.
-
-
-**Installation**:
-
-Install the library via pypi:
-
-.. code-block:: bash
-
- pip install fairscale
-
-or via ``transformers``' ``extras``:
-
-.. code-block:: bash
-
- pip install transformers[fairscale]
-
-(will become available starting from ``transformers==4.6.0``)
-
-or find more details on `the FairScale's GitHub page `__.
-
-If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.
-
-If it's still not resolved the build issue, here are a few more ideas.
-
-``fairscale`` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
-with it, you may want to try one of:
-
-.. code-block:: bash
-
- pip install fairscale --no-build-isolation .
-
-or:
-
-.. code-block:: bash
-
- git clone https://github.com/facebookresearch/fairscale/
- cd fairscale
- rm -r dist build
- python setup.py bdist_wheel
- pip uninstall -y fairscale
- pip install dist/fairscale-*.whl
-
-``fairscale`` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:
-
-.. code-block:: bash
-
- pip uninstall -y fairscale; pip install fairscale --pre \
- -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html \
- --no-cache --no-build-isolation
-
-or:
-
-.. code-block:: bash
-
- pip install -v --disable-pip-version-check . \
- -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html --pre
-
-Of course, adjust the urls to match the cuda version you use.
-
-If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
-`FairScale `__.
-
-
-
-**Usage**:
-
-To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments, and
-make sure you have added the distributed launcher ``-m torch.distributed.launch
---nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
-
-For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
-
-.. code-block:: bash
-
- python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
- --model_name_or_path t5-small --per_device_train_batch_size 1 \
- --output_dir output_dir --overwrite_output_dir \
- --do_train --max_train_samples 500 --num_train_epochs 1 \
- --dataset_name wmt16 --dataset_config "ro-en" \
- --source_lang en --target_lang ro \
- --fp16 --sharded_ddp simple
-
-Notes:
-
-- This feature requires distributed training (so multiple GPUs).
-- It is not implemented for TPUs.
-- It works with ``--fp16`` too, to make things even faster.
-- One of the main benefits of enabling ``--sharded_ddp simple`` is that it uses a lot less GPU memory, so you should be
- able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
- significantly shorter training time.
-
-3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp
- zero_dp_3`` to the command line arguments, and make sure you have added the distributed launcher ``-m
- torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
-
-For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
-
-.. code-block:: bash
-
- python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
- --model_name_or_path t5-small --per_device_train_batch_size 1 \
- --output_dir output_dir --overwrite_output_dir \
- --do_train --max_train_samples 500 --num_train_epochs 1 \
- --dataset_name wmt16 --dataset_config "ro-en" \
- --source_lang en --target_lang ro \
- --fp16 --sharded_ddp zero_dp_2
-
-:obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
-gradients and optimizer states.
-
-Both are compatible with adding :obj:`cpu_offload` to enable ZeRO-offload (activate it like this: :obj:`--sharded_ddp
-"zero_dp_2 cpu_offload"`).
-
-Notes:
-
-- This feature requires distributed training (so multiple GPUs).
-- It is not implemented for TPUs.
-- It works with ``--fp16`` too, to make things even faster.
-- The ``cpu_offload`` additional option requires ``--fp16``.
-- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
- some bugs you encounter may have been fixed there already.
-
-Known caveats:
-
-- This feature is incompatible with :obj:`--predict_with_generate` in the `run_translation.py` script.
-- Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
- :obj:`FullyShardedDataParallelism` of fairscale. It should be used with the option :obj:`auto_wrap` if you are not
- doing this yourself: :obj:`--sharded_ddp "zero_dp_3 auto_wrap"`.
-
-
-DeepSpeed
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-
-Moved to :ref:`deepspeed-trainer-integration`.
-
-
-Installation
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-installation`.
-
-
-Deployment with multiple GPUs
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-multi-gpu`.
-
-
-Deployment with one GPU
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-one-gpu`.
-
-
-Deployment in Notebooks
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-notebook`.
-
-
-Configuration
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-config`.
-
-
-Passing Configuration
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-config-passing`.
-
-
-Shared Configuration
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-config-shared`.
-
-ZeRO
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-zero`.
-
-ZeRO-2 Config
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Moved to :ref:`deepspeed-zero2-config`.
-
-ZeRO-3 Config
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Moved to :ref:`deepspeed-zero3-config`.
-
-
-NVMe Support
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-nvme`.
-
-ZeRO-2 vs ZeRO-3 Performance
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Moved to :ref:`deepspeed-zero2-zero3-performance`.
-
-ZeRO-2 Example
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Moved to :ref:`deepspeed-zero2-example`.
-
-ZeRO-3 Example
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Moved to :ref:`deepspeed-zero3-example`.
-
-Optimizer and Scheduler
-=======================================================================================================================
-
-
-
-Optimizer
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Moved to :ref:`deepspeed-optimizer`.
-
-
-Scheduler
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Moved to :ref:`deepspeed-scheduler`.
-
-fp32 Precision
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-fp32`.
-
-Automatic Mixed Precision
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-amp`.
-
-Batch Size
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-bs`.
-
-Gradient Accumulation
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-grad-acc`.
-
-
-Gradient Clipping
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-grad-clip`.
-
-
-Getting The Model Weights Out
-=======================================================================================================================
-
-Moved to :ref:`deepspeed-weight-extraction`.
diff --git a/utils/check_repo.py b/utils/check_repo.py
index 0522e900dcd5..988b8f214542 100644
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -507,6 +507,8 @@ def find_all_documented_objects():
"xnli_output_modes",
"xnli_processors",
"xnli_tasks_num_labels",
+ "TFTrainer",
+ "TFTrainingArguments",
]
# Exceptionally, some objects should not be documented after all rules passed.