From 27a9cf8323c3c156eacaef036981b350c400d7c3 Mon Sep 17 00:00:00 2001
From: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Date: Mon, 27 Feb 2023 12:15:02 +0530
Subject: [PATCH 1/4] update FSDP and add XLA-FSDP documentation

---
 docs/source/en/main_classes/trainer.mdx | 65 +++++++++++++++++++------
 1 file changed, 51 insertions(+), 14 deletions(-)
diff --git a/docs/source/en/main_classes/trainer.mdx b/docs/source/en/main_classes/trainer.mdx
index a0b914cd40af..1a77985ea6d2 100644
--- a/docs/source/en/main_classes/trainer.mdx
+++ b/docs/source/en/main_classes/trainer.mdx
@@ -574,22 +574,59 @@ add `--fsdp "full_shard offload"` or `--fsdp "shard_grad_op offload"` to the com
 add `--fsdp "full_shard auto_wrap"` or `--fsdp "shard_grad_op auto_wrap"` to the command line arguments.
 - To enable both CPU offloading and auto wrapping, 
 add `--fsdp "full_shard offload auto_wrap"` or `--fsdp "shard_grad_op offload auto_wrap"` to the command line arguments.
-- If auto wrapping is enabled, you can either use transformer based auto wrap policy or size based auto wrap policy.
-  - For transformer based auto wrap policy, please add `--fsdp_transformer_layer_cls_to_wrap <value>` to command line arguments.
-  This specifies the transformer layer class name (case-sensitive) to wrap ,e.g, `BertLayer`, `GPTJBlock`, `T5Block` ....
-  This is important because submodules that share weights (e.g., embedding layer) should not end up in different FSDP wrapped units. 
-  Using this policy, wrapping happens for each block containing Multi-Head Attention followed by couple of MLP layers. 
-  Remaining layers including the shared embeddings are conveniently wrapped in same outermost FSDP unit.
-  Therefore, use this for transformer based models.
-  - For size based auto wrap policy, please add `--fsdp_min_num_params <number>` to command line arguments.
-  It specifies FSDP's minimum number of parameters for auto wrapping.
+- Remaining FSDP config is passed via `--fsdp_config <path_to_fsdp_config.json>`. It is either a location of
+FSDP json config file (e.g., `fsdp_config.json`) or an already loaded json file as `dict`. 
+  - If auto wrapping is enabled, you can either use transformer based auto wrap policy or size based auto wrap policy.
+    - For transformer based auto wrap policy, please specify `fsdp_transformer_layer_cls_to_wrap` in the config file. 
+    This specifies the list of transformer layer class name (case-sensitive) to wrap ,e.g, [`BertLayer`], [`GPTJBlock`], [`T5Block`] ....
+    This is important because submodules that share weights (e.g., embedding layer) should not end up in different FSDP wrapped units.
+    Using this policy, wrapping happens for each block containing Multi-Head Attention followed by couple of MLP layers. 
+    Remaining layers including the shared embeddings are conveniently wrapped in same outermost FSDP unit.
+    Therefore, use this for transformer based models.
+  - For size based auto wrap policy, please add `fsdp_min_num_params` in the config file. 
+    It specifies FSDP's minimum number of parameters for auto wrapping.
+  - `fsdp_backward_prefetch` can be specified in the config file. It controls when to prefetch next set of parameters. 
+    `backward_pre` and `backward_pos` are available options. 
+    For more information refer `torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch`
+  - `fsdp_forward_prefetch` can be specified in the config file. It controls when to prefetch next set of parameters. 
+    If `"True"`, FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. 
+  - `limit_all_gathers` can be specified in the config file. 
+    If `"True"`, FSDP explicitly synchronizes the CPU thread to prevent too many in-flight all-gathers.
 
 **Few caveats to be aware of**
-- Mixed precision is currently not supported with FSDP as we wait for PyTorch to fix support for it.
-More details in this [issues](https://github.com/pytorch/pytorch/issues/75676).
-- FSDP currently doesn't support multiple parameter groups. 
-More details mentioned in this [issue](https://github.com/pytorch/pytorch/issues/76501)
-(`The original model parameters' .grads are not set, meaning that they cannot be optimized separately (which is why we cannot support multiple parameter groups)`).
+- This feature is incompatible with `--predict_with_generate` in the _run_translation.py_ script.
+This feature is incompatible with `generate()` method of `PreTrainedModel` class. 
+Please refer issue [#21667](https://github.com/huggingface/transformers/issues/21667)
+
+### PyTorch/XLA Fully Sharded Data parallel
+
+For all the TPU users, great news! PyTorch/XLA now supports FSDP.
+All the latest Fully Sharded Data Parallel (FSDP) training are supported.
+For more information refer to the [Scaling PyTorch models on Cloud TPUs with FSDP](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) and [PyTorch/XLA implementation of FSDP](https://github.com/pytorch/xla/tree/master/torch_xla/distributed/fsdp)
+All you need to do is enable it through the config.
+
+**Required PyTorch/XLA version for FSDP support**: >=2.0
+
+**Usage**:
+
+Pass `--fsdp True` along with following changes to be made in `--fsdp_config <path_to_fsdp_config.json>`:
+- `xla` should be set to `True` to enable PyTorch/XLA FSDP.
+- `xla_fsdp_settings` The value is a dictionary which stores the XLA FSDP wrapping parameters.
+  For a complete list of options, please see [here](
+  https://github.com/pytorch/xla/blob/master/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py).
+- `xla_fsdp_grad_ckpt`. When `True`, uses gradient checkpointing over each nested XLA FSDP wrapped layer. This setting can only be
+used when the xla flag is set to true, and an auto wrapping policy is specified through
+`fsdp_min_num_params` or `fsdp_transformer_layer_cls_to_wrap`. 
+- You can either use transformer based auto wrap policy or size based auto wrap policy.
+  - For transformer based auto wrap policy, please specify `fsdp_transformer_layer_cls_to_wrap` in the config file. 
+    This specifies the list of transformer layer class name (case-sensitive) to wrap ,e.g, [`BertLayer`], [`GPTJBlock`], [`T5Block`] ....
+    This is important because submodules that share weights (e.g., embedding layer) should not end up in different FSDP wrapped units.
+    Using this policy, wrapping happens for each block containing Multi-Head Attention followed by couple of MLP layers. 
+    Remaining layers including the shared embeddings are conveniently wrapped in same outermost FSDP unit.
+    Therefore, use this for transformer based models.
+  - For size based auto wrap policy, please add `fsdp_min_num_params` in the config file. 
+    It specifies FSDP's minimum number of parameters for auto wrapping.
+    
 
 ### Using Trainer for accelerated PyTorch Training on Mac 
 

From 86d73c5bf35a84a75a66f183a65e5e9ab9236190 Mon Sep 17 00:00:00 2001
From: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Date: Mon, 27 Feb 2023 13:07:11 +0530
Subject: [PATCH 2/4] resolving comments

---
 docs/source/en/main_classes/trainer.mdx | 40 ++++++++++++-------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/docs/source/en/main_classes/trainer.mdx b/docs/source/en/main_classes/trainer.mdx
index 1a77985ea6d2..40325217ff60 100644
--- a/docs/source/en/main_classes/trainer.mdx
+++ b/docs/source/en/main_classes/trainer.mdx
@@ -564,27 +564,27 @@ as the model saving with FSDP activated is only available with recent fixes.
 
 - **Sharding Strategy**: 
   - FULL_SHARD : Shards optimizer states + gradients + model parameters across data parallel workers/GPUs.
-  For this, add `--fsdp full_shard` to the command line arguments. 
+    For this, add `--fsdp full_shard` to the command line arguments. 
   - SHARD_GRAD_OP : Shards optimizer states + gradients across data parallel workers/GPUs.
     For this, add `--fsdp shard_grad_op` to the command line arguments.
   - NO_SHARD : No sharding. For this, add `--fsdp no_shard` to the command line arguments.
 - To offload the parameters and gradients to the CPU, 
-add `--fsdp "full_shard offload"` or `--fsdp "shard_grad_op offload"` to the command line arguments.
--  To automatically recursively wrap layers with FSDP using `default_auto_wrap_policy`, 
-add `--fsdp "full_shard auto_wrap"` or `--fsdp "shard_grad_op auto_wrap"` to the command line arguments.
+  add `--fsdp "full_shard offload"` or `--fsdp "shard_grad_op offload"` to the command line arguments.
+- To automatically recursively wrap layers with FSDP using `default_auto_wrap_policy`, 
+  add `--fsdp "full_shard auto_wrap"` or `--fsdp "shard_grad_op auto_wrap"` to the command line arguments.
 - To enable both CPU offloading and auto wrapping, 
-add `--fsdp "full_shard offload auto_wrap"` or `--fsdp "shard_grad_op offload auto_wrap"` to the command line arguments.
+  add `--fsdp "full_shard offload auto_wrap"` or `--fsdp "shard_grad_op offload auto_wrap"` to the command line arguments.
 - Remaining FSDP config is passed via `--fsdp_config <path_to_fsdp_config.json>`. It is either a location of
-FSDP json config file (e.g., `fsdp_config.json`) or an already loaded json file as `dict`. 
+  FSDP json config file (e.g., `fsdp_config.json`) or an already loaded json file as `dict`. 
   - If auto wrapping is enabled, you can either use transformer based auto wrap policy or size based auto wrap policy.
     - For transformer based auto wrap policy, please specify `fsdp_transformer_layer_cls_to_wrap` in the config file. 
-    This specifies the list of transformer layer class name (case-sensitive) to wrap ,e.g, [`BertLayer`], [`GPTJBlock`], [`T5Block`] ....
-    This is important because submodules that share weights (e.g., embedding layer) should not end up in different FSDP wrapped units.
-    Using this policy, wrapping happens for each block containing Multi-Head Attention followed by couple of MLP layers. 
-    Remaining layers including the shared embeddings are conveniently wrapped in same outermost FSDP unit.
-    Therefore, use this for transformer based models.
-  - For size based auto wrap policy, please add `fsdp_min_num_params` in the config file. 
-    It specifies FSDP's minimum number of parameters for auto wrapping.
+      This specifies the list of transformer layer class name (case-sensitive) to wrap ,e.g, [`BertLayer`], [`GPTJBlock`], [`T5Block`] ....
+      This is important because submodules that share weights (e.g., embedding layer) should not end up in different FSDP wrapped units.
+      Using this policy, wrapping happens for each block containing Multi-Head Attention followed by couple of MLP layers. 
+      Remaining layers including the shared embeddings are conveniently wrapped in same outermost FSDP unit.
+      Therefore, use this for transformer based models.
+    - For size based auto wrap policy, please add `fsdp_min_num_params` in the config file. 
+      It specifies FSDP's minimum number of parameters for auto wrapping.
   - `fsdp_backward_prefetch` can be specified in the config file. It controls when to prefetch next set of parameters. 
     `backward_pre` and `backward_pos` are available options. 
     For more information refer `torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch`
@@ -594,9 +594,9 @@ FSDP json config file (e.g., `fsdp_config.json`) or an already loaded json file
     If `"True"`, FSDP explicitly synchronizes the CPU thread to prevent too many in-flight all-gathers.
 
 **Few caveats to be aware of**
-- This feature is incompatible with `--predict_with_generate` in the _run_translation.py_ script.
-This feature is incompatible with `generate()` method of `PreTrainedModel` class. 
-Please refer issue [#21667](https://github.com/huggingface/transformers/issues/21667)
+- it is incompatible with `generate`, thus is incompatible with `--predict_with_generate` 
+  in all seq2seq/clm scripts (translation/summarization/clm etc.) and 
+  Please refer issue [#21667](https://github.com/huggingface/transformers/issues/21667)
 
 ### PyTorch/XLA Fully Sharded Data parallel
 
@@ -614,9 +614,9 @@ Pass `--fsdp True` along with following changes to be made in `--fsdp_config <pa
 - `xla_fsdp_settings` The value is a dictionary which stores the XLA FSDP wrapping parameters.
   For a complete list of options, please see [here](
   https://github.com/pytorch/xla/blob/master/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py).
-- `xla_fsdp_grad_ckpt`. When `True`, uses gradient checkpointing over each nested XLA FSDP wrapped layer. This setting can only be
-used when the xla flag is set to true, and an auto wrapping policy is specified through
-`fsdp_min_num_params` or `fsdp_transformer_layer_cls_to_wrap`. 
+- `xla_fsdp_grad_ckpt`. When `True`, uses gradient checkpointing over each nested XLA FSDP wrapped layer. 
+  This setting can only be used when the xla flag is set to true, and an auto wrapping policy is specified through
+  `fsdp_min_num_params` or `fsdp_transformer_layer_cls_to_wrap`. 
 - You can either use transformer based auto wrap policy or size based auto wrap policy.
   - For transformer based auto wrap policy, please specify `fsdp_transformer_layer_cls_to_wrap` in the config file. 
     This specifies the list of transformer layer class name (case-sensitive) to wrap ,e.g, [`BertLayer`], [`GPTJBlock`], [`T5Block`] ....
@@ -626,7 +626,7 @@ used when the xla flag is set to true, and an auto wrapping policy is specified
     Therefore, use this for transformer based models.
   - For size based auto wrap policy, please add `fsdp_min_num_params` in the config file. 
     It specifies FSDP's minimum number of parameters for auto wrapping.
-    
+
 
 ### Using Trainer for accelerated PyTorch Training on Mac 
 

From 5bea5fecea92ea9255c59de093d8b9b7eb4f1101 Mon Sep 17 00:00:00 2001
From: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Date: Mon, 27 Feb 2023 13:11:07 +0530
Subject: [PATCH 3/4] minor update

---
 docs/source/en/main_classes/trainer.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/main_classes/trainer.mdx b/docs/source/en/main_classes/trainer.mdx
index 40325217ff60..e549e19a5ff9 100644
--- a/docs/source/en/main_classes/trainer.mdx
+++ b/docs/source/en/main_classes/trainer.mdx
@@ -595,7 +595,7 @@ as the model saving with FSDP activated is only available with recent fixes.
 
 **Few caveats to be aware of**
 - it is incompatible with `generate`, thus is incompatible with `--predict_with_generate` 
-  in all seq2seq/clm scripts (translation/summarization/clm etc.) and 
+  in all seq2seq/clm scripts (translation/summarization/clm etc.).  
   Please refer issue [#21667](https://github.com/huggingface/transformers/issues/21667)
 
 ### PyTorch/XLA Fully Sharded Data parallel

From d539e68c64971f641e30f204d52fc01eccd2892e Mon Sep 17 00:00:00 2001
From: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Date: Tue, 28 Feb 2023 13:02:44 +0530
Subject: [PATCH 4/4] fix xla-fsdp docs

---
 docs/source/en/main_classes/trainer.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/main_classes/trainer.mdx b/docs/source/en/main_classes/trainer.mdx
index e549e19a5ff9..67ab6aba42ef 100644
--- a/docs/source/en/main_classes/trainer.mdx
+++ b/docs/source/en/main_classes/trainer.mdx
@@ -609,7 +609,7 @@ All you need to do is enable it through the config.
 
 **Usage**:
 
-Pass `--fsdp True` along with following changes to be made in `--fsdp_config <path_to_fsdp_config.json>`:
+Pass `--fsdp "full shard"` along with following changes to be made in `--fsdp_config <path_to_fsdp_config.json>`:
 - `xla` should be set to `True` to enable PyTorch/XLA FSDP.
 - `xla_fsdp_settings` The value is a dictionary which stores the XLA FSDP wrapping parameters.
   For a complete list of options, please see [here](