diff --git a/README.md b/README.md
index b1ec9ab394a2..9763e3a8cf09 100755
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ optimizations on advanced hyperparameter tuning and optimizers. For example:
* DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
Megatron on Azure GPUs.
- *Read more*: [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md)
+ *Read more*: [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/)
@@ -106,10 +106,10 @@ combination. ZeRO boosts the scaling capability and efficiency further.
significant performance gains compared to using model parallelism alone.
*Read more*: [technical report](https://arxiv.org/abs/1910.02054),
- and [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md).
+ and [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/).
-
+
The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.
@@ -121,7 +121,7 @@ optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
effectiveness of model training and reduce the number of samples required to
convergence to desired accuracy.
-*Read more*: [Tuning tutorial](./docs/tutorials/1Cycle.md),
+*Read more*: [Tuning tutorial](https://www.deepspeed.ai/tutorials/1Cycle/),
+
### Constant Buffer Optimization (CBO)
CBO enables high network and memory throughput while restricting memory usage to a
constant size. For memory- and network-bound operations such as normalization or
@@ -131,18 +102,18 @@ The DeepSpeed core API consists of just a handful of methods:
DeepSpeed supports all the features described in this document, via the use of these API,
along with a `deepspeed_config` JSON file for enabling and disabling the features.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
### Gradient Clipping
DeepSpeed handles gradient clipping under the hood based on the max gradient norm
specified by the user.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
### Automatic loss scaling with mixed precision
DeepSpeed internally handles loss scaling for mixed precision training. The parameters
for loss scaling can be specified in the `deepspeed_config` JSON file.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
## Training Optimizers
@@ -176,19 +147,19 @@ more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) .
DeepSpeed can simplify checkpointing for you regardless of whether you are using data
parallel training, model parallel training, mixed-precision training, a mix of these
three, or using the zero optimizer to enable larger model sizes.
-Please see the [Getting Started](../README.md#getting-started) guide
+Please see the [Getting Started](/getting-started/) guide
and the
-[core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
## Advanced parameter search
DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for
large batch scaling.
### Learning Rate Range Test
-Please refer to the [Learning Rate Range Test](tutorials/lrrt.md) tutorial.
+Please refer to the [Learning Rate Range Test](/tutorials/lrrt/) tutorial.
### 1Cycle Learning Rate Schedule
-Please refer to the [1Cycle Learning Rate Schedule](tutorials/1Cycle.md) tutorial.
+Please refer to the [1Cycle Learning Rate Schedule](/tutorials/1Cycle/) tutorial.
## Simplified Data Loader
@@ -200,7 +171,7 @@ can automatically handle batch creation appropriately.
For performance debugging, DeepSpeed can give you a detailed breakdown of the time spent
in different parts of the training with by simply enabling it in the `deepspeed_config`
file.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
```json
{
"wall_clock_breakdown": true
diff --git a/docs/_tutorials/1Cycle.md b/docs/_tutorials/1Cycle.md
new file mode 100644
index 000000000000..0cb8a45f31f0
--- /dev/null
+++ b/docs/_tutorials/1Cycle.md
@@ -0,0 +1,144 @@
+---
+title: "1-Cycle Schedule"
+---
+
+This tutorial shows how to implement 1Cycle schedules for learning rate and
+momentum in PyTorch.
+
+## 1-Cycle Schedule
+Recent research has demonstrated that the slow convergence problems of large
+batch size training can be addressed by tuning critical hyperparameters such
+as learning rate and momentum, during training using cyclic and decay
+schedules. In DeepSpeed, we have implemented a state-of-the-art schedule called
+[1-Cycle](https://arxiv.org/abs/1803.09820) to help data scientists
+effectively use larger batch sizes to train their models in PyTorch.
+
+## Prerequisites
+
+To use 1-cycle schedule for model training, you should satisfy these two requirements:
+
+1. Integrate DeepSpeed into your training script using the [Getting
+Started](/getting-started/) guide.
+2. Add the parameters to configure a 1-Cycle schedule to the parameters of your
+model. We will define the 1-Cycle parameters below.
+
+## Overview
+The 1-cycle schedule operates in two phases, a cycle phase and a decay phase,
+which span one iteration over the training data. For concreteness, we will
+review how 1-cycle schedule of learning rate works. In the cycle phase,
+the learning rate oscillates between a minimum value and a maximum value over a
+number of training steps. In the decay phase, the learning rate decays starting
+from the minimum value of the cycle phase. An example of 1-cycle learning rate
+schedule during model training is illustrated below.
+
+
+
+### 1-Cycle Parameters
+
+The 1-Cycle schedule is defined by a number of parameters which allow users
+explore different configurations. The literature recommends concurrent tuning
+of learning rate and momentum because they are correlated hyperparameters. We
+have leveraged this recommendation to reduce configuration burden by organizing
+the 1-cycle parameters into two groups to:
+
+1. Global parameters for configuring the cycle and decay phase
+2. Local parameters for configuring learning rate and momentum
+
+The global parameters for configuring the 1-cycle phases are:
+
+1. `cycle_first_step_size`: The count of training steps to complete first step of cycle phase
+2. `cycle_first_stair_count`: The count of updates (or stairs) in first step of cycle phase
+3. `cycle_second_step_size`: The count of training steps to complete second step of cycle phase
+4. `cycle_second_stair_count`: The count of updates (or stairs) in the second step of cycle phase
+5. `post_cycle_decay_step_size`: The interval, in training steps, to decay hyperparameter in decay phase
+
+The local parameters for the hyperparameters are:
+
+**Learning rate**:
+
+1. `cycle_min_lr`: minimum learning rate in cycle phase
+2. `cycle_max_lr`: maximum learning rate in cycle phase
+3. `decay_lr_rate`: decay rate for learning rate in decay phase
+
+Although appropriate values `cycle_min_lr` and `cycle_max_lr` values can be
+selected based on experience or expertise, we recommend using [learning rate
+range test](/tutorials/lrrt/) feature of DeepSpeed to configure them.
+
+**Momentum**
+1. `cycle_min_mom`: minimum momentum in cycle phase
+2. `cycle_max_mom`: maximum momentum in cycle phase
+3. `decay_mom_rate`: decay rate for momentum in decay phase
+
+## Required Model Configuration Changes
+
+To illustrate the required model configuration changes to use 1-Cycle schedule
+in model training, we will use a schedule with the following properties:
+
+1. A symmetric cycle phase, where each half of the cycle spans the same number
+of training steps. For this example, it will take 1000 training steps for the
+learning rate to increase from 0.0001 to 0.0010 (10X scale), and then to
+decrease back to 0.0001. The momentum will correspondingly cycle between 0.85
+and 0.99 in similar number of steps.
+2. A decay phase, where learning rate decays by 0.001 every 1000 steps, while
+momentum is not decayed.
+
+Note that these parameters are processed by DeepSpeed as session parameters,
+and so should be added to the appropriate section of the model configuration.
+
+### **PyTorch model**
+
+PyTorch versions 1.0.1 and newer provide a feature for implementing schedulers
+for hyper-parameters, called [learning rate
+ schedulers](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html).
+ We have implemented 1-Cycle schedule using this feature. You will add a
+ scheduler entry of type **"OneCycle"** as illustrated below.
+
+```json
+"scheduler": {
+ "type": "OneCycle",
+ "params": {
+ "cycle_first_step_size": 1000,
+ "cycle_first_stair_count": 500,
+ "cycle_second_step_size": 1000,
+ "cycle_second_stair_count": 500,
+ "decay_step_size": 1000,
+ "cycle_min_lr": 0.0001,
+ "cycle_max_lr": 0.0010,
+ "decay_lr_rate": 0.001,
+ "cycle_min_mom": 0.85,
+ "cycle_max_mom": 0.99,
+ "decay_mom_rate": 0.0
+ }
+},
+```
+
+## Batch Scaling Example
+
+As example of how 1-Cycle schedule can enable effective batch scaling, we
+briefly share our experience with an internal model in Microsoft. In this case,
+the model was well-tuned for fast convergence (in data samples) on a single
+GPU, but was converging slowly to target performance (AUC) when training on 8
+GPUs (8X batch size). The plot below shows model convergence with 8 GPUs for
+these learning rate schedules:
+
+1. **Fixed**: using an optimal fixed learning rate for 1-GPU training.
+2. **LinearScale**: using a fixed learning rate that is 8X of **Fixed**.
+3. **1Cycle**: using 1-Cycle schedule.
+
+
+
+With **1Cycle**, the model converges faster than the other schedules to the
+target AUC . In fact, **1Cycle** converges as fast as the optimal 1-GPU
+training (not shown). For **Fixed**, convergence is about 5X slower (needs 5X
+more data samples). With **LinearScale**, the model diverges because the
+learning rate is too high. The plot below illustrates the schedules by
+reporting the learning rate values during 8-GPU training.
+
+
+
+We see that the learning rate for **1Cycle** is always larger than **Fixed**
+and is briefly larger than **LinearScale** to achieve faster convergence. Also
+**1Cycle** lowers the learning rate later during training to avoid model
+divergence, in contrast to **LinearScale**. In summary, by configuring an
+appropriate 1-Cycle schedule we were able to effective scale the training batch
+size for this model by 8X without loss of convergence speed.
diff --git a/docs/_tutorials/azure.md b/docs/_tutorials/azure.md
new file mode 100644
index 000000000000..8f3ed2fd9959
--- /dev/null
+++ b/docs/_tutorials/azure.md
@@ -0,0 +1,131 @@
+---
+title: "Getting Started with DeepSpeed on Azure"
+---
+
+This tutorial will help you get started running DeepSpeed on [Azure virtual
+machines](https://azure.microsoft.com/en-us/services/virtual-machines/).
+Looking forward, we will be integrating these techniques and additional enhancements
+into the [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) platform to
+benefit all your large model training jobs.
+
+If you don't already have an Azure account please see more details here: [https://azure.microsoft.com/](https://azure.microsoft.com/).
+
+To help with launching Azure instances we suggest using the [Azure
+CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
+several helper scripts to get you quickly started using DeepSpeed with Azure.
+ * Install Azure CLI on your local box: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
+ * Alternatively you can use the Azure in-browser shell: https://shell.azure.com/
+
+## Create an SSH key
+Generate an SSH key that will be used across this tutorial to SSH into your VMs and
+between Docker containers. `ssh-keygen` is the recommended way of doing this. Our scripts
+assume your key is located inside the same directory as the Azure scripts.
+
+## Azure Config JSON
+Our helper scripts depend on the following a configuration JSON for deployment
+and setup. We have provided a simple example JSON in `azure_config.json` that
+sets up a basic environment with two VMs. This config uses the NV6_Promo
+instance type which has one NVIDIA Tesla M60 GPU per VM. You can read more
+details about the VM on the [Linux Virtual Machines
+Pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
+page.
+
+See the example below:
+ ```json
+{
+ "num_vms": 2,
+ "location": "southcentralus",
+ "azure_sku": "Standard_NV6_Promo",
+ "ssh_private_key": "id_rsa",
+ "docker_ssh_port": 2222
+}
+```
+
+## Dependencies
+The scripts in this tutorial require [jq](https://stedolan.github.io/jq/) to help with
+parsing JSON from the command line. Also it is recommended to install
+[pdsh](https://linux.die.net/man/1/pdsh) to help launch ssh connections in parallel.
+
+## Create Azure VMs
+We first need to allocate the VMs. We provide a script
+```bash
+./create_vms.sh
+```
+to create VMs with the Azure SKU in the region specified in `azure_config.json`. Feel
+free to customize your JSON to your desired region/SKU. This step will take a few minutes
+to complete while it sets up all of your VMs on Azure.
+
+## Setup VM environment to use DeepSpeed
+Next, we need to configure the VM environment for DeepSpeed. We provide a script
+```bash
+./setup_vms.sh
+```
+to generate a [hostfile](/getting-started/#resource-configuration-multi-node) and SSH
+configuration on all of the VMs. This configuration will be used by the DeepSpeed
+Docker containers in the next step.
+
+## Start the DeepSpeed docker container
+We now setup the DeepSpeed Docker containers on the VMs. We provide a script
+```bash
+./setup_docker.sh
+```
+to pull the DeepSpeed image onto all VMs and start a container instance in the
+background. This will take several minutes since it needs to pull the entire Docker
+image.
+
+## Access VMs
+The tool `azure_ssh.sh` will let you SSH into any of the VMs with this
+syntax:
+```bash
+./azure_ssh.sh [command]
+```
+where the `node-id` is a number between `0` and `num_vms-1`. This script will find the
+public IP address of your VM and use the SSH key provided in the Azure configuration
+JSON.
+
+## Access DeepSpeed container
+Everything should be up and running at this point. Let's access the running DeepSpeed
+container on the first VM and make sure we can talk to the other containers in our deployment.
+
+ * SSH into the first VM via: `./azure_ssh.sh 0`
+ * Change directories into the azure folder of this repo via: `cd ~/workdir/DeepSpeed/azure`
+ * Attach the running docker container via: `./attach.sh`
+ * You should now be able to `ssh` into any other docker container, the containers can be
+ accessed via their SSH alias of `worker-N`, where `N` is the VM number between `0`
+ and `num_vms-1`. In this example we should be able to successfully run `ssh worker-1
+ hostname` which will return the hostname of worker-1.
+
+## Parallel SSH across containers
+ DeepSpeed comes installed with a helper script `ds_ssh` which is a wrapper around
+ the [pdsh](https://linux.die.net/man/1/pdsh) command that lets you issue commands
+ to groups of hosts (via SSH) in parallel. This wrapper simply connects with the
+ hostfile that defines all the containers in your deployment. For example if you run
+ `ds_ssh hostname` you should see a list of all the hostnames in your deployment.
+
+## Run CIFAR-10 example model
+We will now run the DeepSpeed CIFAR-10 model example to test the VM setup. From inside
+the first DeepSpeed container:
+
+ 1) Install the python dependencies necessary to run the CIFAR-10 example model. You can
+ do this across your cluster via:
+ ```bash
+ ds_ssh pip install -r ~/workdir/DeepSpeed/DeepSpeedExamples/cifar/requirements.txt
+ ```
+
+ 2) Now change directories to the CIFAR example:
+ ```bash
+ cd ~/workdir/DeepSpeed/DeepSpeedExamples/cifar
+ ```
+
+ 3) Finally, launch training across all VMs:
+ ```bash
+ deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
+ ```
+
+## Megatron-LM GPT2
+DeepSpeed includes an example model using Megatron-LM's GPT2. Please refer to the full
+[Megatron tutorial](/tutorials/megatron/) for more details.
+ * In order to fully train GPT2 with DeepSpeed and ZeRO we recommend using 8 instances of
+ Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup and
+ a batch size of 1536 you should be able to complete 100k training steps (153.6 million
+ samples) in less than 2 weeks of training.
diff --git a/docs/_tutorials/getting-started.md b/docs/_tutorials/getting-started.md
index c6be8a05638d..69ffab990eb0 100644
--- a/docs/_tutorials/getting-started.md
+++ b/docs/_tutorials/getting-started.md
@@ -6,9 +6,10 @@ excerpt: "First steps with DeepSpeed"
## Installation
-* Please see our [Azure tutorial](docs/azure.md) to get started with DeepSpeed on Azure!
+* Please see our [Azure tutorial](/tutorials/azure/) to get started with DeepSpeed on Azure!
* If you're not on Azure, we recommend using our docker image via `docker pull deepspeed/deepspeed:latest` which contains a pre-installed version of DeepSpeed and all the necessary dependencies.
-* If you want to install DeepSpeed manually, we provide an install script [install.sh](install.sh) to help install on a local machine or across an entire cluster.
+* If you want to install DeepSpeed manually, we provide an install script
+* `install.sh` to help install on a local machine or across an entire cluster.
## Writing DeepSpeed Models
DeepSpeed model training is accomplished using the DeepSpeed engine. The engine
@@ -114,8 +115,8 @@ the `step` value is stored as part of the `client_sd`.
## DeepSpeed Configuration
DeepSpeed features can be enabled, disabled, or configured using a config JSON
file that should be specified as `args.deepspeed_config`. A sample config file
-is shown below. For a full set of features see [core API
-doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html).
+is shown below. For a full set of features see [ API
+doc](/docs/config_json/).
```json
{
diff --git a/docs/_tutorials/lrrt.md b/docs/_tutorials/lrrt.md
new file mode 100644
index 000000000000..d2e1e4051934
--- /dev/null
+++ b/docs/_tutorials/lrrt.md
@@ -0,0 +1,148 @@
+---
+title: "Learning Rate Range Test"
+---
+This tutorial shows how to use to perform Learning Rate range tests in PyTorch.
+
+## Learning Rate Range Test (LRRT)
+
+Learning rate range test ( [LRRT](https://arxiv.org/abs/1803.09820) ) is a
+method for discovering the largest learning rate values that can be used to
+train a model without divergence. Data scientists are often interested in this
+information because large learning rates lead to faster model convergence than
+a small learning rates. Moreover, large learning rates are crucial in learning
+rate schedules such as [CLR](https://arxiv.org/abs/1506.01186) and
+[1Cycle](https://arxiv.org/abs/1803.09820), which are used to train effectively
+with large batch sizes. DeepSpeed provides LRRT for model training in PyTorch
+frameworks.
+
+## Prerequisites
+
+To use DeepSpeed's LRRT, you must satisfy the following two conditions:
+
+1. Integrate DeepSpeed into your training script using the [Getting
+Started](/getting-started/) guide.
+2. Add the parameters to configure LRRT to the parameters of your model. The
+LRRT parameters are defined below.
+
+## LRRT Parameters
+
+LRRT works by linearly increasing the learning rate by a predefined amount, at
+predefined intervals. Thus, LRRT is a form of learning rate schedule because it
+defines how and when the learning rate should change during model training. To
+configure LRRT, you will need to set these parameters:
+
+1. `lr_range_test_min_lr` : The initial learning rate for training `(float)`
+2. `lr_range_test_step_size`: The interval for scaling up learning rate,
+defined in training steps `(integer)`
+3. `lr_range_test_step_rate`: The scaling factor for increasing learning rate
+`(float)`
+4. `lr_range_test_staircase`: If true, learning rate is changed every
+`lr_range_test_step_size` training steps, otherwise learning rate is changed at
+every training step `(boolean)`
+
+## Required Model Configuration Changes
+
+We will illustrate the required model configuration changes an example LRRT
+schedule that:
+
+1. Starts training with an initial learning rate of 0.0001
+2. Uses a scaling rate of 5
+3. Uses a scaling interval of 200 training steps
+4. Scales learning rate at every training step, i.e., does not use staircase
+
+### PyTorch
+
+For PyTorch models, LRRT is implemented as a [learning rate
+scheduler](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html),
+a feature that is available in PyTorch versions 1.0.1 and newer. Thus, you can
+add a `"scheduler"` entry of type `"LRRangeTest"` into your model configuration
+as illustrated below:
+
+```json
+"scheduler": {
+ "type": "LRRangeTest",
+ "params": {
+ "lr_range_test_min_lr": 0.0001,
+ "lr_range_test_step_size": 200,
+ "lr_range_test_step_rate": 5,
+ "lr_range_test_staircase": false
+ }
+}
+```
+
+
+## Example: Tuning for Large Batch Sizes
+
+We illustrate how LRRT can benefit data scientists with a snippet of our
+experience of tuning an internal production model to converge efficiently on
+larger batch sizes, as we scaled from one GPU (batch size 512) to four GPUs
+(batch size 2048). Our goal was to train the model with the larger batch size
+to match the performance of the smaller batch size using the same amount of
+data samples. The challenge here is the well known problem of slow convergence
+of large batch size training. Our approach was to use a
+[1Cycle](/tutorials/1Cycle/) schedule in DeepSpeed to tackle
+this problem, and we used LRRT to configure the schedule.
+
+In the plots below, we illustrate using LRRT to discover the maximum learning
+rates for effective training with batch size 2048. The plot on the left shows
+the impact of large learning rates on validation loss over the first 9000
+batches of training. The plot on the right shows the learning rate values
+during the same period of training. Using grid search we discover that the
+best fixed learning rate for the batch size 2048 is 0.0002. The blue line
+(`lr=0.0002`) represents training with this fixed learning rate. We compare the
+two LRRT schedules with this fixed learning rate. The orange
+(`lr_range_test_step_rate=5`) and gray (`lr_range_test_step_rate=50`) lines
+represent training with similar LRRT schedules that differ only in
+`lr_range_test_step_rate` values. Although the LRRT schedules start from the
+same base learning rate, the gray line's learning rate grows about 10 times
+faster than the orange line. Also, the learning rates of the LRRT schedules had
+grown larger than that of the blue line in the presented data points. We
+subsequently refer to the gray line as "fast growing", and the orange line as
+"slow growing" LRRT schedules respectively.
+
+
+
+We make the following observations from this small example.
+
+1. Larger learning rates clearly benefit model performance, up to some point.
+The fast growing LRRT schedule achieves validation loss of 0.46 after 3000
+batches, which the fixed learning rate does not achieve with 9000 batches. The
+slow growing LRRT does not match that score until after 6000 batches, however
+it maintains an increasing performance advantage over the fixed learning rate.
+
+2. There is an upper bound on learning rate values that are useful for training
+the model. The fast growing LRRT schedule hits this boundary quickly and
+diverges, while the slow growing LRRT will later diverge for the same reason.
+LRRT helped us discover these boundaries quickly, using less than 2% of the
+training data. These boundaries are useful information for constructing
+learning rate schedules.
+
+These observations from LRRT helped us to configure the learning rate
+boundaries and the cycle span for a 1Cycle schedule that solves the problem, as
+shown below.
+
+```json
+"OneCycle": {
+ "cycle_min_lr": 0.002,
+ "cycle_max_lr": 0.005,
+ "cycle_first_step_size": 2000,
+ "cycle_second_step_size": 2000,
+ ...
+}
+```
+
+In our experience these are four most critical parameters of 1Cycle schedules.
+
+1. We chose to use the slower LRRT schedule (`lr_range_test_step_rate=5`) to
+set `cycle_min_lr` because it achieves the best loss and the faster schedule
+diverges fairly quickly.
+2. We set `cycle_min_lr` to 0.005 even though the plot shows that performance
+was still improving at slightly higher learning rate. This is because we
+observed that if we wait till the maximum learning rate, the model could be at
+the point of divergence and impossible to recover.
+3. Since it takes 8000 batches for the learning rate to become 0.005, we set
+`cycle_first_step_size` and (`cycle_second_step_size`) to 2000 which is the
+number of steps that it takes for four GPUs to process 8000 batches.
+
+We hope this brief example sparks your imagination on using LRRT for your own
+unique tuning challenges.
diff --git a/docs/_tutorials/megatron.md b/docs/_tutorials/megatron.md
new file mode 100644
index 000000000000..4575b32c3241
--- /dev/null
+++ b/docs/_tutorials/megatron.md
@@ -0,0 +1,421 @@
+---
+title: "Megatron-LM GPT2"
+---
+
+If you haven't already, we advise you to first read through the [Getting
+Started](/getting-started/) guide before stepping through this tutorial.
+
+In this tutorial we will be adding DeepSpeed to Megatron-LM GPT2 model, which
+is a large, powerful transformer. Megatron-LM supports model-parallel and multi-node
+training. Please see the corresponding paper for more details: [Megatron-LM:
+Training Multi-Billion Parameter Language Models Using Model
+Parallelism](https://arxiv.org/abs/1909.08053).
+
+First, we discuss data and environment setup and how to train the GPT-2 model with the
+original Megatron-LM. Next, we proceed step-by-step in enabling this model to run with
+DeepSpeed. Finally, we demonstrate the **_performance gains_**, and **_memory footprint
+reduction_** from using DeepSpeed.
+
+## Training GPT-2 with the Original Megatron-LM
+
+The original model code from
+[Megatron-LM](https://github.com/NVIDIA/Megatron-LM). We've copied this repo
+under
+[DeepSpeedExamples/Megatron-LM/](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM)
+and made it available as a submodule. To download, execute:
+```bash
+git submodule update --init --recursive
+```
+
+### Training Data Setup
+* Follow Megatron's [instructions](https://github.com/NVIDIA/Megatron-LM#collecting-gpt2-webtext-data)
+ to download the webtext data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`:
+
+### Running Unmodified Megatron-LM GPT2 model
+
+* For a single GPU run:
+ - change `scripts/pretrain_gpt2.sh`, set its `--train-data` argument as `"webtext"`.
+ - run `bash scripts/pretrain_gpt2.sh`
+
+* For multiple GPUs and/or nodes run:
+ - change `scripts/pretrain_gpt2_model_parallel.sh`
+ - set its `--train-data` argument as `"webtext"`
+ - `GPUS_PER_NODE` indicates how many GPUs per node involved in the testing
+ - `NNODES` indicates how many nodes involved in the testing
+
+ - run `bash scripts/pretrain_gpt2_model_parallel.sh`
+
+
+## Enabling DeepSpeed
+
+To use DeepSpeed we will modify three files :
+
+* `arguments.py` : Arguments configurations
+* `pretrain_gpt2.py` : Main entry point for training
+* `utils.py` : Checkpoints saving and loading utilities
+
+
+### Argument Parsing
+The first step is to apply DeepSpeed is adding DeepSpeed arguments to
+Megatron-LM GPT2 model, using `deepspeed.add_config_arguments()` in
+`arguments.py`.
+
+```python
+def get_args():
+ """Parse all the args."""
+
+ parser = argparse.ArgumentParser(description='PyTorch BERT Model')
+ parser = add_model_config_args(parser)
+ parser = add_fp16_config_args(parser)
+ parser = add_training_args(parser)
+ parser = add_evaluation_args(parser)
+ parser = add_text_generate_args(parser)
+ parser = add_data_args(parser)
+
+ # Include DeepSpeed configuration arguments
+ parser = deepspeed.add_config_arguments(parser)
+```
+
+
+
+### Initialization and Training
+We modify `pretrain.py` to enable training with DeepSpeed.
+
+#### Initialization
+We use `deepspeed.initialize` to create `model_engine`, `optimizer` and LR
+`scheduler`. Below is its definition:
+```python
+def initialize(args,
+ model,
+ optimizer=None,
+ model_parameters=None,
+ training_data=None,
+ lr_scheduler=None,
+ mpu=None,
+ dist_init_required=True,
+ collate_fn=None):
+```
+
+For the Megatron-LM GPT2 model, we initialize DeepSpeed in its
+`setup_model_and_optimizer()` function as below, to pass the raw `model`,
+`optimizer`, `args`, `lr_scheduler` and `mpu`.
+```python
+def setup_model_and_optimizer(args):
+ """Setup model and optimizer."""
+
+ model = get_model(args)
+ optimizer = get_optimizer(model, args)
+ lr_scheduler = get_learning_rate_scheduler(optimizer, args)
+
+ if args.deepspeed:
+ import deepspeed
+
+ print_rank_0("DeepSpeed is enabled.")
+
+ model, optimizer, _, lr_scheduler = deepspeed.initialize(
+ model=model,
+ optimizer=optimizer,
+ args=args,
+ lr_scheduler=lr_scheduler,
+ mpu=mpu,
+ dist_init_required=False
+ )
+```
+
+
+Note that when FP16 is enabled, Megatron-LM GPT2 adds a wrapper to the `Adam`
+optimizer. DeepSpeed has its own FP16 Optimizer, so we need to pass the `Adam`
+optimizer to DeepSpeed directly without any wrapper. We return the unwrapped
+Adam optimizer from `get_optimizer()` when DeepSpeed is enabled.
+```python
+def get_optimizer(model, args):
+ """Setup the optimizer."""
+
+ ......
+
+ # Use Adam.
+ optimizer = Adam(param_groups,
+ lr=args.lr, weight_decay=args.weight_decay)
+
+ if args.deepspeed:
+ # fp16 wrapper is not required for DeepSpeed.
+ return optimizer
+```
+
+#### Using the Training API
+The `model` returned by `deepspeed.initialize` is the _DeepSpeed Model Engine_
+that we will use to train the model using the forward, backward and step API.
+
+
+##### Forward Propagation
+The forward propagation API is compatible to PyTorch and no change is required.
+
+
+##### Backward Propagation
+Backward propagation is done by calling `backward(loss)` directly on the model engine.
+
+```python
+ def backward_step(optimizer, model, lm_loss, args, timers):
+ """Backward step."""
+
+ # Total loss.
+ loss = lm_loss
+
+ # Backward pass.
+ if args.deepspeed:
+ model.backward(loss)
+ else:
+ optimizer.zero_grad()
+ if args.fp16:
+ optimizer.backward(loss, update_master_grads=False)
+ else:
+ loss.backward()
+```
+
+Zeroing the gradients is handled automatically by DeepSpeed after the weights
+have been updated using a mini-batch.
+
+Furthermore, DeepSpeed addresses distributed data parallel and FP16 under the
+hood, simplifying code in multiple places.
+
+(A) DeepSpeed also performs gradient averaging automatically at the gradient
+accumulation boundaries. So we skip the allreduce communication.
+
+ ```python
+ if args.deepspeed:
+ # DeepSpeed backward propagation already addressed all reduce communication.
+ # Reset the timer to avoid breaking timer logs below.
+ timers('allreduce').reset()
+ else:
+ torch.distributed.all_reduce(reduced_losses.data)
+ reduced_losses.data = reduced_losses.data / args.world_size
+ if not USE_TORCH_DDP:
+ timers('allreduce').start()
+ model.allreduce_params(reduce_after=False,
+ fp32_allreduce=args.fp32_allreduce)
+ timers('allreduce').stop()
+
+ ```
+
+(B) We also skip updating master gradients, since DeepSpeed addresses it internally.
+
+ ```python
+ # Update master gradients.
+ if not args.deepspeed:
+ if args.fp16:
+ optimizer.update_master_grads()
+
+ # Clipping gradients helps prevent the exploding gradient.
+ if args.clip_grad > 0:
+ if not args.fp16:
+ mpu.clip_grad_norm(model.parameters(), args.clip_grad)
+ else:
+ optimizer.clip_master_grads(args.clip_grad)
+
+ return lm_loss_reduced
+
+ ```
+
+##### Updating the Model Parameters
+The `step()` function in DeepSpeed engine updates the model parameters as well
+as the learning rate.
+
+```python
+ if args.deepspeed:
+ model.step()
+ else:
+ optimizer.step()
+
+ # Update learning rate.
+ if not (args.fp16 and optimizer.overflow):
+ lr_scheduler.step()
+ else:
+ skipped_iter = 1
+
+```
+
+
+
+##### Loss Scaling
+The GPT2 training script logs the loss scaling value during training. Inside,
+the DeepSpeed optimizer, this value is stored as `cur_scale` instead of
+`loss_scale` in Megatron's optimizer. Therefore, we appropriately replace it in
+the logging string.
+
+```python
+ if args.fp16:
+ log_string += ' loss scale {:.1f} |'.format(
+ optimizer.cur_scale if args.deepspeed else optimizer.loss_scale)
+
+```
+
+
+### Checkpoints Saving & Loading
+
+DeepSpeed engine has flexible APIs for checkpoint saving and loading, to handle
+the states from both the client model and its own internal.
+
+```python
+def save_checkpoint(self, save_dir, tag, client_state={})
+def load_checkpoint(self, load_dir, tag)
+```
+
+Applying DeepSpeed needs to update utils.py in which Megatron-LM GPT2 saves and
+loads its checkpoints.
+
+A new function `save_ds_checkpoint()` is created as below for DeepSpeed, it
+collects the client model states and passes to DeepSpeed engine by calling
+`save_checkpoint()` of DeepSpeed.
+
+```python
+ def save_ds_checkpoint(iteration, model, args):
+ """Save a model checkpoint."""
+
+ sd = {}
+ sd['iteration'] = iteration
+ # rng states.
+ if not args.no_save_rng:
+ sd['random_rng_state'] = random.getstate()
+ sd['np_rng_state'] = np.random.get_state()
+ sd['torch_rng_state'] = torch.get_rng_state()
+ sd['cuda_rng_state'] = torch.cuda.get_rng_state()
+ sd['rng_tracker_states'] = mpu.get_cuda_rng_tracker().get_states()
+
+ model.save_checkpoint(args.save, iteration, client_state = sd)
+
+```
+
+In Megatron-LM GPT2 `save_checkpoint()` function, adds following lines to
+invoke the above function for DeepSpeed.
+
+```python
+ def save_checkpoint(iteration, model, optimizer,
+ lr_scheduler, args):
+ """Save a model checkpoint."""
+ if args.deepspeed:
+ save_ds_checkpoint(iteration, model, args)
+ else:
+ ......
+
+```
+
+In `load_checkpoint()` function, use DeepSpeed loading checkpoint API as below,
+and return the states for the client model.
+
+```python
+ def load_checkpoint(model, optimizer, lr_scheduler, args):
+ """Load a model checkpoint."""
+
+ iteration, release = get_checkpoint_iteration(args)
+
+ if args.deepspeed:
+ checkpoint_name, sd = model.load_checkpoint(args.load, iteration)
+
+ if checkpoint_name is None:
+ if mpu.get_data_parallel_rank() == 0:
+ print("Unable to load checkpoint.")
+ return iteration
+ else:
+ ......
+
+```
+
+### Train scripts
+Assume webtext data was prepared in previous step, to start training
+Megatron-LM GPT2 model with DeepSpeed applied, execute the following command to
+start training.
+
+- Single GPU run
+ - run `bash scripts/ds_pretrain_gpt2.sh`
+- Multiple GPUs/Nodes run
+ - run `bash scripts/ds_pretrain_gpt2_model_parallel.sh`
+
+
+
+## Performance Improvements
+DeepSpeed enables training very large models effectively via the advanced [ZeRO
+optimizer](https://arxiv.org/abs/1910.02054v2). ZeRO significantly reduces the memory
+footprint for training large models which means large models can be trained with i) less
+model parallelism and ii) larger batch sizes. A lower model parallelism degree improves
+training efficiency by increasing the granularity of the computation such as the matrix
+multiplication where performance is directly related to the size of the matrices.
+Furthermore, less model parallelism also results in less communication between model
+parallel GPUs, which further boosts performance. Larger batch size has a similar effect
+of increasing the computational granularity as well as reducing communication, also
+resulting in better performance. Therefore, DeepSpeed combines ZeRO-powered data parallelism with
+Megatron-LM tensor-slicing model parallelism, which is
+significantly faster than using Megatron-LM alone.
+
+The observed performance improvements depend on several factors such as the memory per
+GPU, the local GPU interconnect (i.e., PCI-E vs NVLINK vs NVSwitch), the model size,
+inter node network interconnect, etc. Below, we show some of the performance improvements
+from using DeepSpeed over Megatron on a 16 GPU Low Bandwidth (40 Gbps) cluster and a 400 GPU DGX-2 High Bandwidth (800 Gbps) cluster.
+For details please see the [ZeRO Paper](https://arxiv.org/abs/1910.02054v2). We also
+present performance improvement on a 64 GPU cluster along with detailed configuration
+analysis to show where the improvements come from.
+
+
+
+The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.
+
+
+
+### On Low Bandwidth GPU Cluster
+The figure above shows that training 1.5B parameter model with DeepSpeed is
+nearly 4x faster than without DeepSpeed on a cluster with 4 nodes, 4 GPU per
+node, and 16 GPUs total. These GPUs have 16GB of memory each, and PCI-E
+interconnects GPUs within a node, and 40 Gbps infiniband across nodes.
+
+The performance improvement comes from lower model parallelism degree and
+larger batch size as discussed earlier. Training 1.5B parameter model with
+Megatron-LM alone requires 4-way model parallelism, and can only fit an effective
+batch size of 32 using all 16 GPUs. On the other hand, DeepSpeed does not
+require any model-parallelism to train this model, and can support an
+effective batch size of 128 without running out of memory, resulting in
+significantly higher performance.
+
+
+### On High bandwidth DGX-2 GPU Cluster
+Each GPU on the DGX-2 cluster has 32 GB of memory, and GPUs inside a box is connected via
+the high-bandwidth NVSwitch. DGX-2 nodes are connected to each other via 800 Gbps (8 x 100Gbps) infiniband interconnect. As such, running a 1.5B model on DGX-2 requires less model
+parallelism, and the performance improvement from DeepSpeed for this model size is less
+significant. However, at larger model sizes, Megatron still requires significantly larger
+model parallelism degree, and can only run much smaller batch sizes than DeepSpeed.
+Therefore, as the model sizes get larger, DeepSpeed, by coming ZeRO with Megatron model parallelism, starts to significantly outperform
+using Megatron-LM alone.
+
+
+### Performance Improvements with Configuration Details
+The figure below compares DeepSpeed with Megatron on a 64 GPU cluster with 4
+DGX-2 nodes. To give the readers a clear idea of source of the performance
+improvements, we also present the configuration table for both Megatron and
+DeepSpeed. It shows the smallest model parallelism degree and the largest batch
+size that can be used to train these models without running out of memory. As
+discussed above, the tables demonstrate that DeepSpeed runs with smaller model parallelism degree
+and achieves better performance.
+
+
+
+The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.
+
+
+
+**a ) Megatron-LM GPT2 Baseline**
+
+| | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
+| 1.5B | 2 | 32 | 64 | 512 | 48 | 1600 | 16 | 128.56 |
+| 4B | 4 | 16 | 64 | 128 | 64 | 2304 | 16 | 49.36 |
+| 8B | 4 | 16 | 64 | 128 | 72 | 3072 | 24 | 24.57 |
+| 20B | 16 | 4 | 64 | 16 | 111 | 3808 | 32 | 3.42 |
+
+
+
+**b ) Megatron-LM GPT2 with DeepSpeed**
+
+| | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
+| 1.5B | 1 | 64 | 64 | 2048 | 48 | 1600 | 16 | 151.35 |
+| 4B | 1 | 64 | 64 | 512 | 64 | 2304 | 16 | 75.13 |
+| 8B | 2 | 32 | 64 | 512 | 72 | 3072 | 24 | 43.52 |
+| 20B | 4 | 16 | 64 | 128 | 111 | 3808 | 32 | 12.65 |
diff --git a/docs/assets/css/main.scss b/docs/assets/css/main.scss
index c2583467e4b7..26a771784d01 100644
--- a/docs/assets/css/main.scss
+++ b/docs/assets/css/main.scss
@@ -31,8 +31,7 @@
border-radius: $border-radius;
-webkit-box-shadow: $box-shadow;
box-shadow: $box-shadow;
- position: fixed;
-
+ //position: fixed;
.nav__title {
color: #fff;
font-size: $type-size-6;
diff --git a/docs/assets/images/1cycle_lr.png b/docs/assets/images/1cycle_lr.png
new file mode 100644
index 000000000000..18246a6d4125
Binary files /dev/null and b/docs/assets/images/1cycle_lr.png differ
diff --git a/docs/assets/images/loss_and_lr.png b/docs/assets/images/loss_and_lr.png
new file mode 100644
index 000000000000..dedd62360ba7
Binary files /dev/null and b/docs/assets/images/loss_and_lr.png differ
diff --git a/docs/assets/images/lr_schedule.png b/docs/assets/images/lr_schedule.png
new file mode 100644
index 000000000000..cdda1b8decea
Binary files /dev/null and b/docs/assets/images/lr_schedule.png differ
diff --git a/docs/assets/images/megatron-gpt2-perf-test.png b/docs/assets/images/megatron-gpt2-perf-test.png
new file mode 100644
index 000000000000..9fe5e66b239e
Binary files /dev/null and b/docs/assets/images/megatron-gpt2-perf-test.png differ
diff --git a/docs/assets/images/model_convergence.png b/docs/assets/images/model_convergence.png
new file mode 100644
index 000000000000..b88899bf5fb4
Binary files /dev/null and b/docs/assets/images/model_convergence.png differ
diff --git a/docs/contributing.md b/docs/contributing.md
new file mode 100644
index 000000000000..938c1a8b9c4b
--- /dev/null
+++ b/docs/contributing.md
@@ -0,0 +1,74 @@
+---
+title: "Contributing"
+permalink: /contributing/
+---
+
+DeepSpeed welcomes your contributions!
+
+## Prerequisites
+DeepSpeed uses [pre-commit](https://pre-commit.com/) to ensure that formatting is
+consistent across DeepSpeed. First, ensure that `pre-commit` is installed from either
+installing DeepSpeed or `pip install pre-commit`. Next, the pre-commit hooks must be
+installed once before commits can be made:
+```bash
+pre-commit install
+```
+
+Afterwards, our suite of formatting tests run automatically before each `git commit`. You
+can also run these manually:
+```bash
+pre-commit run --all-files
+```
+If a formatting test fails, it will fix the modified code in place and abort
+the `git commit`. After looking over the changes, you can `git add `
+and then repeat the previous `git commit` command.
+
+
+## Testing
+DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests.
+The model convergence tests train
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) and measure
+end-to-end convergence and related metrics. Unit tests are found in `tests/unit/` and
+the model convergence tests are found in `tests/model/`.
+
+### Unit Tests
+[PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be
+installed from PyPI via `pip install pytest`. Simply invoke `pytest --forked` to run the
+unit tests:
+```bash
+pytest --forked tests/unit/
+```
+You can also provide the `-v` flag to `pytest` to see additional information about the
+tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the
+`--forked` flag are required to test CUDA functionality in distributed tests.
+
+### Model Tests
+Model tests require four GPUs and training data downloaded for
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/).
+
+To execute model tests, first [install DeepSpeed](#installation). The
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
+as part of this process. Next, execute the model test driver:
+```bash
+cd tests/model/
+pytest run_sanity_check.py
+```
+Note that the `--forked` flag is not necessary for the model tests.
+
+## Contributor License Agreement
+This project welcomes contributions and suggestions. Most contributions require you to
+agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
+actually do, grant us the rights to use your contribution. For details, visit
+https://cla.opensource.microsoft.com.
+
+When you submit a pull request, a CLA bot will automatically determine whether you need
+to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
+follow the instructions provided by the bot. You will only need to do this once across
+all repos using our CLA.
+
+## Code of Conduct
+This project has adopted the [Microsoft Open Source Code of
+Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the
+[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
+[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
+comments.
diff --git a/docs/index.md b/docs/index.md
index a7a7e0e428b4..af6bdd6d1b4c 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -71,7 +71,7 @@ optimizations on advanced hyperparameter tuning and optimizers. For example:
* DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
Megatron on Azure GPUs.
- *Read more*: [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md)
+ *Read more*: [GPT tutorial](/tutorials/megatron/)
@@ -105,8 +105,7 @@ combination. ZeRO boosts the scaling capability and efficiency further.
significant performance gains compared to using model parallelism alone.
*Read more*: [technical report](https://arxiv.org/abs/1910.02054),
- and [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md).
-
+ and [GPT tutorial](/tutorials/megatron).

@@ -120,13 +119,7 @@ optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
effectiveness of model training and reduce the number of samples required to
convergence to desired accuracy.
-*Read more*: [Tuning tutorial](./docs/tutorials/1Cycle.md),
-
+*Read more*: [Tuning tutorial](/tutorials/1Cycle).
## Good Usability
@@ -165,24 +158,9 @@ overview](features) for descriptions and usage.
* [Performance Analysis and Debugging](features.md#performance-analysis-and-debugging)
-
-# Further Reading
-
-| Article | Description |
-| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
-| [DeepSpeed Features](features.md) | DeepSpeed features |
-| [DeepSpeed JSON Configuration](config_json.md) | Configuring DeepSpeed |
-| [API Documentation](/code-docs/) | Generated DeepSpeed API documentation |
-| [CIFAR-10 Tutorial](./docs/tutorials/CIFAR-10.md) | Getting started with CIFAR-10 and DeepSpeed |
-| [Megatron-LM Tutorial](./docs/tutorials/MegatronGPT2Tutorial.md) | Train GPT2 with DeepSpeed and Megatron-LM |
-| [Learning Rate Range Test Tutorial](./docs/tutorials/lrrt.md) | Faster training with large learning rates |
-| [1Cycle Tutorial](./docs/tutorials/1Cycle.md) | SOTA learning schedule in DeepSpeed |
-
-
-
# Contributing
DeepSpeed welcomes your contributions! Please see our
-[contributing](CONTRIBUTING.md) guide for more details on formatting, testing,
+[contributing](/contributing/) guide for more details on formatting, testing,
etc.
## Contributor License Agreement