diff --git a/README.md b/README.md index b1ec9ab394a2..9763e3a8cf09 100755 --- a/README.md +++ b/README.md @@ -72,7 +72,7 @@ optimizations on advanced hyperparameter tuning and optimizers. For example: * DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA Megatron on Azure GPUs. - *Read more*: [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md) + *Read more*: [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/) @@ -106,10 +106,10 @@ combination. ZeRO boosts the scaling capability and efficiency further. significant performance gains compared to using model parallelism alone. *Read more*: [technical report](https://arxiv.org/abs/1910.02054), - and [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md). + and [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/). -![DeepSpeed-vs-Megatron](./docs/figures/DeepSpeed-vs-Megatron.png) +![DeepSpeed-vs-Megatron](./docs/assets/images/DeepSpeed-vs-Megatron.png)

The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.

@@ -121,7 +121,7 @@ optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the effectiveness of model training and reduce the number of samples required to convergence to desired accuracy. -*Read more*: [Tuning tutorial](./docs/tutorials/1Cycle.md), +*Read more*: [Tuning tutorial](https://www.deepspeed.ai/tutorials/1Cycle/), + ### Constant Buffer Optimization (CBO) CBO enables high network and memory throughput while restricting memory usage to a constant size. For memory- and network-bound operations such as normalization or @@ -131,18 +102,18 @@ The DeepSpeed core API consists of just a handful of methods: DeepSpeed supports all the features described in this document, via the use of these API, along with a `deepspeed_config` JSON file for enabling and disabling the features. -Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details. +Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details. ### Gradient Clipping DeepSpeed handles gradient clipping under the hood based on the max gradient norm specified by the user. -Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details. +Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details. ### Automatic loss scaling with mixed precision DeepSpeed internally handles loss scaling for mixed precision training. The parameters for loss scaling can be specified in the `deepspeed_config` JSON file. -Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details. +Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details. ## Training Optimizers @@ -176,19 +147,19 @@ more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) . DeepSpeed can simplify checkpointing for you regardless of whether you are using data parallel training, model parallel training, mixed-precision training, a mix of these three, or using the zero optimizer to enable larger model sizes. -Please see the [Getting Started](../README.md#getting-started) guide +Please see the [Getting Started](/getting-started/) guide and the -[core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details. +Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details. ## Advanced parameter search DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for large batch scaling. ### Learning Rate Range Test -Please refer to the [Learning Rate Range Test](tutorials/lrrt.md) tutorial. +Please refer to the [Learning Rate Range Test](/tutorials/lrrt/) tutorial. ### 1Cycle Learning Rate Schedule -Please refer to the [1Cycle Learning Rate Schedule](tutorials/1Cycle.md) tutorial. +Please refer to the [1Cycle Learning Rate Schedule](/tutorials/1Cycle/) tutorial. ## Simplified Data Loader @@ -200,7 +171,7 @@ can automatically handle batch creation appropriately. For performance debugging, DeepSpeed can give you a detailed breakdown of the time spent in different parts of the training with by simply enabling it in the `deepspeed_config` file. -Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details. +Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details. ```json { "wall_clock_breakdown": true diff --git a/docs/_tutorials/1Cycle.md b/docs/_tutorials/1Cycle.md new file mode 100644 index 000000000000..0cb8a45f31f0 --- /dev/null +++ b/docs/_tutorials/1Cycle.md @@ -0,0 +1,144 @@ +--- +title: "1-Cycle Schedule" +--- + +This tutorial shows how to implement 1Cycle schedules for learning rate and +momentum in PyTorch. + +## 1-Cycle Schedule +Recent research has demonstrated that the slow convergence problems of large +batch size training can be addressed by tuning critical hyperparameters such +as learning rate and momentum, during training using cyclic and decay +schedules. In DeepSpeed, we have implemented a state-of-the-art schedule called +[1-Cycle](https://arxiv.org/abs/1803.09820) to help data scientists +effectively use larger batch sizes to train their models in PyTorch. + +## Prerequisites + +To use 1-cycle schedule for model training, you should satisfy these two requirements: + +1. Integrate DeepSpeed into your training script using the [Getting +Started](/getting-started/) guide. +2. Add the parameters to configure a 1-Cycle schedule to the parameters of your +model. We will define the 1-Cycle parameters below. + +## Overview +The 1-cycle schedule operates in two phases, a cycle phase and a decay phase, +which span one iteration over the training data. For concreteness, we will +review how 1-cycle schedule of learning rate works. In the cycle phase, +the learning rate oscillates between a minimum value and a maximum value over a +number of training steps. In the decay phase, the learning rate decays starting +from the minimum value of the cycle phase. An example of 1-cycle learning rate +schedule during model training is illustrated below. + +![1cycle_lr](/assets/images/1cycle_lr.png) + +### 1-Cycle Parameters + +The 1-Cycle schedule is defined by a number of parameters which allow users +explore different configurations. The literature recommends concurrent tuning +of learning rate and momentum because they are correlated hyperparameters. We +have leveraged this recommendation to reduce configuration burden by organizing +the 1-cycle parameters into two groups to: + +1. Global parameters for configuring the cycle and decay phase +2. Local parameters for configuring learning rate and momentum + +The global parameters for configuring the 1-cycle phases are: + +1. `cycle_first_step_size`: The count of training steps to complete first step of cycle phase +2. `cycle_first_stair_count`: The count of updates (or stairs) in first step of cycle phase +3. `cycle_second_step_size`: The count of training steps to complete second step of cycle phase +4. `cycle_second_stair_count`: The count of updates (or stairs) in the second step of cycle phase +5. `post_cycle_decay_step_size`: The interval, in training steps, to decay hyperparameter in decay phase + +The local parameters for the hyperparameters are: + +**Learning rate**: + +1. `cycle_min_lr`: minimum learning rate in cycle phase +2. `cycle_max_lr`: maximum learning rate in cycle phase +3. `decay_lr_rate`: decay rate for learning rate in decay phase + +Although appropriate values `cycle_min_lr` and `cycle_max_lr` values can be +selected based on experience or expertise, we recommend using [learning rate +range test](/tutorials/lrrt/) feature of DeepSpeed to configure them. + +**Momentum** +1. `cycle_min_mom`: minimum momentum in cycle phase +2. `cycle_max_mom`: maximum momentum in cycle phase +3. `decay_mom_rate`: decay rate for momentum in decay phase + +## Required Model Configuration Changes + +To illustrate the required model configuration changes to use 1-Cycle schedule +in model training, we will use a schedule with the following properties: + +1. A symmetric cycle phase, where each half of the cycle spans the same number +of training steps. For this example, it will take 1000 training steps for the +learning rate to increase from 0.0001 to 0.0010 (10X scale), and then to +decrease back to 0.0001. The momentum will correspondingly cycle between 0.85 +and 0.99 in similar number of steps. +2. A decay phase, where learning rate decays by 0.001 every 1000 steps, while +momentum is not decayed. + +Note that these parameters are processed by DeepSpeed as session parameters, +and so should be added to the appropriate section of the model configuration. + +### **PyTorch model** + +PyTorch versions 1.0.1 and newer provide a feature for implementing schedulers +for hyper-parameters, called [learning rate + schedulers](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html). + We have implemented 1-Cycle schedule using this feature. You will add a + scheduler entry of type **"OneCycle"** as illustrated below. + +```json +"scheduler": { + "type": "OneCycle", + "params": { + "cycle_first_step_size": 1000, + "cycle_first_stair_count": 500, + "cycle_second_step_size": 1000, + "cycle_second_stair_count": 500, + "decay_step_size": 1000, + "cycle_min_lr": 0.0001, + "cycle_max_lr": 0.0010, + "decay_lr_rate": 0.001, + "cycle_min_mom": 0.85, + "cycle_max_mom": 0.99, + "decay_mom_rate": 0.0 + } +}, +``` + +## Batch Scaling Example + +As example of how 1-Cycle schedule can enable effective batch scaling, we +briefly share our experience with an internal model in Microsoft. In this case, +the model was well-tuned for fast convergence (in data samples) on a single +GPU, but was converging slowly to target performance (AUC) when training on 8 +GPUs (8X batch size). The plot below shows model convergence with 8 GPUs for +these learning rate schedules: + +1. **Fixed**: using an optimal fixed learning rate for 1-GPU training. +2. **LinearScale**: using a fixed learning rate that is 8X of **Fixed**. +3. **1Cycle**: using 1-Cycle schedule. + +![model_convergence](/assets/images/model_convergence.png) + +With **1Cycle**, the model converges faster than the other schedules to the +target AUC . In fact, **1Cycle** converges as fast as the optimal 1-GPU +training (not shown). For **Fixed**, convergence is about 5X slower (needs 5X +more data samples). With **LinearScale**, the model diverges because the +learning rate is too high. The plot below illustrates the schedules by +reporting the learning rate values during 8-GPU training. + +![lr_schedule](/assets/images/lr_schedule.png) + +We see that the learning rate for **1Cycle** is always larger than **Fixed** +and is briefly larger than **LinearScale** to achieve faster convergence. Also +**1Cycle** lowers the learning rate later during training to avoid model +divergence, in contrast to **LinearScale**. In summary, by configuring an +appropriate 1-Cycle schedule we were able to effective scale the training batch +size for this model by 8X without loss of convergence speed. diff --git a/docs/_tutorials/azure.md b/docs/_tutorials/azure.md new file mode 100644 index 000000000000..8f3ed2fd9959 --- /dev/null +++ b/docs/_tutorials/azure.md @@ -0,0 +1,131 @@ +--- +title: "Getting Started with DeepSpeed on Azure" +--- + +This tutorial will help you get started running DeepSpeed on [Azure virtual +machines](https://azure.microsoft.com/en-us/services/virtual-machines/). +Looking forward, we will be integrating these techniques and additional enhancements +into the [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) platform to +benefit all your large model training jobs. + +If you don't already have an Azure account please see more details here: [https://azure.microsoft.com/](https://azure.microsoft.com/). + +To help with launching Azure instances we suggest using the [Azure +CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created +several helper scripts to get you quickly started using DeepSpeed with Azure. + * Install Azure CLI on your local box: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli + * Alternatively you can use the Azure in-browser shell: https://shell.azure.com/ + +## Create an SSH key +Generate an SSH key that will be used across this tutorial to SSH into your VMs and +between Docker containers. `ssh-keygen` is the recommended way of doing this. Our scripts +assume your key is located inside the same directory as the Azure scripts. + +## Azure Config JSON +Our helper scripts depend on the following a configuration JSON for deployment +and setup. We have provided a simple example JSON in `azure_config.json` that +sets up a basic environment with two VMs. This config uses the NV6_Promo +instance type which has one NVIDIA Tesla M60 GPU per VM. You can read more +details about the VM on the [Linux Virtual Machines +Pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) +page. + +See the example below: + ```json +{ + "num_vms": 2, + "location": "southcentralus", + "azure_sku": "Standard_NV6_Promo", + "ssh_private_key": "id_rsa", + "docker_ssh_port": 2222 +} +``` + +## Dependencies +The scripts in this tutorial require [jq](https://stedolan.github.io/jq/) to help with +parsing JSON from the command line. Also it is recommended to install +[pdsh](https://linux.die.net/man/1/pdsh) to help launch ssh connections in parallel. + +## Create Azure VMs +We first need to allocate the VMs. We provide a script +```bash +./create_vms.sh +``` +to create VMs with the Azure SKU in the region specified in `azure_config.json`. Feel +free to customize your JSON to your desired region/SKU. This step will take a few minutes +to complete while it sets up all of your VMs on Azure. + +## Setup VM environment to use DeepSpeed +Next, we need to configure the VM environment for DeepSpeed. We provide a script +```bash +./setup_vms.sh +``` +to generate a [hostfile](/getting-started/#resource-configuration-multi-node) and SSH +configuration on all of the VMs. This configuration will be used by the DeepSpeed +Docker containers in the next step. + +## Start the DeepSpeed docker container +We now setup the DeepSpeed Docker containers on the VMs. We provide a script +```bash +./setup_docker.sh +``` +to pull the DeepSpeed image onto all VMs and start a container instance in the +background. This will take several minutes since it needs to pull the entire Docker +image. + +## Access VMs +The tool `azure_ssh.sh` will let you SSH into any of the VMs with this +syntax: +```bash +./azure_ssh.sh [command] +``` +where the `node-id` is a number between `0` and `num_vms-1`. This script will find the +public IP address of your VM and use the SSH key provided in the Azure configuration +JSON. + +## Access DeepSpeed container +Everything should be up and running at this point. Let's access the running DeepSpeed +container on the first VM and make sure we can talk to the other containers in our deployment. + + * SSH into the first VM via: `./azure_ssh.sh 0` + * Change directories into the azure folder of this repo via: `cd ~/workdir/DeepSpeed/azure` + * Attach the running docker container via: `./attach.sh` + * You should now be able to `ssh` into any other docker container, the containers can be + accessed via their SSH alias of `worker-N`, where `N` is the VM number between `0` + and `num_vms-1`. In this example we should be able to successfully run `ssh worker-1 + hostname` which will return the hostname of worker-1. + +## Parallel SSH across containers + DeepSpeed comes installed with a helper script `ds_ssh` which is a wrapper around + the [pdsh](https://linux.die.net/man/1/pdsh) command that lets you issue commands + to groups of hosts (via SSH) in parallel. This wrapper simply connects with the + hostfile that defines all the containers in your deployment. For example if you run + `ds_ssh hostname` you should see a list of all the hostnames in your deployment. + +## Run CIFAR-10 example model +We will now run the DeepSpeed CIFAR-10 model example to test the VM setup. From inside +the first DeepSpeed container: + + 1) Install the python dependencies necessary to run the CIFAR-10 example model. You can + do this across your cluster via: + ```bash + ds_ssh pip install -r ~/workdir/DeepSpeed/DeepSpeedExamples/cifar/requirements.txt + ``` + + 2) Now change directories to the CIFAR example: + ```bash + cd ~/workdir/DeepSpeed/DeepSpeedExamples/cifar + ``` + + 3) Finally, launch training across all VMs: + ```bash + deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json + ``` + +## Megatron-LM GPT2 +DeepSpeed includes an example model using Megatron-LM's GPT2. Please refer to the full +[Megatron tutorial](/tutorials/megatron/) for more details. + * In order to fully train GPT2 with DeepSpeed and ZeRO we recommend using 8 instances of + Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup and + a batch size of 1536 you should be able to complete 100k training steps (153.6 million + samples) in less than 2 weeks of training. diff --git a/docs/_tutorials/getting-started.md b/docs/_tutorials/getting-started.md index c6be8a05638d..69ffab990eb0 100644 --- a/docs/_tutorials/getting-started.md +++ b/docs/_tutorials/getting-started.md @@ -6,9 +6,10 @@ excerpt: "First steps with DeepSpeed" ## Installation -* Please see our [Azure tutorial](docs/azure.md) to get started with DeepSpeed on Azure! +* Please see our [Azure tutorial](/tutorials/azure/) to get started with DeepSpeed on Azure! * If you're not on Azure, we recommend using our docker image via `docker pull deepspeed/deepspeed:latest` which contains a pre-installed version of DeepSpeed and all the necessary dependencies. -* If you want to install DeepSpeed manually, we provide an install script [install.sh](install.sh) to help install on a local machine or across an entire cluster. +* If you want to install DeepSpeed manually, we provide an install script +* `install.sh` to help install on a local machine or across an entire cluster. ## Writing DeepSpeed Models DeepSpeed model training is accomplished using the DeepSpeed engine. The engine @@ -114,8 +115,8 @@ the `step` value is stored as part of the `client_sd`. ## DeepSpeed Configuration DeepSpeed features can be enabled, disabled, or configured using a config JSON file that should be specified as `args.deepspeed_config`. A sample config file -is shown below. For a full set of features see [core API -doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html). +is shown below. For a full set of features see [ API +doc](/docs/config_json/). ```json { diff --git a/docs/_tutorials/lrrt.md b/docs/_tutorials/lrrt.md new file mode 100644 index 000000000000..d2e1e4051934 --- /dev/null +++ b/docs/_tutorials/lrrt.md @@ -0,0 +1,148 @@ +--- +title: "Learning Rate Range Test" +--- +This tutorial shows how to use to perform Learning Rate range tests in PyTorch. + +## Learning Rate Range Test (LRRT) + +Learning rate range test ( [LRRT](https://arxiv.org/abs/1803.09820) ) is a +method for discovering the largest learning rate values that can be used to +train a model without divergence. Data scientists are often interested in this +information because large learning rates lead to faster model convergence than +a small learning rates. Moreover, large learning rates are crucial in learning +rate schedules such as [CLR](https://arxiv.org/abs/1506.01186) and +[1Cycle](https://arxiv.org/abs/1803.09820), which are used to train effectively +with large batch sizes. DeepSpeed provides LRRT for model training in PyTorch +frameworks. + +## Prerequisites + +To use DeepSpeed's LRRT, you must satisfy the following two conditions: + +1. Integrate DeepSpeed into your training script using the [Getting +Started](/getting-started/) guide. +2. Add the parameters to configure LRRT to the parameters of your model. The +LRRT parameters are defined below. + +## LRRT Parameters + +LRRT works by linearly increasing the learning rate by a predefined amount, at +predefined intervals. Thus, LRRT is a form of learning rate schedule because it +defines how and when the learning rate should change during model training. To +configure LRRT, you will need to set these parameters: + +1. `lr_range_test_min_lr` : The initial learning rate for training `(float)` +2. `lr_range_test_step_size`: The interval for scaling up learning rate, +defined in training steps `(integer)` +3. `lr_range_test_step_rate`: The scaling factor for increasing learning rate +`(float)` +4. `lr_range_test_staircase`: If true, learning rate is changed every +`lr_range_test_step_size` training steps, otherwise learning rate is changed at +every training step `(boolean)` + +## Required Model Configuration Changes + +We will illustrate the required model configuration changes an example LRRT +schedule that: + +1. Starts training with an initial learning rate of 0.0001 +2. Uses a scaling rate of 5 +3. Uses a scaling interval of 200 training steps +4. Scales learning rate at every training step, i.e., does not use staircase + +### PyTorch + +For PyTorch models, LRRT is implemented as a [learning rate +scheduler](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html), +a feature that is available in PyTorch versions 1.0.1 and newer. Thus, you can +add a `"scheduler"` entry of type `"LRRangeTest"` into your model configuration +as illustrated below: + +```json +"scheduler": { + "type": "LRRangeTest", + "params": { + "lr_range_test_min_lr": 0.0001, + "lr_range_test_step_size": 200, + "lr_range_test_step_rate": 5, + "lr_range_test_staircase": false + } +} +``` + + +## Example: Tuning for Large Batch Sizes + +We illustrate how LRRT can benefit data scientists with a snippet of our +experience of tuning an internal production model to converge efficiently on +larger batch sizes, as we scaled from one GPU (batch size 512) to four GPUs +(batch size 2048). Our goal was to train the model with the larger batch size +to match the performance of the smaller batch size using the same amount of +data samples. The challenge here is the well known problem of slow convergence +of large batch size training. Our approach was to use a +[1Cycle](/tutorials/1Cycle/) schedule in DeepSpeed to tackle +this problem, and we used LRRT to configure the schedule. + +In the plots below, we illustrate using LRRT to discover the maximum learning +rates for effective training with batch size 2048. The plot on the left shows +the impact of large learning rates on validation loss over the first 9000 +batches of training. The plot on the right shows the learning rate values +during the same period of training. Using grid search we discover that the +best fixed learning rate for the batch size 2048 is 0.0002. The blue line +(`lr=0.0002`) represents training with this fixed learning rate. We compare the +two LRRT schedules with this fixed learning rate. The orange +(`lr_range_test_step_rate=5`) and gray (`lr_range_test_step_rate=50`) lines +represent training with similar LRRT schedules that differ only in +`lr_range_test_step_rate` values. Although the LRRT schedules start from the +same base learning rate, the gray line's learning rate grows about 10 times +faster than the orange line. Also, the learning rates of the LRRT schedules had +grown larger than that of the blue line in the presented data points. We +subsequently refer to the gray line as "fast growing", and the orange line as +"slow growing" LRRT schedules respectively. + +![validation_loss](/assets/images/loss_and_lr.png) + +We make the following observations from this small example. + +1. Larger learning rates clearly benefit model performance, up to some point. +The fast growing LRRT schedule achieves validation loss of 0.46 after 3000 +batches, which the fixed learning rate does not achieve with 9000 batches. The +slow growing LRRT does not match that score until after 6000 batches, however +it maintains an increasing performance advantage over the fixed learning rate. + +2. There is an upper bound on learning rate values that are useful for training +the model. The fast growing LRRT schedule hits this boundary quickly and +diverges, while the slow growing LRRT will later diverge for the same reason. +LRRT helped us discover these boundaries quickly, using less than 2% of the +training data. These boundaries are useful information for constructing +learning rate schedules. + +These observations from LRRT helped us to configure the learning rate +boundaries and the cycle span for a 1Cycle schedule that solves the problem, as +shown below. + +```json +"OneCycle": { + "cycle_min_lr": 0.002, + "cycle_max_lr": 0.005, + "cycle_first_step_size": 2000, + "cycle_second_step_size": 2000, + ... +} +``` + +In our experience these are four most critical parameters of 1Cycle schedules. + +1. We chose to use the slower LRRT schedule (`lr_range_test_step_rate=5`) to +set `cycle_min_lr` because it achieves the best loss and the faster schedule +diverges fairly quickly. +2. We set `cycle_min_lr` to 0.005 even though the plot shows that performance +was still improving at slightly higher learning rate. This is because we +observed that if we wait till the maximum learning rate, the model could be at +the point of divergence and impossible to recover. +3. Since it takes 8000 batches for the learning rate to become 0.005, we set +`cycle_first_step_size` and (`cycle_second_step_size`) to 2000 which is the +number of steps that it takes for four GPUs to process 8000 batches. + +We hope this brief example sparks your imagination on using LRRT for your own +unique tuning challenges. diff --git a/docs/_tutorials/megatron.md b/docs/_tutorials/megatron.md new file mode 100644 index 000000000000..4575b32c3241 --- /dev/null +++ b/docs/_tutorials/megatron.md @@ -0,0 +1,421 @@ +--- +title: "Megatron-LM GPT2" +--- + +If you haven't already, we advise you to first read through the [Getting +Started](/getting-started/) guide before stepping through this tutorial. + +In this tutorial we will be adding DeepSpeed to Megatron-LM GPT2 model, which +is a large, powerful transformer. Megatron-LM supports model-parallel and multi-node +training. Please see the corresponding paper for more details: [Megatron-LM: +Training Multi-Billion Parameter Language Models Using Model +Parallelism](https://arxiv.org/abs/1909.08053). + +First, we discuss data and environment setup and how to train the GPT-2 model with the +original Megatron-LM. Next, we proceed step-by-step in enabling this model to run with +DeepSpeed. Finally, we demonstrate the **_performance gains_**, and **_memory footprint +reduction_** from using DeepSpeed. + +## Training GPT-2 with the Original Megatron-LM + +The original model code from +[Megatron-LM](https://github.com/NVIDIA/Megatron-LM). We've copied this repo +under +[DeepSpeedExamples/Megatron-LM/](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) +and made it available as a submodule. To download, execute: +```bash +git submodule update --init --recursive +``` + +### Training Data Setup +* Follow Megatron's [instructions](https://github.com/NVIDIA/Megatron-LM#collecting-gpt2-webtext-data) + to download the webtext data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`: + +### Running Unmodified Megatron-LM GPT2 model + +* For a single GPU run: + - change `scripts/pretrain_gpt2.sh`, set its `--train-data` argument as `"webtext"`. + - run `bash scripts/pretrain_gpt2.sh` + +* For multiple GPUs and/or nodes run: + - change `scripts/pretrain_gpt2_model_parallel.sh` + - set its `--train-data` argument as `"webtext"` + - `GPUS_PER_NODE` indicates how many GPUs per node involved in the testing + - `NNODES` indicates how many nodes involved in the testing + + - run `bash scripts/pretrain_gpt2_model_parallel.sh` + + +## Enabling DeepSpeed + +To use DeepSpeed we will modify three files : + +* `arguments.py` : Arguments configurations +* `pretrain_gpt2.py` : Main entry point for training +* `utils.py` : Checkpoints saving and loading utilities + + +### Argument Parsing +The first step is to apply DeepSpeed is adding DeepSpeed arguments to +Megatron-LM GPT2 model, using `deepspeed.add_config_arguments()` in +`arguments.py`. + +```python +def get_args(): + """Parse all the args.""" + + parser = argparse.ArgumentParser(description='PyTorch BERT Model') + parser = add_model_config_args(parser) + parser = add_fp16_config_args(parser) + parser = add_training_args(parser) + parser = add_evaluation_args(parser) + parser = add_text_generate_args(parser) + parser = add_data_args(parser) + + # Include DeepSpeed configuration arguments + parser = deepspeed.add_config_arguments(parser) +``` + + + +### Initialization and Training +We modify `pretrain.py` to enable training with DeepSpeed. + +#### Initialization +We use `deepspeed.initialize` to create `model_engine`, `optimizer` and LR +`scheduler`. Below is its definition: +```python +def initialize(args, + model, + optimizer=None, + model_parameters=None, + training_data=None, + lr_scheduler=None, + mpu=None, + dist_init_required=True, + collate_fn=None): +``` + +For the Megatron-LM GPT2 model, we initialize DeepSpeed in its +`setup_model_and_optimizer()` function as below, to pass the raw `model`, +`optimizer`, `args`, `lr_scheduler` and `mpu`. +```python +def setup_model_and_optimizer(args): + """Setup model and optimizer.""" + + model = get_model(args) + optimizer = get_optimizer(model, args) + lr_scheduler = get_learning_rate_scheduler(optimizer, args) + + if args.deepspeed: + import deepspeed + + print_rank_0("DeepSpeed is enabled.") + + model, optimizer, _, lr_scheduler = deepspeed.initialize( + model=model, + optimizer=optimizer, + args=args, + lr_scheduler=lr_scheduler, + mpu=mpu, + dist_init_required=False + ) +``` + + +Note that when FP16 is enabled, Megatron-LM GPT2 adds a wrapper to the `Adam` +optimizer. DeepSpeed has its own FP16 Optimizer, so we need to pass the `Adam` +optimizer to DeepSpeed directly without any wrapper. We return the unwrapped +Adam optimizer from `get_optimizer()` when DeepSpeed is enabled. +```python +def get_optimizer(model, args): + """Setup the optimizer.""" + + ...... + + # Use Adam. + optimizer = Adam(param_groups, + lr=args.lr, weight_decay=args.weight_decay) + + if args.deepspeed: + # fp16 wrapper is not required for DeepSpeed. + return optimizer +``` + +#### Using the Training API +The `model` returned by `deepspeed.initialize` is the _DeepSpeed Model Engine_ +that we will use to train the model using the forward, backward and step API. + + +##### Forward Propagation +The forward propagation API is compatible to PyTorch and no change is required. + + +##### Backward Propagation +Backward propagation is done by calling `backward(loss)` directly on the model engine. + +```python + def backward_step(optimizer, model, lm_loss, args, timers): + """Backward step.""" + + # Total loss. + loss = lm_loss + + # Backward pass. + if args.deepspeed: + model.backward(loss) + else: + optimizer.zero_grad() + if args.fp16: + optimizer.backward(loss, update_master_grads=False) + else: + loss.backward() +``` + +Zeroing the gradients is handled automatically by DeepSpeed after the weights +have been updated using a mini-batch. + +Furthermore, DeepSpeed addresses distributed data parallel and FP16 under the +hood, simplifying code in multiple places. + +(A) DeepSpeed also performs gradient averaging automatically at the gradient +accumulation boundaries. So we skip the allreduce communication. + + ```python + if args.deepspeed: + # DeepSpeed backward propagation already addressed all reduce communication. + # Reset the timer to avoid breaking timer logs below. + timers('allreduce').reset() + else: + torch.distributed.all_reduce(reduced_losses.data) + reduced_losses.data = reduced_losses.data / args.world_size + if not USE_TORCH_DDP: + timers('allreduce').start() + model.allreduce_params(reduce_after=False, + fp32_allreduce=args.fp32_allreduce) + timers('allreduce').stop() + + ``` + +(B) We also skip updating master gradients, since DeepSpeed addresses it internally. + + ```python + # Update master gradients. + if not args.deepspeed: + if args.fp16: + optimizer.update_master_grads() + + # Clipping gradients helps prevent the exploding gradient. + if args.clip_grad > 0: + if not args.fp16: + mpu.clip_grad_norm(model.parameters(), args.clip_grad) + else: + optimizer.clip_master_grads(args.clip_grad) + + return lm_loss_reduced + + ``` + +##### Updating the Model Parameters +The `step()` function in DeepSpeed engine updates the model parameters as well +as the learning rate. + +```python + if args.deepspeed: + model.step() + else: + optimizer.step() + + # Update learning rate. + if not (args.fp16 and optimizer.overflow): + lr_scheduler.step() + else: + skipped_iter = 1 + +``` + + + +##### Loss Scaling +The GPT2 training script logs the loss scaling value during training. Inside, +the DeepSpeed optimizer, this value is stored as `cur_scale` instead of +`loss_scale` in Megatron's optimizer. Therefore, we appropriately replace it in +the logging string. + +```python + if args.fp16: + log_string += ' loss scale {:.1f} |'.format( + optimizer.cur_scale if args.deepspeed else optimizer.loss_scale) + +``` + + +### Checkpoints Saving & Loading + +DeepSpeed engine has flexible APIs for checkpoint saving and loading, to handle +the states from both the client model and its own internal. + +```python +def save_checkpoint(self, save_dir, tag, client_state={}) +def load_checkpoint(self, load_dir, tag) +``` + +Applying DeepSpeed needs to update utils.py in which Megatron-LM GPT2 saves and +loads its checkpoints. + +A new function `save_ds_checkpoint()` is created as below for DeepSpeed, it +collects the client model states and passes to DeepSpeed engine by calling +`save_checkpoint()` of DeepSpeed. + +```python + def save_ds_checkpoint(iteration, model, args): + """Save a model checkpoint.""" + + sd = {} + sd['iteration'] = iteration + # rng states. + if not args.no_save_rng: + sd['random_rng_state'] = random.getstate() + sd['np_rng_state'] = np.random.get_state() + sd['torch_rng_state'] = torch.get_rng_state() + sd['cuda_rng_state'] = torch.cuda.get_rng_state() + sd['rng_tracker_states'] = mpu.get_cuda_rng_tracker().get_states() + + model.save_checkpoint(args.save, iteration, client_state = sd) + +``` + +In Megatron-LM GPT2 `save_checkpoint()` function, adds following lines to +invoke the above function for DeepSpeed. + +```python + def save_checkpoint(iteration, model, optimizer, + lr_scheduler, args): + """Save a model checkpoint.""" + if args.deepspeed: + save_ds_checkpoint(iteration, model, args) + else: + ...... + +``` + +In `load_checkpoint()` function, use DeepSpeed loading checkpoint API as below, +and return the states for the client model. + +```python + def load_checkpoint(model, optimizer, lr_scheduler, args): + """Load a model checkpoint.""" + + iteration, release = get_checkpoint_iteration(args) + + if args.deepspeed: + checkpoint_name, sd = model.load_checkpoint(args.load, iteration) + + if checkpoint_name is None: + if mpu.get_data_parallel_rank() == 0: + print("Unable to load checkpoint.") + return iteration + else: + ...... + +``` + +### Train scripts +Assume webtext data was prepared in previous step, to start training +Megatron-LM GPT2 model with DeepSpeed applied, execute the following command to +start training. + +- Single GPU run + - run `bash scripts/ds_pretrain_gpt2.sh` +- Multiple GPUs/Nodes run + - run `bash scripts/ds_pretrain_gpt2_model_parallel.sh` + + + +## Performance Improvements +DeepSpeed enables training very large models effectively via the advanced [ZeRO +optimizer](https://arxiv.org/abs/1910.02054v2). ZeRO significantly reduces the memory +footprint for training large models which means large models can be trained with i) less +model parallelism and ii) larger batch sizes. A lower model parallelism degree improves +training efficiency by increasing the granularity of the computation such as the matrix +multiplication where performance is directly related to the size of the matrices. +Furthermore, less model parallelism also results in less communication between model +parallel GPUs, which further boosts performance. Larger batch size has a similar effect +of increasing the computational granularity as well as reducing communication, also +resulting in better performance. Therefore, DeepSpeed combines ZeRO-powered data parallelism with +Megatron-LM tensor-slicing model parallelism, which is +significantly faster than using Megatron-LM alone. + +The observed performance improvements depend on several factors such as the memory per +GPU, the local GPU interconnect (i.e., PCI-E vs NVLINK vs NVSwitch), the model size, +inter node network interconnect, etc. Below, we show some of the performance improvements +from using DeepSpeed over Megatron on a 16 GPU Low Bandwidth (40 Gbps) cluster and a 400 GPU DGX-2 High Bandwidth (800 Gbps) cluster. +For details please see the [ZeRO Paper](https://arxiv.org/abs/1910.02054v2). We also +present performance improvement on a 64 GPU cluster along with detailed configuration +analysis to show where the improvements come from. + +![DeepSpeed-vs-Megatron](/assets/images/DeepSpeed-vs-Megatron.png) +

+The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone. +

+ + +### On Low Bandwidth GPU Cluster +The figure above shows that training 1.5B parameter model with DeepSpeed is +nearly 4x faster than without DeepSpeed on a cluster with 4 nodes, 4 GPU per +node, and 16 GPUs total. These GPUs have 16GB of memory each, and PCI-E +interconnects GPUs within a node, and 40 Gbps infiniband across nodes. + +The performance improvement comes from lower model parallelism degree and +larger batch size as discussed earlier. Training 1.5B parameter model with +Megatron-LM alone requires 4-way model parallelism, and can only fit an effective +batch size of 32 using all 16 GPUs. On the other hand, DeepSpeed does not +require any model-parallelism to train this model, and can support an +effective batch size of 128 without running out of memory, resulting in +significantly higher performance. + + +### On High bandwidth DGX-2 GPU Cluster +Each GPU on the DGX-2 cluster has 32 GB of memory, and GPUs inside a box is connected via +the high-bandwidth NVSwitch. DGX-2 nodes are connected to each other via 800 Gbps (8 x 100Gbps) infiniband interconnect. As such, running a 1.5B model on DGX-2 requires less model +parallelism, and the performance improvement from DeepSpeed for this model size is less +significant. However, at larger model sizes, Megatron still requires significantly larger +model parallelism degree, and can only run much smaller batch sizes than DeepSpeed. +Therefore, as the model sizes get larger, DeepSpeed, by coming ZeRO with Megatron model parallelism, starts to significantly outperform +using Megatron-LM alone. + + +### Performance Improvements with Configuration Details +The figure below compares DeepSpeed with Megatron on a 64 GPU cluster with 4 +DGX-2 nodes. To give the readers a clear idea of source of the performance +improvements, we also present the configuration table for both Megatron and +DeepSpeed. It shows the smallest model parallelism degree and the largest batch +size that can be used to train these models without running out of memory. As +discussed above, the tables demonstrate that DeepSpeed runs with smaller model parallelism degree +and achieves better performance. + +![DeepSpeed Performance SpeedUp](/assets/images/megatron-gpt2-perf-test.png) +

+The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone. +

+ + +**a ) Megatron-LM GPT2 Baseline** + +| | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec | +| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: | +| 1.5B | 2 | 32 | 64 | 512 | 48 | 1600 | 16 | 128.56 | +| 4B | 4 | 16 | 64 | 128 | 64 | 2304 | 16 | 49.36 | +| 8B | 4 | 16 | 64 | 128 | 72 | 3072 | 24 | 24.57 | +| 20B | 16 | 4 | 64 | 16 | 111 | 3808 | 32 | 3.42 | + + + +**b ) Megatron-LM GPT2 with DeepSpeed** + +| | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec | +| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: | +| 1.5B | 1 | 64 | 64 | 2048 | 48 | 1600 | 16 | 151.35 | +| 4B | 1 | 64 | 64 | 512 | 64 | 2304 | 16 | 75.13 | +| 8B | 2 | 32 | 64 | 512 | 72 | 3072 | 24 | 43.52 | +| 20B | 4 | 16 | 64 | 128 | 111 | 3808 | 32 | 12.65 | diff --git a/docs/assets/css/main.scss b/docs/assets/css/main.scss index c2583467e4b7..26a771784d01 100644 --- a/docs/assets/css/main.scss +++ b/docs/assets/css/main.scss @@ -31,8 +31,7 @@ border-radius: $border-radius; -webkit-box-shadow: $box-shadow; box-shadow: $box-shadow; - position: fixed; - + //position: fixed; .nav__title { color: #fff; font-size: $type-size-6; diff --git a/docs/assets/images/1cycle_lr.png b/docs/assets/images/1cycle_lr.png new file mode 100644 index 000000000000..18246a6d4125 Binary files /dev/null and b/docs/assets/images/1cycle_lr.png differ diff --git a/docs/assets/images/loss_and_lr.png b/docs/assets/images/loss_and_lr.png new file mode 100644 index 000000000000..dedd62360ba7 Binary files /dev/null and b/docs/assets/images/loss_and_lr.png differ diff --git a/docs/assets/images/lr_schedule.png b/docs/assets/images/lr_schedule.png new file mode 100644 index 000000000000..cdda1b8decea Binary files /dev/null and b/docs/assets/images/lr_schedule.png differ diff --git a/docs/assets/images/megatron-gpt2-perf-test.png b/docs/assets/images/megatron-gpt2-perf-test.png new file mode 100644 index 000000000000..9fe5e66b239e Binary files /dev/null and b/docs/assets/images/megatron-gpt2-perf-test.png differ diff --git a/docs/assets/images/model_convergence.png b/docs/assets/images/model_convergence.png new file mode 100644 index 000000000000..b88899bf5fb4 Binary files /dev/null and b/docs/assets/images/model_convergence.png differ diff --git a/docs/contributing.md b/docs/contributing.md new file mode 100644 index 000000000000..938c1a8b9c4b --- /dev/null +++ b/docs/contributing.md @@ -0,0 +1,74 @@ +--- +title: "Contributing" +permalink: /contributing/ +--- + +DeepSpeed welcomes your contributions! + +## Prerequisites +DeepSpeed uses [pre-commit](https://pre-commit.com/) to ensure that formatting is +consistent across DeepSpeed. First, ensure that `pre-commit` is installed from either +installing DeepSpeed or `pip install pre-commit`. Next, the pre-commit hooks must be +installed once before commits can be made: +```bash +pre-commit install +``` + +Afterwards, our suite of formatting tests run automatically before each `git commit`. You +can also run these manually: +```bash +pre-commit run --all-files +``` +If a formatting test fails, it will fix the modified code in place and abort +the `git commit`. After looking over the changes, you can `git add ` +and then repeat the previous `git commit` command. + + +## Testing +DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests. +The model convergence tests train +[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) and measure +end-to-end convergence and related metrics. Unit tests are found in `tests/unit/` and +the model convergence tests are found in `tests/model/`. + +### Unit Tests +[PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be +installed from PyPI via `pip install pytest`. Simply invoke `pytest --forked` to run the +unit tests: +```bash +pytest --forked tests/unit/ +``` +You can also provide the `-v` flag to `pytest` to see additional information about the +tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the +`--forked` flag are required to test CUDA functionality in distributed tests. + +### Model Tests +Model tests require four GPUs and training data downloaded for +[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/). + +To execute model tests, first [install DeepSpeed](#installation). The +[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned +as part of this process. Next, execute the model test driver: +```bash +cd tests/model/ +pytest run_sanity_check.py +``` +Note that the `--forked` flag is not necessary for the model tests. + +## Contributor License Agreement +This project welcomes contributions and suggestions. Most contributions require you to +agree to a Contributor License Agreement (CLA) declaring that you have the right to, and +actually do, grant us the rights to use your contribution. For details, visit +https://cla.opensource.microsoft.com. + +When you submit a pull request, a CLA bot will automatically determine whether you need +to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply +follow the instructions provided by the bot. You will only need to do this once across +all repos using our CLA. + +## Code of Conduct +This project has adopted the [Microsoft Open Source Code of +Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the +[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact +[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or +comments. diff --git a/docs/index.md b/docs/index.md index a7a7e0e428b4..af6bdd6d1b4c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -71,7 +71,7 @@ optimizations on advanced hyperparameter tuning and optimizers. For example: * DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA Megatron on Azure GPUs. - *Read more*: [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md) + *Read more*: [GPT tutorial](/tutorials/megatron/) @@ -105,8 +105,7 @@ combination. ZeRO boosts the scaling capability and efficiency further. significant performance gains compared to using model parallelism alone. *Read more*: [technical report](https://arxiv.org/abs/1910.02054), - and [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md). - + and [GPT tutorial](/tutorials/megatron). ![DeepSpeed-vs-Megatron](/assets/images/DeepSpeed-vs-Megatron.png)

@@ -120,13 +119,7 @@ optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the effectiveness of model training and reduce the number of samples required to convergence to desired accuracy. -*Read more*: [Tuning tutorial](./docs/tutorials/1Cycle.md), - +*Read more*: [Tuning tutorial](/tutorials/1Cycle). ## Good Usability @@ -165,24 +158,9 @@ overview](features) for descriptions and usage. * [Performance Analysis and Debugging](features.md#performance-analysis-and-debugging) - -# Further Reading - -| Article | Description | -| ---------------------------------------------------------------------------------------------- | -------------------------------------------- | -| [DeepSpeed Features](features.md) | DeepSpeed features | -| [DeepSpeed JSON Configuration](config_json.md) | Configuring DeepSpeed | -| [API Documentation](/code-docs/) | Generated DeepSpeed API documentation | -| [CIFAR-10 Tutorial](./docs/tutorials/CIFAR-10.md) | Getting started with CIFAR-10 and DeepSpeed | -| [Megatron-LM Tutorial](./docs/tutorials/MegatronGPT2Tutorial.md) | Train GPT2 with DeepSpeed and Megatron-LM | -| [Learning Rate Range Test Tutorial](./docs/tutorials/lrrt.md) | Faster training with large learning rates | -| [1Cycle Tutorial](./docs/tutorials/1Cycle.md) | SOTA learning schedule in DeepSpeed | - - - # Contributing DeepSpeed welcomes your contributions! Please see our -[contributing](CONTRIBUTING.md) guide for more details on formatting, testing, +[contributing](/contributing/) guide for more details on formatting, testing, etc. ## Contributor License Agreement