⚡ DeepSpeed ZeRO Stage 2 model parallel training#2
Merged
Conversation
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective! Using PyPI source for now until conda-forge package is released. Also need to install newer gcc version to prevent error `Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 5.0 and above` on the hpc server.
Working towards conserving GPU memory for matters (inference on full-size images). Using DeepSpeed ZeRO Stage 2 which shards optimizer states (Stage 1) and gradients (Stage 2) across multiple GPUs. Have set devices to be auto instead of 2 so that I can run on 1 GPU on my laptop or 2 GPUs on the HPC server without changing values. Also needed to explicitly convert the input Sentinel-2 image tensor to float16 (if using 16-bit training) to avoid `RuntimeError: Input type (torch.cuda.ShortTensor) and weight type (torch.cuda.HalfTensor) should be the same`.
weiji14
commented
May 20, 2022
| - conda-forge::rioxarray=0.10.1 | ||
| - conda-forge::torchgeo=0.2.0 | ||
| - pip: | ||
| - deepspeed==0.6.4 |
Owner
Author
There was a problem hiding this comment.
TODO wait for conda-forge package at conda-forge/staged-recipes#19021 so this doesn't need to be installed from PyPI. Also check if using conda-forge package means gcc/gxx_linux-64 isn't needed.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
To prevent out-of-memory (OOM) errors when running the Transformer models! Change from distributed data parallel (DDP) to a data+model parallel strategy
Current State
As of 9793587, we have been using distributed data parallel (DDP) to split the data batch-wise across multiple GPUs. However when running on a full-size Sentinel-2 image (batch_size=1) during test phase (#1), this can already cause out-of-memory issues for our Super-Resolution Segmentation task.
Future State
One possible solution is to shard the neural network model itself across multiple GPUs. This reduces the GPU memory requirements and allows for larger models and/or bigger datasets to be used for training/inference.
Specifically, we'll be switching to use DeepSpeed (https://github.com/microsoft/DeepSpeed) which offers several 'levels' of model sharding, and . See https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71 and https://huggingface.co/blog/zero-deepspeed-fairscale for a good explainer
Main DeepSpeed stages (from https://pytorch-lightning.readthedocs.io/en/1.6.3/advanced/model_parallel.html#deepspeed):
💡 Suggest to use Stage 2 instead of Stage 3 because while Stage 3 improves memory use, it comes with increased latency from the cost of extra distributed communication.
Other benefits of using DeepSpeed:
Alternative strategies (and why they were not considered)
Pytorch-Lightning offers several other advanced training strategies. These might work well for other cases, but probably not for our specific project.
TODO:
Use Meta Tensors, c.f. https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71NotImplementedError: Could not run 'aten::_local_scalar_dense' with arguments from the 'Meta' backend. See also General MPS op coverage tracking issue pytorch/pytorch#77764