Skip to content

Commit

Permalink
Merge branch 'master' into 2.0/precision_PL
Browse files Browse the repository at this point in the history
  • Loading branch information
justusschock authored Feb 17, 2023
2 parents b553824 + ac5fa03 commit 8321309
Show file tree
Hide file tree
Showing 56 changed files with 509 additions and 471 deletions.
9 changes: 0 additions & 9 deletions dockers/base-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -98,18 +98,9 @@ RUN \
pip install -r requirements/pytorch/base.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html && \
rm assistant.py

RUN \
# install ColossalAI
# TODO: 1.13 wheels are not released, remove skip once they are
if [[ $PYTORCH_VERSION != "1.13" ]]; then \
pip install "colossalai==0.2.4"; \
python -c "import colossalai; print(colossalai.__version__)" ; \
fi

RUN \
# install rest of strategies
# remove colossalai from requirements since they are installed separately
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
cat requirements/pytorch/strategies.txt && \
pip install -r requirements/pytorch/devel.txt -r requirements/pytorch/strategies.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html

Expand Down
2 changes: 0 additions & 2 deletions dockers/nvidia/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,6 @@ RUN \

# Installations \
pip install "Pillow>=8.2, !=8.3.0" "cryptography>=3.4" "py>=1.10" --no-cache-dir && \
# remove colossalai from requirements since they are installed separately
python -c "fname = 'lightning/requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
PACKAGE_NAME=pytorch pip install './lightning[extra,loggers,strategies]' --no-cache-dir && \
rm -rf lightning && \
pip list
Expand Down
6 changes: 3 additions & 3 deletions docs/source-pytorch/accelerators/tpu_faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ How to resolve the replication issue?
.format(len(local_devices), len(kind_devices)))
RuntimeError: Cannot replicate if number of devices (1) is different from 8
This error is raised when the XLA device is called outside the spawn process. Internally in `TPUSpawn` Strategy for training on multiple tpu cores, we use XLA's `xmp.spawn`.
This error is raised when the XLA device is called outside the spawn process. Internally in the XLA-Strategy for training on multiple tpu cores, we use XLA's `xmp.spawn`.
Don't use ``xm.xla_device()`` while working on Lightning + TPUs!

----
Expand Down Expand Up @@ -91,7 +91,7 @@ How to setup the debug mode for Training on TPUs?
import pytorch_lightning as pl
my_model = MyLightningModule()
trainer = pl.Trainer(accelerator="tpu", devices=8, strategy="tpu_spawn_debug")
trainer = pl.Trainer(accelerator="tpu", devices=8, strategy="xla_debug")
trainer.fit(my_model)
Example Metrics report:
Expand All @@ -108,7 +108,7 @@ Example Metrics report:
A lot of PyTorch operations aren't lowered to XLA, which could lead to significant slowdown of the training process.
These operations are moved to the CPU memory and evaluated, and then the results are transferred back to the XLA device(s).
By using the `tpu_spawn_debug` Strategy, users could create a metrics report to diagnose issues.
By using the `xla_debug` Strategy, users could create a metrics report to diagnose issues.

The report includes things like (`XLA Reference <https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#troubleshooting>`_):

Expand Down
128 changes: 6 additions & 122 deletions docs/source-pytorch/advanced/model_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ This means we cannot sacrifice throughput as much as if we were fine-tuning, bec
Overall:

* When **fine-tuning** a model, use advanced memory efficient strategies such as :ref:`fully-sharded-training`, :ref:`deepspeed-zero-stage-3` or :ref:`deepspeed-zero-stage-3-offload`, allowing you to fine-tune larger models if you are limited on compute
* When **pre-training** a model, use simpler optimizations such :ref:`sharded-training` or :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes
* When **pre-training** a model, use simpler optimizations such as :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes
* For both fine-tuning and pre-training, use :ref:`deepspeed-activation-checkpointing` as the throughput degradation is not significant

For example when using 128 GPUs, you can **pre-train** large 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-2` without having to take a performance hit with more advanced optimized multi-gpu strategy.
Expand All @@ -52,133 +52,17 @@ Sharding techniques help when model sizes are fairly large; roughly 500M+ parame
* When your model is small (ResNet50 of around 80M Parameters), unless you are using unusually large batch sizes or inputs.
* Due to high distributed communication between devices, if running on a slow network/interconnect, the training might be much slower than expected and then it's up to you to determince the tradeoff here.

----------

.. _colossalai:

***********
Colossal-AI
***********

:class:`~pytorch_lightning.strategies.colossalai.ColossalAIStrategy` implements ZeRO-DP with chunk-based memory management.
With this chunk mechanism, really large models can be trained with a small number of GPUs.
It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption.
Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources.

When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes.
This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster.

Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini.
During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states).
In later training, it will use the collected memory usage information to evict chunks dynamically.
Gemini allows you to fit much larger models with limited GPU memory.

According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU.
You can install colossalai by consulting `how to download colossalai <https://colossalai.org/download>`_.
Then, run this benchmark in `Colossalai-PL/gpt <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning/tree/main/benchmark/gpt>`_.

Here is an example showing how to use ColossalAI:

.. code-block:: python
from colossalai.nn.optimizer import HybridAdam
class MyBert(LightningModule):
...
def configure_sharded_model(self) -> None:
# create your model here
self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
def configure_optimizers(self):
# use the specified optimizer
optimizer = HybridAdam(self.model.parameters(), self.lr)
...
model = MyBert()
trainer = Trainer(accelerator="gpu", devices=1, precision=16, strategy="colossalai")
trainer.fit(model)
You can find more examples in the `Colossalai-PL <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning>`_ repository.

.. note::

* The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu".

* The only precision which ColossalAI allows is 16 (FP16).
Cutting-edge and Experimental Strategies
========================================

* It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer.
HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel.
It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both.
Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning.
If you want to be the first to try the latest and greatest experimental features for model-parallel training, check out the :doc:`Colossal-AI Strategy <./third_party/colossalai>` integration.

* Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method.

* ``ColossalaiStrategy`` doesn't support gradient accumulation as of now.

.. _colossal_placement_policy:

Placement Policy
================

Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency.
There are three options for the placement policy.
They are "cpu", "cuda" and "auto" respectively.

When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation.
In this way, "cpu" placement policy uses the least CUDA memory.
It is the best choice for users who want to exceptionally enlarge their model size or training batch size.

When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training.
It is for users who get plenty of CUDA memory.

The third option, "auto", enables Gemini.
It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations.
In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information.
It is the fastest option when CUDA memory is enough.

Here's an example of changing the placement policy to "cpu".

.. code-block:: python
from pytorch_lightning.strategies import ColossalAIStrategy
model = MyModel()
my_strategy = ColossalAIStrategy(placement_policy="cpu")
trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy)
trainer.fit(model)
.. _sharded-training:

****************
Sharded Training
****************

The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
however the implementation is built from the ground up to be PyTorch compatible and standalone.
Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect near-normal linear scaling (if your network allows), and significantly reduced memory usage when training large models.

Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.

The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of efficient communication,
these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.

It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.

.. code-block:: python
# train using Sharded DDP
trainer = Trainer(strategy="ddp_sharded")
Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.

----


.. _fully-sharded-training:

**********************
Expand Down
2 changes: 1 addition & 1 deletion docs/source-pytorch/advanced/strategy_registry.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ It also returns the optional description and parameters for initialising the Str
trainer = Trainer(strategy="deepspeed_stage_3_offload", accelerator="gpu", devices=3)
# Training with the TPU Spawn Strategy with `debug` as True
trainer = Trainer(strategy="tpu_spawn_debug", accelerator="tpu", devices=8)
trainer = Trainer(strategy="xla_debug", accelerator="tpu", devices=8)
Additionally, you can pass your custom registered training strategies to the ``strategy`` argument.
Expand Down
92 changes: 92 additions & 0 deletions docs/source-pytorch/advanced/third_party/colossalai.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
:orphan:

###########
Colossal-AI
###########


The Colossal-AI strategy implements ZeRO-DP with chunk-based memory management.
With this chunk mechanism, really large models can be trained with a small number of GPUs.
It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption.
Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources.

When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes.
This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster.

Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini.
During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states).
In later training, it will use the collected memory usage information to evict chunks dynamically.
Gemini allows you to fit much larger models with limited GPU memory.

According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU.

You can install the Colossal-AI integration by running

.. code-block:: bash
pip install lightning-colossalai
This will install both the `colossalai <https://colossalai.org/download>`_ package as well as the ``ColossalAIStrategy`` for the Lightning Trainer:

.. code-block:: python
trainer = Trainer(strategy="colossalai", precision=16, devices=...)
You can tune several settings by instantiating the strategy objects and pass options in:

.. code-block:: python
from lightning_colossalai import ColossalAIStrategy
strategy = ColossalAIStrategy(...)
trainer = Trainer(strategy=strategy, precision=16, devices=...)
See a full example of a benchmark with the a `GPT-2 model <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning/tree/main/benchmark/gpt>`_ of up to 24 billion parameters

.. note::

* The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu".

* The only precision which ColossalAI allows is 16-bit mixed precision (FP16).

* It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer.
HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel.
It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both.

* Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method.

* ``ColossalaiStrategy`` doesn't support gradient accumulation as of now.

.. _colossal_placement_policy:

Placement Policy
================

Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency.
There are three options for the placement policy.
They are "cpu", "cuda" and "auto" respectively.

When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation.
In this way, "cpu" placement policy uses the least CUDA memory.
It is the best choice for users who want to exceptionally enlarge their model size or training batch size.

When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training.
It is for users who get plenty of CUDA memory.

The third option, "auto", enables Gemini.
It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations.
In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information.
It is the fastest option when CUDA memory is enough.

Here's an example of changing the placement policy to "cpu".

.. code-block:: python
from lightning_colossalai import ColossalAIStrategy
model = MyModel()
my_strategy = ColossalAIStrategy(placement_policy="cpu")
trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy)
trainer.fit(model)
2 changes: 1 addition & 1 deletion docs/source-pytorch/api_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ strategies
SingleHPUStrategy
SingleTPUStrategy
Strategy
TPUSpawnStrategy
XLAStrategy

tuner
-----
Expand Down
31 changes: 25 additions & 6 deletions docs/source-pytorch/extensions/strategy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ plugin and other optional plugins such as the :ref:`ClusterEnvironment <extensio
We expose Strategies mainly for expert users that want to extend Lightning for new hardware support or new distributed backends (e.g. a backend not yet supported by `PyTorch <https://pytorch.org/docs/stable/distributed.html#backends>`_ itself).


----------
----

*****************************
Selecting a Built-in Strategy
Expand Down Expand Up @@ -69,9 +69,6 @@ The below table lists all relevant strategies available in Lightning with their
* - Name
- Class
- Description
* - colossalai
- :class:`~pytorch_lightning.strategies.ColossalAIStrategy`
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__
* - fsdp
- :class:`~pytorch_lightning.strategies.FSDPStrategy`
- Strategy for Fully Sharded Data Parallel training. :ref:`Learn more. <advanced/model_parallel:Fully Sharded Training>`
Expand All @@ -93,15 +90,37 @@ The below table lists all relevant strategies available in Lightning with their
* - ipu_strategy
- :class:`~pytorch_lightning.strategies.IPUStrategy`
- Plugin for training on IPU devices. :doc:`Learn more. <../accelerators/ipu>`
* - tpu_spawn
- :class:`~pytorch_lightning.strategies.TPUSpawnStrategy`
* - xla
- :class:`~pytorch_lightning.strategies.XLAStrategy`
- Strategy for training on multiple TPU devices using the :func:`torch_xla.distributed.xla_multiprocessing.spawn` method. :doc:`Learn more. <../accelerators/tpu>`
* - single_tpu
- :class:`~pytorch_lightning.strategies.SingleTPUStrategy`
- Strategy for training on a single TPU device. :doc:`Learn more. <../accelerators/tpu>`

----


**********************
Third-party Strategies
**********************

There are powerful third-party strategies that integrate well with Lightning but aren't maintained as part of the ``lightning`` package.

.. list-table:: List of third-party strategy implementations
:widths: 20 20 20
:header-rows: 1

* - Name
- Package
- Description
* - colossalai
- `Lightning-AI/lightning-colossalai <https://github.com/Lightning-AI/lightning-colossalai>`_
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__


----


************************
Create a Custom Strategy
************************
Expand Down
3 changes: 3 additions & 0 deletions docs/source-pytorch/fabric/api/fabric_args.rst
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,9 @@ This can result in improved performance, achieving significant speedups on moder
# Default used by the Fabric
fabric = Fabric(precision="32-true", devices=1)
# the same as:
fabric = Fabric(precision="32", devices=1)
# 16-bit (mixed) precision
fabric = Fabric(precision="16-mixed", devices=1)
Expand Down
Loading

0 comments on commit 8321309

Please sign in to comment.