Skip to content

Commit

Permalink
Update Colossal AI docs and integration (#16778)
Browse files Browse the repository at this point in the history
  • Loading branch information
awaelchli authored Feb 16, 2023
1 parent cc22ddc commit ad698f0
Show file tree
Hide file tree
Showing 7 changed files with 125 additions and 137 deletions.
9 changes: 0 additions & 9 deletions dockers/base-cuda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -98,18 +98,9 @@ RUN \
pip install -r requirements/pytorch/base.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html && \
rm assistant.py

RUN \
# install ColossalAI
# TODO: 1.13 wheels are not released, remove skip once they are
if [[ $PYTORCH_VERSION != "1.13" ]]; then \
pip install "colossalai==0.2.4"; \
python -c "import colossalai; print(colossalai.__version__)" ; \
fi

RUN \
# install rest of strategies
# remove colossalai from requirements since they are installed separately
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
cat requirements/pytorch/strategies.txt && \
pip install -r requirements/pytorch/devel.txt -r requirements/pytorch/strategies.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html

Expand Down
2 changes: 0 additions & 2 deletions dockers/nvidia/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,6 @@ RUN \

# Installations \
pip install "Pillow>=8.2, !=8.3.0" "cryptography>=3.4" "py>=1.10" --no-cache-dir && \
# remove colossalai from requirements since they are installed separately
python -c "fname = 'lightning/requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
PACKAGE_NAME=pytorch pip install './lightning[extra,loggers,strategies]' --no-cache-dir && \
rm -rf lightning && \
pip list
Expand Down
128 changes: 6 additions & 122 deletions docs/source-pytorch/advanced/model_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ This means we cannot sacrifice throughput as much as if we were fine-tuning, bec
Overall:

* When **fine-tuning** a model, use advanced memory efficient strategies such as :ref:`fully-sharded-training`, :ref:`deepspeed-zero-stage-3` or :ref:`deepspeed-zero-stage-3-offload`, allowing you to fine-tune larger models if you are limited on compute
* When **pre-training** a model, use simpler optimizations such :ref:`sharded-training` or :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes
* When **pre-training** a model, use simpler optimizations such as :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes
* For both fine-tuning and pre-training, use :ref:`deepspeed-activation-checkpointing` as the throughput degradation is not significant

For example when using 128 GPUs, you can **pre-train** large 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-2` without having to take a performance hit with more advanced optimized multi-gpu strategy.
Expand All @@ -52,133 +52,17 @@ Sharding techniques help when model sizes are fairly large; roughly 500M+ parame
* When your model is small (ResNet50 of around 80M Parameters), unless you are using unusually large batch sizes or inputs.
* Due to high distributed communication between devices, if running on a slow network/interconnect, the training might be much slower than expected and then it's up to you to determince the tradeoff here.

----------

.. _colossalai:

***********
Colossal-AI
***********

:class:`~pytorch_lightning.strategies.colossalai.ColossalAIStrategy` implements ZeRO-DP with chunk-based memory management.
With this chunk mechanism, really large models can be trained with a small number of GPUs.
It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption.
Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources.

When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes.
This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster.

Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini.
During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states).
In later training, it will use the collected memory usage information to evict chunks dynamically.
Gemini allows you to fit much larger models with limited GPU memory.

According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU.
You can install colossalai by consulting `how to download colossalai <https://colossalai.org/download>`_.
Then, run this benchmark in `Colossalai-PL/gpt <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning/tree/main/benchmark/gpt>`_.

Here is an example showing how to use ColossalAI:

.. code-block:: python
from colossalai.nn.optimizer import HybridAdam
class MyBert(LightningModule):
...
def configure_sharded_model(self) -> None:
# create your model here
self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
def configure_optimizers(self):
# use the specified optimizer
optimizer = HybridAdam(self.model.parameters(), self.lr)
...
model = MyBert()
trainer = Trainer(accelerator="gpu", devices=1, precision=16, strategy="colossalai")
trainer.fit(model)
You can find more examples in the `Colossalai-PL <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning>`_ repository.

.. note::

* The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu".

* The only precision which ColossalAI allows is 16 (FP16).
Cutting-edge and Experimental Strategies
========================================

* It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer.
HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel.
It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both.
Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning.
If you want to be the first to try the latest and greatest experimental features for model-parallel training, check out the :doc:`Colossal-AI Strategy <./third_party/colossalai>` integration.

* Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method.

* ``ColossalaiStrategy`` doesn't support gradient accumulation as of now.

.. _colossal_placement_policy:

Placement Policy
================

Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency.
There are three options for the placement policy.
They are "cpu", "cuda" and "auto" respectively.

When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation.
In this way, "cpu" placement policy uses the least CUDA memory.
It is the best choice for users who want to exceptionally enlarge their model size or training batch size.

When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training.
It is for users who get plenty of CUDA memory.

The third option, "auto", enables Gemini.
It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations.
In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information.
It is the fastest option when CUDA memory is enough.

Here's an example of changing the placement policy to "cpu".

.. code-block:: python
from pytorch_lightning.strategies import ColossalAIStrategy
model = MyModel()
my_strategy = ColossalAIStrategy(placement_policy="cpu")
trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy)
trainer.fit(model)
.. _sharded-training:

****************
Sharded Training
****************

The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
however the implementation is built from the ground up to be PyTorch compatible and standalone.
Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect near-normal linear scaling (if your network allows), and significantly reduced memory usage when training large models.

Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.

The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of efficient communication,
these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.

It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.

.. code-block:: python
# train using Sharded DDP
trainer = Trainer(strategy="ddp_sharded")
Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.

----


.. _fully-sharded-training:

**********************
Expand Down
92 changes: 92 additions & 0 deletions docs/source-pytorch/advanced/third_party/colossalai.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
:orphan:

###########
Colossal-AI
###########


The Colossal-AI strategy implements ZeRO-DP with chunk-based memory management.
With this chunk mechanism, really large models can be trained with a small number of GPUs.
It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption.
Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources.

When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes.
This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster.

Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini.
During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states).
In later training, it will use the collected memory usage information to evict chunks dynamically.
Gemini allows you to fit much larger models with limited GPU memory.

According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU.

You can install the Colossal-AI integration by running

.. code-block:: bash
pip install lightning-colossalai
This will install both the `colossalai <https://colossalai.org/download>`_ package as well as the ``ColossalAIStrategy`` for the Lightning Trainer:

.. code-block:: python
trainer = Trainer(strategy="colossalai", precision=16, devices=...)
You can tune several settings by instantiating the strategy objects and pass options in:

.. code-block:: python
from lightning_colossalai import ColossalAIStrategy
strategy = ColossalAIStrategy(...)
trainer = Trainer(strategy=strategy, precision=16, devices=...)
See a full example of a benchmark with the a `GPT-2 model <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning/tree/main/benchmark/gpt>`_ of up to 24 billion parameters

.. note::

* The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu".

* The only precision which ColossalAI allows is 16-bit mixed precision (FP16).

* It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer.
HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel.
It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both.

* Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method.

* ``ColossalaiStrategy`` doesn't support gradient accumulation as of now.

.. _colossal_placement_policy:

Placement Policy
================

Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency.
There are three options for the placement policy.
They are "cpu", "cuda" and "auto" respectively.

When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation.
In this way, "cpu" placement policy uses the least CUDA memory.
It is the best choice for users who want to exceptionally enlarge their model size or training batch size.

When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training.
It is for users who get plenty of CUDA memory.

The third option, "auto", enables Gemini.
It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations.
In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information.
It is the fastest option when CUDA memory is enough.

Here's an example of changing the placement policy to "cpu".

.. code-block:: python
from lightning_colossalai import ColossalAIStrategy
model = MyModel()
my_strategy = ColossalAIStrategy(placement_policy="cpu")
trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy)
trainer.fit(model)
27 changes: 23 additions & 4 deletions docs/source-pytorch/extensions/strategy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ plugin and other optional plugins such as the :ref:`ClusterEnvironment <extensio
We expose Strategies mainly for expert users that want to extend Lightning for new hardware support or new distributed backends (e.g. a backend not yet supported by `PyTorch <https://pytorch.org/docs/stable/distributed.html#backends>`_ itself).


----------
----

*****************************
Selecting a Built-in Strategy
Expand Down Expand Up @@ -69,9 +69,6 @@ The below table lists all relevant strategies available in Lightning with their
* - Name
- Class
- Description
* - colossalai
- :class:`~pytorch_lightning.strategies.ColossalAIStrategy`
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__
* - fsdp
- :class:`~pytorch_lightning.strategies.FSDPStrategy`
- Strategy for Fully Sharded Data Parallel training. :ref:`Learn more. <advanced/model_parallel:Fully Sharded Training>`
Expand Down Expand Up @@ -102,6 +99,28 @@ The below table lists all relevant strategies available in Lightning with their

----


**********************
Third-party Strategies
**********************

There are powerful third-party strategies that integrate well with Lightning but aren't maintained as part of the ``lightning`` package.

.. list-table:: List of third-party strategy implementations
:widths: 20 20 20
:header-rows: 1

* - Name
- Package
- Description
* - colossalai
- `Lightning-AI/lightning-colossalai <https://github.com/Lightning-AI/lightning-colossalai>`_
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__


----


************************
Create a Custom Strategy
************************
Expand Down
1 change: 1 addition & 0 deletions requirements/pytorch/strategies.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
# in case you want to preserve/enforce restrictions on the latest compatible version, add "strict" as an in-line comment

deepspeed>=0.6.0, <0.8.0 # TODO: Include 0.8.x after https://github.com/microsoft/DeepSpeed/commit/b587c7e85470329ac25df7c7c2521ff9b2833db7 gets released
lightning-colossalai==0.1.0dev
3 changes: 3 additions & 0 deletions src/lightning/pytorch/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Removed the `QuantizationAwareTraining` callback ([#16750](https://github.com/Lightning-AI/lightning/pull/16750))


- Removed the `ColossalAIStrategy` and `ColossalAIPrecisionPlugin` in favor of the new [lightning-colossalai](https://github.com/Lightning-AI/lightning-colossalai) package ([#16757](https://github.com/Lightning-AI/lightning/pull/16757), [#16778](https://github.com/Lightning-AI/lightning/pull/16778))


### Fixed

- Fixed an attribute error and improved input validation for invalid strategy types being passed to Trainer ([#16693](https://github.com/Lightning-AI/lightning/pull/16693))
Expand Down

0 comments on commit ad698f0

Please sign in to comment.