Update Colossal AI docs and integration (#16778)

Lightning-AI · Feb 16, 2023 · ad698f0 · ad698f0
1 parent cc22ddc
commit ad698f0
Show file tree

Hide file tree

Showing 7 changed files with 125 additions and 137 deletions.
diff --git a/dockers/base-cuda/Dockerfile b/dockers/base-cuda/Dockerfile
@@ -98,18 +98,9 @@ RUN \
     pip install -r requirements/pytorch/base.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html && \
     rm assistant.py
 
-RUN \
-    # install ColossalAI
-    # TODO: 1.13 wheels are not released, remove skip once they are
-    if [[ $PYTORCH_VERSION != "1.13" ]]; then \
-        pip install "colossalai==0.2.4"; \
-        python -c "import colossalai; print(colossalai.__version__)" ; \
-    fi
 
 RUN \
     # install rest of strategies
-    # remove colossalai from requirements since they are installed separately
-    python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
     cat requirements/pytorch/strategies.txt && \
     pip install -r requirements/pytorch/devel.txt -r requirements/pytorch/strategies.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html
 

diff --git a/dockers/nvidia/Dockerfile b/dockers/nvidia/Dockerfile
@@ -43,8 +43,6 @@ RUN \
 
 # Installations \
     pip install "Pillow>=8.2, !=8.3.0" "cryptography>=3.4" "py>=1.10" --no-cache-dir && \
-    # remove colossalai from requirements since they are installed separately
-    python -c "fname = 'lightning/requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \
     PACKAGE_NAME=pytorch pip install './lightning[extra,loggers,strategies]' --no-cache-dir && \
     rm -rf lightning && \
     pip list

diff --git a/docs/source-pytorch/advanced/model_parallel.rst b/docs/source-pytorch/advanced/model_parallel.rst
@@ -37,7 +37,7 @@ This means we cannot sacrifice throughput as much as if we were fine-tuning, bec
 Overall:
 
 * When **fine-tuning** a model, use advanced memory efficient strategies such as :ref:`fully-sharded-training`, :ref:`deepspeed-zero-stage-3` or :ref:`deepspeed-zero-stage-3-offload`, allowing you to fine-tune larger models if you are limited on compute
-* When **pre-training** a model, use simpler optimizations such :ref:`sharded-training` or :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes
+* When **pre-training** a model, use simpler optimizations such as :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes
 * For both fine-tuning and pre-training, use :ref:`deepspeed-activation-checkpointing` as the throughput degradation is not significant
 
 For example when using 128 GPUs, you can **pre-train** large 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-2` without having to take a performance hit with more advanced optimized multi-gpu strategy.
@@ -52,133 +52,17 @@ Sharding techniques help when model sizes are fairly large; roughly 500M+ parame
 * When your model is small (ResNet50 of around 80M Parameters), unless you are using unusually large batch sizes or inputs.
 * Due to high distributed communication between devices, if running on a slow network/interconnect, the training might be much slower than expected and then it's up to you to determince the tradeoff here.
 
-----------
-
-.. _colossalai:
-
-***********
-Colossal-AI
-***********
-
-:class:`~pytorch_lightning.strategies.colossalai.ColossalAIStrategy` implements ZeRO-DP with chunk-based memory management.
-With this chunk mechanism, really large models can be trained with a small number of GPUs.
-It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption.
-Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources.
-
-When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes.
-This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster.
-
-Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini.
-During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states).
-In later training, it will use the collected memory usage information to evict chunks dynamically.
-Gemini allows you to fit much larger models with limited GPU memory.
-
-According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU.
-You can install colossalai by consulting `how to download colossalai <https://colossalai.org/download>`_.
-Then, run this benchmark in `Colossalai-PL/gpt <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning/tree/main/benchmark/gpt>`_.
-
-Here is an example showing how to use ColossalAI:
-
-.. code-block:: python
-
-    from colossalai.nn.optimizer import HybridAdam
-
-
-    class MyBert(LightningModule):
-        ...
-
-        def configure_sharded_model(self) -> None:
-            # create your model here
-            self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
-
-        def configure_optimizers(self):
-            # use the specified optimizer
-            optimizer = HybridAdam(self.model.parameters(), self.lr)
-
-        ...
-
-
-    model = MyBert()
-    trainer = Trainer(accelerator="gpu", devices=1, precision=16, strategy="colossalai")
-    trainer.fit(model)
-
-You can find more examples in the `Colossalai-PL <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning>`_ repository.
-
-.. note::
-
-    *   The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu".
 
-    *   The only precision which ColossalAI allows is 16 (FP16).
+Cutting-edge and Experimental Strategies
+========================================
 
-    *   It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer.
-        HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel.
-        It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both.
+Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning.
+If you want to be the first to try the latest and greatest experimental features for model-parallel training, check out the :doc:`Colossal-AI Strategy <./third_party/colossalai>` integration.
 
-    *   Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method.
-
-    *   ``ColossalaiStrategy`` doesn't support gradient accumulation as of now.
-
-.. _colossal_placement_policy:
-
-Placement Policy
-================
-
-Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency.
-There are three options for the placement policy.
-They are "cpu", "cuda" and "auto" respectively.
-
-When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation.
-In this way, "cpu" placement policy uses the least CUDA memory.
-It is the best choice for users who want to exceptionally enlarge their model size or training batch size.
-
-When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training.
-It is for users who get plenty of CUDA memory.
-
-The third option, "auto", enables Gemini.
-It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations.
-In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information.
-It is the fastest option when CUDA memory is enough.
-
-Here's an example of changing the placement policy to "cpu".
-
-.. code-block:: python
-
-    from pytorch_lightning.strategies import ColossalAIStrategy
-
-    model = MyModel()
-    my_strategy = ColossalAIStrategy(placement_policy="cpu")
-    trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy)
-    trainer.fit(model)
-
-.. _sharded-training:
-
-****************
-Sharded Training
-****************
-
-The technique can be found within `DeepSpeed ZeRO <https://arxiv.org/abs/1910.02054>`_ and
-`ZeRO-2 <https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/>`_,
-however the implementation is built from the ground up to be PyTorch compatible and standalone.
-Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect near-normal linear scaling (if your network allows), and significantly reduced memory usage when training large models.
-
-Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs.
-This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients.
-
-The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of efficient communication,
-these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups.
-
-It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models).
-A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful.
-
-.. code-block:: python
-
-    # train using Sharded DDP
-    trainer = Trainer(strategy="ddp_sharded")
-
-Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
 
 ----
 
+
 .. _fully-sharded-training:
 
 **********************

diff --git a/docs/source-pytorch/advanced/third_party/colossalai.rst b/docs/source-pytorch/advanced/third_party/colossalai.rst
@@ -0,0 +1,92 @@
+:orphan:
+
+###########
+Colossal-AI
+###########
+
+
+The Colossal-AI strategy implements ZeRO-DP with chunk-based memory management.
+With this chunk mechanism, really large models can be trained with a small number of GPUs.
+It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption.
+Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources.
+
+When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes.
+This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster.
+
+Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini.
+During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states).
+In later training, it will use the collected memory usage information to evict chunks dynamically.
+Gemini allows you to fit much larger models with limited GPU memory.
+
+According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU.
+
+You can install the Colossal-AI integration by running
+
+.. code-block:: bash
+
+    pip install lightning-colossalai
+
+This will install both the `colossalai <https://colossalai.org/download>`_ package as well as the ``ColossalAIStrategy`` for the Lightning Trainer:
+
+.. code-block:: python
+
+    trainer = Trainer(strategy="colossalai", precision=16, devices=...)
+
+
+You can tune several settings by instantiating the strategy objects and pass options in:
+
+.. code-block:: python
+
+    from lightning_colossalai import ColossalAIStrategy
+
+    strategy = ColossalAIStrategy(...)
+    trainer = Trainer(strategy=strategy, precision=16, devices=...)
+
+
+See a full example of a benchmark with the a `GPT-2 model <https://github.com/hpcaitech/ColossalAI-Pytorch-lightning/tree/main/benchmark/gpt>`_ of up to 24 billion parameters
+
+.. note::
+
+    *   The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu".
+
+    *   The only precision which ColossalAI allows is 16-bit mixed precision (FP16).
+
+    *   It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer.
+        HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel.
+        It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both.
+
+    *   Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method.
+
+    *   ``ColossalaiStrategy`` doesn't support gradient accumulation as of now.
+
+.. _colossal_placement_policy:
+
+Placement Policy
+================
+
+Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency.
+There are three options for the placement policy.
+They are "cpu", "cuda" and "auto" respectively.
+
+When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation.
+In this way, "cpu" placement policy uses the least CUDA memory.
+It is the best choice for users who want to exceptionally enlarge their model size or training batch size.
+
+When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training.
+It is for users who get plenty of CUDA memory.
+
+The third option, "auto", enables Gemini.
+It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations.
+In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information.
+It is the fastest option when CUDA memory is enough.
+
+Here's an example of changing the placement policy to "cpu".
+
+.. code-block:: python
+
+    from lightning_colossalai import ColossalAIStrategy
+
+    model = MyModel()
+    my_strategy = ColossalAIStrategy(placement_policy="cpu")
+    trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy)
+    trainer.fit(model)
diff --git a/docs/source-pytorch/extensions/strategy.rst b/docs/source-pytorch/extensions/strategy.rst
@@ -23,7 +23,7 @@ plugin and other optional plugins such as the :ref:`ClusterEnvironment <extensio
 We expose Strategies mainly for expert users that want to extend Lightning for new hardware support or new distributed backends (e.g. a backend not yet supported by `PyTorch <https://pytorch.org/docs/stable/distributed.html#backends>`_ itself).
 
 
-----------
+----
 
 *****************************
 Selecting a Built-in Strategy
@@ -69,9 +69,6 @@ The below table lists all relevant strategies available in Lightning with their
    * - Name
      - Class
      - Description
-   * - colossalai
-     - :class:`~pytorch_lightning.strategies.ColossalAIStrategy`
-     - Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__
    * - fsdp
      - :class:`~pytorch_lightning.strategies.FSDPStrategy`
      - Strategy for Fully Sharded Data Parallel training. :ref:`Learn more. <advanced/model_parallel:Fully Sharded Training>`
@@ -102,6 +99,28 @@ The below table lists all relevant strategies available in Lightning with their
 
 ----
 
+
+**********************
+Third-party Strategies
+**********************
+
+There are powerful third-party strategies that integrate well with Lightning but aren't maintained as part of the ``lightning`` package.
+
+.. list-table:: List of third-party strategy implementations
+   :widths: 20 20 20
+   :header-rows: 1
+
+   * - Name
+     - Package
+     - Description
+   * - colossalai
+     - `Lightning-AI/lightning-colossalai <https://github.com/Lightning-AI/lightning-colossalai>`_
+     - Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__
+
+
+----
+
+
 ************************
 Create a Custom Strategy
 ************************

diff --git a/requirements/pytorch/strategies.txt b/requirements/pytorch/strategies.txt
@@ -2,3 +2,4 @@
 #  in case you want to preserve/enforce restrictions on the latest compatible version, add "strict" as an in-line comment
 
 deepspeed>=0.6.0, <0.8.0  # TODO: Include 0.8.x after https://github.com/microsoft/DeepSpeed/commit/b587c7e85470329ac25df7c7c2521ff9b2833db7 gets released
+lightning-colossalai==0.1.0dev
diff --git a/src/lightning/pytorch/CHANGELOG.md b/src/lightning/pytorch/CHANGELOG.md
@@ -310,6 +310,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Removed the `QuantizationAwareTraining` callback ([#16750](https://github.com/Lightning-AI/lightning/pull/16750))
 
 
+- Removed the `ColossalAIStrategy` and `ColossalAIPrecisionPlugin` in favor of the new [lightning-colossalai](https://github.com/Lightning-AI/lightning-colossalai) package ([#16757](https://github.com/Lightning-AI/lightning/pull/16757), [#16778](https://github.com/Lightning-AI/lightning/pull/16778))
+
+
 ### Fixed
 
 - Fixed an attribute error and improved input validation for invalid strategy types being passed to Trainer ([#16693](https://github.com/Lightning-AI/lightning/pull/16693))
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,3 +2,4 @@
		# in case you want to preserve/enforce restrictions on the latest compatible version, add "strict" as an in-line comment

		deepspeed>=0.6.0, <0.8.0 # TODO: Include 0.8.x after https://github.com/microsoft/DeepSpeed/commit/b587c7e85470329ac25df7c7c2521ff9b2833db7 gets released
		lightning-colossalai==0.1.0dev