diff --git a/dockers/base-cuda/Dockerfile b/dockers/base-cuda/Dockerfile index d2bd534e33776..424b82ce532ce 100644 --- a/dockers/base-cuda/Dockerfile +++ b/dockers/base-cuda/Dockerfile @@ -98,18 +98,9 @@ RUN \ pip install -r requirements/pytorch/base.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html && \ rm assistant.py -RUN \ - # install ColossalAI - # TODO: 1.13 wheels are not released, remove skip once they are - if [[ $PYTORCH_VERSION != "1.13" ]]; then \ - pip install "colossalai==0.2.4"; \ - python -c "import colossalai; print(colossalai.__version__)" ; \ - fi RUN \ # install rest of strategies - # remove colossalai from requirements since they are installed separately - python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \ cat requirements/pytorch/strategies.txt && \ pip install -r requirements/pytorch/devel.txt -r requirements/pytorch/strategies.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html diff --git a/dockers/nvidia/Dockerfile b/dockers/nvidia/Dockerfile index 9bb97e92af04e..cb76595f3eac7 100644 --- a/dockers/nvidia/Dockerfile +++ b/dockers/nvidia/Dockerfile @@ -43,8 +43,6 @@ RUN \ # Installations \ pip install "Pillow>=8.2, !=8.3.0" "cryptography>=3.4" "py>=1.10" --no-cache-dir && \ - # remove colossalai from requirements since they are installed separately - python -c "fname = 'lightning/requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'colossalai' not in line] ; open(fname, 'w').writelines(lines)" ; \ PACKAGE_NAME=pytorch pip install './lightning[extra,loggers,strategies]' --no-cache-dir && \ rm -rf lightning && \ pip list diff --git a/docs/source-pytorch/advanced/model_parallel.rst b/docs/source-pytorch/advanced/model_parallel.rst index 9b3030f02ec8c..6603eae0da6c9 100644 --- a/docs/source-pytorch/advanced/model_parallel.rst +++ b/docs/source-pytorch/advanced/model_parallel.rst @@ -37,7 +37,7 @@ This means we cannot sacrifice throughput as much as if we were fine-tuning, bec Overall: * When **fine-tuning** a model, use advanced memory efficient strategies such as :ref:`fully-sharded-training`, :ref:`deepspeed-zero-stage-3` or :ref:`deepspeed-zero-stage-3-offload`, allowing you to fine-tune larger models if you are limited on compute -* When **pre-training** a model, use simpler optimizations such :ref:`sharded-training` or :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes +* When **pre-training** a model, use simpler optimizations such as :ref:`deepspeed-zero-stage-2`, scaling the number of GPUs to reach larger parameter sizes * For both fine-tuning and pre-training, use :ref:`deepspeed-activation-checkpointing` as the throughput degradation is not significant For example when using 128 GPUs, you can **pre-train** large 10 to 20 Billion parameter models using :ref:`deepspeed-zero-stage-2` without having to take a performance hit with more advanced optimized multi-gpu strategy. @@ -52,133 +52,17 @@ Sharding techniques help when model sizes are fairly large; roughly 500M+ parame * When your model is small (ResNet50 of around 80M Parameters), unless you are using unusually large batch sizes or inputs. * Due to high distributed communication between devices, if running on a slow network/interconnect, the training might be much slower than expected and then it's up to you to determince the tradeoff here. ----------- - -.. _colossalai: - -*********** -Colossal-AI -*********** - -:class:`~pytorch_lightning.strategies.colossalai.ColossalAIStrategy` implements ZeRO-DP with chunk-based memory management. -With this chunk mechanism, really large models can be trained with a small number of GPUs. -It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption. -Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources. - -When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes. -This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster. - -Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini. -During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states). -In later training, it will use the collected memory usage information to evict chunks dynamically. -Gemini allows you to fit much larger models with limited GPU memory. - -According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU. -You can install colossalai by consulting `how to download colossalai `_. -Then, run this benchmark in `Colossalai-PL/gpt `_. - -Here is an example showing how to use ColossalAI: - -.. code-block:: python - - from colossalai.nn.optimizer import HybridAdam - - - class MyBert(LightningModule): - ... - - def configure_sharded_model(self) -> None: - # create your model here - self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased") - - def configure_optimizers(self): - # use the specified optimizer - optimizer = HybridAdam(self.model.parameters(), self.lr) - - ... - - - model = MyBert() - trainer = Trainer(accelerator="gpu", devices=1, precision=16, strategy="colossalai") - trainer.fit(model) - -You can find more examples in the `Colossalai-PL `_ repository. - -.. note:: - - * The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu". - * The only precision which ColossalAI allows is 16 (FP16). +Cutting-edge and Experimental Strategies +======================================== - * It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer. - HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel. - It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both. +Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning. +If you want to be the first to try the latest and greatest experimental features for model-parallel training, check out the :doc:`Colossal-AI Strategy <./third_party/colossalai>` integration. - * Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method. - - * ``ColossalaiStrategy`` doesn't support gradient accumulation as of now. - -.. _colossal_placement_policy: - -Placement Policy -================ - -Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency. -There are three options for the placement policy. -They are "cpu", "cuda" and "auto" respectively. - -When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation. -In this way, "cpu" placement policy uses the least CUDA memory. -It is the best choice for users who want to exceptionally enlarge their model size or training batch size. - -When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training. -It is for users who get plenty of CUDA memory. - -The third option, "auto", enables Gemini. -It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations. -In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information. -It is the fastest option when CUDA memory is enough. - -Here's an example of changing the placement policy to "cpu". - -.. code-block:: python - - from pytorch_lightning.strategies import ColossalAIStrategy - - model = MyModel() - my_strategy = ColossalAIStrategy(placement_policy="cpu") - trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy) - trainer.fit(model) - -.. _sharded-training: - -**************** -Sharded Training -**************** - -The technique can be found within `DeepSpeed ZeRO `_ and -`ZeRO-2 `_, -however the implementation is built from the ground up to be PyTorch compatible and standalone. -Sharded Training allows you to maintain GPU scaling efficiency, whilst reducing memory overhead drastically. In short, expect near-normal linear scaling (if your network allows), and significantly reduced memory usage when training large models. - -Sharded Training still utilizes Data Parallel Training under the hood, except optimizer states and gradients are sharded across GPUs. -This means the memory overhead per GPU is lower, as each GPU only has to maintain a partition of your optimizer state and gradients. - -The benefits vary by model and parameter sizes, but we've recorded up to a 63% memory reduction per GPU allowing us to double our model sizes. Because of efficient communication, -these benefits in multi-GPU setups are almost free and throughput scales well with multi-node setups. - -It is highly recommended to use Sharded Training in multi-GPU environments where memory is limited, or where training larger models are beneficial (500M+ parameter models). -A technical note: as batch size scales, storing activations for the backwards pass becomes the bottleneck in training. As a result, sharding optimizer state and gradients becomes less impactful. - -.. code-block:: python - - # train using Sharded DDP - trainer = Trainer(strategy="ddp_sharded") - -Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required. ---- + .. _fully-sharded-training: ********************** diff --git a/docs/source-pytorch/advanced/third_party/colossalai.rst b/docs/source-pytorch/advanced/third_party/colossalai.rst new file mode 100644 index 0000000000000..5223bdc0ad60d --- /dev/null +++ b/docs/source-pytorch/advanced/third_party/colossalai.rst @@ -0,0 +1,92 @@ +:orphan: + +########### +Colossal-AI +########### + + +The Colossal-AI strategy implements ZeRO-DP with chunk-based memory management. +With this chunk mechanism, really large models can be trained with a small number of GPUs. +It supports larger trainable model size and batch size than usual heterogeneous training by reducing CUDA memory fragments and CPU memory consumption. +Also, it speeds up this kind of heterogeneous training by fully utilizing all kinds of resources. + +When enabling chunk mechanism, a set of consecutive parameters are stored in a chunk, and then the chunk is sharded across different processes. +This can reduce communication and data transmission frequency and fully utilize communication and PCI-E bandwidth, which makes training faster. + +Unlike traditional implementations, which adopt static memory partition, we implemented a dynamic heterogeneous memory management system named Gemini. +During the first training step, the warmup phase will sample the maximum non-model data memory (memory usage expect parameters, gradients, and optimizer states). +In later training, it will use the collected memory usage information to evict chunks dynamically. +Gemini allows you to fit much larger models with limited GPU memory. + +According to our benchmark results, we can train models with up to 24 billion parameters in 1 GPU. + +You can install the Colossal-AI integration by running + +.. code-block:: bash + + pip install lightning-colossalai + +This will install both the `colossalai `_ package as well as the ``ColossalAIStrategy`` for the Lightning Trainer: + +.. code-block:: python + + trainer = Trainer(strategy="colossalai", precision=16, devices=...) + + +You can tune several settings by instantiating the strategy objects and pass options in: + +.. code-block:: python + + from lightning_colossalai import ColossalAIStrategy + + strategy = ColossalAIStrategy(...) + trainer = Trainer(strategy=strategy, precision=16, devices=...) + + +See a full example of a benchmark with the a `GPT-2 model `_ of up to 24 billion parameters + +.. note:: + + * The only accelerator which ColossalAI supports is ``"gpu"``. But CPU resources will be used when the placement policy is set to "auto" or "cpu". + + * The only precision which ColossalAI allows is 16-bit mixed precision (FP16). + + * It only supports a single optimizer, which must be ``colossalai.nn.optimizer.CPUAdam`` or ``colossalai.nn.optimizer. + HybridAdam`` now. You can set ``adamw_mode`` to False to use normal Adam. Noticing that ``HybridAdam`` is highly optimized, it uses fused CUDA kernel and parallel CPU kernel. + It is recomended to use ``HybridAdam``, since it updates parameters in GPU and CPU both. + + * Your model must be created using the :meth:`~pytorch_lightning.core.module.LightningModule.configure_sharded_model` method. + + * ``ColossalaiStrategy`` doesn't support gradient accumulation as of now. + +.. _colossal_placement_policy: + +Placement Policy +================ + +Placement policies can help users fully exploit their GPU-CPU heterogeneous memory space for better training efficiency. +There are three options for the placement policy. +They are "cpu", "cuda" and "auto" respectively. + +When the placement policy is set to "cpu", all participated parameters will be offloaded into CPU memory immediately at the end of every auto-grad operation. +In this way, "cpu" placement policy uses the least CUDA memory. +It is the best choice for users who want to exceptionally enlarge their model size or training batch size. + +When using "cuda" option, all parameters are placed in the CUDA memory, no CPU resources will be used during the training. +It is for users who get plenty of CUDA memory. + +The third option, "auto", enables Gemini. +It monitors the consumption of CUDA memory during the warmup phase and collects CUDA memory usage of all auto-grad operations. +In later training steps, Gemini automatically manages the data transmission between GPU and CPU according to collected CUDA memory usage information. +It is the fastest option when CUDA memory is enough. + +Here's an example of changing the placement policy to "cpu". + +.. code-block:: python + + from lightning_colossalai import ColossalAIStrategy + + model = MyModel() + my_strategy = ColossalAIStrategy(placement_policy="cpu") + trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=my_strategy) + trainer.fit(model) diff --git a/docs/source-pytorch/extensions/strategy.rst b/docs/source-pytorch/extensions/strategy.rst index 429131ef03944..034d508474745 100644 --- a/docs/source-pytorch/extensions/strategy.rst +++ b/docs/source-pytorch/extensions/strategy.rst @@ -23,7 +23,7 @@ plugin and other optional plugins such as the :ref:`ClusterEnvironment `_ itself). ----------- +---- ***************************** Selecting a Built-in Strategy @@ -69,9 +69,6 @@ The below table lists all relevant strategies available in Lightning with their * - Name - Class - Description - * - colossalai - - :class:`~pytorch_lightning.strategies.ColossalAIStrategy` - - Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. `__ * - fsdp - :class:`~pytorch_lightning.strategies.FSDPStrategy` - Strategy for Fully Sharded Data Parallel training. :ref:`Learn more. ` @@ -102,6 +99,28 @@ The below table lists all relevant strategies available in Lightning with their ---- + +********************** +Third-party Strategies +********************** + +There are powerful third-party strategies that integrate well with Lightning but aren't maintained as part of the ``lightning`` package. + +.. list-table:: List of third-party strategy implementations + :widths: 20 20 20 + :header-rows: 1 + + * - Name + - Package + - Description + * - colossalai + - `Lightning-AI/lightning-colossalai `_ + - Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. `__ + + +---- + + ************************ Create a Custom Strategy ************************ diff --git a/requirements/pytorch/strategies.txt b/requirements/pytorch/strategies.txt index c8a5c9531fe3d..4db2eb301121b 100644 --- a/requirements/pytorch/strategies.txt +++ b/requirements/pytorch/strategies.txt @@ -2,3 +2,4 @@ # in case you want to preserve/enforce restrictions on the latest compatible version, add "strict" as an in-line comment deepspeed>=0.6.0, <0.8.0 # TODO: Include 0.8.x after https://github.com/microsoft/DeepSpeed/commit/b587c7e85470329ac25df7c7c2521ff9b2833db7 gets released +lightning-colossalai==0.1.0dev diff --git a/src/lightning/pytorch/CHANGELOG.md b/src/lightning/pytorch/CHANGELOG.md index 988023827b3b9..420cb9213dc87 100644 --- a/src/lightning/pytorch/CHANGELOG.md +++ b/src/lightning/pytorch/CHANGELOG.md @@ -310,6 +310,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Removed the `QuantizationAwareTraining` callback ([#16750](https://github.com/Lightning-AI/lightning/pull/16750)) +- Removed the `ColossalAIStrategy` and `ColossalAIPrecisionPlugin` in favor of the new [lightning-colossalai](https://github.com/Lightning-AI/lightning-colossalai) package ([#16757](https://github.com/Lightning-AI/lightning/pull/16757), [#16778](https://github.com/Lightning-AI/lightning/pull/16778)) + + ### Fixed - Fixed an attribute error and improved input validation for invalid strategy types being passed to Trainer ([#16693](https://github.com/Lightning-AI/lightning/pull/16693))