You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/multi_gpu.rst
+81-6
Original file line number
Diff line number
Diff line change
@@ -612,6 +612,7 @@ This is useful when dealing with large Transformer based models, or in environme
612
612
Lightning currently offers the following methods to leverage model parallelism:
613
613
614
614
- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
615
+
- Sequential Model Parallelism with Checkpointing (partition your :class:`nn.Sequential <torch.nn.Sequential>` module across multiple GPUs, leverage checkpointing and microbatching for further memory improvements and device utilization)
615
616
616
617
Sharded Training
617
618
^^^^^^^^^^^^^^^^
@@ -666,7 +667,7 @@ To use Sharded Training, you need to first install FairScale using the command b
@@ -678,6 +679,80 @@ Sharded Training can work across all DDP variants by adding the additional ``--p
678
679
679
680
Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
680
681
682
+
----------
683
+
684
+
.. _sequential-parallelism:
685
+
686
+
Sequential Model Parallelism with Checkpointing
687
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
688
+
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
689
+
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
690
+
We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs.
691
+
In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically.
692
+
693
+
Reference: https://arxiv.org/abs/1811.06965
694
+
695
+
.. note:: DDPSequentialPlugin is currently supported only for Pytorch 1.6.
696
+
697
+
To get started, install FairScale through extras using with ``pip install pytorch-lightning["extra"]``
To use Sequential Model Parallelism, you must define a :class:`nn.Sequential <torch.nn.Sequential>` module that defines the layers you wish to parallelize across GPUs.
706
+
This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below.
707
+
708
+
.. code-block:: python
709
+
710
+
from pytorch_lightning.plugins.ddp_sequential_plugin import DDPSequentialPlugin
We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here <https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/basic_examples/conv_sequential_example.py>`_.
726
+
To run the example, you need to install `Bolts <https://github.com/PyTorchLightning/pytorch-lightning-bolts>`_. Install with ``pip install pytorch-lightning-bolts``.
727
+
728
+
When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings.
729
+
730
+
.. list-table:: GPU Memory Utilization
731
+
:widths: 25 25 50
732
+
:header-rows: 1
733
+
734
+
* - GPUS
735
+
- Without Balancing
736
+
- With Balancing
737
+
* - Gpu 0
738
+
- 4436 MB
739
+
- 1554 MB
740
+
* - Gpu 1
741
+
- ~0
742
+
- 994 MB
743
+
744
+
To run the example with Sequential Model Parallelism:
@@ -728,17 +803,17 @@ Lightning supports the use of TorchElastic to enable fault-tolerant and elastic
728
803
.. code-block:: python
729
804
730
805
Trainer(gpus=8, accelerator='ddp')
731
-
732
-
806
+
807
+
733
808
Following the `TorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
140
+
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
141
+
142
+
For more information, refer to :ref:`sequential-parallelism`.
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
131
+
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
132
+
133
+
For more information, refer to :ref:`sequential-parallelism`.
0 commit comments