Merge branch 'master' into mypy_utilities_seed

Lightning-AI · Sep 2, 2021 · 0994577 · 0994577
2 parents b1b62f0 + 530caef
commit 0994577
Show file tree

Hide file tree

Showing 32 changed files with 804 additions and 316 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -65,7 +65,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Loop customization:
     * Added `Closure` and `AbstractClosure` classes ([#8642](https://github.com/PyTorchLightning/pytorch-lightning/pull/8642))
-
+    * Refactored `TrainingBatchLoop` and extracted `OptimizerLoop`, splitting off automatic optimization into its own loop ([#9191](https://github.com/PyTorchLightning/pytorch-lightning/pull/9191))
 
 - Added support for saving and loading state of multiple callbacks of the same type ([#7187](https://github.com/PyTorchLightning/pytorch-lightning/pull/7187))
 
@@ -251,8 +251,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Removed deprecated property `ModelCheckpoint.period` in favor of `ModelCheckpoint.every_n_epochs` ([#9213](https://github.com/PyTorchLightning/pytorch-lightning/pull/9213))
 
 
+- Removed deprecated property `LightningModule.datamodule` in favor of `Trainer.datamodule` ([#9233](https://github.com/PyTorchLightning/pytorch-lightning/pull/9233))
+
+
 - Removed deprecated properties `DeepSpeedPlugin.cpu_offload*` in favor of `offload_optimizer`, `offload_parameters` and `pin_memory` ([#9244](https://github.com/PyTorchLightning/pytorch-lightning/pull/9244))
 
+
 ### Fixed
 
 - Fixed save/load/resume from checkpoint for DeepSpeed Plugin (
@@ -267,19 +271,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed bug where data-loading functions where not getting the correct running stage passed ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))
 
 
-- Fixed a bug in the binary search mode of auto batch size scaling where exception was thrown if the first trainer run resulted in OOM ([#8954](https://github.com/PyTorchLightning/pytorch-lightning/pull/8954))
-
+## [1.4.5] - 2021-08-31
 
 - Fixed reduction using `self.log(sync_dict=True, reduce_fx={mean,max})` ([#9142](https://github.com/PyTorchLightning/pytorch-lightning/pull/9142))
-
-
 - Fixed not setting a default value for `max_epochs` if `max_time` was specified on the `Trainer` constructor ([#9072](https://github.com/PyTorchLightning/pytorch-lightning/pull/9072))
+- Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations ([#9150](https://github.com/PyTorchLightning/pytorch-lightning/pull/9150))
+- Fixed `DDP` "CUDA error: initialization error" due to a `copy` instead of `deepcopy` on `ResultCollection` ([#9239](https://github.com/PyTorchLightning/pytorch-lightning/pull/9239))
 
 
-- Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations ([#9150](https://github.com/PyTorchLightning/pytorch-lightning/pull/9150))
+## [1.4.4] - 2021-08-24
 
+- Fixed a bug in the binary search mode of auto batch size scaling where exception was raised if the first trainer run resulted in OOM ([#8954](https://github.com/PyTorchLightning/pytorch-lightning/pull/8954))
+- Fixed a bug causing logging with `log_gpu_memory='min_max'` not working ([#9013](https://github.com/PyTorchLightning/pytorch-lightning/pull/9013))
 
-- Fixed `DDP` "CUDA error: initialization error" due to a `copy` instead of `deepcopy` on `ResultCollection` ([#9239](https://github.com/PyTorchLightning/pytorch-lightning/pull/9239))
+
+- Fixed wrapping issue: avoid wrapping LightningModule with data-parallel modules when not fitting in `DDPPlugin`, `DDPSpawnPlugin`, `DDPShardedPlugin`, `DDPSpawnShardedPlugin` ([#9096]https://github.com/PyTorchLightning/pytorch-lightning/pull/9096)
 
 
 ## [1.4.3] - 2021-08-17
@@ -312,7 +318,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed `accelerator=ddp` choice for CPU ([#8645](https://github.com/PyTorchLightning/pytorch-lightning/pull/8645))
 
 
-- Fixed a bug causing logging with `log_gpu_memory='min_max'` not working ([#9013](https://github.com/PyTorchLightning/pytorch-lightning/pull/9013))
+- Fixed an issues with export to ONNX format when a model has multiple inputs ([#8800](https://github.com/PyTorchLightning/pytorch-lightning/pull/8800))
+
 
 ## [1.4.0] - 2021-07-27
 

diff --git a/docs/source/advanced/fault_tolerant_training.rst b/docs/source/advanced/fault_tolerant_training.rst
@@ -0,0 +1,150 @@
+Fault-tolerant Training
+=======================
+
+.. warning:: Fault-tolerant Training is currently an experimental feature within Lightning.
+
+Fault-tolerant Training is an internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure.
+This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time.
+
+Until now, a ``Trainer.fit()`` failing in the middle of an epoch during training or validation
+would require the user to restart that epoch completely, losing any progress made during the epoch.
+This would make benchmarking non-reproducible as optimization has been interrupted and only partially restored.
+
+With Fault Tolerant Training, when ``Trainer.fit()`` fails in the middle of an epoch during training or validation,
+Lightning will restart exactly where it failed, and everything will be restored.
+
+Fault Tolerance requires PyTorch 1.7 or higher and can be enabled as follows:
+
+.. code-block:: bash
+
+    PL_FAULT_TOLERANT_TRAINING=1 python script.py
+
+
+Under The Hood
+--------------
+
+Lightning keeps track of the following state updates during training:
+
+* Samplers indices and random states across multiple processes and workers: This enables restoring random transforms and batch fetching to the exact state as it was right before the failure.
+* Optimizers, learning rate schedulers, callbacks, etc..
+* Loop progression
+* Logging internal states such that metric reductions on epoch end are not getting affected by the failure and model selection can continue as expected.
+
+Currently Supported
+-------------------
+
+If you are using a single map-based dataset by sub-classing :class:`~torch.utils.data.Dataset`, everything should work as expected.
+
+.. code-block:: python
+
+    from torch.utils.data import Dataset, DataLoader
+
+
+    class RandomDataset(Dataset):
+        def __init__(self, size: int, length: int):
+            self.len = length
+            self.data = torch.randn(length, size)
+
+        def __getitem__(self, index):
+            return self.data[index]
+
+        def __len__(self):
+            return self.len
+
+If you are using a single iterable-based dataset, there are some limitations. To support fault-tolerance, you will need to use and expose a sampler within your dataset.
+
+For example, the following implementation for an iterable dataset sub-classing :class:`~torch.utils.data.IterableDataset` won't be supported.
+
+.. code-block:: python
+
+    from torch.utils.data import IterableDataset, DataLoader
+
+
+    # does not support fault tolerance training!
+    class RandomIterableDataset(IterableDataset):
+        def __init__(self, size: int, count: int):
+            self.count = count
+            self.size = size
+
+        def __iter__(self):
+            for _ in range(self.count):
+                yield torch.randn(self.size)
+
+
+There are two primary reasons why Lightning can't support the previous implementation.
+
+* Lightning cannot infer what you are iterating over, making it difficult to restart training. Lightning Fault Tolerant Training requires a :class:`~torch.utils.data.distributed.Sampler` to be used to encapsulate the fetching logic, requiring both the sampler and an iterator to be made available as attributes within the dataset, so Lightning can access them to track progress.
+* Implementing the `__next__` method is required as it separates iterator creation from its consumption, which is essential for Lightning to wrap the iterator before their consumption.
+
+If your iterable dataset are implemented in the following way, everything should works as expected.
+
+.. code-block:: python
+
+    import torch
+    from torch.utils.data import IterableDataset, DataLoader
+
+
+    class RandomIterableDataset(IterableDataset):
+        def __init__(self, size: int, length: int):
+            self.data = torch.randn(length, size)
+
+            # expose the sampler as an attribute
+            self.sampler = RandomSampler(range(length))
+
+        def __iter__(self) -> "RandomIterableDataset":
+            # expose the generator from the sampler as an attribute
+            # the ``sampler_iter`` will be wrapped by Lightning to ensure
+            # we can capture random seeds and iteration count for fast-forward samplers
+            # while restarting.
+            self.sampler_iter = iter(self.sampler)
+            return self
+
+        def __next__(self) -> torch.Tensor:
+            # call next on the iterator and get the associated data.
+            # the logic here can become more complex but the sampler
+            # should be the central piece for fetching the next sample
+            index = next(self.sampler_iter)
+            return self.data[index]
+
+
+Current Known Limitations
+-------------------------
+
+If you are using multiple training dataloaders, Lightning won't be able to restore the random state properly.
+
+.. testcode::
+
+    class LitModel(LightningModule):
+        def train_dataloader(self):
+            loader_a = torch.utils.data.DataLoader(range(8), batch_size=4)
+            loader_b = torch.utils.data.DataLoader(range(16), batch_size=4)
+            return {"loader_a": loader_a, "loader_b": loader_b}
+
+        def training_step(self, batch, batch_idx):
+            # access the data in the same format as the collection of dataloaders.
+            # dict, list are supported.
+            loader_a = batch["loader_a"]
+            loader_b = batch["loader_b"]
+
+
+If you believe this to be useful, please open a `feature request <https://github.com/PyTorchLightning/pytorch-lightning/issues>`_.
+
+
+Performance Impacts
+-------------------
+
+Fault-tolerant Training was tested on common and worst-case scenarios in order to measure the impact of the internal state tracking on the total training time.
+On tiny models like the `BoringModel and RandomDataset <https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py>`_
+which has virtually no data loading and processing overhead, we noticed up to 50% longer training time with fault tolerance enabled.
+In this worst-case scenario, fault-tolerant adds an overhead that is noticeable in comparison to the compute time for dataloading itself.
+However, for more realistic training workloads where data loading and preprocessing is more expensive, the constant overhead that fault tolerance adds becomes less noticeable or not noticeable at all.
+For example, when training with ResNet50 on CIFAR 10 we have observed a 0.5% to 1% increase in training time depending on ``batch size`` or ``number of workers``.
+
+More detailed benchmarks will be shared in the future.
+
+.. note::
+
+    The extra time is coming from several parts:
+
+    - Capturing the iteration count + random states for each sample within each DataLoader workers and pass it through the data_queue
+    - Extra logic to handle / store the dataloader's states from each batch.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -56,6 +56,7 @@ PyTorch Lightning Documentation
    advanced/advanced_gpu
    advanced/mixed_precision
    common/weights_loading
+   advanced/fault_tolerant_training
    advanced/checkpoint_io
    common/optimizers
    advanced/profiler

diff --git a/pytorch_lightning/core/hooks.py b/pytorch_lightning/core/hooks.py
@@ -768,11 +768,14 @@ def transfer_batch_to_device(self, batch: Any, device: torch.device, dataloader_
 
         Example::
 
-            def transfer_batch_to_device(self, batch, device):
+            def transfer_batch_to_device(self, batch, device, dataloader_idx):
                 if isinstance(batch, CustomBatch):
                     # move all tensors in your custom data structure to the device
                     batch.samples = batch.samples.to(device)
                     batch.targets = batch.targets.to(device)
+                elif dataloader_idx == 0:
+                    # skip device transfer for the first dataloader or anything you wish
+                    pass
                 else:
                     batch = super().transfer_batch_to_device(data, device)
                 return batch

diff --git a/pytorch_lightning/core/lightning.py b/pytorch_lightning/core/lightning.py
@@ -66,7 +66,6 @@ class LightningModule(
     # since none of these are important when using JIT, we are going to ignore them.
     __jit_unused_properties__ = (
         [
-            "datamodule",
             "example_input_array",
             "on_gpu",
             "current_epoch",
@@ -104,7 +103,6 @@ def __init__(self, *args: Any, **kwargs: Any) -> None:
 
         # optionally can be set by user
         self._example_input_array = None
-        self._datamodule = None
         self._current_fx_name: Optional[str] = None
         self._current_dataloader_idx: Optional[int] = None
         self._automatic_optimization: bool = True
@@ -203,15 +201,6 @@ def local_rank(self) -> int:
         """The index of the current process within a single node."""
         return self.trainer.local_rank if self.trainer else 0
 
-    @property
-    def datamodule(self) -> Any:
-        warning_cache.deprecation(
-            "The `LightningModule.datamodule` property is deprecated in v1.3 and will be removed in v1.5."
-            " Access the datamodule through using `self.trainer.datamodule` instead.",
-            stacklevel=6,
-        )
-        return self._datamodule
-
     @property
     def loaded_optimizer_states_dict(self) -> dict:
         warning_cache.deprecation(
@@ -230,10 +219,6 @@ def loaded_optimizer_states_dict(self, val: dict) -> None:
         )
         self._loaded_optimizer_states_dict = val
 
-    @datamodule.setter
-    def datamodule(self, datamodule: Any) -> None:
-        self._datamodule = datamodule
-
     @property
     def on_gpu(self):
         """
@@ -1858,7 +1843,10 @@ def to_onnx(self, file_path: Union[str, Path], input_sample: Optional[Any] = Non
 
         if "example_outputs" not in kwargs:
             self.eval()
-            kwargs["example_outputs"] = self(input_sample)
+            if isinstance(input_sample, Tuple):
+                kwargs["example_outputs"] = self(*input_sample)
+            else:
+                kwargs["example_outputs"] = self(input_sample)
 
         torch.onnx.export(self, input_sample, file_path, **kwargs)
         self.train(mode)
@@ -1990,7 +1978,6 @@ def __getstate__(self) -> Dict[str, Any]:
         state = dict(self.__dict__)
         if self._should_prevent_trainer_and_dataloaders_deepcopy:
             state["trainer"] = None
-            state["_datamodule"] = None
             state.pop("train_dataloader", None)
             state.pop("val_dataloader", None)
             state.pop("test_dataloader", None)

diff --git a/pytorch_lightning/core/mixins/hparams_mixin.py b/pytorch_lightning/core/mixins/hparams_mixin.py
@@ -126,12 +126,26 @@ def _to_hparams_dict(hp: Union[MutableMapping, Namespace, str]):
 
     @property
     def hparams(self) -> Union[AttributeDict, dict, Namespace]:
+        """
+        The collection of hyperparameters saved with :meth:`save_hyperparameters`. It is mutable by the user.
+        For the frozen set of initial hyperparameters, use :attr:`hparams_initial`.
+
+        Returns:
+            Union[AttributeDict, dict, Namespace]: mutable hyperparameters dicionary
+        """
         if not hasattr(self, "_hparams"):
             self._hparams = AttributeDict()
         return self._hparams
 
     @property
     def hparams_initial(self) -> AttributeDict:
+        """
+        The collection of hyperparameters saved with :meth:`save_hyperparameters`. These contents are read-only.
+        Manual updates to the saved hyperparameters can instead be performed through :attr:`hparams`.
+
+        Returns:
+            AttributeDict: immutable initial hyperparameters
+        """
         if not hasattr(self, "_hparams_initial"):
             return AttributeDict()
         # prevent any change

diff --git a/pytorch_lightning/loops/__init__.py b/pytorch_lightning/loops/__init__.py
@@ -17,3 +17,4 @@
 from pytorch_lightning.loops.dataloader import DataLoaderLoop, EvaluationLoop, PredictionLoop  # noqa: F401
 from pytorch_lightning.loops.epoch import EvaluationEpochLoop, PredictionEpochLoop, TrainingEpochLoop  # noqa: F401
 from pytorch_lightning.loops.fit_loop import FitLoop  # noqa: F401
+from pytorch_lightning.loops.optimizer.optimizer_loop import OptimizerLoop  # noqa: F401