Skip to content

Commit

Permalink
Merge branch 'master' into mypy_utilities_seed
Browse files Browse the repository at this point in the history
  • Loading branch information
stancld committed Sep 2, 2021
2 parents b1b62f0 + 530caef commit 0994577
Show file tree
Hide file tree
Showing 32 changed files with 804 additions and 316 deletions.
23 changes: 15 additions & 8 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Loop customization:
* Added `Closure` and `AbstractClosure` classes ([#8642](https://github.com/PyTorchLightning/pytorch-lightning/pull/8642))

* Refactored `TrainingBatchLoop` and extracted `OptimizerLoop`, splitting off automatic optimization into its own loop ([#9191](https://github.com/PyTorchLightning/pytorch-lightning/pull/9191))

- Added support for saving and loading state of multiple callbacks of the same type ([#7187](https://github.com/PyTorchLightning/pytorch-lightning/pull/7187))

Expand Down Expand Up @@ -251,8 +251,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Removed deprecated property `ModelCheckpoint.period` in favor of `ModelCheckpoint.every_n_epochs` ([#9213](https://github.com/PyTorchLightning/pytorch-lightning/pull/9213))


- Removed deprecated property `LightningModule.datamodule` in favor of `Trainer.datamodule` ([#9233](https://github.com/PyTorchLightning/pytorch-lightning/pull/9233))


- Removed deprecated properties `DeepSpeedPlugin.cpu_offload*` in favor of `offload_optimizer`, `offload_parameters` and `pin_memory` ([#9244](https://github.com/PyTorchLightning/pytorch-lightning/pull/9244))


### Fixed

- Fixed save/load/resume from checkpoint for DeepSpeed Plugin (
Expand All @@ -267,19 +271,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed bug where data-loading functions where not getting the correct running stage passed ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))


- Fixed a bug in the binary search mode of auto batch size scaling where exception was thrown if the first trainer run resulted in OOM ([#8954](https://github.com/PyTorchLightning/pytorch-lightning/pull/8954))

## [1.4.5] - 2021-08-31

- Fixed reduction using `self.log(sync_dict=True, reduce_fx={mean,max})` ([#9142](https://github.com/PyTorchLightning/pytorch-lightning/pull/9142))


- Fixed not setting a default value for `max_epochs` if `max_time` was specified on the `Trainer` constructor ([#9072](https://github.com/PyTorchLightning/pytorch-lightning/pull/9072))
- Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations ([#9150](https://github.com/PyTorchLightning/pytorch-lightning/pull/9150))
- Fixed `DDP` "CUDA error: initialization error" due to a `copy` instead of `deepcopy` on `ResultCollection` ([#9239](https://github.com/PyTorchLightning/pytorch-lightning/pull/9239))


- Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations ([#9150](https://github.com/PyTorchLightning/pytorch-lightning/pull/9150))
## [1.4.4] - 2021-08-24

- Fixed a bug in the binary search mode of auto batch size scaling where exception was raised if the first trainer run resulted in OOM ([#8954](https://github.com/PyTorchLightning/pytorch-lightning/pull/8954))
- Fixed a bug causing logging with `log_gpu_memory='min_max'` not working ([#9013](https://github.com/PyTorchLightning/pytorch-lightning/pull/9013))

- Fixed `DDP` "CUDA error: initialization error" due to a `copy` instead of `deepcopy` on `ResultCollection` ([#9239](https://github.com/PyTorchLightning/pytorch-lightning/pull/9239))

- Fixed wrapping issue: avoid wrapping LightningModule with data-parallel modules when not fitting in `DDPPlugin`, `DDPSpawnPlugin`, `DDPShardedPlugin`, `DDPSpawnShardedPlugin` ([#9096]https://github.com/PyTorchLightning/pytorch-lightning/pull/9096)


## [1.4.3] - 2021-08-17
Expand Down Expand Up @@ -312,7 +318,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed `accelerator=ddp` choice for CPU ([#8645](https://github.com/PyTorchLightning/pytorch-lightning/pull/8645))


- Fixed a bug causing logging with `log_gpu_memory='min_max'` not working ([#9013](https://github.com/PyTorchLightning/pytorch-lightning/pull/9013))
- Fixed an issues with export to ONNX format when a model has multiple inputs ([#8800](https://github.com/PyTorchLightning/pytorch-lightning/pull/8800))


## [1.4.0] - 2021-07-27

Expand Down
150 changes: 150 additions & 0 deletions docs/source/advanced/fault_tolerant_training.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
Fault-tolerant Training
=======================

.. warning:: Fault-tolerant Training is currently an experimental feature within Lightning.

Fault-tolerant Training is an internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure.
This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time.

Until now, a ``Trainer.fit()`` failing in the middle of an epoch during training or validation
would require the user to restart that epoch completely, losing any progress made during the epoch.
This would make benchmarking non-reproducible as optimization has been interrupted and only partially restored.

With Fault Tolerant Training, when ``Trainer.fit()`` fails in the middle of an epoch during training or validation,
Lightning will restart exactly where it failed, and everything will be restored.

Fault Tolerance requires PyTorch 1.7 or higher and can be enabled as follows:

.. code-block:: bash
PL_FAULT_TOLERANT_TRAINING=1 python script.py
Under The Hood
--------------

Lightning keeps track of the following state updates during training:

* Samplers indices and random states across multiple processes and workers: This enables restoring random transforms and batch fetching to the exact state as it was right before the failure.
* Optimizers, learning rate schedulers, callbacks, etc..
* Loop progression
* Logging internal states such that metric reductions on epoch end are not getting affected by the failure and model selection can continue as expected.

Currently Supported
-------------------

If you are using a single map-based dataset by sub-classing :class:`~torch.utils.data.Dataset`, everything should work as expected.

.. code-block:: python
from torch.utils.data import Dataset, DataLoader
class RandomDataset(Dataset):
def __init__(self, size: int, length: int):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
If you are using a single iterable-based dataset, there are some limitations. To support fault-tolerance, you will need to use and expose a sampler within your dataset.

For example, the following implementation for an iterable dataset sub-classing :class:`~torch.utils.data.IterableDataset` won't be supported.

.. code-block:: python
from torch.utils.data import IterableDataset, DataLoader
# does not support fault tolerance training!
class RandomIterableDataset(IterableDataset):
def __init__(self, size: int, count: int):
self.count = count
self.size = size
def __iter__(self):
for _ in range(self.count):
yield torch.randn(self.size)
There are two primary reasons why Lightning can't support the previous implementation.

* Lightning cannot infer what you are iterating over, making it difficult to restart training. Lightning Fault Tolerant Training requires a :class:`~torch.utils.data.distributed.Sampler` to be used to encapsulate the fetching logic, requiring both the sampler and an iterator to be made available as attributes within the dataset, so Lightning can access them to track progress.
* Implementing the `__next__` method is required as it separates iterator creation from its consumption, which is essential for Lightning to wrap the iterator before their consumption.

If your iterable dataset are implemented in the following way, everything should works as expected.

.. code-block:: python
import torch
from torch.utils.data import IterableDataset, DataLoader
class RandomIterableDataset(IterableDataset):
def __init__(self, size: int, length: int):
self.data = torch.randn(length, size)
# expose the sampler as an attribute
self.sampler = RandomSampler(range(length))
def __iter__(self) -> "RandomIterableDataset":
# expose the generator from the sampler as an attribute
# the ``sampler_iter`` will be wrapped by Lightning to ensure
# we can capture random seeds and iteration count for fast-forward samplers
# while restarting.
self.sampler_iter = iter(self.sampler)
return self
def __next__(self) -> torch.Tensor:
# call next on the iterator and get the associated data.
# the logic here can become more complex but the sampler
# should be the central piece for fetching the next sample
index = next(self.sampler_iter)
return self.data[index]
Current Known Limitations
-------------------------

If you are using multiple training dataloaders, Lightning won't be able to restore the random state properly.

.. testcode::

class LitModel(LightningModule):
def train_dataloader(self):
loader_a = torch.utils.data.DataLoader(range(8), batch_size=4)
loader_b = torch.utils.data.DataLoader(range(16), batch_size=4)
return {"loader_a": loader_a, "loader_b": loader_b}

def training_step(self, batch, batch_idx):
# access the data in the same format as the collection of dataloaders.
# dict, list are supported.
loader_a = batch["loader_a"]
loader_b = batch["loader_b"]


If you believe this to be useful, please open a `feature request <https://github.com/PyTorchLightning/pytorch-lightning/issues>`_.


Performance Impacts
-------------------

Fault-tolerant Training was tested on common and worst-case scenarios in order to measure the impact of the internal state tracking on the total training time.
On tiny models like the `BoringModel and RandomDataset <https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py>`_
which has virtually no data loading and processing overhead, we noticed up to 50% longer training time with fault tolerance enabled.
In this worst-case scenario, fault-tolerant adds an overhead that is noticeable in comparison to the compute time for dataloading itself.
However, for more realistic training workloads where data loading and preprocessing is more expensive, the constant overhead that fault tolerance adds becomes less noticeable or not noticeable at all.
For example, when training with ResNet50 on CIFAR 10 we have observed a 0.5% to 1% increase in training time depending on ``batch size`` or ``number of workers``.

More detailed benchmarks will be shared in the future.

.. note::

The extra time is coming from several parts:

- Capturing the iteration count + random states for each sample within each DataLoader workers and pass it through the data_queue
- Extra logic to handle / store the dataloader's states from each batch.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ PyTorch Lightning Documentation
advanced/advanced_gpu
advanced/mixed_precision
common/weights_loading
advanced/fault_tolerant_training
advanced/checkpoint_io
common/optimizers
advanced/profiler
Expand Down
5 changes: 4 additions & 1 deletion pytorch_lightning/core/hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,11 +768,14 @@ def transfer_batch_to_device(self, batch: Any, device: torch.device, dataloader_
Example::
def transfer_batch_to_device(self, batch, device):
def transfer_batch_to_device(self, batch, device, dataloader_idx):
if isinstance(batch, CustomBatch):
# move all tensors in your custom data structure to the device
batch.samples = batch.samples.to(device)
batch.targets = batch.targets.to(device)
elif dataloader_idx == 0:
# skip device transfer for the first dataloader or anything you wish
pass
else:
batch = super().transfer_batch_to_device(data, device)
return batch
Expand Down
21 changes: 4 additions & 17 deletions pytorch_lightning/core/lightning.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,6 @@ class LightningModule(
# since none of these are important when using JIT, we are going to ignore them.
__jit_unused_properties__ = (
[
"datamodule",
"example_input_array",
"on_gpu",
"current_epoch",
Expand Down Expand Up @@ -104,7 +103,6 @@ def __init__(self, *args: Any, **kwargs: Any) -> None:

# optionally can be set by user
self._example_input_array = None
self._datamodule = None
self._current_fx_name: Optional[str] = None
self._current_dataloader_idx: Optional[int] = None
self._automatic_optimization: bool = True
Expand Down Expand Up @@ -203,15 +201,6 @@ def local_rank(self) -> int:
"""The index of the current process within a single node."""
return self.trainer.local_rank if self.trainer else 0

@property
def datamodule(self) -> Any:
warning_cache.deprecation(
"The `LightningModule.datamodule` property is deprecated in v1.3 and will be removed in v1.5."
" Access the datamodule through using `self.trainer.datamodule` instead.",
stacklevel=6,
)
return self._datamodule

@property
def loaded_optimizer_states_dict(self) -> dict:
warning_cache.deprecation(
Expand All @@ -230,10 +219,6 @@ def loaded_optimizer_states_dict(self, val: dict) -> None:
)
self._loaded_optimizer_states_dict = val

@datamodule.setter
def datamodule(self, datamodule: Any) -> None:
self._datamodule = datamodule

@property
def on_gpu(self):
"""
Expand Down Expand Up @@ -1858,7 +1843,10 @@ def to_onnx(self, file_path: Union[str, Path], input_sample: Optional[Any] = Non

if "example_outputs" not in kwargs:
self.eval()
kwargs["example_outputs"] = self(input_sample)
if isinstance(input_sample, Tuple):
kwargs["example_outputs"] = self(*input_sample)
else:
kwargs["example_outputs"] = self(input_sample)

torch.onnx.export(self, input_sample, file_path, **kwargs)
self.train(mode)
Expand Down Expand Up @@ -1990,7 +1978,6 @@ def __getstate__(self) -> Dict[str, Any]:
state = dict(self.__dict__)
if self._should_prevent_trainer_and_dataloaders_deepcopy:
state["trainer"] = None
state["_datamodule"] = None
state.pop("train_dataloader", None)
state.pop("val_dataloader", None)
state.pop("test_dataloader", None)
Expand Down
14 changes: 14 additions & 0 deletions pytorch_lightning/core/mixins/hparams_mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,12 +126,26 @@ def _to_hparams_dict(hp: Union[MutableMapping, Namespace, str]):

@property
def hparams(self) -> Union[AttributeDict, dict, Namespace]:
"""
The collection of hyperparameters saved with :meth:`save_hyperparameters`. It is mutable by the user.
For the frozen set of initial hyperparameters, use :attr:`hparams_initial`.
Returns:
Union[AttributeDict, dict, Namespace]: mutable hyperparameters dicionary
"""
if not hasattr(self, "_hparams"):
self._hparams = AttributeDict()
return self._hparams

@property
def hparams_initial(self) -> AttributeDict:
"""
The collection of hyperparameters saved with :meth:`save_hyperparameters`. These contents are read-only.
Manual updates to the saved hyperparameters can instead be performed through :attr:`hparams`.
Returns:
AttributeDict: immutable initial hyperparameters
"""
if not hasattr(self, "_hparams_initial"):
return AttributeDict()
# prevent any change
Expand Down
1 change: 1 addition & 0 deletions pytorch_lightning/loops/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@
from pytorch_lightning.loops.dataloader import DataLoaderLoop, EvaluationLoop, PredictionLoop # noqa: F401
from pytorch_lightning.loops.epoch import EvaluationEpochLoop, PredictionEpochLoop, TrainingEpochLoop # noqa: F401
from pytorch_lightning.loops.fit_loop import FitLoop # noqa: F401
from pytorch_lightning.loops.optimizer.optimizer_loop import OptimizerLoop # noqa: F401
Loading

0 comments on commit 0994577

Please sign in to comment.