Merge remote-tracking branch 'upstream/master' into training_step_dat…

…aloader_iter
Lightning-AI · Aug 16, 2021 · 5f8bdd5 · 5f8bdd5
2 parents 7fb52df + 89156b7
commit 5f8bdd5
Show file tree

Hide file tree

Showing 76 changed files with 1,213 additions and 837 deletions.
diff --git a/.azure-pipelines/ipu-tests.yml b/.azure-pipelines/ipu-tests.yml
@@ -72,26 +72,11 @@ jobs:
         python -c "import poptorch; print(poptorch.__version__)"
       displayName: "Check poptorch installation"
 
-    - bash: |
-        wget https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip -P legacy/
-        unzip -o legacy/checkpoints.zip -d legacy/
-        ls -l legacy/checkpoints/
-      displayName: 'Get legacy checkpoints'
-
     - bash: |
         source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
         source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh
         export POPTORCH_WAIT_FOR_IPU=1
-        python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --junitxml=$(Build.StagingDirectory)/test-results.xml --durations=50
+        python -m coverage run --source pytorch_lightning -m pytest tests/accelerators/test_ipu.py -v --junitxml=$(Build.StagingDirectory)/test-results.xml --durations=50
       env:
         MKL_THREADING_LAYER: "GNU"
       displayName: 'Testing: standard'
-
-    - bash: |
-        source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
-        source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh
-        export POPTORCH_WAIT_FOR_IPU=1
-        bash tests/special_tests.sh
-      env:
-        MKL_THREADING_LAYER: "GNU"
-      displayName: 'Testing: special'
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added `state_id` property to the `Callback` base class ([#6886](https://github.com/PyTorchLightning/pytorch-lightning/pull/6886))
 
 
+- Progress tracking
+    * Integrate `TrainingEpochLoop.total_batch_idx` ([#8598](https://github.com/PyTorchLightning/pytorch-lightning/pull/8598)
+
+
 - Added `batch_size` and `rank_zero_only` arguments for `log_dict` to match `log` ([#8628](https://github.com/PyTorchLightning/pytorch-lightning/pull/8628))
 
 
@@ -37,6 +41,11 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Fault-tolerant training:
     * Added `FastForwardSampler` and `CaptureIterableDataset` injection to data loading utilities ([#8366](https://github.com/PyTorchLightning/pytorch-lightning/pull/8366))
+    * Added `LightningDataFetcher` to control fetching flow ([#8890](https://github.com/PyTorchLightning/pytorch-lightning/pull/8890))
+    * Added `SharedCycleIteratorState` to prevent infinite loop ([#8889](https://github.com/PyTorchLightning/pytorch-lightning/pull/8889))
+
+
+- Added `CheckpointIO` to expose checkpoint IO from training type plugin ([#8743](https://github.com/PyTorchLightning/pytorch-lightning/pull/8743))
 
 
 ### Changed
@@ -75,6 +84,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 (https://github.com/PyTorchLightning/pytorch-lightning/pull/8608))
 
 
+- `Trainer.request_dataloader` now takes a `RunningStage` enum instance ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))
+
 ### Deprecated
 
 - Deprecated `LightningModule.summarize()` in favor of `pytorch_lightning.utilities.model_summary.summarize()`
@@ -123,8 +134,17 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Removed the deprecated `Trainer.truncated_bptt_steps` in favor of `LightningModule.truncated_bptt_steps` ([#8826](https://github.com/PyTorchLightning/pytorch-lightning/pull/8826))
 
 
+- Removed `LightningModule.write_predictions` and `LightningModule.write_predictions_dict` ([#](https://github.com/PyTorchLightning/pytorch-lightning/pull/8850))
+
+
+- Removed reset dataloader hooks to Training Plugins and Accelerators ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))
+
+
+
 ### Fixed
 
+- Restore original loaders if replaced by entrypoint ([#8885](https://github.com/PyTorchLightning/pytorch-lightning/pull/8885))
+
 - Fixed `trainer.fit_loop.split_idx` always returning `None` ([#8601](https://github.com/PyTorchLightning/pytorch-lightning/pull/8601))
 
 
@@ -152,9 +172,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed an issue with logger outputs not being finalized correctly after prediction runs ([#8333](https://github.com/PyTorchLightning/pytorch-lightning/issues/8333))
 
 
+- Fixed `StochasticWeightAveraging` with a list of learning rates not applying them to each param group ([#8747](https://github.com/PyTorchLightning/pytorch-lightning/issues/8747))
+
+
 - Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer ([#8804](https://github.com/PyTorchLightning/pytorch-lightning/pull/8804/))
 
 
+- Fixed plateau scheduler stepping on incomplete epoch ([#8861](https://github.com/PyTorchLightning/pytorch-lightning/pull/8861))
+
+
+- Fixed infinite loop with CycleIterator and multiple loaders ([#8889](https://github.com/PyTorchLightning/pytorch-lightning/pull/8889))
+
+
+- Fixed bug where data-loading functions where not getting the correct running stage passed ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))
+
+
 ## [1.4.0] - 2021-07-27
 
 ### Added

diff --git a/benchmarks/test_basic_parity.py b/benchmarks/test_basic_parity.py
@@ -51,7 +51,7 @@ def assert_parity_absolute(pl_values, pt_values, norm_by: float = 1, max_diff: f
     "cls_model,max_diff_speed,max_diff_memory,num_epochs,num_runs",
     [
         (ParityModuleRNN, 0.05, 0.001, 4, 3),
-        (ParityModuleMNIST, 0.25, 0.001, 4, 3),  # todo: lower this thr
+        (ParityModuleMNIST, 0.3, 0.001, 4, 3),  # todo: lower this thr
         pytest.param(ParityModuleCIFAR, 4.0, 0.0002, 2, 2, marks=_MARK_SHORT_BM),
     ],
 )

diff --git a/docs/source/advanced/checkpoint_io.rst b/docs/source/advanced/checkpoint_io.rst
@@ -0,0 +1,52 @@
+Custom Checkpointing IO
+=======================
+
+.. warning:: The Checkpoint IO API is experimental and subject to change.
+
+Lightning supports modifying the checkpointing save/load functionality through the ``CheckpointIO``. This encapsulates the save/load logic
+that is managed by the ``TrainingTypePlugin``.
+
+``CheckpointIO`` can be extended to include your custom save/load functionality to and from a path. The ``CheckpointIO`` object can be passed to either a ``Trainer`` object or a ``TrainingTypePlugin`` as shown below.
+
+.. code-block:: python
+
+    from pathlib import Path
+    from typing import Any, Dict, Optional, Union
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.callbacks import ModelCheckpoint
+    from pytorch_lightning.plugins import CheckpointIO, SingleDevicePlugin
+
+
+    class CustomCheckpointIO(CheckpointIO):
+        def save_checkpoint(
+            self, checkpoint: Dict[str, Any], path: Union[str, Path], storage_options: Optional[Any] = None
+        ) -> None:
+            ...
+
+        def load_checkpoint(self, path: Union[str, Path], storage_options: Optional[Any] = None) -> Dict[str, Any]:
+            ...
+
+
+    custom_checkpoint_io = CustomCheckpointIO()
+
+    # Pass into the Trainer object
+    model = MyModel()
+    trainer = Trainer(
+        plugins=[custom_checkpoint_io],
+        callbacks=ModelCheckpoint(save_last=True),
+    )
+    trainer.fit(model)
+
+    # pass into TrainingTypePlugin
+    model = MyModel()
+    device = torch.device("cpu")
+    trainer = Trainer(
+        plugins=SingleDevicePlugin(device, checkpoint_io=custom_checkpoint_io),
+        callbacks=ModelCheckpoint(save_last=True),
+    )
+    trainer.fit(model)
+
+.. note::
+
+    Some ``TrainingTypePlugins`` do not support custom ``CheckpointIO`` as as checkpointing logic is not modifiable.
diff --git a/docs/source/api_references.rst b/docs/source/api_references.rst
@@ -127,6 +127,18 @@ Cluster Environments
     KubeflowEnvironment
     SLURMEnvironment
 
+Checkpoint IO Plugins
+^^^^^^^^^^^^^^^^^^^^^
+
+.. currentmodule:: pytorch_lightning.plugins.io
+
+.. autosummary::
+    :toctree: api
+    :nosignatures:
+    :template: classtemplate.rst
+
+    CheckpointIO
+    TorchCheckpointIO
 
 Profiler API
 ------------

diff --git a/docs/source/common/lightning_module.rst b/docs/source/common/lightning_module.rst
@@ -853,18 +853,6 @@ validation_epoch_end
 .. automethod:: pytorch_lightning.core.lightning.LightningModule.validation_epoch_end
     :noindex:
 
-write_prediction
-~~~~~~~~~~~~~~~~
-
-.. automethod:: pytorch_lightning.core.lightning.LightningModule.write_prediction
-    :noindex:
-
-write_prediction_dict
-~~~~~~~~~~~~~~~~~~~~~
-
-.. automethod:: pytorch_lightning.core.lightning.LightningModule.write_prediction_dict
-    :noindex:
-
 ------------
 
 Properties

diff --git a/docs/source/common/optimizers.rst b/docs/source/common/optimizers.rst
@@ -185,7 +185,7 @@ defined in your :meth:`~pytorch_lightning.core.lightning.LightningModule.configu
 .. warning::
    * Before 1.3, Lightning automatically called ``lr_scheduler.step()`` in both automatic and manual optimization. From
      1.3, ``lr_scheduler.step()`` is now for the user to call at arbitrary intervals.
-   * Note that the ``lr_dict`` keys, such as ``"step"`` and ``""interval"``, will be ignored even if they are provided in
+   * Note that the ``lr_dict`` keys, such as ``"step"`` and ``"interval"``, will be ignored even if they are provided in
      your :meth:`~pytorch_lightning.core.lightning.LightningModule.configure_optimizers` during manual optimization.
 
 Here is an example calling ``lr_scheduler.step()`` every step.

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -55,6 +55,7 @@ PyTorch Lightning Documentation
    advanced/multi_gpu
    advanced/advanced_gpu
    common/weights_loading
+   advanced/checkpoint_io
    common/optimizers
    advanced/profiler
    advanced/sequences

diff --git a/pytorch_lightning/accelerators/accelerator.py b/pytorch_lightning/accelerators/accelerator.py
@@ -155,9 +155,7 @@ def teardown(self) -> None:
         """
         self.training_type_plugin.teardown()
 
-    def batch_to_device(
-        self, batch: Any, device: Optional[torch.device] = None, dataloader_idx: Optional[int] = None
-    ) -> Any:
+    def batch_to_device(self, batch: Any, device: Optional[torch.device] = None, dataloader_idx: int = 0) -> Any:
         """Moves the batch to the correct device.
         The returned batch is of the same type as the input batch, just having all tensors on the correct device.
 
@@ -171,7 +169,7 @@ def batch_to_device(
 
         if model is not None and not isinstance(self.training_type_plugin, DataParallelPlugin):
             # no need to transfer batch to device in DP mode
-            return model._apply_batch_transfer_handler(batch, device, dataloader_idx)
+            return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
 
         return move_data_to_device(batch, device)
 
@@ -410,22 +408,6 @@ def process_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[I
         """
         return self.training_type_plugin.process_dataloader(dataloader)
 
-    def on_reset_train_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
-        """Called before resetting the train dataloader."""
-        return self.training_type_plugin.on_reset_train_dataloader(dataloader)
-
-    def on_reset_val_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
-        """Called before resetting the val dataloader."""
-        return self.training_type_plugin.on_reset_val_dataloader(dataloader)
-
-    def on_reset_test_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
-        """Called before resetting the test dataloader."""
-        return self.training_type_plugin.on_reset_test_dataloader(dataloader)
-
-    def on_reset_predict_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
-        """Called before resetting the predict dataloader."""
-        return self.training_type_plugin.on_reset_predict_dataloader(dataloader)
-
     @property
     def results(self) -> Any:
         """

diff --git a/pytorch_lightning/callbacks/early_stopping.py b/pytorch_lightning/callbacks/early_stopping.py
@@ -91,7 +91,7 @@ def __init__(
         check_finite: bool = True,
         stopping_threshold: Optional[float] = None,
         divergence_threshold: Optional[float] = None,
-        check_on_train_epoch_end: bool = True,
+        check_on_train_epoch_end: Optional[bool] = None,
     ):
         super().__init__()
         self.min_delta = min_delta
@@ -120,6 +120,12 @@ def __init__(
             )
         self.monitor = monitor or "early_stop_on"
 
+    def on_pretrain_routine_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
+        if self._check_on_train_epoch_end is None:
+            # if the user runs validation multiple times per training epoch, we try to check after
+            # validation instead of on train epoch end
+            self._check_on_train_epoch_end = trainer.val_check_interval == 1.0
+
     def _validate_condition_metric(self, logs):
         monitor_val = logs.get(self.monitor)
 
@@ -191,7 +197,7 @@ def _run_early_stopping_check(self, trainer: "pl.Trainer") -> None:
         # when in dev debugging
         trainer.dev_debugger.track_early_stopping_history(self, current)
 
-        should_stop, reason = self._evalute_stopping_criteria(current)
+        should_stop, reason = self._evaluate_stopping_criteria(current)
 
         # stop every ddp process if any world process decides to stop
         should_stop = trainer.training_type_plugin.reduce_boolean_decision(should_stop)
@@ -201,7 +207,7 @@ def _run_early_stopping_check(self, trainer: "pl.Trainer") -> None:
         if reason and self.verbose:
             self._log_info(trainer, reason)
 
-    def _evalute_stopping_criteria(self, current: torch.Tensor) -> Tuple[bool, str]:
+    def _evaluate_stopping_criteria(self, current: torch.Tensor) -> Tuple[bool, str]:
         should_stop = False
         reason = None
         if self.check_finite and not torch.isfinite(current):

diff --git a/pytorch_lightning/callbacks/model_checkpoint.py b/pytorch_lightning/callbacks/model_checkpoint.py
@@ -332,9 +332,9 @@ def on_train_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -
             rank_zero_info("Saving latest checkpoint...")
         # as we advance one step at end of training, we use `global_step - 1` to avoid saving duplicates
         monitor_candidates = self._monitor_candidates(trainer, trainer.current_epoch, trainer.global_step - 1)
-        trainer.train_loop.global_step -= 1
+        trainer.fit_loop.global_step -= 1
         self._save_last_checkpoint(trainer, monitor_candidates)
-        trainer.train_loop.global_step += 1
+        trainer.fit_loop.global_step += 1
 
     def on_save_checkpoint(
         self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", checkpoint: Dict[str, Any]

diff --git a/pytorch_lightning/callbacks/stochastic_weight_avg.py b/pytorch_lightning/callbacks/stochastic_weight_avg.py
@@ -166,25 +166,18 @@ def on_train_epoch_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningMo
             # move average model to request device.
             self._average_model = self._average_model.to(self._device or pl_module.device)
 
-            optimizers = trainer.optimizers
+            optimizer = trainer.optimizers[0]
+            if self._swa_lrs is None:
+                self._swa_lrs = [param_group["lr"] for param_group in optimizer.param_groups]
+            if isinstance(self._swa_lrs, float):
+                self._swa_lrs = [self._swa_lrs] * len(optimizer.param_groups)
 
-            for param_group in optimizers[0].param_groups:
-                if self._swa_lrs is None:
-                    initial_lr = param_group["lr"]
-
-                elif isinstance(self._swa_lrs, float):
-                    initial_lr = self._swa_lrs
-
-                else:
-                    initial_lr = self._swa_lrs[0]
-
-                param_group["initial_lr"] = initial_lr
-
-            self._swa_lrs = initial_lr
+            for lr, group in zip(self._swa_lrs, optimizer.param_groups):
+                group["initial_lr"] = lr
 
             self._swa_scheduler = SWALR(
-                optimizers[0],
-                swa_lr=initial_lr,
+                optimizer,
+                swa_lr=self._swa_lrs,
                 anneal_epochs=self._annealing_epochs,
                 anneal_strategy=self._annealing_strategy,
                 last_epoch=trainer.max_epochs if self._annealing_strategy == "cos" else -1,

diff --git a/pytorch_lightning/core/decorators.py b/pytorch_lightning/core/decorators.py
@@ -97,7 +97,7 @@ def inner_fn(self, *args, **kwargs):
 
         if not pre_layer_count == post_layer_count:
             rank_zero_warn(
-                f"The model layers do not match after moving to the target device."
+                "The model layers do not match after moving to the target device."
                 " If your model employs weight sharing on TPU,"
                 " please tie your weights using the `on_post_move_to_device` model hook.\n"
                 f"Layer count: [Before: {pre_layer_count} After: {post_layer_count}]"