Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into training_step_dat…
Browse files Browse the repository at this point in the history
…aloader_iter
  • Loading branch information
Yifu Wang committed Aug 16, 2021
2 parents 7fb52df + 89156b7 commit 5f8bdd5
Show file tree
Hide file tree
Showing 76 changed files with 1,213 additions and 837 deletions.
17 changes: 1 addition & 16 deletions .azure-pipelines/ipu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,26 +72,11 @@ jobs:
python -c "import poptorch; print(poptorch.__version__)"
displayName: "Check poptorch installation"
- bash: |
wget https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip -P legacy/
unzip -o legacy/checkpoints.zip -d legacy/
ls -l legacy/checkpoints/
displayName: 'Get legacy checkpoints'
- bash: |
source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh
export POPTORCH_WAIT_FOR_IPU=1
python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --junitxml=$(Build.StagingDirectory)/test-results.xml --durations=50
python -m coverage run --source pytorch_lightning -m pytest tests/accelerators/test_ipu.py -v --junitxml=$(Build.StagingDirectory)/test-results.xml --durations=50
env:
MKL_THREADING_LAYER: "GNU"
displayName: 'Testing: standard'
- bash: |
source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh
export POPTORCH_WAIT_FOR_IPU=1
bash tests/special_tests.sh
env:
MKL_THREADING_LAYER: "GNU"
displayName: 'Testing: special'
32 changes: 32 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added `state_id` property to the `Callback` base class ([#6886](https://github.com/PyTorchLightning/pytorch-lightning/pull/6886))


- Progress tracking
* Integrate `TrainingEpochLoop.total_batch_idx` ([#8598](https://github.com/PyTorchLightning/pytorch-lightning/pull/8598)


- Added `batch_size` and `rank_zero_only` arguments for `log_dict` to match `log` ([#8628](https://github.com/PyTorchLightning/pytorch-lightning/pull/8628))


Expand All @@ -37,6 +41,11 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Fault-tolerant training:
* Added `FastForwardSampler` and `CaptureIterableDataset` injection to data loading utilities ([#8366](https://github.com/PyTorchLightning/pytorch-lightning/pull/8366))
* Added `LightningDataFetcher` to control fetching flow ([#8890](https://github.com/PyTorchLightning/pytorch-lightning/pull/8890))
* Added `SharedCycleIteratorState` to prevent infinite loop ([#8889](https://github.com/PyTorchLightning/pytorch-lightning/pull/8889))


- Added `CheckpointIO` to expose checkpoint IO from training type plugin ([#8743](https://github.com/PyTorchLightning/pytorch-lightning/pull/8743))


### Changed
Expand Down Expand Up @@ -75,6 +84,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
(https://github.com/PyTorchLightning/pytorch-lightning/pull/8608))


- `Trainer.request_dataloader` now takes a `RunningStage` enum instance ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))

### Deprecated

- Deprecated `LightningModule.summarize()` in favor of `pytorch_lightning.utilities.model_summary.summarize()`
Expand Down Expand Up @@ -123,8 +134,17 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Removed the deprecated `Trainer.truncated_bptt_steps` in favor of `LightningModule.truncated_bptt_steps` ([#8826](https://github.com/PyTorchLightning/pytorch-lightning/pull/8826))


- Removed `LightningModule.write_predictions` and `LightningModule.write_predictions_dict` ([#](https://github.com/PyTorchLightning/pytorch-lightning/pull/8850))


- Removed reset dataloader hooks to Training Plugins and Accelerators ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))



### Fixed

- Restore original loaders if replaced by entrypoint ([#8885](https://github.com/PyTorchLightning/pytorch-lightning/pull/8885))

- Fixed `trainer.fit_loop.split_idx` always returning `None` ([#8601](https://github.com/PyTorchLightning/pytorch-lightning/pull/8601))


Expand Down Expand Up @@ -152,9 +172,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed an issue with logger outputs not being finalized correctly after prediction runs ([#8333](https://github.com/PyTorchLightning/pytorch-lightning/issues/8333))


- Fixed `StochasticWeightAveraging` with a list of learning rates not applying them to each param group ([#8747](https://github.com/PyTorchLightning/pytorch-lightning/issues/8747))


- Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer ([#8804](https://github.com/PyTorchLightning/pytorch-lightning/pull/8804/))


- Fixed plateau scheduler stepping on incomplete epoch ([#8861](https://github.com/PyTorchLightning/pytorch-lightning/pull/8861))


- Fixed infinite loop with CycleIterator and multiple loaders ([#8889](https://github.com/PyTorchLightning/pytorch-lightning/pull/8889))


- Fixed bug where data-loading functions where not getting the correct running stage passed ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))


## [1.4.0] - 2021-07-27

### Added
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/test_basic_parity.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def assert_parity_absolute(pl_values, pt_values, norm_by: float = 1, max_diff: f
"cls_model,max_diff_speed,max_diff_memory,num_epochs,num_runs",
[
(ParityModuleRNN, 0.05, 0.001, 4, 3),
(ParityModuleMNIST, 0.25, 0.001, 4, 3), # todo: lower this thr
(ParityModuleMNIST, 0.3, 0.001, 4, 3), # todo: lower this thr
pytest.param(ParityModuleCIFAR, 4.0, 0.0002, 2, 2, marks=_MARK_SHORT_BM),
],
)
Expand Down
52 changes: 52 additions & 0 deletions docs/source/advanced/checkpoint_io.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
Custom Checkpointing IO
=======================

.. warning:: The Checkpoint IO API is experimental and subject to change.

Lightning supports modifying the checkpointing save/load functionality through the ``CheckpointIO``. This encapsulates the save/load logic
that is managed by the ``TrainingTypePlugin``.

``CheckpointIO`` can be extended to include your custom save/load functionality to and from a path. The ``CheckpointIO`` object can be passed to either a ``Trainer`` object or a ``TrainingTypePlugin`` as shown below.

.. code-block:: python
from pathlib import Path
from typing import Any, Dict, Optional, Union
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.plugins import CheckpointIO, SingleDevicePlugin
class CustomCheckpointIO(CheckpointIO):
def save_checkpoint(
self, checkpoint: Dict[str, Any], path: Union[str, Path], storage_options: Optional[Any] = None
) -> None:
...
def load_checkpoint(self, path: Union[str, Path], storage_options: Optional[Any] = None) -> Dict[str, Any]:
...
custom_checkpoint_io = CustomCheckpointIO()
# Pass into the Trainer object
model = MyModel()
trainer = Trainer(
plugins=[custom_checkpoint_io],
callbacks=ModelCheckpoint(save_last=True),
)
trainer.fit(model)
# pass into TrainingTypePlugin
model = MyModel()
device = torch.device("cpu")
trainer = Trainer(
plugins=SingleDevicePlugin(device, checkpoint_io=custom_checkpoint_io),
callbacks=ModelCheckpoint(save_last=True),
)
trainer.fit(model)
.. note::

Some ``TrainingTypePlugins`` do not support custom ``CheckpointIO`` as as checkpointing logic is not modifiable.
12 changes: 12 additions & 0 deletions docs/source/api_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,18 @@ Cluster Environments
KubeflowEnvironment
SLURMEnvironment

Checkpoint IO Plugins
^^^^^^^^^^^^^^^^^^^^^

.. currentmodule:: pytorch_lightning.plugins.io

.. autosummary::
:toctree: api
:nosignatures:
:template: classtemplate.rst

CheckpointIO
TorchCheckpointIO

Profiler API
------------
Expand Down
12 changes: 0 additions & 12 deletions docs/source/common/lightning_module.rst
Original file line number Diff line number Diff line change
Expand Up @@ -853,18 +853,6 @@ validation_epoch_end
.. automethod:: pytorch_lightning.core.lightning.LightningModule.validation_epoch_end
:noindex:

write_prediction
~~~~~~~~~~~~~~~~

.. automethod:: pytorch_lightning.core.lightning.LightningModule.write_prediction
:noindex:

write_prediction_dict
~~~~~~~~~~~~~~~~~~~~~

.. automethod:: pytorch_lightning.core.lightning.LightningModule.write_prediction_dict
:noindex:

------------

Properties
Expand Down
2 changes: 1 addition & 1 deletion docs/source/common/optimizers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ defined in your :meth:`~pytorch_lightning.core.lightning.LightningModule.configu
.. warning::
* Before 1.3, Lightning automatically called ``lr_scheduler.step()`` in both automatic and manual optimization. From
1.3, ``lr_scheduler.step()`` is now for the user to call at arbitrary intervals.
* Note that the ``lr_dict`` keys, such as ``"step"`` and ``""interval"``, will be ignored even if they are provided in
* Note that the ``lr_dict`` keys, such as ``"step"`` and ``"interval"``, will be ignored even if they are provided in
your :meth:`~pytorch_lightning.core.lightning.LightningModule.configure_optimizers` during manual optimization.

Here is an example calling ``lr_scheduler.step()`` every step.
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ PyTorch Lightning Documentation
advanced/multi_gpu
advanced/advanced_gpu
common/weights_loading
advanced/checkpoint_io
common/optimizers
advanced/profiler
advanced/sequences
Expand Down
22 changes: 2 additions & 20 deletions pytorch_lightning/accelerators/accelerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,9 +155,7 @@ def teardown(self) -> None:
"""
self.training_type_plugin.teardown()

def batch_to_device(
self, batch: Any, device: Optional[torch.device] = None, dataloader_idx: Optional[int] = None
) -> Any:
def batch_to_device(self, batch: Any, device: Optional[torch.device] = None, dataloader_idx: int = 0) -> Any:
"""Moves the batch to the correct device.
The returned batch is of the same type as the input batch, just having all tensors on the correct device.
Expand All @@ -171,7 +169,7 @@ def batch_to_device(

if model is not None and not isinstance(self.training_type_plugin, DataParallelPlugin):
# no need to transfer batch to device in DP mode
return model._apply_batch_transfer_handler(batch, device, dataloader_idx)
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)

return move_data_to_device(batch, device)

Expand Down Expand Up @@ -410,22 +408,6 @@ def process_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[I
"""
return self.training_type_plugin.process_dataloader(dataloader)

def on_reset_train_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
"""Called before resetting the train dataloader."""
return self.training_type_plugin.on_reset_train_dataloader(dataloader)

def on_reset_val_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
"""Called before resetting the val dataloader."""
return self.training_type_plugin.on_reset_val_dataloader(dataloader)

def on_reset_test_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
"""Called before resetting the test dataloader."""
return self.training_type_plugin.on_reset_test_dataloader(dataloader)

def on_reset_predict_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
"""Called before resetting the predict dataloader."""
return self.training_type_plugin.on_reset_predict_dataloader(dataloader)

@property
def results(self) -> Any:
"""
Expand Down
12 changes: 9 additions & 3 deletions pytorch_lightning/callbacks/early_stopping.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def __init__(
check_finite: bool = True,
stopping_threshold: Optional[float] = None,
divergence_threshold: Optional[float] = None,
check_on_train_epoch_end: bool = True,
check_on_train_epoch_end: Optional[bool] = None,
):
super().__init__()
self.min_delta = min_delta
Expand Down Expand Up @@ -120,6 +120,12 @@ def __init__(
)
self.monitor = monitor or "early_stop_on"

def on_pretrain_routine_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
if self._check_on_train_epoch_end is None:
# if the user runs validation multiple times per training epoch, we try to check after
# validation instead of on train epoch end
self._check_on_train_epoch_end = trainer.val_check_interval == 1.0

def _validate_condition_metric(self, logs):
monitor_val = logs.get(self.monitor)

Expand Down Expand Up @@ -191,7 +197,7 @@ def _run_early_stopping_check(self, trainer: "pl.Trainer") -> None:
# when in dev debugging
trainer.dev_debugger.track_early_stopping_history(self, current)

should_stop, reason = self._evalute_stopping_criteria(current)
should_stop, reason = self._evaluate_stopping_criteria(current)

# stop every ddp process if any world process decides to stop
should_stop = trainer.training_type_plugin.reduce_boolean_decision(should_stop)
Expand All @@ -201,7 +207,7 @@ def _run_early_stopping_check(self, trainer: "pl.Trainer") -> None:
if reason and self.verbose:
self._log_info(trainer, reason)

def _evalute_stopping_criteria(self, current: torch.Tensor) -> Tuple[bool, str]:
def _evaluate_stopping_criteria(self, current: torch.Tensor) -> Tuple[bool, str]:
should_stop = False
reason = None
if self.check_finite and not torch.isfinite(current):
Expand Down
4 changes: 2 additions & 2 deletions pytorch_lightning/callbacks/model_checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,9 +332,9 @@ def on_train_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -
rank_zero_info("Saving latest checkpoint...")
# as we advance one step at end of training, we use `global_step - 1` to avoid saving duplicates
monitor_candidates = self._monitor_candidates(trainer, trainer.current_epoch, trainer.global_step - 1)
trainer.train_loop.global_step -= 1
trainer.fit_loop.global_step -= 1
self._save_last_checkpoint(trainer, monitor_candidates)
trainer.train_loop.global_step += 1
trainer.fit_loop.global_step += 1

def on_save_checkpoint(
self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", checkpoint: Dict[str, Any]
Expand Down
25 changes: 9 additions & 16 deletions pytorch_lightning/callbacks/stochastic_weight_avg.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,25 +166,18 @@ def on_train_epoch_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningMo
# move average model to request device.
self._average_model = self._average_model.to(self._device or pl_module.device)

optimizers = trainer.optimizers
optimizer = trainer.optimizers[0]
if self._swa_lrs is None:
self._swa_lrs = [param_group["lr"] for param_group in optimizer.param_groups]
if isinstance(self._swa_lrs, float):
self._swa_lrs = [self._swa_lrs] * len(optimizer.param_groups)

for param_group in optimizers[0].param_groups:
if self._swa_lrs is None:
initial_lr = param_group["lr"]

elif isinstance(self._swa_lrs, float):
initial_lr = self._swa_lrs

else:
initial_lr = self._swa_lrs[0]

param_group["initial_lr"] = initial_lr

self._swa_lrs = initial_lr
for lr, group in zip(self._swa_lrs, optimizer.param_groups):
group["initial_lr"] = lr

self._swa_scheduler = SWALR(
optimizers[0],
swa_lr=initial_lr,
optimizer,
swa_lr=self._swa_lrs,
anneal_epochs=self._annealing_epochs,
anneal_strategy=self._annealing_strategy,
last_epoch=trainer.max_epochs if self._annealing_strategy == "cos" else -1,
Expand Down
2 changes: 1 addition & 1 deletion pytorch_lightning/core/decorators.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def inner_fn(self, *args, **kwargs):

if not pre_layer_count == post_layer_count:
rank_zero_warn(
f"The model layers do not match after moving to the target device."
"The model layers do not match after moving to the target device."
" If your model employs weight sharing on TPU,"
" please tie your weights using the `on_post_move_to_device` model hook.\n"
f"Layer count: [Before: {pre_layer_count} After: {post_layer_count}]"
Expand Down
Loading

0 comments on commit 5f8bdd5

Please sign in to comment.