TPU support & profiling
Pre-releaseOverview
This is the first joint release between pytorch-bearer and Lightning, here we come ...
This release adds support for training models on Tensor Processing Units (TPU). We can now train models on GPUs and TPUs by changing a single parameter in Trainer
(see docs). We are also bringing the flexibility of Bearer into Lightning by allowing for arbitrary user-defined callbacks, see docs.
We are also including a profiler that allows Lightning users to identify training bottlenecks (see docs).
This release also includes automatic sampler setup depending on the selected backend, Lightning configures the sampler correctly (no need for user input).
The loggers have also been extended to support for multiple concurrent loggers to be passed to Trainer
as an iterable, docs and added support for step-based learning rate scheduling.
At last, lots of bug fixes (see below).
Detail changes
Added
- Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
- Added
reload_dataloaders_every_epoch=False
flag for trainer. Some users require reloading data every epoch (#926) - Added
progress_bar_refresh_rate=50
flag for trainer. The refresh rate on notebooks (#926) - Updated governance docs
- Added a check to ensure that the metric used for early stopping exists before training commences (#542)
- Added
optimizer_idx
argument tobackward
hook (#733) - Added
entity
argument toWandbLogger
to be passed towandb.init
(#783) - Added a tool for profiling training runs (#782)
- Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804) - Added option to specify
step
key when logging metrics (#808) - Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759) - Added Tensor Processing Unit (TPU) support (#868)
- Added semantic segmentation example (#751, #876, #881)
- Split callbacks in multiple files (#849)
- Support for user-defined callbacks (#889 and #950)
- Added support for multiple loggers to be passed to
Trainer
as an iterable (e.g. list, tuple, etc.) (#903) - Added support for step-based learning rate scheduling (#941)
- Added support for logging hparams as
dict
(#1029) - Checkpoint and early stopping now work without val. step (#1041)
- Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
- Added type hints for function arguments (#912)
- Added default
argparser
forTrainer
(#952, #1023) - Added TPU gradient clipping (#963)
- Added max/min number of steps in Trainer (#728)
Changed
- Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752) - Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767) - Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749) - Moved functionality of
LightningModule.load_from_metrics
intoLightningModule.load_from_checkpoint
(#995) - Changed Checkpoint path parameter from
filepath
todirpath
(#1016) - Freezed models
hparams
asNamespace
property (#1029) - Dropped
logging
config in package init (#1015) - Renames model steps (#1051)
training_end
>>training_epoch_end
validation_end
>>validation_epoch_end
test_end
>>test_epoch_end
- Refactor dataloading, supports infinite dataloader (#955)
- Create single file in
TensorBoardLogger
(#777)
Deprecated
- Deprecated
pytorch_lightning.logging
(#767) - Deprecated
LightningModule.load_from_metrics
in favour ofLightningModule.load_from_checkpoint
(#995, #1079) - Deprecated
@data_loader
decorator (#926) - Deprecated model steps
training_end
,validation_end
andtest_end
(#1051, #1056)
Removed
- Removed dependency on
pandas
(#736) - Removed dependency on
torchvision
(#797) - Removed dependency on
scikit-learn
(#801)
Fixed
- Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743) - Fixed a bug where the model checkpoint didn't write to the same directory as the logger (#771)
- Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777) - Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832) - Fixed a bug when calling
self.logger.experiment
with multiple loggers (#1009) - Fixed a bug when calling
logger.append_tags
on aNeptuneLogger
with a single tag (#1009) - Fixed sending back data from
.spawn
by saving and loading the trained model in/out of the process (#1017) - Fixed port collision on DDP (#1010)
- Fixed/tested pass overrides (#918)
- Fixed comet logger to log after train (#892)
- Remove deprecated args to learning rate step function (#890)
Contributors
@airglow, @akshaykvnit, @AljoSt, @AntixK, @awaelchli, @baeseongsu, @bobkemp, @Borda, @calclavia, @Calysto, @djbyrne, @ethanwharris, @fdelrio89, @hadim, @hanbyul-kim, @jeremyjordan, @kuynzereb, @luiscape, @MattPainter01, @neggert, @onkyo14taro, @peteriz, @shoarora, @SkafteNicki, @smallzzy, @srush, @theevann, @tullie, @williamFalcon, @xeTaiz, @xssChauhan, @yukw777
If we forgot someone due to not matching commit email with GitHub account, let us know :]