All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
-
Add new
DETAIL
log level to provide useful logs for improving monitoring and debugging of batch jobs -
Added a flag
SLURMEnvironment(auto_requeue=True|False)
to control whether Lightning handles the requeuing (#10601) -
Fault Tolerant Manual
- Add
_SupportsStateDict
protocol to detect if classes are stateful (#10646) - Add
_FaultTolerantMode
enum used to track different supported fault tolerant modes (#10645) - Add a
_rotate_worker_indices
utility to reload the state according the latest worker (#10647) - Add stateful workers (#10674)
- Add an utility to collect the states across processes (#10639)
- Add logic to reload the states across data loading components (#10699)
- Cleanup some fault tolerant utilities (#10703)
- Enable Fault Tolerant Manual Training (#10707)
- Broadcast the
_terminate_gracefully
to all processes and add support for DDP (#10638)
- Add
-
Added support for re-instantiation of custom (subclasses of)
DataLoaders
returned in the*_dataloader()
methods, i.e., automatic replacement of samplers now works with custom types ofDataLoader
(#10680) -
Added a function to validate if fault tolerant training is supported. (#10465)
-
Show a better error message when a custom
DataLoader
implementation is not well implemented and we need to reconstruct it (#10719) -
Show a better error message when frozen dataclass is used as a batch (#10927)
-
Save the
Loop
's state by default in the checkpoint (#10784) -
Added
Loop.replace
to easily switch one loop for another (#10324) -
Added support for
--lr_scheduler=ReduceLROnPlateau
to theLightningCLI
(#10860) -
Added
LightningCLI.configure_optimizers
to override theconfigure_optimizers
return value (#10860) -
Added a warning that shows when
max_epochs
in theTrainer
is not set (#10700) -
Added support for returning a single Callback from
LightningModule.configure_callbacks
without wrapping it into a list (#11060) -
Added
console_kwargs
forRichProgressBar
to initialize inner Console (#10875) -
Added a
PrecisionPlugin.teardown
method (#10990) -
Added
LightningModule.lr_scheduler_step
(#10249) -
Added
opt_idx
to scheduler config if not assigned by user (#11247) -
Added a
MisconfigurationException
if user providedopt_idx
in scheduler config doesn't match with actual optimizer index of its respective optimizer (#11247)
-
Set the
prog_bar
flag to False inLightningModule.log_grad_norm
(#11472) -
Raised exception in
init_dist_connection()
when torch distibuted is not available (#10418) -
The
monitor
argument in theEarlyStopping
callback is no longer optional (#10328) -
Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
-
Raised
MisconfigurationException
whenenable_progress_bar=False
and a progress bar instance has been passed in the callback list (#10520) -
Moved
trainer.connectors.env_vars_connector._defaults_from_env_vars
toutilities.argsparse._defaults_from_env_vars
(#10501) -
Changes in
LightningCLI
required for the new major release of jsonargparse v4.0.0 (#10426) -
Renamed
refresh_rate_per_second
parameter torefresh_rate
forRichProgressBar
signature (#10497) -
Moved ownership of the
PrecisionPlugin
intoTrainingTypePlugin
and updated all references (#10570) -
Fault Tolerant relies on
signal.SIGTERM
to gracefully exit instead ofsignal.SIGUSR1
(#10605) -
Loop.restarting=...
now sets the value recursively for all subloops (#11442) -
Raised an error if the
batch_size
cannot be inferred from the current batch if it contained a string or was a custom batch object (#10541) -
The validation loop is now disabled when
overfit_batches > 0
is set in the Trainer (#9709) -
Moved optimizer related logics from
Accelerator
toTrainingTypePlugin
(#10596) -
Moved ownership of the lightning optimizers from the
Trainer
to theStrategy
(#11444) -
Moved
batch_to_device
method fromAccelerator
toTrainingTypePlugin
(#10649) -
The
DDPSpawnPlugin
no longer overrides thepost_dispatch
plugin hook (#10034) -
Integrate the progress bar implementation with progress tracking (#11213)
-
The
LightningModule.{add_to_queue,get_from_queue}
hooks no longer get atorch.multiprocessing.SimpleQueue
and instead receive a list based queue (#10034) -
Changed
training_step
,validation_step
,test_step
andpredict_step
method signatures inAccelerator
and updated input from caller side (#10908) -
Changed the name of the temporary checkpoint that the
DDPSpawnPlugin
and related plugins save (#10934) -
LoggerCollection
returns only unique logger names and versions (#10976) -
Redesigned process creation for spawn-based plugins (
DDPSpawnPlugin
,TPUSpawnPlugin
, etc.) (#10896)- All spawn-based plugins now spawn processes immediately upon calling
Trainer.{fit,validate,test,predict}
- The hooks/callbacks
prepare_data
,setup
,configure_sharded_model
andteardown
now run under initialized process group for spawn-based plugins just like their non-spawn counterparts - Some configuration errors that were previously raised as
MisconfigurationException
s will now be raised asProcessRaisedException
(torch>=1.8) or asException
(torch<1.8) - Removed the
TrainingTypePlugin.pre_dispatch()
method and merged it withTrainingTypePlugin.setup()
(#11137)
- All spawn-based plugins now spawn processes immediately upon calling
-
Changed profiler to index and display the names of the hooks with a new pattern []. (#11026)
-
Changed
batch_to_device
entry in profiling from stage-specific to generic, to match profiling of other hooks (#11031) -
Changed the info message for finalizing ddp-spawn worker processes to a debug-level message (#10864)
-
Removed duplicated file extension when uploading model checkpoints with
NeptuneLogger
(#11015) -
Changed
LSFEnvironment
to useLSB_DJOB_RANKFILE
environment variable instead ofLSB_HOSTS
for determining node rank and main address (#10825) -
Removed
__getstate__
and__setstate__
ofRichProgressBar
(#11100) -
The
DDPPlugin
andDDPSpawnPlugin
and their subclasses now remove theSyncBatchNorm
wrappers inteardown()
to enable proper support at inference after fitting (#11078) -
Moved ownership of the
Accelerator
instance to theTrainingTypePlugin
; all training-type plugins now take an optional parameteraccelerator
(#11022) -
Renamed the
TrainingTypePlugin
toStrategy
(#11120)- Renamed the
ParallelPlugin
toParallelStrategy
(#11123) - Renamed the
DataParallelPlugin
toDataParallelStrategy
(#11183) - Renamed the
DDPPlugin
toDDPStrategy
(#11142) - Renamed the
DDP2Plugin
toDDP2Strategy
(#11185) - Renamed the
DDPShardedPlugin
toDDPShardedStrategy
(#11186) - Renamed the
DDPFullyShardedPlugin
toDDPFullyShardedStrategy
(#11143) - Renamed the
DDPSpawnPlugin
toDDPSpawnStrategy
(#11145) - Renamed the
DDPSpawnShardedPlugin
toDDPSpawnShardedStrategy
(#11210) - Renamed the
DeepSpeedPlugin
toDeepSpeedStrategy
(#11194) - Renamed the
HorovodPlugin
toHorovodStrategy
(#11195) - Renamed the
TPUSpawnPlugin
toTPUSpawnStrategy
(#11190) - Renamed the
IPUPlugin
toIPUStrategy
(#11193) - Renamed the
SingleDevicePlugin
toSingleDeviceStrategy
(#11182) - Renamed the
SingleTPUPlugin
toSingleTPUStrategy
(#11182) - Renamed the
TrainingTypePluginsRegistry
toStrategyRegistry
(#11233)
- Renamed the
-
Marked the
ResultCollection
,ResultMetric
, andResultMetricCollection
classes as protected (#11130) -
The epoch start/end hooks are now called by the
FitLoop
instead of theTrainingEpochLoop
(#11201) -
DeepSpeed does not require lightning module zero 3 partitioning (#10655)
-
Deprecated
training_type_plugin
property in favor ofstrategy
inTrainer
and updated the references (#11141) -
Moved
Strategy
classes to thestrategies
directory (#11226) -
Renamed
training_type_plugin
file tostrategy
(#11239) -
Changed
DeviceStatsMonitor
to group metrics based on the logger'sgroup_separator
(#11254) -
Raised
UserWarning
if evaluation is triggered withbest
ckpt and trainer is configured with multiple checkpoint callbacks (#11274) -
Trainer.logged_metrics
now always contains scalar tensors, even when a Python scalar was logged (#11270) -
Changed
MisconfigurationException
toModuleNotFoundError
whenrich
isn't available (#11360)
-
Deprecated
ClusterEnvironment.master_{address,port}
in favor ofClusterEnvironment.main_{address,port}
(#10103) -
Deprecated
DistributedType
in favor of_StrategyType
(#10505) -
Deprecated the
precision_plugin
constructor argument fromAccelerator
(#10570) -
Deprecated
DeviceType
in favor of_AcceleratorType
(#10503) -
Deprecated the property
Trainer.slurm_job_id
in favor of the newSLURMEnvironment.job_id()
method (#10622) -
Deprecated the access to the attribute
IndexBatchSamplerWrapper.batch_indices
in favor ofIndexBatchSamplerWrapper.seen_batch_indices
(#10870) -
Deprecated
on_init_start
andon_init_end
callback hooks (#10940) -
Deprecated
Trainer.call_hook
in favor ofTrainer._call_callback_hooks
,Trainer._call_lightning_module_hook
,Trainer._call_ttp_hook
, andTrainer._call_accelerator_hook
(#10979) -
Deprecated
TrainingTypePlugin.post_dispatch
in favor ofTrainingTypePlugin.teardown
(#10939) -
Deprecated
ModelIO.on_hpc_{save/load}
in favor ofCheckpointHooks.on_{save/load}_checkpoint
(#10911) -
Deprecated
Trainer.run_stage
in favor ofTrainer.{fit,validate,test,predict}
(#11000) -
Deprecated
Trainer.lr_schedulers
in favor ofTrainer.lr_scheduler_configs
which returns a list of dataclasses instead of dictionaries (#11443) -
Deprecated
Trainer.verbose_evaluate
in favor ofEvaluationLoop(verbose=...)
(#10931) -
Deprecated
Trainer.should_rank_save_checkpoint
Trainer property (#11068) -
Deprecated
Trainer.lightning_optimizers
(#11444) -
Deprecated
TrainerOptimizersMixin
and moved functionality tocore/optimizer.py
(#11155) -
Deprecated
TrainerCallbackHookMixin
(#11148) -
Deprecated
TrainerDataLoadingMixin
and moved functionality toTrainer
andDataConnector
(#11282) -
Deprecated function
pytorch_lightning.callbacks.device_stats_monitor.prefix_metric_keys
(#11254)
-
Removed deprecated parameter
method
inpytorch_lightning.utilities.model_helpers.is_overridden
(#10507) -
Remove deprecated method
ClusterEnvironment.creates_children
(#10339) -
Removed deprecated
TrainerModelHooksMixin.is_function_implemented
andTrainerModelHooksMixin.has_arg
(#10322) -
Removed deprecated
pytorch_lightning.utilities.device_dtype_mixin.DeviceDtypeModuleMixin
in favor ofpytorch_lightning.core.mixins.device_dtype_mixin.DeviceDtypeModuleMixin
(#10442) -
Removed deprecated
LightningModule.loaded_optimizer_states_dict
property (#10346) -
Removed deprecated
Trainer.fit(train_dataloader=)
,Trainer.validate(val_dataloaders=)
, andTrainer.test(test_dataloader=)
(#10325) -
Removed deprecated
has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
andhas_teardown_predict
datamodule lifecycle properties (#10350) -
Removed deprecated
every_n_val_epochs
parameter of ModelCheckpoint (#10366) -
Removed deprecated
import pytorch_lightning.profiler.profilers
in favor ofimport pytorch_lightning.profiler
(#10443) -
Removed deprecated property
configure_slurm_dpp
from accelerator connector (#10370) -
Removed deprecated arguments
num_nodes
andsync_batchnorm
fromDDPPlugin
,DDPSpawnPlugin
,DeepSpeedPlugin
(#10357) -
Removed deprecated property
is_slurm_managing_tasks
from AcceleratorConnector (#10353) -
Removed deprecated
LightningModule.log(tbptt_reduce_fx, tbptt_reduce_token, sync_dist_op)
(#10423) -
Removed deprecated
Plugin.task_idx
(#10441) -
Removed deprecated method
master_params
from PrecisionPlugin (#10372) -
Removed the automatic detachment of "extras" returned from
training_step
. For example,return {'loss': ..., 'foo': foo.detach()}
will now be necessary iffoo
has gradients which you do not want to store (#10424) -
Removed deprecated passthrough methods and properties from
Accelerator
base class: -
Removed deprecated signature for
transfer_batch_to_device
hook. The new argumentdataloader_idx
is now required (#10480) -
Removed deprecated
utilities.distributed.rank_zero_{warn/deprecation}
(#10451) -
Removed deprecated
mode
argument fromModelSummary
class (#10449) -
Removed deprecated
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#10482) -
Removed deprecated
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#10482) -
Removed deprecated
disable_validation
property from Trainer (#10450) -
Removed deprecated
CheckpointConnector.hpc_load
property in favor ofCheckpointConnector.restore
(#10525) -
Removed deprecated
reload_dataloaders_every_epoch
fromTrainer
in favour ofreload_dataloaders_every_n_epochs
(#10481) -
Removed the
precision_plugin
attribute fromAccelerator
in favor of its equivalent attributeprecision_plugin
in theTrainingTypePlugin
(#10570) -
Removed
DeepSpeedPlugin.{precision,amp_type,amp_level}
properties (#10657) -
Removed argument
return_result
from theDDPSpawnPlugin.spawn()
method (#10867) -
Removed the property
TrainingTypePlugin.results
and corresponding properties in subclasses (#10034) -
Removed the
mp_queue
attribute fromDDPSpawnPlugin
andTPUSpawnPlugin
(#10034) -
Removed unnecessary
_move_optimizer_state
method overrides fromTPUSpawnPlugin
andSingleTPUPlugin
(#10849) -
Removed
should_rank_save_checkpoint
property fromTrainingTypePlugin
(#11070) -
Removed
model_sharded_context
method fromAccelerator
(#10886) -
Removed method
pre_dispatch
from thePrecisionPlugin
(#10887) -
Removed method
setup_optimizers_in_pre_dispatch
from thestrategies
and achieve the same logic insetup
andpre_dispatch
methods (#10906) -
Removed methods
pre_dispatch
,dispatch
andpost_dispatch
from theAccelerator
(#10885) -
Removed method
training_step
,test_step
,validation_step
andpredict_step
from theAccelerator
(#10890) -
Removed
TrainingTypePlugin.start_{training,evaluating,predicting}
hooks and the same in all subclasses (#10989, #10896) -
Removed
Accelerator.on_train_start
(#10999) -
Removed support for Python 3.6 (#11117)
-
Removed
Strategy.init_optimizers
in favor ofStrategy.setup_optimizers
(#11236) -
Removed
profile("training_step_and_backward")
inClosure
class since we already profile callstraining_step
andbackward
(#11222) -
Removed
Strategy.optimizer_zero_grad
(#11246)
-
Fixed security vulnerabilities CVE-2020-1747 and CVE-2020-14343 caused by the
PyYAML
dependency (#11099) -
Fixed logging on
{test,validation}_epoch_end
with multiple dataloaders (#11132) -
Reset the validation progress tracking state after sanity checking (#11218)
-
Fixed double evaluation bug with fault-tolerance enabled where the second call was completely skipped (#11119)
-
Fixed an issue with the
TPUSpawnPlugin
handling theXLA_USE_BF16
environment variable incorrectly (#10990) -
Fixed wrong typehint for
Trainer.lightning_optimizers
(#11155) -
Fixed type promotion when tensors of higher category than float are logged (#11401)
-
Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy (#11307)
-
Fixed
SimpleProfiler
summary (#11414) -
Disbled sampler replacement when using
IterableDataset
(#11507)
- Fixed
LightningCLI
race condition while saving the config (#11199) - Fixed the default value used with
log(reduce_fx=min|max)
(#11310) - Fixed data fetcher selection (#11294)
- Fixed a race condition that could result in incorrect (zero) values being observed in prediction writer callbacks (#11288)
- Fixed dataloaders not getting reloaded the correct amount of times when setting
reload_dataloaders_every_n_epochs
andcheck_val_every_n_epoch
(#10948) - Fixed deepspeed strategy not restoring the lr-scheduler states when lr-scheduler(s) are configured through
LightningModule.configure_optimizer
(#11322)
- Fixed
NeptuneLogger
when using DDP (#11030) - Fixed a bug to disable logging hyperparameters in logger if there are no hparams (#11105)
- Avoid the deprecated
onnx.export(example_outputs=...)
in torch 1.10 (#11116) - Fixed an issue when torch-scripting a
LightningModule
after training withTrainer(sync_batchnorm=True)
(#11078) - Fixed an
AttributeError
occuring when using aCombinedLoader
(multiple dataloaders) for prediction (#11111) - Fixed bug where
Trainer(track_grad_norm=..., logger=False)
would fail (#11114) - Fixed an incorrect warning being produced by the model summary when using
bf16
precision on CPU (#11161)
- DeepSpeed does not require lightning module zero 3 partitioning (#10655)
- The
ModelCheckpoint
callback now saves and restores attributesbest_k_models
,kth_best_model_path
,kth_value
, andlast_model_path
(#10995)
- Fixed a bug where the DeepSpeedPlugin arguments
cpu_checkpointing
andcontiguous_memory_optimization
were not being forwarded to deepspeed correctly (#10874) - Fixed an issue with
NeptuneLogger
causing checkpoints to be uploaded with a duplicated file extension (#11015) - Fixed support for logging within callbacks returned from
LightningModule
(#10991) - Fixed running sanity check with
RichProgressBar
(#10913) - Fixed support for
CombinedLoader
while checking for warning raised with eval dataloaders (#10994) - The TQDM progress bar now correctly shows the
on_epoch
logged values on train epoch end (#11069) - Fixed bug where the TQDM updated the training progress bar during
trainer.validate
(#11069)
- Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally (#10815)
- Fixed an issue with
SignalConnector
not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled (#10611) - Fixed
SignalConnector._has_already_handler
check for callable type (#10483) - Fixed an issue to return the results for each dataloader separately instead of duplicating them for each (#10810)
- Improved exception message if
rich
version is less than10.2.2
(#10839) - Fixed uploading best model checkpoint in NeptuneLogger (#10369)
- Fixed early schedule reset logic in PyTorch profiler that was causing data leak (#10837)
- Fixed a bug that caused incorrect batch indices to be passed to the
BasePredictionWriter
hooks when using a dataloader withnum_workers > 0
(#10870) - Fixed an issue with item assignment on the logger on rank > 0 for those who support it (#10917)
- Fixed importing
torch_xla.debug
fortorch-xla<1.8
(#10836) - Fixed an issue with
DDPSpawnPlugin
and related plugins leaving a temporary checkpoint behind (#10934) - Fixed a
TypeError
occuring in theSingalConnector.teardown()
method (#10961)
- Fixed support for
--key.help=class
with theLightningCLI
(#10767) - Fixed
_compare_version
for python packages (#10762) - Fixed TensorBoardLogger
SummaryWriter
not close before spawning the processes (#10777) - Fixed a consolidation error in Lite when attempting to save the state dict of a sharded optimizer (#10746)
- Fixed the default logging level for batch hooks associated with training from
on_step=False, on_epoch=True
toon_step=True, on_epoch=False
(#10756)
- Fixed
ShardedTensor
state dict hook registration to check if torch distributed is available (#10621) - Fixed an issue with
self.log
not respecting a tensor'sdtype
when applying computations (#10076) - Fixed LigtningLite
_wrap_init
popping unexisting keys from DataLoader signature parameters (#10613) - Fixed signals being registered within threads (#10610)
- Fixed an issue that caused Lightning to extract the batch size even though it was set by the user in
LightningModule.log
(#10408) - Fixed
Trainer(move_metrics_to_cpu=True)
not moving the evaluation logged results to CPU (#10631) - Fixed the
{validation,test}_step
outputs getting moved to CPU withTrainer(move_metrics_to_cpu=True)
(#10631) - Fixed an issue with collecting logged test results with multiple dataloaders (#10522)
- Fixed
CombinedLoader
andmax_size_cycle
didn't receive aDistributedSampler
(#10374) - Fixed an issue where class or init-only variables of dataclasses were passed to the dataclass constructor in
utilities.apply_to_collection
(#9702) - Fixed
isinstance
not working withinit_meta_context
, materialized model not being moved to the device (#10493) - Fixed an issue that prevented the Trainer to shutdown workers when execution is interrupted due to failure(#10463)
- Squeeze the early stopping monitor to remove empty tensor dimensions (#10461)
- Fixed sampler replacement logic with
overfit_batches
to only replace the sample whenSequentialSampler
is not used (#10486) - Fixed scripting causing false positive deprecation warnings (#10470, #10555)
- Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
- Fixed propagation of device and dtype information to submodules of LightningLite when they inherit from
DeviceDtypeModuleMixin
(#10559)
- Fixed
apply_to_collection(defaultdict)
(#10316) - Fixed failure when
DataLoader(batch_size=None)
is passed (#10345) - Fixed interception of
__init__
arguments for sub-classed DataLoader re-instantiation in Lite (#10334) - Fixed issue with pickling
CSVLogger
after a call toCSVLogger.save
(#10388) - Fixed an import error being caused by
PostLocalSGD
whentorch.distributed
not available (#10359) - Fixed the logging with
on_step=True
in epoch-level hooks causing unintended side-effects. Logging withon_step=True
in epoch-level hooks will now correctly raise an error (#10409) - Fixed deadlocks for distributed training with
RichProgressBar
(#10428) - Fixed an issue where the model wrapper in Lite converted non-floating point tensors to float (#10429)
- Fixed an issue with inferring the dataset type in fault-tolerant training (#10432)
- Fixed dataloader workers with
persistent_workers
being deleted on every iteration (#10434)
- Added support for monitoring the learning rate without schedulers in
LearningRateMonitor
(#9786) - Added registration of
ShardedTensor
state dict hooks inLightningModule.__init__
if the PyTorch version supportsShardedTensor
(#8944) - Added error handling including calling of
on_keyboard_interrupt()
andon_exception()
for all entrypoints (fit, validate, test, predict) (#8819) - Added a flavor of
training_step
that takesdataloader_iter
as an argument (#8807) - Added a
state_key
property to theCallback
base class (#6886) - Added progress tracking to loops:
- Integrated
TrainingEpochLoop.total_batch_idx
(#8598) - Added
BatchProgress
and integratedTrainingEpochLoop.is_last_batch
(#9657) - Avoid optional
Tracker
attributes (#9320) - Reset
current
progress counters when restarting an epoch loop that had already finished (#9371) - Call
reset_on_restart
in the loop'sreset
hook instead of when loading a checkpoint (#9561) - Use
completed
overprocessed
inreset_on_restart
(#9656) - Renamed
reset_on_epoch
toreset_on_run
(#9658)
- Integrated
- Added
batch_size
andrank_zero_only
arguments forlog_dict
to matchlog
(#8628) - Added a check for unique GPU ids (#8666)
- Added
ResultCollection
state_dict to the Loopstate_dict
and added support for distributed reload (#8641) - Added DeepSpeed collate checkpoint utility function (#8701)
- Added a
handles_accumulate_grad_batches
property to the training type plugins (#8856) - Added a warning to
WandbLogger
when reusing a wandb run (#8714) - Added
log_graph
argument forwatch
method ofWandbLogger
(#8662) LightningCLI
additions:- Added
LightningCLI(run=False|True)
to choose whether to run aTrainer
subcommand (#8751) - Added support to call any trainer function from the
LightningCLI
via subcommands (#7508) - Allow easy trainer re-instantiation (#7508)
- Automatically register all optimizers and learning rate schedulers (#9565)
- Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
- Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
- Support passing lists of callbacks via command line (#8815)
- Support shorthand notation to instantiate models (#9588)
- Support shorthand notation to instantiate datamodules (#10011)
- Added
multifile
option toLightningCLI
to enable/disable config saving to preserve multiple files structure (#9073)
- Added
- Fault-tolerant training:
- Added
FastForwardSampler
andCaptureIterableDataset
injection to data loading utilities (#8366) - Added
DataFetcher
to control fetching flow (#8890) - Added
SharedCycleIteratorState
to prevent infinite loop (#8889) - Added
CaptureMapDataset
for state management in map-style datasets (#8891) - Added Fault Tolerant Training to
DataFetcher
(#8891) - Replaced old prefetch iterator with new
DataFetcher
in training loop (#8953) - Added partial support for global random state fault-tolerance in map-style datasets (#8950)
- Converted state to tuple explicitly when setting Python random state (#9401)
- Added support for restarting an optimizer loop (multiple optimizers) (#9537)
- Added support for restarting within Evaluation Loop (#9563)
- Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
- Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
- Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
- Added
- Checkpoint saving and loading extensibility:
- Added
CheckpointIO
plugin to expose checkpoint IO from training type plugin (#8743) - Refactored
CheckpointConnector
to offload validation logic to theCheckpointIO
plugin (#9045) - Added
remove_checkpoint
toCheckpointIO
plugin by moving the responsibility out of theModelCheckpoint
callback (#9373) - Added
XLACheckpointIO
plugin (#9972)
- Added
- Loop customization:
- Added
Closure
andAbstractClosure
classes (#8642) - Refactored
TrainingBatchLoop
and extractedOptimizerLoop
, splitting off automatic optimization into its own loop (#9191) - Removed
TrainingBatchLoop.backward()
; manual optimization now calls directly intoAccelerator.backward()
and automatic optimization handles backward in newOptimizerLoop
(#9265) - Extracted
ManualOptimization
logic fromTrainingBatchLoop
into its own separate loop class (#9266) - Added
OutputResult
andManualResult
classes (#9437, #9424) - Marked
OptimizerLoop.backward
as protected (#9514) - Marked
FitLoop.should_accumulate
as protected (#9515) - Marked several methods in
PredictionLoop
as protected:on_predict_start
,on_predict_epoch_end
,on_predict_end
,on_predict_model_eval
(#9516) - Marked several methods in
EvaluationLoop
as protected:get_max_batches
,on_evaluation_model_eval
,on_evaluation_model_train
,on_evaluation_start
,on_evaluation_epoch_start
,on_evaluation_epoch_end
,on_evaluation_end
,reload_evaluation_dataloaders
(#9516) - Marked several methods in
EvaluationEpochLoop
as protected:on_evaluation_batch_start
,evaluation_step
,evaluation_step_end
(#9516) - Added
yielding_training_step
example (#9983)
- Added
- Added support for saving and loading state of multiple callbacks of the same type (#7187)
- Added DeepSpeed Stage 1 support (#8974)
- Added
Python dataclass
support forLightningDataModule
(#8272) - Added sanitization of tensors when they get logged as hyperparameters in
TensorBoardLogger
(#9031) - Added
InterBatchParallelDataFetcher
(#9020) - Added
DataLoaderIterDataFetcher
(#9020) - Added
DataFetcher
withinFit / Evaluation
Loop (#9047) - Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
- Added Rich integration:
- Added input validation logic for precision (#9080)
- Added support for CPU AMP autocast (#9084)
- Added
on_exception
callback hook (#9183) - Added a warning to DeepSpeed when inferring batch size (#9221)
- Added
ModelSummary
callback (#9344) - Added
log_images
,log_text
andlog_table
toWandbLogger
(#9545) - Added
PL_RECONCILE_PROCESS
environment variable to enable process reconciliation regardless of cluster environment settings (#9389) - Added
get_device_stats
to the Accelerator interface and added its implementation for GPU and TPU (#9586) - Added a warning when an unknown key is encountered in the optimizer configuration, and when
OneCycleLR
is used with"interval": "epoch"
(#9666) - Added
DeviceStatsMonitor
callback (#9712) - Added
enable_progress_bar
to the Trainer constructor (#9664) - Added
pl_legacy_patch
load utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166) - Added support for
torch.use_deterministic_algorithms
(#9121) - Added automatic parameters tying for TPUs (#9525)
- Added support for
torch.autograd.set_detect_anomaly
throughTrainer
constructor argumentdetect_anomaly
(#9848) - Added
enable_model_summary
flag to Trainer (#9699) - Added
strategy
argument to Trainer (#8597) - Added
init_meta_context
,materialize_module
utilities (#9920) - Added
TPUPrecisionPlugin
(#10020) - Added
torch.bfloat16
support: - Added
kfold
example for loop customization (#9965) - LightningLite:
- Added
PrecisionPlugin.forward_context
, making it the default implementation for all{train,val,test,predict}_step_context()
methods (#9988) - Added
DDPSpawnPlugin.spawn()
for spawning new processes of a given function (#10018, #10022) - Added
TrainingTypePlugin.{_setup_model, _setup_optimizer}
methods (#9994, #10064) - Implemented
DataParallelPlugin._setup_model
(#10010) - Implemented
DeepSpeedPlugin._setup_model_and_optimizers
(#10009, #10064) - Implemented
{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers
(#10028, #10064) - Added optional
model
argument to theoptimizer_step
methods in accelerators and plugins (#10023) - Updated precision attributes in
DeepSpeedPlugin
(#10164) - Added the ability to return a result from rank 0 in
DDPSpawnPlugin.spawn
(#10162) - Added
pytorch_lightning.lite
package (#10175) - Added
LightningLite
documentation (#10043) - Added
LightningLite
examples (#9987) - Make the
_LiteDataLoader
an iterator and add supports for custom dataloader (#10279)
- Added
- Added
use_omegaconf
argument tosave_hparams_to_yaml
plugin (#9170) - Added
ckpt_path
argument forTrainer.fit()
(#10061) - Added
auto_device_count
method toAccelerators
(#10222) - Added support for
devices="auto"
(#10264) - Added a
filename
argument inModelCheckpoint.format_checkpoint_name
(#9818) - Added support for empty
gpus
list to run on CPU (#10246) - Added a warning if multiple batch sizes are found from ambiguous batch (#10247)
- Trainer now raises a
MisconfigurationException
when its methods are called withckpt_path="best"
but a checkpoint callback isn't configured (#9841) - Setting
Trainer(accelerator="ddp_cpu")
now does not spawn a subprocess ifnum_processes
is kept1
along withnum_nodes > 1
(#9603) - Module imports are now catching
ModuleNotFoundError
instead ofImportError
(#9867) pytorch_lightning.loggers.neptune.NeptuneLogger
is now consistent with the new neptune-client API; the old neptune-client API is supported byNeptuneClient
from the neptune-contrib repo (#6867)- Parsing of
enums
type hyperparameters to be saved in thehaprams.yaml
file by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170) - Parsing of the
gpus
Trainer argument has changed:gpus="n"
(str) no longer selects the GPU index n and instead selects the first n devices (#8770) iteration_count
and other index attributes in the loops has been replaced with progress dataclasses (#8477)- The
trainer.lightning_module
reference is now properly set at the very beginning of a run (#8536) - The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
- The
Trainer
functionsreset_{train,val,test,predict}_dataloader
,reset_train_val_dataloaders
, andrequest_dataloader
model
argument is now optional (#8536) - Saved checkpoints will no longer use the type of a
Callback
as the key to avoid issues with unpickling (#6886) - Improved string conversion for
ResultCollection
(#8622) LightningCLI
changes:LightningCLI.init_parser
now returns the parser instance (#8721)LightningCLI.add_core_arguments_to_parser
,LightningCLI.parse_arguments
now take aparser
argument (#8721)LightningCLI.instantiate_trainer
now takes a config and a list of callbacks (#8721)- Split
LightningCLI.add_core_arguments_to_parser
intoLightningCLI.add_default_arguments_to_parser
+LightningCLI.add_core_arguments_to_parser
(#8721)
- The accelerator and training type plugin
setup
hooks no longer have amodel
argument (#8536) - The accelerator and training type plugin
update_global_step
hook has been removed (#8856) - The coverage of
self.log
-ing in anyLightningModule
orCallback
hook has been improved (#8498) self.log
-ing without aTrainer
reference now raises a warning instead of an exception (#9733)- Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloader
now takes aRunningStage
enum instance (#8858)- Changed
rank_zero_warn
toNotImplementedError
in the{train, val, test, predict}_dataloader
hooks thatLightning(Data)Module
uses (#9161) - Moved
block_ddp_sync_behaviour
out ofTrainingBatchLoop
to loop utilities (#9192) - Executing the
optimizer_closure
is now required when overriding theoptimizer_step
hook (#9360) - Changed logging of
LightningModule
andLightningDataModule
hyperparameters to raise an exception only if there are colliding keys with different values (#9496) seed_everything
now fails when an invalid seed value is passed instead of selecting a random seed (#8787)- The Trainer now calls
TrainingTypePlugin
collective APIs directly instead of going through the Accelerator reference (#9677, #9901) - The tuner now usees a unique filename to save a temporary checkpoint (#9682)
- Changed
HorovodPlugin.all_gather
to return atorch.Tensor
instead of a list (#9696) - Changed Trainer connectors to be protected attributes:
- Configuration Validator (#9779)
- The
current_epoch
andglobal_step
attributes now get restored irrespective of the Trainer task (#9413) - Trainer now raises an exception when requesting
amp_level
with nativeamp_backend
(#9755) - Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_norm
now raises an exception if parameternorm_type <= 0
(#9765)- Updated error message for interactive incompatible plugins (#9896)
- Moved the
optimizer_step
andclip_gradients
hook from theAccelerator
andTrainingTypePlugin
into thePrecisionPlugin
(#10143, #10029) NativeMixedPrecisionPlugin
and its subclasses now take an optionalGradScaler
instance (#10055)- Trainer is now raising a
MisconfigurationException
instead of a warning ifTrainer.{validate/test}
is missing required methods (#10016) - Changed default value of the
max_steps
Trainer argument fromNone
to -1 (#9460) - LightningModule now raises an error when calling
log(on_step=False, on_epoch=False)
(#10227) - Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
- Raised
MisconfigurationException
when total length ofdataloader
across ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827) - Changed the model size calculation using
ByteCounter
(#10123) - Enabled
on_load_checkpoint
forLightningDataModule
for alltrainer_fn
(#10238) - Allowed separate config files for parameters with class type when LightningCLI is in
subclass_mode=False
(#10286)
- Deprecated Trainer argument
terminate_on_nan
in favor ofdetect_anomaly
(#9175) - Deprecated
Trainer.terminate_on_nan
public attribute access (#9849) - Deprecated
LightningModule.summarize()
in favor ofpytorch_lightning.utilities.model_summary.summarize()
(#8513) - Deprecated
LightningModule.model_size
(#8343) - Deprecated
DataModule
properties:train_transforms
,val_transforms
,test_transforms
,size
,dims
(#8851) - Deprecated
add_to_queue
,get_from_queue
fromLightningModule
in favor of corresponding methods in theDDPSpawnPlugin
(#9118) - Deprecated
LightningModule.get_progress_bar_dict
andTrainer.progress_bar_dict
in favor ofpytorch_lightning.callbacks.progress.base.get_standard_metrics
andProgressBarBase.get_metrics
(#8985) - Deprecated
prepare_data_per_node
flag on Trainer and set it as a property ofDataHooks
, accessible in theLightningModule
andLightningDataModule
(#8958) - Deprecated the
TestTubeLogger
(#9065) - Deprecated
on_{train/val/test/predict}_dataloader()
fromLightningModule
andLightningDataModule
(#9098) - Deprecated
on_keyboard_interrupt
callback hook in favor of newon_exception
hook (#9260) - Deprecated passing
process_position
to theTrainer
constructor in favor of adding theProgressBar
callback withprocess_position
directly to the list of callbacks (#9222) - Deprecated passing
flush_logs_every_n_steps
as a Trainer argument, instead pass it to the logger init if supported (#9366) - Deprecated
LightningLoggerBase.close
,LoggerCollection.close
in favor ofLightningLoggerBase.finalize
,LoggerCollection.finalize
(#9422) - Deprecated passing
progress_bar_refresh_rate
to theTrainer
constructor in favor of adding theProgressBar
callback withrefresh_rate
directly to the list of callbacks, or passingenable_progress_bar=False
to disable the progress bar (#9616) - Deprecated
LightningDistributed
and moved the broadcast logic toDDPPlugin
andDDPSpawnPlugin
directly (#9691) - Deprecated passing
stochastic_weight_avg
to theTrainer
constructor in favor of adding theStochasticWeightAveraging
callback directly to the list of callbacks (#8989) - Deprecated Accelerator collective API
barrier
,broadcast
, andall_gather
in favor of calling theTrainingTypePlugin
collective API directly (#9677) - Deprecated
checkpoint_callback
from theTrainer
constructor in favor ofenable_checkpointing
(#9754) - Deprecated the
LightningModule.on_post_move_to_device
method (#9525) - Deprecated
pytorch_lightning.core.decorators.parameter_validation
in favor ofpytorch_lightning.utilities.parameter_tying.set_shared_parameters
(#9525) - Deprecated passing
weights_summary
to theTrainer
constructor in favor of adding theModelSummary
callback withmax_depth
directly to the list of callbacks (#9699) - Deprecated
log_gpu_memory
,gpu_metrics
, and util funcs in favor ofDeviceStatsMonitor
callback (#9921) - Deprecated
GPUStatsMonitor
andXLAStatsMonitor
in favor ofDeviceStatsMonitor
callback (#9924) - Deprecated setting
Trainer(max_steps=None)
; To turn off the limit, setTrainer(max_steps=-1)
(default) (#9460) - Deprecated access to the
AcceleratorConnector.is_slurm_managing_tasks
attribute and marked it as protected (#10101) - Deprecated access to the
AcceleratorConnector.configure_slurm_ddp
method and marked it as protected (#10101) - Deprecated passing
resume_from_checkpoint
to theTrainer
constructor in favor oftrainer.fit(ckpt_path=)
(#10061) - Deprecated
ClusterEnvironment.creates_children()
in favor ofClusterEnvironment.creates_processes_externally
(property) (#10106) - Deprecated
PrecisionPlugin.master_params()
in favor ofPrecisionPlugin.main_params()
(#10105) - Deprecated
lr_sch_names
fromLearningRateMonitor
(#10066) - Deprecated
ProgressBar
callback in favor ofTQDMProgressBar
(#10134)
- Removed deprecated
metrics
(#8586) - Removed the deprecated
outputs
argument in both theLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#8587) - Removed the deprecated
TrainerLoggingMixin
class (#8609) - Removed the deprecated
TrainerTrainingTricksMixin
class (#8679) - Removed the deprecated
optimizer_idx
fromtraining_step
as an accepted argument in manual optimization (#8576) - Removed support for the deprecated
on_save_checkpoint
signature. The hook now takes acheckpoint
positional parameter (#8697) - Removed support for the deprecated
on_load_checkpoint
signature. The hook now takes apl_module
positional parameter (#8697) - Removed the deprecated
save_function
property inModelCheckpoint
(#8680) - Removed the deprecated
model
argument fromModelCheckpoint.save_checkpoint
(#8688) - Removed the deprecated
sync_step
argument fromWandbLogger
(#8763) - Removed the deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#8826) - Removed
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#8850) - Removed
on_reset_*_dataloader
hooks in TrainingType Plugins and Accelerators (#8858) - Removed deprecated
GradInformation
module in favor ofpytorch_lightning.utilities.grads
(#8831) - Removed
TrainingTypePlugin.on_save
andAccelerator.on_save
(#9023) - Removed
{Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step
(#9746) - Removed deprecated
connect_precision_plugin
andconnect_training_type_plugin
fromAccelerator
(#9019) - Removed
on_train_epoch_end
fromAccelerator
(#9035) - Removed
InterBatchProcessor
in favor ofDataLoaderIterDataFetcher
(#9052) - Removed
Plugin
inbase_plugin.py
in favor of accessingTrainingTypePlugin
andPrecisionPlugin
directly instead (#9066) - Removed
teardown
fromParallelPlugin
(#8943) - Removed deprecated
profiled_functions
argument fromPyTorchProfiler
(#9178) - Removed deprecated
pytorch_lighting.utilities.argparse_utils
module (#9166) - Removed deprecated property
Trainer.running_sanity_check
in favor ofTrainer.sanity_checking
(#9209) - Removed deprecated
BaseProfiler.output_filename
arg from it and its descendants in favor ofdirpath
andfilename
(#9214) - Removed deprecated property
ModelCheckpoint.period
in favor ofModelCheckpoint.every_n_epochs
(#9213) - Removed deprecated
auto_move_data
decorator (#9231) - Removed deprecated property
LightningModule.datamodule
in favor ofTrainer.datamodule
(#9233) - Removed deprecated properties
DeepSpeedPlugin.cpu_offload*
in favor ofoffload_optimizer
,offload_parameters
andpin_memory
(#9244) - Removed deprecated property
AcceleratorConnector.is_using_torchelastic
in favor ofTorchElasticEnvironment.is_using_torchelastic()
(#9729) - Removed
pytorch_lightning.utilities.debugging.InternalDebugger
(#9680) - Removed
call_configure_sharded_model_hook
property fromAccelerator
andTrainingTypePlugin
(#9612) - Removed
TrainerProperties
mixin and moved property definitions directly intoTrainer
(#9495) - Removed a redundant warning with
ModelCheckpoint(monitor=None)
callback (#9875) - Remove
epoch
fromtrainer.logged_metrics
(#9904) - Remove deprecated
distributed_backend
fromTrainer
(#10017) - Removed
process_idx
from the{DDPSpawnPlugin,TPUSpawnPlugin}.new_process
methods (#10022) - Removed automatic patching of
{train,val,test,predict}_dataloader()
on theLightningModule
(#9764) - Removed
pytorch_lightning.trainer.connectors.OptimizerConnector
(#10120)
- Fixed ImageNet evaluation in example (#10179)
- Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
- Fixed
move_metrics_to_cpu
moving the loss to CPU while training on device (#9308) - Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
- Fixed an issue with freeing memory of datafetchers during teardown (#9387)
- Fixed a bug where the training step output needed to be
deepcopy
-ed (#9349) - Fixed an issue with freeing memory allocated by the data iterators in
Loop.on_run_end
(#9386, #9915) - Fixed
BasePredictionWriter
not returning the batch indices in a non-distributed setting (#9432) - Fixed an error when running in XLA environments with no TPU attached (#9572)
- Fixed check on torchmetrics logged whose
compute()
output is a multielement tensor (#9582) - Fixed gradient accumulation for
DDPShardedPlugin
(#9122) - Fixed missing DeepSpeed distributed call (#9540)
- Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin
,DDPSpawnPlugin
,DDPShardedPlugin
,DDPSpawnShardedPlugin
(#9096) - Fixed
trainer.accumulate_grad_batches
to be an int on init. The default value for it is nowNone
inside Trainer (#9652) - Fixed
broadcast
inDDPPlugin
andDDPSpawnPlugin
to respect thesrc
input (#9691) - Fixed
self.log(on_epoch=True, reduce_fx=sum))
for theon_batch_start
andon_train_batch_start
hooks (#9791) - Fixed
self.log(on_epoch=True)
for theon_batch_start
andon_train_batch_start
hooks (#9780) - Fixed restoring training state during
Trainer.fit
only (#9413) - Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
- Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
- Fixed DeepSpeed GPU device IDs (#9847)
- Reset
val_dataloader
intuner/batch_size_scaling
(#9857) - Fixed use of
LightningCLI
in computer_vision_fine_tuning.py example (#9934) - Fixed issue with non-init dataclass fields in
apply_to_collection
(#9963) - Reset
val_dataloader
intuner/batch_size_scaling
for binsearch (#9975) - Fixed logic to check for spawn in dataloader
TrainerDataLoadingMixin._worker_check
(#9902) - Fixed
train_dataloader
getting loaded twice when resuming from a checkpoint duringTrainer.fit()
(#9671) - Fixed
LearningRateMonitor
logging with multiple param groups optimizer with no scheduler (#10044) - Fixed undesired side effects being caused by
Trainer
patching dataloader methods on theLightningModule
(#9764) - Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
- Fixed
on_before_optimizer_step
getting called before the optimizer closure (including backward) has run (#10167) - Fixed monitor value in
ModelCheckpoint
getting moved to the wrong device in a special case where it becomes NaN (#10118) - Fixed creation of
dirpath
inBaseProfiler
if it doesn't exist (#10073) - Fixed incorrect handling of sigterm (#10189)
- Fixed bug where
log(on_step=True, on_epoch=True, sync_dist=True)
wouldn't reduce the value on step (#10227) - Fixed an issue with
pl.utilities.seed.reset_seed
converting thePL_SEED_WORKERS
environment variable tobool
(#10099) - Fixed iterating over a logger collection when
fast_dev_run > 0
(#10232) - Fixed
batch_size
inResultCollection
not being reset to 1 on epoch end (#10242) - Fixed
distrib_type
not being set when training plugin instances are being passed to the Trainer (#10251)
- Fixed
lr_find
to generate same results on multiple calls (#9704) - Fixed
reset
metrics on validation epoch end (#9717) - Fixed input validation for
gradient_clip_val
,gradient_clip_algorithm
,track_grad_norm
andterminate_on_nan
Trainer arguments (#9595) - Reset metrics before each task starts (#9410)
- Fixed error reporting in DDP process reconciliation when processes are launched by an external agent (#9389)
- Added PL_RECONCILE_PROCESS environment variable to enable process reconciliation regardless of cluster environment settings (#9389)
- Fixed
add_argparse_args
raisingTypeError
when args are typed astyping.Generic
in Python 3.6 (#9554) - Fixed back-compatibility for saving hyperparameters from a single container and inferring its argument name by reverting #9125 (#9642)
- Fixed logging of nan parameters (#9364)
- Fixed
replace_sampler
missing the batch size under specific conditions (#9367) - Pass init args to ShardedDataParallel (#9483)
- Fixed collision of user argument when using ShardedDDP (#9512)
- Fixed DeepSpeed crash for RNNs (#9489)
- Fixed an issues with export to ONNX format when a model has multiple inputs (#8800)
- Removed deprecation warnings being called for
on_{task}_dataloader
(#9279) - Fixed save/load/resume from checkpoint for DeepSpeed Plugin ( #8397, #8644, #8627)
- Fixed
EarlyStopping
running on train epoch end whencheck_val_every_n_epoch>1
is set (#9156) - Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8333)
- Fixed the Apex and DeepSpeed plugin closure running after the
on_before_optimizer_step
hook (#9288) - Fixed the Native AMP plugin closure not running with manual optimization (#9288)
- Fixed bug where data-loading functions where not getting the correct running stage passed (#8858)
- Fixed intra-epoch evaluation outputs staying in memory when the respective
*_epoch_end
hook wasn't overridden (#9261) - Fixed error handling in DDP process reconciliation when
_sync_dir
was not initialized (#9267) - Fixed PyTorch Profiler not enabled for manual optimization (#9316)
- Fixed inspection of other args when a container is specified in
save_hyperparameters
(#9125) - Fixed signature of
Timer.on_train_epoch_end
andStochasticWeightAveraging.on_train_epoch_end
to prevent unwanted deprecation warnings (#9347)
- Fixed reduction using
self.log(sync_dict=True, reduce_fx={mean,max})
(#9142) - Fixed not setting a default value for
max_epochs
ifmax_time
was specified on theTrainer
constructor (#9072) - Fixed the CometLogger, no longer modifies the metrics in place. Instead creates a copy of metrics before performing any operations (#9150)
- Fixed
DDP
"CUDA error: initialization error" due to acopy
instead ofdeepcopy
onResultCollection
(#9239)
- Fixed a bug in the binary search mode of auto batch size scaling where exception was raised if the first trainer run resulted in OOM (#8954)
- Fixed a bug causing logging with
log_gpu_memory='min_max'
not working (#9013)
- Fixed plateau scheduler stepping on incomplete epoch (#8861)
- Fixed infinite loop with
CycleIterator
and multiple loaders (#8889) - Fixed
StochasticWeightAveraging
with a list of learning rates not applying them to each param group (#8747) - Restore original loaders if replaced by entrypoint (#8885)
- Fixed lost reference to
_Metadata
object inResultMetricCollection
(#8932) - Ensure the existence of
DDPPlugin._sync_dir
inreconciliate_processes
(#8939)
- Fixed recursive call for
apply_to_collection(include_none=False)
(#8719) - Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer (#8804)
- Fixed comments and exception message for metrics_to_scalars (#8782)
- Fixed typo error in LightningLoggerBase.after_save_checkpoint docstring (#8737)
- Fixed
trainer.fit_loop.split_idx
always returningNone
(#8601) - Fixed references for
ResultCollection.extra
(#8622) - Fixed reference issues during epoch end result collection (#8621)
- Fixed horovod auto-detection when horovod is not installed and the launcher is
mpirun
(#8610) - Fixed an issue with
training_step
outputs not getting collected correctly fortraining_epoch_end
(#8613) - Fixed distributed types support for CPUs (#8667)
- Fixed a deadlock issue with DDP and torchelastic (#8655)
- Fixed
accelerator=ddp
choice for CPU (#8645)
- Added
extract_batch_size
utility and corresponding tests to extract batch dimension from multiple batch types (#8357) - Added support for named parameter groups in
LearningRateMonitor
(#7987) - Added
dataclass
support forpytorch_lightning.utilities.apply_to_collection
(#7935) - Added support to
LightningModule.to_torchscript
for saving to custom filesystems withfsspec
(#7617) - Added
KubeflowEnvironment
for use with thePyTorchJob
operator in Kubeflow - Added LightningCLI support for config files on object stores (#7521)
- Added
ModelPruning(prune_on_train_epoch_end=True|False)
to choose when to apply pruning (#7704) - Added support for checkpointing based on a provided time interval during training (#7515)
- Progress tracking
- Added support for passing a
LightningDataModule
positionally as the second argument totrainer.{validate,test,predict}
(#7431) - Added argument
trainer.predict(ckpt_path)
(#7430) - Added
clip_grad_by_value
support for TPUs (#7025) - Added support for passing any class to
is_overridden
(#7918) - Added
sub_dir
parameter toTensorBoardLogger
(#6195) - Added correct
dataloader_idx
to batch transfer hooks (#6241) - Added
include_none=bool
argument toapply_to_collection
(#7769) - Added
apply_to_collections
to apply a function to two zipped collections (#7769) - Added
ddp_fully_sharded
support (#7487) - Added
should_rank_save_checkpoint
property to Training Plugins (#7684) - Added
log_grad_norm
hook toLightningModule
to customize the logging of gradient norms (#7873) - Added
save_config_filename
init argument toLightningCLI
to ease resolving name conflicts (#7741) - Added
save_config_overwrite
init argument toLightningCLI
to ease overwriting existing config files (#8059) - Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
- Added trainer stage hooks for Training Plugins and Accelerators (#7864)
- Added the
on_before_optimizer_step
hook (#8048) - Added IPU Accelerator (#7867)
- Fault-tolerant training
- Added
{,load_}state_dict
toResultCollection
(#7948) - Added
{,load_}state_dict
toLoops
(#8197) - Added
FastForwardSampler
andCaptureIterableDataset
(#8307) - Set
Loop.restarting=False
at the end of the first iteration (#8362) - Save the loops state with the checkpoint (opt-in) (#8362)
- Save a checkpoint to restore the state on exception (opt-in) (#8362)
- Added
state_dict
andload_state_dict
utilities forCombinedLoader
+ utilities for dataloader (#8364)
- Added
- Added
rank_zero_only
toLightningModule.log
function (#7966) - Added
metric_attribute
toLightningModule.log
function (#7966) - Added a warning if
Trainer(log_every_n_steps)
is a value too high for the training dataloader (#7734) - Added LightningCLI support for argument links applied on instantiation (#7895)
- Added LightningCLI support for configurable callbacks that should always be present (#7964)
- Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
- Added support for
torch.nn.UninitializedParameter
inModelSummary
(#7642) - Added support
LightningModule.save_hyperparameters
whenLightningModule
is a dataclass (#7992) - Added support for overriding
optimizer_zero_grad
andoptimizer_step
when using accumulate_grad_batches (#7980) - Added
logger
boolean flag tosave_hyperparameters
(#7960) - Added support for calling scripts using the module syntax (
python -m package.script
) (#8073) - Added support for optimizers and learning rate schedulers to
LightningCLI
(#8093) - Added XLA Profiler (#8014)
- Added
PrecisionPlugin.{pre,post}_backward
(#8328) - Added
on_load_checkpoint
andon_save_checkpoint
hooks to thePrecisionPlugin
base class (#7831) - Added
max_depth
parameter inModelSummary
(#8062) - Added
XLAStatsMonitor
callback (#8235) - Added
restore
function andrestarting
attribute to baseLoop
(#8247) - Added support for
save_hyperparameters
inLightningDataModule
(#3792) - Added the
ModelCheckpoint(save_on_train_epoch_end)
to choose when to run the saving logic (#8389) - Added
LSFEnvironment
for distributed training with the LSF resource managerjsrun
(#5102) - Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'
(#7808) - Added
tpu_spawn_debug
to plugin registry (#7933) - Enabled traditional/manual launching of DDP processes through
LOCAL_RANK
andNODE_RANK
environment variable assignments (#7480) - Added
quantize_on_fit_end
argument toQuantizationAwareTraining
(#8464) - Added experimental support for loop specialization (#8226)
- Added support for
devices
flag to Trainer (#8440) - Added private
prevent_trainer_and_dataloaders_deepcopy
context manager on theLightningModule
(#8472) - Added support for providing callables to the Lightning CLI instead of types (#8400)
- Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
- Changed the
Trainer
'scheckpoint_callback
argument to allow only boolean values (#7539) - Log epoch metrics before the
on_evaluation_end
hook (#7272) - Explicitly disallow calling
self.log(on_epoch=False)
during epoch-only or single-call hooks (#7874) - Changed these
Trainer
methods to be protected:call_setup_hook
,call_configure_sharded_model
,pre_dispatch
,dispatch
,post_dispatch
,call_teardown_hook
,run_train
,run_sanity_check
,run_evaluate
,run_evaluation
,run_predict
,track_output_for_epoch_end
- Changed
metrics_to_scalars
to work with any collection or value (#7888) - Changed
clip_grad_norm
to usetorch.nn.utils.clip_grad_norm_
(#7025) - Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint
now runs at the end of the training epoch by default (#8389)EarlyStopping
now runs at the end of the training epoch by default (#8286)- Refactored Loops
- Moved attributes
global_step
,current_epoch
,max/min_steps
,max/min_epochs
,batch_idx
, andtotal_batch_idx
to TrainLoop (#7437) - Refactored result handling in training loop (#7506)
- Moved attributes
hiddens
andsplit_idx
to TrainLoop (#7507) - Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
- Simplified "should run validation" logic (#7682)
- Simplified logic for updating the learning rate for schedulers (#7682)
- Removed the
on_epoch
guard from the "should stop" validation check (#7701) - Refactored internal loop interface; added new classes
FitLoop
,TrainingEpochLoop
,TrainingBatchLoop
(#7871, #8077) - Removed
pytorch_lightning/trainer/training_loop.py
(#7985) - Refactored evaluation loop interface; added new classes
DataLoaderLoop
,EvaluationLoop
,EvaluationEpochLoop
(#7990, #8077) - Removed
pytorch_lightning/trainer/evaluation_loop.py
(#8056) - Restricted public access to several internal functions (#8024)
- Refactored trainer
_run_*
functions and separate evaluation loops (#8065) - Refactored prediction loop interface; added new classes
PredictionLoop
,PredictionEpochLoop
(#7700, #8077) - Removed
pytorch_lightning/trainer/predict_loop.py
(#8094) - Moved result teardown to the loops (#8245)
- Improve
Loop
API to better handle childrenstate_dict
andprogress
(#8334)
- Moved attributes
- Refactored logging
- Renamed and moved
core/step_result.py
totrainer/connectors/logger_connector/result.py
(#7736) - Dramatically simplify the
LoggerConnector
(#7882) trainer.{logged,progress_bar,callback}_metrics
are now updated on-demand (#7882)- Completely overhaul the
Result
object in favor ofResultMetric
(#7882) - Improve epoch-level reduction time and overall memory usage (#7882)
- Allow passing
self.log(batch_size=...)
(#7891) - Each of the training loops now keeps its own results collection (#7891)
- Remove
EpochResultStore
andHookResultStore
in favor ofResultCollection
(#7909) - Remove
MetricsHolder
(#7909)
- Renamed and moved
- Moved
ignore_scalar_return_in_dp
warning suppression to the DataParallelPlugin class (#7421) - Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*
to the metric name (#7351) - Raised
ValueError
when aNone
value isself.log
-ed (#7771) - Changed
resolve_training_type_plugins
to allow settingnum_nodes
andsync_batchnorm
fromTrainer
setting (#7026) - Default
seed_everything(workers=True)
in theLightningCLI
(#7504) - Changed
model.state_dict()
inCheckpointConnector
to allowtraining_type_plugin
to customize the model'sstate_dict()
(#7474) MLflowLogger
now uses the env variableMLFLOW_TRACKING_URI
as default tracking URI (#7457)- Changed
Trainer
arg and functionality fromreload_dataloaders_every_epoch
toreload_dataloaders_every_n_epochs
(#5043) - Changed
WandbLogger(log_model={True/'all'})
to log models as artifacts (#6231) - MLFlowLogger now accepts
run_name
as an constructor argument (#7622) - Changed
teardown()
inAccelerator
to allowtraining_type_plugin
to customizeteardown
logic (#7579) Trainer.fit
now raises an error when using manual optimization with unsupported features such asgradient_clip_val
oraccumulate_grad_batches
(#7788)- Accelerator hooks are called regardless if
LightningModule
overrides the same hooks (#7826) - Moved profilers to their own file (#7822)
- The
on_after_backward
hook is now called on accumulating iterations. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328) - The mixed precision loss is no longer unscaled before the
on_after_backward
hook. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328) - The
TrainingTypePlugin.{pre,post}_backward
hooks no longer take theoptimizer, opt_idx, should_accumulate
arguments (#8328) - The
PrecisionPlugin.backward
hooks no longer returns a value (#8328) - The
PrecisionPlugin.backward
hooks no longer takes ashould_accumulate
argument (#8328) - Added the
on_before_backward
hook (#7865) LightningCLI
now aborts with a clearer message if config already exists and disables save config duringfast_dev_run
(#7963)- Saved the
LightningCLI
config onsetup
and only on the main process (#8017) - Dropped the
LightningCLI
ArgumentParser
when pickling (#8017) - Skip
broadcast
if distributed not initialized for the spawn plugins (#8017) Trainer(resume_from_checkpoint=...)
now restores the model directly afterLightningModule.setup()
, which is beforeLightningModule.configure_sharded_model()
(#7652)- Moved
torch.cuda.set_device()
to enable collective calls earlier in setup (#8312) - Used XLA utility API to move data to CPU (Single TPU core) (#8078)
- Improved error messages in
replace_sampler
when theDataLoader
attributes are not included in the signature or the signature is missing optional arguments (#8519) - Moved
DeviceDtypeModuleMixin
andHyperparametersMixin
mixin tocore
(#8396) - Return the
default_root_dir
as thelog_dir
when the logger is aLoggerCollection
(#8187)
- Deprecated
LightningModule.loaded_optimizer_states_dict
(#8229) - Standardized the dataloaders arguments of
trainer.{fit,valdiate,test,tune}
(#7431) - Deprecated
DataModule
properties:has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
,has_teardown_predict
(#7657) - Deprecated
TrainerModelHooksMixin
in favor ofpytorch_lightning.utilities.signature_utils
(#7422) - Deprecated
num_nodes
andsync_batchnorm
arguments inDDPPlugin
andDDPSpawnPlugin
(#7026) - Deprecated
self.log(sync_dist_op)
in favor ofself.log(reduce_fx)
. (#7891) - Deprecated
is_overridden(model=...)
in favor ofis_overridden(instance=...)
(#7918) - Deprecated automatically detaching returned extras with grads (#7994)
- Deprecated default value of
monitor
argument in EarlyStopping callback to enforcemonitor
as a required argument (#7907) - Deprecated importing
rank_zero_{warn,deprecation}
directly frompytorch_lightning.utilities.distributed
(#8085) - Deprecated the use of
CheckpointConnector.hpc_load()
in favor ofCheckpointConnector.restore()
(#7652) - Deprecated
ModelCheckpoint(every_n_val_epochs)
in favor ofModelCheckpoint(every_n_epochs)
(#8383) - Deprecated
DDPPlugin.task_idx
in favor ofDDPPlugin.local_rank
(#8203) - Deprecated the
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#8025) - Deprecated the
Trainer.disable_validation
property in favor ofnot Trainer.enable_validation
(#8291) - Deprecated
mode
parameter inModelSummary
in favor ofmax_depth
(#8062) - Deprecated
reload_dataloaders_every_epoch
argument ofTrainer
in favor ofreload_dataloaders_every_n_epochs
(#5043) - Deprecated
distributed_backend
argument forTrainer
(#8575)
- Dropped official support/testing for PyTorch <1.6 (#8288)
- Removed
ProfilerConnector
(#7654) - Pruned deprecated classif. metrics from
pytorch_lightning.metrics.functional.classification
(#7499) - Removed deprecated data parallel classes
LightningDataParallel
andLightningDistributedDataParallel
frompytorch_lightning.overrides.data_parallel
(#7510) - Removed deprecated trainer attributes -
get_model
andaccelerator_backend
(#7502) - Removed support for automatically monitoring the
val_loss
key withModelCheckpoint
. Pass yourmonitor
of choice to theModelCheckpoint
instance instead (#8293) - Removed support for
self.log(tbptt_reduce_fx)
andself.log(tbptt_pad_token)
. Please, open a discussion explaining your use-case if you relied on these. (#7644) - Removed deprecated utils modules
model_utils
,warning_utils
,xla_device_utils
and partiallyargparse_utils
(#7503) - Removed
RPCPlugin
andRPCSequentialPlugin
. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101) - Removed deprecated trainer attributes -
on_cpu
,on_tpu
,use_tpu
,on_gpu
,use_dp
,use_ddp
,use_ddp2
,use_horovod
,use_single_gpu
(#7501) - Removed deprecated
optimizer
argument inLightningModule.manual_backward()
; Toggling optimizers in manual optimization should be done usingLightningModule.{un}toggle_optimizer()
(#8287) - Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
- Removed environment variable
PL_EXP_VERSION
from DDP subprocesses (7403)
- Fixed the
GPUStatsMonitor
callbacks to use the correct GPU IDs ifCUDA_VISIBLE_DEVICES
set (#8260) - Fixed
lr_scheduler
checkpointed state by callingupdate_lr_schedulers
before saving checkpoints (#7877) - Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
- Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
- Fixed
None
loss keys getting added intraining_epoch_end
when using manual optimization and not returning a loss (#7772) - Fixed a bug where
precision=64
withaccelerator='ddp_spawn'
would throw a pickle error (#6924) - Do not override the existing
epoch
value inlogged_metrics
when already logged by the user (#7982) - Support for manual optimization with DeepSpeed (#7970)
- Fixed
dataloader_idx
argument value when predicting with only oneDataLoader
(#7941) - Fixed passing the
stage
argument ofCallback.{setup,teardown}
as a keyword (#7973) - Fixed metrics generated during
validation sanity checking
are cleaned on end (#8171) - Fixed
log_gpu_memory
metrics not being added tologging
when nothing else is logged (#8174) - Fixed a bug where calling
log
with aMetric
instance would raise an error if it was a nested attribute of the model (#8181) - Fixed a bug where using
precision=64
would cause buffers with complex dtype to be cast to real (#8208) - Fixed
is_overridden
returning true for wrapped functions with no changes (#8296) - Fixed a bug where
truncated_bptt_steps
would throw an AttributeError when the target RNN has multiple hidden states (#8145) - Fixed
self.optimizers()
not returning a single optimizer if it had been wrapped (#8326) - Fixed the
on_after_backward
hook not getting called when using manual optimization and no plugins (#8328) - Fixed the
LightningModule.backward
hook only getting called with theapex
plugin when using manual optimization (#8328) - Fixed moving batch to device before sending it to the
on_*_batch_start
/on_*_batch_end
callbacks and model hooks (#7378) - Fixed passing a custom
DDPPlugin
when choosingaccelerator="ddp_cpu"
for the accelerator (#6208) - Fixed missing call to
LightningModule.untoggle_optimizer
in training loop when running gradient accumulation with multiple optimizers (#8284) - Fixed hash of LightningEnum to work with value instead of name (#8421).
- Fixed a bug where an extra checkpoint was saved at the end of training if the
val_check_interval
did not align with the number of training batches (#7724) - Fixed hash of LightningEnum to work with value instead of name(#8421).
- Fixed
move_data_to_device
to return the batch if the objectto
function didn't returnself
(#8433) - Fixed progress bar updates for Pod Training (#8258)
- Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
- Fixed memory leaks on GPU by moving
optimizer_states
,ResultCollection.extra
,ResultMetric
attributes, andLoggerConnector
metrics tocpu
. Also, delete the DDP wrapper onteardown
(#8490) - Fixed
SWA
callback using LightningModuleprevent_trainer_and_dataloaders_deepcopy
to avoid OOM (#8472) - Fixed
ModelPruning
callbackon_save_checkpoint
to avoid making adeepcopy
potentially leading to OOM (#8472) - Fixed the sampler replacement logic for
DataLoader
s which do not define allDataLoader
attributes as__init__
parameters (#8519) - Fixed DeepSpeed Windows support (#8488)
- Fixed DeepSpeed not properly setting the trainer
lr_schedulers
attribute (#8527) - Fixed experiment version and log-dir divergence in DDP when using multiple
Trainer
instances in sequence (7403) - Enabled manual optimization for TPUs (#8458)
- Fixed
accumulate_grad_batches
not been recomputed during model reload (#5334) - Fixed a
TypeError
when wrapping optimizers in theHorovodPlugin
and runningTrainer.test
(#7840) - Fixed
BackboneFinetuning
restoration (#8501) - Fixed
lr_scheduler
with metric (e.g.torch.optim.lr_scheduler.ReduceLROnPlateau
) when usingautomatic_optimization = False
(#7643) - Fixed
DeepSpeed
breaking with no schedulers (#8580)
- Fixed a sync deadlock when checkpointing a
LightningModule
that uses a torchmetrics 0.4Metric
(#8218) - Fixed compatibility TorchMetrics v0.4 (#8206)
- Added torchelastic check when sanitizing GPUs (#8095)
- Fixed a DDP info message that was never shown (#8111)
- Fixed metrics deprecation message at module import level (#8163)
- Fixed a bug where an infinite recursion would be triggered when using the
BaseFinetuning
callback on a model that contains aModuleDict
(#8170) - Added a mechanism to detect
deadlock
forDDP
when only 1 process trigger anException
. The mechanism willkill the processes
when it happens (#8167) - Fixed NCCL error when selecting non-consecutive device ids (#8165)
- Fixed SWA to also work with
IterableDataset
(#8172)
- Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
- Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
- Fixed setting a
DistributedSampler
when using a distributed plugin in a custom accelerator (#7814) - Improved
PyTorchProfiler
chrome traces names (#8009) - Fixed moving the best score to device in
EarlyStopping
callback for TPU devices (#7959) - Fixes access to
callback_metrics
in ddp_spawn (#7916)
- Fixed logs overwriting issue for remote filesystems (#7889)
- Fixed
DataModule.prepare_data
could only be called on the global rank 0 process (#7945) - Fixed setting
worker_init_fn
to seed dataloaders correctly when using DDP (#7942) - Fixed
BaseFinetuning
callback to properly handle parent modules w/ parameters (#7931)
- Added warning to Training Step output (#7779)
- Fixed
LearningRateMonitor
andBackboneFinetuning
(#7835) - Minor improvements to
apply_to_collection
and type signature oflog_dict
(#7851) - Fixed docker versions (#7834)
- Fixed sharded training check for fp16 precision (#7825)
- Fixed support for torch Module type hints in LightningCLI (#7807)
- Move
training_output
validation to aftertrain_step_end
(#7868)
- Fixed info message when max training time reached (#7780)
- Fixed missing
__len__
method toIndexBatchSamplerWrapper
(#7681)
- Changed calling of
untoggle_optimizer(opt_idx)
out of the closure function (#7563)
- Fixed
ProgressBar
pickling after callingtrainer.predict
(#7608) - Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
- Fixed dataloaders are not reset when tuning the model (#7566)
- Fixed print errors in
ProgressBar
whentrainer.fit
is not called (#7674) - Fixed global step update when the epoch is skipped (#7677)
- Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)
DataModule
s now avoid duplicate{setup,teardown,prepare_data}
calls for the same stage (#7238)
- Fixed parsing of multiple training dataloaders (#7433)
- Fixed recursive passing of
wrong_type
keyword argument inpytorch_lightning.utilities.apply_to_collection
(#7433) - Fixed setting correct
DistribType
forddp_cpu
(spawn) backend (#7492) - Fixed incorrect number of calls to LR scheduler when
check_val_every_n_epoch > 1
(#7032)
- Fixed DeepSpeed with IterableDatasets (#7362)
- Fixed
Trainer.current_epoch
not getting restored after tuning (#7434) - Fixed local rank displayed in console log (#7395)
- Added support for the
EarlyStopping
callback to run at the end of the training epoch (#6944) - Added synchronization points before and after
setup
hooks are run (#7202) - Added a
teardown
hook toClusterEnvironment
(#6942) - Added utils for metrics to scalar conversions (#7180)
- Added utils for NaN/Inf detection for gradients and parameters (#6834)
- Added more explicit exception message when trying to execute
trainer.test()
ortrainer.validate()
withfast_dev_run=True
(#6667) - Added
LightningCLI
class to provide simple reproducibility with minimum boilerplate training CLI ( #4492, #6862, #7156, #7299) - Added
gradient_clip_algorithm
argument to Trainer for gradient clipping by value (#6123). - Added a way to print to terminal without breaking up the progress bar (#5470)
- Added support to checkpoint after training steps in
ModelCheckpoint
callback (#6146) - Added
TrainerStatus.{INITIALIZING,RUNNING,FINISHED,INTERRUPTED}
(#7173) - Added
Trainer.validate()
method to perform one evaluation epoch over the validation set (#4948) - Added
LightningEnvironment
for Lightning-specific DDP (#5915) - Added
teardown()
hook to LightningDataModule (#4673) - Added
auto_insert_metric_name
parameter toModelCheckpoint
(#6277) - Added arg to
self.log
that enables users to give custom names when dealing with multiple dataloaders (#6274) - Added
teardown
method toBaseProfiler
to enable subclasses defining post-profiling steps outside of__del__
(#6370) - Added
setup
method toBaseProfiler
to enable subclasses defining pre-profiling steps for every process (#6633) - Added no return warning to predict (#6139)
- Added
Trainer.predict
config validation (#6543) - Added
AbstractProfiler
interface (#6621) - Added support for including module names for forward in the autograd trace of
PyTorchProfiler
(#6349) - Added support for the PyTorch 1.8.1 autograd profiler (#6618)
- Added
outputs
parameter to callback'son_validation_epoch_end
&on_test_epoch_end
hooks (#6120) - Added
configure_sharded_model
hook (#6679) - Added support for
precision=64
, enabling training with double precision (#6595) - Added support for DDP communication hooks (#6736)
- Added
artifact_location
argument toMLFlowLogger
which will be passed to theMlflowClient.create_experiment
call (#6677) - Added
model
parameter to precision plugins'clip_gradients
signature ( #6764, #7231) - Added
is_last_batch
attribute toTrainer
(#6825) - Added
LightningModule.lr_schedulers()
for manual optimization (#6567) - Added
MpModelWrapper
in TPU Spawn (#7045) - Added
max_time
Trainer argument to limit training time (#6823) - Added
on_predict_{batch,epoch}_{start,end}
hooks (#7141) - Added new
EarlyStopping
parametersstopping_threshold
anddivergence_threshold
(#6868) - Added
debug
flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219) - Added new
UnrepeatedDistributedSampler
andIndexBatchSamplerWrapper
for tracking distributed predictions (#7215) - Added
trainer.predict(return_predictions=None|False|True)
(#7215) - Added
BasePredictionWriter
callback to implement prediction saving (#7127) - Added
trainer.tune(scale_batch_size_kwargs, lr_find_kwargs)
arguments to configure the tuning algorithms (#7258) - Added
tpu_distributed
check for TPU Spawn barrier (#7241) - Added device updates to TPU Spawn for Pod training (#7243)
- Added warning when missing
Callback
and usingresume_from_checkpoint
(#7254) - DeepSpeed single file saving (#6900)
- Added Training type Plugins Registry ( #6982, #7063, #7214, #7224 )
- Add
ignore
param tosave_hyperparameters
(#6056)
- Changed
LightningModule.truncated_bptt_steps
to be property (#7323) - Changed
EarlyStopping
callback from by default runningEarlyStopping.on_validation_end
if only training is run. Setcheck_on_train_epoch_end
to run the callback at the end of the train epoch instead of at the end of the validation epoch (#7069) - Renamed
pytorch_lightning.callbacks.swa
topytorch_lightning.callbacks.stochastic_weight_avg
(#6259) - Refactor
RunningStage
andTrainerState
usage ( #4945, #7173)- Added
RunningStage.SANITY_CHECKING
- Added
TrainerFn.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
- Changed
trainer.evaluating
to returnTrue
if validating or testing
- Added
- Changed
setup()
andteardown()
stage argument to take any of{fit,validate,test,predict}
(#6386) - Changed profilers to save separate report files per state and rank (#6621)
- The trainer no longer tries to save a checkpoint on exception or run callback's
on_train_end
functions (#6864) - Changed
PyTorchProfiler
to usetorch.autograd.profiler.record_function
to record functions (#6349) - Disabled
lr_scheduler.step()
in manual optimization (#6825) - Changed warnings and recommendations for dataloaders in
ddp_spawn
(#6762) pl.seed_everything
will now also set the seed on theDistributedSampler
(#7024)- Changed default setting for communication of multi-node training using
DDPShardedPlugin
(#6937) trainer.tune()
now returns the tuning result (#7258)LightningModule.from_datasets()
now acceptsIterableDataset
instances as training datasets. (#7503)- Changed
resume_from_checkpoint
warning to an error when the checkpoint file does not exist (#7075) - Automatically set
sync_batchnorm
fortraining_type_plugin
(#6536) - Allowed training type plugin to delay optimizer creation (#6331)
- Removed ModelSummary validation from train loop on_trainer_init (#6610)
- Moved
save_function
to accelerator (#6689) - Updated DeepSpeed ZeRO (#6546, #6752, #6142, #6321)
- Improved verbose logging for
EarlyStopping
callback (#6811) - Run ddp_spawn dataloader checks on Windows (#6930)
- Updated mlflow with using
resolve_tags
(#6746) - Moved
save_hyperparameters
to its own function (#7119) - Replaced
_DataModuleWrapper
with__new__
(#7289) - Reset
current_fx
properties on lightning module in teardown (#7247) - Auto-set
DataLoader.worker_init_fn
withseed_everything
(#6960) - Remove
model.trainer
call inside of dataloading mixin (#7317) - Split profilers module (#6261)
- Ensure accelerator is valid if running interactively (#5970)
- Disabled batch transfer in DP mode (#6098)
- Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339) - Deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#7323) - Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339) - Deprecated
LightningModule.grad_norm
in favor ofpytorch_lightning.utilities.grads.grad_norm
(#7292) - Deprecated the
save_function
property from theModelCheckpoint
callback (#7201) - Deprecated
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#7066) - Deprecated
TrainerLoggingMixin
in favor of a separate utilities module for metric handling (#7180) - Deprecated
TrainerTrainingTricksMixin
in favor of a separate utilities module for NaN/Inf detection for gradients and parameters (#6834) period
has been deprecated in favor ofevery_n_val_epochs
in theModelCheckpoint
callback (#6146)- Deprecated
trainer.running_sanity_check
in favor oftrainer.sanity_checking
(#4945) - Deprecated
Profiler(output_filename)
in favor ofdirpath
andfilename
(#6621) - Deprecated
PytorchProfiler(profiled_functions)
in favor ofrecord_functions
(#6349) - Deprecated
@auto_move_data
in favor oftrainer.predict
(#6993) - Deprecated
Callback.on_load_checkpoint(checkpoint)
in favor ofCallback.on_load_checkpoint(trainer, pl_module, checkpoint)
(#7253) - Deprecated metrics in favor of
torchmetrics
( #6505, #6530, #6540, #6547, #6515, #6572, #6573, #6584, #6636, #6637, #6649, #6659, #7131, ) - Deprecated the
LightningModule.datamodule
getter and setter methods; access them throughTrainer.datamodule
instead (#7168) - Deprecated the use of
Trainer(gpus="i")
(string) for selecting the i-th GPU; from v1.5 this will set the number of GPUs instead of the index (#6388)
- Removed the
exp_save_path
property from theLightningModule
(#7266) - Removed training loop explicitly calling
EarlyStopping.on_validation_end
if no validation is run (#7069) - Removed
automatic_optimization
as a property from the training loop in favor ofLightningModule.automatic_optimization
(#7130) - Removed evaluation loop legacy returns for
*_epoch_end
hooks (#6973) - Removed support for passing a bool value to
profiler
argument of Trainer (#6164) - Removed no return warning from val/test step (#6139)
- Removed passing a
ModelCheckpoint
instance toTrainer(checkpoint_callback)
(#6166) - Removed deprecated Trainer argument
enable_pl_optimizer
andautomatic_optimization
(#6163) - Removed deprecated metrics (#6161)
- from
pytorch_lightning.metrics.functional.classification
removedto_onehot
,to_categorical
,get_num_classes
,roc
,multiclass_roc
,average_precision
,precision_recall_curve
,multiclass_precision_recall_curve
- from
pytorch_lightning.metrics.functional.reduction
removedreduce
,class_reduce
- from
- Removed deprecated
ModelCheckpoint
argumentsprefix
,mode="auto"
(#6162) - Removed
mode='auto'
fromEarlyStopping
(#6167) - Removed
epoch
andstep
arguments fromModelCheckpoint.format_checkpoint_name()
, these are now included in themetrics
argument (#7344) - Removed legacy references for magic keys in the
Result
object (#6016) - Removed deprecated
LightningModule
hparams
setter (#6207) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the
"log"/"progress_bar"
magic keys. Useself.log
instead (#6734) - Removed
trainer.fit()
return value of1
. It has no return now (#7237) - Removed
logger_connector
legacy code (#6733) - Removed unused mixin attributes (#6487)
- Fixed NaN errors in progress bars when training with iterable datasets with no length defined (#7306)
- Fixed attaching train and validation dataloaders when
reload_dataloaders_every_epoch=True
andnum_sanity_val_steps=0
(#7207) - Added a barrier in the accelerator
teardown
to synchronize processes before execution finishes (#6814) - Fixed multi-node DDP sub-process launch by using
local_rank
instead ofglobal_rank
for main process assertion (#7061) - Fixed incorrect removal of
WORLD_SIZE
environment variable in DDP training when launching with torch distributed/torchelastic (#6942) - Made the
Plugin.reduce
method more consistent across all Plugins to reflect a mean-reduction by default (#6011) - Move lightning module to correct device type when using LightningDistributedWrapper (#6070)
- Do not print top-k verbose log with
ModelCheckpoint(monitor=None)
(#6109) - Fixed
ModelCheckpoint(save_top_k=0, save_last=True)
not saving thelast
checkpoint (#6136) - Fixed
.teardown(stage='fit')
and.on_fit_{start,end}()
getting called duringtrainer.test
(#6386) - Fixed LightningModule
all_gather
on cpu tensors (#6416) - Fixed torch distributed not available in setup hook for DDP (#6506)
- Fixed
trainer.tuner.{lr_find,scale_batch_size}
not setting theTrainer
state properly (#7258) - Fixed bug where the learning rate schedulers did not follow the optimizer frequencies (#4868)
- Fixed pickle error checker to now check for
pickle.PickleError
to catch all pickle errors (#6917) - Fixed a bug where the outputs object passed to
LightningModule.training_epoch_end
was different from the object passed to theon_train_end_epoch
hook (#6969) - Fixed a bug where the outputs passed to
train_batch_end
would be lists even when using a single optimizer and no truncated backprop through time steps (#6969) - Fixed bug for trainer error handling which would cause hang for distributed training (#6864)
- Fixed
self.device
not returning the correct device in replicas of data-parallel (#6414) - Fixed
lr_find
trying beyondnum_training
steps and suggesting a too high learning rate (#7076) - Fixed logger creating incorrect version folder in DDP with repeated
Trainer.fit
calls (#7077) - Fixed metric objects passed directly to
self.log
not being reset correctly (#7055) - Fixed
CombinedLoader
in distributed settings for validation / testing (#7102) - Fixed the save_dir in
WandbLogger
when the run was initiated externally (#7106) - Fixed
num_sanity_val_steps
affecting reproducibility of training data shuffling (#7014) - Fixed resetting device after
fitting/evaluating/predicting
(#7188) - Fixed bug where
trainer.tuner.scale_batch_size(max_trials=0)
would not return the correct batch size result (#7262) - Fixed metrics not being properly logged with
precision=16
andmanual_optimization
(#7228) - Fixed
BaseFinetuning
properly reloadingoptimizer_states
when usingresume_from_checkpoint
(#6891) - Fixed
parameters_to_ignore
not properly set to DDPWrapper (#7239) - Fixed parsing of
fast_dev_run=True
with the built-inArgumentParser
(#7240) - Fixed handling an
IterableDataset
that fails to produce a batch at the beginning of an epoch (#7294) - Fixed
LightningModule.save_hyperparameters()
when attempting to save an empty container (#7268) - Fixed
apex
not properly instantiated when running withddp
(#7274) - Fixed optimizer
state
not moved toGPU
(#7277) - Fixed custom init args for
WandbLogger
(#6989) - Fixed a bug where an error would be raised if the train dataloader sometimes produced None for a batch (#7342)
- Fixed examples ( #6600, #6638, #7096, #7246, #6357, #6476, #6294, #6373, #6088, #7398 )
- Resolved schedule step bug for PyTorch Profiler (#6674, #6681)
- Updated logic for checking TPUs availability (#6767)
- Resolve TPU miss rendezvous (#6781)
- Fixed auto-scaling mode when calling tune method on trainer (#7321)
- Fixed finetuning complex models correctly unfreezes (#6880)
- Ensure we set the eval/train flag correctly on accelerator model (#6877)
- Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic (#6802) - Fixed matching the number of outputs of backward with forward for AllGatherGrad (#6625)
- Fixed the
gradient_clip_algorithm
has no effect (#6928) - Fixed CUDA OOM detection and handling (#6934)
- Fixed
unfreeze_and_add_param_group
expectsmodules
rather thanmodule
(#6822) - Fixed DPP + SyncBN when move on device (#6838)
- Fixed missing arguments in
lr_find
call (#6784) - Fixed
set_default_tensor_type
totorch.DoubleTensor
with precision=64 (#7108) - Fixed
NeptuneLogger.log_text(step=None)
(#7194) - Fixed importing torchtext batch (#6365, #6323, #6211)
- Fixed the order to call for world ranks & the
root_device
property inTPUSpawnPlugin
(#7074) - Fixed multi-gpu join for Horovod (#6954)
- Fixed parsing for pre-release package versions (#6999)
- Added TPUSpawn + IterableDataset error message (#6875)
- Fixed process rank not being available right away after
Trainer
instantiation (#6941) - Fixed
sync_dist
for tpus (#6950) - Fixed
AttributeError
forrequire_backward_grad_sync
when running manual optimization with sharded plugin (#6915) - Fixed
--gpus
default for parser returned byTrainer.add_argparse_args
(#6898) - Fixed TPU Spawn all gather (#6896)
- Fixed
EarlyStopping
logic whenmin_epochs
ormin_steps
requirement is not met (#6705) - Fixed csv extension check (#6436)
- Fixed checkpoint issue when using Horovod distributed backend (#6958)
- Fixed tensorboard exception raising (#6901)
- Fixed setting the eval/train flag correctly on accelerator model (#6983)
- Fixed DDP_SPAWN compatibility with bug_report_model.py (#6892)
- Fixed bug where
BaseFinetuning.flatten_modules()
was duplicating leaf node parameters (#6879) - Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic:
- Fixed resolve a bug with omegaconf and xm.save (#6741)
- Fixed an issue with IterableDataset when len is not defined (#6828)
- Sanitize None params during pruning (#6836)
- Enforce an epoch scheduler interval when using SWA (#6588)
- Fixed TPU Colab hang issue, post training (#6816)
- Fixed a bug where
TensorBoardLogger
would give a warning and not log correctly to a symbolic linksave_dir
(#6730) - Fixed bug where
predict
could not be used whenprogress_bar_refresh_rate=0
(#6884)
- Changed the behavior of
on_epoch_start
to run at the beginning of validation & test epoch (#6498)
- Removed legacy code to include
step
dictionary returns incallback_metrics
. Useself.log_dict
instead. (#6682)
- Fixed
DummyLogger.log_hyperparams
raising aTypeError
when running withfast_dev_run=True
(#6398) - Fixed error on TPUs when there was no
ModelCheckpoint
(#6654) - Fixed
trainer.test
freeze on TPUs (#6654) - Fixed a bug where gradients were disabled after calling
Trainer.predict
(#6657) - Fixed bug where no TPUs were detected in a TPU pod env (#6719)
- Update Gradient Clipping for the TPU Accelerator (#6576)
- Refactored setup for typing friendly (#6590)
- Fixed a bug where
all_gather
would not work correctly withtpu_cores=8
(#6587) - Fixed comparing required versions (#6434)
- Fixed duplicate logs appearing in console when using the python logging module (#6275)
- Added Autocast in validation, test and predict modes for Native AMP (#6565)
- Changed the default of
find_unused_parameters
back toTrue
in DDP and DDP Spawn (#6438)
- Expose DeepSpeed loss parameters to allow users to fix loss instability (#6115)
- Fixed DP reduction with collection (#6324)
- Fixed an issue where the tuner would not tune the learning rate if also tuning the batch size (#4688)
- Fixed broadcast to use PyTorch
broadcast_object_list
and addreduce_decision
(#6410) - Fixed logger creating directory structure too early in DDP (#6380)
- Fixed DeepSpeed additional memory use on rank 0 when default device not set early enough (#6460)
- Fixed an issue with
Tuner.scale_batch_size
not finding the batch size attribute in the datamodule (#5968) - Fixed an exception in the layer summary when the model contains torch.jit scripted submodules (#6511)
- Fixed when Train loop config was run during
Trainer.predict
(#6541)
- Fixed
ModelPruning(make_pruning_permanent=True)
pruning buffers getting removed when saved during training (#6073) - Fixed when
_stable_1d_sort
to work whenn >= N
(#6177) - Fixed
AttributeError
whenlogger=None
on TPU (#6221) - Fixed PyTorch Profiler with
emit_nvtx
(#6260) - Fixed
trainer.test
frombest_path
hangs after callingtrainer.fit
(#6272) - Fixed
SingleTPU
callingall_gather
(#6296) - Ensure we check DeepSpeed/Sharded in multi-node DDP (#6297
- Check
LightningOptimizer
doesn't delete optimizer hooks (#6305 - Resolve memory leak for evaluation (#6326
- Ensure that clip gradients is only called if the value is greater than 0 (#6330
- Fixed
Trainer
not resettinglightning_optimizers
when callingTrainer.fit()
multiple times (#6372)
- Added
checkpoint
parameter to callback'son_save_checkpoint
hook (#6072)
- Changed the order of
backward
,step
,zero_grad
tozero_grad
,backward
,step
(#6147) - Changed default for DeepSpeed CPU Offload to False, due to prohibitively slow speeds at smaller scale (#6262)
- Fixed epoch level schedulers not being called when
val_check_interval < 1.0
(#6075) - Fixed multiple early stopping callbacks (#6197)
- Fixed incorrect usage of
detach()
,cpu()
,to()
(#6216) - Fixed LBFGS optimizer support which didn't converge in automatic optimization (#6147)
- Prevent
WandbLogger
from dropping values (#5931) - Fixed error thrown when using valid distributed mode in multi node (#6297
- Fixed incorrect yield logic for the amp autocast context manager (#6080)
- Fixed priority of plugin/accelerator when setting distributed mode (#6089)
- Fixed error message for AMP + CPU incompatibility (#6107)
- Disabled batch transfer in DP mode (#6093)
- Added
DataType
,AverageMethod
andMDMCAverageMethod
enum in metrics (#5657) - Added support for summarized model total params size in megabytes (#5590)
- Added support for multiple train loaders (#1959)
- Added
Accuracy
metric now generalizes to Top-k accuracy for (multi-dimensional) multi-class inputs using thetop_k
parameter (#4838) - Added
Accuracy
metric now enables the computation of subset accuracy for multi-label or multi-dimensional multi-class inputs with thesubset_accuracy
parameter (#4838) - Added
HammingDistance
metric to compute the hamming distance (loss) (#4838) - Added
max_fpr
parameter toauroc
metric for computing partial auroc metric (#3790) - Added
StatScores
metric to compute the number of true positives, false positives, true negatives and false negatives (#4839) - Added
R2Score
metric (#5241) - Added
LambdaCallback
(#5347) - Added
BackboneLambdaFinetuningCallback
(#5377) - Accelerator
all_gather
supports collection (#5221) - Added
image_gradients
functional metric to compute the image gradients of a given input image. (#5056) - Added
MetricCollection
(#4318) - Added
.clone()
method to metrics (#4318) - Added
IoU
class interface (#4704) - Support to tie weights after moving model to TPU via
on_post_move_to_device
hook - Added missing val/test hooks in
LightningModule
(#5467) - The
Recall
andPrecision
metrics (and their functional counterpartsrecall
andprecision
) can now be generalized to Recall@K and Precision@K with the use oftop_k
parameter (#4842) - Added
ModelPruning
Callback (#5618, #5825, #6045) - Added
PyTorchProfiler
(#5560) - Added compositional metrics (#5464)
- Added Trainer method
predict(...)
for high performence predictions (#5579) - Added
on_before_batch_transfer
andon_after_batch_transfer
data hooks (#3671) - Added AUC/AUROC class interface (#5479)
- Added
PredictLoop
object (#5752) - Added
QuantizationAwareTraining
callback (#5706, #6040) - Added
LightningModule.configure_callbacks
to enable the definition of model-specific callbacks (#5621) - Added
dim
toPSNR
metric for mean-squared-error reduction (#5957) - Added promxial policy optimization template to pl_examples (#5394)
- Added
log_graph
toCometLogger
(#5295) - Added possibility for nested loaders (#5404)
- Added
sync_step
to Wandb logger (#5351) - Added
StochasticWeightAveraging
callback (#5640) - Added
LightningDataModule.from_datasets(...)
(#5133) - Added
PL_TORCH_DISTRIBUTED_BACKEND
env variable to select backend (#5981) - Added
Trainer
flag to activate Stochastic Weight Averaging (SWA)Trainer(stochastic_weight_avg=True)
(#6038) - Added DeepSpeed integration (#5954, #6042)
- Changed
stat_scores
metric now calculates stat scores over all classes and gains new parameters, in line with the newStatScores
metric (#4839) - Changed
computer_vision_fine_tunning
example to useBackboneLambdaFinetuningCallback
(#5377) - Changed
automatic casting
for LoggerConnectormetrics
(#5218) - Changed
iou
[func] to allow float input (#4704) - Metric
compute()
method will no longer automatically callreset()
(#5409) - Set PyTorch 1.4 as min requirements, also for testing and examples
torchvision>=0.5
andtorchtext>=0.5
(#5418) - Changed
callbacks
argument inTrainer
to allowCallback
input (#5446) - Changed the default of
find_unused_parameters
toFalse
in DDP (#5185) - Changed
ModelCheckpoint
version suffixes to start at 1 (#5008) - Progress bar metrics tensors are now converted to float (#5692)
- Changed the default value for the
progress_bar_refresh_rate
Trainer argument in Google COLAB notebooks to 20 (#5516) - Extended support for purely iteration-based training (#5726)
- Made
LightningModule.global_rank
,LightningModule.local_rank
andLightningModule.logger
read-only properties (#5730) - Forced
ModelCheckpoint
callbacks to run after all others to guarantee all states are saved to the checkpoint (#5731) - Refactored Accelerators and Plugins:
- Added base classes for plugins (#5715)
- Added parallel plugins for DP, DDP, DDPSpawn, DDP2 and Horovod (#5714)
- Precision Plugins (#5718)
- Added new Accelerators for CPU, GPU and TPU (#5719)
- Added RPC and Sharded plugins (#5732)
- Added missing
LightningModule
-wrapper logic to new plugins and accelerator (#5734) - Moved device-specific teardown logic from training loop to accelerator (#5973)
- Moved accelerator_connector.py to the connectors subfolder (#6033)
- Trainer only references accelerator (#6039)
- Made parallel devices optional across all plugins (#6051)
- Cleaning (#5948, #5949, #5950)
- Enabled
self.log
in callbacks (#5094) - Renamed xxx_AVAILABLE as protected (#5082)
- Unified module names in Utils (#5199)
- Separated utils: imports & enums (#5256 #5874)
- Refactor: clean trainer device & distributed getters (#5300)
- Simplified training phase as LightningEnum (#5419)
- Updated metrics to use LightningEnum (#5689)
- Changed the seq of
on_train_batch_end
,on_batch_end
&on_train_epoch_end
,on_epoch_end hooks
(#5688) - Refactored
setup_training
and removetest_mode
(#5388) - Disabled training with zero
num_training_batches
when insufficientlimit_train_batches
(#5703) - Refactored
EpochResultStore
(#5522) - Update
lr_finder
to check for attribute if not runningfast_dev_run
(#5990) - LightningOptimizer manual optimizer is more flexible and expose
toggle_model
(#5771) MlflowLogger
limit parameter value length to 250 char (#5893)- Re-introduced fix for Hydra directory sync with multiple process (#5993)
- Function
stat_scores_multiple_classes
is deprecated in favor ofstat_scores
(#4839) - Moved accelerators and plugins to its
legacy
pkg (#5645) - Deprecated
LightningDistributedDataParallel
in favor of new wrapper moduleLightningDistributedModule
(#5185) - Deprecated
LightningDataParallel
in favor of new wrapper moduleLightningParallelModule
(#5670) - Renamed utils modules (#5199)
argparse_utils
>>argparse
model_utils
>>model_helpers
warning_utils
>>warnings
xla_device_utils
>>xla_device
- Deprecated using
'val_loss'
to set theModelCheckpoint
monitor (#6012) - Deprecated
.get_model()
with explicit.lightning_module
property (#6035) - Deprecated Trainer attribute
accelerator_backend
in favor ofaccelerator
(#6034)
- Removed deprecated checkpoint argument
filepath
(#5321) - Removed deprecated
Fbeta
,f1_score
andfbeta_score
metrics (#5322) - Removed deprecated
TrainResult
(#5323) - Removed deprecated
EvalResult
(#5633) - Removed
LoggerStages
(#5673)
- Fixed distributed setting and
ddp_cpu
only withnum_processes>1
(#5297) - Fixed
num_workers
for Windows example (#5375) - Fixed loading yaml (#5619)
- Fixed support custom DataLoader with DDP if they can be re-instantiated (#5745)
- Fixed repeated
.fit()
calls ignore max_steps iteration bound (#5936) - Fixed throwing
MisconfigurationError
on unknown mode (#5255) - Resolve bug with Finetuning (#5744)
- Fixed
ModelCheckpoint
race condition in file existence check (#5155) - Fixed some compatibility with PyTorch 1.8 (#5864)
- Fixed forward cache (#5895)
- Fixed recursive detach of tensors to CPU (#6007)
- Fixed passing wrong strings for scheduler interval doesn't throw an error (#5923)
- Fixed wrong
requires_grad
state afterreturn None
with multiple optimizers (#5738) - Fixed add
on_epoch_end
hook at the end ofvalidation
,test
epoch (#5986) - Fixed missing
process_dataloader
call forTPUSpawn
when in distributed mode (#6015) - Fixed progress bar flickering by appending 0 to floats/strings (#6009)
- Fixed synchronization issues with TPU training (#6027)
- Fixed
hparams.yaml
saved twice when usingTensorBoardLogger
(#5953) - Fixed basic examples (#5912, #5985)
- Fixed
fairscale
compatible with PT 1.8 (#5996) - Ensured
process_dataloader
is called whentpu_cores > 1
to use Parallel DataLoader (#6015) - Attempted SLURM auto resume call when non-shell call fails (#6002)
- Fixed wrapping optimizers upon assignment (#6006)
- Fixed allowing hashing of metrics with lists in their state (#5939)
- Separate epoch validation from step validation (#5208)
- Fixed
toggle_optimizers
not handling all optimizer parameters (#5775)
- Fixed
TensorBoardLogger
not closingSummaryWriter
onfinalize
(#5696) - Fixed filtering of pytorch "unsqueeze" warning when using DP (#5622)
- Fixed
num_classes
argument in F1 metric (#5663) - Fixed
log_dir
property (#5537) - Fixed a race condition in
ModelCheckpoint
when checking if a checkpoint file exists (#5144) - Remove unnecessary intermediate layers in Dockerfiles (#5697)
- Fixed auto learning rate ordering (#5638)
- Increased TPU check timeout from 20s to 100s (#5598)
- Ignored
step
param in Neptune logger's log_metric method (#5510) - Pass batch outputs to
on_train_batch_end
instead ofepoch_end
outputs (#4369)
- Fixed
toggle_optimizer
to resetrequires_grad
state (#5574) - Fixed FileNotFoundError for best checkpoint when using DDP with Hydra (#5629)
- Fixed an error when logging a progress bar metric with a reserved name (#5620)
- Fixed
Metric
'sstate_dict
not included when child modules (#5614) - Fixed Neptune logger creating multiple experiments when GPUs > 1 (#3256)
- Fixed duplicate logs appearing in console when using the python logging module (#5509)
- Fixed tensor printing in
trainer.test()
(#5138) - Fixed not using dataloader when
hparams
present (#4559)
- Fixed a visual bug in the progress bar display initialization (#4579)
- Fixed logging
on_train_batch_end
in a callback with multiple optimizers (#5521) - Fixed
reinit_scheduler_properties
with correct optimizer (#5519) - Fixed
val_check_interval
withfast_dev_run
(#5540)
- Add automatic optimization property setter to lightning module (#5169)
- Changed deprecated
enable_pl_optimizer=True
(#5244)
- Fixed
transfer_batch_to_device
for DDP withlen(devices_ids) == 1
(#5195) - Logging only on
not should_accumulate()
during training (#5417) - Resolve interpolation bug with Hydra (#5406)
- Check environ before selecting a seed to prevent warning message (#4743)
- Fixed signature mismatch in
model_to_device
ofDDPCPUHPCAccelerator
(#5505)
- Added a check for optimizer attached to
lr_scheduler
(#5338) - Added support for passing non-existing filepaths to
resume_from_checkpoint
(#4402)
- Skip restore from
resume_from_checkpoint
whiletesting
(#5161) - Allowed
log_momentum
for adaptive optimizers inLearningRateMonitor
(#5333) - Disabled checkpointing, earlystopping and logging with
fast_dev_run
(#5277) - Distributed group defaults to
WORLD
ifNone
(#5125)
- Fixed
trainer.test
returning non-test metrics (#5214) - Fixed metric state reset (#5273)
- Fixed
--num-nodes
onDDPSequentialPlugin
(#5327) - Fixed invalid value for
weights_summary
(#5296) - Fixed
Trainer.test
not using the latestbest_model_path
(#5161) - Fixed existence check for hparams not using underlying filesystem (#5250)
- Fixed
LightningOptimizer
AMP bug (#5191) - Fixed casted key to string in
_flatten_dict
(#5354)
- Support number for logging with
sync_dist=True
(#5080) - Added offset logging step when resuming for Wandb logger (#5050)
enable_pl_optimizer=False
by default to temporarily fix AMP issues (#5163)
- Metric reduction with Logging (#5150)
- Remove nan loss in manual optimization (#5121)
- Un-balanced logging properly supported (#5119)
- Fix hanging in DDP HPC accelerators (#5157)
- Fix reset
TensorRunningAccum
(#5106) - Updated
DALIClassificationLoader
to not use deprecated arguments (#4925) - Corrected call to
torch.no_grad
(#5124)
- Add a notebook example to reach a quick baseline of ~94% accuracy on CIFAR10 using Resnet in Lightning (#4818)
- Simplify accelerator steps (#5015)
- Refactor load in checkpoint connector (#4593)
- Fixed the saved filename in
ModelCheckpoint
when it already exists (#4861)
- Fixed trainer by default
None
inDDPAccelerator
(#4915) - Fixed
LightningOptimizer
to expose optimizer attributes (#5095) - Do not warn when the
name
key is used in thelr_scheduler
dict (#5057) - Check if optimizer supports closure (#4981)
- Add deprecated metric utility functions back to functional ( #5067, #5068)
- Allow any input in
to_onnx
andto_torchscript
(#4378) - Fixed
DDPHPCAccelerator
hangs in DDP construction by callinginit_device
(#5157)
- Added "monitor" key to saved
ModelCheckpoints
(#4383) - Added
ConfusionMatrix
class interface (#4348) - Added multiclass AUROC metric (#4236)
- Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
- Added optimizer hooks in callbacks (#4379)
- Added option to log momentum (#4384)
- Added
current_score
toModelCheckpoint.on_save_checkpoint
(#4721) - Added logging using
self.log
in train and evaluation for epoch end hooks ( #4552, #4495, #4439, #4684, #4913) - Added ability for DDP plugin to modify optimizer state saving (#4675)
- Added
prefix
argument in loggers (#4557) - Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
- Added
PrecisionRecallCurve, ROC, AveragePrecision
class metric (#4549) - Added custom
Apex
andNativeAMP
asPrecision plugins
(#4355) - Added
DALI MNIST
example (#3721) - Added
sharded plugin
for DDP for multi-gpu training memory optimizations ( #4639, #4686, #4737, #4773) - Added
experiment_id
to the NeptuneLogger (#3462) - Added
Pytorch Geometric
integration example with Lightning (#4568) - Added
all_gather
method toLightningModule
which allows gradient based tensor synchronizations for use-cases such as negative sampling. (#5012) - Enabled
self.log
in most functions (#4969) - Added changeable extension variable for
ModelCheckpoint
(#4977)
- Tuner algorithms will be skipped if
fast_dev_run=True
(#3903) WandbLogger
does not force wandbreinit
arg to True anymore and creates a run only when needed (#4648)- Changed
automatic_optimization
to be a model attribute (#4602) - Changed
Simple Profiler
report to order by percentage time spent + num calls (#4880) - Simplify optimization Logic (#4984)
- Classification metrics overhaul (#4837)
- Updated
fast_dev_run
to accept integer representing num_batches (#4629) - Refactored optimizer (#4658)
- Deprecated
prefix
argument inModelCheckpoint
(#4765) - Deprecated the old way of assigning hyper-parameters through
self.hparams = ...
(#4813) - Deprecated
mode='auto'
fromModelCheckpoint
andEarlyStopping
(#4695)
- Removed
reorder
parameter of theauc
metric (#5004) - Removed
multiclass_roc
andmulticlass_precision_recall_curve
, useroc
andprecision_recall_curve
instead (#4549)
- Added feature to move tensors to CPU before saving (#4309)
- Fixed
LoggerConnector
to have logged metrics on root device in DP (#4138) - Auto convert tensors to contiguous format when
gather_all
(#4907) - Fixed
PYTHONPATH
for ddp test model (#4528) - Fixed allowing logger to support indexing (#4595)
- Fixed DDP and manual_optimization (#4976)
- Added casting to python types for numpy scalars when logging
hparams
(#4647) - Added warning when progress bar refresh rate is less than 20 on Google Colab to prevent crashing (#4654)
- Added
F1
class metric (#4656)
- Consistently use
step=trainer.global_step
inLearningRateMonitor
independently oflogging_interval
(#4376) - Metric states are no longer as default added to
state_dict
(#4685) - Renamed class metric
Fbeta
>>FBeta
(#4656) - Model summary: add 1 decimal place (#4745)
- Do not override
PYTHONWARNINGS
(#4700) - Changed
init_ddp_connection
moved fromDDP
toDDPPlugin
(#4407)
- Fixed checkpoint
hparams
dict casting whenomegaconf
is available (#4770) - Fixed incomplete progress bars when total batches not divisible by refresh rate (#4577)
- Updated SSIM metric (#4566)
- Fixed batch_arg_name - add
batch_arg_name
to all calls to_adjust_batch_size
bug (#4812) - Fixed
torchtext
data to GPU (#4785) - Fixed a crash bug in MLFlow logger (#4716)
- Added lambda closure to
manual_optimizer_step
(#4618)
- Change Metrics
persistent
default mode toFalse
(#4685) - LoggerConnector log_metrics will use
total_batch_idx
instead ofglobal_step
when logging ontraining step
(#4738)
- Prevent crash if
sync_dist=True
on CPU (#4626) - Fixed average pbar Metrics (#4534)
- Fixed
setup
callback hook to correctly pass the LightningModule through (#4608) - Allowing decorate model init with saving
hparams
inside (#4662) - Fixed
split_idx
set byLoggerConnector
inon_trainer_init
toTrainer
(#4697)
- Added metrics aggregation in Horovod and fixed early stopping (#3775)
- Added
manual_optimizer_step
which work withAMP Native
andaccumulated_grad_batches
(#4485) - Added
persistent(mode)
method to metrics, to enable and disable metric states being added tostate_dict
(#4482) - Added congratulations at the end of our notebooks (#4555)
- Added parameters
move_metrics_to_cpu
in Trainer to disable gpu leak (#4592)
- Changed
fsspec
to tuner (#4458) - Unify SLURM/TorchElastic under backend plugin (#4578, #4580, #4581, #4582, #4583)
- Fixed feature-lack in
hpc_load
(#4526) - Fixed metrics states being overridden in DDP mode (#4482)
- Fixed
lightning_getattr
,lightning_hasattr
not finding the correct attributes in datamodule (#4347) - Fixed automatic optimization AMP by
manual_optimization_step
(#4485) - Replace
MisconfigurationException
with warning inModelCheckpoint
Callback (#4560) - Fixed logged keys in mlflow logger (#4412)
- Fixed
is_picklable
by catchingAttributeError
(#4508) - Fixed multi test dataloaders dict
AttributeError
error (#4480) - Fixed show progress bar only for
progress_rank 0
onDDP_SLURM
(#4437)
- Added PyTorch 1.7 Stable support (#3821)
- Added timeout for
tpu_device_exists
to ensure process does not hang indefinitely (#4340)
- W&B log in sync with
Trainer
step (#4405) - Hook
on_after_backward
is called only whenoptimizer_step
is being called (#4439) - Moved
track_and_norm_grad
intotraining loop
and called only whenoptimizer_step
is being called (#4439) - Changed type checker with explicit cast of
ref_model
object (#4457) - Changed
distributed_backend
->accelerator
(#4429)
- Deprecated passing
ModelCheckpoint
instance tocheckpoint_callback
Trainer argument (#4336)
- Disable saving checkpoints if not trained (#4372)
- Fixed error using
auto_select_gpus=True
withgpus=-1
(#4209) - Disabled training when
limit_train_batches=0
(#4371) - Fixed that metrics do not store computational graph for all seen data (#4313)
- Fixed AMP unscale for
on_after_backward
(#4439) - Fixed TorchScript export when module includes Metrics (#4428)
- Fixed TorchScript trace method's data to device and docstring (#4360)
- Fixed CSV logger warning (#4419)
- Fixed skip DDP parameter sync (#4301)
- Fixed
WandbLogger
_sanitize_callable function (#4422) - Fixed
AMP Native
_unscale
gradient (#4441)
- Added
dirpath
andfilename
parameter inModelCheckpoint
(#4213) - Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
- Added
strict
option to the scheduler dictionary (#3586) - Added
fsspec
support for profilers (#4162) - Added autogenerated helptext to
Trainer.add_argparse_args
(#4344) - Added support for string values in
Trainer
'sprofiler
parameter (#3656) - Added
optimizer_closure
tooptimizer.step
when supported (#4190) - Added unification of regression metrics (#4166)
- Added checkpoint load from Bytes (#4314)
- Improved error messages for invalid
configure_optimizers
returns (#3587) - Allow changing the logged step value in
validation_step
(#4130) - Allow setting
replace_sampler_ddp=True
with a distributed sampler already added (#4273) - Fixed santized parameters for
WandbLogger.log_hyperparams
(#4320)
- Deprecated
filepath
inModelCheckpoint
(#4213) - Deprecated
reorder
parameter of theauc
metric (#4237) - Deprecated bool values in
Trainer
'sprofiler
parameter (#3656)
- Fixed setting device ids in DDP (#4297)
- Fixed synchronization of best model path in
ddp_accelerator
(#4323) - Fixed
WandbLogger
not uploading checkpoint artifacts at the end of training (#4341) - Fixed
FBeta
computation (#4183) - Fixed
accumulation across batches
has completedbefore breaking training loop
(#4278) - Fixed
ModelCheckpoint
don't increase current_epoch and global_step when not training (#4291) - Fixed
COMET_EXPERIMENT_KEY
environment variable usage in comet logger (#4230)
- Added persistent flag to
Metric.add_state
(#4195)
- Added trace functionality to the function
to_torchscript
(#4142)
- Called
on_load_checkpoint
before loadingstate_dict
(#4057)
- Removed duplicate metric vs step log for train loop (#4173)
- Fixed the
self.log
problem invalidation_step()
(#4169) - Fixed
hparams
saving - save the state whensave_hyperparameters()
is called [in__init__
] (#4163) - Fixed runtime failure while exporting
hparams
to yaml (#4158)
- Added getstate/setstate method for torch.save serialization (#4127)
- Added Explained Variance Metric + metric fix (#4013)
- Added Metric <-> Lightning Module integration tests (#4008)
- Added parsing OS env vars in
Trainer
(#4022) - Added classification metrics (#4043)
- Updated explained variance metric (#4024)
- Enabled plugins (#4041)
- Enabled custom clusters (#4048)
- Enabled passing in custom accelerators (#4050)
- Added
LightningModule.toggle_optimizer
(#4058) - Added
LightningModule.manual_backward
(#4063) - Added
output
argument to*_batch_end
hooks (#3965, #3966) - Added
output
argument to*_epoch_end
hooks (#3967)
- Integrated metrics API with self.log (#3961)
- Decoupled Apex (#4052, #4054, #4055, #4056, #4058, #4060, #4061, #4062, #4063, #4064, #4065)
- Renamed all backends to
Accelerator
(#4066) - Enabled manual returns (#4089)
- Removed support for EvalResult and TrainResult (#3968)
- Removed deprecated trainer flags:
overfit_pct
,log_save_interval
,row_log_interval
(#3969) - Removed deprecated early_stop_callback (#3982)
- Removed deprecated model hooks (#3980)
- Removed deprecated callbacks (#3979)
- Removed
trainer
argument inLightningModule.backward
#4056)
- Fixed
current_epoch
property update to reflect true epoch number insideLightningDataModule
, whenreload_dataloaders_every_epoch=True
. (#3974) - Fixed to print scaler value in progress bar (#4053)
- Fixed mismatch between docstring and code regarding when
on_load_checkpoint
hook is called (#3996)
- Added new Metrics API. (#3868, #3921)
- Enable PyTorch 1.7 compatibility (#3541)
- Added
LightningModule.to_torchscript
to support exporting asScriptModule
(#3258) - Added warning when dropping unpicklable
hparams
(#2874) - Added EMB similarity (#3349)
- Added
ModelCheckpoint.to_yaml
method (#3048) - Allow
ModelCheckpoint
monitor to beNone
, meaning it will always save (#3630) - Disabled optimizers setup during testing (#3059)
- Added support for datamodules to save and load checkpoints when training (#3563)
- Added support for datamodule in learning rate finder (#3425)
- Added gradient clip test for native AMP (#3754)
- Added dist lib to enable syncing anything across devices (#3762)
- Added
broadcast
toTPUBackend
(#3814) - Added
XLADeviceUtils
class to check XLA device type (#3274)
- Refactored accelerator backends:
- moved TPU
xxx_step
to backend (#3118) - refactored DDP backend
forward
(#3119) - refactored GPU backend
__step
(#3120) - refactored Horovod backend (#3121, #3122)
- remove obscure forward call in eval + CPU backend
___step
(#3123) - reduced all simplified forward (#3126)
- added hook base method (#3127)
- refactor eval loop to use hooks - use
test_mode
for if so we can split later (#3129) - moved
___step_end
hooks (#3130) - training forward refactor (#3134)
- training AMP scaling refactor (#3135)
- eval step scaling factor (#3136)
- add eval loop object to streamline eval loop (#3138)
- refactored dataloader process hook (#3139)
- refactored inner eval loop (#3141)
- final inner eval loop hooks (#3154)
- clean up hooks in
run_evaluation
(#3156) - clean up data reset (#3161)
- expand eval loop out (#3165)
- moved hooks around in eval loop (#3195)
- remove
_evaluate
fx (#3197) Trainer.fit
hook clean up (#3198)- DDPs train hooks (#3203)
- refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
- reduced accelerator selection (#3211)
- group prepare data hook (#3212)
- added data connector (#3285)
- modular is_overridden (#3290)
- adding
Trainer.tune()
(#3293) - move
run_pretrain_routine
->setup_training
(#3294) - move train outside of setup training (#3297)
- move
prepare_data
to data connector (#3307) - moved accelerator router (#3309)
- train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
- duplicate data interface definition up into DataHooks class (#3344)
- inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
- all logging related calls in a connector (#3395)
- device parser (#3400, #3405)
- added model connector (#3407)
- moved eval loop logging to loggers (#3408)
- moved eval loop (#3412#3408)
- trainer/separate argparse (#3421, #3428, #3432)
- move
lr_finder
(#3434) - organize args (##3435, #3442, #3447, #3448, #3449, #3456)
- move specific accelerator code (#3457)
- group connectors (#3472)
- accelerator connector methods x/n (#3469, #3470, #3474)
- merge backends x/n (#3476, #3477, #3478, #3480, #3482)
- apex plugin (#3502)
- precision plugins (#3504)
- Result - make monitor default to
checkpoint_on
to simplify (#3571) - reference to the Trainer on the
LightningDataModule
(#3684) - add
.log
to lightning module (#3686, #3699, #3701, #3704, #3715) - enable tracking original metric when step and epoch are both true (#3685)
- deprecated results obj, added support for simpler comms (#3681)
- move backends back to individual files (#3712)
- fixes logging for eval steps (#3763)
- decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806, #3817, #3819, #3927)
- remove weight loading hack for ddp_cpu (#3808)
- separate
torchelastic
from DDP (#3810) - separate SLURM from DDP (#3809)
- decoupled DDP2 (#3816)
- bug fix with logging val epoch end + monitor (#3812)
- callback system and init DDP (#3836)
- adding compute environments (#3837, #3842)
- epoch can now log independently (#3843)
- test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
- fixed
init_slurm_connection
causing hostname errors (#3856) - moves init apex from LM to apex connector (#3923)
- moves sync bn to each backend (#3925)
- moves configure ddp to each backend (#3924)
- moved TPU
- Deprecation warning (#3844)
- Changed
LearningRateLogger
toLearningRateMonitor
(#3251) - Used
fsspec
instead ofgfile
for all IO (#3320)- Swaped
torch.load
forfsspec
load in DDP spawn backend (#3787) - Swaped
torch.load
forfsspec
load in cloud_io loading (#3692) - Added support for
to_disk()
to use remote filepaths withfsspec
(#3930) - Updated model_checkpoint's to_yaml to use
fsspec
open (#3801) - Fixed
fsspec
is inconsistent when doingfs.ls
(#3805)
- Swaped
- Refactor
GPUStatsMonitor
to improve training speed (#3257) - Changed IoU score behavior for classes absent in target and pred (#3098)
- Changed IoU
remove_bg
bool toignore_index
optional int (#3098) - Changed defaults of
save_top_k
andsave_last
toNone
in ModelCheckpoint (#3680) row_log_interval
andlog_save_interval
are now based on training loop'sglobal_step
instead of epoch-internal batch index (#3667)- Silenced some warnings. verified ddp refactors (#3483)
- Cleaning up stale logger tests (#3490)
- Allow
ModelCheckpoint
monitor to beNone
(#3633) - Enable
None
model checkpoint default (#3669) - Skipped
best_model_path
ifcheckpoint_callback
isNone
(#2962) - Used
raise .. from ..
to explicitly chain exceptions (#3750) - Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
- Write predictions in LightningModule instead of EvalResult #3882
- Deprecated
TrainResult
andEvalResult
, useself.log
andself.write
from theLightningModule
to log metrics and write predictions.training_step
can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681) - Deprecate
early_stop_callback
Trainer argument (#3845) - Rename Trainer arguments
row_log_interval
>>log_every_n_steps
andlog_save_interval
>>flush_logs_every_n_steps
(#3748)
- Removed experimental Metric API (#3943,
#3949,
#3946), listed changes before final removal:
- Added
EmbeddingSimilarity
metric (#3349, #3358) - Added hooks to metric module interface (#2528)
- Added error when AUROC metric is used for multiclass problems (#3350)
- Fixed
ModelCheckpoint
withsave_top_k=-1
option not tracking the best models when a monitor metric is available (#3735) - Fixed counter-intuitive error being thrown in
Accuracy
metric for zero target tensor (#3764) - Fixed aggregation of metrics (#3517)
- Fixed Metric aggregation (#3321)
- Fixed RMSLE metric (#3188)
- Renamed
reduction
toclass_reduction
in classification metrics (#3322) - Changed
class_reduction
similar to sklearn for classification metrics (#3322) - Renaming of precision recall metric (#3308)
- Added
- Fixed
on_train_batch_start
hook to end epoch early (#3700) - Fixed
num_sanity_val_steps
is clipped tolimit_val_batches
(#2917) - Fixed ONNX model save on GPU (#3145)
- Fixed
GpuUsageLogger
to work on different platforms (#3008) - Fixed auto-scale batch size not dumping
auto_lr_find
parameter (#3151) - Fixed
batch_outputs
with optimizer frequencies (#3229) - Fixed setting batch size in
LightningModule.datamodule
when usingauto_scale_batch_size
(#3266) - Fixed Horovod distributed backend compatibility with native AMP (#3404)
- Fixed batch size auto scaling exceeding the size of the dataset (#3271)
- Fixed getting
experiment_id
from MLFlow only once instead of each training loop (#3394) - Fixed
overfit_batches
which now correctly disables shuffling for the training loader. (#3501) - Fixed gradient norm tracking for
row_log_interval > 1
(#3489) - Fixed
ModelCheckpoint
name formatting (#3164) - Fixed example implementation of AutoEncoder (#3190)
- Fixed invalid paths when remote logging with TensorBoard (#3236)
- Fixed change
t()
totranspose()
as XLA devices do not support.t()
on 1-dim tensor (#3252) - Fixed (weights only) checkpoints loading without PL (#3287)
- Fixed
gather_all_tensors
cross GPUs in DDP (#3319) - Fixed CometML save dir (#3419)
- Fixed forward key metrics (#3467)
- Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
- Fixed global step increment in training loop when
training_epoch_end
hook is used (#3673) - Fixed dataloader shuffling not getting turned off with
overfit_batches > 0
anddistributed_backend = "ddp"
(#3534) - Fixed determinism in
DDPSpawnBackend
when usingseed_everything
in main process (#3335) - Fixed
ModelCheckpoint
period
to actually save everyperiod
epochs (#3630) - Fixed
val_progress_bar
total withnum_sanity_val_steps
(#3751) - Fixed Tuner dump: add
current_epoch
to dumped_params (#3261) - Fixed
current_epoch
andglobal_step
properties mismatch betweenTrainer
andLightningModule
(#3785) - Fixed learning rate scheduler for optimizers with internal state (#3897)
- Fixed
tbptt_reduce_fx
when non-floating tensors are logged (#3796) - Fixed model checkpoint frequency (#3852)
- Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
- Fixed
TrainerEvaluationLoopMixin
activatesmodel.train()
at the end (#3858) - Fixed
overfit_batches
when using with multiple val/test_dataloaders (#3857) - Fixed enables
training_step
to returnNone
(#3862) - Fixed init nan for checkpointing (#3863)
- Fixed for
load_from_checkpoint
(#2776) - Fixes incorrect
batch_sizes
when Dataloader returns a dict with multiple tensors (#3668) - Fixed unexpected signature for
validation_step
(#3947)
- Added SyncBN for DDP (#2801, #2838)
- Added basic
CSVLogger
(#2721) - Added SSIM metrics (#2671)
- Added BLEU metrics (#2535)
- Added support to export a model to ONNX format (#2596)
- Added support for
Trainer(num_sanity_val_steps=-1)
to check all validation data before training (#2246) - Added struct. output:
- Added class
LightningDataModule
(#2668) - Added support for PyTorch 1.6 (#2745)
- Added call DataModule hooks implicitly in trainer (#2755)
- Added support for Mean in DDP Sync (#2568)
- Added remaining
sklearn
metrics:AveragePrecision
,BalancedAccuracy
,CohenKappaScore
,DCG
,Hamming
,Hinge
,Jaccard
,MeanAbsoluteError
,MeanSquaredError
,MeanSquaredLogError
,MedianAbsoluteError
,R2Score
,MeanPoissonDeviance
,MeanGammaDeviance
,MeanTweedieDeviance
,ExplainedVariance
(#2562) - Added support for
limit_{mode}_batches (int)
to work with infinite dataloader (IterableDataset) (#2840) - Added support returning python scalars in DP (#1935)
- Added support to Tensorboard logger for OmegaConf
hparams
(#2846) - Added tracking of basic states in
Trainer
(#2541) - Tracks all outputs including TBPTT and multiple optimizers (#2890)
- Added GPU Usage Logger (#2932)
- Added
strict=False
forload_from_checkpoint
(#2819) - Added saving test predictions on multiple GPUs (#2926)
- Auto log the computational graph for loggers that support this (#3003)
- Added warning when changing monitor and using results obj (#3014)
- Added a hook
transfer_batch_to_device
to theLightningDataModule
(#3038)
- Truncated long version numbers in progress bar (#2594)
- Enabling val/test loop disabling (#2692)
- Refactored into
accelerator
module: - Using
.comet.config
file forCometLogger
(#1913) - Updated hooks arguments - breaking for
setup
andteardown
(#2850) - Using
gfile
to support remote directories (#2164) - Moved optimizer creation after device placement for DDP backends (#2904)
- Support
**DictConfig
forhparam
serialization (#2519) - Removed callback metrics from test results obj (#2994)
- Re-enabled naming metrics in ckpt name (#3060)
- Changed progress bar epoch counting to start from 0 (#3061)
- Deprecated Trainer attribute
ckpt_path
, which will now be set byweights_save_path
(#2681)
- Removed deprecated: (#2760)
- core decorator
data_loader
- Module hook
on_sanity_check_start
and loadingload_from_metrics
- package
pytorch_lightning.logging
- Trainer arguments:
show_progress_bar
,num_tpu_cores
,use_amp
,print_nan_grads
- LR Finder argument
num_accumulation_steps
- core decorator
- Fixed
accumulate_grad_batches
for last batch (#2853) - Fixed setup call while testing (#2624)
- Fixed local rank zero casting (#2640)
- Fixed single scalar return from training (#2587)
- Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
- Fixed
dtype
anddevice
properties not getting updated in submodules (#2657) - Fixed
fast_dev_run
to run for all dataloaders (#2581) - Fixed
save_dir
in loggers getting ignored by default value ofweights_save_path
when user did not specifyweights_save_path
(#2681) - Fixed
weights_save_path
getting ignored whenlogger=False
is passed to Trainer (#2681) - Fixed TPU multi-core and Float16 (#2632)
- Fixed test metrics not being logged with
LoggerCollection
(#2723) - Fixed data transfer to device when using
torchtext.data.Field
andinclude_lengths is True
(#2689) - Fixed shuffle argument for distributed sampler (#2789)
- Fixed logging interval (#2694)
- Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1
(#2738) - Fixed correct CWD for ddp sub-processes when using Hydra (#2719)
- Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES
(#2739) - Fixed false
num_classes
warning in metrics (#2781) - Fixed shell injection vulnerability in subprocess call (#2786)
- Fixed LR finder and
hparams
compatibility (#2821) - Fixed
ModelCheckpoint
not saving the latest information whensave_last=True
(#2881) - Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
- Fixed apex gradient clipping (#2829)
- Fixed save apex scaler states (#2828)
- Fixed a model loading issue with inheritance and variable positional arguments (#2911)
- Fixed passing
non_blocking=True
when transferring a batch object that does not support it (#2910) - Fixed checkpointing to remote file paths (#2925)
- Fixed adding val step argument to metrics (#2986)
- Fixed an issue that caused
Trainer.test()
to stall in ddp mode (#2997) - Fixed gathering of results with tensors of varying shape (#3020)
- Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
- Fixed automatic batch scaling not working with half precision (#3045)
- Fixed setting device to root gpu (#3042)
- Removed auto val reduce (#2462)
- Flattening Wandb Hyperparameters (#2459)
- Fixed using the same DDP python interpreter and actually running (#2482)
- Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
- Made
TensorBoardLogger
andCometLogger
pickleable (#2518) - Fixed a problem with
MLflowLogger
creating multiple run folders (#2502) - Fixed global_step increment (#2455)
- Fixed TPU hanging example (#2488)
- Fixed
argparse
default value bug (#2526) - Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
- Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
- Fixed Trainer
.fit()
returning last not best weights in "ddp_spawn" (#2565) - Fixed passing (do not pass) TPU weights back on test (#2566)
- Fixed DDP tests and
.test()
(#2512, #2570)
- Added reduce ddp results on eval (#2434)
- Added a warning when an
IterableDataset
has__len__
defined (#2437)
- Enabled no returns from eval (#2446)
- Fixes train outputs (#2428)
- Fixes Conda dependencies (#2412)
- Fixed Apex scaling with decoupled backward (#2433)
- Fixed crashing or wrong displaying progressbar because of missing ipywidgets (#2417)
- Fixed TPU saving dir (fc26078e, 04e68f02)
- Fixed logging on rank 0 only (#2425)
- Added TorchText support for moving data to GPU (#2379)
- Changed epoch indexing from 0 instead of 1 (#2289)
- Refactor Model
backward
(#2276) - Refactored
training_batch
+ tests to verify correctness (#2327, #2328) - Refactored training loop (#2336)
- Made optimization steps for hooks (#2363)
- Changed default apex level to 'O2' (#2362)
- Moved
TrainsLogger
to Bolts (#2384)
- Fixed parsing TPU arguments and TPU tests (#2094)
- Fixed number batches in case of multiple dataloaders and
limit_{*}_batches
(#1920, #2226) - Fixed an issue with forward hooks not being removed after model summary (#2298)
- Fix for
load_from_checkpoint()
not working with absolute path on Windows (#2294) - Fixed an issue how _has_len handles
NotImplementedError
e.g. raised bytorchtext.data.Iterator
(#2293), (#2307) - Fixed
average_precision
metric (#2319) - Fixed ROC metric for CUDA tensors (#2304)
- Fixed lost compatibility with custom datatypes implementing
.to
(#2335) - Fixed loading model with kwargs (#2387)
- Fixed sum(0) for
trainer.num_val_batches
(#2268) - Fixed checking if the parameters are a
DictConfig
Object (#2216) - Fixed SLURM weights saving (#2341)
- Fixed swaps LR scheduler order (#2356)
- Fixed adding tensorboard
hparams
logging test (#2342) - Fixed use model ref for tear down (#2360)
- Fixed logger crash on DDP (#2388)
- Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
- Fixed loading past checkpoints from v0.7.x (#2405)
- Fixed loading model without arguments (#2403)
- Fixed Windows compatibility issue (#2358)
- Fixed the
load_from_checkpoint
path detected as URL bug (#2244) - Fixed hooks - added barrier (#2245, #2257, #2260)
- Fixed
hparams
- remove frame inspection onself.hparams
(#2253) - Fixed setup and on fit calls (#2252)
- Fixed GPU template (#2255)
- Added
overfit_batches
,limit_{val|test}_batches
flags (overfit now uses training set for all three) (#2213) - Added metrics
- Allow dataloaders without sampler field present (#1907)
- Added option
save_last
to save the model at the end of every epoch inModelCheckpoint
(#1908) - Early stopping checks
on_validation_end
(#1458) - Speed up single-core TPU training by loading data using
ParallelLoader
(#2033) - Added a model hook
transfer_batch_to_device
that enables moving custom data structures to the target device (#1756) - Added black formatter for the code with code-checker on pull (#1610)
- Added back the slow spawn ddp implementation as
ddp_spawn
(#2115) - Added loading checkpoints from URLs (#1667)
- Added a callback method
on_keyboard_interrupt
for handling KeyboardInterrupt events during training (#2134) - Added a decorator
auto_move_data
that moves data to the correct device when using the LightningModule for inference (#1905) - Added
ckpt_path
option toLightningModule.test(...)
to load particular checkpoint (#2190) - Added
setup
andteardown
hooks for model (#2229)
- Allow user to select individual TPU core to train on (#1729)
- Removed non-finite values from loss in
LRFinder
(#1862) - Allow passing model hyperparameters as complete kwarg list (#1896)
- Renamed
ModelCheckpoint
's attributesbest
tobest_model_score
andkth_best_model
tokth_best_model_path
(#1799) - Re-Enable Logger's
ImportError
s (#1938) - Changed the default value of the Trainer argument
weights_summary
fromfull
totop
(#2029) - Raise an error when lightning replaces an existing sampler (#2020)
- Enabled
prepare_data
from correct processes - clarify local vs global rank (#2166) - Remove explicit flush from tensorboard logger (#2126)
- Changed epoch indexing from 1 instead of 0 (#2206)
- Deprecated flags: (#2213)
overfit_pct
in favour ofoverfit_batches
val_percent_check
in favour oflimit_val_batches
test_percent_check
in favour oflimit_test_batches
- Deprecated
ModelCheckpoint
's attributesbest
andkth_best_model
(#1799) - Dropped official support/testing for older PyTorch versions <1.3 (#1917)
- Deprecated Trainer
proc_rank
in favour ofglobal_rank
(#2166, #2269)
- Removed unintended Trainer argument
progress_bar_callback
, the callback should be passed in byTrainer(callbacks=[...])
instead (#1855) - Removed obsolete
self._device
in Trainer (#1849) - Removed deprecated API (#2073)
- Packages:
pytorch_lightning.pt_overrides
,pytorch_lightning.root_module
- Modules:
pytorch_lightning.logging.comet_logger
,pytorch_lightning.logging.mlflow_logger
,pytorch_lightning.logging.test_tube_logger
,pytorch_lightning.overrides.override_data_parallel
,pytorch_lightning.core.model_saving
,pytorch_lightning.core.root_module
- Trainer arguments:
add_row_log_interval
,default_save_path
,gradient_clip
,nb_gpu_nodes
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
- Trainer attributes:
nb_gpu_nodes
,num_gpu_nodes
,gradient_clip
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
,default_save_path
,tng_tqdm_dic
- Packages:
- Run graceful training teardown on interpreter exit (#1631)
- Fixed user warning when apex was used together with learning rate schedulers (#1873)
- Fixed multiple calls of
EarlyStopping
callback (#1863) - Fixed an issue with
Trainer.from_argparse_args
when passing in unknown Trainer args (#1932) - Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
- Fixed root node resolution for SLURM cluster with dash in host name (#1954)
- Fixed
LearningRateLogger
in multi-scheduler setting (#1944) - Fixed test configuration check and testing (#1804)
- Fixed an issue with Trainer constructor silently ignoring unknown/misspelled arguments (#1820)
- Fixed
save_weights_only
in ModelCheckpoint (#1780) - Allow use of same
WandbLogger
instance for multiple training loops (#2055) - Fixed an issue with
_auto_collect_arguments
collecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself
(#2048) - Fixed mistake in parameters' grad norm tracking (#2012)
- Fixed CPU and hanging GPU crash (#2118)
- Fixed an issue with the model summary and
example_input_array
depending on a specific ordering of the submodules in a LightningModule (#1773) - Fixed Tpu logging (#2230)
- Fixed Pid port + duplicate
rank_zero
logging (#2140, #2231)
- Added callback for logging learning rates (#1498)
- Added transfer learning example (for a binary classification task in computer vision) (#1564)
- Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723). - Added auto scaling of batch size (#1638)
- The progress bar metrics now also get updated in
training_epoch_end
(#1724) - Enable
NeptuneLogger
to work withdistributed_backend=ddp
(#1753) - Added option to provide seed to random generators to ensure reproducibility (#1572)
- Added override for hparams in
load_from_ckpt
(#1797) - Added support multi-node distributed execution under
torchelastic
(#1811, #1818) - Added using
store_true
for bool args (#1822, #1842) - Added dummy logger for internally disabling logging for some features (#1836)
- Enable
non-blocking
for device transfers to GPU (#1843) - Replace mata_tags.csv with hparams.yaml (#1271)
- Reduction when
batch_size < num_gpus
(#1609) - Updated LightningTemplateModel to look more like Colab example (#1577)
- Don't convert
namedtuple
totuple
when transferring the batch to target device (#1589) - Allow passing hparams as keyword argument to LightningModule when loading from checkpoint (#1639)
- Args should come after the last positional argument (#1807)
- Made ddp the default if no backend specified with multiple GPUs (#1789)
- Deprecated
tags_csv
in favor ofhparams_file
(#1271)
- Fixed broken link in PR template (#1675)
- Fixed ModelCheckpoint not None checking filepath (#1654)
- Trainer now calls
on_load_checkpoint()
when resuming from a checkpoint (#1666) - Fixed sampler logic for ddp with iterable dataset (#1734)
- Fixed
_reset_eval_dataloader()
for IterableDataset (#1560) - Fixed Horovod distributed backend to set the
root_gpu
property (#1669) - Fixed wandb logger
global_step
affects other loggers (#1492) - Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
- Fixed bugs that prevent lr finder to be used together with early stopping and validation dataloaders (#1676)
- Fixed a bug in Trainer that prepended the checkpoint path with
version_
when it shouldn't (#1748) - Fixed lr key name in case of param groups in LearningRateLogger (#1719)
- Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
- Fixed num processes wasn't being set properly and auto sampler was ddp failing (#1819)
- Fixed bugs in semantic segmentation example (#1824)
- Fixed saving native AMP scaler state (#1777)
- Fixed native amp + ddp (#1788)
- Fixed
hparam
logging with metrics (#1647)
- Allow logging of metrics together with
hparams
(#1630)
- Removed Warning from trainer loop (#1634)
- Fixed ModelCheckpoint not being fixable (#1632)
- Fixed CPU DDP breaking change and DDP change (#1635)
- Tested pickling (#1636)
- Added flag
replace_sampler_ddp
to manually disable sampler replacement in DDP (#1513) - Added
auto_select_gpus
flag to trainer that enables automatic selection of available GPUs on exclusive mode systems. - Added learning rate finder (#1347)
- Added support for DDP mode in clusters without SLURM (#1387)
- Added
test_dataloaders
parameter toTrainer.test()
(#1434) - Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
(#1475) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
ddp_cpu
backend for testing ddp without GPUs (#1158) - Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')
(#1529) - Added support for 8 core distributed training on Kaggle TPU's (#1568)
- Added support for native AMP (#1561, #1580)
- Changed the default behaviour to no longer include a NaN check with each training iteration (#1475)
- Decoupled the progress bar from trainer` it is a callback now and can be customized or even be replaced entirely (#1450).
- Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
- Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
- Updated semantic segmentation example with custom U-Net and logging (#1371)
- Disabled val and test shuffling (#1600)
- Deprecated
training_tqdm_dict
in favor ofprogress_bar_dict
(#1450).
- Removed
test_dataloaders
parameter fromTrainer.fit()
(#1434)
- Added the possibility to pass nested metrics dictionaries to loggers (#1582)
- Fixed memory leak from opt return (#1528)
- Fixed saving checkpoint before deleting old ones (#1453)
- Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()
results (#1459) - Fixed optimizer configuration when
configure_optimizers
returns dict withoutlr_scheduler
(#1443) - Fixed
LightningModule
- mixing hparams and arguments inLightningModule.__init__()
crashes load_from_checkpoint() (#1505) - Added a missing call to the
on_before_zero_grad
model hook (#1493). - Allow use of sweeps with
WandbLogger
(#1512) - Fixed a bug that caused the
callbacks
Trainer argument to reference a global variable (#1534). - Fixed a bug that set all boolean CLI arguments from
Trainer.add_argparse_args
always to True (#1571) - Fixed do not copy the batch when training on a single GPU (#1576, #1579)
- Fixed soft checkpoint removing on DDP (#1408)
- Fixed automatic parser bug (#1585)
- Fixed bool conversion from string (#1606)
- Added
rank_zero_warn
for warning only in rank 0 (#1428)
- Fixed default
DistributedSampler
for DDP training (#1425) - Fixed workers warning not on windows (#1430)
- Fixed returning tuple from
run_training_batch
(#1431) - Fixed gradient clipping (#1438)
- Fixed pretty print (#1441)
- Added same step loggers' metrics aggregation (#1278)
- Added parity test between a vanilla MNIST model and lightning model (#1284)
- Added parity test between a vanilla RNN model and lightning model (#1351)
- Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
- Added support for hierarchical
dict
(#1152) - Added
TrainsLogger
class (#1122) - Added type hints to
pytorch_lightning.core
(#946) - Added support for
IterableDataset
in validation and testing (#1104) - Added support for non-primitive types in
hparams
forTensorboardLogger
(#1130) - Added a check that stops the training when loss or weights contain
NaN
orinf
values. (#1097) - Added support for
IterableDataset
whenval_check_interval=1.0
(default), this will trigger validation at the end of each epoch. (#1283) - Added
summary
method to Profilers. (#1259) - Added informative errors if user defined dataloader has zero length (#1280)
- Added testing for python 3.8 (#915)
- Added model configuration checking (#1199)
- Added support for optimizer frequencies through
LightningModule.configure_optimizers()
(#1269) - Added option to run without an optimizer by returning
None
fromconfigure_optimizers
. (#1279) - Added a warning when the number of data loader workers is small. (#1378)
- Changed (renamed and refatored)
TensorRunningMean
->TensorRunningAccum
: running accumulations were generalized. (#1278) - Changed
progress_bar_refresh_rate
trainer flag to disable progress bar when set to 0. (#1108) - Enhanced
load_from_checkpoint
to also forward params to the model (#1307) - Updated references to
self.forward()
to instead use the__call__
interface. (#1211) - Changed default behaviour of
configure_optimizers
to use no optimizer rather than Adam. (#1279) - Allow to upload models on W&B (#1339)
- On DP and DDP2 unsqueeze is automated now (#1319)
- Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
- Did not interfere with a default sampler (#1318)
- Remove default Adam optimizer (#1317)
- Give warnings for unimplemented required lightning methods (#1317)
- Made
evaluate
method private >>Trainer._evaluate(...)
. (#1260) - Simplify the PL examples structure (shallower and more readable) (#1247)
- Changed min max gpu memory to be on their own plots (#1358)
- Remove
.item
which causes sync issues (#1254) - Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
- Change default logger to dedicated one (#1064)
- Deprecated Trainer argument
print_nan_grads
(#1097) - Deprecated Trainer argument
show_progress_bar
(#1108)
- Removed test for no test dataloader in .fit (#1495)
- Removed duplicated module
pytorch_lightning.utilities.arg_parse
for loading CLI arguments (#1167) - Removed wandb logger's
finalize
method (#1193) - Dropped
torchvision
dependency in tests and added own MNIST dataset class instead (#986)
- Fixed
model_checkpoint
when saving all models (#1359) Trainer.add_argparse_args
classmethod fixed. Now it adds a type for the arguments (#1147)- Fixed bug related to type checking of
ReduceLROnPlateau
lr schedulers(#1126) - Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
- Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch
(#1196) - Fixed all warnings and errors in the docs build process (#1191)
- Fixed an issue where
val_percent_check=0
would not disable validation (#1251) - Fixed average of incomplete
TensorRunningMean
(#1309) - Fixed
WandbLogger.watch
withwandb.init()
(#1311) - Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
- Fixed a bug that would cause
trainer.test()
to run on the validation set when overloadingvalidation_epoch_end
andtest_end
(#1353) - Fixed
WandbLogger.watch
- use of the watch method without importingwandb
(#1311) - Fixed
WandbLogger
to be used with 'ddp' - allow reinits in sub-processes (#1149, #1360) - Made
training_epoch_end
behave likevalidation_epoch_end
(#1357) - Fixed
fast_dev_run
running validation twice (#1365) - Fixed pickle error from quick patch
__code__
(#1352) - Fixed memory leak on GPU0 (#1094, #1349)
- Fixed checkpointing interval (#1272)
- Fixed validation and training loops run the partial dataset (#1192)
- Fixed running
on_validation_end
only on main process in DDP (#1125) - Fixed
load_spawn_weights
only in proc rank 0 (#1385) - Fixes using deprecated
use_amp
attribute (#1145) - Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1377)
- Fixed
Unimplemented backend XLA
error on TPU (#1387)
- Fixes
print
issues anddata_loader
(#1080)
- Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
- Added
reload_dataloaders_every_epoch=False
flag for trainer. Some users require reloading data every epoch (#926) - Added
progress_bar_refresh_rate=50
flag for trainer. Throttle refresh rate on notebooks (#926) - Updated governance docs
- Added a check to ensure that the metric used for early stopping exists before training commences (#542)
- Added
optimizer_idx
argument tobackward
hook (#733) - Added
entity
argument toWandbLogger
to be passed towandb.init
(#783) - Added a tool for profiling training runs (#782)
- Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804) - Added option to specify
step
key when logging metrics (#808) - Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759) - Added Tensor Processing Unit (TPU) support (#868)
- Added semantic segmentation example (#751,#876, #881)
- Split callbacks in multiple files (#849)
- Support for user defined callbacks (#889 and #950)
- Added support for multiple loggers to be passed to
Trainer
as an iterable (e.g. list, tuple, etc.) (#903) - Added support for step-based learning rate scheduling (#941)
- Added support for logging
hparams
as dict (#1029) - Checkpoint and early stopping now work without val. step (#1041)
- Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
- Added type hints for function arguments (#912, )
- Added default
argparser
forTrainer
(#952, #1023) - Added TPU gradient clipping (#963)
- Added max/min number of steps in
Trainer
(#728)
- Improved
NeptuneLogger
by addingclose_after_fit
argument to allow logging after training(#908) - Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752) - Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767) - Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749) - Moved functionality of
LightningModule.load_from_metrics
intoLightningModule.load_from_checkpoint
(#995) - Changed Checkpoint path parameter from
filepath
todirpath
(#1016) - Freezed models
hparams
asNamespace
property (#1029) - Dropped
logging
config in package init (#1015) - Renames model steps (#1051)
training_end
>>training_epoch_end
validation_end
>>validation_epoch_end
test_end
>>test_epoch_end
- Refactor dataloading, supports infinite dataloader (#955)
- Create single file in
TensorBoardLogger
(#777)
- Deprecated
pytorch_lightning.logging
(#767) - Deprecated
LightningModule.load_from_metrics
in favour ofLightningModule.load_from_checkpoint
(#995, #1079) - Deprecated
@data_loader
decorator (#926) - Deprecated model steps
training_end
,validation_end
andtest_end
(#1051, #1056)
- Removed dependency on
pandas
(#736) - Removed dependency on
torchvision
(#797) - Removed dependency on
scikit-learn
(#801)
- Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743) - Fixed a bug where the model checkpointer didn't write to the same directory as the logger (#771)
- Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777) - Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832) - Fixed a bug when calling
self.logger.experiment
with multiple loggers (#1009) - Fixed a bug when calling
logger.append_tags
on aNeptuneLogger
with a single tag (#1009) - Fixed sending back data from
.spawn
by saving and loading the trained model in/out of the process (#1017 - Fixed port collision on DDP (#1010)
- Fixed/tested pass overrides (#918)
- Fixed comet logger to log after train (#892)
- Remove deprecated args to learning rate step function (#890)
- Added support for resuming from a specific checkpoint via
resume_from_checkpoint
argument (#516) - Added support for
ReduceLROnPlateau
scheduler (#320) - Added support for Apex mode
O2
in conjunction with Data Parallel (#493) - Added option (
save_top_k
) to save the top k models in theModelCheckpoint
class (#128) - Added
on_train_start
andon_train_end
hooks toModelHooks
(#598) - Added
TensorBoardLogger
(#607) - Added support for weight summary of model with multiple inputs (#543)
- Added
map_location
argument toload_from_metrics
andload_from_checkpoint
(#625) - Added option to disable validation by setting
val_percent_check=0
(#649) - Added
NeptuneLogger
class (#648) - Added
WandbLogger
class (#627)
- Changed the default progress bar to print to stdout instead of stderr (#531)
- Renamed
step_idx
tostep
,epoch_idx
toepoch
,max_num_epochs
tomax_epochs
andmin_num_epochs
tomin_epochs
(#589) - Renamed
total_batch_nb
tototal_batches
,nb_val_batches
tonum_val_batches
,nb_training_batches
tonum_training_batches
,max_nb_epochs
tomax_epochs
,min_nb_epochs
tomin_epochs
,nb_test_batches
tonum_test_batches
, andnb_val_batches
tonum_val_batches
(#567) - Changed gradient logging to use parameter names instead of indexes (#660)
- Changed the default logger to
TensorBoardLogger
(#609) - Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
- Deprecated
max_nb_epochs
andmin_nb_epochs
(#567) - Deprecated the
on_sanity_check_start
hook inModelHooks
(#598)
- Removed the
save_best_only
argument fromModelCheckpoint
, usesave_top_k=1
instead (#128)
- Fixed a bug which ocurred when using Adagrad with cuda (#554)
- Fixed a bug where training would be on the GPU despite setting
gpus=0
orgpus=[]
(#561) - Fixed an error with
print_nan_gradients
when some parameters do not require gradient (#579) - Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
- Fixed support for PyTorch 1.1.0 (#552)
- Fixed an issue with early stopping when using a
val_check_interval < 1.0
inTrainer
(#492) - Fixed bugs relating to the
CometLogger
object that would cause it to not work properly (#481) - Fixed a bug that would occur when returning
-1
fromon_batch_start
following an early exit or when the batch wasNone
(#509) - Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
- Fixed a bug where batch 'segments' would remain on the GPU when using
truncated_bptt > 1
(#532) - Fixed a bug when using
IterableDataset
(#547) - Fixed a bug where
.item
was called on non-tensor objects (#602) - Fixed a bug where
Trainer.train
would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs
(#608) - Fixed a bug where early stopping would begin two epochs early (#617)
- Fixed a bug where
num_training_batches
andnum_test_batches
would sometimes be rounded down to zero (#649) - Fixed a bug where an additional batch would be processed when manually setting
num_training_batches
(#653) - Fixed a bug when batches did not have a
.copy
method (#701) - Fixed a bug when using
log_gpu_memory=True
in Python 3.6 (#715) - Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
- Fixed a bug where
on_train_end
was not called when ealy stopping (#723)
- Added option to disable default logger, checkpointer, and early stopping by passing
logger=False
,checkpoint_callback=False
andearly_stop_callback=False
respectively - Added
CometLogger
for use with Comet.ml - Added
val_check_interval
argument toTrainer
allowing validition to be performed at every given number of batches - Added functionality to save and load hyperparameters using the standard checkpoint mechanism
- Added call to
torch.cuda.empty_cache
before training starts - Added option for user to override the call t
backward
- Added support for truncated backprop through time via the
truncated_bptt_steps
argument inTrainer
- Added option to operate on all outputs from
training_step
in DDP2 - Added a hook for modifying DDP init
- Added a hook for modifying Apex
- Changed experiment version to be padded with zeros (e.g.
/dir/version_9
becomes/dir/version_0009
) - Changed callback metrics to include any metrics given in logs or progress bar
- Changed the default for
save_best_only
inModelCheckpoint
toTrue
- Added
tng_data_loader
for backwards compatibility - Renamed
MLFlowLogger.client
toMLFlowLogger.experiment
for consistency - Moved
global_step
increment to happen after the batch has been processed - Changed weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
- Changed progress bar functionality to add multiple progress bars for train/val/test
- Changed calls to
print
to uselogging
instead
- Deprecated
tng_dataloader
- Fixed an issue where the number of batches was off by one during training
- Fixed a bug that occured when setting a ckeckpoint callback and
early_stop_callback=False
- Fixed an error when importing CometLogger
- Fixed a bug where the
gpus
argument had some unexpected behaviour - Fixed a bug where the computed total number of batches was sometimes incorrect
- Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
- Fixed a bug when using the
log_gpu_memory='min_max'
option inTrainer
- Fixed a bug where checkpointing would sometimes erase the current directory
- Added
weights_summary
argument toTrainer
to be set tofull
(full summary),top
(just top level modules) or other - Added
tags
argument toMLFlowLogger
- Changed default for
amp_level
toO1
- Removed the
print_weights_summary
argument fromTrainer
- Fixed a bug where logs were not written properly
- Fixed a bug where
logger.finalize
wasn't called after training is complete - Fixed callback metric errors in DDP
- Fixed a bug where
TestTubeLogger
didn't log to the correct directory
- Added the
LightningLoggerBase
class for experiment loggers - Added
MLFlowLogger
for logging withmlflow
- Added
TestTubeLogger
for logging withtest_tube
- Added a different implementation of DDP (
distributed_backed='ddp2'
) where every node has one model using all GPUs - Added support for optimisers which require a closure (e.g. LBFGS)
- Added automatic
MASTER_PORT
defualt for DDP when not set manually - Added new GPU memory logging options
'min_max'
(log only the min/max utilization) and'all'
(log all the GPU memory)
- Changed schedulers to always be called with the current epoch
- Changed
test_tube
to an optional dependency - Changed data loaders to internally use a getter instead of a python property
- Disabled auto GPU loading when restoring weights to prevent out of memory errors
- Changed logging, early stopping and checkpointing to occur by default
- Fixed a bug with samplers that do not specify
set_epoch
- Fixed a bug when using the
MLFlowLogger
with unsupported data types, this will now raise a warning - Fixed a bug where gradient norms were alwasy zero using
track_grad_norm
- Fixed a bug which causes a crash when logging memory
- Changed
data_batch
argument tobatch
throughout - Changed
batch_i
argument tobatch_idx
throughout - Changed
tng_dataloader
method totrain_dataloader
- Changed
on_tng_metrics
method toon_training_metrics
- Changed
gradient_clip
argument togradient_clip_val
- Changed
add_log_row_interval
torow_log_interval
- Fixed a bug with tensorboard logging in multi-gpu setup
- Added the flag
log_gpu_memory
toTrainer
to deactivate logging of GPU memory utilization - Added SLURM resubmit functionality (port from test-tube)
- Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
- Added option to use single gpu per node with
DistributedDataParallel
- Changed functionality of
validation_end
andtest_end
with multiple dataloaders to be given all of the dataloaders at once rather than in seperate calls - Changed print_nan_grads to only print the parameter value and gradients when they contain NaN
- Changed gpu API to take integers as well (e.g.
gpus=2
instead ofgpus=[0, 1]
) - All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
- Fixed a bug where data types that implement
.to
but not.cuda
would not be properly moved onto the GPU - Fixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
- Added
test_step
andtest_end
methods, used whenTrainer.test
is called - Added
GradientAccumulationScheduler
callback which can be used to schedule changes to the number of accumulation batches - Added option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
- Fixed a bug when setting
nb_sanity_val_steps = 0
- Changed the default
val_check_interval
to1.0
- Changed defaults for
nb_val_batches
,nb_tng_batches
andnb_test_batches
to 0
- Fixed a bug where the full validation set as used despite setting
val_percent_check
- Fixed a bug where an
Exception
was thrown when using a data set containing a single batch - Fixed a bug where an
Exception
was thrown if noval_dataloader
was given - Fixed a bug where tuples were not properly transfered to the GPU
- Fixed a bug where data of a non standard type was not properly handled by the trainer
- Fixed a bug when loading data as a tuple
- Fixed a bug where
AttributeError
could be suppressed by theTrainer
- Added support for data to be given as a
dict
orlist
with a single gpu - Added support for
configure_optimizers
to return a single optimizer, two list (optimizers and schedulers), or a single list
- Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizers
would throw anException
- Added
optimizer_step
method that can be overridden to change the standard optimizer behaviour
- Added supoort for multiple validation dataloaders
- Added support for latest test-tube logger (optimised for
torch==1.2.0
)
validation_step
andval_dataloader
are now optionallr_scheduler
is now activated after epoch
- Fixed a bug where a warning would show when using
lr_scheduler
intorch>1.1.0
- Fixed a bug where an
Exception
would be thrown if usingtorch.DistributedDataParallel
without using aDistributedSampler
, this now throws aWarning
instead
- Fixed a bug where accumulate gradients would scale the loss incorrectly
- Changed install requirement to
torch==1.2.0
- Changed install requirement to
torch==1.1.0
- Added 16-bit support for a single GPU
- Added support for training continuation (preserves epoch, global step etc.)
- Changed
training_step
andvalidation_step
, outputs will no longer be automatically reduced
- Removed need for
Experiment
object inTrainer
- Fixed issues with reducing outputs from generative models (such as images and text)
- Added a decorator to do lazy data loading internally
- Fixed a bug where
Experiment
object was not process safe, potentially causing logs to be overwritten