PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.
The core team is excited to announce the PyTorch Lightning 1.6 release ⚡
Highlights
PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:
Introducing Intel's Habana Accelerator
Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.
You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:
trainer = pl.Trainer(accelerator="hpu")
# single Gaudi training
trainer = pl.Trainer(accelerator="hpu", devices=1)
# distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)
The Bagua Strategy
The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:
trainer = pl.Trainer(strategy="bagua")
# or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default
Towards stable Accelerator, Strategy, and Plugin APIs
The Accelerator
, Strategy
, and Plugin
APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.
In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin
) as well as certain Plugins. In particular, we want to highlight the following changes:
-
All
TrainingTypePlugin
s have been renamed toStrategy
(#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the newstrategy
anddevices
flags to the Trainer.# Before from pytorch_lightning.plugins import DDPPlugin # New from pytorch_lightning.strategies import DDPStrategy
-
The
Accelerator
andPrecisionPlugin
have moved intoStrategy
. All strategies now take an optional parameteraccelerator
andprecision_plugin
(#11022, #10570). -
Custom Accelerator implementations must now implement two new abstract methods:
is_available()
(#11797) andauto_device_count()
(#10222). The latter determines how many devices get used by default when specifyingTrainer(accelerator=..., devices="auto")
. -
We redesigned the process creation for spawn-based strategies such as
DDPSpawnStrategy
andTPUSpawnStrategy
(#10896). All spawn-based strategies now spawn processes immediately upon callingTrainer.{fit,validate,test,predict}
, which means the hooks/callbacksprepare_data
,setup
,configure_sharded_model
andteardown
all run under an initialized process group. These changes align the spawn-based strategies with their non-spawn counterparts (such asDDPStrategy
).
We've also exposed the process group backend for use. For example, you can now easily enable fairring
like this:
# Explicitly specify the process group backend if you choose to
ddp = pl.strategies.DDPStrategy(process_group_backend="fairring")
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)
In a similar fashion, if installing torch>=1.11
, you can enable DDP static graph to apply special runtime optimizations:
trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))
LightningCLI
improvements
In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:
from pytorch_lightning.utilities.cli import LightningCLI
LightningCLI(auto_registry=True)
We have also added support for the ReduceLROnPlateau
scheduler with shorthand notation:
$ python script.py fit --optimizer=Adam --lr_scheduler=ReduceLROnPlateau --lr_scheduler.monitor=metric_to_track
If you need to customize the learning rate scheduler configuration, you can do so by overriding:
class MyLightningCLI(LightningCLI):
@staticmethod
def configure_optimizers(lightning_module, optimizer, lr_scheduler=None):
return {"optimizer": optimizer, "lr_scheduler": {"scheduler": lr_scheduler, ...}}
Finally, loggers are also now configurable with shorthand:
$ python script.py fit --trainer.logger=WandbLogger --trainer.logger.name="my_lightning_run"
Control SLURM's re-queueing
We've added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:
from pytorch_lightning.plugins.environments import SLURMEnvironment
trainer = pl.Trainer(plugins=SLURMEnvironment(auto_requeue=False))
Fault-tolerance improvements
The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1
to SIGTERM
for better support inside cloud instances.
An additional feature we're excited to announce is support for consecutive trainer.fit()
calls.
trainer = pl.Trainer(max_epochs=2)
trainer.fit(model)
# now, run 2 more epochs
trainer.fit_loop.max_epochs = 4
trainer.fit(model)
Loop customization improvements
The Loop
's state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.
We've also made it easier to replace Lightning's loops with your own. For example:
class MyCustomLoop(pl.loops.TrainingEpochLoop):
...
trainer = pl.Trainer(...)
trainer.fit_loop.replace(epoch_loop=MyCustomLoop)
# Trainer runs the fit loop with your new epoch loop!
trainer.fit(model)
Data-Loading improvements
In previous versions, Lightning required that the DataLoader
instance set its input arguments as instance attributes. This meant that custom DataLoader
s also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:
class MyDataLoader(torch.utils.data.DataLoader):
def __init__(self, a=123, *args, **kwargs):
- # this was required before
- self.a = a
super().__init__(*args, **kwargs)
trainer.fit(model, train_dataloader=MyDataLoader())
As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn't need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV's. You can now define your own pre-fetching value like this:
class MyCustomLoop(pl.loops.FitLoop):
@property
def prefetch_batches(self):
return 7 # lucky number 7
trainer = pl.Trainer(...)
trainer.fit_loop = MyCustomLoop(min_epochs=trainer.min_epochs, max_epochs=trainer.max_epochs)
New Hooks
LightningModule.lr_scheduler_step
Lightning now allows the use of custom learning rate schedulers that aren't natively available in PyTorch. A great example of this is Timm Schedulers.
When using custom learning rate schedulers relying on an API other than PyTorch's, you can now define the LightningModule.lr_scheduler_step
with your desired logic.
from timm.scheduler import TanhLRScheduler
class MyLightningModule(pl.LightningModule):
def configure_optimizers(self):
optimizer = ...
scheduler = TanhLRScheduler(optimizer, ...)
return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "epoch"}}
def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
scheduler.step(epoch=self.current_epoch) # timm's scheduler need the epoch value
A new stateful API
This release introduces new hooks to standardize all stateful components to use state_dict
and load_state_dict
, mimicking the PyTorch API. The new hooks receive their own component's state and replace most usages of the previous on_save_checkpoint
and on_load_checkpoint
hooks.
def MyCallback(pl.Callback):
- def on_save_checkpoint(self, trainer, pl_module, checkpoint):
- return {'x': self.x}
- def on_load_checkpoint(self, trainer, pl_module, checkpoint):
- self.x = x
+ def state_dict(self):
+ return {'x': self.x}
+ def load_state_dict(self, checkpoint):
+ self.x = x
New properties
Trainer.estimated_stepping_batches
You can use built-in Trainer.estimated_stepping_batches
to compute the total number of stepping batches needed for the complete training.
The property takes gradient accumulation factor and distributed setting into consideration when performing this computation so that you don't have to derive it manually:
class MyLightningModule(pl.LightningModule):
def configure_optimizers(self):
optimizer = ...
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches
)
return {"optimizer": optimizer, "lr_scheduler": scheduler}
Trainer.num_devices
and Trainer.device_ids
In the past, retrieving the number of devices used, or their IDs, posed a considerable challenge. Additionally, doing so required knowing which property to access based on the current Trainer
configuration.
To simplify this process, we've deprecated the per-accelerator properties to have accelerator agnostic properties. For example:
- num_devices = max(1, trainer.num_gpus, trainer.num_processes)
- if trainer.tpu_cores:
- num_devices = max(num_devices, trainer.tpu_cores)
+ num_devices = trainer.num_devices
Experimental Features
Manual Fault-tolerance
Fault Tolerance has limitations that require specific information about your data-loading structure.
It is now possible to resolve those limitations by enabling manual fault tolerance where you can write your own logic and specify how exactly to checkpoint your own datasets and samplers. You can do so using this environment flag:
$ PL_FAULT_TOLERANT_TRAINING=MANUAL python script.py
Check out this video for a dive into the internals of this flag.
Customizing the layer synchronization
We introduced a new plugin class for wrapping layers of a model with synchronization logic for multiprocessing.
class MyLayerSync(pl.plugins.LayerSync):
...
layer_sync = MyLayerSync(...)
trainer = Trainer(sync_batchnorm=True, plugins=layer_sync, strategy="ddp")
Registering Custom Accelerators
There has been much progress in the field of ML Accelerators, and the list of accelerators is constantly expanding.
We've made it easier for users to try out new accelerators by enabling support for registering custom Accelerator
classes in Lightning.
from pytorch_lightning.accelerators import Accelerator, AcceleratorRegistry
class SOTAAccelerator(Accelerator):
def __init__(self, x):
...
AcceleratorRegistry.register("sota_accelerator", SOTAAccelerator, x=123)
# the following works now:
trainer = Trainer(accelerator="sota_accelerator")
Backward Incompatible Changes
Here is a selection of notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.
Drop PyTorch 1.7 support
Following our 4 PyTorch release window, this release supports PyTorch 1.8 to 1.11. Support for PyTorch 1.7 has been removed.
Drop Python 3.6 support
Following Python's end-of-life, support for Python 3.6 has been removed.
AcceleratorConnector
rewrite
To support new accelerator and stategy features, we completely rewrote our internal AcceleratorConncetor
class. No backwards compatibility was maintained so it is likely to have broken your code if it was using this class.
Re-define the current_epoch
boundary
To resolve fault-tolerance issues, we changed where the current epoch value gets increased.
trainer.current_epoch
is now increased by 1 on_train_end
. This means that if a model is run for 3 epochs (0, 1, 2), trainer.current_epoch
will now return 3 instead of 2 after trainer.fit()
. This can also impact custom callbacks that acess this property inside this hook.
This also impacts checkpoints saved during an epoch (e.g. on_train_epoch_end
). For example, a Trainer(max_epochs=1, limit_train_batches=1)
instance that saves a checkpoint will have the current_epoch=0
value saved instead of current_epoch=1
.
Re-define the global_step
boundary
To resolve fault-tolerance issues, we changed where the global step value gets increased.
Access to trainer.global_step
during an intra-training validation hook will now correctly return the number of optimizer steps taken already. In pseudocode:
training_step()
+ global_step += 1
validation_if_necessary()
- global_step += 1
Saved checkpoints that use the global step value as part of the filename are now increased by 1 for the same reason. A checkpoint saved after 1 step will be now be named step=1.ckpt
instead of step=0.ckpt
.
The trainer.global_step
value will now account for TBPTT or multiple optimizers. Users setting Trainer({min,max}_steps=...)
under these circumstances will need to adjust their values.
Removed automatic reduction of outputs in training_step
when using DataParallel
When using Trainer(strategy="dp")
, all the tensors returned by training_step were previously reduced to a scalar (#11594). This behavior was especially confusing when outputs needed to be collected into the training_epoch_end
hook.
From now on, outputs are no longer reduced except for the loss
tensor, unless you implement training_step_end
, in which case the loss won't get reduced either.
No longer fallback to CPU with no devices
Previous versions were lenient in that the lack of GPU devices defaulted to running on CPU. This meant that users' code could be running much slower without them ever noticing that it was running on CPU.
We suggest passing Trainer(accelerator="auto")
when this leniency is desired.
CHANGELOG
Added
- Allow logging to an existing run ID in MLflow with
MLFlowLogger
(#12290) - Enable gradient accumulation using Horovod's
backward_passes_per_step
(#11911) - Add new
DETAIL
log level to provide useful logs for improving monitoring and debugging of batch jobs (#11008) - Added a flag
SLURMEnvironment(auto_requeue=True|False)
to control whether Lightning handles the requeuing (#10601) - Fault Tolerant Manual
- Add
_Stateful
protocol to detect if classes are stateful (#10646) - Add
_FaultTolerantMode
enum used to track different supported fault tolerant modes (#10645) - Add a
_rotate_worker_indices
utility to reload the state according the latest worker (#10647) - Add stateful workers (#10674)
- Add an utility to collect the states across processes (#10639)
- Add logic to reload the states across data loading components (#10699)
- Cleanup some fault tolerant utilities (#10703)
- Enable Fault Tolerant Manual Training (#10707)
- Broadcast the
_terminate_gracefully
to all processes and add support for DDP (#10638)
- Add
- Added support for re-instantiation of custom (subclasses of)
DataLoaders
returned in the*_dataloader()
methods, i.e., automatic replacement of samplers now works with custom types ofDataLoader
(#10680) - Added a function to validate if fault tolerant training is supported. (#10465)
- Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)
- Show a better error message when a custom
DataLoader
implementation is not well implemented and we need to reconstruct it (#10719) - Show a better error message when frozen dataclass is used as a batch (#10927)
- Save the
Loop
's state by default in the checkpoint (#10784) - Added
Loop.replace
to easily switch one loop for another (#10324) - Added support for
--lr_scheduler=ReduceLROnPlateau
to theLightningCLI
(#10860) - Added
LightningCLI.configure_optimizers
to override theconfigure_optimizers
return value (#10860) - Added
LightningCLI(auto_registry)
flag to register all subclasses of the registerable components automatically (#12108) - Added a warning that shows when
max_epochs
in theTrainer
is not set (#10700) - Added support for returning a single Callback from
LightningModule.configure_callbacks
without wrapping it into a list (#11060) - Added
console_kwargs
forRichProgressBar
to initialize inner Console (#10875) - Added support for shorthand notation to instantiate loggers with the
LightningCLI
(#11533) - Added a
LOGGER_REGISTRY
instance to register custom loggers to theLightningCLI
(#11533) - Added info message when the
Trainer
argumentslimit_*_batches
,overfit_batches
, orval_check_interval
are set to1
or1.0
(#11950) - Added a
PrecisionPlugin.teardown
method (#10990) - Added
LightningModule.lr_scheduler_step
(#10249) - Added support for no pre-fetching to
DataFetcher
(#11606) - Added support for optimizer step progress tracking with manual optimization (#11848)
- Return the output of the
optimizer.step
. This can be useful forLightningLite
users, manual optimization users, or users overridingLightningModule.optimizer_step
(#11711) - Teardown the active loop and strategy on exception (#11620)
- Added a
MisconfigurationException
if user providedopt_idx
in scheduler config doesn't match with actual optimizer index of its respective optimizer (#11247) - Added a
loggers
property toTrainer
which returns a list of loggers provided by the user (#11683) - Added a
loggers
property toLightningModule
which retrieves theloggers
property fromTrainer
(#11683) - Added support for DDP when using a
CombinedLoader
for the training data (#11648) - Added a warning when using
DistributedSampler
during validation/testing (#11479) - Added support for
Bagua
training strategy (#11146) - Added support for manually returning a
poptorch.DataLoader
in a*_dataloader
hook (#12116) - Added
rank_zero
module to centralize utilities (#11747) - Added a
_Stateful
support forLightningDataModule
(#11637) - Added
_Stateful
support forPrecisionPlugin
(#11638) - Added
Accelerator.is_available
to check device availability (#11797) - Enabled static type-checking on the signature of
Trainer
(#11888) - Added utility functions for moving optimizers to devices (#11758)
- Added a warning when saving an instance of
nn.Module
withsave_hyperparameters()
(#12068) - Added
estimated_stepping_batches
property toTrainer
(#11599) - Added support for pluggable Accelerators (#12030)
- Added profiling for
on_load_checkpoint
/on_save_checkpoint
callback and LightningModule hooks (#12149) - Added
LayerSync
andNativeSyncBatchNorm
plugins (#11754) - Added optional
storage_options
argument toTrainer.save_checkpoint()
to pass to customCheckpointIO
implementations (#11891) - Added support to explicitly specify the process group backend for parallel strategies (#11745)
- Added
device_ids
andnum_devices
property toTrainer
(#12151) - Added
Callback.state_dict()
andCallback.load_state_dict()
methods (#12232) - Added
AcceleratorRegistry
(#12180) - Added support for Habana Accelerator (HPU) (#11808)
- Added support for dataclasses in
apply_to_collections
(#11889)
Changed
- Drop PyTorch 1.7 support (#12191), (#12432)
- Make
benchmark
flag optional and set its value based on the deterministic flag (#11944) - Implemented a new native and rich format in
_print_results
method of theEvaluationLoop
(#11332) - Do not print an empty table at the end of the
EvaluationLoop
(#12427) - Set the
prog_bar
flag to False inLightningModule.log_grad_norm
(#11472) - Raised exception in
init_dist_connection()
when torch distributed is not available (#10418) - The
monitor
argument in theEarlyStopping
callback is no longer optional (#10328) - Do not fail if batch size could not be inferred for logging when using DeepSpeed (#10438)
- Raised
MisconfigurationException
whenenable_progress_bar=False
and a progress bar instance has been passed in the callback list (#10520) - Moved
trainer.connectors.env_vars_connector._defaults_from_env_vars
toutilities.argsparse._defaults_from_env_vars
(#10501) - Changes in
LightningCLI
required for the new major release of jsonargparse v4.0.0 (#10426) - Renamed
refresh_rate_per_second
parameter torefresh_rate
forRichProgressBar
signature (#10497) - Moved ownership of the
PrecisionPlugin
intoTrainingTypePlugin
and updated all references (#10570) - Fault Tolerant relies on
signal.SIGTERM
to gracefully exit instead ofsignal.SIGUSR1
(#10605) Loop.restarting=...
now sets the value recursively for all subloops (#11442)- Raised an error if the
batch_size
cannot be inferred from the current batch if it contained a string or was a custom batch object (#10541) - The validation loop is now disabled when
overfit_batches > 0
is set in the Trainer (#9709) - Moved optimizer related logics from
Accelerator
toTrainingTypePlugin
(#10596) - Moved ownership of the lightning optimizers from the
Trainer
to theStrategy
(#11444) - Moved ownership of the data fetchers from the DataConnector to the Loops (#11621)
- Moved
batch_to_device
method fromAccelerator
toTrainingTypePlugin
(#10649) - The
DDPSpawnPlugin
no longer overrides thepost_dispatch
plugin hook (#10034) - Integrate the progress bar implementation with progress tracking (#11213)
- The
LightningModule.{add_to_queue,get_from_queue}
hooks no longer get atorch.multiprocessing.SimpleQueue
and instead receive a list based queue (#10034) - Changed
training_step
,validation_step
,test_step
andpredict_step
method signatures inAccelerator
and updated input from caller side (#10908) - Changed the name of the temporary checkpoint that the
DDPSpawnPlugin
and related plugins save (#10934) LoggerCollection
returns only unique logger names and versions (#10976)- Redesigned process creation for spawn-based plugins (
DDPSpawnPlugin
,TPUSpawnPlugin
, etc.) (#10896)- All spawn-based plugins now spawn processes immediately upon calling
Trainer.{fit,validate,test,predict}
- The hooks/callbacks
prepare_data
,setup
,configure_sharded_model
andteardown
now run under initialized process group for spawn-based plugins just like their non-spawn counterparts - Some configuration errors that were previously raised as
MisconfigurationException
s will now be raised asProcessRaisedException
(torch>=1.8) or asException
(torch<1.8) - Removed the
TrainingTypePlugin.pre_dispatch()
method and merged it withTrainingTypePlugin.setup()
(#11137)
- All spawn-based plugins now spawn processes immediately upon calling
- Changed profiler to index and display the names of the hooks with a new pattern []. (#11026)
- Changed
batch_to_device
entry in profiling from stage-specific to generic, to match profiling of other hooks (#11031) - Changed the info message for finalizing ddp-spawn worker processes to a debug-level message (#10864)
- Removed duplicated file extension when uploading model checkpoints with
NeptuneLogger
(#11015) - Removed
__getstate__
and__setstate__
ofRichProgressBar
(#11100) - The
DDPPlugin
andDDPSpawnPlugin
and their subclasses now remove theSyncBatchNorm
wrappers inteardown()
to enable proper support at inference after fitting (#11078) - Moved ownership of the
Accelerator
instance to theTrainingTypePlugin
; all training-type plugins now take an optional parameteraccelerator
(#11022) - Renamed the
TrainingTypePlugin
toStrategy
(#11120)- Renamed the
ParallelPlugin
toParallelStrategy
(#11123) - Renamed the
DataParallelPlugin
toDataParallelStrategy
(#11183) - Renamed the
DDPPlugin
toDDPStrategy
(#11142) - Renamed the
DDP2Plugin
toDDP2Strategy
(#11185) - Renamed the
DDPShardedPlugin
toDDPShardedStrategy
(#11186) - Renamed the
DDPFullyShardedPlugin
toDDPFullyShardedStrategy
(#11143) - Renamed the
DDPSpawnPlugin
toDDPSpawnStrategy
(#11145) - Renamed the
DDPSpawnShardedPlugin
toDDPSpawnShardedStrategy
(#11210) - Renamed the
DeepSpeedPlugin
toDeepSpeedStrategy
(#11194) - Renamed the
HorovodPlugin
toHorovodStrategy
(#11195) - Renamed the
TPUSpawnPlugin
toTPUSpawnStrategy
(#11190) - Renamed the
IPUPlugin
toIPUStrategy
(#11193) - Renamed the
SingleDevicePlugin
toSingleDeviceStrategy
(#11182) - Renamed the
SingleTPUPlugin
toSingleTPUStrategy
(#11182) - Renamed the
TrainingTypePluginsRegistry
toStrategyRegistry
(#11233)
- Renamed the
- Marked the
ResultCollection
,ResultMetric
, andResultMetricCollection
classes as protected (#11130) - Marked
trainer.checkpoint_connector
as protected (#11550) - The epoch start/end hooks are now called by the
FitLoop
instead of theTrainingEpochLoop
(#11201) - DeepSpeed does not require lightning module zero 3 partitioning (#10655)
- Moved
Strategy
classes to thestrategies
directory (#11226) - Renamed
training_type_plugin
file tostrategy
(#11239) - Changed
DeviceStatsMonitor
to group metrics based on the logger'sgroup_separator
(#11254) - Raised
UserWarning
if evaluation is triggered withbest
ckpt and trainer is configured with multiple checkpoint callbacks (#11274) Trainer.logged_metrics
now always contains scalar tensors, even when a Python scalar was logged (#11270)- The tuner now uses the checkpoint connector to copy and restore its state (#11518)
- Changed
MisconfigurationException
toModuleNotFoundError
whenrich
isn't available (#11360) - The
trainer.current_epoch
value is now increased by 1 during and afteron_train_end
(#8578) - The
trainer.global_step
value now accounts for multiple optimizers and TBPTT splits (#11805) - The
trainer.global_step
value is now increased right after theoptimizer.step()
call which will impact users who access it during an intra-training validation hook (#11805) - The filename of checkpoints created with
ModelCheckpoint(filename='{step}')
is different compared to previous versions. A checkpoint saved after 1 step will be namedstep=1.ckpt
instead ofstep=0.ckpt
(#11805) - Inherit from
ABC
forAccelerator
: Users need to implementauto_device_count
(#11521) - Changed
parallel_devices
property inParallelStrategy
to be lazy initialized (#11572) - Updated
TQDMProgressBar
to run a separate progress bar for each eval dataloader (#11657) - Sorted
SimpleProfiler(extended=False)
summary based on mean duration for each hook (#11671) - Avoid enforcing
shuffle=False
for eval dataloaders (#11575) - When using DP (data-parallel), Lightning will no longer automatically reduce all tensors returned in training_step; it will only reduce the loss unless
training_step_end
is overridden (#11594) - When using DP (data-parallel), the
training_epoch_end
hook will no longer receive reduced outputs fromtraining_step
and instead get the full tensor of results from all GPUs (#11594) - Changed default logger name to
lightning_logs
for consistency (#11762) - Rewrote
accelerator_connector
(#11448) - When manual optimization is used with DDP, we no longer force
find_unused_parameters=True
(#12425) - Disable loading dataloades if corresponding
limit_batches=0
(#11576) - Removed
is_global_zero
check intraining_epoch_loop
beforelogger.save
. If you have a custom logger that implementssave
the Trainer will now callsave
on all ranks by default. To change this behavior add@rank_zero_only
to yoursave
implementation (#12134) - Disabled tuner with distributed strategies (#12179)
- Marked
trainer.logger_connector
as protected (#12195) - Move
Strategy.process_dataloader
function call fromfit/evaluation/predict_loop.py
todata_connector.py
(#12251) ModelCheckpoint(save_last=True, every_n_epochs=N)
now saves a "last" checkpoint every epoch (disregardingevery_n_epochs
) instead of only once at the end of training (#12418)- The strategies that support
sync_batchnorm
now only apply it when fitting (#11919) - Avoided fallback on CPU if no devices are provided for other accelerators (#12410)
- Modified
supporters.py
so that in the accumulator element (for loss) is created directly on the device (#12430) - Removed
EarlyStopping.on_save_checkpoint
andEarlyStopping.on_load_checkpoint
in favor ofEarlyStopping.state_dict
andEarlyStopping.load_state_dict
(#11887) - Removed
BaseFinetuning.on_save_checkpoint
andBaseFinetuning.on_load_checkpoint
in favor ofBaseFinetuning.state_dict
andBaseFinetuning.load_state_dict
(#11887) - Removed
BackboneFinetuning.on_save_checkpoint
andBackboneFinetuning.on_load_checkpoint
in favor ofBackboneFinetuning.state_dict
andBackboneFinetuning.load_state_dict
(#11887) - Removed
ModelCheckpoint.on_save_checkpoint
andModelCheckpoint.on_load_checkpoint
in favor ofModelCheckpoint.state_dict
andModelCheckpoint.load_state_dict
(#11887) - Removed
Timer.on_save_checkpoint
andTimer.on_load_checkpoint
in favor ofTimer.state_dict
andTimer.load_state_dict
(#11887) - Replaced PostLocalSGDOptimizer with a dedicated model averaging component (#12378)
Deprecated
- Deprecated
training_type_plugin
property in favor ofstrategy
inTrainer
and updated the references (#11141) - Deprecated
Trainer.{validated,tested,predicted}_ckpt_path
and replaced with read-only propertyTrainer.ckpt_path
set when checkpoints loaded viaTrainer.{fit,validate,test,predict}
(#11696) - Deprecated
ClusterEnvironment.master_{address,port}
in favor ofClusterEnvironment.main_{address,port}
(#10103) - Deprecated
DistributedType
in favor of_StrategyType
(#10505) - Deprecated the
precision_plugin
constructor argument fromAccelerator
(#10570) - Deprecated
DeviceType
in favor of_AcceleratorType
(#10503) - Deprecated the property
Trainer.slurm_job_id
in favor of the newSLURMEnvironment.job_id()
method (#10622) - Deprecated the access to the attribute
IndexBatchSamplerWrapper.batch_indices
in favor ofIndexBatchSamplerWrapper.seen_batch_indices
(#10870) - Deprecated
on_init_start
andon_init_end
callback hooks (#10940) - Deprecated
Trainer.call_hook
in favor ofTrainer._call_callback_hooks
,Trainer._call_lightning_module_hook
,Trainer._call_ttp_hook
, andTrainer._call_accelerator_hook
(#10979) - Deprecated
TrainingTypePlugin.post_dispatch
in favor ofTrainingTypePlugin.teardown
(#10939) - Deprecated
ModelIO.on_hpc_{save/load}
in favor ofCheckpointHooks.on_{save/load}_checkpoint
(#10911) - Deprecated
Trainer.run_stage
in favor ofTrainer.{fit,validate,test,predict}
(#11000) - Deprecated
Trainer.lr_schedulers
in favor ofTrainer.lr_scheduler_configs
which returns a list of dataclasses instead of dictionaries (#11443) - Deprecated
Trainer.verbose_evaluate
in favor ofEvaluationLoop(verbose=...)
(#10931) - Deprecated
Trainer.should_rank_save_checkpoint
Trainer property (#11068) - Deprecated
Trainer.lightning_optimizers
(#11444) - Deprecated
TrainerOptimizersMixin
and moved functionality tocore/optimizer.py
(#11155) - Deprecated the
on_train_batch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#12182) - Deprecated the
training_epoch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#12182) - Deprecated
TrainerCallbackHookMixin
(#11148) - Deprecated
TrainerDataLoadingMixin
and moved functionality toTrainer
andDataConnector
(#11282) - Deprecated function
pytorch_lightning.callbacks.device_stats_monitor.prefix_metric_keys
(#11254) - Deprecated
Callback.on_epoch_start
hook in favour ofCallback.on_{train/val/test}_epoch_start
(#11578) - Deprecated
Callback.on_epoch_end
hook in favour ofCallback.on_{train/val/test}_epoch_end
(#11578) - Deprecated
LightningModule.on_epoch_start
hook in favor ofLightningModule.on_{train/val/test}_epoch_start
(#11578) - Deprecated
LightningModule.on_epoch_end
hook in favor ofLightningModule.on_{train/val/test}_epoch_end
(#11578) - Deprecated
on_before_accelerator_backend_setup
callback hook in favour ofsetup
(#11568) - Deprecated
on_batch_start
andon_batch_end
callback hooks in favor ofon_train_batch_start
andon_train_batch_end
(#11577) - Deprecated
on_configure_sharded_model
callback hook in favor ofsetup
(#11627) - Deprecated
pytorch_lightning.utilities.distributed.rank_zero_only
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_only
(#11747) - Deprecated
pytorch_lightning.utilities.distributed.rank_zero_debug
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_debug
(#11747) - Deprecated
pytorch_lightning.utilities.distributed.rank_zero_info
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_info
(#11747) - Deprecated
pytorch_lightning.utilities.warnings.rank_zero_warn
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_warn
(#11747) - Deprecated
pytorch_lightning.utilities.warnings.rank_zero_deprecation
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_deprecation
(#11747) - Deprecated
pytorch_lightning.utilities.warnings.LightningDeprecationWarning
in favor ofpytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
- Deprecated
on_pretrain_routine_start
andon_pretrain_routine_end
callback hooks in favor ofon_fit_start
(#11794) - Deprecated
LightningModule.on_pretrain_routine_start
andLightningModule.on_pretrain_routine_end
hooks in favor ofon_fit_start
(#12122) - Deprecated
agg_key_funcs
andagg_default_func
parameters fromLightningLoggerBase
(#11871) - Deprecated
LightningLoggerBase.update_agg_funcs
(#11871) - Deprecated
LightningLoggerBase.agg_and_log_metrics
in favor ofLightningLoggerBase.log_metrics
(#11832) - Deprecated passing
weights_save_path
to theTrainer
constructor in favor of adding theModelCheckpoint
callback withdirpath
directly to the list of callbacks (#12084) - Deprecated
pytorch_lightning.profiler.AbstractProfiler
in favor ofpytorch_lightning.profiler.Profiler
(#12106) - Deprecated
pytorch_lightning.profiler.BaseProfiler
in favor ofpytorch_lightning.profiler.Profiler
(#12150) - Deprecated
BaseProfiler.profile_iterable
(#12102) - Deprecated
LoggerCollection
in favor oftrainer.loggers
(#12147) - Deprecated
PrecisionPlugin.on_{save,load}_checkpoint
in favor ofPrecisionPlugin.{state_dict,load_state_dict}
(#11978) - Deprecated
LightningDataModule.on_save/load_checkpoint
in favor ofstate_dict/load_state_dict
(#11893) - Deprecated
Trainer.use_amp
in favor ofTrainer.amp_backend
(#12312) - Deprecated
LightingModule.use_amp
in favor ofTrainer.amp_backend
(#12315) - Deprecated specifying the process group backend through the environment variable
PL_TORCH_DISTRIBUTED_BACKEND
(#11745) - Deprecated
ParallelPlugin.torch_distributed_backend
in favor ofDDPStrategy.process_group_backend
property (#11745) - Deprecated
ModelCheckpoint.save_checkpoint
in favor ofTrainer.save_checkpoint
(#12456) - Deprecated
Trainer.devices
in favor ofTrainer.num_devices
andTrainer.device_ids
(#12151) - Deprecated
Trainer.root_gpu
in favor ofTrainer.strategy.root_device.index
when GPU is used (#12262) - Deprecated
Trainer.num_gpus
in favor ofTrainer.num_devices
when GPU is used (#12384) - Deprecated
Trainer.ipus
in favor ofTrainer.num_devices
when IPU is used (#12386) - Deprecated
Trainer.num_processes
in favor ofTrainer.num_devices
(#12388) - Deprecated
Trainer.data_parallel_device_ids
in favor ofTrainer.device_ids
(#12072) - Deprecated returning state from
Callback.on_save_checkpoint
in favor of returning state inCallback.state_dict
for checkpointing (#11887) - Deprecated passing only the callback state to
Callback.on_load_checkpoint(callback_state)
in favor of passing the callback state toCallback.load_state_dict
and in 1.8, passing the entire checkpoint dictionary toCallback.on_load_checkpoint(checkpoint)
(#11887) - Deprecated
Trainer.gpus
in favor ofTrainer.device_ids
orTrainer.num_devices
(#12436) - Deprecated
Trainer.tpu_cores
in favor ofTrainer.num_devices
(#12437)
Removed
- Removed deprecated parameter
method
inpytorch_lightning.utilities.model_helpers.is_overridden
(#10507) - Remove deprecated method
ClusterEnvironment.creates_children
(#10339) - Removed deprecated
TrainerModelHooksMixin.is_function_implemented
andTrainerModelHooksMixin.has_arg
(#10322) - Removed deprecated
pytorch_lightning.utilities.device_dtype_mixin.DeviceDtypeModuleMixin
in favor ofpytorch_lightning.core.mixins.device_dtype_mixin.DeviceDtypeModuleMixin
(#10442) - Removed deprecated
LightningModule.loaded_optimizer_states_dict
property (#10346) - Removed deprecated
Trainer.fit(train_dataloader=)
,Trainer.validate(val_dataloaders=)
, andTrainer.test(test_dataloader=)
(#10325) - Removed deprecated
has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
andhas_teardown_predict
datamodule lifecycle properties (#10350) - Removed deprecated
every_n_val_epochs
parameter of ModelCheckpoint (#10366) - Removed deprecated
import pytorch_lightning.profiler.profilers
in favor ofimport pytorch_lightning.profiler
(#10443) - Removed deprecated property
configure_slurm_dpp
from accelerator connector (#10370) - Removed deprecated arguments
num_nodes
andsync_batchnorm
fromDDPPlugin
,DDPSpawnPlugin
,DeepSpeedPlugin
(#10357) - Removed deprecated property
is_slurm_managing_tasks
from AcceleratorConnector (#10353) - Removed deprecated
LightningModule.log(tbptt_reduce_fx, tbptt_reduce_token, sync_dist_op)
(#10423) - Removed deprecated
Plugin.task_idx
(#10441) - Removed deprecated method
master_params
from PrecisionPlugin (#10372) - Removed the automatic detachment of "extras" returned from
training_step
. For example,return {'loss': ..., 'foo': foo.detach()}
will now be necessary iffoo
has gradients which you do not want to store (#10424) - Removed deprecated passthrough methods and properties from
Accelerator
base class: - Removed deprecated signature for
transfer_batch_to_device
hook. The new argumentdataloader_idx
is now required (#10480) - Removed deprecated
utilities.distributed.rank_zero_{warn/deprecation}
(#10451) - Removed deprecated
mode
argument fromModelSummary
class (#10449) - Removed deprecated
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#10482) - Removed deprecated
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#10482) - Removed deprecated
disable_validation
property from Trainer (#10450) - Removed deprecated
CheckpointConnector.hpc_load
property in favor ofCheckpointConnector.restore
(#10525) - Removed deprecated
reload_dataloaders_every_epoch
fromTrainer
in favour ofreload_dataloaders_every_n_epochs
(#10481) - Removed the
precision_plugin
attribute fromAccelerator
in favor of its equivalent attributeprecision_plugin
in theTrainingTypePlugin
(#10570) - Removed
DeepSpeedPlugin.{precision,amp_type,amp_level}
properties (#10657) - Removed patching of
on_before_batch_transfer
,transfer_batch_to_device
andon_after_batch_transfer
hooks inLightningModule
(#10603) - Removed argument
return_result
from theDDPSpawnPlugin.spawn()
method (#10867) - Removed the property
TrainingTypePlugin.results
and corresponding properties in subclasses (#10034) - Removed the
mp_queue
attribute fromDDPSpawnPlugin
andTPUSpawnPlugin
(#10034) - Removed unnecessary
_move_optimizer_state
method overrides fromTPUSpawnPlugin
andSingleTPUPlugin
(#10849) - Removed
should_rank_save_checkpoint
property fromTrainingTypePlugin
(#11070) - Removed
model_sharded_context
method fromAccelerator
(#10886) - Removed method
pre_dispatch
from thePrecisionPlugin
(#10887) - Removed method
setup_optimizers_in_pre_dispatch
from thestrategies
and achieve the same logic insetup
andpre_dispatch
methods (#10906) - Removed methods
pre_dispatch
,dispatch
andpost_dispatch
from theAccelerator
(#10885) - Removed method
training_step
,test_step
,validation_step
andpredict_step
from theAccelerator
(#10890) - Removed
TrainingTypePlugin.start_{training,evaluating,predicting}
hooks and the same in all subclasses (#10989, #10896) - Removed
Accelerator.on_train_start
(#10999) - Removed support for Python 3.6 (#11117)
- Removed
Strategy.init_optimizers
in favor ofStrategy.setup_optimizers
(#11236) - Removed
profile("training_step_and_backward")
inClosure
class since we already profile callstraining_step
andbackward
(#11222) - Removed
Strategy.optimizer_zero_grad
(#11246) - Removed
Strategy.on_gpu
(#11537) - Removed
Strategy.on_tpu
property (#11536) - Removed the abstract property
LightningLoggerBase.experiment
(#11603) - Removed
FitLoop.current_epoch
getter and setter (#11562) - Removed access to
_short_id
inNeptuneLogger
(#11517) - Removed
log_text
andlog_image
from theLightningLoggerBase
API (#11857) - Removed calls to
profile("model_forward")
in favor of profilingtraining_step
(#12032) - Removed
get_mp_spawn_kwargs
fromDDPSpawnStrategy
andTPUSpawnStrategy
in favor of configuration in the_SpawnLauncher
(#11966) - Removed
_aggregate_metrics
,_reduce_agg_metrics
, and_finalize_agg_metrics
fromLightningLoggerBase
(#12053) - Removed the
AcceleratorConnector.device_type
property (#12081) - Removed
AcceleratorConnector.num_nodes
(#12107) - Removed
AcceleratorConnector.has_ipu
property (#12111) - Removed
AcceleratorConnector.use_ipu
property (#12110) - Removed
AcceleratorConnector.has_tpu
property (#12109) - Removed
AcceleratorConnector.use_dp
property (#12112) - Removed
configure_sync_batchnorm
fromParallelStrategy
and all other strategies that inherit from it (#11754) - Removed public attribute
sync_batchnorm
from strategies (#11754) - Removed
AcceleratorConnector.root_gpu
property (#12262) - Removed
AcceleratorConnector.tpu_id
property (#12387) - Removed
AcceleratorConnector.num_gpus
property (#12384) - Removed
AcceleratorConnector.num_ipus
property (#12386) - Removed
AcceleratorConnector.num_processes
property (#12388) - Removed
AcceleratorConnector.parallel_device_ids
property (#12072) - Removed
AcceleratorConnector.devices
property (#12435) - Removed
AcceleratorConnector.parallel_devices
property (#12075) - Removed
AcceleratorConnector.tpu_cores
property (#12437)
Fixed
- Fixed an issue where
ModelCheckpoint
could delete last checkpoint from the old directory whendirpath
has changed during resumed training (#12225) - Fixed an issue where
ModelCheckpoint
could delete older checkpoints whendirpath
has changed during resumed training (#12045) - Fixed an issue where
HorovodStrategy.teardown()
did not complete gracefully if an exception was thrown during callback setup #11752 - Fixed security vulnerabilities CVE-2020-1747 and CVE-2020-14343 caused by the
PyYAML
dependency (#11099) - Fixed security vulnerability "CWE-94: Improper Control of Generation of Code (Code Injection)" (#12212)
- Fixed logging on
{test,validation}_epoch_end
with multiple dataloaders (#11132) - Reset the validation progress tracking state after sanity checking (#11218)
- Fixed double evaluation bug with fault-tolerance enabled where the second call was completely skipped (#11119)
- Fixed an issue with the
TPUSpawnPlugin
handling theXLA_USE_BF16
environment variable incorrectly (#10990) - Fixed wrong typehint for
Trainer.lightning_optimizers
(#11155) - Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy (#11307)
- Fixed bug that forced overriding
configure_optimizers
with the CLI (#11672) - Fixed type promotion when tensors of higher category than float are logged (#11401)
- Fixed
SimpleProfiler
summary (#11414) - No longer set a
DistributedSampler
to thepoptorch.DataLoader
when IPUs are used (#12114) - Fixed bug where progress bar was not being disabled when not in rank zero during predict (#11377)
- Fixed the mid-epoch warning call while resuming training (#11556)
- Fixed
LightningModule.{un,}toggle_model
when only 1 optimizer is used (#12088) - Fixed an issue in
RichProgressbar
to display the metrics logged only on main progress bar (#11690) - Fixed
RichProgressBar
progress when refresh rate does not evenly divide the total counter (#11668) - Fixed
RichProgressBar
progress validation bar total when using multiple validation runs within a single training epoch (#11668) - Configure native Deepspeed schedulers with interval='step' (#11788), (#12031)
- Update
RichProgressBarTheme
styles after detecting light theme on colab (#10993) - Fixed passing
_ddp_params_and_buffers_to_ignore
(#11949) - Fixed an
AttributeError
when callingsave_hyperparameters
and no parameters need saving (#11827) - Fixed environment variable priority for global rank determination (#11406)
- Fixed an issue that caused the Trainer to produce identical results on subsequent runs without explicit re-seeding (#11870)
- Fixed an issue that caused the Tuner to affect the random state (#11870)
- Fixed to avoid common hook warning if no hook is overridden (#12131)
- Fixed deepspeed keeping old sub-folders in same ckpt path (#12194)
- Fixed returning logged metrics instead of callback metrics during evaluation (#12224)
- Fixed the case where
logger=None
is passed to the Trainer (#12249) - Fixed bug where the global step tracked by
ModelCheckpoint
was still set even if no checkpoint was saved (#12418) - Fixed bug where
ModelCheckpoint
was overriding theepoch
andstep
logged values (#12418) - Fixed bug where monitoring the default
epoch
andstep
values withModelCheckpoint
would fail (#12418) - Fixed initializing optimizers unnecessarily in
DDPFullyShardedStrategy
(#12267) - Fixed check for horovod module (#12377)
- Fixed logging to loggers with multiple eval dataloaders (#12454)
- Fixed an issue with resuming from a checkpoint trained with QAT (#11346)
Full commit list: 1.5.0...1.6.0
Contributors
Veteran
@akihironitta @ananthsub @awaelchli @Borda @borisdayma @carmocca @daniellepintz @edward-io @ethanwharris @four4fish @jjenniferdai @kaushikb11 @kingyiusuen @kragniz @mauvilsa @ninginthecloud @popfido @rohitgr7 @SeanNaren @speediedan @tchaton @tshu-w @twsl @williamFalcon
New
@a-gardner1 @abhi-rf @abhinavarora @adamreeve @adamviola @AJSVB @akashkw @amin-nejad @AndresAlgaba @ant0nsc @armanal @bhadreshpsavani @CAIQT @catalys1 @chaddy1004 @chunyang-wen @circlecrystal @Code-Cornelius @Cyber-Machine @dennisbappert @DuYicong515 @edpizzi @franp9am @ftorres16 @ggare-cmu @guyang3532 @Honzys @idiomaticrefactoring @isvogor-foi @jerome-habana @jgibson2 @jlhbaseball15 @jona-0 @JoostvDoorn @josafatburmeister @konstantinjdobler @Kr4is @krishnakalyan3 @krshrimali @lemairecarl @lucmos @manangoel99 @mathemusician @mayeroa @mbortolon97 @NathanGodey @Nesqulck @nithinraok @ORippler @os1ma @peterdudfield @Piyush-97 @puhuk @qqueing @quancs @Raahul-Singh @Raalsky @Rajathbharadwaj @rasbt @rharish101 @rhjohnstone @rjkilpatrick @RobertLaurella @roschly @rsokl @rusty1s @SauravMaheshkar @sethvargo @shabie @shivammehta007 @srb-cv @ThomVett @wangraying @whokilleddb @zredeaux65
If we forgot someone or have any suggestion, let us know in Slack ⚡