Lightning 1.8: Colossal-AI Strategy, Commands and Secrets for Apps, FSDP Improvements and More!
The core team is excited to announce the release of Lightning 1.8 ⚡
Lightning v1.8 is the culmination of work from 52 contributors who have worked on features, bug-fixes, and documentation for a total of over 550+ commits since v1.7.
Highlights
Colossal-AI
Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:
# Select the strategy with good defaults
trainer = Trainer(strategy="colossalai")
# or tune parameters to your liking
from lightning.pytorch.strategies import ColossalAIStrategy
trainer = Trainer(strategy=ColossalAIStrategy(placement_policy="cpu", ...))
You can find Colossal-AI's benchmarks with Lightning on GPT-2 here.
Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:
- Data Parallelism
- Pipeline Parallelism
- 1D, 2D, 2.5D, 3D Tensor Parallelism
- Sequence Parallelism
- Zero Redundancy Optimization
Learn how to install and use Colossal-AI effectively with Lightning here.
NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.
Secrets for Lightning Apps
Introducing encrypted secrets (#14612), a feature requested by Lightning App users 🎉!
Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.
-
Add a secret to your Lightning account in lightning.ai (read more here)
-
Add an environment variable to your app to read the secret:
# somewhere in your Flow or Work: GitHubComponent(api_token=os.environ["API_TOKEN"])
-
Pass the secret to your app run with the following command:
lightning run app app.py --cloud --secret API_TOKEN=github_api_token
These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.
NOTE: This is an experimental feature.
CLI Commands for Lightning Apps
Introducing CLI commands for apps (#13602)!
As a Lightning App builder, if you want to easily create a CLI interface for users to interract with your app, then this is for you.
Here is an example where users can dynamically create notebooks from the CLI.
All you need to do is implement the configure_commands
hook on the LightningFlow
:
import lightning as L
from commands.notebook.run import RunNotebook
class Flow(L.LightningFlow):
...
def configure_commands(self):
# Return a list of dictionaries with commands:
return [{"run notebook": RunNotebook(method=self.run_notebook)}]
app = L.LightningApp(Flow())
Once the app is running with lightning run app app.py
, you can connect to the app with the following command:
lightning connect {app name} -y
and run the command that was configured:
lightning run notebook --name=my_notebook_name
For a full tutorial and running example, visit our docs. TODO: add to docs
NOTE: This is an experimental feature.
Auto-wrapping for FSDP Strategy
In Lightning v1.7, we introduced an integration for PyTorch FSDP in the form of our FSDP strategy, which allows you to train huge models with billions of parameters sharded across hundreds of GPUs and machines.
# Native FSDP implementation
trainer = Trainer(strategy="fsdp_native")
We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).
Here are some examples:
Case 1: Model is so large that it does not fit into CPU memory.
Construct your layers in the configure_sharded_model
hook and wrap the large ones you want to shard across GPUs:
class MassiveModel(LightningModule):
...
# Create model here and wrap the large layers for sharding
def configure_sharded_model(self):
for i, layer in enumerate(self.block):
self.block[i] = wrap(layer)
...
Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP's policy:
model = MassiveModel()
trainer = Trainer(
accelerator="gpu",
devices=8,
strategy="fsdp_native", # or strategy="fsdp" for fairscale
precision=16
)
# Automatically wraps the layers here:
trainer.fit(model)
Case 3: Model fits into GPU memory. No action required, use any strategy you want.
Note: if you want to manually wrap layers for more control, you can still do that!
Read more about FSDP and how layer wrapping works in our docs.
New Tuner Callbacks
In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.
Batch Size Finder (#11089)
-
You can customize the
BatchSizeFinder
callback to run at different epochs. This feature is useful while fine-tuning models since you can't always use the same batch size after unfreezing the backbone.from lightning.pytorch.callbacks import BatchSizeFinder class FineTuneBatchSizeFinder(BatchSizeFinder): def __init__(self, milestones, *args, **kwargs): super().__init__(*args, **kwargs) self.milestones = milestones def on_fit_start(self, *args, **kwargs): return def on_train_epoch_start(self, trainer, pl_module): if trainer.current_epoch in self.milestones or trainer.current_epoch == 0: self.scale_batch_size(trainer, pl_module) trainer = Trainer(callbacks=[FineTuneBatchSizeFinder(milestones=(5, 10))]) trainer.fit(...)
-
Run batch size finder for
validate
/test
/predict
.from lightning.pytorch.callbacks import BatchSizeFinder class EvalBatchSizeFinder(BatchSizeFinder): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def on_fit_start(self, *args, **kwargs): return def on_test_start(self, trainer, pl_module): self.scale_batch_size(trainer, pl_module) trainer = Trainer(callbacks=[EvalBatchSizeFinder()]) trainer.test(...)
Learning Rate Finder (#13802)
You can now use the LearningRateFinder
callback to run at different intervals. This feature is useful when fine-tuning models, for example.
from lightning.pytorch.callbacks import LearningRateFinder
class FineTuneLearningRateFinder(LearningRateFinder):
def __init__(self, milestones, *args, **kwargs):
super().__init__(*args, **kwargs)
self.milestones = milestones
def on_fit_start(self, *args, **kwargs):
return
def on_train_epoch_start(self, trainer, pl_module):
if trainer.current_epoch in self.milestones or trainer.current_epoch == 0:
self.lr_find(trainer, pl_module)
trainer = Trainer(callbacks=[FineTuneLearningRateFinder(milestones=(5, 10))])
trainer.fit(...)
LightningCLI Improvements
Even though the LightningCLI
class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):
from lightning.pytorch.cli import LightningCLI
def cli_main(args):
cli = LightningCLI(MyModel, ..., args=args)
...
Anywhere in your program, you can now call the CLI directly:
cli_main(["--trainer.max_epochs=100", "--model.encoder_layers=24"])
Learn about all features of the LightningCLI!
Improvements to the SLURM Support
Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we've added two quality of life improvements:
-
The preemption/termination signal is now configurable (#14626):
# the default signal is SIGUSR1 trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGUSR1)]) # customize it for your cluster trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGHUP)])
-
Automatic requeuing of jobs now also works for array jobs (#15040)! Array jobs are a convenient way to group/launch several scripts at once. When the SLURM scheduler interrupts your jobs, Lightning will save a checkpoint, resubmit a new job, and, once the scheduler allocates resources, the Trainer will resume from where it left off.
Read more about our SLURM integration here.
Backward Incompatible Changes
This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.
Callback hooks for loading and saving checkpoints
The signature and behavior of the on_load_checkpoint
and on_save_checkpoint
callback hooks have changed (#14835):
Before:
def on_save_checkpoint(self, trainer, pl_module, checkpoint):
...
# previously, we were able to return state here
return state
def on_load_checkpoint(self, trainer, pl_module, callback_state):
# previously, only the state for this callback was passed in as argument
...
Now:
def on_save_checkpoint(self, trainer, pl_module, checkpoint):
...
# returning a value here is no longer supported
# you can modify the checkpoint dict directly
return None
def state_dict(self):
...
# Now, return state from this new method
return state
def on_load_checkpoint(self, trainer, pl_module, checkpoint):
# previously, only the state for this callback was passed in as argument
...
def load_state_dict(self, state):
# Now, the state for this callback gets passed to this new method
...
DataModule hooks for loading and saving checkpoints
The on_save_checkpoint
and on_load_checkpoint
hooks on the LightningDataModule
have been removed in favor of the state_dict
and load_state_dict
methods:
-def on_save_checkpoint(self, checkpoint):
- checkpoint["banana"] = self.banana
+def state_dict(self):
+ return dict(banana=self.banana)
-def on_load_checkpoint(self, checkpoint):
- self.banana = checkpoint["banana"]
+def load_state_dict(self, state):
+ self.banana = state["banana"]
Callback hooks
We removed some Callback
hooks that were ambiguous to use Removed deprecated callback hooks (#14834):
Old name | New name |
---|---|
on_batch_start |
on_train_batch_start |
on_batch_end |
on_train_batch_end |
on_epoch_start |
on_train_epoch_start |
on_epoch_start |
on_validation_epoch_start |
on_epoch_start |
on_test_epoch_start |
on_pretrain_routine_start |
on_fit_start |
Trainer Device Attributes
We cleaned up the properties related to device indices (#14829).
The attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores,num_processes,root_gpu,data_parallel_device_ids}
have been removed in favor of accelerator-agnostic attributes:
trainer = Trainer(...)
# access the number of devices the trainer uses on this machine ...
print(trainer.num_devices)
# ... or the device IDs
print(trainer.device_ids)
Setting the torch-distributed backend
In previous versions of Lightning, switching between the "gloo" and "nccl" backends for multi-GPU, multi-node training was possible through setting an environment variable like so:
PL_TORCH_DISTRIBUTED_BACKEND="gloo" python train.py
But not all strategies support changing the backend in this way.
From now on, the backend has to be set in the code (#14693):
trainer = Trainer(strategy=DDPStrategy(process_group_backend="gloo"))
The default remains "nccl", and you should choose "gloo" only for debugging purposes.
Logging with multiple loggers
Logging with multiple loggers can be super useful (and super easy with Lightning). For example, you could be using one logger to record sensitive image logs to a hosted MLFlow server within your organization, and at the same time log loss curves online to WandB.
trainer = Trainer(
loggers=[WandbLogger(...), MLFlowLogger(...)]
)
Here are two major changes that apply when using multiple loggers in 1.8:
-
Checkpoints and profiler reports no longer go to a strange folder with a long, hard to remember name (#14325). From now on, these arifacts will land in the version folder of the first logger in the list.
-
The loggers used to be wrapped by a
LoggerCollection
object, so that when you accessedtrainer.logger
you could log to all of them simultaneously. However, this "magic" caused confusion and errors among users and we decided to simplify this (#14283):# now returns the first logger in the list print(trainer.logger) # access all loggers in a list with plural loggers = trainer.loggers for logger in loggers: logger.do_something()
Deprecations
Why is Lightning deprecating APIs in every release?
Many users have this question, and it is a fair one! Deprecations are a normal part of API evolution in all software. We continually improve Lightning, which means we make APIs like class names, methods, hooks and arguments clear, easy to remember, and general enough to adopt more functionality in the future. Sometimes we have to let old things go to build new and better products.
Learn more about our deprecation window here.
So far, we have followed the pattern of removing deprecated functionality and APIs after two minor versions of deprecation. From Lightning 1.8 onward, we will additionaly convert warnings to error messages after the deprecation phase ends. This way, we can greatly improve the upgrade experience with helpful messages for users who skip more than two minor Lightning versions. The exception to this rule are experimental features, which are marked as such in our documentation.
Here is a summary of major deprecations introduced in 1.8:
API | Removal version | Alternative |
---|---|---|
Argument Trainer(amp_level=...) |
1.10 | Trainer(plugins=[ApexMixedPrecisionPlugin(amp_level=...)]) |
Function unwrap_lightning_module |
1.10 | Strategy.lightning_module |
Function unwrap_lightning_module_sharded |
1.10 | Strategy.lightning_module |
Import pl.core.mixins.DeviceDtypeModuleMixin |
1.10 | No longer supported |
Argument LightningCLI(save_config_filename=...) |
1.10 | LightningCLI(save_config_kwargs=dict(config_filename=...)) |
Argument LightningCLI(save_config_overwrite=...) |
1.10 | LightningCLI(save_config_kwargs=dict(overwrite=...)) |
Argument LightningCLI(save_config_multifile=...) |
1.10 | LightningCLI(save_config_kwargs=dict(multifile=...)) |
Enum TrainerFn.TUNING |
1.10 | No longer supported |
Enum RunningStage.TUNING |
1.10 | No longer supported |
Attribute Trainer.tuning |
1.10 | No longer supported |
CHANGELOG
Lightning App
Added
- Added
load_state_dict
andstate_dict
hooks forLightningFlow
components (#14100) - Added a
--secret
option to CLI to allow binding secrets to app environment variables when running in the cloud (#14612) - Added support for running the works without cloud compute in the default container (#14819)
- Added an HTTPQueue as an optional replacement for the default redis queue (#14978
- Added support for configuring flow cloud compute (#14831)
- Added support for adding descriptions to commands either through a docstring or the
DESCRIPTION
attribute (#15193 - Added a try / catch mechanism around request processing to avoid killing the flow (#15187
- Added an Database Component (#14995
- Added authentication to HTTP queue (#15202)
- Added support to pass a
LightningWork
to theLightningApp
(#15215 - Added support getting CLI help for connected apps even if the app isn't running (#15196
- Added support for adding requirements to commands and installing them when missing when running an app command (#15198
- Added Lightning CLI Connection to be terminal session instead of global (#15241
- Added support for managing SSH-keys via CLI (#15291)
- Add a
JustPyFrontend
to ease UI creation withhttps://github.com/justpy-org/justpy
(#15002) - Added a layout endpoint to the Rest API and enable to disable pulling or pushing to the state (#15367
- Added support for functions for
configure_api
andconfigure_commands
to be executed in the Rest API process (#15098 - Added support to start lightning app on cloud without needing to install dependencies locally (#15019
Changed
Fixed
- Fixed an issue when using the CLI without arguments (#14877)
- Fixed a bug where the upload files endpoint would raise an error when running locally (#14924)
- Fixed BYOC cluster region selector -> hiding it from help since only us-east-1 has been tested and is recommended ([#15277]#15277)
- Fixed a bug when launching an app on multiple clusters (#15226)
- Fixed a bug with a default CloudCompute for Lightning flows (#15371)
Lightning Trainer
Added
- Added support for requeueing slurm array jobs (#15040)
- Added native AMP support for
ddp_fork
(and associated alias strategies) with CUDA GPUs (#14983) - Added
BatchSizeFinder
callback (#11089) - Added
LearningRateFinder
callback (#13802) - Tuner now supports a new
method
argument which will determine when to run theBatchSizeFinder
: one offit
,validate
,test
orpredict
(#11089) - Added prefix to log message in
seed_everything
with rank info (#14031) - Added support for auto wrapping for
DDPFullyShardedNativeStrategy
(#14252) - Added support for passing extra init-parameters to the
LightningDataModule.from_datasets
(#14185) - Added support for saving sharded optimizer state dict outside of
DDPShardedStrategy
(#14208) - Added support for auto wrapping for
DDPFullyShardedStrategy
(#14383) - Integrate the
lightning_utilities
package (
#14475,
#14537,
#14556,
#14558,
#14575,
#14620) - Added
args
parameter toLightningCLI
to ease running from within Python (#14596) - Added
WandbLogger.download_artifact
andWandbLogger.use_artifact
for managing artifacts with Weights and Biases (#14551) - Added an option to configure the signal SLURM sends when a job is preempted or requeued (#14626)
- Added a warning when the model passed to
LightningLite.setup()
does not have all parameters on the same device (#14822) - The
CometLogger
now flags the Comet Experiments as being created from Lightning for analytics purposes (#14906) - Introduce
ckpt_path="hpc"
keyword for checkpoint loading (#14911) - Added a more descriptive error message when attempting to fork processes with pre-initialized CUDA context (#14709)
- Added support for custom parameters in subclasses of
SaveConfigCallback
(#14998) - Added
inference_mode
flag to Trainer to let users enable/disable inference mode during evaluation (#15034) - Added
LightningLite.no_backward_sync
for control over efficient gradient accumulation with distributed strategies (#14966) - Added a sanity check that scripts are executed with the
srun
command in SLURM and that environment variables are not conflicting (#15011) - Added an error message when attempting to launch processes with
python -i
and an interactive-incompatible strategy (#15293)
Changed
- The
Trainer.{fit,validate,test,predict,tune}
methods now raise a useful error message if the input is not aLightningModule
(#13892) - Raised a
MisconfigurationException
if batch transfer hooks are overriden withIPUAccelerator
(#13961) - Replaced the unwrapping logic in strategies with direct access to unwrapped
LightningModule
(#13738) - Enabled
on_before_batch_transfer
forDPStrategy
andIPUAccelerator
(#14023) - When resuming training with Apex enabled, the
Trainer
will now raise an error (#14341) - Included
torch.cuda
rng state to the aggregate_collect_rng_states()
and_set_rng_states()
(#14384) - Changed
trainer.should_stop
to not stop in between an epoch and run untilmin_steps/min_epochs
only (#13890) - The
pyDeprecate
dependency is no longer installed (#14472) - When using multiple loggers, by default checkpoints and profiler output now get saved to the log dir of the first logger in the list (#14325)
- In Lightning Lite, state-dict access to the module wrapper now gets passed through to the original module reference (#14629)
- Removed fall-back to
LightningEnvironment
when number of SLURM tasks does not correspond to number of processes in Trainer (#14300) - Aligned DDP and DDPSpawn strategies in setting up the environment (#11073)
- Integrated the Lite Precision plugins into the PL Precision plugins - the base class in PL now extends the
lightning_lite.precision.Precision
base class (#14798)- The
PrecisionPlugin.backward
signature changed: Theclosure_loss
argument was renamed totensor
- The
PrecisionPlugin.{pre_,post_}backward
signature changed: Theclosure_loss
argument was renamed totensor
and moved as the first argument - The
PrecisionPlugin.optimizer_step
signature changed: Themodel
,optimizer_idx
andclosure
arguments need to be passed as keyword arguments now
- The
- Trainer queries the CUDA devices through NVML if available to avoid initializing CUDA before forking, which eliminates the need for the
PL_DISABLE_FORK
environment variable introduced in v1.7.4 (#14631) - The
MLFlowLogger.finalize()
now sets the status toFAILED
when an exception occurred inTrainer
, and sets the status toFINISHED
on successful completion (#12292) - It is no longer needed to call
model.double()
when usingprecision=64
in Lightning Lite (#14827) - HPC checkpoints are now loaded automatically only in slurm environment when no specific value for
ckpt_path
has been set (#14911) - The
Callback.on_load_checkpoint
now gets the full checkpoint dictionary and thecallback_state
argument was renamedcheckpoint
(#14835) - Moved the warning about saving nn.Module in
save_hyperparameters()
to before the deepcopy (#15132) - To avoid issues with forking processes, from PyTorch 1.13 and higher, Lightning will directly use the PyTorch NVML-based check for
torch.cuda.device_count
and from PyTorch 1.14 and higher, Lightning will configure PyTorch to use a NVML-based check fortorch.cuda.is_available
. (#15110, #15133) - The
NeptuneLogger
now usesneptune.init_run
instead of the deprecatedneptune.init
to initialize a run (#15393)
Deprecated
- Deprecated
LightningDeepSpeedModule
(#14000) - Deprecated
amp_level
fromTrainer
in favour of passing it explictly via precision plugin (#13898) - Deprecated the calls to
pytorch_lightning.utiltiies.meta
functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868) - Deprecated the
unwrap_lightning_module
andunwrap_lightning_module_sharded
utility functions in favor of accessing the unwrappedLightningModule
on the strategy directly (#13738) - Deprecated the
pl_module
argument inLightningParallelModule
,LightningDistributedModule
,LightningShardedDataParallel
,LightningBaguaModule
andLightningDeepSpeedModule
wrapper classes (#13738) - Deprecated the
on_colab_kaggle
function (#14247) - Deprecated the internal
pl.core.mixins.DeviceDtypeModuleMixin
class (#14511, #14548) - Deprecated all functions in
pytorch_lightning.utilities.xla_device
(#14514, #14550)- Deprecated the internal
inner_f
function - Deprecated the internal
pl_multi_process
function - Deprecated the internal
XLADeviceUtils.xla_available
staticmethod - Deprecated the
XLADeviceUtils.tpu_device_exists
staticmethod in favor ofpytorch_lightning.accelerators.TPUAccelerator.is_available()
- Deprecated the internal
- Deprecated
pytorch_lightning.utilities.distributed.tpu_distributed
in favor oflightning_lite.accelerators.tpu.tpu_distributed
(#14550) - Deprecated all functions in
pytorch_lightning.utilities.cloud_io
in favor oflightning_lite.utilities.cloud_io
(#14515) - Deprecated the functions in
pytorch_lightning.utilities.apply_func
in favor oflightning_utilities.core.apply_func
(#14516, #14537) - Deprecated all functions in
pytorch_lightning.utilities.device_parser
(#14492, #14753)- Deprecated the
pytorch_lightning.utilities.device_parser.determine_root_gpu_device
in favor oflightning_lite.utilities.device_parser.determine_root_gpu_device
- Deprecated the
pytorch_lightning.utilities.device_parser.parse_gpu_ids
in favor oflightning_lite.utilities.device_parser.parse_gpu_ids
- Deprecated the
pytorch_lightning.utilities.device_parser.is_cuda_available
in favor oflightning_lite.accelerators.cuda.is_cuda_available
- Deprecated the
pytorch_lightning.utilities.device_parser.num_cuda_devices
in favor oflightning_lite.accelerators.cuda.num_cuda_devices
- Deprecated the
pytorch_lightning.utilities.device_parser.parse_cpu_cores
in favor oflightning_lite.accelerators.cpu.parse_cpu_cores
- Deprecated the
pytorch_lightning.utilities.device_parser.parse_tpu_cores
in favor oflightning_lite.accelerators.tpu.parse_tpu_cores
- Deprecated the
pytorch_lightning.utilities.device_parser.parse_hpus
in favor ofpytorch_lightning.accelerators.hpu.parse_hpus
- Deprecated the
- Deprecated duplicate
SaveConfigCallback
parameters inLightningCLI.__init__
:save_config_kwargs
,save_config_overwrite
andsave_config_multifile
. Newsave_config_kwargs
parameter should be used instead (#14998) - Deprecated
TrainerFn.TUNING
,RunningStage.TUNING
andtrainer.tuning
property (#15100) - Deprecated custom
pl.utilities.distributed.AllGatherGrad
implementation in favor of PyTorch's (#15364)
Removed
- Removed the deprecated
Trainer.training_type_plugin
property in favor ofTrainer.strategy
(#14011) - Removed all deprecated training type plugins (#14011)
- Removed the deprecated
DDP2Strategy
(#14026) - Removed the deprecated
DistributedType
andDeviceType
enum classes (#14045) - Removed deprecated support for passing the
rank_zero_warn
warning category positionally (#14470) - Removed the legacy and unused
Trainer.get_deprecated_arg_names()
(#14415) - Removed the deprecated
on_train_batch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#14373) - Removed the deprecated
training_epoch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#14373) - Removed the experimental
pytorch_lightning.utiltiies.meta
functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868) - Removed the deprecated
LoggerCollection
;Trainer.logger
andLightningModule.logger
now returns the first logger when more than one gets passed to the Trainer (#14283) - Removed the deprecated the
trainer.lr_schedulers
(#14408) - Removed the deprecated
LightningModule.{on_hpc_load,on_hpc_save}
hooks in favor of the general purpose hooksLightningModule.{on_load_checkpoint,on_save_checkpoint}
(#14315) - Removed deprecated support for old torchtext versions (#14375)
- Removed deprecated support for the old
neptune-client
API in theNeptuneLogger
(#14727) - Removed the deprecated
weights_save_path
Trainer argumnent andTrainer.weights_save_path
property (#14424) - Removed the deprecated (#14471)
pytorch_lightning.utilities.distributed.rank_zero_only
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_only
pytorch_lightning.utilities.distributed.rank_zero_debug
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_debug
pytorch_lightning.utilities.distributed.rank_zero_info
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_info
pytorch_lightning.utilities.warnings.rank_zero_warn
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_warn
pytorch_lightning.utilities.warnings.rank_zero_deprecation
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_deprecation
pytorch_lightning.utilities.warnings.LightningDeprecationWarning
in favor ofpytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
- Removed deprecated
Trainer.num_processes
attribute in favour ofTrainer.num_devices
(#14423) - Removed the deprecated
Trainer.data_parallel_device_ids
hook in favour ofTrainer.device_ids
(#14422) - Removed the deprecated class
TrainerCallbackHookMixin
(#14401) - Removed the deprecated
BaseProfiler
andAbstractProfiler
classes (#14404) - Removed the deprecated way to set the distributed backend via the environment variable
PL_TORCH_DISTRIBUTED_BACKEND
, in favor of setting theprocess_group_backend
in the strategy constructor (#14693) - Removed deprecated callback hooks (#14834)
Callback.on_configure_sharded_model
in favor ofCallback.setup
Callback.on_before_accelerator_backend_setup
in favor ofCallback.setup
Callback.on_batch_start
in favor ofCallback.on_train_batch_start
Callback.on_batch_end
in favor ofCallback.on_train_batch_end
Callback.on_epoch_start
in favor ofCallback.on_{train,validation,test}_epoch_start
Callback.on_epoch_end
in favor ofCallback.on_{train,validation,test}_epoch_end
Callback.on_pretrain_routine_{start,end}
in favor ofCallback.on_fit_start
- Removed the deprecated device attributes
Trainer.{devices,gpus,num_gpus,ipus,tpu_cores}
in favor of the accelerator-agnosticTrainer.num_devices
(#14829) - Removed the deprecated
LightningIPUModule
(#14830) - Removed the deprecated
Logger.agg_and_log_metrics
hook in favour ofLogger.log_metrics
and theagg_key_funcs
andagg_default_func
arguments. (#14840) - Removed the deprecated precision plugin checkpoint hooks
PrecisionPlugin.on_load_checkpoint
andPrecisionPlugin.on_save_checkpoint
(#14833) - Removed the deprecated
Trainer.root_gpu
attribute in favor ofTrainer.strategy.root_device
(#14829) - Removed the deprecated
Trainer.use_amp
andLightningModule.use_amp
attributes (#14832) - Removed the deprecated callback hooks
Callback.on_init_start
andCallback.on_init_end
(#14867) - Removed the deprecated
Trainer.run_stage
in favor ofTrainer.{fit,validate,test,predict}
(#14870) - Removed the deprecated
SimpleProfiler.profile_iterable
andAdvancedProfiler.profile_iterable
attributes (#14864) - Removed the deprecated
Trainer.verbose_evaluate
(#14884) - Removed the deprecated
Trainer.should_rank_save_checkpoint
(#14885) - Removed the deprecated
TrainerOptimizersMixin
(#14887) - Removed the deprecated
Trainer.lightning_optimizers
(#14889) - Removed the deprecated
TrainerDataLoadingMixin
(#14888) - Removed the deprecated
Trainer.call_hook
in favor ofTrainer._call_callback_hooks
,Trainer._call_lightning_module_hook
,Trainer._call_ttp_hook
, andTrainer._call_accelerator_hook
(#14869) - Removed the deprecated
Trainer.{validated,tested,predicted}_ckpt_path
(#14897) - Removed the deprecated
device_stats_monitor_prefix_metric_keys
(#14890) - Removed the deprecated
LightningDataModule.on_save/load_checkpoint
hooks (#14909) - Removed support for returning a value in
Callback.on_save_checkpoint
in favor of implementingCallback.state_dict
(#14835)
Fixed
- Fixed an issue with
LightningLite.setup()
not setting the.device
attribute correctly on the returned wrapper (#14822) - Fixed an attribute error when running the tuner together with the
StochasticWeightAveraging
callback (#14836) - Fixed MissingFieldException in offline mode for the
NeptuneLogger()
(#14919) - Fixed wandb
save_dir
is overridden byNone
dir
when using CLI (#14878) - Fixed a missing call to
LightningDataModule.load_state_dict
hook while restoring checkpoint usingLightningDataModule.load_from_checkpoint
(#14883) - Fixed torchscript error with containers of LightningModules (#14904)
- Fixed reloading of the last checkpoint on run restart (#14907)
SaveConfigCallback
instances should only save the config once to allow having theoverwrite=False
safeguard when usingLightningCLI(..., run=False)
(#14927)- Fixed an issue with terminating the trainer profiler when a
StopIteration
exception is raised while using anIterableDataset
(#14940) - Do not update on-plateau schedulers when reloading from an end-of-epoch checkpoint (#14702)
- Fixed
Trainer
support for PyTorch built without distributed support (#14971) - Fixed batch normalization statistics calculation in
StochasticWeightAveraging
callback (#14866) - Avoided initializing optimizers during deepspeed inference (#14944)
- Fixed
LightningCLI
parse_env and description in subcommands (#15138) - Fixed an exception that would occur when creating a
multiprocessing.Pool
after importing Lightning (#15292) - Fixed a pickling error when using
RichProgressBar
together with checkpointing (#15319) - Fixed the
RichProgressBar
crashing when used with distributed strategies (#15376) - Fixed an issue with
RichProgressBar
not resetting the internal state for the sanity check progress (#15377) - Fixed an issue with DataLoader re-instantiation when the attribute is an array and the default value of the corresponding argument changed (#15409)
Full commit list: 1.7.0...1.8.0
Contributors
Veteran
@akihironitta @ananthsub @AndresAlgaba @ar90n @Atharva-Phatak @awaelchli @BongYang @Borda @carmocca @dependabot @donlapark @ethanwharris @Felonious-Spellfire @hhsecond @jerome-habana @JustinGoheen @justusschock @kaushikb11 @krishnakalyan3 @krshrimali @luca-medeiros @manangoel99 @manskx @mauvilsa @MrShevan @nicolai86 @nmiculinic @otaj @Queuecumber @rlizzo @rohitgr7 @rschireman @SeanNaren @speediedan @tchaton @tshu-w
New
@Birch-san @clementpoiret @HalestormAI @thongonary @alecmerdler @adam-lightning @yurijmikhalevich @lijm1358 @robert-s-lee @panos-is @kacperlukawski @alro923 @dmitsf @Anner-deJong @cschell @nishantb06 @Callidior @j0rd1smit @MarcSkovMadsen @KralaBenjamin @robertomest @daniel347x @pierocor @datumbox @nohalon @pritamsoni-hsr @nandwalritik @gilfree @ritsuki1227 @christopher-nguyen-re @JulesGM @jgbos @dconathan @jsr-p @NeoKish @Blaizzy @suyash-811 @alexkuzmik @ziyadsheeba @geoffrey-g-delhomme @amrutha1098 @AlessioQuercia @ver217 @Helias @zxvix @1SAA @fabiofumarola @luca3rd @kimpty @PaulLerner @rbracco @wouterzwerink
If we forgot somebody or you have a suggestion, find support here ⚡
Did you know?
Chuck Norris can write functions of infinite recursion ... and have them return.