You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lightning v1.8 is the culmination of work from 52 contributors who have worked on features, bug-fixes, and documentation for a total of over 550+ commits since v1.7.
Highlights
Colossal-AI
Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:
# Select the strategy with good defaultstrainer=Trainer(strategy="colossalai")
# or tune parameters to your likingfromlightning.pytorch.strategiesimportColossalAIStrategytrainer=Trainer(strategy=ColossalAIStrategy(placement_policy="cpu", ...))
You can find Colossal-AI's benchmarks with Lightning on GPT-2 here.
Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:
Data Parallelism
Pipeline Parallelism
1D, 2D, 2.5D, 3D Tensor Parallelism
Sequence Parallelism
Zero Redundancy Optimization
Learn how to install and use Colossal-AI effectively with Lightning here.
NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.
Secrets for Lightning Apps
Introducing encrypted secrets (#14612), a feature requested by Lightning App users 🎉!
Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.
Add a secret to your Lightning account. TODO: add link how to
Add an environment variable to your app to read the secret:
# somewhere in your Flow or Work:GitHubComponent(api_token=os.environ["API_TOKEN"])
Pass the secret to your app run with the following command:
lightning run app app.py --cloud --secret API_TOKEN=github_api_token
These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.
NOTE: This is an experimental feature.
CLI Commands for Lightning Apps
Introducing CLI commands for apps (#13602)!
As a Lightning App builder, if you want to easily create a CLI interface for users to interract with your app, then this is for you.
Here is an example where users can dynamically create notebooks from the CLI.
All you need to do is implement the configure_commands hook on the LightningFlow:
importlightningasLfromcommands.notebook.runimportRunNotebookclassFlow(L.LightningFlow):
...
defconfigure_commands(self):
# Return a list of dictionaries with commands:return [{"run notebook": RunNotebook(method=self.run_notebook)}]
app=L.LightningApp(Flow())
Once the app is running with lightning run app app.py, you can connect to the app with the following command:
lightning connect {app name} -y
and run the command that was configured:
lightning run notebook --name=my_notebook_name
For a full tutorial and running example, visit our docs. TODO: add to docs NOTE: This is an experimental feature.
Auto-wrapping for FSDP Strategy
In Lightning v1.7, we introduced an integration for PyTorch FSDP in the form of our FSDP strategy, which allows you to train huge models with billions of parameters sharded across hundreds of GPUs and machines.
We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).
Here are some examples:
Case 1: Model is so large that it does not fit into CPU memory.
Construct your layers in the configure_sharded_model hook and wrap the large ones you want to shard across GPUs:
classMassiveModel(LightningModule):
...
# Create model here and wrap the large layers for shardingdefconfigure_sharded_model(self):
fori, layerinenumerate(self.block):
self.block[i] =wrap(layer)
...
Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP's policy:
model=MassiveModel()
trainer=Trainer(
accelerator="gpu",
devices=8,
strategy="fsdp_native", # or strategy="fsdp" for fairscaleprecision=16
)
# Automatically wraps the layers here:trainer.fit(model)
Case 3: Model fits into GPU memory. No action required, use any strategy you want.
Note: if you want to manually wrap layers for more control, you can still do that!
Read more about FSDP and how layer wrapping works in our docs.
New Tuner Callbacks
In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.
You can customize the BatchSizeFinder callback to run at different epochs. This feature is useful while fine-tuning models since you can't always use the same batch size after unfreezing the backbone.
Even though the LightningCLI class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):
Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we've added two quality of life improvements:
The preemption/termination signal is now configurable (#14626):
# the default signal is SIGUSR1trainer=Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGUSR1)])
# customize it for your clustertrainer=Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGHUP)])
Automatic requeuing of jobs now also works for array jobs (#15040)! Array jobs are a convenient way to group/launch several scripts at once. When the SLURM scheduler interrupts your jobs, Lightning will save a checkpoint, resubmit a new job, and, once the scheduler allocates resources, the Trainer will resume from where it left off.
This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.
Callback hooks for loading and saving checkpoints
The signature and behavior of the on_load_checkpoint and on_save_checkpoint callback hooks have changed (#14835):
Before:
defon_save_checkpoint(self, trainer, pl_module, checkpoint):
...
# previously, we were able to return state herereturnstatedefon_load_checkpoint(self, trainer, pl_module, callback_state):
# previously, only the state for this callback was passed in as argument
...
Now:
defon_save_checkpoint(self, trainer, pl_module, checkpoint):
...
# returning a value here is no longer supported# you can modify the checkpoint dict directlyreturnNonedefstate_dict(self):
...
# Now, return state from this new methodreturnstatedefon_load_checkpoint(self, trainer, pl_module, checkpoint):
# previously, only the state for this callback was passed in as argument
...
defload_state_dict(self, state):
# Now, the state for this callback gets passed to this new method
...
DataModule hooks for loading and saving checkpoints
The on_save_checkpoint and on_load_checkpoint hooks on the LightningDataModule have been removed in favor of the state_dict and load_state_dict methods:
We removed some Callback hooks that were ambiguous to use Removed deprecated callback hooks (#14834):
Old name
New name
on_batch_start
on_train_batch_start
on_batch_end
on_train_batch_end
on_epoch_start
on_train_epoch_start
on_epoch_start
on_validation_epoch_start
on_epoch_start
on_test_epoch_start
on_pretrain_routine_start
on_fit_start
Trainer Device Attributes
We cleaned up the properties related to device indices (#14829).
The attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores,num_processes,root_gpu,data_parallel_device_ids} have been removed in favor of accelerator-agnostic attributes:
trainer=Trainer(...)
# access the number of devices the trainer uses on this machine ...print(trainer.num_devices)
# ... or the device IDsprint(trainer.device_ids)
Setting the torch-distributed backend
In previous versions of Lightning, switching between the "gloo" and "nccl" backends for multi-GPU, multi-node training was possible through setting an environment variable like so:
The default remains "nccl", and you should choose "gloo" only for debugging purposes.
Logging with multiple loggers
Logging with multiple loggers can be super useful (and super easy with Lightning). For example, you could be using one logger to record sensitive image logs to a hosted MLFlow server within your organization, and at the same time log loss curves online to WandB.
Here are two major changes that apply when using multiple loggers in 1.8:
Checkpoints and profiler reports no longer go to a strange folder with a long, hard to remember name (#14325). From now on, these arifacts will land in the version folder of the first logger in the list.
The loggers used to be wrapped by a LoggerCollection object, so that when you accessed trainer.logger you could log to all of them simultaneously. However, this "magic" caused confusion and errors among users and we decided to simplify this (#14283):
# now returns the first logger in the listprint(trainer.logger)
# access all loggers in a list with pluralloggers=trainer.loggersforloggerinloggers:
logger.do_something()
Deprecations
Why is Lightning deprecating APIs in every release?
Many users have this question, and it is a fair one! Deprecations are a normal part of API evolution in all software. We continually improve Lightning, which means we make APIs like class names, methods, hooks and arguments clear, easy to remember, and general enough to adopt more functionality in the future. Sometimes we have to let old things go to build new and better products.
So far, we have followed the pattern of removing deprecated functionality and APIs after two minor versions of deprecation. From Lightning 1.8 onward, we will additionaly convert warnings to error messages after the deprecation phase ends. This way, we can greatly improve the upgrade experience with helpful messages for users who skip more than two minor Lightning versions. The exception to this rule are experimental features, which are marked as such in our documentation.
Here is a summary of major deprecations introduced in 1.8:
Added args parameter to LightningCLI to ease running from within Python (#14596)
Added WandbLogger.download_artifact and WandbLogger.use_artifact for managing artifacts with Weights and Biases (#14551)
Added an option to configure the signal SLURM sends when a job is preempted or requeued (#14626)
Added a warning when the model passed to LightningLite.setup() does not have all parameters on the same device (#14822)
The CometLogger now flags the Comet Experiments as being created from Lightning for analytics purposes (#14906)
Introduce ckpt_path="hpc" keyword for checkpoint loading (#14911)
Added a more descriptive error message when attempting to fork processes with pre-initialized CUDA context (#14709)
Added support for custom parameters in subclasses of SaveConfigCallback (#14998)
Added inference_mode flag to Trainer to let users enable/disable inference mode during evaluation (#15034)
Added LightningLite.no_backward_sync for control over efficient gradient accumulation with distributed strategies (#14966)
Added a sanity check that scripts are executed with the srun command in SLURM and that environment variables are not conflicting (#15011)
Added an error message when attempting to launch processes with python -i and an interactive-incompatible strategy (#15293)
Changed
The Trainer.{fit,validate,test,predict,tune} methods now raise a useful error message if the input is not a LightningModule (#13892)
Raised a MisconfigurationException if batch transfer hooks are overriden with IPUAccelerator (#13961)
Replaced the unwrapping logic in strategies with direct access to unwrapped LightningModule (#13738)
Enabled on_before_batch_transfer for DPStrategy and IPUAccelerator (#14023)
When resuming training with Apex enabled, the Trainer will now raise an error (#14341)
Included torch.cuda rng state to the aggregate _collect_rng_states() and _set_rng_states() (#14384)
Changed trainer.should_stop to not stop in between an epoch and run until min_steps/min_epochs only (#13890)
The pyDeprecate dependency is no longer installed (#14472)
When using multiple loggers, by default checkpoints and profiler output now get saved to the log dir of the first logger in the list (#14325)
In Lightning Lite, state-dict access to the module wrapper now gets passed through to the original module reference (#14629)
Removed fall-back to LightningEnvironment when number of SLURM tasks does not correspond to number of processes in Trainer (#14300)
Aligned DDP and DDPSpawn strategies in setting up the environment (#11073)
Integrated the Lite Precision plugins into the PL Precision plugins - the base class in PL now extends the lightning_lite.precision.Precision base class (#14798)
The PrecisionPlugin.backward signature changed: The closure_loss argument was renamed to tensor
The PrecisionPlugin.{pre_,post_}backward signature changed: The closure_loss argument was renamed to tensor and moved as the first argument
The PrecisionPlugin.optimizer_step signature changed: The model, optimizer_idx and closure arguments need to be passed as keyword arguments now
Trainer queries the CUDA devices through NVML if available to avoid initializing CUDA before forking, which eliminates the need for the PL_DISABLE_FORK environment variable introduced in v1.7.4 (#14631)
The MLFlowLogger.finalize() now sets the status to FAILED when an exception occurred in Trainer, and sets the status to FINISHED on successful completion (#12292)
It is no longer needed to call model.double() when using precision=64 in Lightning Lite (#14827)
HPC checkpoints are now loaded automatically only in slurm environment when no specific value for ckpt_path has been set (#14911)
The Callback.on_load_checkpoint now gets the full checkpoint dictionary and the callback_state argument was renamed checkpoint (#14835)
Moved the warning about saving nn.Module in save_hyperparameters() to before the deepcopy (#15132)
To avoid issues with forking processes, from PyTorch 1.13 and higher, Lightning will directly use the PyTorch NVML-based check for torch.cuda.device_count and from PyTorch 1.14 and higher, Lightning will configure PyTorch to use a NVML-based check for torch.cuda.is_available. (#15110, #15133)
The NeptuneLogger now uses neptune.init_run instead of the deprecated neptune.init to initialize a run (#15393)
Deprecated the unwrap_lightning_module and unwrap_lightning_module_sharded utility functions in favor of accessing the unwrapped LightningModule on the strategy directly (#13738)
Deprecated the pl_module argument in LightningParallelModule, LightningDistributedModule, LightningShardedDataParallel, LightningBaguaModule and LightningDeepSpeedModule wrapper classes (#13738)
Deprecated the internal pl.core.mixins.DeviceDtypeModuleMixin class (#14511, #14548)
Deprecated all functions in pytorch_lightning.utilities.xla_device (#14514, #14550)
Deprecated the internal inner_f function
Deprecated the internal pl_multi_process function
Deprecated the internal XLADeviceUtils.xla_available staticmethod
Deprecated the XLADeviceUtils.tpu_device_exists staticmethod in favor of pytorch_lightning.accelerators.TPUAccelerator.is_available()
Deprecated pytorch_lightning.utilities.distributed.tpu_distributed in favor of lightning_lite.accelerators.tpu.tpu_distributed (#14550)
Deprecated all functions in pytorch_lightning.utilities.cloud_io in favor of lightning_lite.utilities.cloud_io (#14515)
Deprecated the functions in pytorch_lightning.utilities.apply_func in favor of lightning_utilities.core.apply_func (#14516, #14537)
Deprecated all functions in pytorch_lightning.utilities.device_parser (#14492, #14753)
Deprecated the pytorch_lightning.utilities.device_parser.determine_root_gpu_device in favor of lightning_lite.utilities.device_parser.determine_root_gpu_device
Deprecated the pytorch_lightning.utilities.device_parser.parse_gpu_ids in favor of lightning_lite.utilities.device_parser.parse_gpu_ids
Deprecated the pytorch_lightning.utilities.device_parser.is_cuda_available in favor of lightning_lite.accelerators.cuda.is_cuda_available
Deprecated the pytorch_lightning.utilities.device_parser.num_cuda_devices in favor of lightning_lite.accelerators.cuda.num_cuda_devices
Deprecated the pytorch_lightning.utilities.device_parser.parse_cpu_cores in favor of lightning_lite.accelerators.cpu.parse_cpu_cores
Deprecated the pytorch_lightning.utilities.device_parser.parse_tpu_cores in favor of lightning_lite.accelerators.tpu.parse_tpu_cores
Deprecated the pytorch_lightning.utilities.device_parser.parse_hpus in favor of pytorch_lightning.accelerators.hpu.parse_hpus
Deprecated duplicate SaveConfigCallback parameters in LightningCLI.__init__: save_config_kwargs, save_config_overwrite and save_config_multifile. New save_config_kwargs parameter should be used instead (#14998)
Deprecated TrainerFn.TUNING, RunningStage.TUNING and trainer.tuning property (#15100)
Deprecated custom pl.utilities.distributed.AllGatherGrad implementation in favor of PyTorch's (#15364)
Removed
Removed the deprecated Trainer.training_type_plugin property in favor of Trainer.strategy (#14011)
Removed all deprecated training type plugins (#14011)
Removed the deprecated LoggerCollection; Trainer.logger and LightningModule.logger now returns the first logger when more than one gets passed to the Trainer (#14283)
Removed the deprecated the trainer.lr_schedulers (#14408)
Removed the deprecated LightningModule.{on_hpc_load,on_hpc_save} hooks in favor of the general purpose hooks LightningModule.{on_load_checkpoint,on_save_checkpoint} (#14315)
Removed deprecated support for old torchtext versions (#14375)
Removed deprecated support for the old neptune-client API in the NeptuneLogger (#14727)
Removed the deprecated weights_save_path Trainer argumnent and Trainer.weights_save_path property (#14424)
pytorch_lightning.utilities.distributed.rank_zero_only in favor of pytorch_lightning.utilities.rank_zero.rank_zero_only
pytorch_lightning.utilities.distributed.rank_zero_debug in favor of pytorch_lightning.utilities.rank_zero.rank_zero_debug
pytorch_lightning.utilities.distributed.rank_zero_info in favor of pytorch_lightning.utilities.rank_zero.rank_zero_info
pytorch_lightning.utilities.warnings.rank_zero_warn in favor of pytorch_lightning.utilities.rank_zero.rank_zero_warn
pytorch_lightning.utilities.warnings.rank_zero_deprecation in favor of pytorch_lightning.utilities.rank_zero.rank_zero_deprecation
pytorch_lightning.utilities.warnings.LightningDeprecationWarning in favor of pytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
Removed deprecated Trainer.num_processes attribute in favour of Trainer.num_devices (#14423)
Removed the deprecated Trainer.data_parallel_device_ids hook in favour of Trainer.device_ids (#14422)
Removed the deprecated class TrainerCallbackHookMixin (#14401)
Removed the deprecated BaseProfiler and AbstractProfiler classes (#14404)
Removed the deprecated way to set the distributed backend via the environment variable PL_TORCH_DISTRIBUTED_BACKEND, in favor of setting the process_group_backend in the strategy constructor (#14693)
Callback.on_configure_sharded_model in favor of Callback.setup
Callback.on_before_accelerator_backend_setup in favor of Callback.setup
Callback.on_batch_start in favor of Callback.on_train_batch_start
Callback.on_batch_end in favor of Callback.on_train_batch_end
Callback.on_epoch_start in favor of Callback.on_{train,validation,test}_epoch_start
Callback.on_epoch_end in favor of Callback.on_{train,validation,test}_epoch_end
Callback.on_pretrain_routine_{start,end} in favor of Callback.on_fit_start
Removed the deprecated device attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores} in favor of the accelerator-agnostic Trainer.num_devices (#14829)
Removed the deprecated LightningIPUModule (#14830)
Removed the deprecated Logger.agg_and_log_metrics hook in favour of Logger.log_metrics and the agg_key_funcs and agg_default_func arguments. (#14840)
Removed the deprecated precision plugin checkpoint hooks PrecisionPlugin.on_load_checkpoint and PrecisionPlugin.on_save_checkpoint (#14833)
Removed the deprecated Trainer.root_gpu attribute in favor of Trainer.strategy.root_device (#14829)
Removed the deprecated Trainer.use_amp and LightningModule.use_amp attributes (#14832)
Removed the deprecated callback hooks Callback.on_init_start and Callback.on_init_end (#14867)
Removed the deprecated Trainer.run_stage in favor of Trainer.{fit,validate,test,predict} (#14870)
Removed the deprecated SimpleProfiler.profile_iterable and AdvancedProfiler.profile_iterable attributes (#14864)
Removed the deprecated Trainer.verbose_evaluate (#14884)
Removed the deprecated Trainer.should_rank_save_checkpoint (#14885)
Removed the deprecated TrainerOptimizersMixin (#14887)
Removed the deprecated Trainer.lightning_optimizers (#14889)
Removed the deprecated TrainerDataLoadingMixin (#14888)
Removed the deprecated Trainer.call_hook in favor of Trainer._call_callback_hooks, Trainer._call_lightning_module_hook, Trainer._call_ttp_hook, and Trainer._call_accelerator_hook (#14869)
Removed the deprecated Trainer.{validated,tested,predicted}_ckpt_path (#14897)
Removed the deprecated device_stats_monitor_prefix_metric_keys (#14890)
Removed the deprecated LightningDataModule.on_save/load_checkpoint hooks (#14909)
Removed support for returning a value in Callback.on_save_checkpoint in favor of implementing Callback.state_dict (#14835)
Fixed
Fixed an issue with LightningLite.setup() not setting the .device attribute correctly on the returned wrapper (#14822)
Fixed an attribute error when running the tuner together with the StochasticWeightAveraging callback (#14836)
Fixed MissingFieldException in offline mode for the NeptuneLogger() (#14919)
Fixed wandb save_dir is overridden by Nonedir when using CLI (#14878)
Fixed a missing call to LightningDataModule.load_state_dict hook while restoring checkpoint using LightningDataModule.load_from_checkpoint (#14883)
Fixed torchscript error with containers of LightningModules (#14904)
Fixed reloading of the last checkpoint on run restart (#14907)
SaveConfigCallback instances should only save the config once to allow having the overwrite=False safeguard when using LightningCLI(..., run=False) (#14927)
Fixed an issue with terminating the trainer profiler when a StopIteration exception is raised while using an IterableDataset (#14940)
Do not update on-plateau schedulers when reloading from an end-of-epoch checkpoint (#14702)
Fixed Trainer support for PyTorch built without distributed support (#14971)
Fixed batch normalization statistics calculation in StochasticWeightAveraging callback (#14866)
Avoided initializing optimizers during deepspeed inference (#14944)
Fixed LightningCLI parse_env and description in subcommands (#15138)
Fixed an exception that would occur when creating a multiprocessing.Pool after importing Lightning (#15292)
Fixed a pickling error when using RichProgressBar together with checkpointing (#15319)
Fixed the RichProgressBar crashing when used with distributed strategies (#15376)
Fixed an issue with RichProgressBar not resetting the internal state for the sanity check progress (#15377)
Fixed an issue with DataLoader re-instantiation when the attribute is an array and the default value of the corresponding argument changed (#15409)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The core team is excited to announce the release of Lightning 1.8 ⚡
Lightning v1.8 is the culmination of work from 52 contributors who have worked on features, bug-fixes, and documentation for a total of over 550+ commits since v1.7.
Highlights
Colossal-AI
Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:
You can find Colossal-AI's benchmarks with Lightning on GPT-2 here.
Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:
Learn how to install and use Colossal-AI effectively with Lightning here.
NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.
Secrets for Lightning Apps
Introducing encrypted secrets (#14612), a feature requested by Lightning App users 🎉!
Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.
Add a secret to your Lightning account. TODO: add link how to
Add an environment variable to your app to read the secret:
Pass the secret to your app run with the following command:
These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.
NOTE: This is an experimental feature.
CLI Commands for Lightning Apps
Introducing CLI commands for apps (#13602)!
As a Lightning App builder, if you want to easily create a CLI interface for users to interract with your app, then this is for you.
Here is an example where users can dynamically create notebooks from the CLI.
All you need to do is implement the
configure_commands
hook on theLightningFlow
:Once the app is running with
lightning run app app.py
, you can connect to the app with the following command:and run the command that was configured:
For a full tutorial and running example, visit our docs. TODO: add to docs
NOTE: This is an experimental feature.
Auto-wrapping for FSDP Strategy
In Lightning v1.7, we introduced an integration for PyTorch FSDP in the form of our FSDP strategy, which allows you to train huge models with billions of parameters sharded across hundreds of GPUs and machines.
We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).
Here are some examples:
Case 1: Model is so large that it does not fit into CPU memory.
Construct your layers in the
configure_sharded_model
hook and wrap the large ones you want to shard across GPUs:Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP's policy:
Case 3: Model fits into GPU memory. No action required, use any strategy you want.
Note: if you want to manually wrap layers for more control, you can still do that!
Read more about FSDP and how layer wrapping works in our docs.
New Tuner Callbacks
In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.
Batch Size Finder (#11089)
You can customize the
BatchSizeFinder
callback to run at different epochs. This feature is useful while fine-tuning models since you can't always use the same batch size after unfreezing the backbone.Run batch size finder for
validate
/test
/predict
.Learning Rate Finder (#13802)
You can now use the
LearningRateFinder
callback to run at different intervals. This feature is useful when fine-tuning models, for example.LightningCLI Improvements
Even though the
LightningCLI
class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):Anywhere in your program, you can now call the CLI directly:
Learn about all features of the LightningCLI!
Improvements to the SLURM Support
Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we've added two quality of life improvements:
The preemption/termination signal is now configurable (#14626):
Automatic requeuing of jobs now also works for array jobs (#15040)! Array jobs are a convenient way to group/launch several scripts at once. When the SLURM scheduler interrupts your jobs, Lightning will save a checkpoint, resubmit a new job, and, once the scheduler allocates resources, the Trainer will resume from where it left off.
Read more about our SLURM integration here.
Backward Incompatible Changes
This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.
Callback hooks for loading and saving checkpoints
The signature and behavior of the
on_load_checkpoint
andon_save_checkpoint
callback hooks have changed (#14835):Before:
Now:
DataModule hooks for loading and saving checkpoints
The
on_save_checkpoint
andon_load_checkpoint
hooks on theLightningDataModule
have been removed in favor of thestate_dict
andload_state_dict
methods:Callback hooks
We removed some
Callback
hooks that were ambiguous to use Removed deprecated callback hooks (#14834):on_batch_start
on_train_batch_start
on_batch_end
on_train_batch_end
on_epoch_start
on_train_epoch_start
on_epoch_start
on_validation_epoch_start
on_epoch_start
on_test_epoch_start
on_pretrain_routine_start
on_fit_start
Trainer Device Attributes
We cleaned up the properties related to device indices (#14829).
The attributes
Trainer.{devices,gpus,num_gpus,ipus,tpu_cores,num_processes,root_gpu,data_parallel_device_ids}
have been removed in favor of accelerator-agnostic attributes:Setting the torch-distributed backend
In previous versions of Lightning, switching between the "gloo" and "nccl" backends for multi-GPU, multi-node training was possible through setting an environment variable like so:
PL_TORCH_DISTRIBUTED_BACKEND="gloo" python train.py
But not all strategies support changing the backend in this way.
From now on, the backend has to be set in the code (#14693):
The default remains "nccl", and you should choose "gloo" only for debugging purposes.
Logging with multiple loggers
Logging with multiple loggers can be super useful (and super easy with Lightning). For example, you could be using one logger to record sensitive image logs to a hosted MLFlow server within your organization, and at the same time log loss curves online to WandB.
Here are two major changes that apply when using multiple loggers in 1.8:
Checkpoints and profiler reports no longer go to a strange folder with a long, hard to remember name (#14325). From now on, these arifacts will land in the version folder of the first logger in the list.
The loggers used to be wrapped by a
LoggerCollection
object, so that when you accessedtrainer.logger
you could log to all of them simultaneously. However, this "magic" caused confusion and errors among users and we decided to simplify this (#14283):Deprecations
Why is Lightning deprecating APIs in every release?
Many users have this question, and it is a fair one! Deprecations are a normal part of API evolution in all software. We continually improve Lightning, which means we make APIs like class names, methods, hooks and arguments clear, easy to remember, and general enough to adopt more functionality in the future. Sometimes we have to let old things go to build new and better products.
Learn more about our deprecation window here.
So far, we have followed the pattern of removing deprecated functionality and APIs after two minor versions of deprecation. From Lightning 1.8 onward, we will additionaly convert warnings to error messages after the deprecation phase ends. This way, we can greatly improve the upgrade experience with helpful messages for users who skip more than two minor Lightning versions. The exception to this rule are experimental features, which are marked as such in our documentation.
Here is a summary of major deprecations introduced in 1.8:
Trainer(amp_level=...)
Trainer(plugins=[ApexMixedPrecisionPlugin(amp_level=...)])
unwrap_lightning_module
Strategy.lightning_module
unwrap_lightning_module_sharded
Strategy.lightning_module
pl.core.mixins.DeviceDtypeModuleMixin
LightningCLI(save_config_filename=...)
LightningCLI(save_config_kwargs=dict(config_filename=...))
LightningCLI(save_config_overwrite=...)
LightningCLI(save_config_kwargs=dict(overwrite=...))
LightningCLI(save_config_multifile=...)
LightningCLI(save_config_kwargs=dict(multifile=...))
TrainerFn.TUNING
RunningStage.TUNING
Trainer.tuning
CHANGELOG
Lightning App
Added
load_state_dict
andstate_dict
hooks forLightningFlow
components (#14100)--secret
option to CLI to allow binding secrets to app environment variables when running in the cloud (#14612)DESCRIPTION
attribute (#15193LightningWork
to theLightningApp
(#15215JustPyFrontend
to ease UI creation withhttps://github.com/justpy-org/justpy
(#15002)configure_api
andconfigure_commands
to be executed in the Rest API process (#15098Changed
--instance-types
option when creating clusters (#15314)Fixed
Lightning Trainer
Added
ddp_fork
(and associated alias strategies) with CUDA GPUs (#14983)BatchSizeFinder
callback (#11089)LearningRateFinder
callback (#13802)method
argument which will determine when to run theBatchSizeFinder
: one offit
,validate
,test
orpredict
(#11089)seed_everything
with rank info (#14031)DDPFullyShardedNativeStrategy
(#14252)LightningDataModule.from_datasets
(#14185)DDPShardedStrategy
(#14208)DDPFullyShardedStrategy
(#14383)lightning_utilities
package (#14475,
#14537,
#14556,
#14558,
#14575,
#14620)
args
parameter toLightningCLI
to ease running from within Python (#14596)WandbLogger.download_artifact
andWandbLogger.use_artifact
for managing artifacts with Weights and Biases (#14551)LightningLite.setup()
does not have all parameters on the same device (#14822)CometLogger
now flags the Comet Experiments as being created from Lightning for analytics purposes (#14906)ckpt_path="hpc"
keyword for checkpoint loading (#14911)SaveConfigCallback
(#14998)inference_mode
flag to Trainer to let users enable/disable inference mode during evaluation (#15034)LightningLite.no_backward_sync
for control over efficient gradient accumulation with distributed strategies (#14966)srun
command in SLURM and that environment variables are not conflicting (#15011)python -i
and an interactive-incompatible strategy (#15293)Changed
Trainer.{fit,validate,test,predict,tune}
methods now raise a useful error message if the input is not aLightningModule
(#13892)MisconfigurationException
if batch transfer hooks are overriden withIPUAccelerator
(#13961)LightningModule
(#13738)on_before_batch_transfer
forDPStrategy
andIPUAccelerator
(#14023)Trainer
will now raise an error (#14341)torch.cuda
rng state to the aggregate_collect_rng_states()
and_set_rng_states()
(#14384)trainer.should_stop
to not stop in between an epoch and run untilmin_steps/min_epochs
only (#13890)pyDeprecate
dependency is no longer installed (#14472)LightningEnvironment
when number of SLURM tasks does not correspond to number of processes in Trainer (#14300)lightning_lite.precision.Precision
base class (#14798)PrecisionPlugin.backward
signature changed: Theclosure_loss
argument was renamed totensor
PrecisionPlugin.{pre_,post_}backward
signature changed: Theclosure_loss
argument was renamed totensor
and moved as the first argumentPrecisionPlugin.optimizer_step
signature changed: Themodel
,optimizer_idx
andclosure
arguments need to be passed as keyword arguments nowPL_DISABLE_FORK
environment variable introduced in v1.7.4 (#14631)MLFlowLogger.finalize()
now sets the status toFAILED
when an exception occurred inTrainer
, and sets the status toFINISHED
on successful completion (#12292)model.double()
when usingprecision=64
in Lightning Lite (#14827)ckpt_path
has been set (#14911)Callback.on_load_checkpoint
now gets the full checkpoint dictionary and thecallback_state
argument was renamedcheckpoint
(#14835)save_hyperparameters()
to before the deepcopy (#15132)torch.cuda.device_count
and from PyTorch 1.14 and higher, Lightning will configure PyTorch to use a NVML-based check fortorch.cuda.is_available
. (#15110, #15133)NeptuneLogger
now usesneptune.init_run
instead of the deprecatedneptune.init
to initialize a run (#15393)Deprecated
LightningDeepSpeedModule
(#14000)amp_level
fromTrainer
in favour of passing it explictly via precision plugin (#13898)pytorch_lightning.utiltiies.meta
functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)unwrap_lightning_module
andunwrap_lightning_module_sharded
utility functions in favor of accessing the unwrappedLightningModule
on the strategy directly (#13738)pl_module
argument inLightningParallelModule
,LightningDistributedModule
,LightningShardedDataParallel
,LightningBaguaModule
andLightningDeepSpeedModule
wrapper classes (#13738)on_colab_kaggle
function (#14247)pl.core.mixins.DeviceDtypeModuleMixin
class (#14511, #14548)pytorch_lightning.utilities.xla_device
(#14514, #14550)inner_f
functionpl_multi_process
functionXLADeviceUtils.xla_available
staticmethodXLADeviceUtils.tpu_device_exists
staticmethod in favor ofpytorch_lightning.accelerators.TPUAccelerator.is_available()
pytorch_lightning.utilities.distributed.tpu_distributed
in favor oflightning_lite.accelerators.tpu.tpu_distributed
(#14550)pytorch_lightning.utilities.cloud_io
in favor oflightning_lite.utilities.cloud_io
(#14515)pytorch_lightning.utilities.apply_func
in favor oflightning_utilities.core.apply_func
(#14516, #14537)pytorch_lightning.utilities.device_parser
(#14492, #14753)pytorch_lightning.utilities.device_parser.determine_root_gpu_device
in favor oflightning_lite.utilities.device_parser.determine_root_gpu_device
pytorch_lightning.utilities.device_parser.parse_gpu_ids
in favor oflightning_lite.utilities.device_parser.parse_gpu_ids
pytorch_lightning.utilities.device_parser.is_cuda_available
in favor oflightning_lite.accelerators.cuda.is_cuda_available
pytorch_lightning.utilities.device_parser.num_cuda_devices
in favor oflightning_lite.accelerators.cuda.num_cuda_devices
pytorch_lightning.utilities.device_parser.parse_cpu_cores
in favor oflightning_lite.accelerators.cpu.parse_cpu_cores
pytorch_lightning.utilities.device_parser.parse_tpu_cores
in favor oflightning_lite.accelerators.tpu.parse_tpu_cores
pytorch_lightning.utilities.device_parser.parse_hpus
in favor ofpytorch_lightning.accelerators.hpu.parse_hpus
SaveConfigCallback
parameters inLightningCLI.__init__
:save_config_kwargs
,save_config_overwrite
andsave_config_multifile
. Newsave_config_kwargs
parameter should be used instead (#14998)TrainerFn.TUNING
,RunningStage.TUNING
andtrainer.tuning
property (#15100)pl.utilities.distributed.AllGatherGrad
implementation in favor of PyTorch's (#15364)Removed
Trainer.training_type_plugin
property in favor ofTrainer.strategy
(#14011)DDP2Strategy
(#14026)DistributedType
andDeviceType
enum classes (#14045)rank_zero_warn
warning category positionally (#14470)Trainer.get_deprecated_arg_names()
(#14415)on_train_batch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#14373)training_epoch_end(outputs)
format when multiple optimizers are used and TBPTT is enabled (#14373)pytorch_lightning.utiltiies.meta
functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)LoggerCollection
;Trainer.logger
andLightningModule.logger
now returns the first logger when more than one gets passed to the Trainer (#14283)trainer.lr_schedulers
(#14408)LightningModule.{on_hpc_load,on_hpc_save}
hooks in favor of the general purpose hooksLightningModule.{on_load_checkpoint,on_save_checkpoint}
(#14315)neptune-client
API in theNeptuneLogger
(#14727)weights_save_path
Trainer argumnent andTrainer.weights_save_path
property (#14424)pytorch_lightning.utilities.distributed.rank_zero_only
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_only
pytorch_lightning.utilities.distributed.rank_zero_debug
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_debug
pytorch_lightning.utilities.distributed.rank_zero_info
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_info
pytorch_lightning.utilities.warnings.rank_zero_warn
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_warn
pytorch_lightning.utilities.warnings.rank_zero_deprecation
in favor ofpytorch_lightning.utilities.rank_zero.rank_zero_deprecation
pytorch_lightning.utilities.warnings.LightningDeprecationWarning
in favor ofpytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
Trainer.num_processes
attribute in favour ofTrainer.num_devices
(#14423)Trainer.data_parallel_device_ids
hook in favour ofTrainer.device_ids
(#14422)TrainerCallbackHookMixin
(#14401)BaseProfiler
andAbstractProfiler
classes (#14404)PL_TORCH_DISTRIBUTED_BACKEND
, in favor of setting theprocess_group_backend
in the strategy constructor (#14693)Callback.on_configure_sharded_model
in favor ofCallback.setup
Callback.on_before_accelerator_backend_setup
in favor ofCallback.setup
Callback.on_batch_start
in favor ofCallback.on_train_batch_start
Callback.on_batch_end
in favor ofCallback.on_train_batch_end
Callback.on_epoch_start
in favor ofCallback.on_{train,validation,test}_epoch_start
Callback.on_epoch_end
in favor ofCallback.on_{train,validation,test}_epoch_end
Callback.on_pretrain_routine_{start,end}
in favor ofCallback.on_fit_start
Trainer.{devices,gpus,num_gpus,ipus,tpu_cores}
in favor of the accelerator-agnosticTrainer.num_devices
(#14829)LightningIPUModule
(#14830)Logger.agg_and_log_metrics
hook in favour ofLogger.log_metrics
and theagg_key_funcs
andagg_default_func
arguments. (#14840)PrecisionPlugin.on_load_checkpoint
andPrecisionPlugin.on_save_checkpoint
(#14833)Trainer.root_gpu
attribute in favor ofTrainer.strategy.root_device
(#14829)Trainer.use_amp
andLightningModule.use_amp
attributes (#14832)Callback.on_init_start
andCallback.on_init_end
(#14867)Trainer.run_stage
in favor ofTrainer.{fit,validate,test,predict}
(#14870)SimpleProfiler.profile_iterable
andAdvancedProfiler.profile_iterable
attributes (#14864)Trainer.verbose_evaluate
(#14884)Trainer.should_rank_save_checkpoint
(#14885)TrainerOptimizersMixin
(#14887)Trainer.lightning_optimizers
(#14889)TrainerDataLoadingMixin
(#14888)Trainer.call_hook
in favor ofTrainer._call_callback_hooks
,Trainer._call_lightning_module_hook
,Trainer._call_ttp_hook
, andTrainer._call_accelerator_hook
(#14869)Trainer.{validated,tested,predicted}_ckpt_path
(#14897)device_stats_monitor_prefix_metric_keys
(#14890)LightningDataModule.on_save/load_checkpoint
hooks (#14909)Callback.on_save_checkpoint
in favor of implementingCallback.state_dict
(#14835)Fixed
LightningLite.setup()
not setting the.device
attribute correctly on the returned wrapper (#14822)StochasticWeightAveraging
callback (#14836)NeptuneLogger()
(#14919)save_dir
is overridden byNone
dir
when using CLI (#14878)LightningDataModule.load_state_dict
hook while restoring checkpoint usingLightningDataModule.load_from_checkpoint
(#14883)SaveConfigCallback
instances should only save the config once to allow having theoverwrite=False
safeguard when usingLightningCLI(..., run=False)
(#14927)StopIteration
exception is raised while using anIterableDataset
(#14940)Trainer
support for PyTorch built without distributed support (#14971)StochasticWeightAveraging
callback (#14866)LightningCLI
parse_env and description in subcommands (#15138)multiprocessing.Pool
after importing Lightning (#15292)RichProgressBar
together with checkpointing (#15319)RichProgressBar
crashing when used with distributed strategies (#15376)RichProgressBar
not resetting the internal state for the sanity check progress (#15377)Full commit list: 1.7.0...1.8.0
Contributors
Veteran
@akihironitta @ananthsub @AndresAlgaba @ar90n @Atharva-Phatak @awaelchli @BongYang @Borda @carmocca @dependabot @donlapark @ethanwharris @Felonious-Spellfire @hhsecond @jerome-habana @JustinGoheen @justusschock @kaushikb11 @krishnakalyan3 @krshrimali @luca-medeiros @manangoel99 @manskx @mauvilsa @MrShevan @nicolai86 @nmiculinic @otaj @Queuecumber @rlizzo @rohitgr7 @rschireman @SeanNaren @speediedan @tchaton @tshu-w
New
@Birch-san @clementpoiret @HalestormAI @thongonary @alecmerdler @adam-lightning @yurijmikhalevich @lijm1358 @robert-s-lee @panos-is @kacperlukawski @alro923 @dmitsf @Anner-deJong @cschell @nishantb06 @Callidior @j0rd1smit @MarcSkovMadsen @KralaBenjamin @robertomest @daniel347x @pierocor @datumbox @nohalon @pritamsoni-hsr @nandwalritik @gilfree @ritsuki1227 @christopher-nguyen-re @JulesGM @jgbos @dconathan @jsr-p @NeoKish @Blaizzy @suyash-811 @alexkuzmik @ziyadsheeba @geoffrey-g-delhomme @amrutha1098 @AlessioQuercia @ver217 @Helias @zxvix @1SAA @fabiofumarola @luca3rd @kimpty @PaulLerner @rbracco @wouterzwerink
If we forgot somebody or you have a suggestion, find support here ⚡
Did you know?
Chuck Norris can write functions of infinite recursion ... and have them return.
This discussion was created from the release Lightning 1.8: Colossal-AI Strategy, Commands and Secrets for Apps, FSDP Improvements and More!.
Beta Was this translation helpful? Give feedback.
All reactions