Skip to content

Lightning 1.8: Colossal-AI Strategy, Commands and Secrets for Apps, FSDP Improvements and More!

Compare
Choose a tag to compare
@awaelchli awaelchli released this 01 Nov 11:13
· 2488 commits to master since this release
7ee0994

The core team is excited to announce the release of Lightning 1.8 ⚡

Lightning v1.8 is the culmination of work from 52 contributors who have worked on features, bug-fixes, and documentation for a total of over 550+ commits since v1.7.

Highlights

Colossal-AI

Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:

# Select the strategy with good defaults
trainer = Trainer(strategy="colossalai")

# or tune parameters to your liking
from lightning.pytorch.strategies import ColossalAIStrategy

trainer = Trainer(strategy=ColossalAIStrategy(placement_policy="cpu", ...))

You can find Colossal-AI's benchmarks with Lightning on GPT-2 here.

Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:

  • Data Parallelism
  • Pipeline Parallelism
  • 1D, 2D, 2.5D, 3D Tensor Parallelism
  • Sequence Parallelism
  • Zero Redundancy Optimization

Learn how to install and use Colossal-AI effectively with Lightning here.

NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.

Secrets for Lightning Apps

Introducing encrypted secrets (#14612), a feature requested by Lightning App users 🎉!

Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.

  1. Add a secret to your Lightning account in lightning.ai (read more here)

  2. Add an environment variable to your app to read the secret:

    # somewhere in your Flow or Work:
    GitHubComponent(api_token=os.environ["API_TOKEN"])
  3. Pass the secret to your app run with the following command:

    lightning run app app.py --cloud --secret API_TOKEN=github_api_token

These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.

NOTE: This is an experimental feature.

CLI Commands for Lightning Apps

Introducing CLI commands for apps (#13602)!
As a Lightning App builder, if you want to easily create a CLI interface for users to interract with your app, then this is for you.

Here is an example where users can dynamically create notebooks from the CLI.
All you need to do is implement the configure_commands hook on the LightningFlow:

import lightning as L
from commands.notebook.run import RunNotebook


class Flow(L.LightningFlow):
    ...

    def configure_commands(self):
        # Return a list of dictionaries with commands:
        return [{"run notebook": RunNotebook(method=self.run_notebook)}]


app = L.LightningApp(Flow())

Once the app is running with lightning run app app.py, you can connect to the app with the following command:

lightning connect {app name} -y

and run the command that was configured:

lightning run notebook --name=my_notebook_name

For a full tutorial and running example, visit our docs. TODO: add to docs
NOTE: This is an experimental feature.

Auto-wrapping for FSDP Strategy

In Lightning v1.7, we introduced an integration for PyTorch FSDP in the form of our FSDP strategy, which allows you to train huge models with billions of parameters sharded across hundreds of GPUs and machines.

# Native FSDP implementation
trainer = Trainer(strategy="fsdp_native")

We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).

Here are some examples:

Case 1: Model is so large that it does not fit into CPU memory.
Construct your layers in the configure_sharded_model hook and wrap the large ones you want to shard across GPUs:

class MassiveModel(LightningModule):
    ...
    
    # Create model here and wrap the large layers for sharding
    def configure_sharded_model(self):
        for i, layer in enumerate(self.block):
            self.block[i] = wrap(layer)
        ...

Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP's policy:

model = MassiveModel()
trainer = Trainer(
    accelerator="gpu", 
    devices=8, 
    strategy="fsdp_native",  # or strategy="fsdp" for fairscale
    precision=16
)

# Automatically wraps the layers here:
trainer.fit(model)

Case 3: Model fits into GPU memory. No action required, use any strategy you want.

Note: if you want to manually wrap layers for more control, you can still do that!

Read more about FSDP and how layer wrapping works in our docs.

New Tuner Callbacks

In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.

Batch Size Finder (#11089)

  1. You can customize the BatchSizeFinder callback to run at different epochs. This feature is useful while fine-tuning models since you can't always use the same batch size after unfreezing the backbone.

    from lightning.pytorch.callbacks import BatchSizeFinder
    
    
    class FineTuneBatchSizeFinder(BatchSizeFinder):
        def __init__(self, milestones, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.milestones = milestones
    
        def on_fit_start(self, *args, **kwargs):
            return
    
        def on_train_epoch_start(self, trainer, pl_module):
            if trainer.current_epoch in self.milestones or trainer.current_epoch == 0:
                self.scale_batch_size(trainer, pl_module)
    
    
    trainer = Trainer(callbacks=[FineTuneBatchSizeFinder(milestones=(5, 10))])
    trainer.fit(...)
  2. Run batch size finder for validate/test/predict.

    from lightning.pytorch.callbacks import BatchSizeFinder
    
    
    class EvalBatchSizeFinder(BatchSizeFinder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
    
        def on_fit_start(self, *args, **kwargs):
            return
    
        def on_test_start(self, trainer, pl_module):
            self.scale_batch_size(trainer, pl_module)
    
    
    trainer = Trainer(callbacks=[EvalBatchSizeFinder()])
    trainer.test(...)

Learning Rate Finder (#13802)

You can now use the LearningRateFinder callback to run at different intervals. This feature is useful when fine-tuning models, for example.

from lightning.pytorch.callbacks import LearningRateFinder


class FineTuneLearningRateFinder(LearningRateFinder):
    def __init__(self, milestones, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.milestones = milestones

    def on_fit_start(self, *args, **kwargs):
        return

    def on_train_epoch_start(self, trainer, pl_module):
        if trainer.current_epoch in self.milestones or trainer.current_epoch == 0:
            self.lr_find(trainer, pl_module)

trainer = Trainer(callbacks=[FineTuneLearningRateFinder(milestones=(5, 10))])
trainer.fit(...)

LightningCLI Improvements

Even though the LightningCLI class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):

from lightning.pytorch.cli import LightningCLI

def cli_main(args):
    cli = LightningCLI(MyModel, ..., args=args)
    ...

Anywhere in your program, you can now call the CLI directly:

cli_main(["--trainer.max_epochs=100", "--model.encoder_layers=24"])

Learn about all features of the LightningCLI!

Improvements to the SLURM Support

Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we've added two quality of life improvements:

  • The preemption/termination signal is now configurable (#14626):

    # the default signal is SIGUSR1
    trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGUSR1)])
    
    # customize it for your cluster
    trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGHUP)])
  • Automatic requeuing of jobs now also works for array jobs (#15040)! Array jobs are a convenient way to group/launch several scripts at once. When the SLURM scheduler interrupts your jobs, Lightning will save a checkpoint, resubmit a new job, and, once the scheduler allocates resources, the Trainer will resume from where it left off.

Read more about our SLURM integration here.

Backward Incompatible Changes

This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.

Callback hooks for loading and saving checkpoints

The signature and behavior of the on_load_checkpoint and on_save_checkpoint callback hooks have changed (#14835):

Before:

def on_save_checkpoint(self, trainer, pl_module, checkpoint):
    ...
    # previously, we were able to return state here
    return state

def on_load_checkpoint(self, trainer, pl_module, callback_state):
    # previously, only the state for this callback was passed in as argument
    ...

Now:

def on_save_checkpoint(self, trainer, pl_module, checkpoint):
    ...
    # returning a value here is no longer supported
    # you can modify the checkpoint dict directly
    return None


def state_dict(self):
    ...
    # Now, return state from this new method
    return state


def on_load_checkpoint(self, trainer, pl_module, checkpoint):
    # previously, only the state for this callback was passed in as argument
    ...
    
    
def load_state_dict(self, state):
    # Now, the state for this callback gets passed to this new method
    ...

DataModule hooks for loading and saving checkpoints

The on_save_checkpoint and on_load_checkpoint hooks on the LightningDataModule have been removed in favor of the state_dict and load_state_dict methods:

-def on_save_checkpoint(self, checkpoint):
-    checkpoint["banana"] = self.banana
+def state_dict(self):
+    return dict(banana=self.banana)


-def on_load_checkpoint(self, checkpoint):
-    self.banana = checkpoint["banana"]
+def load_state_dict(self, state):
+    self.banana = state["banana"]

Callback hooks

We removed some Callback hooks that were ambiguous to use Removed deprecated callback hooks (#14834):

Old name New name
on_batch_start on_train_batch_start
on_batch_end on_train_batch_end
on_epoch_start on_train_epoch_start
on_epoch_start on_validation_epoch_start
on_epoch_start on_test_epoch_start
on_pretrain_routine_start on_fit_start

Trainer Device Attributes

We cleaned up the properties related to device indices (#14829).

The attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores,num_processes,root_gpu,data_parallel_device_ids} have been removed in favor of accelerator-agnostic attributes:

trainer = Trainer(...)

# access the number of devices the trainer uses on this machine ...
print(trainer.num_devices)

# ... or the device IDs
print(trainer.device_ids)

Setting the torch-distributed backend

In previous versions of Lightning, switching between the "gloo" and "nccl" backends for multi-GPU, multi-node training was possible through setting an environment variable like so:

PL_TORCH_DISTRIBUTED_BACKEND="gloo" python train.py

But not all strategies support changing the backend in this way.
From now on, the backend has to be set in the code (#14693):

trainer = Trainer(strategy=DDPStrategy(process_group_backend="gloo"))

The default remains "nccl", and you should choose "gloo" only for debugging purposes.

Logging with multiple loggers

Logging with multiple loggers can be super useful (and super easy with Lightning). For example, you could be using one logger to record sensitive image logs to a hosted MLFlow server within your organization, and at the same time log loss curves online to WandB.

trainer = Trainer(
    loggers=[WandbLogger(...), MLFlowLogger(...)]
)

Here are two major changes that apply when using multiple loggers in 1.8:

  • Checkpoints and profiler reports no longer go to a strange folder with a long, hard to remember name (#14325). From now on, these arifacts will land in the version folder of the first logger in the list.

  • The loggers used to be wrapped by a LoggerCollection object, so that when you accessed trainer.logger you could log to all of them simultaneously. However, this "magic" caused confusion and errors among users and we decided to simplify this (#14283):

    # now returns the first logger in the list
    print(trainer.logger)
    
    # access all loggers in a list with plural
    loggers = trainer.loggers
    
    for logger in loggers:
        logger.do_something()

Deprecations

Why is Lightning deprecating APIs in every release?

Many users have this question, and it is a fair one! Deprecations are a normal part of API evolution in all software. We continually improve Lightning, which means we make APIs like class names, methods, hooks and arguments clear, easy to remember, and general enough to adopt more functionality in the future. Sometimes we have to let old things go to build new and better products.

Learn more about our deprecation window here.

So far, we have followed the pattern of removing deprecated functionality and APIs after two minor versions of deprecation. From Lightning 1.8 onward, we will additionaly convert warnings to error messages after the deprecation phase ends. This way, we can greatly improve the upgrade experience with helpful messages for users who skip more than two minor Lightning versions. The exception to this rule are experimental features, which are marked as such in our documentation.

Here is a summary of major deprecations introduced in 1.8:

API Removal version Alternative
Argument Trainer(amp_level=...) 1.10 Trainer(plugins=[ApexMixedPrecisionPlugin(amp_level=...)])
Function unwrap_lightning_module 1.10 Strategy.lightning_module
Function unwrap_lightning_module_sharded 1.10 Strategy.lightning_module
Import pl.core.mixins.DeviceDtypeModuleMixin 1.10 No longer supported
Argument LightningCLI(save_config_filename=...) 1.10 LightningCLI(save_config_kwargs=dict(config_filename=...))
Argument LightningCLI(save_config_overwrite=...) 1.10 LightningCLI(save_config_kwargs=dict(overwrite=...))
Argument LightningCLI(save_config_multifile=...) 1.10 LightningCLI(save_config_kwargs=dict(multifile=...))
Enum TrainerFn.TUNING 1.10 No longer supported
Enum RunningStage.TUNING 1.10 No longer supported
Attribute Trainer.tuning 1.10 No longer supported

CHANGELOG

Lightning App

Added
  • Added load_state_dict and state_dict hooks for LightningFlow components (#14100)
  • Added a --secret option to CLI to allow binding secrets to app environment variables when running in the cloud (#14612)
  • Added support for running the works without cloud compute in the default container (#14819)
  • Added an HTTPQueue as an optional replacement for the default redis queue (#14978
  • Added support for configuring flow cloud compute (#14831)
  • Added support for adding descriptions to commands either through a docstring or the DESCRIPTION attribute (#15193
  • Added a try / catch mechanism around request processing to avoid killing the flow (#15187
  • Added an Database Component (#14995
  • Added authentication to HTTP queue (#15202)
  • Added support to pass a LightningWork to the LightningApp (#15215
  • Added support getting CLI help for connected apps even if the app isn't running (#15196
  • Added support for adding requirements to commands and installing them when missing when running an app command (#15198
  • Added Lightning CLI Connection to be terminal session instead of global (#15241
  • Added support for managing SSH-keys via CLI (#15291)
  • Add a JustPyFrontend to ease UI creation with https://github.com/justpy-org/justpy (#15002)
  • Added a layout endpoint to the Rest API and enable to disable pulling or pushing to the state (#15367
  • Added support for functions for configure_api and configure_commands to be executed in the Rest API process (#15098
  • Added support to start lightning app on cloud without needing to install dependencies locally (#15019
Changed
  • Improved the show logs command to be standalone and re-usable (#15343
  • Removed the --instance-types option when creating clusters (#15314)
Fixed
  • Fixed an issue when using the CLI without arguments (#14877)
  • Fixed a bug where the upload files endpoint would raise an error when running locally (#14924)
  • Fixed BYOC cluster region selector -> hiding it from help since only us-east-1 has been tested and is recommended ([#15277]#15277)
  • Fixed a bug when launching an app on multiple clusters (#15226)
  • Fixed a bug with a default CloudCompute for Lightning flows (#15371)

Lightning Trainer

Added
  • Added support for requeueing slurm array jobs (#15040)
  • Added native AMP support for ddp_fork (and associated alias strategies) with CUDA GPUs (#14983)
  • Added BatchSizeFinder callback (#11089)
  • Added LearningRateFinder callback (#13802)
  • Tuner now supports a new method argument which will determine when to run the BatchSizeFinder: one of fit, validate, test or predict (#11089)
  • Added prefix to log message in seed_everything with rank info (#14031)
  • Added support for auto wrapping for DDPFullyShardedNativeStrategy (#14252)
  • Added support for passing extra init-parameters to the LightningDataModule.from_datasets (#14185)
  • Added support for saving sharded optimizer state dict outside of DDPShardedStrategy (#14208)
  • Added support for auto wrapping for DDPFullyShardedStrategy (#14383)
  • Integrate the lightning_utilities package (
    #14475,
    #14537,
    #14556,
    #14558,
    #14575,
    #14620)
  • Added args parameter to LightningCLI to ease running from within Python (#14596)
  • Added WandbLogger.download_artifact and WandbLogger.use_artifact for managing artifacts with Weights and Biases (#14551)
  • Added an option to configure the signal SLURM sends when a job is preempted or requeued (#14626)
  • Added a warning when the model passed to LightningLite.setup() does not have all parameters on the same device (#14822)
  • The CometLogger now flags the Comet Experiments as being created from Lightning for analytics purposes (#14906)
  • Introduce ckpt_path="hpc" keyword for checkpoint loading (#14911)
  • Added a more descriptive error message when attempting to fork processes with pre-initialized CUDA context (#14709)
  • Added support for custom parameters in subclasses of SaveConfigCallback (#14998)
  • Added inference_mode flag to Trainer to let users enable/disable inference mode during evaluation (#15034)
  • Added LightningLite.no_backward_sync for control over efficient gradient accumulation with distributed strategies (#14966)
  • Added a sanity check that scripts are executed with the srun command in SLURM and that environment variables are not conflicting (#15011)
  • Added an error message when attempting to launch processes with python -i and an interactive-incompatible strategy (#15293)
Changed
  • The Trainer.{fit,validate,test,predict,tune} methods now raise a useful error message if the input is not a LightningModule (#13892)
  • Raised a MisconfigurationException if batch transfer hooks are overriden with IPUAccelerator (#13961)
  • Replaced the unwrapping logic in strategies with direct access to unwrapped LightningModule (#13738)
  • Enabled on_before_batch_transfer for DPStrategy and IPUAccelerator (#14023)
  • When resuming training with Apex enabled, the Trainer will now raise an error (#14341)
  • Included torch.cuda rng state to the aggregate _collect_rng_states() and _set_rng_states() (#14384)
  • Changed trainer.should_stop to not stop in between an epoch and run until min_steps/min_epochs only (#13890)
  • The pyDeprecate dependency is no longer installed (#14472)
  • When using multiple loggers, by default checkpoints and profiler output now get saved to the log dir of the first logger in the list (#14325)
  • In Lightning Lite, state-dict access to the module wrapper now gets passed through to the original module reference (#14629)
  • Removed fall-back to LightningEnvironment when number of SLURM tasks does not correspond to number of processes in Trainer (#14300)
  • Aligned DDP and DDPSpawn strategies in setting up the environment (#11073)
  • Integrated the Lite Precision plugins into the PL Precision plugins - the base class in PL now extends the lightning_lite.precision.Precision base class (#14798)
    • The PrecisionPlugin.backward signature changed: The closure_loss argument was renamed to tensor
    • The PrecisionPlugin.{pre_,post_}backward signature changed: The closure_loss argument was renamed to tensor and moved as the first argument
    • The PrecisionPlugin.optimizer_step signature changed: The model, optimizer_idx and closure arguments need to be passed as keyword arguments now
  • Trainer queries the CUDA devices through NVML if available to avoid initializing CUDA before forking, which eliminates the need for the PL_DISABLE_FORK environment variable introduced in v1.7.4 (#14631)
  • The MLFlowLogger.finalize() now sets the status to FAILED when an exception occurred in Trainer, and sets the status to FINISHED on successful completion (#12292)
  • It is no longer needed to call model.double() when using precision=64 in Lightning Lite (#14827)
  • HPC checkpoints are now loaded automatically only in slurm environment when no specific value for ckpt_path has been set (#14911)
  • The Callback.on_load_checkpoint now gets the full checkpoint dictionary and the callback_state argument was renamed checkpoint (#14835)
  • Moved the warning about saving nn.Module in save_hyperparameters() to before the deepcopy (#15132)
  • To avoid issues with forking processes, from PyTorch 1.13 and higher, Lightning will directly use the PyTorch NVML-based check for torch.cuda.device_count and from PyTorch 1.14 and higher, Lightning will configure PyTorch to use a NVML-based check for torch.cuda.is_available. (#15110, #15133)
  • The NeptuneLogger now uses neptune.init_run instead of the deprecated neptune.init to initialize a run (#15393)
Deprecated
  • Deprecated LightningDeepSpeedModule (#14000)
  • Deprecated amp_level from Trainer in favour of passing it explictly via precision plugin (#13898)
  • Deprecated the calls to pytorch_lightning.utiltiies.meta functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)
  • Deprecated the unwrap_lightning_module and unwrap_lightning_module_sharded utility functions in favor of accessing the unwrapped LightningModule on the strategy directly (#13738)
  • Deprecated the pl_module argument in LightningParallelModule, LightningDistributedModule, LightningShardedDataParallel, LightningBaguaModule and LightningDeepSpeedModule wrapper classes (#13738)
  • Deprecated the on_colab_kaggle function (#14247)
  • Deprecated the internal pl.core.mixins.DeviceDtypeModuleMixin class (#14511, #14548)
  • Deprecated all functions in pytorch_lightning.utilities.xla_device (#14514, #14550)
    • Deprecated the internal inner_f function
    • Deprecated the internal pl_multi_process function
    • Deprecated the internal XLADeviceUtils.xla_available staticmethod
    • Deprecated the XLADeviceUtils.tpu_device_exists staticmethod in favor of pytorch_lightning.accelerators.TPUAccelerator.is_available()
  • Deprecated pytorch_lightning.utilities.distributed.tpu_distributed in favor of lightning_lite.accelerators.tpu.tpu_distributed (#14550)
  • Deprecated all functions in pytorch_lightning.utilities.cloud_io in favor of lightning_lite.utilities.cloud_io (#14515)
  • Deprecated the functions in pytorch_lightning.utilities.apply_func in favor of lightning_utilities.core.apply_func (#14516, #14537)
  • Deprecated all functions in pytorch_lightning.utilities.device_parser (#14492, #14753)
    • Deprecated the pytorch_lightning.utilities.device_parser.determine_root_gpu_device in favor of lightning_lite.utilities.device_parser.determine_root_gpu_device
    • Deprecated the pytorch_lightning.utilities.device_parser.parse_gpu_ids in favor of lightning_lite.utilities.device_parser.parse_gpu_ids
    • Deprecated the pytorch_lightning.utilities.device_parser.is_cuda_available in favor of lightning_lite.accelerators.cuda.is_cuda_available
    • Deprecated the pytorch_lightning.utilities.device_parser.num_cuda_devices in favor of lightning_lite.accelerators.cuda.num_cuda_devices
    • Deprecated the pytorch_lightning.utilities.device_parser.parse_cpu_cores in favor of lightning_lite.accelerators.cpu.parse_cpu_cores
    • Deprecated the pytorch_lightning.utilities.device_parser.parse_tpu_cores in favor of lightning_lite.accelerators.tpu.parse_tpu_cores
    • Deprecated the pytorch_lightning.utilities.device_parser.parse_hpus in favor of pytorch_lightning.accelerators.hpu.parse_hpus
  • Deprecated duplicate SaveConfigCallback parameters in LightningCLI.__init__: save_config_kwargs, save_config_overwrite and save_config_multifile. New save_config_kwargs parameter should be used instead (#14998)
  • Deprecated TrainerFn.TUNING, RunningStage.TUNING and trainer.tuning property (#15100)
  • Deprecated custom pl.utilities.distributed.AllGatherGrad implementation in favor of PyTorch's (#15364)
Removed
  • Removed the deprecated Trainer.training_type_plugin property in favor of Trainer.strategy (#14011)
  • Removed all deprecated training type plugins (#14011)
  • Removed the deprecated DDP2Strategy (#14026)
  • Removed the deprecated DistributedType and DeviceType enum classes (#14045)
  • Removed deprecated support for passing the rank_zero_warn warning category positionally (#14470)
  • Removed the legacy and unused Trainer.get_deprecated_arg_names() (#14415)
  • Removed the deprecated on_train_batch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#14373)
  • Removed the deprecated training_epoch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#14373)
  • Removed the experimental pytorch_lightning.utiltiies.meta functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)
  • Removed the deprecated LoggerCollection; Trainer.logger and LightningModule.logger now returns the first logger when more than one gets passed to the Trainer (#14283)
  • Removed the deprecated the trainer.lr_schedulers (#14408)
  • Removed the deprecated LightningModule.{on_hpc_load,on_hpc_save} hooks in favor of the general purpose hooks LightningModule.{on_load_checkpoint,on_save_checkpoint} (#14315)
  • Removed deprecated support for old torchtext versions (#14375)
  • Removed deprecated support for the old neptune-client API in the NeptuneLogger (#14727)
  • Removed the deprecated weights_save_path Trainer argumnent and Trainer.weights_save_path property (#14424)
  • Removed the deprecated (#14471)
    • pytorch_lightning.utilities.distributed.rank_zero_only in favor of pytorch_lightning.utilities.rank_zero.rank_zero_only
    • pytorch_lightning.utilities.distributed.rank_zero_debug in favor of pytorch_lightning.utilities.rank_zero.rank_zero_debug
    • pytorch_lightning.utilities.distributed.rank_zero_info in favor of pytorch_lightning.utilities.rank_zero.rank_zero_info
    • pytorch_lightning.utilities.warnings.rank_zero_warn in favor of pytorch_lightning.utilities.rank_zero.rank_zero_warn
    • pytorch_lightning.utilities.warnings.rank_zero_deprecation in favor of pytorch_lightning.utilities.rank_zero.rank_zero_deprecation
    • pytorch_lightning.utilities.warnings.LightningDeprecationWarning in favor of pytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
  • Removed deprecated Trainer.num_processes attribute in favour of Trainer.num_devices (#14423)
  • Removed the deprecated Trainer.data_parallel_device_ids hook in favour of Trainer.device_ids (#14422)
  • Removed the deprecated class TrainerCallbackHookMixin (#14401)
  • Removed the deprecated BaseProfiler and AbstractProfiler classes (#14404)
  • Removed the deprecated way to set the distributed backend via the environment variable PL_TORCH_DISTRIBUTED_BACKEND, in favor of setting the process_group_backend in the strategy constructor (#14693)
  • Removed deprecated callback hooks (#14834)
    • Callback.on_configure_sharded_model in favor of Callback.setup
    • Callback.on_before_accelerator_backend_setup in favor of Callback.setup
    • Callback.on_batch_start in favor of Callback.on_train_batch_start
    • Callback.on_batch_end in favor of Callback.on_train_batch_end
    • Callback.on_epoch_start in favor of Callback.on_{train,validation,test}_epoch_start
    • Callback.on_epoch_end in favor of Callback.on_{train,validation,test}_epoch_end
    • Callback.on_pretrain_routine_{start,end} in favor of Callback.on_fit_start
  • Removed the deprecated device attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores} in favor of the accelerator-agnostic Trainer.num_devices (#14829)
  • Removed the deprecated LightningIPUModule (#14830)
  • Removed the deprecated Logger.agg_and_log_metrics hook in favour of Logger.log_metrics and the agg_key_funcs and agg_default_func arguments. (#14840)
  • Removed the deprecated precision plugin checkpoint hooks PrecisionPlugin.on_load_checkpoint and PrecisionPlugin.on_save_checkpoint (#14833)
  • Removed the deprecated Trainer.root_gpu attribute in favor of Trainer.strategy.root_device (#14829)
  • Removed the deprecated Trainer.use_amp and LightningModule.use_amp attributes (#14832)
  • Removed the deprecated callback hooks Callback.on_init_start and Callback.on_init_end (#14867)
  • Removed the deprecated Trainer.run_stage in favor of Trainer.{fit,validate,test,predict} (#14870)
  • Removed the deprecated SimpleProfiler.profile_iterable and AdvancedProfiler.profile_iterable attributes (#14864)
  • Removed the deprecated Trainer.verbose_evaluate (#14884)
  • Removed the deprecated Trainer.should_rank_save_checkpoint (#14885)
  • Removed the deprecated TrainerOptimizersMixin (#14887)
  • Removed the deprecated Trainer.lightning_optimizers (#14889)
  • Removed the deprecated TrainerDataLoadingMixin (#14888)
  • Removed the deprecated Trainer.call_hook in favor of Trainer._call_callback_hooks, Trainer._call_lightning_module_hook, Trainer._call_ttp_hook, and Trainer._call_accelerator_hook (#14869)
  • Removed the deprecated Trainer.{validated,tested,predicted}_ckpt_path (#14897)
  • Removed the deprecated device_stats_monitor_prefix_metric_keys (#14890)
  • Removed the deprecated LightningDataModule.on_save/load_checkpoint hooks (#14909)
  • Removed support for returning a value in Callback.on_save_checkpoint in favor of implementing Callback.state_dict (#14835)
Fixed
  • Fixed an issue with LightningLite.setup() not setting the .device attribute correctly on the returned wrapper (#14822)
  • Fixed an attribute error when running the tuner together with the StochasticWeightAveraging callback (#14836)
  • Fixed MissingFieldException in offline mode for the NeptuneLogger() (#14919)
  • Fixed wandb save_dir is overridden by None dir when using CLI (#14878)
  • Fixed a missing call to LightningDataModule.load_state_dict hook while restoring checkpoint using LightningDataModule.load_from_checkpoint (#14883)
  • Fixed torchscript error with containers of LightningModules (#14904)
  • Fixed reloading of the last checkpoint on run restart (#14907)
  • SaveConfigCallback instances should only save the config once to allow having the overwrite=False safeguard when using LightningCLI(..., run=False) (#14927)
  • Fixed an issue with terminating the trainer profiler when a StopIteration exception is raised while using an IterableDataset (#14940)
  • Do not update on-plateau schedulers when reloading from an end-of-epoch checkpoint (#14702)
  • Fixed Trainer support for PyTorch built without distributed support (#14971)
  • Fixed batch normalization statistics calculation in StochasticWeightAveraging callback (#14866)
  • Avoided initializing optimizers during deepspeed inference (#14944)
  • Fixed LightningCLI parse_env and description in subcommands (#15138)
  • Fixed an exception that would occur when creating a multiprocessing.Pool after importing Lightning (#15292)
  • Fixed a pickling error when using RichProgressBar together with checkpointing (#15319)
  • Fixed the RichProgressBar crashing when used with distributed strategies (#15376)
  • Fixed an issue with RichProgressBar not resetting the internal state for the sanity check progress (#15377)
  • Fixed an issue with DataLoader re-instantiation when the attribute is an array and the default value of the corresponding argument changed (#15409)

Full commit list: 1.7.0...1.8.0

Contributors

Veteran

@akihironitta @ananthsub @AndresAlgaba @ar90n @Atharva-Phatak @awaelchli @BongYang @Borda @carmocca @dependabot @donlapark @ethanwharris @Felonious-Spellfire @hhsecond @jerome-habana @JustinGoheen @justusschock @kaushikb11 @krishnakalyan3 @krshrimali @luca-medeiros @manangoel99 @manskx @mauvilsa @MrShevan @nicolai86 @nmiculinic @otaj @Queuecumber @rlizzo @rohitgr7 @rschireman @SeanNaren @speediedan @tchaton @tshu-w

New

@Birch-san @clementpoiret @HalestormAI @thongonary @alecmerdler @adam-lightning @yurijmikhalevich @lijm1358 @robert-s-lee @panos-is @kacperlukawski @alro923 @dmitsf @Anner-deJong @cschell @nishantb06 @Callidior @j0rd1smit @MarcSkovMadsen @KralaBenjamin @robertomest @daniel347x @pierocor @datumbox @nohalon @pritamsoni-hsr @nandwalritik @gilfree @ritsuki1227 @christopher-nguyen-re @JulesGM @jgbos @dconathan @jsr-p @NeoKish @Blaizzy @suyash-811 @alexkuzmik @ziyadsheeba @geoffrey-g-delhomme @amrutha1098 @AlessioQuercia @ver217 @Helias @zxvix @1SAA @fabiofumarola @luca3rd @kimpty @PaulLerner @rbracco @wouterzwerink

If we forgot somebody or you have a suggestion, find support here

Did you know?

Chuck Norris can write functions of infinite recursion ... and have them return.