Remove memory-retaining epoch-end hooks #16520

carmocca · 2023-01-26T18:01:03Z

Migration guide

training_epoch_end -> on_train_epoch_end

 class MyLightningModule(L.LightningModule):
+    def __init__(self):
+        super().__init__()
+        self.training_step_outputs = []

     def training_step(self, ...):
         loss = ...
+        self.training_step_outputs.append(loss)
         return loss

-    def training_epoch_end(self, outputs):
-        epoch_average = torch.stack([output["loss"] for output in outputs]).mean()
+    def on_train_epoch_end(self):
+        epoch_average = torch.stack(self.training_step_outputs).mean()
         self.log("training_epoch_average", epoch_average)
+        self.training_step_outputs.clear()  # free memory

The same suggestions apply to those implementing Callback.training_epoch_end

validation_epoch_end -> on_validation_epoch_end

 class MyLightningModule(L.LightningModule):
+    def __init__(self):
+        super().__init__()
+        self.validation_step_outputs = []

     def validation_step(self, ...):
         loss = ...
+        self.validation_step_outputs.append(loss)
         return loss

-    def validation_epoch_end(self, outputs):
-        epoch_average = torch.stack(outputs).mean()
+    def on_validation_epoch_end(self):
+        epoch_average = torch.stack(self.validation_step_outputs).mean()
         self.log("validation_epoch_average", epoch_average)
+        self.validation_step_outputs.clear()  # free memory

The same suggestions apply to those implementing Callback.validation_epoch_end

test_epoch_end -> on_test_epoch_end

 class MyLightningModule(L.LightningModule):
+    def __init__(self):
+        super().__init__()
+        self.test_step_outputs = []

     def test_step(self, ...):
         loss = ...
+        self.test_step_outputs.append(loss)
         return loss

-    def test_epoch_end(self, outputs):
-        epoch_average = torch.stack(outputs).mean()
+    def on_test_epoch_end(self):
+        epoch_average = torch.stack(self.test_step_outputs).mean()
         self.log("test_epoch_average", epoch_average)
+        self.test_step_outputs.clear()  # free memory

The same suggestions apply to those implementing Callback.test_epoch_end

Example with two DataLoaders

 class MyLightningModule(L.LightningModule):
+    def __init__(self):
+        super().__init__()
+        self.test_step_outputs = [[], []]  # two dataloaders

     def test_step(self, batch, batch_idx, dataloader_idx=0):
         loss = ...
+        self.test_step_outputs[dataloader_idx].append(loss)
         return loss

-    def test_epoch_end(self, outputs):
+    def on_test_epoch_end(self):
-        for dl_idx in range(len(outputs)):
+        for dl_idx in range(len(self.test_step_outputs)):
-            dataloader_epoch_average = torch.stack(outputs[dl_idx]).mean()
+            dataloader_epoch_average = torch.stack(self.test_step_outputs[dl_idx]).mean()
             self.log(f"test_epoch_average_dl_{dl_idx}", dataloader_epoch_average)
-            outputs[dl_idx].clear()
+            self.test_step_outputs[dl_idx].clear()

     def test_dataloader(self):
         dl1 = DataLoader(RandomDataset(32, 64), batch_size=2)
         dl2 = DataLoader(RandomDataset(32, 64), batch_size=2)
         return dl1, dl2

Example with strategy="dp" (DataParallel)

 class MyLightningModule(L.LightningModule):
+    def __init__(self):
+        super().__init__()
+        self.training_step_outputs = []
+        self.validation_step_outputs = []

     def training_step(self, batch, batch_idx):
         output = ...
         return output
 
     def validation_step(self, batch, batch_idx):
         output = ...
         return output

+    def training_step_end(self, training_step_output):
+        training_step_output = self.trainer.strategy.reduce(training_step_output)
+        self.training_step_outputs.append(training_step_output)
+        return training_step_output

+    def validation_step_end(self, validation_step_output):
+        self.validation_step_outputs.append(validation_step_output)
 
-    def training_epoch_end(self, outputs):
-        epoch_average = torch.stack([output["loss"] for output in outputs]).mean()
+    def on_train_epoch_end(self):
+        epoch_average = torch.stack(self.training_step_outputs).mean()
         self.log("training_epoch_average", epoch_average)
+        self.training_step_outputs.clear()  # free memory
 
-    def validation_epoch_end(self, outputs):
+    def on_validation_epoch_end(self):
         epoch_average = torch.stack(self.validation_step_outputs).mean()
         self.log("validation_epoch_average", epoch_average)
+        self.validation_step_outputs.clear()  # free memory

If you have questions about how to migrate your use case, you can ask in this PR.

What does this PR do?

Removes the training_epoch_end, validation_epoch_end, and test_epoch_end hooks.
In favor of on_train_epoch_end, on_validation_epoch_end, and on_test_epoch_end.

These hooks were becoming problematic as just implementing them could lead to memory issues if the user was unaware of their implementation.
They also increased the loop's complexity and were hard to hack or customize externally.

At runtime, we check whether the old hooks are overridden, and fail if they are with an error message that points to the migration guide above

Blocked by #16567

Fixes #8731
Closes #9380
Closes #9968
Closes #10878
Closes #11554

Follow-up things to address:
#8479: need to remove outputs from on_predict_epoch_end

Does your PR introduce any breaking changes? If yes, please list them.

Removes the hooks described above.

cc @Borda @justusschock @carmocca @awaelchli

github-actions · 2023-01-30T17:56:31Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (slow, macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (slow, ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
pl-cpu (slow, windows-2022, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning/pytorch/callbacks/callback.py, src/lightning/pytorch/core/hooks.py, src/lightning/pytorch/core/module.py, src/lightning/pytorch/demos/boring_classes.py, src/lightning/pytorch/loops/dataloader/evaluation_loop.py, src/lightning/pytorch/loops/epoch/evaluation_epoch_loop.py, src/lightning/pytorch/loops/epoch/training_epoch_loop.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/optimization/manual.py, src/lightning/pytorch/trainer/configuration_validator.py, src/lightning/pytorch/trainer/connectors/logger_connector/fx_validator.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/types.py, tests/tests_pytorch/accelerators/test_ipu.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/callbacks/progress/test_tqdm_progress_bar.py, tests/tests_pytorch/callbacks/test_callback_hook_outputs.py, tests/tests_pytorch/callbacks/test_lr_monitor.py, tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py, tests/tests_pytorch/checkpointing/test_model_checkpoint.py, tests/tests_pytorch/checkpointing/test_trainer_checkpoint.py, tests/tests_pytorch/core/test_datamodules.py, tests/tests_pytorch/core/test_lightning_module.py, tests/tests_pytorch/core/test_lightning_optimizer.py, tests/tests_pytorch/helpers/deterministic_model.py, tests/tests_pytorch/loggers/test_all.py, tests/tests_pytorch/loggers/test_logger.py, tests/tests_pytorch/loggers/test_neptune.py, tests/tests_pytorch/loggers/test_tensorboard.py, tests/tests_pytorch/loops/optimization/test_optimizer_loop.py, tests/tests_pytorch/loops/test_evaluation_loop.py, tests/tests_pytorch/loops/test_evaluation_loop_flow.py, tests/tests_pytorch/loops/test_flow_warnings.py, tests/tests_pytorch/loops/test_loops.py, tests/tests_pytorch/loops/test_training_loop.py, tests/tests_pytorch/loops/test_training_loop_flow_dict.py, tests/tests_pytorch/loops/test_training_loop_flow_scalar.py, tests/tests_pytorch/models/test_hooks.py, tests/tests_pytorch/plugins/test_double_plugin.py, tests/tests_pytorch/strategies/test_deepspeed_strategy.py, tests/tests_pytorch/strategies/test_dp.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/dynamic_args/test_multiple_eval_dataloaders.py, tests/tests_pytorch/trainer/flags/test_fast_dev_run.py, tests/tests_pytorch/trainer/flags/test_min_max_epochs.py, tests/tests_pytorch/trainer/logging_/test_distributed_logging.py, tests/tests_pytorch/trainer/logging_/test_eval_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_logger_connector.py, tests/tests_pytorch/trainer/logging_/test_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_train_loop_logging.py, tests/tests_pytorch/trainer/optimization/test_manual_optimization.py, tests/tests_pytorch/trainer/optimization/test_multiple_optimizers.py, tests/tests_pytorch/trainer/optimization/test_optimizers.py, tests/tests_pytorch/trainer/test_config_validator.py, tests/tests_pytorch/trainer/test_dataloaders.py, tests/tests_pytorch/trainer/test_trainer.py, tests/tests_pytorch/tuner/test_scale_batch_size.py, tests/tests_pytorch/utilities/test_all_gather_grad.py, tests/tests_pytorch/utilities/test_auto_restart.py, tests/tests_pytorch/utilities/test_fetching.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to src/lightning/pytorch/callbacks/callback.py, src/lightning/pytorch/core/hooks.py, src/lightning/pytorch/core/module.py, src/lightning/pytorch/demos/boring_classes.py, src/lightning/pytorch/loops/dataloader/evaluation_loop.py, src/lightning/pytorch/loops/epoch/evaluation_epoch_loop.py, src/lightning/pytorch/loops/epoch/training_epoch_loop.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/optimization/manual.py, src/lightning/pytorch/trainer/configuration_validator.py, src/lightning/pytorch/trainer/connectors/logger_connector/fx_validator.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/types.py, tests/tests_pytorch/accelerators/test_ipu.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/callbacks/progress/test_tqdm_progress_bar.py, tests/tests_pytorch/callbacks/test_callback_hook_outputs.py, tests/tests_pytorch/callbacks/test_lr_monitor.py, tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py, tests/tests_pytorch/checkpointing/test_model_checkpoint.py, tests/tests_pytorch/checkpointing/test_trainer_checkpoint.py, tests/tests_pytorch/core/test_datamodules.py, tests/tests_pytorch/core/test_lightning_module.py, tests/tests_pytorch/core/test_lightning_optimizer.py, tests/tests_pytorch/helpers/deterministic_model.py, tests/tests_pytorch/loggers/test_all.py, tests/tests_pytorch/loggers/test_logger.py, tests/tests_pytorch/loggers/test_neptune.py, tests/tests_pytorch/loggers/test_tensorboard.py, tests/tests_pytorch/loops/optimization/test_optimizer_loop.py, tests/tests_pytorch/loops/test_evaluation_loop.py, tests/tests_pytorch/loops/test_evaluation_loop_flow.py, tests/tests_pytorch/loops/test_flow_warnings.py, tests/tests_pytorch/loops/test_loops.py, tests/tests_pytorch/loops/test_training_loop.py, tests/tests_pytorch/loops/test_training_loop_flow_dict.py, tests/tests_pytorch/loops/test_training_loop_flow_scalar.py, tests/tests_pytorch/models/test_hooks.py, tests/tests_pytorch/plugins/test_double_plugin.py, tests/tests_pytorch/strategies/test_deepspeed_strategy.py, tests/tests_pytorch/strategies/test_dp.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/dynamic_args/test_multiple_eval_dataloaders.py, tests/tests_pytorch/trainer/flags/test_fast_dev_run.py, tests/tests_pytorch/trainer/flags/test_min_max_epochs.py, tests/tests_pytorch/trainer/logging_/test_distributed_logging.py, tests/tests_pytorch/trainer/logging_/test_eval_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_logger_connector.py, tests/tests_pytorch/trainer/logging_/test_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_train_loop_logging.py, tests/tests_pytorch/trainer/optimization/test_manual_optimization.py, tests/tests_pytorch/trainer/optimization/test_multiple_optimizers.py, tests/tests_pytorch/trainer/optimization/test_optimizers.py, tests/tests_pytorch/trainer/test_config_validator.py, tests/tests_pytorch/trainer/test_dataloaders.py, tests/tests_pytorch/trainer/test_trainer.py, tests/tests_pytorch/tuner/test_scale_batch_size.py, tests/tests_pytorch/utilities/test_all_gather_grad.py, tests/tests_pytorch/utilities/test_auto_restart.py, tests/tests_pytorch/utilities/test_fetching.py.

🟢 pytorch_lightning: Azure HPU

Check ID	Status
pytorch-lightning (HPUs)	success	✅

These checks are required after the changes to src/lightning/pytorch/callbacks/callback.py, src/lightning/pytorch/core/hooks.py, src/lightning/pytorch/core/module.py, src/lightning/pytorch/demos/boring_classes.py, src/lightning/pytorch/loops/dataloader/evaluation_loop.py, src/lightning/pytorch/loops/epoch/evaluation_epoch_loop.py, src/lightning/pytorch/loops/epoch/training_epoch_loop.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/optimization/manual.py, src/lightning/pytorch/trainer/configuration_validator.py, src/lightning/pytorch/trainer/connectors/logger_connector/fx_validator.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/types.py, tests/tests_pytorch/accelerators/test_ipu.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/callbacks/progress/test_tqdm_progress_bar.py, tests/tests_pytorch/callbacks/test_callback_hook_outputs.py, tests/tests_pytorch/callbacks/test_lr_monitor.py, tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py, tests/tests_pytorch/checkpointing/test_model_checkpoint.py, tests/tests_pytorch/checkpointing/test_trainer_checkpoint.py, tests/tests_pytorch/core/test_datamodules.py, tests/tests_pytorch/core/test_lightning_module.py, tests/tests_pytorch/core/test_lightning_optimizer.py, tests/tests_pytorch/helpers/deterministic_model.py, tests/tests_pytorch/loggers/test_all.py, tests/tests_pytorch/loggers/test_logger.py, tests/tests_pytorch/loggers/test_neptune.py, tests/tests_pytorch/loggers/test_tensorboard.py, tests/tests_pytorch/loops/optimization/test_optimizer_loop.py, tests/tests_pytorch/loops/test_evaluation_loop.py, tests/tests_pytorch/loops/test_evaluation_loop_flow.py, tests/tests_pytorch/loops/test_flow_warnings.py, tests/tests_pytorch/loops/test_loops.py, tests/tests_pytorch/loops/test_training_loop.py, tests/tests_pytorch/loops/test_training_loop_flow_dict.py, tests/tests_pytorch/loops/test_training_loop_flow_scalar.py, tests/tests_pytorch/models/test_hooks.py, tests/tests_pytorch/plugins/test_double_plugin.py, tests/tests_pytorch/strategies/test_deepspeed_strategy.py, tests/tests_pytorch/strategies/test_dp.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/dynamic_args/test_multiple_eval_dataloaders.py, tests/tests_pytorch/trainer/flags/test_fast_dev_run.py, tests/tests_pytorch/trainer/flags/test_min_max_epochs.py, tests/tests_pytorch/trainer/logging_/test_distributed_logging.py, tests/tests_pytorch/trainer/logging_/test_eval_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_logger_connector.py, tests/tests_pytorch/trainer/logging_/test_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_train_loop_logging.py, tests/tests_pytorch/trainer/optimization/test_manual_optimization.py, tests/tests_pytorch/trainer/optimization/test_multiple_optimizers.py, tests/tests_pytorch/trainer/optimization/test_optimizers.py, tests/tests_pytorch/trainer/test_config_validator.py, tests/tests_pytorch/trainer/test_dataloaders.py, tests/tests_pytorch/trainer/test_trainer.py, tests/tests_pytorch/tuner/test_scale_batch_size.py, tests/tests_pytorch/utilities/test_all_gather_grad.py, tests/tests_pytorch/utilities/test_auto_restart.py, tests/tests_pytorch/utilities/test_fetching.py.

🟢 pytorch_lightning: Azure IPU

Check ID	Status
pytorch-lightning (IPUs)	success	✅

These checks are required after the changes to src/lightning/pytorch/callbacks/callback.py, src/lightning/pytorch/core/hooks.py, src/lightning/pytorch/core/module.py, src/lightning/pytorch/demos/boring_classes.py, src/lightning/pytorch/loops/dataloader/evaluation_loop.py, src/lightning/pytorch/loops/epoch/evaluation_epoch_loop.py, src/lightning/pytorch/loops/epoch/training_epoch_loop.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/optimization/manual.py, src/lightning/pytorch/trainer/configuration_validator.py, src/lightning/pytorch/trainer/connectors/logger_connector/fx_validator.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/types.py, tests/tests_pytorch/accelerators/test_ipu.py, tests/tests_pytorch/accelerators/test_tpu.py, tests/tests_pytorch/callbacks/progress/test_tqdm_progress_bar.py, tests/tests_pytorch/callbacks/test_callback_hook_outputs.py, tests/tests_pytorch/callbacks/test_lr_monitor.py, tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py, tests/tests_pytorch/checkpointing/test_model_checkpoint.py, tests/tests_pytorch/checkpointing/test_trainer_checkpoint.py, tests/tests_pytorch/core/test_datamodules.py, tests/tests_pytorch/core/test_lightning_module.py, tests/tests_pytorch/core/test_lightning_optimizer.py, tests/tests_pytorch/helpers/deterministic_model.py, tests/tests_pytorch/loggers/test_all.py, tests/tests_pytorch/loggers/test_logger.py, tests/tests_pytorch/loggers/test_neptune.py, tests/tests_pytorch/loggers/test_tensorboard.py, tests/tests_pytorch/loops/optimization/test_optimizer_loop.py, tests/tests_pytorch/loops/test_evaluation_loop.py, tests/tests_pytorch/loops/test_evaluation_loop_flow.py, tests/tests_pytorch/loops/test_flow_warnings.py, tests/tests_pytorch/loops/test_loops.py, tests/tests_pytorch/loops/test_training_loop.py, tests/tests_pytorch/loops/test_training_loop_flow_dict.py, tests/tests_pytorch/loops/test_training_loop_flow_scalar.py, tests/tests_pytorch/models/test_hooks.py, tests/tests_pytorch/plugins/test_double_plugin.py, tests/tests_pytorch/strategies/test_deepspeed_strategy.py, tests/tests_pytorch/strategies/test_dp.py, tests/tests_pytorch/trainer/connectors/test_data_connector.py, tests/tests_pytorch/trainer/dynamic_args/test_multiple_eval_dataloaders.py, tests/tests_pytorch/trainer/flags/test_fast_dev_run.py, tests/tests_pytorch/trainer/flags/test_min_max_epochs.py, tests/tests_pytorch/trainer/logging_/test_distributed_logging.py, tests/tests_pytorch/trainer/logging_/test_eval_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_logger_connector.py, tests/tests_pytorch/trainer/logging_/test_loop_logging.py, tests/tests_pytorch/trainer/logging_/test_train_loop_logging.py, tests/tests_pytorch/trainer/optimization/test_manual_optimization.py, tests/tests_pytorch/trainer/optimization/test_multiple_optimizers.py, tests/tests_pytorch/trainer/optimization/test_optimizers.py, tests/tests_pytorch/trainer/test_config_validator.py, tests/tests_pytorch/trainer/test_dataloaders.py, tests/tests_pytorch/trainer/test_trainer.py, tests/tests_pytorch/tuner/test_scale_batch_size.py, tests/tests_pytorch/utilities/test_all_gather_grad.py, tests/tests_pytorch/utilities/test_auto_restart.py, tests/tests_pytorch/utilities/test_fetching.py.

🟢 pytorch_lightning: Docs

Check ID	Status
make-doctest (pytorch)	success	✅
make-html (pytorch)	success	✅

These checks are required after the changes to src/lightning/pytorch/callbacks/callback.py, src/lightning/pytorch/core/hooks.py, src/lightning/pytorch/core/module.py, src/lightning/pytorch/demos/boring_classes.py, src/lightning/pytorch/loops/dataloader/evaluation_loop.py, src/lightning/pytorch/loops/epoch/evaluation_epoch_loop.py, src/lightning/pytorch/loops/epoch/training_epoch_loop.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/optimization/manual.py, src/lightning/pytorch/trainer/configuration_validator.py, src/lightning/pytorch/trainer/connectors/logger_connector/fx_validator.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/types.py, docs/source-pytorch/accelerators/accelerator_prepare.rst, docs/source-pytorch/common/lightning_module.rst, docs/source-pytorch/extensions/logging.rst, docs/source-pytorch/model/manual_optimization.rst, docs/source-pytorch/starter/style_guide.rst, docs/source-pytorch/visualize/logging_advanced.rst.

🟢 lightning_app: Tests workflow

Check ID	Status
app-pytest (macOS-11, lightning, 3.8, latest)	success	✅
app-pytest (macOS-11, lightning, 3.8, oldest)	success	✅
app-pytest (macOS-11, app, 3.9, latest)	success	✅
app-pytest (ubuntu-20.04, lightning, 3.8, latest)	success	✅
app-pytest (ubuntu-20.04, lightning, 3.8, oldest)	success	✅
app-pytest (ubuntu-20.04, app, 3.9, latest)	success	✅
app-pytest (windows-2022, lightning, 3.8, latest)	success	✅
app-pytest (windows-2022, lightning, 3.8, oldest)	success	✅
app-pytest (windows-2022, app, 3.8, latest)	success	✅

These checks are required after the changes to src/lightning/app/utilities/introspection.py.

🟢 lightning_app: Examples

Check ID	Status
app-examples (macOS-11, lightning, 3.9, latest)	success	✅
app-examples (macOS-11, lightning, 3.9, oldest)	success	✅
app-examples (macOS-11, app, 3.9, latest)	success	✅
app-examples (ubuntu-20.04, lightning, 3.9, latest)	success	✅
app-examples (ubuntu-20.04, lightning, 3.9, oldest)	success	✅
app-examples (ubuntu-20.04, app, 3.9, latest)	success	✅
app-examples (windows-2022, lightning, 3.9, latest)	success	✅
app-examples (windows-2022, lightning, 3.9, oldest)	success	✅
app-examples (windows-2022, app, 3.9, latest)	success	✅

These checks are required after the changes to src/lightning/app/utilities/introspection.py.

🟢 lightning_app: Azure

Check ID	Status
App.cloud-e2e	success	✅

These checks are required after the changes to src/lightning/app/utilities/introspection.py.

🟢 lightning_app: Docs

Check ID	Status
make-doctest (app)	success	✅
make-html (app)	success	✅

These checks are required after the changes to src/lightning/app/utilities/introspection.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/app/utilities/introspection.py, src/lightning/pytorch/callbacks/callback.py, src/lightning/pytorch/core/hooks.py, src/lightning/pytorch/core/module.py, src/lightning/pytorch/demos/boring_classes.py, src/lightning/pytorch/loops/dataloader/evaluation_loop.py, src/lightning/pytorch/loops/epoch/evaluation_epoch_loop.py, src/lightning/pytorch/loops/epoch/training_epoch_loop.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/optimization/manual.py, src/lightning/pytorch/trainer/configuration_validator.py, src/lightning/pytorch/trainer/connectors/logger_connector/fx_validator.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/types.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to src/lightning/app/utilities/introspection.py, src/lightning/pytorch/callbacks/callback.py, src/lightning/pytorch/core/hooks.py, src/lightning/pytorch/core/module.py, src/lightning/pytorch/demos/boring_classes.py, src/lightning/pytorch/loops/dataloader/evaluation_loop.py, src/lightning/pytorch/loops/epoch/evaluation_epoch_loop.py, src/lightning/pytorch/loops/epoch/training_epoch_loop.py, src/lightning/pytorch/loops/fit_loop.py, src/lightning/pytorch/loops/optimization/manual.py, src/lightning/pytorch/trainer/configuration_validator.py, src/lightning/pytorch/trainer/connectors/logger_connector/fx_validator.py, src/lightning/pytorch/trainer/trainer.py, src/lightning/pytorch/utilities/types.py.

🟢 link-check

Check ID	Status
markdown-link-check	success	✅

These checks are required after the changes to src/lightning/pytorch/CHANGELOG.md.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

ref Lightning-AI/pytorch-lightning#16520

according to [Lightning PR 16520](Lightning-AI/pytorch-lightning#16520).

edmcman · 2024-10-16T15:26:53Z

Is there any way to get the old behavior without adding boiler plate?

42elenz · 2024-10-23T17:39:32Z

I am wondering the following:
I Am using multiple GPU - Training so I assumed that I am using some sort of DP. This is the reason I followed the DP-example. This is my implementation:

class Contrastive_Training_Model(MultimodalBasis):
def init(self, hparams, fold=''):
super().init(hparams)

    self.save_hyperparameters(hparams)
    self.train_criterion_type = hparams.model.clip_pretrain_train_criterion
    self.label_type_fn = hparams.data.label_type_false_negative_class
    self.train_criterion, self.validation_criterion = choose_contrastive_loss_fct(self.train_criterion_type, hparams)
    self.debugging = hparams.logging.debug_level
    self.correct_val_ids_file = hparams.logging.correct_val_ids_file
    self.fold = fold
    self.training_step_outputs = []
    self.validation_step_outputs = []

    #sanity_check(self.train_criterion_type, self.label_type_fn)

#This is called automatically in trainer class
def training_step(self, batch, batch_idx):
    """
    Trains contrastive model
    """
    train_mri = batch['cor_mri'] #can be courrpted depends on the settings
    train_questionnaire = batch['cor_questionnaire'] #can be courrpted depends on the settings
    mri_data = batch['mri']
    questionnaire_data = batch['questionnaire']
    id = batch['id']
    fn_label_class = batch['fn_label_class']
    ds_label_class = batch['ds_label_class']

    if self.label_type_fn == "binary":
        fn_label_class = fn_label_class.bool()
    mri_embeddings_projected = self.forward_mri(train_mri) 
    questionnaire_embeddings_projected = self.forward_quest(train_questionnaire)
    loss, logits, labels = self.train_criterion(mri_embeddings_projected, questionnaire_embeddings_projected, fn_label_class)
    #self.training_step_outputs.append({'loss':loss, 'logits': logits, 'labels': labels, "ds_label_class": ds_label_class, 'ID': id, 'mri_embeddings': mri_embeddings_projected, 'questionnaire_embeddings': questionnaire_embeddings_projected, 'questionaire_data': questionnaire_data})
    # self.log(f"multimodal.train.loss", loss, on_epoch=True, on_step=False) implemente later
    #if len(im_views[0])==self.hparams.batch_size:
    #self.calc_and_log_train_embedding_acc(logits=logits, labels=labels, modality='multimodal')

    return {'loss':loss,
            'logits': logits, 
            "ds_label_class": ds_label_class,
            'fn_label_class': fn_label_class, 
            'ID': id,
            'mri_embeddings': mri_embeddings_projected,
            'questionnaire_embeddings': questionnaire_embeddings_projected,
            'questionaire_data': questionnaire_data,}

def training_step_end(self, training_step_output):
    training_step_output = self.trainer.strategy.reduce(training_step_output)
    self.training_step_outputs.append(training_step_output)
    return training_step_output


def validation_step(self, batch, batch_idx):
    """
    Validates contrastive model
    """
    val_mri = batch['cor_mri']
    val_questionnaire = batch['cor_questionnaire']
    mri_data = batch['mri']
    questionnaire_data = batch['questionnaire']
    id = batch['id']
    fn_label_class = batch['fn_label_class']
    ds_label_class = batch['ds_label_class']

    mri_embeddings_projected = self.forward_mri(val_mri)
    questionnaire_embeddings_projected = self.forward_quest(val_questionnaire)
    loss, logits, quest_logits, labels = self.validation_criterion(mri_embeddings_projected, questionnaire_embeddings_projected, fn_label_class)
    #self.validation_step_outputs.append({'loss':loss, 'mri_logits': logits, 'quest_logits': quest_logits,'logits': logits, "ds_label_class": ds_label_class, 'fn_label_class': fn_label_class, 'ID': id, 'mri_embeddings': mri_embeddings_projected, 'questionnaire_embeddings': questionnaire_embeddings_projected,'questionaire_data': questionnaire_data})
    return {'loss':loss,
            'mri_logits': logits,
            'quest_logits': quest_logits,
            'logits': logits,
            "ds_label_class": ds_label_class,
            'fn_label_class': fn_label_class, 
            'ID': id,
            'mri_embeddings': mri_embeddings_projected,
            'questionnaire_embeddings': questionnaire_embeddings_projected,
            'questionaire_data': questionnaire_data,}

def validation_step_end(self, validation_step_output):
    self.validation_step_outputs.append(validation_step_output)

#At the end of the epcoh. All outputs are in a list.
def on_train_epoch_end(self):
    train_outputs = self.training_step_outputs
    epoch_loss, epoch_accuracy = evaluation_of_contrastive_outputs(train_outputs,self.debugging, evaluation_type="train")
    fold = self.fold
    self.log(f"cont.train.loss{fold}", epoch_loss, on_epoch=True, on_step=False)
    self.log(f"cont.train.acc{fold}", epoch_accuracy, on_epoch=True, on_step=False)
    self.training_step_outputs.clear()

def on_validation_epoch_end(self):
    val_outputs = self.validation_step_outputs
    fold = self.fold
    epoch_loss, epoch_accuracy, mri_accuracy, questionnaire_accuracy, mri_quest_accuracy = evaluation_of_contrastive_outputs(val_outputs,self.debugging, evaluation_type="validation", correct_val_ids_file=self.correct_val_ids_file)
    self.log(f"cont.val.loss", epoch_loss, on_epoch=True, on_step=False)
    self.log(f"cont.val.accuracy{fold}", epoch_accuracy, on_epoch=True, on_step=False)
    self.log(f"cont.val.mri_accuracy{fold}", mri_accuracy, on_epoch=True, on_step=False)
    self.log(f"cont.val.questionnaire_accuracy{fold}", questionnaire_accuracy, on_epoch=True, on_step=False)
    self.log(f"cont.val.mri_questionnaire_accuracy{fold}", mri_quest_accuracy, on_epoch=True, on_step=False)
    self.validation_step_outputs.clear()
    
def configure_optimizers(self):
    optimizer = torch.optim.Adam(
        self.parameters(), 
        lr=self.hparams.training.lr,
        weight_decay=self.hparams.training.weight_decay)
    return optimizer`
```



**Unfourtnatly in the prerrunning, when the basic validation is calculated, the valdiation_step_end() is not called. So I get an error (Division by Zero in the evaluation_of_contrastive_outputs).

What else can I do than just check for this case?**

bnestor · 2024-12-20T21:05:50Z

@42elenz I had to add self.validation_step_outputs.append(validation_step_output) to my validation_step. Then everything worked fine.

carmocca self-assigned this Jan 26, 2023

github-actions bot added app (removed) Generic label for Lightning App package pl Generic label for PyTorch Lightning package labels Jan 26, 2023

carmocca added breaking change Includes a breaking change lightningmodule pl.LightningModule hooks Related to the hooks API and removed app (removed) Generic label for Lightning App package labels Jan 26, 2023

carmocca force-pushed the refactor/epoch-end-hook-removal branch from 2f301a1 to 55a5a51 Compare January 26, 2023 18:17

github-actions bot added the app (removed) Generic label for Lightning App package label Jan 26, 2023

carmocca added this to the 2.0 milestone Jan 30, 2023

carmocca marked this pull request as ready for review January 30, 2023 17:55

carmocca requested review from tchaton, lantiga, awaelchli, hhsecond, ethanwharris, williamFalcon, justusschock, edenlightning and Borda as code owners January 30, 2023 17:55

carmocca added 3 commits January 30, 2023 19:10

Update docs

70b8cb6

Update src

dba6fa6

Update tests

b27175e

carmocca force-pushed the refactor/epoch-end-hook-removal branch from e7b6f0f to b27175e Compare January 30, 2023 18:13

carmocca mentioned this pull request Jan 30, 2023

Run on_train_epoch_end after the LM for callbacks that monitor #16567

Merged

mergify bot added the has conflicts label Feb 1, 2023

Merge branch 'master' into refactor/epoch-end-hook-removal

2b1f714

mergify bot added has conflicts and removed has conflicts labels Feb 1, 2023

soutrik7771 mentioned this pull request Apr 16, 2023

[Auto] FileNotFoundError on Windows Nixtla/neuralforecast#526

Closed

SagiPolaczek mentioned this pull request Apr 19, 2023

Support Lightning >=2.0.0 and Pandas >=2.0.0 BiomedSciAI/fuse-med-ml#301

Merged

xianyuanliu mentioned this pull request May 3, 2023

PyTorch 2.0 pykale/pykale#375

Closed

ddelange mentioned this pull request May 4, 2023

Add support for python 3.11 autogluon/autogluon#3190

Merged

ddelange added a commit to ddelange/autogluon that referenced this pull request May 4, 2023

Rename validation_epoch_end -> on_validation_epoch_end

0d4f8a4

ref Lightning-AI/pytorch-lightning#16520

yinweisu pushed a commit to autogluon/autogluon that referenced this pull request May 8, 2023

Rename validation_epoch_end -> on_validation_epoch_end

cea73b2

ref Lightning-AI/pytorch-lightning#16520

Wadha-Almattar mentioned this pull request May 27, 2023

NotImplementedError: Support for training_epoch_end shatz01/MoCoV2_CIFAR10#2

Open

pallaviyn referenced this pull request in talhaanwarch/youtube-tutorials Jun 2, 2023

Add files via upload

b6f500e

OrilinZ mentioned this pull request Jun 15, 2023

validation_epoch_end problem, how can I fix it? ljh0v0/D3PM-Pytorch#1

Open

shobhitagrawal1 mentioned this pull request Jul 11, 2023

several errors in trial runs nitzanlab/biolord#6

Closed

athitten mentioned this pull request Jul 27, 2023

Upgrade to pytorch lightning 2.0 NVIDIA/NeMo#6433

Merged

8 tasks

HaviZou mentioned this pull request Aug 2, 2023

训练时，train 命令出现异常 breezedeus/CnOCR#270

Closed

anteju mentioned this pull request Aug 9, 2023

[Fix] AudioCodecModel training fails with PTL 2.0 NVIDIA/NeMo#7190

Closed

8 tasks

utility-aagrawal mentioned this pull request Sep 7, 2023

Requirements/pre-requisites for this code to run? SShah30-hue/sentiment-analysis-ensemble#1

Open

xings-sdnu mentioned this pull request Oct 13, 2023

训练问题 RangiLyu/nanodet#535

Open

4theKnowledge mentioned this pull request Oct 17, 2023

version incompatible Babelscape/rebel#72

Closed

thomaschhh mentioned this pull request Nov 20, 2023

Training - NotImplementedError bene-ges/nemo_compatible#15

Closed

jimmylihui mentioned this pull request Nov 23, 2023

training_epoch_end error amazon-science/unconditional-time-series-diffusion#2

Closed

darkway-s mentioned this pull request Nov 24, 2023

Issue about requirements.txt danielschroter/human_value_detector#7

Open

Madjid-CH mentioned this pull request Jan 3, 2024

fixed backward compatibility problem with pytorch lightening v2.0.0 bezirganyan/m2-mixer#3

Merged

alexanderwerning mentioned this pull request Feb 27, 2024

SEDWrapper sed_model problem RetroCirce/HTS-Audio-Transformer#54

Closed

ChingisBadmaev mentioned this pull request Mar 11, 2024

examples (notebooks) have not been updated qubvel-org/segmentation_models.pytorch#860

Closed

RuslanSergeev added a commit to RuslanSergeev/im2height that referenced this pull request May 27, 2024

fixes memory-retaining epoch end validation hooks,

bed87d7

according to [Lightning PR 16520](Lightning-AI/pytorch-lightning#16520).

RuslanSergeev mentioned this pull request May 27, 2024

Fixes memory-retaining epoch end validation hooks. dettmar/im2height#4

Open

hadar-hai mentioned this pull request Aug 13, 2024

No .ckpt file TRI-ML/RAP#1

Open

ytzfhqs mentioned this pull request Aug 17, 2024

Updating the tutorial file qubvel-org/segmentation_models.pytorch#907

Merged

aryankeluskar mentioned this pull request Nov 6, 2024

migrated to newer version of lightning flatironinstitute/deepblast#165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove memory-retaining epoch-end hooks #16520

Remove memory-retaining epoch-end hooks #16520

carmocca commented Jan 26, 2023 •

edited

Loading

github-actions bot commented Jan 30, 2023 •

edited

Loading

edmcman commented Oct 16, 2024

42elenz commented Oct 23, 2024 •

edited

Loading

bnestor commented Dec 20, 2024

Remove memory-retaining epoch-end hooks #16520

Remove memory-retaining epoch-end hooks #16520

Conversation

carmocca commented Jan 26, 2023 • edited Loading

Migration guide

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

github-actions bot commented Jan 30, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

edmcman commented Oct 16, 2024

42elenz commented Oct 23, 2024 • edited Loading

bnestor commented Dec 20, 2024

carmocca commented Jan 26, 2023 •

edited

Loading

github-actions bot commented Jan 30, 2023 •

edited

Loading

42elenz commented Oct 23, 2024 •

edited

Loading