[bugfix] Run `logger.after_save_checkpoint` in model checkpoint's `on_train_end` hook #9783

dalessioluca · 2021-10-01T04:23:10Z

What does this PR do?

This PR makes sure that the logger.after_save_checkpoint is called after ModelCheckpoint creates a checkpoint "on_train_end".

Note: A bigger refactor has been discussed there: #6231 (comment)

Details:

import torch
from pytorch_lightning.loggers import CSVLogger 
from pytorch_lightning.trainer import Trainer
from pytorch_lightning import LightningModule
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning.callbacks import ModelCheckpoint


class Logger(CSVLogger):
    call_counter = 0
    
    def after_save_checkpoint(self, checkpoint_callback: "ReferenceType[ModelCheckpoint]") -> None:
        print("called after new ckpt is saved")
        self.call_counter += 1


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    # These two together make sure that I save a ckpt every XXX epochs AND at the end of training
    ckpt_train_end = ModelCheckpoint(
        filename="last",
        save_on_train_epoch_end=False,
        save_last=True,
    )

    ckpt_train_interval = ModelCheckpoint(
        filename="my_checkpoint-{epoch}",
        save_on_train_epoch_end=True,
        save_last=False,
        every_n_epochs=6,
    )

    model = BoringModel()

    trainer = Trainer(
        logger=Logger(save_dir='./logging_dir', flush_logs_every_n_steps=1),
        max_epochs=9,
        log_every_n_steps=1,
        weights_save_path="saved_ckpt",
        callbacks=[ckpt_train_interval, ckpt_train_end],
    )

    trainer.fit(model, train_dataloaders=train_data)

    assert trainer.logger.call_counter == 2

if __name__ == "__main__":
    run()

This code saves 2 ckpt_files in the "saved_ckpt" folder but the logger.after_save_checkpoint is called only once.

-->

Fixes #

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
[x ] Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

ananthsub

thanks for identifying this! did you see any other points in the checkpoint callback where the logger was not triggered?

pytorch_lightning/callbacks/model_checkpoint.py

added test

for more information, see https://pre-commit.ci

tchaton

LGTM !

codecov · 2021-10-12T10:21:04Z

Codecov Report

Merging #9783 (5801163) into master (6da5829) will decrease coverage by 4%.
The diff coverage is 100%.

❗ Current head 5801163 differs from pull request most recent head 0c2b15c. Consider uploading reports for the commit 0c2b15c to get more accurate results

@@           Coverage Diff           @@
##           master   #9783    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         178     178            
  Lines       15648   15650     +2     
=======================================
- Hits        14503   13897   -606     
- Misses       1145    1753   +608

rohitgr7

PR looks good. I think the test can be improved or configured with one of the existing tests which tests for save_last

tests/checkpointing/test_model_checkpoint.py

rohitgr7 · 2021-10-12T10:57:14Z

tests/checkpointing/test_model_checkpoint.py

@@ -77,6 +77,53 @@ def mock(key):
    return calls


+@pytest.mark.parametrize("save_last", [False, True])  # problem with save_last = True
+@pytest.mark.parametrize("save_on_train_epoch_end", [True, False])


as per PR title, the bug was that it's not calling after_save_ckpt for last.ckpt which is saved in on_train_end. so do we need this? since on_train_end is called irrespective of save_on_train_epoch_end.

This test is meant to raise an error in all scenarios in which logger.after_save_checkpoint is not called after a checkpoint file is saved so we need to test both save_last = True and False.

rohitgr7 · 2021-10-12T11:28:48Z

tests/checkpointing/test_model_checkpoint.py

+        triggered_train_epoch_end: bool = False
+        trainer: Optional[Trainer]
+
+        def after_save_checkpoint(self, *_):


Suggested change

def after_save_checkpoint(self, *_):

def after_save_checkpoint(self, *_):

I think a better test would be to mock LightningLoggerBase.after_save_checkpoint and check it's call count

That's a good thought. Originally, I tried to implement the test as you suggested but did not succeed. The problem is that:

save_checkpoint calls 3 separate saving routines (_save_top_k_checkpoint, _save_none_monitor_checkpoint, _save_last_checkpoint) followed by a single call to Logger.after_save_checkpoint.

on_train_end calls a single saving routine (_save_last_checkpoint) followed by a single call to Logger.after_save_checkpoint.

This makes checking the call counts a fragile way to test that Logger.after_save_checkpoint is called appropriately.

maybe you can try just this config:

ModelCheckpoint(save_top_k=0, monitor=None, save_last=True)

it only saves the last.ckpt, thus will ensure the after_save_checkpoint call for on_train_end.

I have rewritten the test. It now checks that the set of ckpt files generated by the trainer matches with the set of ckpt files received by the logger.after_save_checkpoint.

aMerge branch 'bugfix/ckpt_and_logger' of https://github.com/dalessioluca/pytorch-lightning into bugfix/ckpt_and_logger

# Conflicts: # tests/checkpointing/test_model_checkpoint.py

for more information, see https://pre-commit.ci

stale · 2021-10-27T08:16:45Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

rohitgr7 · 2021-11-01T12:51:22Z

hey @dalessioluca , thank you for this PR and identifying issues here. looks like it does log the model here atleast for wandb: https://github.com/PyTorchLightning/pytorch-lightning/blob/45c45dc7b018f9a2db60f5df1a3f7dbbb45ccb36/pytorch_lightning/loggers/wandb.py#L461-L464

but the way these are logged and how it connects to the logger after checkpointing isn't super convenient and can be done in a more optimal way. We should call after_save_checkpoint inside trainer.save_checkpoint(). I'll do another PR after sometime which might solve more issues and this one too. In case you want to make changes to loggers and checkpoint then it would be great else you can close this one. Again thanks a lot for this.. was really helpful identifying more underlying issues.

tchaton · 2021-11-01T14:26:17Z

Dear @dalessioluca,

Based on @rohitgr7 latest comment, I believe it is better to make sure after_save_checkpoint and save_checkpoint are called together.

Would you be willing to keep going with this PR based on @rohitgr7 comments ?

Best,
T.C

dalessioluca · 2021-11-01T14:57:24Z

It indeed seems more clean to call after_save_checkpoint inside save_checkpoint.
I will modify the PR accordingly.

rohitgr7 · 2021-11-01T19:26:10Z

also you might need to update some loggers as well. for starters, wandb won't be needing the finalize method if we call after_save_checkpoint right after save_checkpoint.

carmocca · 2022-07-28T19:09:37Z

I believe this is not necessary anymore. We call after_save_checkpoint already whenever we save: https://github.com/Lightning-AI/lightning/blob/233b36b185ae71b480ddd5c87ff5d12cd70c88fe/src/pytorch_lightning/callbacks/model_checkpoint.py#L386-L394

Please, correct me if I'm wrong.

bug fix. If saving the checkpoint on_train_end the logger is triggered

8416ed9

dalessioluca requested review from Borda, carmocca, kaushikb11, tchaton and williamFalcon as code owners October 1, 2021 04:23

ananthsub added bug Something isn't working checkpointing Related to checkpointing labels Oct 1, 2021

ananthsub reviewed Oct 1, 2021

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

dalessioluca and others added 3 commits October 1, 2021 12:04

added test

5112c49

Merge branch 'PyTorchLightning:master' into master

a2737fb

Merge pull request #1 from dalessioluca/bugfix/ckpt_and_logger

9342fe6

added test

dalessioluca requested review from awaelchli, justusschock, rohitgr7 and SeanNaren as code owners October 1, 2021 16:11

pre-commit-ci bot and others added 3 commits October 1, 2021 16:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

18839a8

for more information, see https://pre-commit.ci

Merge branch 'PyTorchLightning:master' into master

cb4a9ed

Merge branch 'PyTorchLightning:master' into bugfix/ckpt_and_logger

6adf349

ananthsub changed the title ~~bugfix. ModelCheckpoint callback creates checkpoint "on_train_end" but does not trigger the logger.after_save_checkpoint~~ [bugfix] Run logger.after_save_checkpoint in model checkpoint's on_train_end hook Oct 2, 2021

mergify bot added the has conflicts label Oct 11, 2021

Merge branch 'master' into master

85482f7

mergify bot removed the has conflicts label Oct 12, 2021

update

5801163

tchaton approved these changes Oct 12, 2021

View reviewed changes

tchaton enabled auto-merge (squash) October 12, 2021 09:55

rohitgr7 reviewed Oct 12, 2021

View reviewed changes

dalessioluca added 2 commits October 12, 2021 09:46

improved test

762da96

afdsasasd

9fef279

aMerge branch 'bugfix/ckpt_and_logger' of https://github.com/dalessioluca/pytorch-lightning into bugfix/ckpt_and_logger

Merge branch 'bugfix/ckpt_and_logger'

5a50403

# Conflicts: # tests/checkpointing/test_model_checkpoint.py

auto-merge was automatically disabled October 12, 2021 13:49
Head branch was pushed to by a user without write access

pre-commit-ci bot and others added 4 commits October 12, 2021 13:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

0c2b15c

for more information, see https://pre-commit.ci

much better test

d308d25

much better test

1bb588c

[pre-commit.ci] auto fixes from pre-commit.com hooks

31cbc2c

for more information, see https://pre-commit.ci

mergify bot added the has conflicts label Oct 12, 2021

stale bot added the won't fix This will not be worked on label Oct 27, 2021

stale bot removed the won't fix This will not be worked on label Nov 1, 2021

tchaton modified the milestones: v1.6, v1.6.x Nov 1, 2021

awaelchli modified the milestones: v1.6.x, 1.5.x Nov 3, 2021

Borda modified the milestones: 1.5.x, 1.6 Mar 21, 2022

rohitgr7 self-assigned this Mar 21, 2022

rohitgr7 modified the milestones: 1.6, 1.6.x Mar 21, 2022

carmocca closed this Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Run `logger.after_save_checkpoint` in model checkpoint's `on_train_end` hook #9783

[bugfix] Run `logger.after_save_checkpoint` in model checkpoint's `on_train_end` hook #9783

dalessioluca commented Oct 1, 2021 •

edited by tchaton

Loading

ananthsub left a comment

tchaton left a comment

codecov bot commented Oct 12, 2021 •

edited

Loading

rohitgr7 left a comment

rohitgr7 Oct 12, 2021

dalessioluca Oct 12, 2021

rohitgr7 Oct 12, 2021

dalessioluca Oct 12, 2021

rohitgr7 Oct 12, 2021

dalessioluca Oct 12, 2021

stale bot commented Oct 27, 2021

rohitgr7 commented Nov 1, 2021

tchaton commented Nov 1, 2021

dalessioluca commented Nov 1, 2021

rohitgr7 commented Nov 1, 2021

carmocca commented Jul 28, 2022

	def after_save_checkpoint(self, *_):
	def after_save_checkpoint(self, *_):

[bugfix] Run logger.after_save_checkpoint in model checkpoint's on_train_end hook #9783

[bugfix] Run logger.after_save_checkpoint in model checkpoint's on_train_end hook #9783

Conversation

dalessioluca commented Oct 1, 2021 • edited by tchaton Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

ananthsub left a comment

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 12, 2021 • edited Loading

Codecov Report

rohitgr7 left a comment

Choose a reason for hiding this comment

rohitgr7 Oct 12, 2021

Choose a reason for hiding this comment

dalessioluca Oct 12, 2021

Choose a reason for hiding this comment

rohitgr7 Oct 12, 2021

Choose a reason for hiding this comment

dalessioluca Oct 12, 2021

Choose a reason for hiding this comment

rohitgr7 Oct 12, 2021

Choose a reason for hiding this comment

dalessioluca Oct 12, 2021

Choose a reason for hiding this comment

stale bot commented Oct 27, 2021

rohitgr7 commented Nov 1, 2021

tchaton commented Nov 1, 2021

dalessioluca commented Nov 1, 2021

rohitgr7 commented Nov 1, 2021

carmocca commented Jul 28, 2022

[bugfix] Run `logger.after_save_checkpoint` in model checkpoint's `on_train_end` hook #9783

[bugfix] Run `logger.after_save_checkpoint` in model checkpoint's `on_train_end` hook #9783

dalessioluca commented Oct 1, 2021 •

edited by tchaton

Loading

codecov bot commented Oct 12, 2021 •

edited

Loading