Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Create a ModelCheckpointBase callback #6504

Closed
ananthsub opened this issue Mar 13, 2021 · 7 comments · Fixed by #13024
Closed

[RFC] Create a ModelCheckpointBase callback #6504

ananthsub opened this issue Mar 13, 2021 · 7 comments · Fixed by #13024
Assignees
Labels
callback: model checkpoint feature Is an improvement or enhancement
Milestone

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Mar 13, 2021

🚀 Feature

Create a ModelCheckpointBase callback, and have the existing checkpoint callback extend it

Motivation

The model checkpoint callback is growing in complexity. Features that have been recently added or will soon be proposed:

The decision was made in #6146 to keep these triggers mutually exclusive, at least based on the phase they run in. Why? It's very hard to get the state management right. For instance, the monitor might be added for something that's available only during validation, but the checkpoint callback is configured to run during training too, and crashes when it tries to look up the monitor key in the available metrics for tracking. Tracking top-K models and scores is another huge pain. Supporting multiple monitor metrics on top of this is another beast.

cc @Borda @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

Pitch

Move the existing logic for the following into a base class:

  • Core saving functionality
  • Top-K model management
  • formatting checkpoint names
  • validation (though sub-classes should override this)

And have thin wrappers on top which extend this class and implement callback hook(s) for when to save the checkpoint.

Alternatives

The checkpoint callback gets bigger and bigger as we add more features to it.

Additional context

@ananthsub ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 13, 2021
@dalek-who
Copy link
Contributor

dalek-who commented Mar 13, 2021

Related to this issue, I want to implement a logic: saving the best checkpoint's valid predict result during saving best checkpoint, and I want to save it in another file rather than in checkpoint's binary file. Right now, I can't know who calls LightningModule.on_save_checkpoint (can be best, best-k, lask or during exception).
If I implement it in LightningModule.on_save_checkpoint, I need a flag showing who save it; but I think the better way is a ModelCheckpointBase with some hooks like on_save_checkpoint_{best, best-k, last}

@carmocca
Copy link
Contributor

Previous discussion: #4335 (comment)

Should we close that one in favor of this one? @ananthsub

@stale
Copy link

stale bot commented Apr 18, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 18, 2021
@stale stale bot closed this as completed Apr 26, 2021
@ananthsub ananthsub reopened this Mar 4, 2022
@stale stale bot removed the won't fix This will not be worked on label Mar 4, 2022
@ananthsub
Copy link
Contributor Author

ananthsub commented Mar 4, 2022

I have a different take on this now: which is extending the framework is relying on the class hierarchy to determine the intent of the callback. This is used by the Trainer especially to determine what callbacks are enabled by default. This is used for:

  • checkpointing
  • progress bar
  • model summary

I think for these, we ought to create an empty base class. This way, users don't have to worry about changes to the concrete implementations also offered by the framework. As an example:

class BaseModelCheckpoint(Callback):
    pass

class ModelCheckpoint(BaseModelCheckpoint):
    # existing code today


class MyCustomModelCheckpoint(BaseModelCheckpoint)

Essentially, someone should be able to extend BaseModelCheckpoint, not ModelCheckpoint, which is completely empty, and customize this however they see fit without worrying about keeping their code in sync with the framework's changes to ModelCheckpoint.

@jjenniferdai @carmocca

@carmocca
Copy link
Contributor

carmocca commented Mar 5, 2022

What would be the advantage internally by offering this base empty class?

@jjenniferdai
Copy link
Contributor

custom checkpoint callbacks (AnyCustomUserCheckpointCallback(BaseModelCheckpoint)) don't have to inherit ModelCheckpoint logic and can still get detected and rearranged to execute last in callback_connector

@carmocca
Copy link
Contributor

I am in favor of this.

I think it's becoming increasingly important that ModelCheckpoint gets split up. It has gotten bloated with multiple arguments that have chaotic or unintended interactions and make edge cases very confusing. For example:

We couldn't do this in the past because we didn't support multiple instances of the same callback. Related: #4335 where I propose a split. We would need to revisit it because that discussion is old

@carmocca carmocca added this to the 1.7 milestone Apr 29, 2022
@carmocca carmocca added callback: model checkpoint and removed help wanted Open to be worked on labels Apr 29, 2022
@carmocca carmocca moved this to Todo in Frameworks Planning Apr 29, 2022
@carmocca carmocca moved this from Todo to In Progress in Frameworks Planning May 10, 2022
@carmocca carmocca moved this from In Progress to In Review in Frameworks Planning May 10, 2022
Repository owner moved this from In Review to Done in Frameworks Planning Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback: model checkpoint feature Is an improvement or enhancement
Projects
No open projects
Status: Done
Status: No status
Development

Successfully merging a pull request may close this issue.

5 participants