[RFC] Create a ModelCheckpointBase callback #6504

ananthsub · 2021-03-13T09:15:32Z

🚀 Feature

Create a ModelCheckpointBase callback, and have the existing checkpoint callback extend it

Motivation

The model checkpoint callback is growing in complexity. Features that have been recently added or will soon be proposed:

Checkpoint every n train steps: [feat] Support iteration-based checkpointing in model checkpoint callback #6146
Checkpointing using a time-based interval during training: Support time-based checkpointing trigger #6286
Checkpointing on train end to fix this hack
https://github.com/PyTorchLightning/pytorch-lightning/blob/680e83adab38c2d680b138bdc39d48fc35c0cb58/pytorch_lightning/trainer/training_loop.py#L152-L163

The decision was made in #6146 to keep these triggers mutually exclusive, at least based on the phase they run in. Why? It's very hard to get the state management right. For instance, the monitor might be added for something that's available only during validation, but the checkpoint callback is configured to run during training too, and crashes when it tries to look up the monitor key in the available metrics for tracking. Tracking top-K models and scores is another huge pain. Supporting multiple monitor metrics on top of this is another beast.

cc @Borda @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

Pitch

Move the existing logic for the following into a base class:

Core saving functionality
Top-K model management
formatting checkpoint names
validation (though sub-classes should override this)

And have thin wrappers on top which extend this class and implement callback hook(s) for when to save the checkpoint.

Alternatives

The checkpoint callback gets bigger and bigger as we add more features to it.

Additional context

The text was updated successfully, but these errors were encountered:

dalek-who · 2021-03-13T14:09:14Z

Related to this issue, I want to implement a logic: saving the best checkpoint's valid predict result during saving best checkpoint, and I want to save it in another file rather than in checkpoint's binary file. Right now, I can't know who calls LightningModule.on_save_checkpoint (can be best, best-k, lask or during exception).
If I implement it in LightningModule.on_save_checkpoint, I need a flag showing who save it; but I think the better way is a ModelCheckpointBase with some hooks like on_save_checkpoint_{best, best-k, last}

carmocca · 2021-03-14T00:33:19Z

Previous discussion: #4335 (comment)

Should we close that one in favor of this one? @ananthsub

stale · 2021-04-18T21:51:11Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ananthsub · 2022-03-04T18:54:26Z

I have a different take on this now: which is extending the framework is relying on the class hierarchy to determine the intent of the callback. This is used by the Trainer especially to determine what callbacks are enabled by default. This is used for:

checkpointing
progress bar
model summary

I think for these, we ought to create an empty base class. This way, users don't have to worry about changes to the concrete implementations also offered by the framework. As an example:

class BaseModelCheckpoint(Callback):
    pass

class ModelCheckpoint(BaseModelCheckpoint):
    # existing code today


class MyCustomModelCheckpoint(BaseModelCheckpoint)

Essentially, someone should be able to extend BaseModelCheckpoint, not ModelCheckpoint, which is completely empty, and customize this however they see fit without worrying about keeping their code in sync with the framework's changes to ModelCheckpoint.

@jjenniferdai @carmocca

carmocca · 2022-03-05T20:11:31Z

What would be the advantage internally by offering this base empty class?

jjenniferdai · 2022-03-07T19:10:11Z

custom checkpoint callbacks (AnyCustomUserCheckpointCallback(BaseModelCheckpoint)) don't have to inherit ModelCheckpoint logic and can still get detected and rearranged to execute last in callback_connector

carmocca · 2022-03-30T12:11:47Z

I am in favor of this.

I think it's becoming increasingly important that ModelCheckpoint gets split up. It has gotten bloated with multiple arguments that have chaotic or unintended interactions and make edge cases very confusing. For example:

best_model_path does not retrieve the path to the best monitor checkpoint file #12485
Raise a WARNING when someone tried to load the best checkpoint when one has not been set. #12501
Fix interaction with save_last and every_n_epochs #12391
Just look at the interactions of every_n_epochs: https://github.com/PyTorchLightning/pytorch-lightning/blob/3bcaed52454f3e6c3bce5513032e34302e5b1bb6/pytorch_lightning/callbacks/model_checkpoint.py#L117-L131

We couldn't do this in the past because we didn't support multiple instances of the same callback. Related: #4335 where I propose a split. We would need to revisit it because that discussion is old

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 13, 2021

carmocca mentioned this issue Apr 7, 2021

Validation metrics assumed to be logged within the first training epoch #6791

Closed

stale bot added the won't fix This will not be worked on label Apr 18, 2021

stale bot closed this as completed Apr 26, 2021

ananthsub reopened this Mar 4, 2022

stale bot removed the won't fix This will not be worked on label Mar 4, 2022

carmocca assigned otaj Apr 29, 2022

carmocca added this to the 1.7 milestone Apr 29, 2022

carmocca added callback: model checkpoint and removed help wanted Open to be worked on labels Apr 29, 2022

carmocca added this to Frameworks Planning Apr 29, 2022

carmocca moved this to Todo in Frameworks Planning Apr 29, 2022

otaj mentioned this issue May 10, 2022

Add BaseModelCheckpoint class to inherit from #13024

Merged

11 tasks

carmocca moved this from Todo to In Progress in Frameworks Planning May 10, 2022

carmocca moved this from In Progress to In Review in Frameworks Planning May 10, 2022

otaj closed this as completed in #13024 Jun 30, 2022

Repository owner moved this from In Review to Done in Frameworks Planning Jun 30, 2022

Borda added this to Lightning RFCs Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Create a ModelCheckpointBase callback #6504

[RFC] Create a ModelCheckpointBase callback #6504

ananthsub commented Mar 13, 2021 •

edited by github-actions bot

Loading

dalek-who commented Mar 13, 2021 •

edited

Loading

carmocca commented Mar 14, 2021

stale bot commented Apr 18, 2021

ananthsub commented Mar 4, 2022 •

edited

Loading

carmocca commented Mar 5, 2022

jjenniferdai commented Mar 7, 2022

carmocca commented Mar 30, 2022

[RFC] Create a ModelCheckpointBase callback #6504

[RFC] Create a ModelCheckpointBase callback #6504

Comments

ananthsub commented Mar 13, 2021 • edited by github-actions bot Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

dalek-who commented Mar 13, 2021 • edited Loading

carmocca commented Mar 14, 2021

stale bot commented Apr 18, 2021

ananthsub commented Mar 4, 2022 • edited Loading

carmocca commented Mar 5, 2022

jjenniferdai commented Mar 7, 2022

carmocca commented Mar 30, 2022

ananthsub commented Mar 13, 2021 •

edited by github-actions bot

Loading

dalek-who commented Mar 13, 2021 •

edited

Loading

ananthsub commented Mar 4, 2022 •

edited

Loading