Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

Closed
timothybrooks opened this issue Mar 3, 2021 · 3 comments · Fixed by #6146
Closed
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@timothybrooks
Copy link

🚀 Feature

Currently, ModelCheckpoint supports the period option, which specifies the epoch interval for saving checkpoints, and must be an integer. In general and especially for extremely large datasets, it would be useful to support finer control over when to save checkpoints.

Motivation

I am training models with huge datasets, thus making the interval between epochs so large that saving checkpoints only at the end of epochs does not satisfy my needs.

Pitch

I propose that similar to the val_check_interval training flag, ModelCheckpoint should support fractional epoch intervals, e.g. period=0.25 would indicate that a model checkpoint should be saved at each quarter of an epoch. It is desirable to sometimes specify intervals in terms of batch steps rather than epochs, so I also propose adding a parameter to support this, such as step_period, where the caller can specify the number of steps in between saving checkpoints.

Alternatives

Users can implement custom callbacks that save checkpoints at the end of a batch step. However, it would be great to leverage all the smarts of ModelCheckpoint (such as top k logic), which quickly makes the custom callback redundant and complex. It is also a feature which I believe would be commonly used enough that it would be valuable to expose to users without the need to write custom callbacks.

@timothybrooks timothybrooks added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 3, 2021
@ananthsub
Copy link
Contributor

Would this work for you? #6145
#6286 is another extension i'd like to land to give checkpointing more flexibility

@timothybrooks
Copy link
Author

Yes, from the description #6145 would be exactly what I need! Time based is a good idea as well, although less important for my particular use case. Is there a target release for #6145?

@ananthsub
Copy link
Contributor

ananthsub commented Mar 3, 2021

i'm actively working on #6145 - ideally this will be available by the next patch release, but if not, definitely within the next few weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
2 participants