Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

timothybrooks · 2021-03-03T22:07:39Z

🚀 Feature

Currently, ModelCheckpoint supports the period option, which specifies the epoch interval for saving checkpoints, and must be an integer. In general and especially for extremely large datasets, it would be useful to support finer control over when to save checkpoints.

Motivation

I am training models with huge datasets, thus making the interval between epochs so large that saving checkpoints only at the end of epochs does not satisfy my needs.

Pitch

I propose that similar to the val_check_interval training flag, ModelCheckpoint should support fractional epoch intervals, e.g. period=0.25 would indicate that a model checkpoint should be saved at each quarter of an epoch. It is desirable to sometimes specify intervals in terms of batch steps rather than epochs, so I also propose adding a parameter to support this, such as step_period, where the caller can specify the number of steps in between saving checkpoints.

Alternatives

Users can implement custom callbacks that save checkpoints at the end of a batch step. However, it would be great to leverage all the smarts of ModelCheckpoint (such as top k logic), which quickly makes the custom callback redundant and complex. It is also a feature which I believe would be commonly used enough that it would be valuable to expose to users without the need to write custom callbacks.

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-03-03T22:16:00Z

Would this work for you? #6145
#6286 is another extension i'd like to land to give checkpointing more flexibility

timothybrooks · 2021-03-03T22:37:25Z

Yes, from the description #6145 would be exactly what I need! Time based is a good idea as well, although less important for my particular use case. Is there a target release for #6145?

ananthsub · 2021-03-03T23:43:04Z

i'm actively working on #6145 - ideally this will be available by the next patch release, but if not, definitely within the next few weeks

timothybrooks added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 3, 2021

ananthsub linked a pull request Mar 5, 2021 that will close this issue

[feat] Support iteration based checkpointing after training batches #6145

Closed

11 tasks

ananthsub removed a link to a pull request Mar 5, 2021

[feat] Support iteration based checkpointing after training batches #6145

Closed

11 tasks

ananthsub linked a pull request Mar 5, 2021 that will close this issue

[feat] Support iteration-based checkpointing in model checkpoint callback #6146

Merged

11 tasks

ananthsub closed this as completed in #6146 Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

timothybrooks commented Mar 3, 2021

ananthsub commented Mar 3, 2021

timothybrooks commented Mar 3, 2021

ananthsub commented Mar 3, 2021 •

edited

Loading

Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

Comments

timothybrooks commented Mar 3, 2021

🚀 Feature

Motivation

Pitch

Alternatives

ananthsub commented Mar 3, 2021

timothybrooks commented Mar 3, 2021

ananthsub commented Mar 3, 2021 • edited Loading

ananthsub commented Mar 3, 2021 •

edited

Loading