You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, ModelCheckpoint supports the period option, which specifies the epoch interval for saving checkpoints, and must be an integer. In general and especially for extremely large datasets, it would be useful to support finer control over when to save checkpoints.
Motivation
I am training models with huge datasets, thus making the interval between epochs so large that saving checkpoints only at the end of epochs does not satisfy my needs.
Pitch
I propose that similar to the val_check_interval training flag, ModelCheckpoint should support fractional epoch intervals, e.g. period=0.25 would indicate that a model checkpoint should be saved at each quarter of an epoch. It is desirable to sometimes specify intervals in terms of batch steps rather than epochs, so I also propose adding a parameter to support this, such as step_period, where the caller can specify the number of steps in between saving checkpoints.
Alternatives
Users can implement custom callbacks that save checkpoints at the end of a batch step. However, it would be great to leverage all the smarts of ModelCheckpoint (such as top k logic), which quickly makes the custom callback redundant and complex. It is also a feature which I believe would be commonly used enough that it would be valuable to expose to users without the need to write custom callbacks.
The text was updated successfully, but these errors were encountered:
Yes, from the description #6145 would be exactly what I need! Time based is a good idea as well, although less important for my particular use case. Is there a target release for #6145?
🚀 Feature
Currently, ModelCheckpoint supports the
period
option, which specifies the epoch interval for saving checkpoints, and must be an integer. In general and especially for extremely large datasets, it would be useful to support finer control over when to save checkpoints.Motivation
I am training models with huge datasets, thus making the interval between epochs so large that saving checkpoints only at the end of epochs does not satisfy my needs.
Pitch
I propose that similar to the
val_check_interval
training flag, ModelCheckpoint should support fractional epoch intervals, e.g.period=0.25
would indicate that a model checkpoint should be saved at each quarter of an epoch. It is desirable to sometimes specify intervals in terms of batch steps rather than epochs, so I also propose adding a parameter to support this, such asstep_period
, where the caller can specify the number of steps in between saving checkpoints.Alternatives
Users can implement custom callbacks that save checkpoints at the end of a batch step. However, it would be great to leverage all the smarts of ModelCheckpoint (such as top k logic), which quickly makes the custom callback redundant and complex. It is also a feature which I believe would be commonly used enough that it would be valuable to expose to users without the need to write custom callbacks.
The text was updated successfully, but these errors were encountered: