-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelCheckpoint
does NOT save anything if every_n_train_steps
is greater than the number of training steps in a epoch
#11979
Comments
@ShaneTian Is your expectation that ModelCheckpoint would run in the 10th batch of the second epoch given your example? These values are always assumed to be constrained within the number of batches. We could raise a warning if that's not the case. |
val_check_interval is constrained within number of batches within an epoch but every_n_train_steps of ModelCheckpoint isn't I think. I'd suggest we can think of extending |
Yes, intuitively, it should be a global training step. |
Yes. The proposal is being tracked in #8135 (comment) (just linked this issue too) |
someone's already working on it 🚀 |
Any updates about this issue? This feature is really useful for scenarios in which the training set is small or validation is expensive. |
Support has been added with #11993 |
Hi! I'm still facing this problem, I'm training a really small dataset, so I only have 5 iterations per epoch, and I'm not running validation at all, I set I just want to save the trained model of the latest epoch, but it seems like it's not saving anything unless I specifically set
|
🐛 Bug
ModelCheckpoint
does NOT save anything ifevery_n_train_steps
is greater than the number of training steps in an epoch.The
pl
does NOT call validation loop ifval_check_interval
is greater than the number of training steps in an epoch.To Reproduce
In my experiment, the number of training steps in an epoch is about 110.
every_n_train_steps
andval_check_interval
to100
, theModelCheckpoint
and validation loop work well.every_n_train_steps
andval_check_interval
to120
, theModelCheckpoint
and validation loop fail.ModelCheckpoint
does not save anything, thepl
does NOT call validation loop.Expected behavior
Is this expected? Can't I set these two parameters more than one epoch?
Environment
conda
,pip
, source):pip
cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7
The text was updated successfully, but these errors were encountered: