You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a network that trains slowly on a large dataset (something like 1 week per epoch). In my previous pure-Pytorch version, I saved a checkpoint of the model every hour along the way without doing any sort of additional validation. I just want to make sure I didn't lose my progress. is there a way to do something similar in lightning? It doesn't necessarily need to be time based--I just don't want to wait a week for the model to save. I like to periodically download the latest model from the server and see how it's doing along the way.
Thanks!
The text was updated successfully, but these errors were encountered:
For what it's worth, I see that the ModelCheckpoint callback requires that the period be at least 1, which would indicate to me that it's not very easy to checkpoint the model in the middle of an epoch.
Hi Fishyai, there's another thread about this, see #2534 . We can save every N steps independent of validation via a Callback. I agree this information should be added to the Lightning docs, because it's not so straightforward for a new-comer.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
❓ Questions and Help
Hi all,
I have a network that trains slowly on a large dataset (something like 1 week per epoch). In my previous pure-Pytorch version, I saved a checkpoint of the model every hour along the way without doing any sort of additional validation. I just want to make sure I didn't lose my progress. is there a way to do something similar in lightning? It doesn't necessarily need to be time based--I just don't want to wait a week for the model to save. I like to periodically download the latest model from the server and see how it's doing along the way.
Thanks!
The text was updated successfully, but these errors were encountered: