Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint/saving a model in the middle of an epoch #3084

Closed
fishbotics opened this issue Aug 21, 2020 · 3 comments · Fixed by #3807
Closed

Checkpoint/saving a model in the middle of an epoch #3084

fishbotics opened this issue Aug 21, 2020 · 3 comments · Fixed by #3807
Labels
question Further information is requested won't fix This will not be worked on

Comments

@fishbotics
Copy link

fishbotics commented Aug 21, 2020

❓ Questions and Help

Hi all,

I have a network that trains slowly on a large dataset (something like 1 week per epoch). In my previous pure-Pytorch version, I saved a checkpoint of the model every hour along the way without doing any sort of additional validation. I just want to make sure I didn't lose my progress. is there a way to do something similar in lightning? It doesn't necessarily need to be time based--I just don't want to wait a week for the model to save. I like to periodically download the latest model from the server and see how it's doing along the way.

Thanks!

@fishbotics fishbotics added the question Further information is requested label Aug 21, 2020
@fishbotics
Copy link
Author

For what it's worth, I see that the ModelCheckpoint callback requires that the period be at least 1, which would indicate to me that it's not very easy to checkpoint the model in the middle of an epoch.

https://github.com/PyTorchLightning/PyTorch-Lightning/blob/master/pytorch_lightning/callbacks/model_checkpoint.py#L350

@andrewjong
Copy link

andrewjong commented Aug 22, 2020

Hi Fishyai, there's another thread about this, see #2534 . We can save every N steps independent of validation via a Callback. I agree this information should be added to the Lightning docs, because it's not so straightforward for a new-comer.

@stale
Copy link

stale bot commented Oct 21, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 21, 2020
@stale stale bot closed this as completed Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants