Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

Closed
ORippler opened this issue Dec 11, 2020 · 3 comments
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task won't fix This will not be worked on

Comments

@ORippler
Copy link
Contributor

🐛 Bug

When passing a checkpoint to Trainer via resume_from_checkpoint, it is not tracked/garbage collected by ModelCheckpoint class. Instead, a new checkpoint is instantiated and gargabe collected/updated as usual.

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1QJrLngpOZg1MOgAtZH5kRo_s6u-Hjh0n?usp=sharing

Expected behavior

Checkpoint passed to Trainer.resume_from_checkpoint is garbage collected.
If this is not desired behavior, I think a sentence or 2 should be added to the docs on the intended behavior.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.1.0
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

If this is due to epoch mismatching and not a design choice, #5007 #4655 #2401 could be possibly related.

@ORippler ORippler added bug Something isn't working help wanted Open to be worked on labels Dec 11, 2020
@edenlightning edenlightning added the priority: 1 Medium priority task label Dec 11, 2020
@haideraltahan
Copy link

haideraltahan commented Dec 18, 2020

Having the same issue. I believe this problem could be related to #5091 as I currently cannot load my model to continue training nor can I test my saved model. The model is always initialized from scratch.
@ORippler have you found a temporary solution?

@ORippler
Copy link
Contributor Author

I don't know how this is related to your problem, if your model is always initialized from scratch/fails resuming from a given checkpoint.

Resuming works properly for me, however the Checkpoint passed to Trainer via resume_from_checkpoint is no longer removed during the garbage collection (refer also the Colab reproduction linked above).

My temporary solutions to this is to perform the GC manually (i.e. removing the checkpoint passed to resume_from_training). I present a hotfix to #5091 inside the issue

@stale
Copy link

stale bot commented Jan 19, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jan 19, 2021
@stale stale bot closed this as completed Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants