ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

ORippler · 2020-12-11T14:47:02Z

🐛 Bug

When passing a checkpoint to Trainer via resume_from_checkpoint, it is not tracked/garbage collected by ModelCheckpoint class. Instead, a new checkpoint is instantiated and gargabe collected/updated as usual.

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1QJrLngpOZg1MOgAtZH5kRo_s6u-Hjh0n?usp=sharing

Expected behavior

Checkpoint passed to Trainer.resume_from_checkpoint is garbage collected.
If this is not desired behavior, I think a sentence or 2 should be added to the docs on the intended behavior.

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.18.5
- pyTorch_debug: True
- pyTorch_version: 1.7.0+cu101
- pytorch-lightning: 1.1.0
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.6.9
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

If this is due to epoch mismatching and not a design choice, #5007 #4655 #2401 could be possibly related.

The text was updated successfully, but these errors were encountered:

haideraltahan · 2020-12-18T19:09:57Z

Having the same issue. I believe this problem could be related to #5091 as I currently cannot load my model to continue training nor can I test my saved model. The model is always initialized from scratch.
@ORippler have you found a temporary solution?

ORippler · 2020-12-20T10:57:06Z

I don't know how this is related to your problem, if your model is always initialized from scratch/fails resuming from a given checkpoint.

Resuming works properly for me, however the Checkpoint passed to Trainer via resume_from_checkpoint is no longer removed during the garbage collection (refer also the Colab reproduction linked above).

My temporary solutions to this is to perform the GC manually (i.e. removing the checkpoint passed to resume_from_training). I present a hotfix to #5091 inside the issue

stale · 2021-01-19T14:04:54Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ORippler added bug Something isn't working help wanted Open to be worked on labels Dec 11, 2020

edenlightning added the priority: 1 Medium priority task label Dec 11, 2020

stale bot added the won't fix This will not be worked on label Jan 19, 2021

stale bot closed this as completed Jan 26, 2021

ORippler mentioned this issue Dec 8, 2021

Add required states for resumed ModelCheckpoint GC #10995

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

ORippler commented Dec 11, 2020

haideraltahan commented Dec 18, 2020 •

edited

Loading

ORippler commented Dec 20, 2020

stale bot commented Jan 19, 2021

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

Comments

ORippler commented Dec 11, 2020

🐛 Bug

Please reproduce using the BoringModel and post here

Expected behavior

Environment

Additional context

haideraltahan commented Dec 18, 2020 • edited Loading

ORippler commented Dec 20, 2020

stale bot commented Jan 19, 2021

haideraltahan commented Dec 18, 2020 •

edited

Loading