-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint saving is not working in custom loops KFold.py example #12098
Comments
Hey @FinnBehrendt, I've tried out the When I set breakpoints in |
After investigating this issue more, I realized that checkpointing in I believe that what is happening is that after every fold, the checkpoint state is not reset. So for example, say that after fold 1 the minimum validation loss is 1.5 and the best checkpoint is saved. Then fold 2 trains and the minimum validation loss is 2.0. Because the checkpoint state is not reset, the best checkpoint in fold 2 is worse than the Trainer's minimum validation loss of 1.5 and will not save the checkpoint. What we need to do is to make the checkpoints of each fold independent of another. Here's a related issue #12300 that I opened. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Yes, the Kfold example is flawed and the checkpointing is just one example where things go wrong. To implement kfold properly, one needs to loop externally and use a new trainer every time and not share any state between the runs. If that makes sense. While one can probably try to fix the checkpointing issue, systematic issues with this internal loop still exist. I wonder if https://github.com/SkafteNicki/pl_cross has the same problem. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
It does have the same problem. We will be revisiting the kfold loop in the future |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Closing the issue since special support for custom loops was removed and the example with it too. |
🐛 Bug
When using KFold.py from the custom loop example, the checkpoint callback does not save checkpoints properly. Only for the first few iterations checkpoints are saved. After some iterations, the checkpoints are not updated anymore.
To Reproduce
For me, this error appeared also for the plain example from here
Expected behavior
Checkpoints should be saved as specified in the checkpoint_callback
Environment
conda
,pip
, source): condacc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @justusschock
The text was updated successfully, but these errors were encountered: