Checkpoint saving is not working in custom loops KFold.py example #12098

FinnBehrendt · 2022-02-24T19:23:06Z

🐛 Bug

When using KFold.py from the custom loop example, the checkpoint callback does not save checkpoints properly. Only for the first few iterations checkpoints are saved. After some iterations, the checkpoints are not updated anymore.

To Reproduce

For me, this error appeared also for the plain example from here

Expected behavior

Checkpoints should be saved as specified in the checkpoint_callback

Environment

PyTorch Lightning Version (e.g., 1.5.0): 1.5.10, 1.5.0, 1.6.0
PyTorch Version (e.g., 1.10):1.10
Python version (e.g., 3.9): 3.9
OS (e.g., Linux): Linux
CUDA/cuDNN version: 11.3
GPU models and configuration: V100
How you installed PyTorch (conda, pip, source): conda

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @justusschock

The text was updated successfully, but these errors were encountered:

JinLi711 · 2022-03-11T15:46:51Z

Hey @FinnBehrendt,

I've tried out the Kfold.py example in the master branch and I don't see your issue showing up. So maybe you can try the master version of Kfold.py.

When I set breakpoints in Kfold.py, I do see that checkpoints in lightning_logs/ are being updated continuously throughout the training process.

JinLi711 · 2022-03-11T16:39:29Z

After investigating this issue more, I realized that checkpointing in Kfold.py is an issue. And I can reproduce your error after running Kfold.py a few more times.

I believe that what is happening is that after every fold, the checkpoint state is not reset. So for example, say that after fold 1 the minimum validation loss is 1.5 and the best checkpoint is saved. Then fold 2 trains and the minimum validation loss is 2.0. Because the checkpoint state is not reset, the best checkpoint in fold 2 is worse than the Trainer's minimum validation loss of 1.5 and will not save the checkpoint.

What we need to do is to make the checkpoints of each fold independent of another.

Here's a related issue #12300 that I opened.

stale · 2022-04-16T08:09:57Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

awaelchli · 2022-04-24T19:00:04Z

Yes, the Kfold example is flawed and the checkpointing is just one example where things go wrong. To implement kfold properly, one needs to loop externally and use a new trainer every time and not share any state between the runs. If that makes sense. While one can probably try to fix the checkpointing issue, systematic issues with this internal loop still exist. I wonder if https://github.com/SkafteNicki/pl_cross has the same problem.

stale · 2022-06-06T01:38:29Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca · 2022-06-06T16:03:32Z

It does have the same problem. We will be revisiting the kfold loop in the future
cc @rasbt @SkafteNicki

stale · 2022-07-10T20:22:14Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

awaelchli · 2023-04-29T04:46:26Z

Closing the issue since special support for custom loops was removed and the example with it too.

FinnBehrendt added the bug Something isn't working label Feb 24, 2022

ananthsub added loops Related to the Loop API checkpointing Related to checkpointing labels Mar 2, 2022

stale bot added the won't fix This will not be worked on label Apr 16, 2022

stale bot closed this as completed Apr 24, 2022

awaelchli removed the won't fix This will not be worked on label Apr 24, 2022

awaelchli reopened this Apr 24, 2022

stale bot added the won't fix This will not be worked on label Jun 6, 2022

stale bot removed the won't fix This will not be worked on label Jun 6, 2022

stale bot added the won't fix This will not be worked on label Jul 10, 2022

carmocca added this to the future milestone Jul 11, 2022

stale bot removed the won't fix This will not be worked on label Jul 11, 2022

peterchristofferholm mentioned this issue Feb 20, 2023

Cross validation feature #839

Closed

awaelchli closed this as not planned Won't fix, can't repro, duplicate, stale Apr 29, 2023

awaelchli removed this from the future milestone Apr 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint saving is not working in custom loops KFold.py example #12098

Checkpoint saving is not working in custom loops KFold.py example #12098

FinnBehrendt commented Feb 24, 2022 •

edited by github-actions bot

Loading

JinLi711 commented Mar 11, 2022

JinLi711 commented Mar 11, 2022

stale bot commented Apr 16, 2022

awaelchli commented Apr 24, 2022

stale bot commented Jun 6, 2022

carmocca commented Jun 6, 2022

stale bot commented Jul 10, 2022

awaelchli commented Apr 29, 2023

Checkpoint saving is not working in custom loops KFold.py example #12098

Checkpoint saving is not working in custom loops KFold.py example #12098

Comments

FinnBehrendt commented Feb 24, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

JinLi711 commented Mar 11, 2022

JinLi711 commented Mar 11, 2022

stale bot commented Apr 16, 2022

awaelchli commented Apr 24, 2022

stale bot commented Jun 6, 2022

carmocca commented Jun 6, 2022

stale bot commented Jul 10, 2022

awaelchli commented Apr 29, 2023

FinnBehrendt commented Feb 24, 2022 •

edited by github-actions bot

Loading