Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint saving is not working in custom loops KFold.py example #12098

Closed
FinnBehrendt opened this issue Feb 24, 2022 · 8 comments
Closed

Checkpoint saving is not working in custom loops KFold.py example #12098

FinnBehrendt opened this issue Feb 24, 2022 · 8 comments
Labels
bug Something isn't working checkpointing Related to checkpointing loops Related to the Loop API

Comments

@FinnBehrendt
Copy link

FinnBehrendt commented Feb 24, 2022

🐛 Bug

When using KFold.py from the custom loop example, the checkpoint callback does not save checkpoints properly. Only for the first few iterations checkpoints are saved. After some iterations, the checkpoints are not updated anymore.

To Reproduce

For me, this error appeared also for the plain example from here

Expected behavior

Checkpoints should be saved as specified in the checkpoint_callback

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.5.10, 1.5.0, 1.6.0
  • PyTorch Version (e.g., 1.10):1.10
  • Python version (e.g., 3.9): 3.9
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: V100
  • How you installed PyTorch (conda, pip, source): conda

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @justusschock

@FinnBehrendt FinnBehrendt added the bug Something isn't working label Feb 24, 2022
@ananthsub ananthsub added loops Related to the Loop API checkpointing Related to checkpointing labels Mar 2, 2022
@JinLi711
Copy link

Hey @FinnBehrendt,

I've tried out the Kfold.py example in the master branch and I don't see your issue showing up. So maybe you can try the master version of Kfold.py.

When I set breakpoints in Kfold.py, I do see that checkpoints in lightning_logs/ are being updated continuously throughout the training process.

@JinLi711
Copy link

After investigating this issue more, I realized that checkpointing in Kfold.py is an issue. And I can reproduce your error after running Kfold.py a few more times.

I believe that what is happening is that after every fold, the checkpoint state is not reset. So for example, say that after fold 1 the minimum validation loss is 1.5 and the best checkpoint is saved. Then fold 2 trains and the minimum validation loss is 2.0. Because the checkpoint state is not reset, the best checkpoint in fold 2 is worse than the Trainer's minimum validation loss of 1.5 and will not save the checkpoint.

What we need to do is to make the checkpoints of each fold independent of another.

Here's a related issue #12300 that I opened.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 16, 2022
@stale stale bot closed this as completed Apr 24, 2022
@awaelchli awaelchli removed the won't fix This will not be worked on label Apr 24, 2022
@awaelchli awaelchli reopened this Apr 24, 2022
@awaelchli
Copy link
Contributor

Yes, the Kfold example is flawed and the checkpointing is just one example where things go wrong. To implement kfold properly, one needs to loop externally and use a new trainer every time and not share any state between the runs. If that makes sense. While one can probably try to fix the checkpointing issue, systematic issues with this internal loop still exist. I wonder if https://github.com/SkafteNicki/pl_cross has the same problem.

@stale
Copy link

stale bot commented Jun 6, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jun 6, 2022
@carmocca
Copy link
Contributor

carmocca commented Jun 6, 2022

It does have the same problem. We will be revisiting the kfold loop in the future
cc @rasbt @SkafteNicki

@stale stale bot removed the won't fix This will not be worked on label Jun 6, 2022
@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jul 10, 2022
@carmocca carmocca added this to the future milestone Jul 11, 2022
@stale stale bot removed the won't fix This will not be worked on label Jul 11, 2022
@awaelchli
Copy link
Contributor

Closing the issue since special support for custom loops was removed and the example with it too.

@awaelchli awaelchli closed this as not planned Won't fix, can't repro, duplicate, stale Apr 29, 2023
@awaelchli awaelchli removed this from the future milestone Apr 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing loops Related to the Loop API
Projects
None yet
Development

No branches or pull requests

5 participants