-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelCheckpoint Callback state loader with missing dir #15705
Comments
I wonder if it would be beneficial to allow the 'ckpt_path' of Trainer.fit() to accept a dict loaded from a .ckpt file using torch.load. Then you could manually remove problematic state dicts if required. i.e
|
@Stack-Attack Hello! I face the same issue, but on the same machine. I do the following:
Using torch.load and removing callback state is the best way for this issue? |
@ArtemSivtsov Yes, for now if I make any large changes to a model or experiment I make a new run, load the weights manually, and train with an empty checkpoint. Roughly the following logic:
|
@Stack-Attack Thank you so much for a quick reply! I hope Lightning team will fix that behavior later :) |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
Hey @Stack-Attack, @ArtemSivtsov |
Bug description
Loading a checkpoint with the ModelCheckpoint callback on a different machine (or with a missing/moved "best_model_path" dir) results in an error and crash.
A common use case for me is to train a model (with .ckpt stored elsewhere i.e Neptune), and then pull the checkpoint from that model to another machine to continue training later. This used to work in older versions, but now breaks. Currently, the code deals with situations where the directory structure has changed, but not for larger changes in the absolute file-structure.
How to reproduce the bug
Error messages and logs
Error messages and logs here please
Environment
More info
It seems like the simplest and maybe most straight forward solution is to not restore the ModelCheckpoint state at all if the directory has changed. There are more complex solutions (like checking each field) but given that this specific checkpoint is tightly coupled with the file structure it seems ill advised.
The text was updated successfully, but these errors were encountered: