-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Introduce .fit(ckpt_path="last")
#11912
Comments
Some n00b questions:
|
Hey @ananthsub .
Yes, this was my first idea. However, fault tolerant can be harder to get right when an epoch end checkpoint is quite stable. It is why we originally decided to keep them apart.
Not 100 % sure to understand your question. However, the last checkpoint ideally should be the one with the most recent timestamp.
Yes, it still makes sense. The concept of last doesn't change. When fault tolerant is triggered, its checkpoint is actually latest in date to be created. But when restarting, as soon the new ModelCheckpoint creates a new checkpoint, latest become this one.
I think checkpoint should be timestamped and the latest should be used. |
I'll answer too, adding to Thomas' comments:
No because the fault-tolerance checkpoint is saved on exception so we cannot really guarantee that it will be usable on reload, on the other hand, "last" checkpoints are saved normally.
Because it's quite common to want to append the epoch number in case you start multiple runs using the same directory. It would be harder if we go off the checkpoint name, but we should either use their timestamps or "listen" on the
I don't think it's redundant for the reasons above.
Currently, they are overwritten. There's a proposal to avoid this in #5030 |
🚀 Feature
Add support for passing just
"last"
totrainer.{fit,validate,test,predict}(ckpt_path=...)
This would pick the latest model checkpoint saved from the set of:
The tracking logic would most likely be inside the checkpoint connector.
Also, just for
trainer.fit
,ckpt_path
should default to"last"
when fault-tolerance is enabled.Motivation
Until now, we've recommended passing
trainer.model_checkpoint.last_model_path
, but with fault-tolerance enabled, the last model path might be one generated by fault-tolerance.Fault-tolerance logic is meant to be hidden from the users so we don't want them to track this.
Alternatives
The best alternative (after #11862) would be:
but it does not consider that the
model_checkpoint.last_model_path
could have been saved after.We should track this for the user.
Additional context
Proposed by @tchaton
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @tchaton @justusschock @awaelchli @ananthsub @ninginthecloud @rohitgr7 @carmocca
The text was updated successfully, but these errors were encountered: