-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish Allow on_save_checkpoint... #3688
Conversation
Hello @williamFalcon! Thanks for updating this PR.
Comment last updated at 2020-09-30 19:48:43 UTC |
Finish #3562 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for finishing this!
lol... idk haha. i thought i fixed all the tests... i need to finish the results stuff, but if someone could pick this up that'd be great :) |
@@ -220,34 +218,6 @@ def test_none_monitor_save_last(tmpdir): | |||
ModelCheckpoint(filepath=tmpdir, save_last=False) | |||
|
|||
|
|||
def test_model_checkpoint_save_last_checkpoint_contents(tmpdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is 1:1 the same as a few lines above, I remove it.
if self.save_top_k: | ||
if self.save_top_k == -1: | ||
self._save_all_checkpoints(trainer, pl_module, epoch, filepath) | ||
else: | ||
self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath) | ||
|
||
# Mode 2: save the last checkpoint | ||
self._save_last_checkpoint(trainer, pl_module, epoch, monitor_candidates, filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had to switch the order here. We need to run save_last after topk, because the topk code tracks the best model path.
otherwise the last.ckpt will not point to the correct "best" model
the new test below checks that.
@ananthsub @williamFalcon
there is a failing test, not sure what to do about it yet, will look into it soon |
if self.save_top_k: | ||
if self.save_top_k == -1: | ||
self._save_all_checkpoints(trainer, pl_module, epoch, filepath) | ||
else: | ||
self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic already feels too complex. i hope solving #2586 will force us to consolidate some of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ananth, this is exactly what I was thinking. Saving all models is really a special case of save_top_k and should be handled there. It becomes obvious when we look at the fix here: #3735
Would appreciate your feedback there too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... i think we need to come back to this modelcheckpoint after 1.0 with a nice clean re-write.
It's gotten super messy now haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for cleaning up and fixing tests @awaelchli !
if self.save_top_k: | ||
if self.save_top_k == -1: | ||
self._save_all_checkpoints(trainer, pl_module, epoch, filepath) | ||
else: | ||
self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath) | ||
|
||
# Mode 2: save the last checkpoint | ||
self._save_last_checkpoint(trainer, pl_module, epoch, monitor_candidates, filepath) | ||
|
||
def __validate_init_configuration(self): | ||
if self.save_top_k is not None and self.save_top_k < -1: | ||
raise MisconfigurationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in __init_ckpt_dir, you could have L261 be gated by trainer.is_global_zero to avoid unnecessary file I/O outside rank-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, this one we missed. but we don't have access to trainer, nor global rank at this point when init happens.
I'm wondering if we even need to call makedirs at this point. We could delay it, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it could be delayed to on pretrain routine start
This pull request is now in conflict... :( |
hey @ananthsub turns out this PR may actually become more valuable than we initially thought. Thanks for bringing this up. Seems like this is super helpful for the ddp challenges we were facing! cheers! |
Finishes the stale PR #3562
cc @ananthsub