Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish Allow on_save_checkpoint... #3688

Merged
merged 40 commits into from
Sep 30, 2020
Merged

Finish Allow on_save_checkpoint... #3688

merged 40 commits into from
Sep 30, 2020

Conversation

williamFalcon
Copy link
Contributor

@williamFalcon williamFalcon commented Sep 28, 2020

Finishes the stale PR #3562

cc @ananthsub

@pep8speaks
Copy link

pep8speaks commented Sep 28, 2020

Hello @williamFalcon! Thanks for updating this PR.

Line 471:17: W503 line break before binary operator
Line 472:17: W503 line break before binary operator
Line 473:17: W503 line break before binary operator

Comment last updated at 2020-09-30 19:48:43 UTC

@mergify mergify bot requested a review from a team September 28, 2020 01:32
@williamFalcon
Copy link
Contributor Author

williamFalcon commented Sep 28, 2020

Finish #3562

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for finishing this!

@mergify mergify bot requested a review from a team September 28, 2020 06:32
@Borda Borda changed the title Finish #3562 Finish Allow on_save_checkpoint... Sep 28, 2020
@Borda Borda added the feature Is an improvement or enhancement label Sep 28, 2020
tests/callbacks/test_model_checkpoint.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team September 28, 2020 07:39
@mergify mergify bot requested a review from a team September 28, 2020 07:39
@williamFalcon
Copy link
Contributor Author

lol... idk haha. i thought i fixed all the tests...

i need to finish the results stuff, but if someone could pick this up that'd be great :)

@awaelchli @ananthsub

@@ -220,34 +218,6 @@ def test_none_monitor_save_last(tmpdir):
ModelCheckpoint(filepath=tmpdir, save_last=False)


def test_model_checkpoint_save_last_checkpoint_contents(tmpdir):
Copy link
Contributor

@awaelchli awaelchli Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is 1:1 the same as a few lines above, I remove it.

if self.save_top_k:
if self.save_top_k == -1:
self._save_all_checkpoints(trainer, pl_module, epoch, filepath)
else:
self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath)

# Mode 2: save the last checkpoint
self._save_last_checkpoint(trainer, pl_module, epoch, monitor_candidates, filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had to switch the order here. We need to run save_last after topk, because the topk code tracks the best model path.
otherwise the last.ckpt will not point to the correct "best" model
the new test below checks that.

@awaelchli
Copy link
Contributor

@ananthsub @williamFalcon
so I think I finished this PR. Summary

  • all the checkpoint methods run on all ranks, in particular on_save_checkpoint
  • this means modelcheckpoint tracks best model score etc on all ranks (whether this is good or not, I don't know)
  • it is the responsibility of checkpoint_callback.save_function to save files only on rank 0. The user can switch it out if they want.
  • My test asserts that the on_save_checkpoint is called on all ranks and that torch.save is only called on rank 0.

there is a failing test, not sure what to do about it yet, will look into it soon

Comment on lines 211 to 215
if self.save_top_k:
if self.save_top_k == -1:
self._save_all_checkpoints(trainer, pl_module, epoch, filepath)
else:
self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath)
Copy link
Contributor

@ananthsub ananthsub Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic already feels too complex. i hope solving #2586 will force us to consolidate some of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ananth, this is exactly what I was thinking. Saving all models is really a special case of save_top_k and should be handled there. It becomes obvious when we look at the fix here: #3735
Would appreciate your feedback there too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... i think we need to come back to this modelcheckpoint after 1.0 with a nice clean re-write.
It's gotten super messy now haha

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for cleaning up and fixing tests @awaelchli !

if self.save_top_k:
if self.save_top_k == -1:
self._save_all_checkpoints(trainer, pl_module, epoch, filepath)
else:
self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath)

# Mode 2: save the last checkpoint
self._save_last_checkpoint(trainer, pl_module, epoch, monitor_candidates, filepath)

def __validate_init_configuration(self):
if self.save_top_k is not None and self.save_top_k < -1:
raise MisconfigurationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in __init_ckpt_dir, you could have L261 be gated by trainer.is_global_zero to avoid unnecessary file I/O outside rank-0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, this one we missed. but we don't have access to trainer, nor global rank at this point when init happens.
I'm wondering if we even need to call makedirs at this point. We could delay it, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it could be delayed to on pretrain routine start

@mergify
Copy link
Contributor

mergify bot commented Sep 30, 2020

This pull request is now in conflict... :(

@mergify mergify bot requested a review from a team September 30, 2020 19:00
@williamFalcon williamFalcon merged commit cf182e8 into master Sep 30, 2020
@awaelchli
Copy link
Contributor

hey @ananthsub turns out this PR may actually become more valuable than we initially thought. Thanks for bringing this up. Seems like this is super helpful for the ddp challenges we were facing! cheers!

@awaelchli awaelchli deleted the r3n branch September 30, 2020 20:23
@Borda Borda added this to the 0.10.0 milestone Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants