Finish Allow on_save_checkpoint... #3688

williamFalcon · 2020-09-28T01:31:18Z

Finishes the stale PR #3562

pep8speaks · 2020-09-28T01:31:22Z

Hello @williamFalcon! Thanks for updating this PR.

In the file pytorch_lightning/callbacks/model_checkpoint.py:

Line 471:17: W503 line break before binary operator
Line 472:17: W503 line break before binary operator
Line 473:17: W503 line break before binary operator

Comment last updated at 2020-09-30 19:48:43 UTC

williamFalcon · 2020-09-28T01:34:54Z

Finish #3562

ananthsub

thank you for finishing this!

tests/callbacks/test_model_checkpoint.py

pytorch_lightning/callbacks/model_checkpoint.py

…htning into r3n

williamFalcon · 2020-09-28T14:50:07Z

lol... idk haha. i thought i fixed all the tests...

i need to finish the results stuff, but if someone could pick this up that'd be great :)

@awaelchli @ananthsub

awaelchli · 2020-09-30T00:24:49Z

tests/callbacks/test_model_checkpoint.py

@@ -220,34 +218,6 @@ def test_none_monitor_save_last(tmpdir):
    ModelCheckpoint(filepath=tmpdir, save_last=False)


-def test_model_checkpoint_save_last_checkpoint_contents(tmpdir):


this test is 1:1 the same as a few lines above, I remove it.

awaelchli · 2020-09-30T00:25:57Z

pytorch_lightning/callbacks/model_checkpoint.py

        if self.save_top_k:
            if self.save_top_k == -1:
                self._save_all_checkpoints(trainer, pl_module, epoch, filepath)
            else:
                self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath)

+        # Mode 2: save the last checkpoint
+        self._save_last_checkpoint(trainer, pl_module, epoch, monitor_candidates, filepath)


i had to switch the order here. We need to run save_last after topk, because the topk code tracks the best model path.
otherwise the last.ckpt will not point to the correct "best" model
the new test below checks that.

awaelchli · 2020-09-30T02:19:48Z

@ananthsub @williamFalcon
so I think I finished this PR. Summary

all the checkpoint methods run on all ranks, in particular on_save_checkpoint
this means modelcheckpoint tracks best model score etc on all ranks (whether this is good or not, I don't know)
it is the responsibility of checkpoint_callback.save_function to save files only on rank 0. The user can switch it out if they want.
My test asserts that the on_save_checkpoint is called on all ranks and that torch.save is only called on rank 0.

there is a failing test, not sure what to do about it yet, will look into it soon

ananthsub · 2020-09-30T05:23:44Z

pytorch_lightning/callbacks/model_checkpoint.py

        if self.save_top_k:
            if self.save_top_k == -1:
                self._save_all_checkpoints(trainer, pl_module, epoch, filepath)
            else:
                self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath)


this logic already feels too complex. i hope solving #2586 will force us to consolidate some of this

Ananth, this is exactly what I was thinking. Saving all models is really a special case of save_top_k and should be handled there. It becomes obvious when we look at the fix here: #3735
Would appreciate your feedback there too!

yeah... i think we need to come back to this modelcheckpoint after 1.0 with a nice clean re-write.
It's gotten super messy now haha

ananthsub

thanks for cleaning up and fixing tests @awaelchli !

ananthsub · 2020-09-30T07:05:10Z

pytorch_lightning/callbacks/model_checkpoint.py

        if self.save_top_k:
            if self.save_top_k == -1:
                self._save_all_checkpoints(trainer, pl_module, epoch, filepath)
            else:
                self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath)

+        # Mode 2: save the last checkpoint
+        self._save_last_checkpoint(trainer, pl_module, epoch, monitor_candidates, filepath)
+
    def __validate_init_configuration(self):
        if self.save_top_k is not None and self.save_top_k < -1:
            raise MisconfigurationException(


in __init_ckpt_dir, you could have L261 be gated by trainer.is_global_zero to avoid unnecessary file I/O outside rank-0

true, this one we missed. but we don't have access to trainer, nor global rank at this point when init happens.
I'm wondering if we even need to call makedirs at this point. We could delay it, no?

yes it could be delayed to on pretrain routine start

mergify · 2020-09-30T12:37:26Z

This pull request is now in conflict... :(

pytorch_lightning/utilities/cloud_io.py

awaelchli · 2020-09-30T20:20:24Z

hey @ananthsub turns out this PR may actually become more valuable than we initially thought. Thanks for bringing this up. Seems like this is super helpful for the ddp challenges we were facing! cheers!

Finish #3562

bd08451

mergify bot requested a review from a team September 28, 2020 01:32

ananthsub approved these changes Sep 28, 2020

View reviewed changes

justusschock approved these changes Sep 28, 2020

View reviewed changes

mergify bot requested a review from a team September 28, 2020 06:32

Borda changed the title ~~Finish #3562~~ Finish Allow on_save_checkpoint... Sep 28, 2020

Borda added the feature Is an improvement or enhancement label Sep 28, 2020

Borda approved these changes Sep 28, 2020

View reviewed changes

tests/callbacks/test_model_checkpoint.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

Apply suggestions from code review

211d82f

mergify bot requested a review from a team September 28, 2020 07:39

Borda reviewed Sep 28, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team September 28, 2020 07:39

Borda and others added 9 commits September 28, 2020 09:39

Apply suggestions from code review

4ad52c2

fix tests

742a46c

Finish #3562

49ce472

Apply suggestions from code review

cce9e16

Apply suggestions from code review

d6b5e13

fix tests

4a16581

fix structure

9d78a1f

fix structure

b9ab13f

Merge branch 'r3n' of https://github.com/PyTorchLightning/pytorch-lig…

a957244

…htning into r3n

Adrian Wälchli and others added 5 commits September 30, 2020 01:17

Merge branch 'master' into r3n

33eb5b8

make save_last test pass

6a7a7c0

unnecessary global rank check

b8192fb

fix test

4eea9de

Merge remote-tracking branch 'PyTorchLightning/r3n' into r3n

d792143

awaelchli reviewed Sep 30, 2020

View reviewed changes

awaelchli added 5 commits September 30, 2020 03:20

check if fails

5cd5edd

test

26c8312

clean up

84070cd

adjust horovod test

0b268f1

clean up

219aa6e

ananthsub reviewed Sep 30, 2020

View reviewed changes

ananthsub approved these changes Sep 30, 2020

View reviewed changes

ananthsub reviewed Sep 30, 2020

View reviewed changes

ananthsub approved these changes Sep 30, 2020

View reviewed changes

remove unnecessary makdirs

a2f2fe5

awaelchli added 7 commits September 30, 2020 14:50

Merge branch 'master' into r3n

9ba223c

change

48ecd41

undo

dfbcff3

debug

d133adf

debug

ad3dcfc

debug

02523ab

debug

dc9ab72

awaelchli reviewed Sep 30, 2020

View reviewed changes

pytorch_lightning/utilities/cloud_io.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team September 30, 2020 19:00

awaelchli added 4 commits September 30, 2020 21:16

mock

cc30c2a

undo debug code

22895a1

add extra assertions

9cdaaf8

test

8c3f19b

williamFalcon merged commit cf182e8 into master Sep 30, 2020

awaelchli deleted the r3n branch September 30, 2020 20:23

ananthsub mentioned this pull request Oct 3, 2020

on_save_checkpoint callbacks runs in rank zero only #3545

Closed

Borda added this to the 0.10.0 milestone Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish Allow on_save_checkpoint... #3688

Finish Allow on_save_checkpoint... #3688

williamFalcon commented Sep 28, 2020 •

edited by Borda

Loading

pep8speaks commented Sep 28, 2020 •

edited

Loading

williamFalcon commented Sep 28, 2020 •

edited

Loading

ananthsub left a comment

williamFalcon commented Sep 28, 2020

awaelchli Sep 30, 2020 •

edited

Loading

awaelchli Sep 30, 2020

awaelchli commented Sep 30, 2020

ananthsub Sep 30, 2020 •

edited

Loading

awaelchli Sep 30, 2020

williamFalcon Sep 30, 2020

ananthsub left a comment

ananthsub Sep 30, 2020

awaelchli Sep 30, 2020

ananthsub Sep 30, 2020

mergify bot commented Sep 30, 2020

awaelchli commented Sep 30, 2020

		@@ -220,34 +218,6 @@ def test_none_monitor_save_last(tmpdir):
		ModelCheckpoint(filepath=tmpdir, save_last=False)


		def test_model_checkpoint_save_last_checkpoint_contents(tmpdir):

Finish Allow on_save_checkpoint... #3688

Finish Allow on_save_checkpoint... #3688

Conversation

williamFalcon commented Sep 28, 2020 • edited by Borda Loading

pep8speaks commented Sep 28, 2020 • edited Loading

Comment last updated at 2020-09-30 19:48:43 UTC

williamFalcon commented Sep 28, 2020 • edited Loading

ananthsub left a comment

Choose a reason for hiding this comment

williamFalcon commented Sep 28, 2020

awaelchli Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

awaelchli Sep 30, 2020

Choose a reason for hiding this comment

awaelchli commented Sep 30, 2020

ananthsub Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

awaelchli Sep 30, 2020

Choose a reason for hiding this comment

williamFalcon Sep 30, 2020

Choose a reason for hiding this comment

ananthsub left a comment

Choose a reason for hiding this comment

ananthsub Sep 30, 2020

Choose a reason for hiding this comment

awaelchli Sep 30, 2020

Choose a reason for hiding this comment

ananthsub Sep 30, 2020

Choose a reason for hiding this comment

mergify bot commented Sep 30, 2020

awaelchli commented Sep 30, 2020

williamFalcon commented Sep 28, 2020 •

edited by Borda

Loading

pep8speaks commented Sep 28, 2020 •

edited

Loading

williamFalcon commented Sep 28, 2020 •

edited

Loading

awaelchli Sep 30, 2020 •

edited

Loading

ananthsub Sep 30, 2020 •

edited

Loading