Add step index in checkpoint name #3807

Borda · 2020-10-02T22:39:30Z

What does this PR do?

Fixes #1764
Fixes #3789
Fixes #1966
Fixes #1758
Fixes #3084

Partially implements what is requested in #4335

TODO

fix remaining tests
update docs with option of step
integrate step into sub epoch test case

Example:

This saves a checkpoint 4 times per epoch with step and epoch index in the name of the file.

val_check_interval=0.25, 
callbacks=[ModelCheckpoint(save_top_k=-1, monitor="valid_loss")]

Borda · 2020-10-02T23:07:56Z

if I understand the two Priority bugs, people expect to save at all Cx... well at least not save at the C1 and skip all others

train: ----------------------------------
val: ...........-C1-...........-C2-..........-C3-

for sure C1 only is not good but saves C3 just because it is last in the epoch even C1 is better with scoring (for example overfitting) does not make sense...

williamFalcon · 2020-10-02T23:13:04Z

Context:
not ALL epochs take 1 minute lol... Many research cases require training 1-2 epochs where each epoch may take days or weeks.

Thus checking val every few steps or hours is the correct way to go and thus a checkpoint should be created if that condition is met.

If people want to check val only once per epoch then pass in the proper flags. But the expectation is that every time val is checked a checkpoint could be saved.

Let's make sure we explicitly test for this to make sure we don't regress.

pep8speaks · 2020-10-02T23:40:54Z

Hello @Borda! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-02 11:47:45 UTC

williamFalcon · 2020-10-03T12:08:39Z

@awaelchli can you review?

pytorch_lightning/callbacks/model_checkpoint.py

mergify · 2020-10-03T16:34:17Z

This pull request is now in conflict... :(

Borda · 2020-10-04T00:03:49Z

@awaelchli @williamFalcon mind check before updating docs...
btw, failing PT 1.7 is addressed in #3821 and #3833

Borda · 2020-10-04T00:08:32Z

tests/trainer/test_trainer.py

@@ -378,90 +378,6 @@ def test_dp_output_reduce():
    assert reduced['b']['c'] == out['b']['c']




these are moved to checkpoint file...

Borda · 2020-10-04T07:05:57Z

will be rebased on #3838

mergify · 2020-10-04T17:37:55Z

This pull request is now in conflict... :(

pytorch_lightning/callbacks/model_checkpoint.py

ananthsub · 2020-10-26T00:09:52Z

pytorch_lightning/callbacks/model_checkpoint.py

@@ -495,9 +496,10 @@ def _monitor_candidates(self, trainer):
        ckpt_name_metrics = deepcopy(trainer.logger_connector.logged_metrics)
        ckpt_name_metrics.update(trainer.logger_connector.callback_metrics)
        ckpt_name_metrics.update(trainer.logger_connector.progress_bar_metrics)
+        ckpt_name_metrics.update({"step": trainer.global_step, "epoch": trainer.current_epoch})


this update also happens in _format_checkpoint_name - is it possible to consolidate to a single place? is there a risk they can fall out of sync?

which point do you have in mind?

@ananthsub could you explain further what you mean? I think I understand but need some clarity here.

ananthsub · 2020-10-26T00:10:28Z

pytorch_lightning/callbacks/model_checkpoint.py

+        epoch = metrics.get("epoch")
+        step = metrics.get("step")


same comment as above

pytorch_lightning/callbacks/model_checkpoint.py

williamFalcon · 2020-10-26T00:23:17Z

nice addition! excited to get this merged

tchaton

Great PR !

pytorch_lightning/callbacks/model_checkpoint.py

pytorch_lightning/trainer/evaluation_loop.py

Co-authored-by: Carlos Mocholí <[email protected]>

mergify · 2020-10-26T18:14:35Z

This pull request is now in conflict... :(

tchaton

Awesome addition !

pytorch_lightning/callbacks/model_checkpoint.py

tests/checkpointing/test_model_checkpoint.py

* true final value of global step * ch check * tests * save each validation interval * wip * add test * add test * wip * fix tests, revert old edits, fix merge conflicts, update doctests * test + bugfix * sort files * format test * suggestion by ananth * added changelog * naming * docs * example * suggestion Co-authored-by: Carlos Mocholí <[email protected]> * fix test * pep * pep Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> (cherry picked from commit ef03c39)

This reverts commit e9c61cc

* true final value of global step * ch check * tests * save each validation interval * wip * add test * add test * wip * fix tests, revert old edits, fix merge conflicts, update doctests * test + bugfix * sort files * format test * suggestion by ananth * added changelog * naming * docs * example * suggestion Co-authored-by: Carlos Mocholí <[email protected]> * fix test * pep * pep Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]>

Borda added bug Something isn't working feature Is an improvement or enhancement labels Oct 2, 2020

mergify bot requested a review from a team October 2, 2020 22:40

Borda added design Includes a design discussion discussion In a discussion stage labels Oct 2, 2020

chiragraman mentioned this pull request Oct 3, 2020

Checkpointing and Early Stopping fail to work correctly when increasing number of train batches (in some cases) #3789

Closed

Borda changed the title ~~save each validation interval~~ save each validation interval [wip] Oct 3, 2020

awaelchli reviewed Oct 3, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team October 3, 2020 13:06

Borda force-pushed the fix/checkpoint-interval branch from bff3f71 to 8a20989 Compare October 3, 2020 18:58

Borda requested a review from awaelchli October 4, 2020 00:05

Borda commented Oct 4, 2020

View reviewed changes

Borda mentioned this pull request Oct 4, 2020

Fix/true end step #3838

Closed

Borda force-pushed the fix/checkpoint-interval branch from 4dfd824 to 6e2b2bc Compare October 4, 2020 08:01

Borda removed the bug Something isn't working label Oct 5, 2020

Borda added 7 commits October 6, 2020 00:16

true final value of global step

85e9ccf

ch check

d32396f

tests

0550bc3

save each validation interval

cbbd06c

wip

07bb754

add test

f681d31

add test

fde3fe4

ananthsub reviewed Oct 26, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

ananthsub reviewed Oct 26, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Show resolved Hide resolved

tchaton approved these changes Oct 26, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Show resolved Hide resolved

example

254f56c

ananthsub approved these changes Oct 26, 2020

View reviewed changes

carmocca reviewed Oct 26, 2020

View reviewed changes

pytorch_lightning/trainer/evaluation_loop.py Outdated Show resolved Hide resolved

rohitgr7 and others added 2 commits October 26, 2020 23:33

Merge branch 'master' into fix/checkpoint-interval

c57247d

suggestion

febd53b

Co-authored-by: Carlos Mocholí <[email protected]>

awaelchli added 3 commits October 30, 2020 07:09

Merge branch 'master' into fix/checkpoint-interval

2fcd2eb

Merge branch 'master' into fix/checkpoint-interval

ae1e3c3

fix test

9bb5fd2

awaelchli approved these changes Nov 1, 2020

View reviewed changes

pep

ad02f33

tchaton self-requested a review November 2, 2020 08:09

tchaton approved these changes Nov 2, 2020

View reviewed changes

justusschock approved these changes Nov 2, 2020

View reviewed changes

rohitgr7 reviewed Nov 2, 2020

View reviewed changes

rohitgr7 added 2 commits November 2, 2020 15:19

pep

c0882c4

Merge branch 'master' into fix/checkpoint-interval

86b5255

rohitgr7 approved these changes Nov 2, 2020

View reviewed changes

Merge branch 'master' into fix/checkpoint-interval

c5e1e91

awaelchli merged commit ef03c39 into master Nov 2, 2020

awaelchli deleted the fix/checkpoint-interval branch November 2, 2020 14:06

SeanNaren pushed a commit that referenced this pull request Nov 10, 2020

Revert "Add step index in checkpoint name (#3807)"

8bf6f79

This reverts commit e9c61cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add step index in checkpoint name #3807

Add step index in checkpoint name #3807

Borda commented Oct 2, 2020 •

edited by awaelchli

Loading

Borda commented Oct 2, 2020

williamFalcon commented Oct 2, 2020

pep8speaks commented Oct 2, 2020 •

edited

Loading

williamFalcon commented Oct 3, 2020

mergify bot commented Oct 3, 2020

Borda commented Oct 4, 2020

Borda Oct 4, 2020

Borda commented Oct 4, 2020

mergify bot commented Oct 4, 2020

ananthsub Oct 26, 2020

Borda Oct 26, 2020

SeanNaren Nov 1, 2020

ananthsub Oct 26, 2020

williamFalcon commented Oct 26, 2020

tchaton left a comment

mergify bot commented Oct 26, 2020

tchaton left a comment

		@@ -378,90 +378,6 @@ def test_dp_output_reduce():
		assert reduced['b']['c'] == out['b']['c']

Add step index in checkpoint name #3807

Add step index in checkpoint name #3807

Conversation

Borda commented Oct 2, 2020 • edited by awaelchli Loading

What does this PR do?

Example:

Borda commented Oct 2, 2020

williamFalcon commented Oct 2, 2020

pep8speaks commented Oct 2, 2020 • edited Loading

Comment last updated at 2020-11-02 11:47:45 UTC

williamFalcon commented Oct 3, 2020

mergify bot commented Oct 3, 2020

Borda commented Oct 4, 2020

Borda Oct 4, 2020

Choose a reason for hiding this comment

Borda commented Oct 4, 2020

mergify bot commented Oct 4, 2020

ananthsub Oct 26, 2020

Choose a reason for hiding this comment

Borda Oct 26, 2020

Choose a reason for hiding this comment

SeanNaren Nov 1, 2020

Choose a reason for hiding this comment

ananthsub Oct 26, 2020

Choose a reason for hiding this comment

williamFalcon commented Oct 26, 2020

tchaton left a comment

Choose a reason for hiding this comment

mergify bot commented Oct 26, 2020

tchaton left a comment

Choose a reason for hiding this comment

Borda commented Oct 2, 2020 •

edited by awaelchli

Loading

pep8speaks commented Oct 2, 2020 •

edited

Loading