Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add step index in checkpoint name #3807

Merged
merged 27 commits into from
Nov 2, 2020
Merged

Add step index in checkpoint name #3807

merged 27 commits into from
Nov 2, 2020

Conversation

Borda
Copy link
Member

@Borda Borda commented Oct 2, 2020

What does this PR do?

Fixes #1764
Fixes #3789
Fixes #1966
Fixes #1758
Fixes #3084

Partially implements what is requested in #4335

TODO

  • fix remaining tests
  • update docs with option of step
  • integrate step into sub epoch test case

Example:

This saves a checkpoint 4 times per epoch with step and epoch index in the name of the file.

val_check_interval=0.25, 
callbacks=[ModelCheckpoint(save_top_k=-1, monitor="valid_loss")]

image

@Borda Borda added bug Something isn't working feature Is an improvement or enhancement labels Oct 2, 2020
@mergify mergify bot requested a review from a team October 2, 2020 22:40
@Borda Borda added design Includes a design discussion discussion In a discussion stage labels Oct 2, 2020
@Borda
Copy link
Member Author

Borda commented Oct 2, 2020

if I understand the two Priority bugs, people expect to save at all Cx... well at least not save at the C1 and skip all others

train: ----------------------------------
val: ...........-C1-...........-C2-..........-C3-

for sure C1 only is not good but saves C3 just because it is last in the epoch even C1 is better with scoring (for example overfitting) does not make sense...

@williamFalcon
Copy link
Contributor

Context:
not ALL epochs take 1 minute lol... Many research cases require training 1-2 epochs where each epoch may take days or weeks.

Thus checking val every few steps or hours is the correct way to go and thus a checkpoint should be created if that condition is met.

If people want to check val only once per epoch then pass in the proper flags. But the expectation is that every time val is checked a checkpoint could be saved.

Let's make sure we explicitly test for this to make sure we don't regress.

@pep8speaks
Copy link

pep8speaks commented Oct 2, 2020

Hello @Borda! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-02 11:47:45 UTC

@williamFalcon
Copy link
Contributor

@awaelchli can you review?

@Borda Borda changed the title save each validation interval save each validation interval [wip] Oct 3, 2020
@mergify mergify bot requested a review from a team October 3, 2020 13:06
@mergify
Copy link
Contributor

mergify bot commented Oct 3, 2020

This pull request is now in conflict... :(

@Borda Borda force-pushed the fix/checkpoint-interval branch from bff3f71 to 8a20989 Compare October 3, 2020 18:58
@Borda
Copy link
Member Author

Borda commented Oct 4, 2020

@awaelchli @williamFalcon mind check before updating docs...
btw, failing PT 1.7 is addressed in #3821 and #3833

@Borda Borda requested a review from awaelchli October 4, 2020 00:05
@@ -378,90 +378,6 @@ def test_dp_output_reduce():
assert reduced['b']['c'] == out['b']['c']


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are moved to checkpoint file...

@Borda Borda mentioned this pull request Oct 4, 2020
@Borda
Copy link
Member Author

Borda commented Oct 4, 2020

will be rebased on #3838

@Borda Borda force-pushed the fix/checkpoint-interval branch from 4dfd824 to 6e2b2bc Compare October 4, 2020 08:01
@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2020

This pull request is now in conflict... :(

@Borda Borda removed the bug Something isn't working label Oct 5, 2020
@@ -495,9 +496,10 @@ def _monitor_candidates(self, trainer):
ckpt_name_metrics = deepcopy(trainer.logger_connector.logged_metrics)
ckpt_name_metrics.update(trainer.logger_connector.callback_metrics)
ckpt_name_metrics.update(trainer.logger_connector.progress_bar_metrics)
ckpt_name_metrics.update({"step": trainer.global_step, "epoch": trainer.current_epoch})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this update also happens in _format_checkpoint_name - is it possible to consolidate to a single place? is there a risk they can fall out of sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which point do you have in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub could you explain further what you mean? I think I understand but need some clarity here.

Comment on lines +535 to +536
epoch = metrics.get("epoch")
step = metrics.get("step")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

@williamFalcon
Copy link
Contributor

nice addition! excited to get this merged

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR !

@mergify
Copy link
Contributor

mergify bot commented Oct 26, 2020

This pull request is now in conflict... :(

@tchaton tchaton self-requested a review November 2, 2020 08:09
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome addition !

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved
@awaelchli awaelchli merged commit ef03c39 into master Nov 2, 2020
@awaelchli awaelchli deleted the fix/checkpoint-interval branch November 2, 2020 14:06
SeanNaren pushed a commit that referenced this pull request Nov 10, 2020
* true final value of global step

* ch check

* tests

* save each validation interval

* wip

* add test

* add test

* wip

* fix tests, revert old edits, fix merge conflicts, update doctests

* test + bugfix

* sort files

* format test

* suggestion by ananth

* added changelog

* naming

* docs

* example

* suggestion

Co-authored-by: Carlos Mocholí <[email protected]>

* fix test

* pep

* pep

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Rohit Gupta <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
(cherry picked from commit ef03c39)
SeanNaren pushed a commit that referenced this pull request Nov 10, 2020
rohitgr7 added a commit that referenced this pull request Nov 21, 2020
* true final value of global step

* ch check

* tests

* save each validation interval

* wip

* add test

* add test

* wip

* fix tests, revert old edits, fix merge conflicts, update doctests

* test + bugfix

* sort files

* format test

* suggestion by ananth

* added changelog

* naming

* docs

* example

* suggestion

Co-authored-by: Carlos Mocholí <[email protected]>

* fix test

* pep

* pep

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Rohit Gupta <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing design Includes a design discussion feature Is an improvement or enhancement
Projects
None yet