-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add step index in checkpoint name #3807
Conversation
if I understand the two Priority bugs, people expect to save at all Cx... well at least not save at the C1 and skip all others
for sure C1 only is not good but saves C3 just because it is last in the epoch even C1 is better with scoring (for example overfitting) does not make sense... |
Context: Thus checking val every few steps or hours is the correct way to go and thus a checkpoint should be created if that condition is met. If people want to check val only once per epoch then pass in the proper flags. But the expectation is that every time val is checked a checkpoint could be saved. Let's make sure we explicitly test for this to make sure we don't regress. |
Hello @Borda! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-11-02 11:47:45 UTC |
@awaelchli can you review? |
This pull request is now in conflict... :( |
bff3f71
to
8a20989
Compare
@awaelchli @williamFalcon mind check before updating docs... |
tests/trainer/test_trainer.py
Outdated
@@ -378,90 +378,6 @@ def test_dp_output_reduce(): | |||
assert reduced['b']['c'] == out['b']['c'] | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are moved to checkpoint file...
will be rebased on #3838 |
4dfd824
to
6e2b2bc
Compare
This pull request is now in conflict... :( |
@@ -495,9 +496,10 @@ def _monitor_candidates(self, trainer): | |||
ckpt_name_metrics = deepcopy(trainer.logger_connector.logged_metrics) | |||
ckpt_name_metrics.update(trainer.logger_connector.callback_metrics) | |||
ckpt_name_metrics.update(trainer.logger_connector.progress_bar_metrics) | |||
ckpt_name_metrics.update({"step": trainer.global_step, "epoch": trainer.current_epoch}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this update also happens in _format_checkpoint_name
- is it possible to consolidate to a single place? is there a risk they can fall out of sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which point do you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ananthsub could you explain further what you mean? I think I understand but need some clarity here.
epoch = metrics.get("epoch") | ||
step = metrics.get("step") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
nice addition! excited to get this merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR !
Co-authored-by: Carlos Mocholí <[email protected]>
This pull request is now in conflict... :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome addition !
* true final value of global step * ch check * tests * save each validation interval * wip * add test * add test * wip * fix tests, revert old edits, fix merge conflicts, update doctests * test + bugfix * sort files * format test * suggestion by ananth * added changelog * naming * docs * example * suggestion Co-authored-by: Carlos Mocholí <[email protected]> * fix test * pep * pep Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> (cherry picked from commit ef03c39)
This reverts commit e9c61cc
* true final value of global step * ch check * tests * save each validation interval * wip * add test * add test * wip * fix tests, revert old edits, fix merge conflicts, update doctests * test + bugfix * sort files * format test * suggestion by ananth * added changelog * naming * docs * example * suggestion Co-authored-by: Carlos Mocholí <[email protected]> * fix test * pep * pep Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]>
What does this PR do?
Fixes #1764
Fixes #3789
Fixes #1966
Fixes #1758
Fixes #3084
Partially implements what is requested in #4335
TODO
Example:
This saves a checkpoint 4 times per epoch with step and epoch index in the name of the file.