Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOTFIX] ModelCheckpoint - Don't increase current_epoch and global_step if not trained #4291

Merged
merged 18 commits into from
Oct 23, 2020

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Oct 21, 2020

What does this PR do?

This PR resolve the bug associated with increasing current_epoch and global_step when iteratively dumping / restoring train.

Fixes #4176

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Oct 21, 2020

Hello @tchaton! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-23 08:26:50 UTC

@mergify mergify bot requested a review from a team October 21, 2020 16:37
@tchaton tchaton changed the title add two tests w/wo tempdir Add two tests w/wo tempdir for saving / restoring from ModelCheckpoint Oct 21, 2020
@mergify mergify bot requested a review from a team October 22, 2020 08:48
@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #4291 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4291   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         111     111           
  Lines        8007    8017   +10     
======================================
+ Hits         7445    7456   +11     
+ Misses        562     561    -1     

@Borda Borda added the ci Continuous Integration label Oct 22, 2020
@Borda Borda added this to the 1.0.x milestone Oct 22, 2020
@tchaton tchaton changed the title Add two tests w/wo tempdir for saving / restoring from ModelCheckpoint [WIP] MisconfigurationException: restored ckpt with current_epoch=2, but the Trainer(max_epochs=1) Oct 22, 2020
@tchaton tchaton changed the title [WIP] MisconfigurationException: restored ckpt with current_epoch=2, but the Trainer(max_epochs=1) [HOTFIX] ModelCheckpoint - Don't increase current_epoch and global_step if not trained Oct 22, 2020
@mergify
Copy link
Contributor

mergify bot commented Oct 23, 2020

This pull request is now in conflict... :(

@SeanNaren SeanNaren merged commit 3abfec8 into master Oct 23, 2020
@tchaton tchaton deleted the hotfix/issue_4176_repeated_save_restore_2 branch October 23, 2020 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with epoch count with repeated save/restore
6 participants