-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fault-tolerant: fix resetting "current" progress at the end of successful epoch #8837
Conversation
self.batch_progress.current.reset() | ||
self.scheduler_progress.current.reset() | ||
self.batch_loop.optim_progress.reset_on_epoch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as in reset() above line 90-93
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth extracting into a protected method
Codecov Report
@@ Coverage Diff @@
## master #8837 +/- ##
========================================
- Coverage 89% 43% -45%
========================================
Files 169 169
Lines 14072 14072
========================================
- Hits 12466 6117 -6349
- Misses 1606 7955 +6349 |
@@ -229,6 +229,10 @@ def on_run_end(self) -> List[List[STEP_OUTPUT]]: | |||
|
|||
self.update_lr_schedulers("epoch", update_plateau_schedulers=True) | |||
|
|||
self.batch_progress.current.reset() | |||
self.scheduler_progress.current.reset() | |||
self.batch_loop.optim_progress.reset_on_epoch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move batch_loop.optim_progress
reset to its own loop ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that it's calling reset_on_epoch
, not reset
. This loop is the epoch loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the bug happen because restarting=False
and the current
values correspond to those of the finished epoch? So we never roll over to the next one?
What does this PR do?
Fixes #8835
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃