-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2/2] Remove training loop force calling early stopping callback #7069
Conversation
Hello @ananthsub! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-29 01:53:56 UTC |
Codecov Report
@@ Coverage Diff @@
## master #7069 +/- ##
======================================
- Coverage 87% 87% -0%
======================================
Files 199 199
Lines 12799 12791 -8
======================================
- Hits 11170 11160 -10
- Misses 1629 1631 +2 |
self._run_early_stopping_check(trainer) | ||
|
||
def on_validation_end(self, trainer, pl_module) -> None: | ||
if self._check_on_train_epoch_end or self._should_skip_check(trainer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run it for both validation and training ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as the monitor metric might not be available in both training and validation. This lets people mix and match better, similar to what we are doing with checkpointing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side fact: We actually had at one point model checkpoint running on both training and val epoch end (pre v1.0), which lead to writing the checkpoint twice, once with the old value for the val metric and once with the new one. And then it was wrong of course when val_check_interval != 1.
self._run_early_stopping_check(trainer) | ||
|
||
def on_validation_end(self, trainer, pl_module) -> None: | ||
if self._check_on_train_epoch_end or self._should_skip_check(trainer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._check_on_train_epoch_end or self._should_skip_check(trainer): | |
if self._should_skip_check(trainer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to clarify, this is a cumulative change including #6944 correct?
yes thats correct |
807fd72
to
a7f5ab9
Compare
@carmocca @awaelchli @tchaton @Borda mind taking a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -548,7 +548,7 @@ def training_step(self, batch, batch_idx): | |||
return output | |||
|
|||
model = TestModel() | |||
early_stop = EarlyStopping(monitor="loss", patience=0) | |||
early_stop = EarlyStopping(monitor="loss", patience=0, check_on_train_epoch_end=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to bring awareness, this changes the behavior for users.
A user that has monitor on a training metric now needs to set this argument after upgrading, right? Let's make this clear in the changelog, in the "Changed" section?
And are we good to include this in 1.3 during feature freeze here? I assume so since milestone is set to 1.3 but just to make it clear to everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the changelog. @edenlightning @tchaton wdyt for 1.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli fwiw this is pretty new behavior: #5208
but definitely, it's a change compared to before and i dont think there's a way to make this backward compatible. logging a warning also can get lost. keeping this around longer blocks the general loop refactor we want to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it can't be made compatible. I just want to push for a better changelog.
Thanks @ananthsub
We need to anticipate that people will report these changes as bugs when upgrading and if we have a clear changelog/release notes we can point them to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to finally see this TODO getting resolved!
36b78ce
to
29c2e3a
Compare
What does this PR do?
Fixes #7033
Part 2 - this depends on #6944
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃