-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
early stopping checks on_validation_end #1458
early stopping checks on_validation_end #1458
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1458 +/- ##
======================================
- Coverage 88% 88% -0%
======================================
Files 74 74
Lines 4643 4645 +2
======================================
Hits 4068 4068
- Misses 575 577 +2 |
Does ES still work on training metrics when no validation loop is defined? I thought there is a test and I would have expected this change to break it. |
This shall be covered with a test... |
i don't think this (changing |
@PyTorchLightning/core-contributors suggestion about this issue? |
Thanks for the input, @jeremyjordan. In general, I'm also in favor of offering flexibility to users, but I figured in this case the intention was to provide a simple way to monitor a validation metric (the default metric is This makes sense to me, since training metrics should always improve during training (e.g. when minimizing a loss with SGD), and the common practice to prevent overfitting is to keep an eye on validation metrics instead. That's why I think, at least by default, monitoring a validation metric is the most sensible thing to do. If a more flexible solution is required, I suggest something similar to: class EarlyStopping(Callback):
def __init__(..., when='validation'):
...
self.when = when
def _should_stop(self, trainer, pl_module):
# Existing stopping logic
...
def on_epoch_end(self, trainer, pl_module):
if self.when != 'training':
return False
return self._should_stop(trainer, pl_module)
def on_validation_end(self, trainer, pl_module):
if self.when != 'validation':
return False
return self._should_stop(trainer, pl_module) I agree it's not beautiful, but it does the job. Or would you prefer splitting early stopping in two classes e.g. |
@baldassarreFe agree with @jeremyjordan that the solution might be more involved. We need to enumerate the cases and work this properly. Early stopping should work everytime validation ends (as you have changed here). But there's the case where there is no val loop... in this case, the early stopping should run at the end of every epoch? But i'm also fine with requiring early stopping to work only on val... Thoughts? |
i would also be fine with requiring early stopping to only work with a validation loop. however, we would need more changes to enable this. take for example, the patience argument which is currently defined as the number of epochs with no improvement. if we instead change this to run every time validation runs we need to consider:
|
IMO the patience argument should be defined as the number of validations with no improvement, which offer better support for the following two cases:
|
i agree with @shijie-wu, we shall enforce that the out of the box EarlyStopping callback runs when validation is present and the arguments are defined with respect to validation runs. if a user has a different (rare) use case, they can always implement their own logic in a custom callback. this change will have an effect on existing users, but i think it's a reasonable one. |
agreed! |
So, to summarize:
If this is correct, I think the code in the initial pull request will satisfy this behavior. I would also update the docs to clarify what we changed, warn the user about possible misunderstandings, and suggest that advanced behavior can be obtained by implementing a custom early stopping callback. Then we're done. Shall I? |
yes, let's do it |
there's an existing test which we'll need to remove as @awaelchli pointed out i'm surprised this branch isn't failing on that test currently |
do we have a bug in tests? |
@baldassarreFe how is it going, can we finish it for next release? |
This pull request is now in conflict... :( |
9706712
to
38ed30d
Compare
This pull request is now in conflict... :( |
@jeremyjordan Good observation. I just quickly tested this locally and commenting out the two last if clauses does not change anything, the test runs forever. Agree that this validation step pattern should be an option in some test and not default :) |
Found the issue. This PR accidentally disables ES because
This is why it is disabled. |
Ah yes I forgot about that, in my WIP PR (#1504) I updated the trainer to not invoke this directly. |
@baldassarreFe would you mind updating the PR swapping
to
and then I'll take care of removing this explicit invocation in #1504 ? |
This pull request is now in conflict... :( |
`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.
Co-authored-by: Adrian Wälchli <[email protected]>
0db51fa
to
f9c0ac5
Compare
Great job! =) |
Great, we made it! Thanks to you all ;) |
Thank you for this! But shouldn't |
Oh yeah, @collinmccarthy I think you're right. The trainer relies on the return value to stop: if self.enable_early_stop:
if (met_min_epochs and met_min_steps) or self.fast_dev_run:
should_stop = self.early_stop_callback.on_validation_end(self, self.get_model())
# stop training
stop = should_stop and met_min_epochs
if stop:
self.run_training_teardown()
return In 006a067 I only changed the name of the method from
Then in 1680c88 the stopping logic was moved to a separate method. I think that's where we missed the
@williamFalcon do you think that's why that test was failing? Maybe it was because the new callback is actually not stopping training. |
this shall be addressed in #1504, i switched from returning a value to setting the trainer attribute |
EarlyStopping
should check the metric of intereston_validation_end
rather thanon_epoch_end
.In a normal scenario, this does not cause a problem, but in combination with
check_val_every_n_epoch>1
in theTrainer
it results in a warning or in aRuntimeError
depending onstrict
.Before submitting
What does this PR do?
Fixes #490.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃