Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2/2] Remove training loop force calling early stopping callback #7069

Merged
merged 8 commits into from
Apr 29, 2021

Conversation

ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Apr 17, 2021

What does this PR do?

Fixes #7033
Part 2 - this depends on #6944

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Apr 17, 2021

Hello @ananthsub! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-29 01:53:56 UTC

@codecov
Copy link

codecov bot commented Apr 17, 2021

Codecov Report

Merging #7069 (29c2e3a) into master (e272bea) will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #7069   +/-   ##
======================================
- Coverage      87%     87%   -0%     
======================================
  Files         199     199           
  Lines       12799   12791    -8     
======================================
- Hits        11170   11160   -10     
- Misses       1629    1631    +2     

self._run_early_stopping_check(trainer)

def on_validation_end(self, trainer, pl_module) -> None:
if self._check_on_train_epoch_end or self._should_skip_check(trainer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run it for both validation and training ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as the monitor metric might not be available in both training and validation. This lets people mix and match better, similar to what we are doing with checkpointing.

Copy link
Contributor

@awaelchli awaelchli Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side fact: We actually had at one point model checkpoint running on both training and val epoch end (pre v1.0), which lead to writing the checkpoint twice, once with the old value for the val metric and once with the new one. And then it was wrong of course when val_check_interval != 1.

self._run_early_stopping_check(trainer)

def on_validation_end(self, trainer, pl_module) -> None:
if self._check_on_train_epoch_end or self._should_skip_check(trainer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self._check_on_train_epoch_end or self._should_skip_check(trainer):
if self._should_skip_check(trainer):

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify, this is a cumulative change including #6944 correct?

@ananthsub
Copy link
Contributor Author

just to clarify, this is a cumulative change including #6944 correct?

yes thats correct

@ananthsub ananthsub force-pushed the early-stop-train-2 branch from 807fd72 to a7f5ab9 Compare April 28, 2021 16:29
@mergify mergify bot removed the has conflicts label Apr 28, 2021
@ananthsub
Copy link
Contributor Author

@carmocca @awaelchli @tchaton @Borda mind taking a look?

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Borda Borda enabled auto-merge (squash) April 28, 2021 20:33
@carmocca carmocca added the ready PRs ready to be merged label Apr 28, 2021
@@ -548,7 +548,7 @@ def training_step(self, batch, batch_idx):
return output

model = TestModel()
early_stop = EarlyStopping(monitor="loss", patience=0)
early_stop = EarlyStopping(monitor="loss", patience=0, check_on_train_epoch_end=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to bring awareness, this changes the behavior for users.
A user that has monitor on a training metric now needs to set this argument after upgrading, right? Let's make this clear in the changelog, in the "Changed" section?

And are we good to include this in 1.3 during feature freeze here? I assume so since milestone is set to 1.3 but just to make it clear to everyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the changelog. @edenlightning @tchaton wdyt for 1.3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli fwiw this is pretty new behavior: #5208

but definitely, it's a change compared to before and i dont think there's a way to make this backward compatible. logging a warning also can get lost. keeping this around longer blocks the general loop refactor we want to do

Copy link
Contributor

@awaelchli awaelchli Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it can't be made compatible. I just want to push for a better changelog.
Thanks @ananthsub

We need to anticipate that people will report these changes as bugs when upgrading and if we have a clear changelog/release notes we can point them to that.

@awaelchli awaelchli disabled auto-merge April 28, 2021 21:54
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to finally see this TODO getting resolved!

@ananthsub ananthsub force-pushed the early-stop-train-2 branch from 36b78ce to 29c2e3a Compare April 29, 2021 01:53
@ananthsub ananthsub merged commit 14b8dd4 into Lightning-AI:master Apr 29, 2021
@ananthsub ananthsub deleted the early-stop-train-2 branch April 29, 2021 20:57
@ananthsub ananthsub mentioned this pull request May 3, 2021
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support early stopping during training inside of the early stopping callback
6 participants