Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more early stopping options (convergence and divergence threshold) #6868

Merged
merged 18 commits into from
Apr 19, 2021

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Apr 7, 2021

What does this PR do?

Part of #6795

Adds two thresholds after which we stop training immediately (no patience).

Divergence threshold: the monitor has reached a value from which we believe it cannot recover -> stop training
Stopping threshold: the monitor has reached a target value that is close to optimal, and we do not care about further improvement -> stop training

Now that we have multiple stopping criteria, it's best we report the reason for stopping too.

TODO:

  • tests for the new parameters
  • improve docs

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Comment on lines 59 to 60
check_finite: Stops training when the monitor becomes NaN or infinite. Set this argument to ``False``
if this behavior is undesired.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want an option to turn this on/off at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say off. We will add support for occasional NaN loss training soon.

@codecov
Copy link

codecov bot commented Apr 7, 2021

Codecov Report

Merging #6868 (e6dc765) into master (832a03a) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #6868    +/-   ##
=======================================
- Coverage      92%     87%    -5%     
=======================================
  Files         196     196            
  Lines       12571   12597    +26     
=======================================
- Hits        11594   10968   -626     
- Misses        977    1629   +652     

@awaelchli awaelchli added callback feature Is an improvement or enhancement labels Apr 8, 2021
if should_stop:
self.stopped_epoch = trainer.current_epoch
if reason:
log.info(f"[{trainer.global_rank}] {reason}")
Copy link
Contributor Author

@awaelchli awaelchli Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suboptimal.

If we log on rank zero only, and the user has sync_dist=False for logging, then we might not see the reason being logged because it could be rank > 0 that decided to stop.

If we log on all ranks and the user has sync_dist=True for logging, we will show the same message N times.

Should we perhaps broadcast the message and log only on rank 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea !

@awaelchli awaelchli marked this pull request as ready for review April 8, 2021 10:37
Comment on lines +61 to +62
stopping_threshold: Stop training immediately once the monitored quantity reaches this threshold.
divergence_threshold: Stop training as soon as the monitored quantity becomes worse than this threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use stop_limit and stop_loss to follow common financial terms

https://www.investopedia.com/articles/active-trading/091813/which-order-use-stoploss-or-stoplimit-orders.asp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlperla what do you think of this name suggestion?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But might not be loss, and most people don't know finance. These are basically optimizer settings,which is more universal.

I think sticking with optimizer style lingo is ideal. Divergence is safe and says what it means . Normally one would call the success criteria as tolerances for optimizers. But that is because they are always comparing something (eg a value itself, changes in that value, or first order conditions) to zero.

Since this could presumably compare stopping for this other than close to zero(especially if you are tracking something where a larger number is better) , I think threshold is probably more general. But open minded of course

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also go super simple and go with min_threshold and max_threshold

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because that implies a direction to them.

Copy link
Contributor

@tchaton tchaton Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the names are good right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not clear to me, it is you have some converging sequence so it stops when it starts diverse again? shall it be some patience for noise presence reason?
or natively it can be observing training and validation measure and stop overfitting - when these twos tart diverse

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something is diverging, it is because you are in some sort of local minima or outside of an attractor and it could only return with some massive jumps (i.e. emulation in simulating annealing you are way off in the boons for your optima... in theory it come come back, but it might takes months. You are better off just restarting). So patience isn't the right thing to think of for that.

pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/early_stopping.py Show resolved Hide resolved
@awaelchli awaelchli requested a review from kaushikb11 as a code owner April 8, 2021 19:43
@awaelchli awaelchli added this to the 1.3 milestone Apr 8, 2021
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should patience and delta apply to these thresholds?

Comment on lines +61 to +62
stopping_threshold: Stop training immediately once the monitored quantity reaches this threshold.
divergence_threshold: Stop training as soon as the monitored quantity becomes worse than this threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also go super simple and go with min_threshold and max_threshold

pytorch_lightning/callbacks/early_stopping.py Show resolved Hide resolved
pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Great work ! @jlperl, does it match what you had in mind ?

@mergify mergify bot removed the has conflicts label Apr 14, 2021
elif self.divergence_threshold is not None and self.monitor_op(-current, -self.divergence_threshold):
should_stop = True
reason = (
f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: whether it is

f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."

or

f"Divergence: {self.monitor} = {current} > {self.divergence_threshold}."

should depend on the monitor_op, right? Maybe have a f"Divergence: {self.monitor} = {current} {op_to_string(self.monitor_op)} {self.divergence_threshold}."
or something like that, where you fill op_to_string with whatever you need to turn it into a >, etc.?

Similarly, I think you could do the same with the "successful" convergence below/above the target.

 f"Below tolerance {self.monitor} = {current} {op_to_string(self.monitor_op)} {self.stopping_threshold}  ."

where you would have to ensure the order is correct as it is going form the other direction.

As I said though, minor.

@jlperla
Copy link

jlperla commented Apr 14, 2021

@tchaton @awaelchli This all looks great to me. I put in one minor comment about the "reason" strings only being correct for one of the monitor_ops, but I also think that could wait and do it as a separate issue later. I personally am unlikely to use the other direction for the monitor_op anytime soon.

@mergify mergify bot removed the has conflicts label Apr 16, 2021
@awaelchli awaelchli added the priority: 0 High priority task label Apr 18, 2021
@tchaton tchaton requested a review from carmocca April 19, 2021 14:24
Copy link
Contributor

@kaushikb11 kaushikb11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

@carmocca carmocca merged commit d12c6cf into master Apr 19, 2021
@carmocca carmocca deleted the feature/early-stopping-threshold branch April 19, 2021 14:49
@awaelchli awaelchli mentioned this pull request Apr 20, 2021
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback feature Is an improvement or enhancement priority: 0 High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants