-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check early stopping metric in the beginning of the training #542
Changes from 1 commit
4c635ad
c06bad6
acda2b7
10e76f6
aeb8bbd
25e1bda
748640b
706d84f
0f83b1a
4e35bee
320e4c1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,7 +53,7 @@ class Trainer(TrainerIOMixin, | |
def __init__(self, | ||
logger=True, | ||
checkpoint_callback=True, | ||
early_stop_callback=True, | ||
early_stop_callback=False, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why changing the default config? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I have written in the description we should discuss whether early stopping should be turned on or off by the default. I think that it is better to be turned off. Again, please look at #524 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Early stopping should default to True - this is the most common use case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that it is conditioned on the But if you insist I can suggest the following: we enable early stopping by default, but if there is no |
||
default_save_path=None, | ||
gradient_clip_val=0, | ||
gradient_clip=None, # backward compatible | ||
|
@@ -185,6 +185,8 @@ def __init__(self, | |
# creates a default one if none passed in | ||
self.early_stop_callback = None | ||
self.configure_early_stopping(early_stop_callback, logger) | ||
if self.enable_early_stop: | ||
self.nb_sanity_val_steps = max(1, self.nb_sanity_val_steps) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe if self.fast_dev_run:
self.nb_sanity_val_steps = 1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But exactly by that reason it should be We just take the previously defined final If we made as you have suggested then There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's not do this. People need to have the option of turning sanity_val_check off There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But how then we will check that early stopping will work correctly? (Note that we force this check only if early stopping is turned on.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand what you're saying, but restricting EVERYONE to force sanity check will certainly block some esoteric research or production cases, so we can't do this. But I think this is on the user at this point. If they turned off sanity check then it's on them at that point and are willingly exposing themselves to these kinds of issues... but for people who keep it on, then we use what you suggest. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
# configure checkpoint callback | ||
self.checkpoint_callback = checkpoint_callback | ||
|
@@ -444,6 +446,7 @@ def run_pretrain_routine(self, model): | |
# run tiny validation (if validation defined) | ||
# to make sure program won't crash during val | ||
ref_model.on_sanity_check_start() | ||
callback_metrics = {} | ||
if self.get_val_dataloaders() is not None and self.nb_sanity_val_steps > 0: | ||
# init progress bars for validation sanity check | ||
pbar = tqdm.tqdm(desc='Validation sanity check', total=self.nb_sanity_val_steps, | ||
|
@@ -453,12 +456,21 @@ def run_pretrain_routine(self, model): | |
# dummy validation progress bar | ||
self.val_progress_bar = tqdm.tqdm(disable=True) | ||
|
||
self.evaluate(model, self.get_val_dataloaders(), self.nb_sanity_val_steps, self.testing) | ||
eval_results = self.evaluate(model, self.get_val_dataloaders(), | ||
self.nb_sanity_val_steps, False) | ||
_, _, _, callback_metrics, _ = self.process_output(eval_results) | ||
|
||
# close progress bars | ||
self.main_progress_bar.close() | ||
self.val_progress_bar.close() | ||
|
||
if (self.enable_early_stop and | ||
callback_metrics.get(self.early_stop_callback.monitor) is None): | ||
raise RuntimeError(f"Early stopping was configured to monitor " | ||
f"{self.early_stop_callback.monitor} but it is not available " | ||
f"after validation_end. Available metrics are: " | ||
f"{','.join(list(callback_metrics.keys()))}") | ||
|
||
# init progress bar | ||
pbar = tqdm.tqdm(leave=True, position=2 * self.process_position, | ||
disable=not self.show_progress_bar, dynamic_ncols=True, unit='batch', | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you should return
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. Return
True
was before and it caused the interruption of the training if the required metric was not found. And now it just gives a warning and training just proceeds as though without early stopping. The point is that the callback should not stop the training if it can't find the metrics.Actually, in the current implementation this branch is not reachable because we check for the availability of the metric in the trainer initialization. But my idea was that if we decide to set early_stopping to True by default, then it can be used to give a warning but not to stop the training.
You can also look at #524 for better understanding.