fixes for early stopping and checkpoint callbacks #1504

jeremyjordan · 2020-04-16T03:51:58Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

For #1464
For #1463
For #1699
For #2151
Related #1458

best attribute isn't being saved
wait attribute isn't being reloaded properly
wait epoch is lagging by an epoch
early stopping callbacks are now being called twice
callback throwing an exception on epochs where validation metrics aren't available (due to check_val_every_n_epoch>1)

Adds tests to prevent future regressions.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

jeremyjordan · 2020-04-17T00:29:06Z

@PyTorchLightning/core-contributors currently, our documentation states that:

In any case, the callback will fall back to the training metrics (returned in training_step(), training_step_end()) looking for a key to monitor if validation is disabled or validation_epoch_end() is not defined.

However, this is not completely true. We only look at callback_metrics which is any key that is not loss, log, or progress. Do we want to update this to look across all values? Or correct the documentation to reflect the current reality?

pep8speaks · 2020-04-25T16:22:14Z

Hello @jeremyjordan! Thanks for updating this PR.

In the file pytorch_lightning/callbacks/model_checkpoint.py:

Line 244:71: W504 line break after binary operator
Line 245:72: W504 line break after binary operator

Comment last updated at 2020-06-28 06:34:32 UTC

jeremyjordan · 2020-04-29T02:43:50Z

@Borda any idea why some of the logger tests are failing?

Borda

it seems that the tests are failing on multiple places not only loggers... let's take it one by one

pytorch_lightning/callbacks/early_stopping.py

jeremyjordan · 2020-04-29T13:06:09Z

tests are failing when:

logger name or version is None
an opaque pickle issue
off by one error in checkpointed global step
mock error in parsing args

i need to investigate the off by one error, but not sure how the other tests failing are related to the changes in this PR

i want to get these failing tests addressed, then will write more tests for the early stopping callback.

jeremyjordan · 2020-05-01T03:26:47Z

ok, there's one remaining failing test and i've tracked down the issue. there's a thread lock being created when you create the OfflineExperiment which is preventing to object from being pickle-able. (see #1682)

shijie-wu · 2020-05-06T17:06:02Z

pytorch_lightning/callbacks/model_checkpoint.py

@@ -197,7 +197,7 @@ def format_checkpoint_name(self, epoch, metrics, ver=None):
        return filepath

    @rank_zero_only
-    def on_validation_end(self, trainer, pl_module):
+    def on_epoch_end(self, trainer, pl_module):


Hi! Does this change effect checkpointing in the middle of training epoch? Consider the usecase where we train on a large dataset and we want to checkpoint & early stop every X steps instead of every X epoches, for example X = 100, i.e. val_check_interval = 100.

Borda · 2020-05-11T22:02:52Z

@jeremyjordan is it blocked by another pr?

williamFalcon · 2020-05-17T13:00:41Z

@jeremyjordan which pr is blocking this?

jeremyjordan · 2020-05-19T02:41:41Z

the tests won't pass as is until #1682 is addressed, we'll probably want to merge #1458 and then i can have this as a follow-on PR which ensures that the EarlyStopping callback works well with Checkpointing

mergify · 2020-06-23T15:20:44Z

This pull request is now in conflict... :(

awaelchli · 2020-06-23T23:31:50Z

Strange. for me it works fine, I get these timings locally when running

tests\models>py.test -v test_hooks.py --durations=10

7.57s call     tests/models/test_hooks.py::test_on_before_zero_grad_called[2]
7.11s call     tests/models/test_hooks.py::test_on_before_zero_grad_called[3]
6.83s call     tests/models/test_hooks.py::test_on_before_zero_grad_called[1]

max_steps seems to work.
EDIT: even before the merge from @Borda I'm getting these fast timings.

jeremyjordan · 2020-06-24T00:41:18Z

@awaelchli are you running on Windows by any chance? that's the only place where tests are passing :D

awaelchli · 2020-06-24T00:44:52Z

Oh right that must be it..

mergify · 2020-06-24T03:40:11Z

This pull request is now in conflict... :(

Borda · 2020-06-24T23:40:12Z

@jeremyjordan mind rebase/merge master? and how is the last test? 🐰

mergify · 2020-06-26T13:34:01Z

This pull request is now in conflict... :(

awaelchli · 2020-06-26T23:25:34Z

I merged master into this and getting many failed tests 😭 don't know where to begin but i still have hope this can get merged. Will try to fix them this weekend.

awaelchli · 2020-06-27T11:48:52Z

pytorch_lightning/trainer/training_loop.py

-            self.run_evaluation(test_mode=self.testing)
-            self.call_checkpoint_callback()
-


@jeremyjordan The tests fail because the evaluation loop is not getting called after the epoch. Did you intend to movie it somewhere else?

ahh i think i only meant to remove the call_checkpoint_callback(), this was my mistake - glad you caught that!

awaelchli · 2020-06-27T22:09:54Z

pytorch_lightning/trainer/callback_config.py

-            if self.logger is not None:
-                save_dir = (getattr(self.logger, 'save_dir', None) or
-                            getattr(self.logger, '_save_dir', None) or
-                            self.default_root_dir)
-
-                # weights_save_path overrides anything
-                if self.weights_save_path is not None:
-                    save_dir = self.weights_save_path
-
-                version = self.logger.version if isinstance(
-                    self.logger.version, str) else f'version_{self.logger.version}'
-                ckpt_path = os.path.join(
-                    save_dir,
-                    self.logger.name,
-                    version,
-                    "checkpoints"


@jeremyjordan this code was moved to the ModelCheckpoint.on_train_start, and I understand why. However, we have the problem that the logger is already saving a meta.yaml file to the default location before the on_train_start callback is even called an the model checkpoint has the chance to update the weights_save_path.
Any idea how to decouple the checkpoint and logger ?

it may be unrelated, since it also happens here #2392. not sure

yes, I think we should provide shared configuration in the Trainer initialization and not expect these child objects (loggers and checkpoint callbacks) to reach into each other's attributes. this probably also includes moving some attributes from logging (eg. version) up into the Trainer

awaelchli · 2020-06-28T04:16:57Z

I was able to fix all tests and merge errors.
About this todo comment:

TODO support more generic way for callbacks to persist a state_dict in a checkpoint

What about a callack method on_save_checkpoint (we already have it as model hooks)? Then the checkpoint and early stop callbacks can save their state into the checkpoint and trainer doesn't need to do that.

mergify · 2020-06-28T06:35:08Z

This pull request is now in conflict... :(

jeremyjordan · 2020-06-28T15:06:02Z

What about a callack method on_save_checkpoint (we already have it as model hooks)? Then the checkpoint and early stop callbacks can save their state into the checkpoint and trainer doesn't need to do that.

Yes, I was thinking the same thing. This callback would just return a state_dict which the Trainer could store. The only thing that I am unclear how we should handle is for other callbacks how we want to reinitialize the state. If we can expect that the same exact callbacks will be passed to the Trainer then it should be trivial. Or we could expect that you only pass in a single instance of each callback class (eg. callbacks=[CustomerLogger(), EarlyStopping(), ModelCheckpoint()] and not callbacks=[CustomerLogger(params_a), CustomerLogger(params_b), EarlyStopping(), ModelCheckpoint()] and just keep a mapping of callback class to state dicts. However, if the user passed multiple callback instances of the same class I'm not sure how we would want to handle that.

Maybe for a first iteration we can just document that for on_save_checkpoint you can only have one instance per class?

Borda · 2020-06-28T15:28:10Z

@jeremyjordan we moved the PR to #2391 as it is the repo branch and much easier to maintain by other core... :]

jeremyjordan · 2020-06-28T15:54:34Z

@awaelchli I created #2401 for us to continue discussion on your comment

awaelchli · 2020-06-28T15:55:18Z

perfect!

* add state_dict for early stopping * move best attr after monitor_op defined * improve early stopping and model checkpoint callbacks * fix formatting * fix attr init order * clean up setting of default_root_dir attr * logger needs default root dir set first * reorg trainer init * remove direct references to checkpoint callback * more fixes * more bugfixes * run callbacks at epoch end * update tests to use on epoch end * PR cleanup * address failing tests * refactor for homogeneity * fix merge conflict * separate tests * tests for early stopping bug regressions * small fixes * revert model checkpoint change * typo fix * fix tests * update train loop * cannot pass an int as default_save_path * refactor log message * fix test case * appease the linter * fix some doctests * move config to callback * fixes from rebase * fixes from rebase * chlog * docs * reformat * formatting * fix * fix * fixes from rebase * add new test for patience * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <[email protected]> * Update tests/callbacks/test_early_stopping.py Co-authored-by: Jirka Borovec <[email protected]> * fix formatting * remove enable_early_stop attribute * add state_dict for early stopping * move best attr after monitor_op defined * improve early stopping and model checkpoint callbacks * fix formatting * fix attr init order * clean up setting of default_root_dir attr * logger needs default root dir set first * reorg trainer init * remove direct references to checkpoint callback * more fixes * more bugfixes * run callbacks at epoch end * update tests to use on epoch end * PR cleanup * address failing tests * refactor for homogeneity * fix merge conflict * separate tests * tests for early stopping bug regressions * small fixes * revert model checkpoint change * typo fix * fix tests * update train loop * fix test case * appease the linter * fix some doctests * move config to callback * fixes from rebase * fixes from rebase * chlog * docs * reformat * formatting * fix * fix * fixes from rebase * add new test for patience * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Jirka Borovec <[email protected]> * Update tests/callbacks/test_early_stopping.py Co-authored-by: Jirka Borovec <[email protected]> * fix formatting * remove enable_early_stop attribute * fix test with new epoch indexing * fix progress bar totals * fix off by one error (see #2289) epoch starts at 0 now * added missing imports * fix hpc_save folderpath * fix formatting * fix tests * small fixes from a rebase * fix * tmpdir * tmpdir * tmpdir * wandb * fix merge conflict * add back evaluation after training * test_resume_early_stopping_from_checkpoint TODO * undo the horovod check * update changelog * remove a duplicate test from merge error * try fix dp_resume test * add the logger fix from master * try remove default_root_dir * try mocking numpy * try import numpy in docs test * fix wandb test * pep 8 fix * skip if no amp * dont mock when doctesting * install extra * fix the resume ES test * undo conf.py changes * revert remove comet pickle from test * Update CHANGELOG.md Co-authored-by: Jirka Borovec <[email protected]> * Update weights_loading.rst * Update weights_loading.rst * Update weights_loading.rst * renamed flag * renamed flag * revert the None check in logger experiment name/version * add the old comments * _experiment * test chckpointing on DDP * skip the ddp test on windows * cloudpickle * renamed flag * renamed flag * parentheses for clarity * apply suggestion max epochs Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Jeremy Jordan <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: Jeremy Jordan <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: William Falcon <[email protected]>

mergify bot requested a review from a team April 16, 2020 03:52

jeremyjordan removed the request for review from a team April 16, 2020 03:54

mergify bot requested a review from a team April 16, 2020 03:55

jeremyjordan mentioned this pull request Apr 24, 2020

Error if callbacks are passed wrongly #1544

Closed

5 tasks

jeremyjordan changed the title ~~[WIP] fixes for early stopping callback~~ [WIP] fixes for early stopping and checkpoint callbacks Apr 26, 2020

Borda added the bug Something isn't working label Apr 29, 2020

Borda added this to the 0.7.6 milestone Apr 29, 2020

Borda reviewed Apr 29, 2020

View reviewed changes

pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved

pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team April 29, 2020 10:05

jeremyjordan mentioned this pull request May 2, 2020

The behavior of EarlyStopping is not the same between 0.7.5 with 0.7.3 #1699

Closed

jeremyjordan changed the title ~~[WIP] fixes for early stopping and checkpoint callbacks~~ [blocked] fixes for early stopping and checkpoint callbacks May 5, 2020

shijie-wu reviewed May 6, 2020

View reviewed changes

Borda modified the milestones: 0.7.6, 0.8.0 May 12, 2020

SkafteNicki mentioned this pull request May 12, 2020

Learning rate scheduler's epoch off by one when resuming from checkpoint #1772

Closed

Borda modified the milestones: 0.8.0, 0.7.7 May 15, 2020

Borda changed the title ~~[blocked] fixes for early stopping and checkpoint callbacks~~ [blocked by #1458] fixes for early stopping and checkpoint callbacks May 19, 2020

jeremyjordan changed the title ~~[blocked by #1458] fixes for early stopping and checkpoint callbacks~~ [WIP] fixes for early stopping and checkpoint callbacks May 21, 2020

This was referenced May 23, 2020

early stopping checks on_validation_end #1458

Merged

Trainer.parse_argparser does not yield sensible default for default_root_dir #1916

Closed

jeremyjordan changed the title ~~[WIP] fixes for early stopping and checkpoint callbacks~~ [blocked by 1458] fixes for early stopping and checkpoint callbacks May 24, 2020

awaelchli mentioned this pull request Jun 23, 2020

Checkpointing with SLURM #2278

Closed

williamFalcon mentioned this pull request Jun 26, 2020

EarlyStopping reinitializes to .wait=0 even with Trainer resume_from_checkpoint #1463

Closed

awaelchli reviewed Jun 27, 2020

View reviewed changes

mergify bot requested a review from a team June 27, 2020 11:49

awaelchli mentioned this pull request Jun 27, 2020

Continue Jeremy's early stopping PR #1504 #2391

Merged

9 tasks

awaelchli reviewed Jun 27, 2020

View reviewed changes

mergify bot requested a review from a team June 27, 2020 22:10

Borda added 5 commits June 28, 2020 08:33

fix

a9a4159

tmpdir

3e9667a

tmpdir

df283e6

tmpdir

677f70c

wandb

ba6a5ba

Borda force-pushed the bugfix/early-stopping-state branch from ae75fa4 to ba6a5ba Compare June 28, 2020 06:34

Borda changed the base branch from master to bugfix/early-stopping-state June 28, 2020 06:35

Borda merged commit 4cabb88 into Lightning-AI:bugfix/early-stopping-state Jun 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes for early stopping and checkpoint callbacks #1504

fixes for early stopping and checkpoint callbacks #1504

jeremyjordan commented Apr 16, 2020 •

edited by Borda

Loading

jeremyjordan commented Apr 17, 2020

pep8speaks commented Apr 25, 2020 •

edited

Loading

jeremyjordan commented Apr 29, 2020

Borda left a comment

jeremyjordan commented Apr 29, 2020

jeremyjordan commented May 1, 2020 •

edited

Loading

shijie-wu May 6, 2020

Borda commented May 11, 2020

williamFalcon commented May 17, 2020

jeremyjordan commented May 19, 2020

mergify bot commented Jun 23, 2020

awaelchli commented Jun 23, 2020 •

edited

Loading

jeremyjordan commented Jun 24, 2020

awaelchli commented Jun 24, 2020

mergify bot commented Jun 24, 2020

Borda commented Jun 24, 2020

mergify bot commented Jun 26, 2020

awaelchli commented Jun 26, 2020

awaelchli Jun 27, 2020

jeremyjordan Jun 28, 2020

awaelchli Jun 27, 2020

awaelchli Jun 28, 2020

jeremyjordan Jun 28, 2020

awaelchli commented Jun 28, 2020 •

edited

Loading

mergify bot commented Jun 28, 2020

jeremyjordan commented Jun 28, 2020

Borda commented Jun 28, 2020

jeremyjordan commented Jun 28, 2020

awaelchli commented Jun 28, 2020

		self.run_evaluation(test_mode=self.testing)
		self.call_checkpoint_callback()

fixes for early stopping and checkpoint callbacks #1504

fixes for early stopping and checkpoint callbacks #1504

Conversation

jeremyjordan commented Apr 16, 2020 • edited by Borda Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

jeremyjordan commented Apr 17, 2020

pep8speaks commented Apr 25, 2020 • edited Loading

Comment last updated at 2020-06-28 06:34:32 UTC

jeremyjordan commented Apr 29, 2020

Borda left a comment

Choose a reason for hiding this comment

jeremyjordan commented Apr 29, 2020

jeremyjordan commented May 1, 2020 • edited Loading

shijie-wu May 6, 2020

Choose a reason for hiding this comment

Borda commented May 11, 2020

williamFalcon commented May 17, 2020

jeremyjordan commented May 19, 2020

mergify bot commented Jun 23, 2020

awaelchli commented Jun 23, 2020 • edited Loading

jeremyjordan commented Jun 24, 2020

awaelchli commented Jun 24, 2020

mergify bot commented Jun 24, 2020

Borda commented Jun 24, 2020

mergify bot commented Jun 26, 2020

awaelchli commented Jun 26, 2020

awaelchli Jun 27, 2020

Choose a reason for hiding this comment

jeremyjordan Jun 28, 2020

Choose a reason for hiding this comment

awaelchli Jun 27, 2020

Choose a reason for hiding this comment

awaelchli Jun 28, 2020

Choose a reason for hiding this comment

jeremyjordan Jun 28, 2020

Choose a reason for hiding this comment

awaelchli commented Jun 28, 2020 • edited Loading

mergify bot commented Jun 28, 2020

jeremyjordan commented Jun 28, 2020

Borda commented Jun 28, 2020

jeremyjordan commented Jun 28, 2020

awaelchli commented Jun 28, 2020

jeremyjordan commented Apr 16, 2020 •

edited by Borda

Loading

pep8speaks commented Apr 25, 2020 •

edited

Loading

jeremyjordan commented May 1, 2020 •

edited

Loading

awaelchli commented Jun 23, 2020 •

edited

Loading

awaelchli commented Jun 28, 2020 •

edited

Loading