Skip to content

EarlyStopping not updating it's value after resuming training #18727

@MaugrimEP

Description

@MaugrimEP

Bug description

Early stopping callback does not update its best value after resuming training from a checkpoint.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

ckpt_path = "some_path/66_best_model.ckpt"
es = EarlyStopping(
		monitor  ="valid/step/loss",
		min_delta=0,
		patience =100,
		verbose  =True,
		mode     ="min",
        check_on_train_epoch_end=False,
	)

trainer = pl.Trainer(callbacks=[es])
trainer.fit(model, datamodule=datamodule, ckpt_path= ckpt_path)

Error messages and logs

Hi,
I have an EarlyStopping and ModelCheckpoint on the same metrics in my callbacks. Also, I continue an existing training.

It seems like EarlyStopping does not update its best metrics if we load a checkpoint (retraining).

I got this output:

Epoch 149, global step 50700: 'epoch' reached 149.00000 (best 149.00000), saving model to '/dlocal/run/19197/_model_save/149_duration_model.ckpt' as top 100
Epoch 150, global step 51038: 'valid/step/loss' reached 0.00266 (best 0.00266), saving model to '/dlocal/run/19197/_model_save/150_best_model.ckpt' as top 1
Epoch 151, global step 51376: 'valid/step/loss' was not in top 1
Epoch 152, global step 51714: 'valid/step/loss' was not in top 1
Epoch 153, global step 52052: 'valid/step/loss' was not in top 1
Epoch 154, global step 52390: 'valid/step/loss' was not in top 1
Epoch 155, global step 52728: 'valid/step/loss' was not in top 1
Epoch 156, global step 53066: 'valid/step/loss' was not in top 1
Epoch 157, global step 53404: 'valid/step/loss' was not in top 1
Epoch 158, global step 53742: 'valid/step/loss' was not in top 1
Epoch 159, global step 54080: 'valid/step/loss' was not in top 1
Epoch 159, global step 54080: 'epoch' reached 159.00000 (best 159.00000), saving model to '/dlocal/run/19197/_model_save/159_duration_model.ckpt' as top 100
Generation Loop: 100%|██████████| 100/100 [05:37<00:00,  3.38s/it]it]2:26<03:13,  3.40s/it]
Generation Loop: 100%|██████████| 100/100 [05:38<00:00,  3.38s/it]
Generation Loop: 100%|██████████| 100/100 [05:38<00:00,  3.39s/it]
Generation Loop: 100%|██████████| 100/100 [05:39<00:00,  3.40s/it]
Epoch 160, global step 54418: 'valid/step/loss' was not in top 1
Epoch 161, global step 54756: 'valid/step/loss' was not in top 1
Epoch 162, global step 55094: 'valid/step/loss' was not in top 1
Epoch 163, global step 55432: 'valid/step/loss' was not in top 1
Epoch 164, global step 55770: 'valid/step/loss' was not in top 1
Epoch 165, global step 56108: 'valid/step/loss' was not in top 1
[rank: 0] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
[rank: 3] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
[rank: 1] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
[rank: 2] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
Epoch 166, global step 56446: 'valid/step/loss' was not in top 1

As we can see, in epoch 150 the model indeed improves.
But in epoch 165 the model stops (Early stopping with patience of 100), which should not be the case.
In fact, I resumed training from 66_best_model.ckpt. This ckpt file is in a specific path outside my project for reasons.
It looks like the best metrics did not change from this ckpt.

Environment

No response

More info

No response

cc @carmocca @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions