-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingcallback: early stoppinghelp wantedOpen to be worked onOpen to be worked onver: 2.0.x
Description
Bug description
Early stopping callback does not update its best value after resuming training from a checkpoint.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
ckpt_path = "some_path/66_best_model.ckpt"
es = EarlyStopping(
monitor ="valid/step/loss",
min_delta=0,
patience =100,
verbose =True,
mode ="min",
check_on_train_epoch_end=False,
)
trainer = pl.Trainer(callbacks=[es])
trainer.fit(model, datamodule=datamodule, ckpt_path= ckpt_path)Error messages and logs
Hi,
I have an EarlyStopping and ModelCheckpoint on the same metrics in my callbacks. Also, I continue an existing training.
It seems like EarlyStopping does not update its best metrics if we load a checkpoint (retraining).
I got this output:
Epoch 149, global step 50700: 'epoch' reached 149.00000 (best 149.00000), saving model to '/dlocal/run/19197/_model_save/149_duration_model.ckpt' as top 100
Epoch 150, global step 51038: 'valid/step/loss' reached 0.00266 (best 0.00266), saving model to '/dlocal/run/19197/_model_save/150_best_model.ckpt' as top 1
Epoch 151, global step 51376: 'valid/step/loss' was not in top 1
Epoch 152, global step 51714: 'valid/step/loss' was not in top 1
Epoch 153, global step 52052: 'valid/step/loss' was not in top 1
Epoch 154, global step 52390: 'valid/step/loss' was not in top 1
Epoch 155, global step 52728: 'valid/step/loss' was not in top 1
Epoch 156, global step 53066: 'valid/step/loss' was not in top 1
Epoch 157, global step 53404: 'valid/step/loss' was not in top 1
Epoch 158, global step 53742: 'valid/step/loss' was not in top 1
Epoch 159, global step 54080: 'valid/step/loss' was not in top 1
Epoch 159, global step 54080: 'epoch' reached 159.00000 (best 159.00000), saving model to '/dlocal/run/19197/_model_save/159_duration_model.ckpt' as top 100
Generation Loop: 100%|██████████| 100/100 [05:37<00:00, 3.38s/it]it]2:26<03:13, 3.40s/it]
Generation Loop: 100%|██████████| 100/100 [05:38<00:00, 3.38s/it]
Generation Loop: 100%|██████████| 100/100 [05:38<00:00, 3.39s/it]
Generation Loop: 100%|██████████| 100/100 [05:39<00:00, 3.40s/it]
Epoch 160, global step 54418: 'valid/step/loss' was not in top 1
Epoch 161, global step 54756: 'valid/step/loss' was not in top 1
Epoch 162, global step 55094: 'valid/step/loss' was not in top 1
Epoch 163, global step 55432: 'valid/step/loss' was not in top 1
Epoch 164, global step 55770: 'valid/step/loss' was not in top 1
Epoch 165, global step 56108: 'valid/step/loss' was not in top 1
[rank: 0] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
[rank: 3] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
[rank: 1] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
[rank: 2] Monitored metric valid/step/loss did not improve in the last 100 records. Best score: 0.003. Signaling Trainer to stop.
Epoch 166, global step 56446: 'valid/step/loss' was not in top 1
As we can see, in epoch 150 the model indeed improves.
But in epoch 165 the model stops (Early stopping with patience of 100), which should not be the case.
In fact, I resumed training from 66_best_model.ckpt. This ckpt file is in a specific path outside my project for reasons.
It looks like the best metrics did not change from this ckpt.
Environment
No response
More info
No response
3outeille
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcallback: early stoppinghelp wantedOpen to be worked onOpen to be worked onver: 2.0.x