You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
awaelchli
changed the title
resume_from_checkpoint broken on in 1.4 - does not advance loop
resume_from_checkpoint broken when fault-tolerant feature enabled
Aug 10, 2021
🐛 Bug
A trainer with the resume from checkpoint option does not continue training and stops immediately, despite increased max_epoch settings.
To Reproduce
Output:
Here the training_step does not get invoked on the second fit call. Instead, the epochs end immediately.
ONLY happens when PL_FAULT_TOLERANT_TRAINING=1. This is an experimental feature which is off by default.
Expected behavior
The training continues due to the increased max_epochs setting.
The text was updated successfully, but these errors were encountered: