You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
enable_pl_optimizer (default!) causes optimizers to not be restored properly from the checkpoint specified by resume_from_checkpoint.
BoringModel Colab Reproduction
The model is trained for 3 epochs and saved in a checkpoint. The checkpoint is then restored and further trained for 1 epoch (with different values of enable_pl_optimizer), the training loss is printed at each step.
The setup where enable_pl_optimizer=True shows a huge loss spike after the first optimizer step, suggesting that the optimizer is not restored properly.
I didn't try to disable enable_pl_optimizer, @tchaton is there anything I should take care of when enable_pl_optimizer=False? I use ddp to train on Slurm cluster.
🐛 Bug
enable_pl_optimizer
(default!) causes optimizers to not be restored properly from the checkpoint specified byresume_from_checkpoint
.BoringModel Colab Reproduction
The model is trained for 3 epochs and saved in a checkpoint. The checkpoint is then restored and further trained for 1 epoch (with different values of
enable_pl_optimizer
), the training loss is printed at each step.The setup where
enable_pl_optimizer=True
shows a huge loss spike after the first optimizer step, suggesting that the optimizer is not restored properly.https://colab.research.google.com/drive/1lHYXm4MpnmXwPZTcPem4D4wwwU5vJhHc?usp=sharing
Expected behavior
PL Optimizers are restored such that there is no huge loss spike after restore, just like when
enable_pl_optimizer=False
.Environment
See Colab.
The text was updated successfully, but these errors were encountered: