Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable_pl_optimizer causes optimizers to not be restored properly #5224

Closed
PiotrDabkowski opened this issue Dec 21, 2020 · 7 comments · Fixed by #5244
Closed

enable_pl_optimizer causes optimizers to not be restored properly #5224

PiotrDabkowski opened this issue Dec 21, 2020 · 7 comments · Fixed by #5244
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task
Milestone

Comments

@PiotrDabkowski
Copy link

PiotrDabkowski commented Dec 21, 2020

🐛 Bug

enable_pl_optimizer (default!) causes optimizers to not be restored properly from the checkpoint specified by resume_from_checkpoint.

BoringModel Colab Reproduction

The model is trained for 3 epochs and saved in a checkpoint. The checkpoint is then restored and further trained for 1 epoch (with different values of enable_pl_optimizer), the training loss is printed at each step.
The setup where enable_pl_optimizer=True shows a huge loss spike after the first optimizer step, suggesting that the optimizer is not restored properly.

https://colab.research.google.com/drive/1lHYXm4MpnmXwPZTcPem4D4wwwU5vJhHc?usp=sharing

Expected behavior

PL Optimizers are restored such that there is no huge loss spike after restore, just like when enable_pl_optimizer=False.

Environment

See Colab.

@PiotrDabkowski PiotrDabkowski added bug Something isn't working help wanted Open to be worked on labels Dec 21, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@heng-yuwen
Copy link

Hi, I think you can check #4655. I think these problems are related. @PiotrDabkowski

@heng-yuwen
Copy link

The loaded optimiser is restored as the initial one.

@PiotrDabkowski
Copy link
Author

Possibly, but not sure. Does disabling pl_optimizer fix #4655 as well?

@tchaton
Copy link
Contributor

tchaton commented Dec 22, 2020

Hey @PiotrDabkowski,

enable_pl_optimizer has been reset to False by default.

We will look into this bug.

Thanks,
T.C

@tchaton tchaton added the priority: 1 Medium priority task label Dec 22, 2020
@tchaton tchaton added this to the 1.1.x milestone Dec 22, 2020
@heng-yuwen
Copy link

heng-yuwen commented Dec 22, 2020

I didn't try to disable enable_pl_optimizer, @tchaton is there anything I should take care of when enable_pl_optimizer=False? I use ddp to train on Slurm cluster.

@tchaton
Copy link
Contributor

tchaton commented Dec 23, 2020

Hey @Hyw1994,

I checked your notebook ! You are entirely right.
I will look into today as it is high priority to re-enable LightningOptimizer :)

Best regards,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants