You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, when converting my repo to lightning format I failed to reproduce the training losses I previously used to get (and the model failed to converge). After some debugging, it seems there is a conflict between lightning and torchlars optimizer (https://github.com/kakaobrain/torchlars).
The problem happens only when using torch lightning's newer versions and the torchlars optimizer.
After a bit of black-boxing here are some ways I found to solve this for now (and reproduce the exact losses I used to get before moving my code to lightning):
Downgrading lightning to 1.0.3
Using another optimizer
Returning None in configure_optimizers() and stepping the optimizer myself. e.g.:
optimizer.zero_grad()
loss.backward()
optimizer.step()
Thanks
The text was updated successfully, but these errors were encountered:
After some debugging, it turns out the problem is that torchlars doesn't call the closure before performing its operations.
I'm guessing the inspection in this fix doesn't work in this case: #4981 (comment)
If anyone else encounters this issue, the easy fix is to add a hook for optimizer_step and call the closure by yourself (and don't pass it to the optimizer):
@amitz25 please reach out to the Kakao brain team to get the torchlars optimizer updated according to the latest torch optimizer requirements! Any latest torch Optimizer which follows the preset guidelines should work with Lightning.
Hi, when converting my repo to lightning format I failed to reproduce the training losses I previously used to get (and the model failed to converge). After some debugging, it seems there is a conflict between lightning and torchlars optimizer (https://github.com/kakaobrain/torchlars).
The problem happens only when using torch lightning's newer versions and the torchlars optimizer.
After a bit of black-boxing here are some ways I found to solve this for now (and reproduce the exact losses I used to get before moving my code to lightning):
optimizer.zero_grad()
loss.backward()
optimizer.step()
Thanks
The text was updated successfully, but these errors were encountered: