You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Has there been any research on how this strategy interacts with a learning rate schedule? Especially for something extreme like the one-cycle policy (super convergence). It seems like the history of the scale of the gradient would be dominated by changes in the learning rate. I found this paper that touches on the subject but doesn't propose any theory behind or solution to the interaction between the two.
from https://hal.science/hal-03891707v1/file/Learning_rate_scheduling_and_gradient_clipping_for_audio_source_separation.pdf
As expected, AutoClip doesn't interact well with cosine annealing
The text was updated successfully, but these errors were encountered:
Has there been any research on how this strategy interacts with a learning rate schedule? Especially for something extreme like the one-cycle policy (super convergence). It seems like the history of the scale of the gradient would be dominated by changes in the learning rate. I found this paper that touches on the subject but doesn't propose any theory behind or solution to the interaction between the two.
As expected, AutoClip doesn't interact well with cosine annealing
The text was updated successfully, but these errors were encountered: