-
-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a centered variance option to the ClippedAdam optimizer #3415
base: dev
Are you sure you want to change the base?
Conversation
@BenZickel can you please explain your figure? i don't know how a convergence rate is computed, and i can't tell if the differences in the second plot are significant given the scale |
a bit of googling led me here. the same algo in essence? a 2 second scan suggests they do bias correction https://edoc.hu-berlin.de/server/api/core/bitstreams/14960a8d-4c35-4d08-86d7-1e130ecd42c8/content |
Thanks for the review @martinjankowiak.
|
… the Latent Dirichlet Allocation example.
I've added the option to use the centered variance option in the Latent Dirichlet Allocation (LDA) example. When running some tests I've noticed that the centered variance option improve both the convergence rate, and ultimate loss, for a wide range of learning rates. Additionally, the same phenomena, as seen in the above example, of reduced sensitivity of the convergence rate to changes in the learning rate, can be observed when using the centered variance option. The centered variance option in the LDA example can be used by running
|
|
||
Small modification to the Adam algorithm implemented in torch.optim.Adam | ||
to include gradient clipping and learning rate decay. | ||
to include gradient clipping and learning rate decay and an option to use | ||
the centered variance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you point to the ref here?
@@ -435,3 +435,105 @@ def step(svi, optimizer): | |||
actual.append(step(svi, optimizer)) | |||
|
|||
assert_equal(actual, expected) | |||
|
|||
|
|||
def test_centered_clipped_adam(plot_results=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long does this test take?
loss_vec.append(loss) | ||
return torch.Tensor(loss_vec) | ||
|
||
def calc_convergence(loss_vec, tail_len=100, threshold=0.01): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment what is being computed?
convergence_rate = (convergence_vec[:-1] / convergence_vec[1:]).log().mean() | ||
return ultimate_loss, convergence_rate, convergence_iter | ||
|
||
def get_convergence_vec(lr_vec, centered_variance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment what is being computed?
thanks @BenZickel ! the motivation makes sense, and i can imagine how this might help, though i'm perhaps somewhat surprised by the size of the effect, though i guess your |
Problem
When using the ClippedAdam optimizer with highly imbalanced parameter gradients stability, the convergence rate of parameters with stable gradients is slower than what it could be.
Solution
Add an option to use the centered variance in the denominator of the step size calculation. Parameters with stable gradients will have a lower centered variance, than the current uncentered variance, and therefore will have a larger step size and higher convergence rate.
Testing
The improvement in convergence rate is shown below (taken from the test function run with plotting enabled):