Adadelta #1122

mohomran · 2014-09-20T22:05:31Z

Initial implementation of the Adadelta solver as proposed in "ADADELTA: An Adaptive Learning Rate Method" (Zeiler, 2012). Motivation: http://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html

Performance on the MNIST autoencoder demo is more or less on par with standard SGD+momentum but not as good as the Nesterov solver. The lack of a learning rate does seem to be a problem towards later iterations in that loss/accuracy don't entirely converge, but this could be due to an implementation issue.

Iteration 64500, Testing net (#0)
Test loss: 59.4627
    Test net output #0: cross_entropy_loss = 59.4627 (* 1 = 59.4627 loss)
    Test net output #1: l2_error = 1.82881
Iteration 64500, Testing net (#1)
Test loss: 59.7422
    Test net output #0: cross_entropy_loss = 59.7422 (* 1 = 59.7422 loss)
    Test net output #1: l2_error = 1.92399
Iteration 65000, loss = 62.1569
Iteration 65000, Testing net (#0)
Test loss: 60.7756
    Test net output #0: cross_entropy_loss = 60.7756 (* 1 = 60.7756 loss)
    Test net output #1: l2_error = 2.05861
Iteration 65000, Testing net (#1)
Test loss: 61.0705
    Test net output #0: cross_entropy_loss = 61.0705 (* 1 = 61.0705 loss)
     Test net output #1: l2_error = 2.15354

(for comparison see: #741 (comment))

A couple of things to note:

Adadelta requires the tracking of both gradient and update history. I chose to store both sequentially in "history_" for a couple of reasons, e.g. to reuse SnapshotSolverState() and RestoreSolverState().
All the tests pass, but a couple (those with multiple iterations) are ridiculously slow, even though the MNIST demo for example is not noticeably slower with Adadelta compared to the other solvers. I still need to look into this.

shelhamer · 2014-09-20T23:37:05Z

Good start! Comment once you've investigated the initial issues for review.

On Saturday, September 20, 2014, Mohamed Omran [email protected]
wrote:

Initial implementation of the Adadelta solver as proposed in "ADADELTA: An
Adaptive Learning Rate Method" (Zeiler, 2012). Motivation:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html

Performance on the MNIST autoencoder demo is more or less on par with
standard SGD+momentum but not as good as the Nesterov solver. The lack of a
learning rate does seem to be a problem towards later iterations in that
loss/accuracy don't entirely converge, but this could be due to an
implementation issue.

Iteration 64500, Testing net (#0)
Test loss: 59.4627
Test net output #0: cross_entropy_loss = 59.4627 (* 1 = 59.4627 loss)
Test net output #1: l2_error = 1.82881
Iteration 64500, Testing net (#1)
Test loss: 59.7422
Test net output #0: cross_entropy_loss = 59.7422 (* 1 = 59.7422 loss)
Test net output #1: l2_error = 1.92399
Iteration 65000, loss = 62.1569
Iteration 65000, Testing net (#0)
Test loss: 60.7756
Test net output #0: cross_entropy_loss = 60.7756 (* 1 = 60.7756 loss)
Test net output #1: l2_error = 2.05861
Iteration 65000, Testing net (#1)
Test loss: 61.0705
Test net output #0: cross_entropy_loss = 61.0705 (* 1 = 61.0705 loss)
Test net output #1: l2_error = 2.15354

A couple of things to note:

Adadelta requires the tracking of both gradient and update history.
I chose to store both sequentially in "history_" for a couple of reasons,
e.g. to reuse SnapshotSolverState() and RestoreSolverState.

All the tests pass, but a couple (those with multiple iterations)
are ridiculously slow, even though the MNIST demo for example is not
noticeably slower with Adadelta compared to the other solvers. I still need
to look into this.

You can merge this Pull Request by running

git pull https://github.com/mohomran/caffe adadelta

Or view, comment on, or merge it at:

#1122
Commit Summary

implemented basic AdaDelta (section 3.1 in paper) + simple test cases

full adadelta + additional tests

mnist demos with adadelta

File Changes

A examples/mnist/lenet_adadelta_solver.prototxt
https://github.com/BVLC/caffe/pull/1122/files#diff-0 (22)

A examples/mnist/mnist_autoencoder_solver_adadelta.prototxt
https://github.com/BVLC/caffe/pull/1122/files#diff-1 (17)

A examples/mnist/train_mnist_autoencoder_adadelta.sh
https://github.com/BVLC/caffe/pull/1122/files#diff-2 (4)

M include/caffe/solver.hpp
https://github.com/BVLC/caffe/pull/1122/files#diff-3 (23)

M src/caffe/proto/caffe.proto
https://github.com/BVLC/caffe/pull/1122/files#diff-4 (1)

M src/caffe/solver.cpp
https://github.com/BVLC/caffe/pull/1122/files#diff-5 (199)

M src/caffe/test/test_gradient_based_solver.cpp
https://github.com/BVLC/caffe/pull/1122/files#diff-6 (105)

Patch Links:

https://github.com/BVLC/caffe/pull/1122.patch

https://github.com/BVLC/caffe/pull/1122.diff

—
Reply to this email directly or view it on GitHub
#1122.

Evan Shelhamer

muupan · 2014-11-11T09:23:51Z

Hello,

I'm currently using this PR in my project (https://github.com/muupan/dqn-in-the-caffe). I think allowing base_lr and lr_policy will be helpful in case AdaDelta does not coverge. In my case, using original AdaDelta caused divergence, so I multiplied the ~~gradients~~ updates by 0.2 and it worked.

With respect to the slow tests, I wonder why kNumIters for AdaDeltaSolver are very large (=500),

https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L566
https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L577

while those for other solvers are small (=4).

https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L390
https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L379

Are these 500 iterations necessary?

mohomran · 2014-11-11T17:43:57Z

slow tests: The slowness of the tests seems to have nothing to do with the number of iterations, but rather how I'm storing both update and gradient history. It's unreasonably slow even with for a small number of iterations. I haven't had a chance to look into it in detail
learning rate: Yes, I suspect that the lack of some global learning rate might be a problem. It could also be that the history needs to be randomly initialised, since the initial step size is otherwise too large and takes a while to decrease to a reasonable value. It could also be a bug. I would need to double check the implementation against the one in ConvNetJS, but I can only get around to it next week.

Thank you for the feedback!

shelhamer · 2015-04-01T23:04:13Z

Replaced by #2204. Thanks for starting this @mohomran!

mohomran added 3 commits September 20, 2014 23:40

implemented basic AdaDelta (section 3.1 in paper) + simple test cases

29c6981

full adadelta + additional tests

599b628

mnist demos with adadelta

e69c2e0

shelhamer mentioned this pull request Sep 21, 2014

AdaDelta Solver #1101

Closed

shelhamer added in progress and removed in progress labels Sep 21, 2014

shelhamer force-pushed the dev branch from d8eb4df to 914da95 Compare October 8, 2014 16:36

sergeyk force-pushed the dev branch from 2fb4c97 to 1718903 Compare October 17, 2014 18:44

shelhamer added the ES label Mar 10, 2015

kevinbache mentioned this pull request Mar 26, 2015

AdaDelta v2 #2204

Closed

shelhamer closed this Apr 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adadelta #1122

Adadelta #1122

mohomran commented Sep 20, 2014

shelhamer commented Sep 20, 2014

muupan commented Nov 11, 2014

mohomran commented Nov 11, 2014

shelhamer commented Apr 1, 2015

Adadelta #1122

Adadelta #1122

Conversation

mohomran commented Sep 20, 2014

shelhamer commented Sep 20, 2014

muupan commented Nov 11, 2014

mohomran commented Nov 11, 2014

shelhamer commented Apr 1, 2015