Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adadelta #1122

Closed
wants to merge 3 commits into from
Closed

Adadelta #1122

wants to merge 3 commits into from

Conversation

mohomran
Copy link
Contributor

Initial implementation of the Adadelta solver as proposed in "ADADELTA: An Adaptive Learning Rate Method" (Zeiler, 2012). Motivation: http://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html

Performance on the MNIST autoencoder demo is more or less on par with standard SGD+momentum but not as good as the Nesterov solver. The lack of a learning rate does seem to be a problem towards later iterations in that loss/accuracy don't entirely converge, but this could be due to an implementation issue.

Iteration 64500, Testing net (#0)
Test loss: 59.4627
    Test net output #0: cross_entropy_loss = 59.4627 (* 1 = 59.4627 loss)
    Test net output #1: l2_error = 1.82881
Iteration 64500, Testing net (#1)
Test loss: 59.7422
    Test net output #0: cross_entropy_loss = 59.7422 (* 1 = 59.7422 loss)
    Test net output #1: l2_error = 1.92399
Iteration 65000, loss = 62.1569
Iteration 65000, Testing net (#0)
Test loss: 60.7756
    Test net output #0: cross_entropy_loss = 60.7756 (* 1 = 60.7756 loss)
    Test net output #1: l2_error = 2.05861
Iteration 65000, Testing net (#1)
Test loss: 61.0705
    Test net output #0: cross_entropy_loss = 61.0705 (* 1 = 61.0705 loss)
     Test net output #1: l2_error = 2.15354

(for comparison see: #741 (comment))

A couple of things to note:

  • Adadelta requires the tracking of both gradient and update history. I chose to store both sequentially in "history_" for a couple of reasons, e.g. to reuse SnapshotSolverState() and RestoreSolverState().
  • All the tests pass, but a couple (those with multiple iterations) are ridiculously slow, even though the MNIST demo for example is not noticeably slower with Adadelta compared to the other solvers. I still need to look into this.

@shelhamer
Copy link
Member

Good start! Comment once you've investigated the initial issues for review.

On Saturday, September 20, 2014, Mohamed Omran [email protected]
wrote:

Initial implementation of the Adadelta solver as proposed in "ADADELTA: An
Adaptive Learning Rate Method" (Zeiler, 2012). Motivation:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html

Performance on the MNIST autoencoder demo is more or less on par with
standard SGD+momentum but not as good as the Nesterov solver. The lack of a
learning rate does seem to be a problem towards later iterations in that
loss/accuracy don't entirely converge, but this could be due to an
implementation issue.

Iteration 64500, Testing net (#0)
Test loss: 59.4627
Test net output #0: cross_entropy_loss = 59.4627 (* 1 = 59.4627 loss)
Test net output #1: l2_error = 1.82881
Iteration 64500, Testing net (#1)
Test loss: 59.7422
Test net output #0: cross_entropy_loss = 59.7422 (* 1 = 59.7422 loss)
Test net output #1: l2_error = 1.92399
Iteration 65000, loss = 62.1569
Iteration 65000, Testing net (#0)
Test loss: 60.7756
Test net output #0: cross_entropy_loss = 60.7756 (* 1 = 60.7756 loss)
Test net output #1: l2_error = 2.05861
Iteration 65000, Testing net (#1)
Test loss: 61.0705
Test net output #0: cross_entropy_loss = 61.0705 (* 1 = 61.0705 loss)
Test net output #1: l2_error = 2.15354

A couple of things to note:

  • Adadelta requires the tracking of both gradient and update history.
    I chose to store both sequentially in "history_" for a couple of reasons,
    e.g. to reuse SnapshotSolverState() and RestoreSolverState.
  • All the tests pass, but a couple (those with multiple iterations)
    are ridiculously slow, even though the MNIST demo for example is not
    noticeably slower with Adadelta compared to the other solvers. I still need
    to look into this.

You can merge this Pull Request by running

git pull https://github.com/mohomran/caffe adadelta

Or view, comment on, or merge it at:

#1122
Commit Summary

  • implemented basic AdaDelta (section 3.1 in paper) + simple test cases
  • full adadelta + additional tests
  • mnist demos with adadelta

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#1122.

Evan Shelhamer

@muupan
Copy link

muupan commented Nov 11, 2014

Hello,

I'm currently using this PR in my project (https://github.com/muupan/dqn-in-the-caffe). I think allowing base_lr and lr_policy will be helpful in case AdaDelta does not coverge. In my case, using original AdaDelta caused divergence, so I multiplied the gradients updates by 0.2 and it worked.

With respect to the slow tests, I wonder why kNumIters for AdaDeltaSolver are very large (=500),

https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L566
https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L577

while those for other solvers are small (=4).

https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L390
https://github.com/mohomran/caffe/blob/adadelta/src/caffe/test/test_gradient_based_solver.cpp#L379

Are these 500 iterations necessary?

@mohomran
Copy link
Contributor Author

  • slow tests: The slowness of the tests seems to have nothing to do with the number of iterations, but rather how I'm storing both update and gradient history. It's unreasonably slow even with for a small number of iterations. I haven't had a chance to look into it in detail
  • learning rate: Yes, I suspect that the lack of some global learning rate might be a problem. It could also be that the history needs to be randomly initialised, since the initial step size is otherwise too large and takes a while to decrease to a reasonable value. It could also be a bug. I would need to double check the implementation against the one in ConvNetJS, but I can only get around to it next week.

Thank you for the feedback!

@shelhamer shelhamer added the ES label Mar 10, 2015
@kevinbache kevinbache mentioned this pull request Mar 26, 2015
@shelhamer
Copy link
Member

Replaced by #2204. Thanks for starting this @mohomran!

@shelhamer shelhamer closed this Apr 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants