-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple the computational batch size and minibatch size by accumulating gradients #1663
Conversation
Decouple the computational batch size and minibatch size by accumulating gradients
a4d2e6d
to
d76653a
Compare
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
(With layers whose backwards accumlate gradients), this effectively decouples the computational batch from the SGD minibatch. Each iteration accumulates gradients over iter_size batches, then parameters are updated.
Have we thought about how to handle the case when we're sharing parameters but using different learning rates? I would be okay with simply disallowing that case since it would probably be a pretty weird thing to do. Otherwise the only other way I can think to handle it is pretty messy -- we could have a a special case where, e.g. if blobs_lr is 2 in one layer but 1 in all others, the Net could prescale (by a factor of 2) the top_diff for the layer with blobs_lr 2 by 2... Actually, even that wouldn't work if the layer has other shared param blobs that don't also have the same relative LR... |
Decouple the computational batch size and minibatch size by accumulating gradients
Always accumulating is simple and good, but let's review the weight sharing and solvers issues before merging. |
Replaced by #1977. |
After #1615, so that this code already supports deconv layer. (The actual diff is just +37/-40 lines.)
This PRs the gradient accumulation branch living at https://github.com/shelhamer/caffe/tree/accum-grad. I took a lighter approach here than the one there: parameter gradients are always accumulated, there is no other option. The gradient checker is made correct by zero-initing parameter diffs.
Issues:
Backward
. External code that usedBackward
is likely to break, if there is any.SGDSolver
, but haven't thought carefully about that yet.