Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add fast implementation of LARS #16122

Merged
merged 9 commits into from
Sep 30, 2019
Merged

Add fast implementation of LARS #16122

merged 9 commits into from
Sep 30, 2019

Conversation

Caenorst
Copy link
Contributor

@Caenorst Caenorst commented Sep 8, 2019

Description

Add a Fast implementation of LARS.

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add multi_sum_sq Op: computing sum of squares of multiple arrays (1 sum per array).
  • Add multi_lars Op: computing LARS with the sum of squares of weights and gradients.
  • Add preloaded_multi_sgd which a version of SGD where learning rate are MXNet array instead of list of parameters.
  • Add SGDwFastLARS optimizer
  • Add test for the new Ops and optimizer.

Comments

This code have been used for MLPerf v0.6 benchmarks. It's especially pertinent if you are training with small local batch size as optimizers usually don't scale with batch size.

Credits

@@ -781,6 +784,240 @@ def update(self, index, weight, grad, state):
ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight,
lr=lr, wd=wd, **kwargs)

@register
class SGDwFastLARS(Optimizer):
Copy link
Member

@anirudhacharya anirudhacharya Sep 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from LBSGD defined here - https://github.com/apache/incubator-mxnet/pull/16122/files#diff-0c893416e9e93fbd94dfaa9fa6c13d67R1022

Currently LBSGD implementation is buggy, maybe this should replace that, rather than create SGDwFastLARS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for quick review.

LB stand for "Large batch" indicating that it will use diverse techniques. As warmup could be implemented directly with lr_scheduler (LARS contrarily to what LBSGD indicate is not a warmup strategy but an optimizer). I think the only reason to use such name would be to have the choice between LARS and LARC optimizers (which will be coming in another PR), or even other techniques for large batch.

I think it could make sense as both implementation are very similar although I'm not sure it's ergonomic to have the selection of optimizer as an argument of Optimizer.

Copy link
Member

@anirudhacharya anirudhacharya Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i am not suggesting we have a parameter that will switch the optimizer from LARS to LARC. Maybe we could just deprecate LBSGD( or mark it for deprecation), since it does not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deprecation is the best option, please reviewers thumb up / down this comment for opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favorof deprecation of LBSGD, but still I would prefer a better name - "Fast" in "SGDwFastLARS" does not really make sense (fast compared to what?). Maybe just LARSSGD or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just call them LARC or LARS.

As a point of reference, TF calls it LARSOptimizer:
https://www.tensorflow.org/api_docs/python/tf/contrib/opt/LARSOptimizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it LARS, and also remove the "lars" redundant prefix from eta and eps let me know if it's fine like that

@@ -781,6 +784,240 @@ def update(self, index, weight, grad, state):
ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight,
lr=lr, wd=wd, **kwargs)

@register
class SGDwFastLARS(Optimizer):
def __init__(self, momentum=0.0, lazy_update=True, lars_eta=0.001, lars_eps=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this good enough ?

@Caenorst
Copy link
Contributor Author

Caenorst commented Sep 9, 2019

SyntaxError: only named arguments may follow *expression

I'm not seeing this error on my local machine, what version of Python is on CI ?

return lenet

@with_seed()
def test_lars():
Copy link
Member

@anirudhacharya anirudhacharya Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long does this test take? should it be put in nightly tests instead of unit tests?

should we add a test in python/unittest/test_optimizer.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's taking 10s with my V100

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are actually training a network here, we should move this to nightly tests and have an optimizer test in python/unittest/test_optimizer.py by building a mock optimizer like we have done for other optimizers( ref - https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_optimizer.py#L514).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm already testing the MXNet Ops in https://github.com/apache/incubator-mxnet/pull/16122/files#diff-4758fb9329d438de2836db2634a8f5f7R270-R422. which are the only non-python part of the optimizer, what my test is adding is to show that it's properly behaving: i.e allow you to train a network with a bigger batch. Making a fully python of it wouldn't help me testing that (the python could be wrong as well, let's say if I misunderstood the publication)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anirudhacharya Does swapping my training test to nightly test and just unit testing my MXNet Ops in /tests/python/gpu/test_operator_gpu.py enough ? If yes I believe I'm all done for this PR :)

Copy link
Member

@anirudhacharya anirudhacharya Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Caenorst sorry, missed seeing this. i will take another look. And, yes it would be good to move time consuming tests to nightly tests.

I understand that the python mock optimizer could be wrong if our understanding of the paper is incorrect. Let us try to catch them in the code review process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the above mentioned test issue, everything else lgtm.

@apeforest apeforest self-assigned this Sep 13, 2019
@Caenorst
Copy link
Contributor Author

I don't understand the errors in the CI, can somebody help me ?

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@Caenorst Thanks a lot for your contribution!

@apeforest apeforest merged commit 810e67c into apache:master Sep 30, 2019
v = v.astype('float32')
if rescale:
v *= self.rescale_grad
norm = NDnorm(v).asnumpy()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended? I thought having a blockingc call is bad

apeforest pushed a commit that referenced this pull request Nov 19, 2019
* add MXNet operator for fast LARS

* add unit tests for fast LARS related MXNet Ops

* fix preloaded_multi_* dtype inference, add SGDwFastLARS optimizer and test

Conflicts:
	tests/python/gpu/test_operator_gpu.py

* remove commented out cast from lenet5 model

* fix lint

* Add more documentation, change name of SGDwFastLARS by LARS, removing redundancy of 'lars' in the parameters

* change optimizer code to be python2 retro-compatible

* fix lint

* replace push_back by emplace_back for cland-tidy
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants