Add an utility for operator benchmarks #14977

sandeep-krishnamurthy · 2019-05-17T03:28:05Z

Description

Add a tool to easily run operator benchmarks. This tool will be useful for MXNet developers, power users who wants to know more about individual operator performance. This tool can be integrated with CI/CD, nightly tests to catch any operator performance regression. Proposal on cwiki - https://cwiki.apache.org/confluence/display/MXNET/MXNet+Operator+Benchmarks

This is the first PR to setup all basic required utilities, overall structure for this operator benchmarking utility, and overall skeleton for the tool.
I have added ~~3 NDArray operator (add, sub, mul) and 1 Gluon (Conv2D) operator benchmark~~ ~~19 NDArray operator benchmark tests~~ 80 operator benchmark tests - code and results markdown file autogenerated by the tool, to showcase how it can be used.
More operators will be covered in following PRs. With this PR, we will be setting up basic structure of other contributors to pitch in.
Added README explaining how to use the utility.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
Code is well-documented:
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Utility to do operator benchmarks

Comments

@apeforest @nswamy @access2rohit @pengzhao-intel @Zha0q1

szha

Thanks for intiating this. Besides the questions below, I have some high-level questions and comments:

why is it necessary to implement all the calls by hand? this approach seems rather inefficient. is there any way to implement this more concisely?
what happens when people implement new operators? must they implement profiling logics here too?
I see the PR is marked as complete despite the many TODOs in the code. I don't think the code can be checked in in this state.

benchmark/opperf/README.md

benchmark/opperf/nn_operations/__init__.py

benchmark/opperf/README.md

sandeep-krishnamurthy · 2019-05-19T03:58:42Z

Thanks for initiating this. Besides the questions below, I have some high-level questions and comments:

why is it necessary to implement all the calls by hand? this approach seems rather inefficient. is there any way to implement this more concisely?
Good point. I did think about it and was discussed by other community members on the proposal doc. Below is my thought process.

Code per operator is something like below:

add_res = run_performance_test(nd.add, run_backward=True, dtype=dtype, ctx=ctx,
                                   inputs=[{"lhs": (1024, 1024),
                                            "rhs": (1024, 1024)},
                                           {"lhs": (10000, 10),
                                            "rhs": (10000, 10)},
                                           {"lhs": (10000, 1),
                                            "rhs": (10000, 100)}],
                                   warmup=warmup, runs=runs)

run_performance_test function will provide all the necessary tools to do a benchmark and get results. User is expected to specify 2 things - operator name, inputs for the operator. Providing inputs based on what needs to be tested is a crucial part and is made explicit. Each operator can have different criteria that needs to be in performance tests. Ex: broadcasting shapes are necessary for say arithmetic operators, small / large tensors etc.

Hence, if we automate fully for example a solution like - Automatically fetch all operators registered, fetch inputs required, understand the input meanings (Ex: lhs shape is same or broadcastable to rhs shape). Such concise thing may help in having less code, but, may hide too many details and make it hard to generally use this tool, integrate it with systems like PR/Nightly benchmark dashboarding etc.

Having said that, there are certainly few rooms of improvements - like all binary operators like add,sub,mul have similar expectations and can be made concise. But, I thought, it may lead to over engineering a simple utility required to easily run benchmark test for an operator.

what happens when people implement new operators? must they implement profiling logics here too?
Yes. Not the profiling logic, but, something like below when they provide name of the operator, and specify inputs to use.

add_res = run_performance_test(nd.add, run_backward=True, dtype=dtype, ctx=ctx,
                                   inputs=[{"lhs": (1024, 1024),
                                            "rhs": (1024, 1024)}],
                                   warmup=warmup, runs=runs)

I see the PR is marked as complete despite the many TODOs in the code. I don't think the code can be checked in in this state.
Motivation of this PR as described in the description is to setup the base infrastructure for a tool to easily do operator benchmarks. Add 4 operators benchmarks in the PR setting up everything required for any community members to help it out to get more than 200 operators we have in MXNet covered. Also, reason for adding TODOs is to cover the overall skeleton and clearly put out the idea of what we want to cover next so that roadmap is clearly set.

… Update README

benchmark/opperf/README.md

sandeep-krishnamurthy · 2019-05-21T18:53:05Z

Latest updates:

Add more operator benchmarks - ~19 tensor operations are added till now.
Support to export result as json/md
Add autogenerated results markdown file
Move everything to only test core low level operators, i.e., NDArray operators only, no Gluon.
Support to send a batch of operators for testing with same inputs.
Example - Below we run all 5 arithmetic operators performance tests in one call.

benchmark_res = run_performance_test([nd.add, nd.subtract, nd.multiply, nd.divide, nd.modulo], run_backward=True,
                                         dtype=dtype, ctx=ctx,
                                         inputs=[{"lhs": (1024, 1024),
                                                  "rhs": (1024, 1024)},
                                                 {"lhs": (10000, 10),
                                                  "rhs": (10000, 10)},
                                                 {"lhs": (10000, 1),
                                                  "rhs": (10000, 100)}],
                                         warmup=warmup, runs=runs)

Updates and clarifications in README

ping - @szha @apeforest @nswamy @access2rohit @Zha0q1 for review.

szha

the registry has both the argument names and their types, along with the constraint, so it should be feasible to automatically generate the benchmark code for them. The current implementation requires additional code everytime new operator is added, which is not scalable.

larroy · 2019-05-22T08:12:46Z

@sandeep-krishnamurthy I think it would be good to provide some type hints in this code, specially in the functions. Since we seem to move MXNet to python3 anyways, this should make the code much easier to read and maintain. What do you think?

larroy · 2019-05-22T08:18:14Z

@szha This is a valid point that we have raised internally before, the question is that this could be implemented on top of this. To me looks like a good level of abstraction and this could be a layer on top, I think that's the fundamental question that we need to ask. Operator benchmark would be very beneficial, it's also possible to get it done in stages. Rome wasn't built in a day.

sandeep-krishnamurthy · 2019-05-22T18:25:39Z

@sandeep-krishnamurthy I think it would be good to provide some type hints in this code, specially in the functions. Since we seem to move MXNet to python3 anyways, this should make the code much easier to read and maintain. What do you think?

This is a nice addition. Will add it to the main user facing functions to start with.

szha · 2019-05-22T19:14:09Z

I just had a discussion w/ @sandeep-krishnamurthy and we agreed that: 1. the ideal state should be the automated approach that uses the data types from op registrty to generate default values for performance testing, and also allow users to override and test for specific settings. 2. current approach won't help towards that goal since the automated approach would render the current approach in this PR mostly obsolete.

szha · 2019-05-22T19:15:18Z

The next step I suggested is to start with the simplest case such as elementwise operators and build automated approach towards that.

sandeep-krishnamurthy · 2019-05-22T20:21:27Z

Thanks @szha for the discussion and putting the summary here. I wanted to add few more points we discussed in the context of this PR:

We agreed that the current functionality in this PR is helpful for existing operators. For purpose like PR builds, nightly performance test CI, quick comparison tools for developers like working on Int64 etc. For which one of your suggestion was to maintain a separate repository / move it to mxnet jenkins or CI repository.
Things not favorable in this PR is that it will be an additional burden for developers adding new operators or doing breaking changes in existing operators(ex: parameters change, namespace change) to come and make modifications in this utility which we all agree is a maintenance overhead.
We also discussed with automated solution, we cannot handle every operator and every new use case in a fully automated way. So it would still be a combination of automated (TODO) + explicit tests (currently in this PR).

sandeep-krishnamurthy · 2019-06-07T19:54:19Z

Could you include utility for listing the operators that are not covered, instead of writing TODO list in code?

I have a readme file maintaining it - https://github.com/apache/incubator-mxnet/pull/14977/files#diff-d7bc2931851dce319a4523bc3bb10ac7

sandeep-krishnamurthy · 2019-06-07T19:56:05Z

Updates:

Add support for Random Sampling operators (~16)
Total operators covered => ~100

szha · 2019-06-08T00:07:04Z

That should be maintained automaticall instead of manually in a doc.

sandeep-krishnamurthy · 2019-06-08T00:27:27Z

That should be maintained automaticall instead of manually in a doc.

I understand, I was planning to do that later as I did not see that as blocker in phase 1. I can do that.

Are we good to go post that?

sandeep-krishnamurthy · 2019-06-10T18:04:43Z

Updates:

Cover automated tests for reduction operators [~10].
Benchmarks for operators - pooling, conv, activations [~10]
Total operators covered so far ~130
Add utility function to give back list of operator names not covered in benchmarks.

@larroy @szha @apeforest - Can you please take a look at this PR? I would like to work towards getting this merged as I believe in the current state, this PR can add value to users and developers of MXNet. I will continue further operator coverage incrementally in new PRs.

benchmark/opperf/nd_operations/README.md

apeforest · 2019-06-12T16:43:43Z

benchmark/opperf/README.md

+
+1. **output-format** : `json` or `md` for markdown file output.
+
+2. **ctx** : `cpu` or `gpu`. By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).


What if there are multiple GPUs? Does the profiler generate results per device?

Since this single operator benchmarks, it runs only on one device and profiler output is only for that device.

benchmark/opperf/README.md

apeforest · 2019-06-12T17:02:17Z

benchmark/opperf/nd_operations/nn_basic_operators.py

+                                            run_backward=True,
+                                            dtype=dtype,
+                                            ctx=ctx,
+                                            inputs=[{"data": (32, 3, 256, 256),


Would it be more convenient to define this shape as a global constant?

Now I see you have a benchmark/opperf/rules/default_params.py. Why not just use those DEFAULT_SHAPE?

I think the shapes we should autodiscover actually

apeforest · 2019-06-12T17:03:27Z

benchmark/opperf/nd_operations/nn_basic_operators.py

+                                                   run_backward=True,
+                                                   dtype=dtype,
+                                                   ctx=ctx,
+                                                   inputs=[{"data": (32, 3, 256, 256),


Same for all these hyperparameters. It would be less error prone to define global constants

why global?

benchmark/opperf/nd_operations/nn_conv_operators.py

apeforest · 2019-06-12T17:06:57Z

benchmark/opperf/nd_operations/random_sampling_operators.py

+from benchmark.opperf.utils.op_registry_utils import get_all_random_sampling_operators
+
+
+def run_mx_random_sampling_operators_benchmarks(ctx=mx.cpu(), dtype='float32', warmup=10, runs=50):


For benchmarking random operators, do we want to set a fixed seed?

is there any case where the random operator changes its runtime based on seed?

benchmark/opperf/opperf.py

apeforest

Thanks a lot for this contribution. It's a very desirable feature for MXNet developers. Since MXNet is going to add numpy operators, it will be very useful to profile numpy operators against existing ndarray operators. Can add this as TODO item for future contribution to extend this utility to numpy operators?

szha · 2019-06-12T18:46:25Z

Thanks for making the changes. The automated information from this PR would be very useful as a dashboard in our wiki (along with the public nightly test results for example).

* Initial end to end working skeleton * Add skeleton for all other operator benchmarks * Add Gluon Conv2D benchmarks * Add readme and user guide, example result * Add licence headers to all files * fix RAT licence check issues * Add ability to group list of operators with same inputs to benchmark. Update README * Add comparison operator tests and more arithmetic operators * Remove Gluon block and instead use only low level NDArray operators * Add GEMM operators * Add logical operations * Add support to export results as markdown * Add ability to query MXNet operator registry for operators and run benchmarks * Delete duplicate arithmetic, logical, comparison operator benchmarks. Update ReadMe and main driver * Add binary elementwise operator benchmarks * Adding basic logging mechanisms * Address review comments * Few formatting issues resolved * Add unary operators. Remove stale todo files * Fix sanity tests * Remove mention of hypothesis * Add random sampling operator benchmarks. * Add all activation operator benchmarks * Add Pooling operator benchmarks * Add Convolution operator benchmarks * Add Reduction operator benchmarks * Add an utility to get list of operator not benchmarked * Autogenerate list of operators to cover * Add basic nn operators - FC, dropout, batchnorm * Add CPU result file

sandeep-krishnamurthy added 5 commits May 15, 2019 21:47

Initial end to end working skeleton

703e140

Add skeleton for all other operator benchmarks

058f194

Add Gluon Conv2D benchmarks

d0df2d7

Add readme and user guide, example result

8374365

Add licence headers to all files

0862509

sandeep-krishnamurthy added Operator Profiler MXNet profiling issues pr-awaiting-review PR is waiting for code review Benchmark labels May 17, 2019

fix RAT licence check issues

79891ed

szha suggested changes May 19, 2019

View reviewed changes

sandeep-krishnamurthy added 5 commits May 20, 2019 10:35

Add ability to group list of operators with same inputs to benchmark.…

06e5c1c

… Update README

Add comparison operator tests and more arithmetic operators

6f9865f

Remove Gluon block and instead use only low level NDArray operators

91f9c14

Add GEMM operators

0fc26a1

Add logical operations

a12e29e

larroy reviewed May 21, 2019

View reviewed changes

benchmark/opperf/README.md Outdated Show resolved Hide resolved

larroy reviewed May 21, 2019

View reviewed changes

benchmark/opperf/README.md Show resolved Hide resolved

larroy reviewed May 21, 2019

View reviewed changes

benchmark/opperf/README.md Outdated Show resolved Hide resolved

larroy reviewed May 21, 2019

View reviewed changes

benchmark/opperf/README.md Outdated Show resolved Hide resolved

Add support to export results as markdown

7626b1e

szha previously requested changes May 21, 2019

View reviewed changes

sandeep-krishnamurthy added 2 commits June 7, 2019 13:55

Add all activation operator benchmarks

60ba9b2

Add Pooling operator benchmarks

0374e76

Add Convolution operator benchmarks

3752441

sandeep-krishnamurthy added 2 commits June 10, 2019 10:21

Add Reduction operator benchmarks

bceee47

Add an utility to get list of operator not benchmarked

a63b53c

szha reviewed Jun 10, 2019

View reviewed changes

benchmark/opperf/nd_operations/README.md Outdated Show resolved Hide resolved

sandeep-krishnamurthy added 2 commits June 10, 2019 16:46

Autogenerate list of operators to cover

afb4803

Add basic nn operators - FC, dropout, batchnorm

7cb3ef7

pengzhao-intel mentioned this pull request Jun 11, 2019

Improve static cached_op optimization #15187

Merged

7 tasks

Add CPU result file

5fbc925