Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add an utility for operator benchmarks #14977

Merged
merged 30 commits into from
Jun 12, 2019

Conversation

sandeep-krishnamurthy
Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy commented May 17, 2019

Description

Add a tool to easily run operator benchmarks. This tool will be useful for MXNet developers, power users who wants to know more about individual operator performance. This tool can be integrated with CI/CD, nightly tests to catch any operator performance regression. Proposal on cwiki - https://cwiki.apache.org/confluence/display/MXNET/MXNet+Operator+Benchmarks

  1. This is the first PR to setup all basic required utilities, overall structure for this operator benchmarking utility, and overall skeleton for the tool.
  2. I have added 3 NDArray operator (add, sub, mul) and 1 Gluon (Conv2D) operator benchmark 19 NDArray operator benchmark tests 80 operator benchmark tests - code and results markdown file autogenerated by the tool, to showcase how it can be used.
  3. More operators will be covered in following PRs. With this PR, we will be setting up basic structure of other contributors to pitch in.
  4. Added README explaining how to use the utility.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • Code is well-documented:
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Utility to do operator benchmarks

Comments

@apeforest @nswamy @access2rohit @pengzhao-intel @Zha0q1

@sandeep-krishnamurthy sandeep-krishnamurthy added Operator Profiler MXNet profiling issues pr-awaiting-review PR is waiting for code review Benchmark labels May 17, 2019
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for intiating this. Besides the questions below, I have some high-level questions and comments:

  • why is it necessary to implement all the calls by hand? this approach seems rather inefficient. is there any way to implement this more concisely?
  • what happens when people implement new operators? must they implement profiling logics here too?
  • I see the PR is marked as complete despite the many TODOs in the code. I don't think the code can be checked in in this state.

benchmark/opperf/README.md Show resolved Hide resolved
benchmark/opperf/README.md Outdated Show resolved Hide resolved
benchmark/opperf/nn_operations/__init__.py Outdated Show resolved Hide resolved
benchmark/opperf/README.md Show resolved Hide resolved
benchmark/opperf/README.md Outdated Show resolved Hide resolved
@sandeep-krishnamurthy
Copy link
Contributor Author

Thanks for initiating this. Besides the questions below, I have some high-level questions and comments:

  • why is it necessary to implement all the calls by hand? this approach seems rather inefficient. is there any way to implement this more concisely?
    Good point. I did think about it and was discussed by other community members on the proposal doc. Below is my thought process.

Code per operator is something like below:

add_res = run_performance_test(nd.add, run_backward=True, dtype=dtype, ctx=ctx,
                                   inputs=[{"lhs": (1024, 1024),
                                            "rhs": (1024, 1024)},
                                           {"lhs": (10000, 10),
                                            "rhs": (10000, 10)},
                                           {"lhs": (10000, 1),
                                            "rhs": (10000, 100)}],
                                   warmup=warmup, runs=runs)

run_performance_test function will provide all the necessary tools to do a benchmark and get results. User is expected to specify 2 things - operator name, inputs for the operator. Providing inputs based on what needs to be tested is a crucial part and is made explicit. Each operator can have different criteria that needs to be in performance tests. Ex: broadcasting shapes are necessary for say arithmetic operators, small / large tensors etc.

Hence, if we automate fully for example a solution like - Automatically fetch all operators registered, fetch inputs required, understand the input meanings (Ex: lhs shape is same or broadcastable to rhs shape). Such concise thing may help in having less code, but, may hide too many details and make it hard to generally use this tool, integrate it with systems like PR/Nightly benchmark dashboarding etc.

Having said that, there are certainly few rooms of improvements - like all binary operators like add,sub,mul have similar expectations and can be made concise. But, I thought, it may lead to over engineering a simple utility required to easily run benchmark test for an operator.

  • what happens when people implement new operators? must they implement profiling logics here too?
    Yes. Not the profiling logic, but, something like below when they provide name of the operator, and specify inputs to use.
add_res = run_performance_test(nd.add, run_backward=True, dtype=dtype, ctx=ctx,
                                   inputs=[{"lhs": (1024, 1024),
                                            "rhs": (1024, 1024)}],
                                   warmup=warmup, runs=runs)
  • I see the PR is marked as complete despite the many TODOs in the code. I don't think the code can be checked in in this state.
    Motivation of this PR as described in the description is to setup the base infrastructure for a tool to easily do operator benchmarks. Add 4 operators benchmarks in the PR setting up everything required for any community members to help it out to get more than 200 operators we have in MXNet covered. Also, reason for adding TODOs is to cover the overall skeleton and clearly put out the idea of what we want to cover next so that roadmap is clearly set.

benchmark/opperf/README.md Outdated Show resolved Hide resolved
benchmark/opperf/README.md Outdated Show resolved Hide resolved
benchmark/opperf/README.md Outdated Show resolved Hide resolved
@sandeep-krishnamurthy
Copy link
Contributor Author

Latest updates:

  1. Add more operator benchmarks - ~19 tensor operations are added till now.
  2. Support to export result as json/md
  3. Add autogenerated results markdown file
  4. Move everything to only test core low level operators, i.e., NDArray operators only, no Gluon.
  5. Support to send a batch of operators for testing with same inputs.
    Example - Below we run all 5 arithmetic operators performance tests in one call.
benchmark_res = run_performance_test([nd.add, nd.subtract, nd.multiply, nd.divide, nd.modulo], run_backward=True,
                                         dtype=dtype, ctx=ctx,
                                         inputs=[{"lhs": (1024, 1024),
                                                  "rhs": (1024, 1024)},
                                                 {"lhs": (10000, 10),
                                                  "rhs": (10000, 10)},
                                                 {"lhs": (10000, 1),
                                                  "rhs": (10000, 100)}],
                                         warmup=warmup, runs=runs)
  1. Updates and clarifications in README

ping - @szha @apeforest @nswamy @access2rohit @Zha0q1 for review.

szha
szha previously requested changes May 21, 2019
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the registry has both the argument names and their types, along with the constraint, so it should be feasible to automatically generate the benchmark code for them. The current implementation requires additional code everytime new operator is added, which is not scalable.

@larroy
Copy link
Contributor

larroy commented May 22, 2019

@sandeep-krishnamurthy I think it would be good to provide some type hints in this code, specially in the functions. Since we seem to move MXNet to python3 anyways, this should make the code much easier to read and maintain. What do you think?

@larroy
Copy link
Contributor

larroy commented May 22, 2019

@szha This is a valid point that we have raised internally before, the question is that this could be implemented on top of this. To me looks like a good level of abstraction and this could be a layer on top, I think that's the fundamental question that we need to ask. Operator benchmark would be very beneficial, it's also possible to get it done in stages. Rome wasn't built in a day.

@sandeep-krishnamurthy
Copy link
Contributor Author

@sandeep-krishnamurthy I think it would be good to provide some type hints in this code, specially in the functions. Since we seem to move MXNet to python3 anyways, this should make the code much easier to read and maintain. What do you think?

This is a nice addition. Will add it to the main user facing functions to start with.

@szha
Copy link
Member

szha commented May 22, 2019

I just had a discussion w/ @sandeep-krishnamurthy and we agreed that: 1. the ideal state should be the automated approach that uses the data types from op registrty to generate default values for performance testing, and also allow users to override and test for specific settings. 2. current approach won't help towards that goal since the automated approach would render the current approach in this PR mostly obsolete.

@szha
Copy link
Member

szha commented May 22, 2019

The next step I suggested is to start with the simplest case such as elementwise operators and build automated approach towards that.

@sandeep-krishnamurthy
Copy link
Contributor Author

Thanks @szha for the discussion and putting the summary here. I wanted to add few more points we discussed in the context of this PR:

  1. We agreed that the current functionality in this PR is helpful for existing operators. For purpose like PR builds, nightly performance test CI, quick comparison tools for developers like working on Int64 etc. For which one of your suggestion was to maintain a separate repository / move it to mxnet jenkins or CI repository.
  2. Things not favorable in this PR is that it will be an additional burden for developers adding new operators or doing breaking changes in existing operators(ex: parameters change, namespace change) to come and make modifications in this utility which we all agree is a maintenance overhead.
  3. We also discussed with automated solution, we cannot handle every operator and every new use case in a fully automated way. So it would still be a combination of automated (TODO) + explicit tests (currently in this PR).

@sandeep-krishnamurthy
Copy link
Contributor Author

Could you include utility for listing the operators that are not covered, instead of writing TODO list in code?

I have a readme file maintaining it - https://github.com/apache/incubator-mxnet/pull/14977/files#diff-d7bc2931851dce319a4523bc3bb10ac7

@sandeep-krishnamurthy
Copy link
Contributor Author

Updates:

  1. Add support for Random Sampling operators (~16)
    Total operators covered => ~100

@szha
Copy link
Member

szha commented Jun 8, 2019

That should be maintained automaticall instead of manually in a doc.

@sandeep-krishnamurthy
Copy link
Contributor Author

That should be maintained automaticall instead of manually in a doc.

I understand, I was planning to do that later as I did not see that as blocker in phase 1. I can do that.

Are we good to go post that?

@sandeep-krishnamurthy
Copy link
Contributor Author

Updates:

  1. Cover automated tests for reduction operators [~10].
  2. Benchmarks for operators - pooling, conv, activations [~10]
    Total operators covered so far ~130
  3. Add utility function to give back list of operator names not covered in benchmarks.

@larroy @szha @apeforest - Can you please take a look at this PR? I would like to work towards getting this merged as I believe in the current state, this PR can add value to users and developers of MXNet. I will continue further operator coverage incrementally in new PRs.


1. **output-format** : `json` or `md` for markdown file output.

2. **ctx** : `cpu` or `gpu`. By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple GPUs? Does the profiler generate results per device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this single operator benchmarks, it runs only on one device and profiler output is only for that device.

run_backward=True,
dtype=dtype,
ctx=ctx,
inputs=[{"data": (32, 3, 256, 256),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more convenient to define this shape as a global constant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see you have a benchmark/opperf/rules/default_params.py. Why not just use those DEFAULT_SHAPE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the shapes we should autodiscover actually

run_backward=True,
dtype=dtype,
ctx=ctx,
inputs=[{"data": (32, 3, 256, 256),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for all these hyperparameters. It would be less error prone to define global constants

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why global?

from benchmark.opperf.utils.op_registry_utils import get_all_random_sampling_operators


def run_mx_random_sampling_operators_benchmarks(ctx=mx.cpu(), dtype='float32', warmup=10, runs=50):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For benchmarking random operators, do we want to set a fixed seed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any case where the random operator changes its runtime based on seed?

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution. It's a very desirable feature for MXNet developers. Since MXNet is going to add numpy operators, it will be very useful to profile numpy operators against existing ndarray operators. Can add this as TODO item for future contribution to extend this utility to numpy operators?

@szha
Copy link
Member

szha commented Jun 12, 2019

Thanks for making the changes. The automated information from this PR would be very useful as a dashboard in our wiki (along with the public nightly test results for example).

@szha szha merged commit 0be6d7e into apache:master Jun 12, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Initial end to end working skeleton

* Add skeleton for all other operator benchmarks

* Add Gluon Conv2D benchmarks

* Add readme and user guide, example result

* Add licence headers to all files

* fix RAT licence check issues

* Add ability to group list of operators with same inputs to benchmark. Update README

* Add comparison operator tests and more arithmetic operators

* Remove Gluon block and instead use only low level NDArray operators

* Add GEMM operators

* Add logical operations

* Add support to export results as markdown

* Add ability to query MXNet operator registry for operators and run benchmarks

* Delete duplicate arithmetic, logical, comparison operator benchmarks. Update ReadMe and main driver

* Add binary elementwise operator benchmarks

* Adding basic logging mechanisms

* Address review comments

* Few formatting issues resolved

* Add unary operators. Remove stale todo files

* Fix sanity tests

* Remove mention of hypothesis

* Add random sampling operator benchmarks.

* Add all activation operator benchmarks

* Add Pooling operator benchmarks

* Add Convolution operator benchmarks

* Add Reduction operator benchmarks

* Add an utility to get list of operator not benchmarked

* Autogenerate list of operators to cover

* Add basic nn operators - FC, dropout, batchnorm

* Add CPU result file
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-review PR is waiting for code review Profiler MXNet profiling issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants