-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add an utility for operator benchmarks #14977
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for intiating this. Besides the questions below, I have some high-level questions and comments:
- why is it necessary to implement all the calls by hand? this approach seems rather inefficient. is there any way to implement this more concisely?
- what happens when people implement new operators? must they implement profiling logics here too?
- I see the PR is marked as complete despite the many TODOs in the code. I don't think the code can be checked in in this state.
Code per operator is something like below:
run_performance_test function will provide all the necessary tools to do a benchmark and get results. User is expected to specify 2 things - operator name, inputs for the operator. Providing inputs based on what needs to be tested is a crucial part and is made explicit. Each operator can have different criteria that needs to be in performance tests. Ex: broadcasting shapes are necessary for say arithmetic operators, small / large tensors etc. Hence, if we automate fully for example a solution like - Automatically fetch all operators registered, fetch inputs required, understand the input meanings (Ex: lhs shape is same or broadcastable to rhs shape). Such concise thing may help in having less code, but, may hide too many details and make it hard to generally use this tool, integrate it with systems like PR/Nightly benchmark dashboarding etc. Having said that, there are certainly few rooms of improvements - like all binary operators like add,sub,mul have similar expectations and can be made concise. But, I thought, it may lead to over engineering a simple utility required to easily run benchmark test for an operator.
add_res = run_performance_test(nd.add, run_backward=True, dtype=dtype, ctx=ctx,
inputs=[{"lhs": (1024, 1024),
"rhs": (1024, 1024)}],
warmup=warmup, runs=runs)
|
Latest updates:
ping - @szha @apeforest @nswamy @access2rohit @Zha0q1 for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the registry has both the argument names and their types, along with the constraint, so it should be feasible to automatically generate the benchmark code for them. The current implementation requires additional code everytime new operator is added, which is not scalable.
@sandeep-krishnamurthy I think it would be good to provide some type hints in this code, specially in the functions. Since we seem to move MXNet to python3 anyways, this should make the code much easier to read and maintain. What do you think? |
@szha This is a valid point that we have raised internally before, the question is that this could be implemented on top of this. To me looks like a good level of abstraction and this could be a layer on top, I think that's the fundamental question that we need to ask. Operator benchmark would be very beneficial, it's also possible to get it done in stages. Rome wasn't built in a day. |
This is a nice addition. Will add it to the main user facing functions to start with. |
I just had a discussion w/ @sandeep-krishnamurthy and we agreed that: 1. the ideal state should be the automated approach that uses the data types from op registrty to generate default values for performance testing, and also allow users to override and test for specific settings. 2. current approach won't help towards that goal since the automated approach would render the current approach in this PR mostly obsolete. |
The next step I suggested is to start with the simplest case such as elementwise operators and build automated approach towards that. |
Thanks @szha for the discussion and putting the summary here. I wanted to add few more points we discussed in the context of this PR:
|
I have a readme file maintaining it - https://github.com/apache/incubator-mxnet/pull/14977/files#diff-d7bc2931851dce319a4523bc3bb10ac7 |
Updates:
|
That should be maintained automaticall instead of manually in a doc. |
I understand, I was planning to do that later as I did not see that as blocker in phase 1. I can do that. Are we good to go post that? |
Updates:
@larroy @szha @apeforest - Can you please take a look at this PR? I would like to work towards getting this merged as I believe in the current state, this PR can add value to users and developers of MXNet. I will continue further operator coverage incrementally in new PRs. |
|
||
1. **output-format** : `json` or `md` for markdown file output. | ||
|
||
2. **ctx** : `cpu` or `gpu`. By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are multiple GPUs? Does the profiler generate results per device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this single operator benchmarks, it runs only on one device and profiler output is only for that device.
run_backward=True, | ||
dtype=dtype, | ||
ctx=ctx, | ||
inputs=[{"data": (32, 3, 256, 256), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more convenient to define this shape as a global constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see you have a benchmark/opperf/rules/default_params.py. Why not just use those DEFAULT_SHAPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the shapes we should autodiscover actually
run_backward=True, | ||
dtype=dtype, | ||
ctx=ctx, | ||
inputs=[{"data": (32, 3, 256, 256), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for all these hyperparameters. It would be less error prone to define global constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why global?
from benchmark.opperf.utils.op_registry_utils import get_all_random_sampling_operators | ||
|
||
|
||
def run_mx_random_sampling_operators_benchmarks(ctx=mx.cpu(), dtype='float32', warmup=10, runs=50): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For benchmarking random operators, do we want to set a fixed seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any case where the random operator changes its runtime based on seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this contribution. It's a very desirable feature for MXNet developers. Since MXNet is going to add numpy operators, it will be very useful to profile numpy operators against existing ndarray operators. Can add this as TODO item for future contribution to extend this utility to numpy operators?
Thanks for making the changes. The automated information from this PR would be very useful as a dashboard in our wiki (along with the public nightly test results for example). |
* Initial end to end working skeleton * Add skeleton for all other operator benchmarks * Add Gluon Conv2D benchmarks * Add readme and user guide, example result * Add licence headers to all files * fix RAT licence check issues * Add ability to group list of operators with same inputs to benchmark. Update README * Add comparison operator tests and more arithmetic operators * Remove Gluon block and instead use only low level NDArray operators * Add GEMM operators * Add logical operations * Add support to export results as markdown * Add ability to query MXNet operator registry for operators and run benchmarks * Delete duplicate arithmetic, logical, comparison operator benchmarks. Update ReadMe and main driver * Add binary elementwise operator benchmarks * Adding basic logging mechanisms * Address review comments * Few formatting issues resolved * Add unary operators. Remove stale todo files * Fix sanity tests * Remove mention of hypothesis * Add random sampling operator benchmarks. * Add all activation operator benchmarks * Add Pooling operator benchmarks * Add Convolution operator benchmarks * Add Reduction operator benchmarks * Add an utility to get list of operator not benchmarked * Autogenerate list of operators to cover * Add basic nn operators - FC, dropout, batchnorm * Add CPU result file
Description
Add a tool to easily run operator benchmarks. This tool will be useful for MXNet developers, power users who wants to know more about individual operator performance. This tool can be integrated with CI/CD, nightly tests to catch any operator performance regression. Proposal on cwiki - https://cwiki.apache.org/confluence/display/MXNET/MXNet+Operator+Benchmarks
3 NDArray operator (add, sub, mul) and 1 Gluon (Conv2D) operator benchmark19 NDArray operator benchmark tests80 operator benchmark tests - code and results markdown file autogenerated by the tool, to showcase how it can be used.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments
@apeforest @nswamy @access2rohit @pengzhao-intel @Zha0q1