Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Large Tensor] Implemented LT flag for OpPerf testing #17449

Merged
merged 59 commits into from
Feb 29, 2020

Conversation

connorgoggins
Copy link
Contributor

@connorgoggins connorgoggins commented Jan 27, 2020

Description

Completely reworked this PR to establish compatibility with the current master. In the weeks since this PR was originally created, over 100 ops have been added to OpPerf, so I added functionality for testing each one with large tensor (dimension >= 2**32) data while ensuring that the suite still worked properly on standard data.

I tested my changes extensively, merging my remaining PRs into this branch during testing to ensure that the full test suite worked with int64 tensor data on every op once all my kernel-level fixes were included.

This PR adds a flag (int64-tensor) and relevant default data to OpPerf for every supported op, thereby allowing users to run the entire suite of opperf tests with int64 tensor data after they build MXNet with int64 tensor support.

Please note that the full suite takes an extremely long time (over one day) to run to completion on a machine with 748 GB of RAM, even with warmup=1 and runs=1.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • M benchmark/opperf/nd_operations/array_rearrange.py
  • M benchmark/opperf/nd_operations/binary_operators.py
  • M benchmark/opperf/nd_operations/gemm_operators.py
  • M benchmark/opperf/nd_operations/indexing_routines.py
  • M benchmark/opperf/nd_operations/linalg_operators.py
  • M benchmark/opperf/nd_operations/misc_operators.py
  • M benchmark/opperf/nd_operations/nn_activation_operators.py
  • M benchmark/opperf/nd_operations/nn_basic_operators.py
  • M benchmark/opperf/nd_operations/nn_conv_operators.py
  • M benchmark/opperf/nd_operations/nn_loss_operators.py
  • M benchmark/opperf/nd_operations/nn_optimizer_operators.py
  • M benchmark/opperf/nd_operations/random_sampling_operators.py
  • M benchmark/opperf/nd_operations/reduction_operators.py
  • M benchmark/opperf/nd_operations/sorting_searching_operators.py
  • M benchmark/opperf/nd_operations/unary_operators.py
  • M benchmark/opperf/opperf.py
  • M benchmark/opperf/rules/default_params.py
  • M benchmark/opperf/utils/benchmark_utils.py
  • M benchmark/opperf/utils/op_registry_utils.py

Results

Full OpPerf Suite (CPU) - Small Tensor
Full OpPerf Suite (CPU) - Int64 Tensor w/changes from cumsum, multi_lars, and RNN PRs

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor changes. Rest looks good. Good job!

benchmark/opperf/nd_operations/gemm_operators.py Outdated Show resolved Hide resolved
benchmark/opperf/nd_operations/nn_activation_operators.py Outdated Show resolved Hide resolved
@ChaiBapchya
Copy link
Contributor

Also let's wait before
#17445 and #17444 merge
So that adding large tensor flag will not break the existing opperf utility.

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

opperf (just by name) indicates this utility is used to test performance of operators. We could leverage its implementation to test large tensor correctness, but I am not sure if we should add this as a parameter to this utility. What value does it bring to users and who are going to use it?

If we are the only users who just need it to test large tensor correctness, we should keep this in a private branch. If we want to expose this functionality to users (again, please think about who the customers are and how they use it), it'll be better to extract into a separate function such as run_large_tensor_test or similar.

@connorgoggins
Copy link
Contributor Author

@apeforest thanks for your feedback! The purpose of this flag would not only be to test operator functionality on large tensor data, but also to test the actual performance of each operator on large tensor data (which falls within the mission of opperf). With this in mind, I believe it makes sense to add this as a parameter to the utility.

This would be valuable to users who are interested in debugging their models' performance at the operator level on large tensor data, thereby helping users create more efficient models when handling high-dimensional data.

I can refactor this into a general run_large_tensor_test function if you would prefer, but I think users may sometimes want to test specific ops and categories of ops on large tensor data instead of being forced to test all ops at the same time.

If the consensus is that this would be better as a private branch, I can move in that direction instead.

@apeforest
Copy link
Contributor

Can users specify custom shapes to test the performance of large tensor instead of using a param? That gives more freedom to users.

@ChaiBapchya
Copy link
Contributor

@apeforest
Actually, if the mxnet is built with LTS ON then user can just give >2**32
as a custom shape and use the opperf utility.

import mxnet as mx
from mxnet import nd

from benchmark.opperf.utils.benchmark_utils import run_performance_test
run_performance_test(nd.add, run_backward=True, dtype='float32', ctx=mx.cpu(),
                               inputs=[{"lhs": (2**32+1, 1),
                                        "rhs": (2**32+1, 1)}],
                               warmup=0, runs=1)

This flag serves as a quick way of testing for Large tensor Ops.
So for example if user doesn't want to add custom shapes for each operator
and just wants to see perf times for all operators then this flag comes in
handy.

python incubator-mxnet/benchmark/opperf/opperf.py --output-format json --output-file mxnet_operator_benchmark_results.json --large-tensor ON

So ya, both are separate use cases and both are possible.
With the obvious assumption, mxnet is built with USE_INT64_TENSOR_SIZE = ON

@apeforest
Copy link
Contributor

This flag serves as a quick way of testing for Large tensor Ops.

Can you think of a use case where customer want such a quick way instead of specifying a custom shape to test an operator? If I were a customer and want to know if an operator would meet the requirement of my input tensor (could be large), I would just specify the shape and test it. Using a flag --large_tensor is rather vague to me. What does it mean, how large is LARGE?

@connorgoggins
Copy link
Contributor Author

With this flag, users could effectively avoid having to create their own custom inputs for each operator, potentially saving them a significant amount of time and effort if they are testing multiple ops. The flag wouldn't be particularly useful if the customer has a specific input tensor shape in mind, but there must also be cases when customers want a quick way of obtaining a more general outlook on the performance of operators under large tensor conditions (e.g. for evaluating op performance differences across different machines and different input sizes).

Would changing the name to int64_tensor introduce more clarity?

@@ -39,6 +39,8 @@ def run_rearrange_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify explicitly here the tensor size is over 2^32

@@ -48,6 +48,8 @@ def run_mx_binary_broadcast_operators_benchmarks(ctx=mx.cpu(), dtype='float32',
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -75,6 +77,8 @@ def run_mx_binary_element_wise_operators_benchmarks(ctx=mx.cpu(), dtype='float32
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -44,6 +44,8 @@ def run_gemm_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='nativ
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

"transpose_a": True,
"transpose_b": True}],
warmup=warmup, runs=runs, profiler=profiler)
if large_tensor == "on":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this flag is ON and user also specifies custom shapes (which is small tensor).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this flag wouldn't be for use on user-specified shapes, it would be for general category and full suite testing of operator performance on input data with dimensions >= 2^32. If the user wanted to test individual operators with custom shapes, they would use run_performance_test() and add their custom data as input - they wouldn't use the flag in that case, as the run_performance_test() function doesn't take in the large_tensor flag as an argument.

@@ -45,6 +45,8 @@ def run_activation_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler=
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

"transpose_a": True,
"transpose_b": True}],
warmup=warmup, runs=runs, profiler=profiler)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only difference between the if and else branch is the inputs argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?

],
warmup=warmup,
runs=runs)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only difference between the if and else branch is the inputs argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?

"moving_var": (3,)}],
warmup=warmup,
runs=runs)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only difference between the if and else branch is the inputs argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?

],
warmup=warmup,
runs=runs)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only difference between the if and else branch is the inputs argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?

],
warmup=warmup,
runs=runs)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only difference between the if and else branch is the inputs argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?

],
warmup=warmup,
runs=runs)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only difference between the if and else branch is the inputs argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?

@@ -46,6 +46,8 @@ def run_optimizer_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific please.

@@ -44,6 +44,8 @@ def run_mx_random_sampling_operators_benchmarks(ctx=mx.cpu(), dtype='float32', p
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific please.

@@ -41,6 +41,8 @@ def run_mx_reduction_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profile
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific please.

@@ -39,6 +39,8 @@ def run_sorting_searching_operators_benchmarks(ctx=mx.cpu(), dtype='float32', pr
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific please.

@@ -45,6 +45,8 @@ def run_mx_unary_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='n
Context to run benchmarks
dtype: str, default 'float32'
Precision to use for benchmarks
large_tensor: str, default 'off'
Tensor size to use for tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific please.

@connorgoggins connorgoggins force-pushed the opperf_large_tensor_flag branch 3 times, most recently from 11a8eb9 to 87c18fc Compare February 13, 2020 00:31
@connorgoggins connorgoggins changed the title Implemented large tensor flag for opperf testing [Large Tensor] Implemented LT flag for OpPerf testing Feb 24, 2020
@apeforest apeforest merged commit 95c5189 into apache:master Feb 29, 2020
@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Mar 1, 2020

While full opperf suite was run initially (and has been linked in the description)
was full opperf run after the subsequent commits? like new ops added and merges?

Could you paste opperf results after commit 256ad70

Coz right now, with master (cuda, cudnn ON)
full opperf suite runs into error forvarious ops

  1. Optimizer update ops
  2. BatchNorm coz of cudnn error
MXNetError: Check failed: param.eps >= 1e-5 (1e-08 vs. 1e-05) : CuDNN requires eps to be no less than 1e-05
  1. lamb_update_phase1&2
Traceback (most recent call last):
  File "incubator-mxnet/benchmark/opperf/opperf.py", line 213, in <module>
    sys.exit(main())
  File "incubator-mxnet/benchmark/opperf/opperf.py", line 193, in main
    benchmark_results = run_all_mxnet_operator_benchmarks(ctx=ctx, dtype=dtype, profiler=profiler, int64_tensor=int64_tensor, warmup=warmup, runs=runs)
  File "incubator-mxnet/benchmark/opperf/opperf.py", line 111, in run_all_mxnet_operator_benchmarks
    mxnet_operator_benchmark_results.append(run_optimizer_operators_benchmarks(ctx=ctx, dtype=dtype, profiler=profiler, int64_tensor=int64_tensor, warmup=warmup, runs=runs))
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/nd_operations/nn_optimizer_operators.py", line 142, in run_optimizer_operators_benchmarks
    mx_optimizer_op_results = run_op_benchmarks(mx_optimizer_ops, dtype, ctx, profiler, int64_tensor, warmup, runs)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 210, in run_op_benchmarks
    warmup=warmup, runs=runs)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 177, in run_performance_test
    benchmark_result = _run_nd_operator_performance_test(op, inputs, run_backward, warmup, runs, kwargs_list, profiler)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 114, in _run_nd_operator_performance_test
    _, _ = benchmark_helper_func(op, warmup, **kwargs_list[0])
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/profiler_utils.py", line 200, in cpp_profile_it
    res = func(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/ndarray_utils.py", line 97, in nd_forward_and_profile
    res = op(**kwargs_new)
  File "<string>", line 113, in lamb_update_phase1
  File "/home/ubuntu/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 91, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: MXNetError: Required parameter wd of float is not presented, in operator lamb_update_phase1(name="", t="1", rescale_grad="0.4", epsilon="1e-08", beta2="0.1", beta1="0.1")
*** Error in `python': corrupted double-linked list: 0x000055b58a93f6c0 ***

The PR which introduced lamb_update_phase1 to opperf #17542 worked for CUDA CUDNN ON
but now it doesn't.

@ChaiBapchya ChaiBapchya mentioned this pull request Mar 2, 2020
4 tasks
@connorgoggins
Copy link
Contributor Author

@ChaiBapchya thanks for pointing this out. When I ran my tests with this PR on Friday, #17400 hadn't been merged into master yet so the conflicts did not appear. I believe your PR will fix these issues - thanks for your contribution!

MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020
* Passing large_tensor parameter down

* Adding large tensor testing functionality for convolutional operators

* Added large tensor test functionality for conv ops

* Fixing sizing for conv ops

* Added gemm large tensor, print on conv

* Updated input for gemm ops and print statements

* Fixed deconv large tensor test

* Added bias for deconv

* Added test functionality for nn_activation and nn_basic ops

* Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test

* Dropped unnecessary print statements

* Fixed lint errors

* Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions

* Adding docs, changed large_tensor to int64_tensor for clarity

* Added warmup/runs to gemm ops, debugging process failure

* Resolved merge conficts, added default params and input switching functionality

* Dynamic input handling for default inputs, additional custom data for int64

* Fixed RPD issue

* Everything through reduction ops working

* Passing large_tensor parameter down

* Adding large tensor testing functionality for convolutional operators

* Added large tensor test functionality for conv ops

* Fixing sizing for conv ops

* Added gemm large tensor, print on conv

* Updated input for gemm ops and print statements

* Fixed deconv large tensor test

* Added bias for deconv

* Added test functionality for nn_activation and nn_basic ops

* Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test

* Dropped unnecessary print statements

* Fixed lint errors

* Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions

* Adding docs, changed large_tensor to int64_tensor for clarity

* Added warmup/runs to gemm ops, debugging process failure

* Resolved merge conficts, added default params and input switching functionality

* Dynamic input handling for default inputs, additional custom data for int64

* Fixed RPD issue

* Everything through reduction ops working

* Random sampling & loss ops working

* Added indices, depth, ravel_data in default_params

* Added indexing ops - waiting for merge on ravel

* Added optimizer ops

* All misc ops working

* All NN Basic ops working

* Fixed LT input for ROIPooling

* Refactored NN Conv tests

* Added test for inline optimizer ops

* Dropping extra tests to decrease execution time

* Switching to inline tests for RNN to support additional modes

* Added state_cell as NDArray param, removed linalg testing for int64 tensor

* Cleaned up styling

* Fixed conv and deconv tests

* Retrigger CI for continuous build

* Cleaned up GEMM op inputs

* Dropped unused param from default_params
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
* Passing large_tensor parameter down

* Adding large tensor testing functionality for convolutional operators

* Added large tensor test functionality for conv ops

* Fixing sizing for conv ops

* Added gemm large tensor, print on conv

* Updated input for gemm ops and print statements

* Fixed deconv large tensor test

* Added bias for deconv

* Added test functionality for nn_activation and nn_basic ops

* Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test

* Dropped unnecessary print statements

* Fixed lint errors

* Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions

* Adding docs, changed large_tensor to int64_tensor for clarity

* Added warmup/runs to gemm ops, debugging process failure

* Resolved merge conficts, added default params and input switching functionality

* Dynamic input handling for default inputs, additional custom data for int64

* Fixed RPD issue

* Everything through reduction ops working

* Passing large_tensor parameter down

* Adding large tensor testing functionality for convolutional operators

* Added large tensor test functionality for conv ops

* Fixing sizing for conv ops

* Added gemm large tensor, print on conv

* Updated input for gemm ops and print statements

* Fixed deconv large tensor test

* Added bias for deconv

* Added test functionality for nn_activation and nn_basic ops

* Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test

* Dropped unnecessary print statements

* Fixed lint errors

* Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions

* Adding docs, changed large_tensor to int64_tensor for clarity

* Added warmup/runs to gemm ops, debugging process failure

* Resolved merge conficts, added default params and input switching functionality

* Dynamic input handling for default inputs, additional custom data for int64

* Fixed RPD issue

* Everything through reduction ops working

* Random sampling & loss ops working

* Added indices, depth, ravel_data in default_params

* Added indexing ops - waiting for merge on ravel

* Added optimizer ops

* All misc ops working

* All NN Basic ops working

* Fixed LT input for ROIPooling

* Refactored NN Conv tests

* Added test for inline optimizer ops

* Dropping extra tests to decrease execution time

* Switching to inline tests for RNN to support additional modes

* Added state_cell as NDArray param, removed linalg testing for int64 tensor

* Cleaned up styling

* Fixed conv and deconv tests

* Retrigger CI for continuous build

* Cleaned up GEMM op inputs

* Dropped unused param from default_params
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants