-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331
Comments
Inference Benchmarks comparing LT_MKL with just MKL Enabled.
Average Percentage Change over all numbers: |
Training Benchmarks comparing LT_MKL with just MKL Enabled. Note: Samples/Second are opposite so I have multiple the percentages by -1. A quick explanation: The number should be going higher so a positive percentage change means there are now less samples/second. A negative percentage change means there are more samples/second.
Average Percentage Change: |
@jonatan1626 thanks for the update. Does |
@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value). |
In your description "A negative percentage change means there are more samples/second." Doesn't that mean negative percentage is faster? |
@apeforest Oh sorry, so I'm multiplying by only for the samples/second column -1 to keep the meaning consistent with everything else. The rest of the columns depict the correct positive percentage improvement and negative percentage degradation. For example if MKL_LT gives 66 samples/sec and MKL gives 70 samples/sec that will be: On the other hand if MKL_LT gives 74 samples/sec and MKL gives 70 samples/sec that will be: So I multiply by -1 to give it the same meaning as the rest of the percentages, where positive is better and negative is worse. |
The slowdown for BERT (-22.98%) is quite significant. We will need to mitigate this before moving forward. |
Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator Running operator-level profiler we could also identify the performance drop in w/o USE_INT64_TENSOR_SIZE flag: w/ USE_INT64_TENSOR_SIZE flag: Also, as I look into the implementation of broadcast_axis operator, many modulo and multiplication operator on the indices are involved. The next step will be to find an optimal implementation of broadcast_axis to reduce the ALU on indices in the kernel. |
@szha @eric-haibin-lin @apeforest With current master and new broadcast_axis changes on p3.16xl single GPU training run. Bert Run Command:
Results:
"new" refers to mxnet code with optimized broadcast_axis. |
@access2rohit This result is a little surprising. In the earlier benchmark results provided by @JonTanS, there is a ~18% degradation in BERT training when large tensor (LT) compiler flag is turned on:
However, from your result, even without your latest speedup in broadcast_axis operator, there is very little difference with LT flag is on:
Could you provide more insights? |
@apeforest THe profiling done by @JonTanS was done long back using mxnet-1.6in november. These results are using current master branch of MXNet, bert scripts have changed too. If there are newer setting for running BERT on single node they are not available on Gluon NLP site. If @eric-haibin-lin or @szhengac can verify whether my BERT is correct or not and also provide proper tuning params to run BERT on single node I will re-run benchmarks and update the results here. |
PR: #17882 fixes regression in SSD. Following are the new results for SSD run:
|
@apeforest @sandeep-krishnamurthy @szha @zheng-da PR's to enable Large Tensor Support as default in master are divided into two stages: Once the above 2 PR's are merged MXNet will support Large Tensors for CPU/GPU(depending on Global Memory) on master. |
Currently Large Tensor Support work on all operators implemented in MXNet and MKLDNN also supports int64. CUDA kernels written inside MXNET both generic(cpu/gpu) and specific(gpu only) support large tensors depending on device memory. BLAS and LAPACK libs were not considered while defining the scope of the project. Currently following BLAS and LAPACK implementations are supported inside MXNet openBLAS (Default) upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t. Initially openBLAS can be supported since it is used by default and in pypi wheels as well. Thus not, breaking any default behaviour of customer. Users attempting to use Large Tensor with other BLAS and LAPACK implementations won't face issues as long as they don't use large tensors. Additional error messages will be added in case Large Tensor is used BLAS implementation is not openBLAS until that BLAS library is made to work with large tensor support of MXNet. NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script) |
Thanks @access2rohit for the summary. Is the plan for enabling Large Tensor Support in the following order?
Do you see this order of execution okay @access2rohit @leezu @szha @zheng-da ? |
Has the large tensor for numpy array been supported? |
@access2rohit can correct me, but, few of them are supported as they use same kernels under the hood. This issue scope was mainly on the NDArray when it got started. After these are done, remaining Numpy ops will also be supported. |
yes |
upon inspecting numpy files inside MXNet and they are using index_t for iterating over elements in their own kernels and use NDarray ones for remaining in which we ensured to use index_t where required. For kernels using BLAS I will update them in the same PR as making MXNet wrappers for openBLAS int64 compatible. |
I'm a little concerned that we don't have a correct integration of BLAS and Lapack. BLAS kernels and will get potential crashes or corrupt results. But I think @sandeep-krishnamurthy's point
refers to fixing this? If so, I'm fine with the order of execution. Thank you @access2rohit for the hard work on this feature |
@leezu yes thats what I meant |
I think the numpy frontend hasn't supported large tensors yet. I started working on it here #18368 but I haven't found the time to finish migrating all the tests. @access2rohit would you be able to help out and take that over? |
Description
Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON.
Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.
To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.
RFC: https://lists.apache.org/thread.html/df53b8c26e9e0433378dd803baba9fec4dd922728a5ce9135dc164b3@%3Cdev.mxnet.apache.org%3E
Current Status:
Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.
There was performance degradation in a few operators such as transpose and it has been fixed (#16104)
Model Inference Performance
int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.
Model Training Performance
* measures speed instead of throughput
What Caused Performance Drop in BERT
Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).
Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.
w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]
w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}
Why is broadcast_axis Operator Affected
Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type
TODO
(@ChaiBapchya )
skipping tests that cannot fit in nightly CI machine #17450Re-Enabling Large Tensor and Vector Nightly on GPU #16164enabling build stage gpu_int64 to enable large tensor nightly runs #17546Implement remaining nn_basic ops in opperf #17456
Updated PartialSortSmallK for LT support #17462
The text was updated successfully, but these errors were encountered: