Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
[MXNET-1446] Quantization: intgemm matrix multiply wrappers (#17559)
Browse files Browse the repository at this point in the history
This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm .

A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59

The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 .

Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything.

intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference.

On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0.

Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take.

Add 128 to data so now it's unsigned.  But that biases the output.  DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM.  intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime.  A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. 
Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. 

Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2.

Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.
  • Loading branch information
kpuatamazon committed Aug 31, 2020
1 parent e2aacce commit 1393602
Show file tree
Hide file tree
Showing 11 changed files with 1,157 additions and 3 deletions.
26 changes: 26 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ if(USE_MKL_IF_AVAILABLE AND (NOT APPLE) AND (NOT MSVC) AND (CMAKE_HOST_SYSTEM_PR
else()
option(USE_MKLDNN "Build with MKL-DNN support" OFF)
endif()
cmake_dependent_option(USE_INTGEMM "Build with x86_64 intgemm library for low-precision multiplication" ON "CMAKE_SYSTEM_PROCESSOR STREQUAL x86_64" OFF)
if(NOT MSVC)
option(USE_OPERATOR_TUNING "Enable auto-tuning of operators" ON)
else()
Expand Down Expand Up @@ -278,6 +279,22 @@ if(USE_MKLDNN)
set_target_properties(dnnl PROPERTIES CXX_CLANG_TIDY "") # don't lint 3rdparty dependency
endif()

if(USE_INTGEMM)
message(STATUS "Using intgemm")
include(FetchContent)
FetchContent_Declare(
intgemm
GIT_REPOSITORY https://github.com/kpu/intgemm.git
GIT_TAG 02f671cf537fdbc818cf8111d1d9e557a8650d7a
)
FetchContent_GetProperties(intgemm)
if(NOT intgemm_POPULATED)
FetchContent_Populate(intgemm)
endif()
add_subdirectory(${intgemm_SOURCE_DIR} ${intgemm_BINARY_DIR} EXCLUDE_FROM_ALL)
add_definitions(-DMXNET_USE_INTGEMM=1)
endif()

# Allow Cuda compiles outside of src tree to find things in 'src' and 'include'
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/include)
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/src)
Expand Down Expand Up @@ -474,6 +491,11 @@ endif()
FILE(GLOB_RECURSE SOURCE "src/*.cc" "src/*.h" "include/*.h")
FILE(GLOB_RECURSE CUDA "src/*.cu" "src/*.cuh")

if(NOT USE_INTGEMM)
FILE(GLOB_RECURSE INTGEMM_OPERATOR_SOURCE "src/operator/contrib/intgemm/*.cc" "src/operator/contrib/intgemm/*.h")
list(REMOVE_ITEM SOURCE ${INTGEMM_OPERATOR_SOURCE})
endif()

# add nnvm to source
FILE(GLOB_RECURSE NNVMSOURCE
3rdparty/tvm/nnvm/src/c_api/*.cc
Expand Down Expand Up @@ -750,6 +772,10 @@ if(USE_MKLDNN)
${CMAKE_BINARY_DIR}/3rdparty/mkldnn/include/dnnl_version.h ${CMAKE_SOURCE_DIR}/include/mkldnn/)
endif()

if(USE_INTGEMM)
target_link_libraries(mxnet PRIVATE intgemm)
endif()

function(BuildTVMOP)
# scope the variables in BuildTVM.cmake to avoid conflict
include(cmake/BuildTVM.cmake)
Expand Down
2 changes: 2 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,8 @@
Licensed MIT © Zeno Rocha
11. mx-theme - For details, see docs/python_docs/themes/mx-theme/LICENSE
Copyright (c) 2016 myyasuda
12. intgemm - Refer to 3rdparty/intgemm/LICENSE
Copyright (c) 2017--2019 University of Edinburgh, Nikolay Bogoychev, Mateusz Chudyk, Kenneth Heafield, and Microsoft Corporation


=======================================================================================
Expand Down
2 changes: 1 addition & 1 deletion include/mxnet/base.h
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,7 @@ inline std::ostream& operator<<(std::ostream &out, const Context &ctx) {
#define ADD_FILELINE "\n\nDefined in " __FILE__ ":L" STRINGIZE(__LINE__)


#if MXNET_USE_MKLDNN == 1
#if MXNET_USE_MKLDNN == 1 || MXNET_USE_INTGEMM == 1
constexpr size_t kMKLDNNAlign = 64;
#endif

Expand Down
Loading

9 comments on commit 1393602

@mseth10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpuatamazon @leezu Ever since this commit got merged, the job time for CI centos-cpu has increased by over 45 mins as can be seen from this trend. Build 2207 (this commit) took 107 mins whereas Build 2206 (parent commit) took only 45 mins.

The increase in time comes from the Tests stage Python3: CentOS 7 CPU. On observing the logs, it is clear that the same tests are taking more time to complete after this commit. Do you think it is because of the introduced wrappers to intgemm library?

Before this commit:

[2020-08-31T05:43:44.771Z] ========================== slowest 50 test durations ===========================
[2020-08-31T05:43:44.771Z] 105.61s call     tests/python/unittest/test_numpy_op.py::test_np_randint
[2020-08-31T05:43:44.771Z] 97.57s call     tests/python/unittest/test_optimizer.py::test_sparse_adam
[2020-08-31T05:43:44.771Z] 48.10s call     tests/python/unittest/test_optimizer.py::test_ftml
[2020-08-31T05:43:44.771Z] 39.72s call     tests/python/unittest/test_optimizer.py::test_signum
[2020-08-31T05:43:44.771Z] 38.13s call     tests/python/unittest/test_numpy_op.py::test_np_interp
[2020-08-31T05:43:44.771Z] 36.06s call     tests/python/unittest/test_numpy_op.py::test_np_trace
[2020-08-31T05:43:44.771Z] 35.53s call     tests/python/unittest/test_optimizer.py::test_nadam
...

After this commit

[2020-08-31T18:04:25.521Z] ========================== slowest 50 test durations ===========================
[2020-08-31T18:04:25.521Z] 334.12s call     tests/python/unittest/test_numpy_op.py::test_np_randint
[2020-08-31T18:04:25.521Z] 166.87s call     tests/python/unittest/test_optimizer.py::test_sgd
[2020-08-31T18:04:25.521Z] 152.18s call     tests/python/unittest/test_optimizer.py::test_sparse_sgd
[2020-08-31T18:04:25.521Z] 127.80s call     tests/python/unittest/test_numpy_op.py::test_np_interp
[2020-08-31T18:04:25.521Z] 97.68s call     tests/python/unittest/test_sparse_ndarray.py::test_sparse_nd_broadcast
[2020-08-31T18:04:25.521Z] 97.14s call     tests/python/unittest/test_optimizer.py::test_sparse_adam
[2020-08-31T18:04:25.521Z] 87.47s call     tests/python/unittest/test_gluon_model_zoo.py::test_models[vgg19_bn]
...

@kpuatamazon
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. None of these tests should be running intgemm.

I don't see a corresponding change on unix-cpu: https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-cpu/job/master/buildTimeTrend where 2213 was the change (that failed embarassingly due to a non-deterministic test that I fixed). Which suggests something weird about Centos 7.

Nor is there a corresponding change in the v1.x branch: https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/unix-cpu/job/v1.x/buildTimeTrend around 97 or 103. So it seems peculiar to the master branch. There is a difference in how I'm doing testing: @pytest.mark.parametrize in master sweeps over sizes and lets the framework know I have many small tests whereas v1.x does for loops that comprise one small test.

One hypothesis could be that the intgemm test is running in contention with these other tests, causing them to take longer, but the intgemm test doesn't run long to being with: about 11s if left alone.

I'm setting up a CentOS environment to test on but can be slow to respond because this is part-time for me.

@leezu
Copy link
Contributor

@leezu leezu commented on 1393602 Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpuatamazon you can reuse the CI CentOS environment via python ci/build.py --run-only --platform centos7_cpu /work/runtime_functions.sh unittest_centos7_cpu

If you start making changes to the runtime_functions.sh you'll need to remove the --run-only. If you do that, you can improve the Docker caching by adding the --no-pull --cache-intermediate options to avoid pulling in CI cache and enable local intermediate docker build cache.

@mseth10
Copy link
Contributor

@mseth10 mseth10 commented on 1393602 Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpuatamazon I tried to reproduce these numbers locally on a c5.18xl EC2 instance (same as used by CI), but did not see any regression. Following are the three main testing modules and their latency numbers before/after this commit on EC2 vs CI.

NOT Serial NOT OPERATOR tests

OMP_NUM_THREADS=18 python -m pytest -m 'not serial' -k 'not test_operator' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest

After commit

  • Local
9246 passed, 151 skipped, 5 xfailed, 50 xpassed, 67043 warnings in 1071.02s (0:17:51)
Slowest:
  1. 202.30s call     tests/python/unittest/test_numpy_op.py::test_np_randint
  2. 96.10s call     tests/python/unittest/test_optimizer.py::test_sparse_adam
  • CI
9246 passed, 151 skipped, 5 xfailed, 50 xpassed, 66614 warnings in 1469.39s (0:24:29)
Slowest:
  1. 334.12s call     tests/python/unittest/test_numpy_op.py::test_np_randint
  2. 166.87s call     tests/python/unittest/test_optimizer.py::test_sgd

Before commit

  • Local
10182 passed, 151 skipped, 5 xfailed, 49 xpassed, 66835 warnings in 1172.70s (0:19:32)
Slowest:
  1. 253.12s call     tests/python/unittest/test_numpy_op.py::test_np_randint
  2. 101.01s call     tests/python/unittest/test_optimizer.py::test_sparse_adam
  • CI
9029 passed, 151 skipped, 5 xfailed, 50 xpassed, 67038 warnings in 658.97s (0:10:58)
Slowest:
  1. 105.61s call     tests/python/unittest/test_numpy_op.py::test_np_randint
  2. 97.57s call     tests/python/unittest/test_optimizer.py::test_sparse_adam

NOT Serial OPERATOR tests

MXNET_ENGINE_TYPE=NaiveEngine OMP_NUM_THREADS=18 python -m pytest -m 'not serial' -k 'test_operator' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest

After commit

  • Local
254 passed, 7 skipped, 66 xpassed, 35 warnings in 470.29s (0:07:50)
Slowest:
  1. 186.12s call     tests/python/unittest/test_operator.py::test_broadcast_binary_op
  2. 146.65s call     tests/python/unittest/test_operator.py::test_order
  • CI
254 passed, 7 skipped, 66 xpassed, 35 warnings in 1238.59s (0:20:38)
Slowest:
  1. 763.25s call     tests/python/unittest/test_operator.py::test_order
  2. 396.39s call     tests/python/unittest/test_operator.py::test_psroipooling

Before commit

  • Local
253 passed, 7 skipped, 66 xpassed, 35 warnings in 771.08s (0:12:51)
Slowest:
  1. 234.78s call     tests/python/unittest/test_operator.py::test_broadcast_binary_op
  2. 176.57s call     tests/python/unittest/test_operator.py::test_psroipooling
  • CI
254 passed, 7 skipped, 66 xpassed, 35 warnings in 164.01s (0:02:44)
Slowest:
  1. 89.86s call     tests/python/unittest/test_operator.py::test_psroipooling
  2. 69.08s call     tests/python/unittest/test_operator.py::test_order

Serial ALL tests

python -m pytest -m serial --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest

After commit

  • Local
146 passed, 9 skipped, 9779 deselected, 2 xpassed, 792 warnings in 1937.07s (0:32:17)
Slowest:
  1. 561.89s call     tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d
  2. 239.44s call     tests/python/unittest/test_random.py::test_randint_generator
  • CI
146 passed, 9 skipped, 9779 deselected, 2 xpassed, 792 warnings in 2006.16s (0:33:26)
Slowest:
  1. 523.17s call     tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d
  2. 283.95s call     tests/python/unittest/test_random.py::test_randint_generator

Before commit

  • Local
145 passed, 9 skipped, 10713 deselected, 2 xpassed, 794 warnings in 2014.50s (0:33:34)
Slowest:
  1. 597.66s call     tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d
  2. 252.59s call     tests/python/unittest/test_random.py::test_randint_generator
  • CI
146 passed, 9 skipped, 9562 deselected, 2 xpassed, 792 warnings in 531.57s (0:08:51)
Slowest:
  1. 96.67s call     tests/python/unittest/test_random.py::test_random
  2. 86.70s call     tests/python/unittest/test_random.py::test_randint_generator

@access2rohit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per @mseth10 's findings it is evident that something is different in CI setup. So, I tried 2 different builds on local with and without FLAG USE_INTGEMM

Without INTGEMM flag total time taken is 13:22 mins

-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================== slowest 50 test durations ===========================
140.61s call     tests/python/unittest/test_random.py::test_random
124.86s call     tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d
103.80s call     tests/python/unittest/test_random.py::test_randint_generator
75.67s call     tests/python/unittest/test_random.py::test_shuffle
26.81s call     tests/python/unittest/test_operator.py::test_image_normalize
25.23s call     tests/python/unittest/test_operator.py::test_big_transpose
24.76s call     tests/python/unittest/test_gluon.py::test_slice_pooling2d
22.68s call     tests/python/unittest/test_random.py::test_negative_binomial_generator
19.30s call     tests/python/unittest/test_operator.py::test_lstm_dropout
17.06s call     tests/python/unittest/test_random.py::test_normal_generator
16.25s call     tests/python/unittest/test_random.py::test_gamma_generator
14.49s call     tests/python/unittest/test_sparse_operator.py::test_sparse_mathematical_core
13.54s call     tests/python/unittest/test_operator.py::test_pseudo2dtranspose
13.54s call     tests/python/unittest/test_gluon.py::test_slice_batchnorm
12.65s call     tests/python/unittest/test_sparse_operator.py::test_sparse_square_sum
12.62s call     tests/python/unittest/test_operator.py::test_gru_dropout
12.12s call     tests/python/unittest/test_gluon.py::test_slice_batchnorm_reshape_batchnorm
10.26s call     tests/python/unittest/test_random.py::test_poisson_generator
9.66s call     tests/python/unittest/test_sparse_operator.py::test_cast_storage_ex
7.73s call     tests/python/unittest/test_random.py::test_exponential_generator
7.23s call     tests/python/unittest/test_ndarray.py::test_update_ops_mutation
6.10s call     tests/python/unittest/test_ndarray.py::test_update_ops_mutation_failed_seed
5.20s call     tests/python/unittest/test_operator.py::test_deconvolution
4.98s call     tests/python/unittest/test_random.py::test_uniform_generator
4.82s call     tests/python/unittest/test_ndarray.py::test_order
4.55s call     tests/python/unittest/test_operator.py::test_rnntanh_dropout
3.88s call     tests/python/unittest/test_sparse_operator.py::test_sparse_dot
3.84s call     tests/python/unittest/test_numpy_ndarray.py::test_np_ndarray_indexing
3.50s call     tests/python/unittest/test_operator.py::test_index_array
3.29s call     tests/python/unittest/test_numpy_interoperability.py::test_np_array_function_protocol
3.04s call     tests/python/unittest/test_random.py::test_dirichlet
2.82s call     tests/python/unittest/test_random.py::test_parallel_random_seed_setting
2.79s call     tests/python/unittest/test_sparse_operator.py::test_sparse_retain
2.69s call     tests/python/unittest/test_operator.py::test_rnnrelu_dropout
2.44s call     tests/python/unittest/test_ndarray.py::test_ndarray_indexing
2.44s call     tests/python/unittest/test_operator.py::test_op_roi_align
2.41s call     tests/python/unittest/test_subgraph.py::test_make_subgraph
2.08s call     tests/python/unittest/test_sparse_operator.py::test_elemwise_add_ex
1.96s call     tests/python/unittest/test_numpy_ndarray.py::test_np_ndarray_binary_element_wise_ops
1.90s call     tests/python/unittest/test_numpy_interoperability.py::test_np_fallback_ops
1.72s call     tests/python/unittest/test_ndarray.py::test_broadcast
1.69s call     tests/python/unittest/test_numpy_interoperability.py::test_np_array_ufunc_protocol
1.53s call     tests/python/unittest/test_sparse_operator.py::test_sparse_storage_fallback
1.52s call     tests/python/unittest/test_gluon.py::test_slice_activation_reshape_activation
1.46s call     tests/python/unittest/test_ndarray.py::test_reduce
1.44s call     tests/python/unittest/test_random.py::test_multinomial_generator
1.39s call     tests/python/unittest/test_gluon.py::test_slice_activation_slice_activation
1.15s call     tests/python/unittest/test_random.py::test_parallel_random_seed_setting_for_context
1.02s call     tests/python/unittest/test_ndarray.py::test_broadcast_binary
0.82s call     tests/python/unittest/test_ndarray.py::test_basic_indexing_is_contiguous
= 145 passed, 9 skipped, 10714 deselected, 2 xpassed, 794 warnings in 802.24s (0:13:22) =

And with INTGEMM flag total time taken is 38:50mins

-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================== slowest 50 test durations ===========================
587.04s call     tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d
297.80s call     tests/python/unittest/test_random.py::test_randint_generator
242.53s call     tests/python/unittest/test_random.py::test_random
219.57s call     tests/python/unittest/test_gluon.py::test_slice_pooling2d
111.43s call     tests/python/unittest/test_random.py::test_negative_binomial_generator
92.50s call     tests/python/unittest/test_random.py::test_poisson_generator
82.57s call     tests/python/unittest/test_random.py::test_shuffle
73.80s call     tests/python/unittest/test_gluon.py::test_slice_batchnorm
71.64s call     tests/python/unittest/test_gluon.py::test_slice_batchnorm_reshape_batchnorm
59.64s call     tests/python/unittest/test_random.py::test_normal_generator
54.36s call     tests/python/unittest/test_operator.py::test_image_normalize
49.02s call     tests/python/unittest/test_random.py::test_gamma_generator
39.77s call     tests/python/unittest/test_operator.py::test_big_transpose
32.32s call     tests/python/unittest/test_operator.py::test_lstm_dropout
28.42s call     tests/python/unittest/test_operator.py::test_pseudo2dtranspose
21.54s call     tests/python/unittest/test_sparse_operator.py::test_sparse_mathematical_core
21.00s call     tests/python/unittest/test_operator.py::test_deconvolution
20.06s call     tests/python/unittest/test_random.py::test_exponential_generator
19.59s call     tests/python/unittest/test_random.py::test_uniform_generator
14.02s call     tests/python/unittest/test_subgraph.py::test_make_subgraph
13.68s call     tests/python/unittest/test_sparse_operator.py::test_cast_storage_ex
12.82s call     tests/python/unittest/test_ndarray.py::test_update_ops_mutation
11.35s call     tests/python/unittest/test_operator.py::test_gru_dropout
10.65s call     tests/python/unittest/test_operator.py::test_index_array
10.14s call     tests/python/unittest/test_gluon.py::test_slice_activation_reshape_activation
9.93s call     tests/python/unittest/test_gluon.py::test_slice_activation_slice_activation
7.71s call     tests/python/unittest/test_sparse_operator.py::test_sparse_storage_fallback
7.69s call     tests/python/unittest/test_numpy_interoperability.py::test_np_array_function_protocol
7.55s call     tests/python/unittest/test_ndarray.py::test_order
6.64s call     tests/python/unittest/test_random.py::test_parallel_random_seed_setting
6.17s call     tests/python/unittest/test_ndarray.py::test_update_ops_mutation_failed_seed
6.08s call     tests/python/unittest/test_sparse_operator.py::test_sparse_square_sum
5.67s call     tests/python/unittest/test_operator.py::test_rnnrelu_dropout
5.51s call     tests/python/unittest/test_operator.py::test_rnntanh_dropout
5.40s call     tests/python/unittest/test_random.py::test_multinomial_generator
5.30s call     tests/python/unittest/test_numpy_ndarray.py::test_np_ndarray_indexing
5.15s call     tests/python/unittest/test_sparse_operator.py::test_sparse_dot
4.40s call     tests/python/unittest/test_sparse_operator.py::test_sparse_retain
4.15s call     tests/python/unittest/test_random.py::test_dirichlet
3.51s call     tests/python/unittest/test_ndarray.py::test_ndarray_indexing
3.11s call     tests/python/unittest/test_operator.py::test_op_roi_align
2.87s call     tests/python/unittest/test_numpy_interoperability.py::test_np_array_ufunc_protocol
2.80s call     tests/python/unittest/test_numpy_interoperability.py::test_np_fallback_ops
2.80s call     tests/python/unittest/test_random.py::test_parallel_random_seed_setting_for_context
2.41s call     tests/python/unittest/test_gluon.py::test_slice_activation
2.23s call     tests/python/unittest/test_ndarray.py::test_broadcast
2.15s call     tests/python/unittest/test_numpy_ndarray.py::test_np_ndarray_binary_element_wise_ops
1.76s call     tests/python/unittest/test_ndarray.py::test_reduce
1.68s call     tests/python/unittest/test_numpy_ndarray.py::test_np_multinomial
1.24s call     tests/python/unittest/test_ndarray.py::test_broadcast_binary
= 145 passed, 9 skipped, 10714 deselected, 2 xpassed, 794 warnings in 2330.87s (0:38:50) =

perhaps the build with USE_INTGEMM is slowingdown test runs in CentOS-CPU

I ran the above tests inside Centos-CPU docker container identical to that used in our CI.

@leezu @kpuatamazon any thoughts as to why enabling USE_INTGEMM flag would cause a slowdown ?

@leezu
Copy link
Contributor

@leezu leezu commented on 1393602 Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reproduce the slowdown while only looking at a single test? Or does the slowdown only occur when running the whole testsuite?

@access2rohit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leezu
I ran the test test_gluon.py::test_slice_pooling2d_slice_pooling2d
with USE_INTGEMM = ON
time taken = 00:09:46
with USE_INTGEMM = OFF
time taken = 00:02:06

Both with devtoolset-7 and devtoolset-8
The results are consistent

@kpuatamazon
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a few hypotheses about this:

  1. Something about aligned memory allocation https://github.com/apache/incubator-mxnet/blob/master/src/storage/cpu_device_storage.h#L53 since apparently MKLDNN is not on for this test so I'm the only one turning it on. This can be tested by commenting /* || MXNET_USE_INTGEMM == 1 */ . It will break intgemm tests but if the other ones go faster, then we know what's up. This is the only thing I've touched in core MXNet.
  2. The test is running in parallel with other tests and making them slow. Or it's heating up AVX512 slowdown (but this shouldn't be that bad). This seems unlikely since the runtime of the test is lower than the reported slowdowns.
  3. I do have some static constructors that call CPUID. I could disable these constructors which would break intgemm but help track down a cause. But this also seems unlikely because it would add a constant overhead if mxnet is loaded once per test (unless it isn't?)

@access2rohit has been coaching me on getting a CI environment setup.

As you may know, I'm part time on this and expect to be in on Friday to look more.

@kpuatamazon
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's OpenMP. Let's move to #19502

Please sign in to comment.