Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1446] Quantization: intgemm matrix multiply wrappers #17559

Merged
merged 81 commits into from
Aug 31, 2020

Conversation

kpuatamazon
Copy link
Contributor

Description

This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm .

A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59

The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 .

Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything.

intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference.

On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0.

Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take.

Add 128 to data so now it's unsigned.  But that biases the output.  DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM.  intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime.  A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. 
Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. 

Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2.

Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR).
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • submodule for intgemm
  • intgemm_prepare_data and intgemm_prepare_weight operators to convert operands from fp32
  • intgemm_take_weight for taking weights in intgemm's weight format, which is useful for vocabulary shortlists in Sockeye.
  • intgemm_fully_connected for matrix multiply

Comments

Backward compatible.
intgemm requires the inner dimension be a multiple of 64 for efficiency and alignment reasons. Currently the outputs must be a multiple of 8 but there is in-progress code in intgemm to remove that.

Kenneth Heafield added 29 commits November 28, 2019 15:41
import mxnet as mx
a = mx.nd.random_uniform(low=-1.0, high=1.0, shape=[5, 64])
b = mx.nd.random_uniform(low=-1.0, high=1.0, shape=[8, 64])
b_scale = 127.0 / mx.nd.contrib.intgemm_maxabsolute(b).asscalar()
b_prepared = mx.nd.contrib.intgemm_prepareb(b, multiplier = b_scale)
mx.nd.FullyConnected(a, b, num_hidden=8, no_bias=True, flatten=False)
mx.nd.contrib.intgemm_fully_connected(a, b_prepared, out_float_multiplier=1.0/b_scale, num_hidden=8, no_bias=True, flatten=False)
…izedTransposed.

This will make it easier to store a consistent file on disk.
@szha
Copy link
Member

szha commented Aug 20, 2020

cc @leezu to review the build logic.

@szha
Copy link
Member

szha commented Aug 20, 2020

Otherwise LGTM. I reviewed tests and and op implementation.

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@szha szha dismissed their stale review August 28, 2020 16:30

addressed concerns. thanks @kpuatamazon!

@kpuatamazon
Copy link
Contributor Author

@leezu Ready?

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@kpuatamazon
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu]
Looks like an unrelated test failure

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@leezu
Copy link
Contributor

leezu commented Aug 31, 2020

Thank you @kpuatamazon

@leezu leezu merged commit 1393602 into apache:master Aug 31, 2020
samskalicky pushed a commit that referenced this pull request Sep 16, 2020
* cherry-pick intgemm from master, fix build

* Fix test to conform to 1.x

* Makefile supporting intgemm compilation

* Stricter dependencies on git checkout of intgemm

* Operators depend on mkldnn

* Don't compile intgemm with gcc older than 5

* Fix intgemm test for windows on 1.x by not using pytest

* Update intgemm to use template arguments for integer immediates

* Try to fix clang3.6

* Ban gcc < 5 in cmake

* Update intgemm with gcc 5.5 debug workaround
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.