Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] 8-bit quantization for inference #771

Merged

Conversation

kpuatamazon
Copy link
Contributor

@kpuatamazon kpuatamazon commented Jan 16, 2020

Add support for 8-bit quantized matrix multiplication in inference.

This code depends on the intgemm branch in my fork of MXNet: https://github.com/kpuatamazon/incubator-mxnet/tree/intgemm . This will turn into a pull request against MXNet.

Quantization on one thread runs 2.95x as fast as the baseline on one thread. Quantization on one thread is 1.28x as fast as the baseline on four threads. Results on an AWS c5.x12large.

Baseline 1 thread:
export MXNET_ENGINE_TYPE=NaiveEngine
export OMP_NUM_THREADS=1
[INFO:__main__] Processed 2969 lines. Total time: 934.8082, sec/sent: 0.3149, sent/sec: 3.1761
real    15m38.468s
user    15m37.986s
sys    0m5.215s

Baseline 4 threads:
export MXNET_ENGINE_TYPE=NaiveEngine
export OMP_NUM_THREADS=4
[INFO:__main__] Processed 2969 lines. Total time: 406.0341, sec/sent: 0.1368, sent/sec: 7.3122
real	6m48.662s
user	26m46.845s
sys	0m28.490s

Quantized 1 thread:
export MXNET_ENGINE_TYPE=NaiveEngine
export OMP_NUM_THREADS=1
[INFO:__main__] Processed 2969 lines. Total time: 314.9344, sec/sent: 0.1061, sent/sec: 9.4274
real    5m17.408s
user    5m17.297s
sys    0m4.476s

BLEU: 42.6 quantized, 42.5 baseline float32. No significant change.

Note that the on-disk format of the int8 file is dependent on the CPU architecture. A fix for this is pending a change to intgemm to separate the quantization and rearrangement steps.

The model is converted to 8-bit offline using a program. Here's a program to convert a model from fp32 to int8. You should also change the config file's dtype to int8. I'm soliciting suggestions on how to make this cleanly, probably another command-line program.

#!/usr/bin/env python3
import mxnet as mx
model = mx.nd.load("params.best")
#Find all weight tensors except for the source embeddings.  This includes output_layer.weight.
dense = [k[0:-7] for k in model.keys() if k.endswith('.weight') and not k.startswith("embedding_source.")]
#The positional embeddings are not quantized yet.
dense.remove("encoder.pos_embedding")
dense.remove("decoder.pos_embedding")
for param in dense:
  name = param + ".weight"
  b = model[name]
  b_scale = 127.0 / mx.nd.contrib.intgemm_maxabsolute(b)
  b_prepared = mx.nd.contrib.intgemm_prepare_weight(b, multiplier = b_scale.asscalar())
  model[name] = b_prepared
  model[param + ".scaling"] = 1.0 / b_scale
mx.nd.save("params.best.quant", model)

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • Were system tests modified? If so did you run these at least 5 times to account for the variation across runs? (Not modified)
  • System tests pass (pytest test/system)
  • Passed code style checking (./style-check.sh)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Kenneth Heafield added 5 commits January 16, 2020 19:13
Works with this quantization program (TODO integrate):
import mxnet as mx
model = mx.nd.load("/home/ubuntu/idid-enus/model.amt.sf-concat/params.best")
dense = [k[0:-7] for k in model.keys() if k.endswith('.weight') and not k.startswith("embedding_source.")]
dense.remove("encoder.pos_embedding")
dense.remove("decoder.pos_embedding")
for param in dense:
  name = param + ".weight"
  b = model[name]
  b_max = mx.nd.contrib.intgemm_maxabsolute(b)
  # The disk format just quantizes.
  b_prepared = mx.nd.contrib.intgemm_prepare_data(b, b_max)
  model[name] = b_prepared
  model[param + ".scaling"] = b_max / 127.0
mx.nd.save("/home/ubuntu/idid-enus/model.amt.sf-concat.quant/params.best", model)
But it doesn't check all parameters are in the provided model
@kpuatamazon
Copy link
Contributor Author

Updated:

  • CPU-independent disk format.
  • If you've got a float32 model, just add --dtype int8 on the command line and it will quantize on the fly.
  • To quantize and save to disk, do this:
import sockeye.model
model = sockeye.model.load_model("/path/to/float32_model", dtype = 'int8', for_disk_saving = True)
a = model[0]
a.save_parameters("/path/to/int8_model/params.best")
a.save_config("/path/to/int8_model")

You'll also need to ln -s /path/to/float32_model/{version,*.json} /path/to/int8_model/. Opinions welcome on a nice command-line version of this. Should it just copy the vocabs?

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really looking forward to the corresponding mxnet change to get this merged!
Left a few comments, mostly minor style comments.

I think it would be nice to test in8 quantization in the system tests. This would entail quantizing the model in the test suite and have another decoding pass which allows you to assert on output similarity and/or BLEU. This would also clarify the workflow with int8 quantization.

sockeye/layers.py Outdated Show resolved Hide resolved
sockeye/layers.py Outdated Show resolved Hide resolved
sockeye/layers.py Show resolved Hide resolved
sockeye/quantization.py Outdated Show resolved Hide resolved
sockeye/quantization.py Outdated Show resolved Hide resolved
sockeye/quantization.py Outdated Show resolved Hide resolved
sockeye/quantization.py Show resolved Hide resolved
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
class QuantizableDense(mx.gluon.HybridBlock):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you inherit from mx.gluon.nn.basic_layers.Dense directly and only overwrite cast() and hybrid_forward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Tried to do this, will need a consultation with a gluon expert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to carefully set the prefix for the inheriting class to make sure the parameter names match.

sockeye/model.py Outdated Show resolved Hide resolved
model.cast(model_config.dtype)

if quantizing:
logger.info("Model dtype: quantizing from float32 to int8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially quantize from FP16, right? Or everything on disk is FP32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a kernel to quantize from FP16 to INT8. CPUs aren't so great at FP16 anyway; they only have instructions to convert to/from FP32 then do all the math in FP32.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that being able to quantize to int8 for inference requires having trained an FP32 model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have stable training in FP16? I guess I could add a code path to convert FP16 -> FP32 -> int8 which, sadly, is how the CPU would do it anyway.

sockeye/model.py Outdated Show resolved Hide resolved
@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented May 18, 2020

Now supports three disk formats:

  1. Regular sockeye float32. Can be used for float32 inference (obviously) and this is the default. Run int8 with --dtype int8
  2. float32 + scaling factors. Can be used for float32 inference and this is the default. Default: float32. Can run int8 with --dtype int8
  3. int8 + scaling factors. Cannot be used for float32 inference (it won't wastefully reconstitute the matrices). Default: int8.

Adding scaling factors (transition 1 -> 2):

import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='float32', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference.  Load from disk.

Adding scaling factors and quantizing (transition 1 -> 3):

import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='int8', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference.  Load from disk.

In both cases you'll need the *.json and version copied manually.

Copy link
Contributor

@mjdenkowski mjdenkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for merge into an intermediate branch for final cleanup.

@mjdenkowski mjdenkowski changed the base branch from sockeye_2 to sockeye_2_heafield_quantize May 20, 2020 15:14
@mjdenkowski mjdenkowski merged commit e4553d3 into awslabs:sockeye_2_heafield_quantize May 20, 2020
@mjdenkowski mjdenkowski mentioned this pull request May 20, 2020
6 tasks
leezu pushed a commit to apache/mxnet that referenced this pull request Aug 31, 2020
This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm .

A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59

The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 .

Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything.

intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference.

On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0.

Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take.

Add 128 to data so now it's unsigned.  But that biases the output.  DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM.  intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime.  A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. 
Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. 

Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2.

Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants