[WIP] 8-bit quantization for inference #771

kpuatamazon · 2020-01-16T18:27:48Z

Add support for 8-bit quantized matrix multiplication in inference.

This code depends on the intgemm branch in my fork of MXNet: https://github.com/kpuatamazon/incubator-mxnet/tree/intgemm . This will turn into a pull request against MXNet.

Quantization on one thread runs 2.95x as fast as the baseline on one thread. Quantization on one thread is 1.28x as fast as the baseline on four threads. Results on an AWS c5.x12large.

Baseline 1 thread:
export MXNET_ENGINE_TYPE=NaiveEngine
export OMP_NUM_THREADS=1
[INFO:__main__] Processed 2969 lines. Total time: 934.8082, sec/sent: 0.3149, sent/sec: 3.1761
real    15m38.468s
user    15m37.986s
sys    0m5.215s

Baseline 4 threads:
export MXNET_ENGINE_TYPE=NaiveEngine
export OMP_NUM_THREADS=4
[INFO:__main__] Processed 2969 lines. Total time: 406.0341, sec/sent: 0.1368, sent/sec: 7.3122
real	6m48.662s
user	26m46.845s
sys	0m28.490s

Quantized 1 thread:
export MXNET_ENGINE_TYPE=NaiveEngine
export OMP_NUM_THREADS=1
[INFO:__main__] Processed 2969 lines. Total time: 314.9344, sec/sent: 0.1061, sent/sec: 9.4274
real    5m17.408s
user    5m17.297s
sys    0m4.476s

BLEU: 42.6 quantized, 42.5 baseline float32. No significant change.

Note that the on-disk format of the int8 file is dependent on the CPU architecture. A fix for this is pending a change to intgemm to separate the quantization and rearrangement steps.

The model is converted to 8-bit offline using a program. Here's a program to convert a model from fp32 to int8. You should also change the config file's dtype to int8. I'm soliciting suggestions on how to make this cleanly, probably another command-line program.

#!/usr/bin/env python3
import mxnet as mx
model = mx.nd.load("params.best")
#Find all weight tensors except for the source embeddings.  This includes output_layer.weight.
dense = [k[0:-7] for k in model.keys() if k.endswith('.weight') and not k.startswith("embedding_source.")]
#The positional embeddings are not quantized yet.
dense.remove("encoder.pos_embedding")
dense.remove("decoder.pos_embedding")
for param in dense:
  name = param + ".weight"
  b = model[name]
  b_scale = 127.0 / mx.nd.contrib.intgemm_maxabsolute(b)
  b_prepared = mx.nd.contrib.intgemm_prepare_weight(b, multiplier = b_scale.asscalar())
  model[name] = b_prepared
  model[param + ".scaling"] = 1.0 / b_scale
mx.nd.save("params.best.quant", model)

Pull Request Checklist

Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
Unit tests pass (pytest)
Were system tests modified? If so did you run these at least 5 times to account for the variation across runs? (Not modified)
System tests pass (pytest test/system)
Passed code style checking (./style-check.sh)
You have considered writing a test
Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…mbeddings

Works with this quantization program (TODO integrate): import mxnet as mx model = mx.nd.load("/home/ubuntu/idid-enus/model.amt.sf-concat/params.best") dense = [k[0:-7] for k in model.keys() if k.endswith('.weight') and not k.startswith("embedding_source.")] dense.remove("encoder.pos_embedding") dense.remove("decoder.pos_embedding") for param in dense: name = param + ".weight" b = model[name] b_max = mx.nd.contrib.intgemm_maxabsolute(b) # The disk format just quantizes. b_prepared = mx.nd.contrib.intgemm_prepare_data(b, b_max) model[name] = b_prepared model[param + ".scaling"] = b_max / 127.0 mx.nd.save("/home/ubuntu/idid-enus/model.amt.sf-concat.quant/params.best", model)

But it doesn't check all parameters are in the provided model

kpuatamazon · 2020-02-03T18:52:48Z

Updated:

CPU-independent disk format.
If you've got a float32 model, just add --dtype int8 on the command line and it will quantize on the fly.
To quantize and save to disk, do this:

import sockeye.model
model = sockeye.model.load_model("/path/to/float32_model", dtype = 'int8', for_disk_saving = True)
a = model[0]
a.save_parameters("/path/to/int8_model/params.best")
a.save_config("/path/to/int8_model")

You'll also need to ln -s /path/to/float32_model/{version,*.json} /path/to/int8_model/. Opinions welcome on a nice command-line version of this. Should it just copy the vocabs?

…eafield-quantize

fhieber

Really looking forward to the corresponding mxnet change to get this merged!
Left a few comments, mostly minor style comments.

I think it would be nice to test in8 quantization in the system tests. This would entail quantizing the model in the test suite and have another decoding pass which allows you to assert on output similarity and/or BLEU. This would also clarify the workflow with int8 quantization.

sockeye/layers.py

sockeye/quantization.py

fhieber · 2020-02-11T08:28:45Z

sockeye/quantization.py

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+class QuantizableDense(mx.gluon.HybridBlock):


Couldn't you inherit from mx.gluon.nn.basic_layers.Dense directly and only overwrite cast() and hybrid_forward?

I agree. Tried to do this, will need a consultation with a gluon expert.

what was the issue?

I guess we need to carefully set the prefix for the inheriting class to make sure the parameter names match.

sockeye/model.py

davvil · 2020-02-11T13:55:51Z

sockeye/model.py

+        model.cast(model_config.dtype)
+
+    if quantizing:
+        logger.info("Model dtype: quantizing from float32 to int8")


We could potentially quantize from FP16, right? Or everything on disk is FP32?

There isn't a kernel to quantize from FP16 to INT8. CPUs aren't so great at FP16 anyway; they only have instructions to convert to/from FP32 then do all the math in FP32.

So this means that being able to quantize to int8 for inference requires having trained an FP32 model?

Do you have stable training in FP16? I guess I could add a code path to convert FP16 -> FP32 -> int8 which, sadly, is how the CPU would do it anyway.

Co-Authored-By: Felix Hieber <[email protected]>

…to heafield-quantize

sockeye/model.py

kpuatamazon · 2020-05-18T22:20:42Z

Now supports three disk formats:

Regular sockeye float32. Can be used for float32 inference (obviously) and this is the default. Run int8 with --dtype int8
float32 + scaling factors. Can be used for float32 inference and this is the default. Default: float32. Can run int8 with --dtype int8
int8 + scaling factors. Cannot be used for float32 inference (it won't wastefully reconstitute the matrices). Default: int8.

Adding scaling factors (transition 1 -> 2):

import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='float32', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference.  Load from disk.

Adding scaling factors and quantizing (transition 1 -> 3):

import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='int8', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference.  Load from disk.

In both cases you'll need the *.json and version copied manually.

sockeye/model.py

sockeye/arguments.py

mjdenkowski

Approved for merge into an intermediate branch for final cleanup.

This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm . A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59 The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 . Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything. intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference. On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0. Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take. Add 128 to data so now it's unsigned. But that biases the output. DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM. intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime. A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2. Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.

Kenneth Heafield added 17 commits November 27, 2019 13:39

Pad vocab to a multiple of 8 for quantization

1a775a1

Single codebase using decoding float32 and int8 transformer, except e…

e65a290

…mbeddings

No need for a space change in inference

912bd6e

Remove logging code

9dfaaf0

Undo changes to train.py defaults

aaebf34

Allow casting to non-int8 types

cdae392

Move dtype to model

683096b

Default to FullyConnected

c035c5e

Remove unnecessary imports

9c828c4

Comment weight initializer zeros

acccf2d

Warning on cast

bf2ed7a

Copyright on quantization.py, spacing fix

b47f9a9

Tuples as (1,)

59f1bc9

TransformerConfig doesn't have dtype anymore

ba597f0

More dtype passing

38ed657

Output layer quantization

84f9a57

Merge branch 'sockeye_2' into heafield-quantize

66aa5e1

kpuatamazon requested review from davvil, fhieber, mjdenkowski and tdomhan as code owners January 16, 2020 18:27

fhieber added feature sockeye_2 labels Jan 16, 2020

Kenneth Heafield added 5 commits January 16, 2020 19:13

Fix missing import/logger

5a9d1db

Update comment

168b856

Version that loads a float32 model and quantizes on the fly

9ab002f

But it doesn't check all parameters are in the provided model

Disk saving option

394f23d

Merge branch 'sockeye_2' of https://github.com/awslabs/sockeye into h…

18418e6

…eafield-quantize

kpuatamazon mentioned this pull request Feb 10, 2020

[MXNET-1446] Quantization: intgemm matrix multiply wrappers apache/mxnet#17559

Merged

9 tasks

fhieber reviewed Feb 11, 2020

View reviewed changes

davvil reviewed Feb 11, 2020

View reviewed changes

Kenneth Heafield and others added 9 commits February 17, 2020 10:51

Wrap comment to 80 characters

8201af0

C.DTYPE_INT8 and space after #

f2a2552

No spacing around keyword arguments

05bd180

Typing on convert_weights_disk_format

c75ffce

Co-Authored-By: Felix Hieber <[email protected]>

Typing on convert_weights_cpu_dependent

ffcd5f2

Co-Authored-By: Felix Hieber <[email protected]>

Make calls friendly to custom operators

c559baa

Merge branch 'heafield-quantize' of github.com:kpuatamazon/sockeye in…

b09a820

…to heafield-quantize

Hacky way to find custom operator

16ea7b4

Merge branch 'sockeye_2' into heafield-quantize

d7ab928

fhieber reviewed May 18, 2020

View reviewed changes

sockeye/model.py Outdated Show resolved Hide resolved

Kenneth Heafield added 9 commits May 18, 2020 10:17

Configurable to custom operator

ee9b05d

fheiber's patch to dtypes

d40ffa2

C.DTYPE_FP32 and remove errant ,

261814f

Quantization: minimize mean squared error for parameters

25ceb19

Use cached quantization scaling

f3001d9

Quantization: do on-the-fly directly

e62741a

Hackily restore model type to saving type

b2f1ffd

Quantization: store scaling

25f26c5

Fix use of existing scaling factors

d645371

fhieber reviewed May 19, 2020

View reviewed changes

sockeye/model.py Show resolved Hide resolved

fhieber reviewed May 19, 2020

View reviewed changes

sockeye/arguments.py Show resolved Hide resolved

mjdenkowski approved these changes May 20, 2020

View reviewed changes

mjdenkowski changed the base branch from sockeye_2 to sockeye_2_heafield_quantize May 20, 2020 15:14

mjdenkowski merged commit e4553d3 into awslabs:sockeye_2_heafield_quantize May 20, 2020

mjdenkowski mentioned this pull request May 20, 2020

Sockeye 2 heafield quantize pr2 #812

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] 8-bit quantization for inference #771

[WIP] 8-bit quantization for inference #771

kpuatamazon commented Jan 16, 2020 •

edited

Loading

kpuatamazon commented Feb 3, 2020

fhieber left a comment

fhieber Feb 11, 2020

kpuatamazon Feb 17, 2020

tdomhan Feb 18, 2020

fhieber Feb 19, 2020

davvil Feb 11, 2020

kpuatamazon Feb 17, 2020

fhieber Feb 17, 2020

kpuatamazon Feb 17, 2020

kpuatamazon commented May 18, 2020 •

edited

Loading

mjdenkowski left a comment

[WIP] 8-bit quantization for inference #771

[WIP] 8-bit quantization for inference #771

Conversation

kpuatamazon commented Jan 16, 2020 • edited Loading

Pull Request Checklist

kpuatamazon commented Feb 3, 2020

fhieber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpuatamazon commented May 18, 2020 • edited Loading

mjdenkowski left a comment

Choose a reason for hiding this comment

kpuatamazon commented Jan 16, 2020 •

edited

Loading

kpuatamazon commented May 18, 2020 •

edited

Loading