Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-133] Model Quantization with Calibration #9552

Merged
merged 16 commits into from
Mar 26, 2018

Conversation

reminisce
Copy link
Contributor

@reminisce reminisce commented Jan 25, 2018

Description

This PR implements model quantization by adopting the TensorFlow approach with calibration by borrowing the idea from Nvidia's TensorRT. The focus of this work is on keeping quantized models (ConvNets for now) inference accuracy loss under control when compared to their corresponding FP32 models. It also provides a framework in MXNet for easily plugging in high-performance operators for low-bit operations generated by using TVM.

This is a joint work of @ZihengJiang and @reminisce.

  • @ZihengJiang implemented the model quantization flow and quantized operators by calling cuDNN APIs for convolution, pooling, and fully-connected operators.
  • @reminisce implemented the calibration flow and refactored the operator implementation into using nnvm interfaces as well as wrote unit tests and examples, designed user-level API, conducted benchmarks, and fixed bugs to make the code mergeable to MXNet master branch.

Details

Please see the following slides for more details on implementation and benchmark results.
quantization_github.pptx

Code Structure

  • Backend: src/operator/quantization/ contains quantized operators, quantization and calibration flow, and quantization util functions.
  • Frontend: python/mxnet/quantization.py contains one user API for generating quantized models from FP32 models.
  • Examples: example/quantization/ contains examples of generating quantized models and using quantized models for inference.
  • Unit tests: tests/python/quantization/ contains unit tests for quantization.

Notes

  • Since we have used cuDNN for implementing the quantized operators, the quantized models generated in the examples of this PR can only run on the Nvidia GPUs supporting the dp4a instruction for inference. We performed our benchmarks on AWS P3 instances.
  • The inference speed of the quantized models is about 50% slower than FP32 models. This is majorly resulted from three transpose operations in the quantized convolution operator to transform data layouts between NCHW and NHWC in order to call cudnnConvolutionForward. In addition, we have noticed that even without transposing data layouts, the INT8 convolution of NHWC is slower than FP32 of NCHW for big images such as (64, 56, 56). In the future, we hope to leverage the strength of TVM to generate high-performance INT8 operators to replace the current implementation of calling cuDNN for quantized convolution.
  • The unit tests of quantization are put under tests/python/quantization because it needs a P3 instance to run. @marcoabreu is working on setting up the testing environment. Once it's done, we will submit the unit tests under that folder to a different label from the commonly used one.

We would like to thank all the following people for discussion, suggestion, providing datasets, and guidance on configuring examples. @mli @piiswrong @zhreshold @astonzhang @szha @eric-haibin-lin @srochel @madjam @bhavinthaker @marcoabreu

We would appreciate everyone's efforts of reviewing this PR.

@cjolivier01 @anirudh2290 @rahul003

@reminisce reminisce requested a review from szha as a code owner January 25, 2018 00:31
@wentingj
Copy link
Contributor

hi, @ZihengJiang, what are the layers using quantized version to get the accuracy from quantization_github.pptx? And have you analyse the time spent on quantize, dequantize and requantize op? Thank you

return hist


def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255):
Copy link
Member

@anirudh2290 anirudh2290 Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide pointers that explain the significance of num_bins and num_quantized_bins. How is it being used to compute thresholds ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • num_quantized_bins represents the number of values in the int8 range. If we want to use 4 bits as quantized values, num_quantized_bins would be 15.
  • num_bins I tried different numbers of bins from 500 to 40,000. It has little effect on the optimal thresholds. So I picked a value in between. Too small values might not be suitable considering the tensor size is large, and too big value leads to more compute time of KL divergence. Here is a good article explaining the good rules of choosing the number of bins. http://www.statisticshowto.com/choose-bin-sizes-statistics/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

@jinhuang415
Copy link
Contributor

Hi @reminisce, may I ask a few questions:
(1) Do we always need to do run-time min/max calculation for weights parameter (not sure if there is any consideration to pre-calculate the min/max range of weights also to improve performance)? If it is needed, do we have any test/statistics how much overhead it may occupy?
(2) By reading "quantization_github.pptx", looks the model accuracy will drop a little bit when calibration batches increase to some extent, the more accurate ranges will be covered if using more calibration batches so from intuition the accuracy should be better using larger batches? Do we have any insight why accuracy drop down while calibration batches increases?

@reminisce
Copy link
Contributor Author

@wentingj The quantized ops used in the benchmarks are convolution, fully-connected, avg_pooling, max_pooling, and flatten. The quantize, dequantize, and requantize each takes up about 5-10% runtime per epoch.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jan 27, 2018 via email

@reminisce
Copy link
Contributor Author

@jinhuang415

  1. The parameters are quantized offline, which means the min/max values were pre-calculated before inference.
  2. In theory, if the calibration dataset is representative enough of the real inference image sets, more examples used for calibration should lead to less accuracy loss. The purpose of using entropy calibration is to keep the accuracy loss stable with respect to the number of examples used for calibration. The naive calibration approach suffers from more calibration examples leads to bigger accuracy loss as you can see the trend in the last two tables. My guess is that if the calibration dataset contains examples that are not similar to real inference images, the quantization thresholds might be biased by those examples and result in a little drop down of accuracy.

@reminisce
Copy link
Contributor Author

@marcoabreu The optimal values are determined by the calibration datasets. So they are independent of platforms. So long as the platform supports int8 basic addition and multiplication, it would be able to run quantized models. We would of course need to write dedicated int8 operators for a specific platform. The current implementation only works on Nvidia GPUs with dp4a instruction.

@marcoabreu
Copy link
Contributor

Oh sorry, this question was targeted towards the num_bins question - the email messed it up.

@reminisce
Copy link
Contributor Author

@marcoabreu Oh, I see. Since calibration is conducted offline, it's not constrained by the hardware resources of edge devices. I believe there is an optimal value of num_bins for each layer. It could become a hyperparameter for users to tune.

@marcoabreu
Copy link
Contributor

Ah, that sounds great! Thanks for the explanation

@@ -0,0 +1,467 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in contrib

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will do that.

@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 2, 2018

Hello Jun, I have created a slave at http://jenkins.mxnet-ci.amazon-ml.com/computer/mxnet-linux-p3-gpu10/. The label is 'mxnetlinux-gpu-p3'. You can create a job in the Jenkinsfile and set node('mxnetlinux-gpu-p3') in order to schedule the job on that slave.

Note: This slave is entirely experimental and I had no chance to validate it, but feel free to play around with it.

Reviewers: Please do NOT merge this PR as long as it contains node('mxnetlinux-gpu-p3') in the Jenkinsfile as this slave-type is experimental and officially not supported.

@reminisce
Copy link
Contributor Author

Thank @marcoabreu for setting up the testing environment for the PR, I will try to run the tests on it.

assert cond == 0

check_quantized_pooling((3, 4, 56, 56), (3, 3), 'max', (0, 0), (2, 2), False)
check_quantized_pooling((3, 4, 56, 56), (3, 3), 'max', (0, 0), (2, 2), True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when global_pool is set 'True' , why check stride >1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stride is not used for shape inference in global pooling. It's just a dummy parameter.

[](const NodeAttrs& attrs) {
return std::vector<ResourceRequest>(1, ResourceRequest::kTempSpace);
})
.set_attr<FNeedRequantize>("FNeedRequantize", [](const NodeAttrs& attrs) { return true; })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mkldnn int8 convolution API support s8 and u8 output besides s32 , it can shrink range inside API, so may add a switch here after adding CPU support.

Copy link
Contributor Author

@reminisce reminisce Feb 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be difficult to do since we don't know whether the output type of a quantized op is int8 or int32 when quantizing its FP32 version and we don't know whether the op is going to run on GPU or CPU. We just assume it's int32 and has to do the requantization later. We should think about how to achieve the purpose of distinguishing quantized ops of CPU and GPU.

I have two questions regarding the shrinking MKL does internally.

  1. How does it choose the thresholds for shrinking? We find the thresholds are essential for final inference accuracy. That's why we introduced calibration stage.
  2. What's the time difference between a quantized conv with int8 output and a quantized conv with int32 plus requantizing it to int8?

We can first focus on making the inference accuracy controlled for CPU and then think about the way of optimizing the flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can focus on the accuracy first and then optimized flow :)
@wentingj will answer your other questions later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wentingj ping

@KellenSunderland
Copy link
Contributor

Hey @reminisce, looking forward to this one on the edge team (if you can't tell). If you're going to test on the p3 instance I recommend cherry-picking this commit: #9684.

@pengzhao-intel
Copy link
Contributor

@reminisce and all, This is an awesome PR 👍

Our team (@wentingj @jinhuang415) is also working on INT8 solution based on MKL-DNN library.
And we plan to contribute our code with this PR.

I have updated a slide to introduce the overall of our solution and status.
I think we can align our solution from high level first and then go into technical details :)

Intel INT8 Solution for MXNet.pptx

Feel free to let us know your questions, comments and suggestions.

@reminisce
Copy link
Contributor Author

reminisce commented Feb 2, 2018

@KellenSunderland Thank you for the note. I will either cherry pick the PR or rebase with the master once your PR is merged.

@reminisce
Copy link
Contributor Author

@pengzhao-intel Thank you guys for implementing quantized ops for CPU computing. We look forward to seeing and benchmarking the implementation.

I propose that your team work on top this PR and submit a separate PR of your work after this one is merged. This is already a big PR (>3000 lines of code) and adding more code would make the review process overwhelming. Please also know that we still need to wait for P3 instances in the CI being officially ready to fully test the PR.

@reminisce
Copy link
Contributor Author

@marcoabreu It looks like the cuDNN version (5.0) is too low for building quantization implementation. Do we have plan to upgrade the lib?

@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 3, 2018 via email

@pengzhao-intel
Copy link
Contributor

@reminisce It makes sense. We will submit a new PR for CPU implementation.

If there're big design or code changes before this PR is merged, please kindly let us know (maybe you need to write a simple summary) so we can adjust our local code.

We will update more info of CPU accuracy and performance later.

@reminisce
Copy link
Contributor Author

@pengzhao-intel I will definitely let you know if there are breaking changes.

For testing inference, you can use the script example/quantization/imagenet_gen_qsym.py to generate quantized models (resnet-152 and inception w/ bn) and run the inference using example/quantization/imagenet_inference.py. Remember to change the ctx to mx.cpu since it's currently default to mx.gpu(0).

@pengzhao-intel
Copy link
Contributor

@reminisce btw, is there a time schedule for the merging?

@marcoabreu
Copy link
Contributor

marcoabreu commented Feb 4, 2018 via email

@@ -0,0 +1 @@
../image-classification/common
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a sym link? Does it work on windows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been addressed?

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Generate a calibrated quantized model from a FP32 model')
parser.add_argument('--model', type=str, required=True,
help='currently only supports imagenet1k-resnet-152 or imagenet1k-inception-bn')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using choices option for argparse: https://docs.python.org/2/library/argparse.html#choices

@@ -1237,8 +1237,28 @@ MXNET_DLL int MXSymbolInferType(SymbolHandle sym,
const int **aux_type_data,
int *complete);



MXNET_DLL int MXQuantizeSymbol(SymbolHandle sym_handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for this function?

@@ -261,6 +260,10 @@ using FInferStorageType = std::function<bool (const NodeAttrs& attrs,
std::vector<int>* in_attrs,
std::vector<int>* out_attrs)>;

using FQuantizedOp = std::function<nnvm::NodePtr (const NodeAttrs& attrs)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Doc?


def _get_optimal_thresholds(nd_dict, num_bins=8001, num_quantized_bins=255, logger=None):
"""Given a ndarray dict, find the optimal threshold for quantizing each value of the key."""
if stats is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to put this check inside _get_optimal_threshold?

label_name : str
Label name required for creating a Module object to run forward propagation on the
calibration dataset.
logger : Object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc for the output?

    Returns
    -------
    xxx
        xxx

return Min(Abs(static_cast<float>(a)), Abs(static_cast<float>(b)));
}

#if 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not used?

@reminisce reminisce force-pushed the merge_quantization_to_master branch from ecc8466 to 3087be6 Compare March 23, 2018 20:28
ZihengJiang and others added 16 commits March 24, 2018 19:29
[Quantization] CuDNN 8bit quantized relu v0.1

[Quantization] CuDNN 8bit quantized max_pool v0.1

[Quantization] CuDNN 8bit quantized lrn v0.1

[Quantization] CuDNN 8bit quantized convolution v0.1

[Quantization] CuDNN 8bit quantized fully connected v0.1

[Quantization] Small fix

[Quantization] Implement backward method

[Quantization] Convolution backward method

[Quantization] Add range for matmul and conv

[Quantization] New types in ndarray.py

[Quantization] 8bit conv works

[Quantization] conv support multiple type

[Quantization] matmul works now

[Quantization] matmul works well

[Quantization] efactor quantization operators

[Quantization] Op: quantize_down_and_shrink_range

[Quantization] Complete quantize_graph_pass

[Quantization] Add example

[Quantization] Take zero-center quantize, accuracy fixed

[Quantization] Multiple layers MLP pass

[Quantization] Make quantized_conv same as Convolution

[Quantization] quantized_conv works

[Quantization] Fix bug

[Quantization] lenet works now

[Quantization] Add quantized_flatten

[Quantization] Quantized max pool works well

[Quantization] Make quantized_conv support NHWC

[Quantization] add max_pool

[Quantization] add ignore_symbols

[Quantization] Save change

[Quantization] Reorganize tests, 8 layers resnet works on cifar

[Quantization] Support for 'NHWC' max pool

[Quantization] Support for 'NHWC' quantized max pool

[Quantization] Fix speed of quantize_down_and_shrink_range

[Quantization] script for resnet on imagenet

[Quantization] refactor for quantize offline

[Quantization] Fix infershape

[Quantization] Update test

[Quantization] Update example

[Quantization] Fix build error
Rebase with dmlc/master

Add quantize_down_and_shrink by threshold

Don't assign resource when threshold is available for quantize_down_and_shrink

Fix quantize_down_and_shrink saturation

Implement pass for setting calib table to node attrs

Rebase with upstream master

Change threshold to min/max quantized params

Add c-api for setting calib table to graph

Add calibration front end function

Bug fixes and add unit test

Add data iter type to calibration

Fix bug in calibrate_quantized_model

Bug fix and add example

Add the second calibration approach and benchmark

Fix

Fix infer error and add benchmark for conv

Add benchmark script

Change output names and argument names

Remove commented out code

Change name

Add layout to benchmark_convolution

Remove redundant comment

Remove common and add soft link

More fix and benchmark

Add scripts to plot images

Minor fix

More fix

More fix and util tools

Tools and support bias in quantized_conv2d

Add script for getting the optimal thresholds using kl divergence

Add kl divergence for optimizing thresholds

Add benchmark scripts

Fix compile after rebasing on master

Allocate temp space only once for quantized_conv2d

Change quantize_down_and_shrink_range to allocate temp space once

No temp space for calib model

Refactor quantize_down_and_shrink_range into requantize

Refactor quantized convolution using nnvm interfaces

Fix quantized_conv bug

Use ConvolutionParam for QuantizedCuDNNConvOp

Refactor quantized fc using nnvm interfaces

Change TQuantizationNeedShrink to FNeedRequantize

Refactor quantized_pooling

Simplify FQuantizedOp interface

Better naming

Fix shape and type inference for quantized_flatten

Clean up quantization frontend APIs and examples

Delete quantized lrn and relu

Add python script for generating quantized models

Add script for running inference

Add inference example

Remove redundant files from example/quantization

Simplify user-level python APIs

Add logger

Improve user-level python api

Fix coding style

Add unit test for quantized_conv

Fix bugs in quantized_fully_connected and add unit test

Add unit test for requantize

Fix a bug and add python api unit tests

Import test_quantization in test_operator_gpu.py

Rebase with master

Remove redundant files

Fix test case for python3 and fix doc

Fix unit tests

Fix unit tests for python3

Release used ndarrays in calibration for saving memory usage

Simplify releasing memory of used ndarrays for calibration

Fix a bug

Revert "Fix a bug"

This reverts commit f7853f2.

Revert "Simplify releasing memory of used ndarrays for calibration"

This reverts commit 70b9e38.

Clean up benchmark script and improve example

Add API and example documentation and fix bugs

Remove redundant test file and improve error message

Merge quantize and dequantize with master impl

Remove commented code

Hide monitor interface from users

Remove interface from Module

Add license header

Move quantization unittests to a separate folder so that it can be only run on P3 instances

Remove quantization unittests from test_operator_gpu.py

Move quantization to contrib

Fix lint

Add mxnetlinux-gpu-p3 to jenkins

Fix jenkins

Fix CI build

Fix CI

Update jenkins file

Use cudnn7 for ci

Add docker file for quantization unit test only

Correctly skip build with cudnn < 6

Add doc for quantize symbol api

Fix lint

Fix python3 and add doc

Try to fix cudnn build problem
@reminisce reminisce force-pushed the merge_quantization_to_master branch from 83a7041 to 7be4936 Compare March 25, 2018 02:30
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, awesome work, Jun! Thanks a lot for the great collaboration

@marcoabreu marcoabreu merged commit 66c6dda into apache:master Mar 26, 2018
ashokei pushed a commit to ashokei/incubator-mxnet that referenced this pull request Mar 27, 2018
* [Quantization] 8bit Quantization and GPU Support

[Quantization] CuDNN 8bit quantized relu v0.1

[Quantization] CuDNN 8bit quantized max_pool v0.1

[Quantization] CuDNN 8bit quantized lrn v0.1

[Quantization] CuDNN 8bit quantized convolution v0.1

[Quantization] CuDNN 8bit quantized fully connected v0.1

[Quantization] Small fix

[Quantization] Implement backward method

[Quantization] Convolution backward method

[Quantization] Add range for matmul and conv

[Quantization] New types in ndarray.py

[Quantization] 8bit conv works

[Quantization] conv support multiple type

[Quantization] matmul works now

[Quantization] matmul works well

[Quantization] efactor quantization operators

[Quantization] Op: quantize_down_and_shrink_range

[Quantization] Complete quantize_graph_pass

[Quantization] Add example

[Quantization] Take zero-center quantize, accuracy fixed

[Quantization] Multiple layers MLP pass

[Quantization] Make quantized_conv same as Convolution

[Quantization] quantized_conv works

[Quantization] Fix bug

[Quantization] lenet works now

[Quantization] Add quantized_flatten

[Quantization] Quantized max pool works well

[Quantization] Make quantized_conv support NHWC

[Quantization] add max_pool

[Quantization] add ignore_symbols

[Quantization] Save change

[Quantization] Reorganize tests, 8 layers resnet works on cifar

[Quantization] Support for 'NHWC' max pool

[Quantization] Support for 'NHWC' quantized max pool

[Quantization] Fix speed of quantize_down_and_shrink_range

[Quantization] script for resnet on imagenet

[Quantization] refactor for quantize offline

[Quantization] Fix infershape

[Quantization] Update test

[Quantization] Update example

[Quantization] Fix build error

* [Quantization] Add calibration flow and refactor code

Rebase with dmlc/master

Add quantize_down_and_shrink by threshold

Don't assign resource when threshold is available for quantize_down_and_shrink

Fix quantize_down_and_shrink saturation

Implement pass for setting calib table to node attrs

Rebase with upstream master

Change threshold to min/max quantized params

Add c-api for setting calib table to graph

Add calibration front end function

Bug fixes and add unit test

Add data iter type to calibration

Fix bug in calibrate_quantized_model

Bug fix and add example

Add the second calibration approach and benchmark

Fix

Fix infer error and add benchmark for conv

Add benchmark script

Change output names and argument names

Remove commented out code

Change name

Add layout to benchmark_convolution

Remove redundant comment

Remove common and add soft link

More fix and benchmark

Add scripts to plot images

Minor fix

More fix

More fix and util tools

Tools and support bias in quantized_conv2d

Add script for getting the optimal thresholds using kl divergence

Add kl divergence for optimizing thresholds

Add benchmark scripts

Fix compile after rebasing on master

Allocate temp space only once for quantized_conv2d

Change quantize_down_and_shrink_range to allocate temp space once

No temp space for calib model

Refactor quantize_down_and_shrink_range into requantize

Refactor quantized convolution using nnvm interfaces

Fix quantized_conv bug

Use ConvolutionParam for QuantizedCuDNNConvOp

Refactor quantized fc using nnvm interfaces

Change TQuantizationNeedShrink to FNeedRequantize

Refactor quantized_pooling

Simplify FQuantizedOp interface

Better naming

Fix shape and type inference for quantized_flatten

Clean up quantization frontend APIs and examples

Delete quantized lrn and relu

Add python script for generating quantized models

Add script for running inference

Add inference example

Remove redundant files from example/quantization

Simplify user-level python APIs

Add logger

Improve user-level python api

Fix coding style

Add unit test for quantized_conv

Fix bugs in quantized_fully_connected and add unit test

Add unit test for requantize

Fix a bug and add python api unit tests

Import test_quantization in test_operator_gpu.py

Rebase with master

Remove redundant files

Fix test case for python3 and fix doc

Fix unit tests

Fix unit tests for python3

Release used ndarrays in calibration for saving memory usage

Simplify releasing memory of used ndarrays for calibration

Fix a bug

Revert "Fix a bug"

This reverts commit f7853f2.

Revert "Simplify releasing memory of used ndarrays for calibration"

This reverts commit 70b9e38.

Clean up benchmark script and improve example

Add API and example documentation and fix bugs

Remove redundant test file and improve error message

Merge quantize and dequantize with master impl

Remove commented code

Hide monitor interface from users

Remove interface from Module

Add license header

Move quantization unittests to a separate folder so that it can be only run on P3 instances

Remove quantization unittests from test_operator_gpu.py

Move quantization to contrib

Fix lint

Add mxnetlinux-gpu-p3 to jenkins

Fix jenkins

Fix CI build

Fix CI

Update jenkins file

Use cudnn7 for ci

Add docker file for quantization unit test only

Correctly skip build with cudnn < 6

Add doc for quantize symbol api

Fix lint

Fix python3 and add doc

Try to fix cudnn build problem

* Fix compile error

* Fix CI

* Remove tests that should not run on P3

* Remove unnecessary docker file

* Fix registering quantized nn ops

* Reformat Jenkinsfile and switch quantization to CUDA 9 (apache#9)

* Address interface change cr

* Address comments and fix bugs

* Make unit test stable

* Improve unit test

* Address cr

* Address cr

* Fix flaky unit test layer_norm

* Fix doc
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* [Quantization] 8bit Quantization and GPU Support

[Quantization] CuDNN 8bit quantized relu v0.1

[Quantization] CuDNN 8bit quantized max_pool v0.1

[Quantization] CuDNN 8bit quantized lrn v0.1

[Quantization] CuDNN 8bit quantized convolution v0.1

[Quantization] CuDNN 8bit quantized fully connected v0.1

[Quantization] Small fix

[Quantization] Implement backward method

[Quantization] Convolution backward method

[Quantization] Add range for matmul and conv

[Quantization] New types in ndarray.py

[Quantization] 8bit conv works

[Quantization] conv support multiple type

[Quantization] matmul works now

[Quantization] matmul works well

[Quantization] efactor quantization operators

[Quantization] Op: quantize_down_and_shrink_range

[Quantization] Complete quantize_graph_pass

[Quantization] Add example

[Quantization] Take zero-center quantize, accuracy fixed

[Quantization] Multiple layers MLP pass

[Quantization] Make quantized_conv same as Convolution

[Quantization] quantized_conv works

[Quantization] Fix bug

[Quantization] lenet works now

[Quantization] Add quantized_flatten

[Quantization] Quantized max pool works well

[Quantization] Make quantized_conv support NHWC

[Quantization] add max_pool

[Quantization] add ignore_symbols

[Quantization] Save change

[Quantization] Reorganize tests, 8 layers resnet works on cifar

[Quantization] Support for 'NHWC' max pool

[Quantization] Support for 'NHWC' quantized max pool

[Quantization] Fix speed of quantize_down_and_shrink_range

[Quantization] script for resnet on imagenet

[Quantization] refactor for quantize offline

[Quantization] Fix infershape

[Quantization] Update test

[Quantization] Update example

[Quantization] Fix build error

* [Quantization] Add calibration flow and refactor code

Rebase with dmlc/master

Add quantize_down_and_shrink by threshold

Don't assign resource when threshold is available for quantize_down_and_shrink

Fix quantize_down_and_shrink saturation

Implement pass for setting calib table to node attrs

Rebase with upstream master

Change threshold to min/max quantized params

Add c-api for setting calib table to graph

Add calibration front end function

Bug fixes and add unit test

Add data iter type to calibration

Fix bug in calibrate_quantized_model

Bug fix and add example

Add the second calibration approach and benchmark

Fix

Fix infer error and add benchmark for conv

Add benchmark script

Change output names and argument names

Remove commented out code

Change name

Add layout to benchmark_convolution

Remove redundant comment

Remove common and add soft link

More fix and benchmark

Add scripts to plot images

Minor fix

More fix

More fix and util tools

Tools and support bias in quantized_conv2d

Add script for getting the optimal thresholds using kl divergence

Add kl divergence for optimizing thresholds

Add benchmark scripts

Fix compile after rebasing on master

Allocate temp space only once for quantized_conv2d

Change quantize_down_and_shrink_range to allocate temp space once

No temp space for calib model

Refactor quantize_down_and_shrink_range into requantize

Refactor quantized convolution using nnvm interfaces

Fix quantized_conv bug

Use ConvolutionParam for QuantizedCuDNNConvOp

Refactor quantized fc using nnvm interfaces

Change TQuantizationNeedShrink to FNeedRequantize

Refactor quantized_pooling

Simplify FQuantizedOp interface

Better naming

Fix shape and type inference for quantized_flatten

Clean up quantization frontend APIs and examples

Delete quantized lrn and relu

Add python script for generating quantized models

Add script for running inference

Add inference example

Remove redundant files from example/quantization

Simplify user-level python APIs

Add logger

Improve user-level python api

Fix coding style

Add unit test for quantized_conv

Fix bugs in quantized_fully_connected and add unit test

Add unit test for requantize

Fix a bug and add python api unit tests

Import test_quantization in test_operator_gpu.py

Rebase with master

Remove redundant files

Fix test case for python3 and fix doc

Fix unit tests

Fix unit tests for python3

Release used ndarrays in calibration for saving memory usage

Simplify releasing memory of used ndarrays for calibration

Fix a bug

Revert "Fix a bug"

This reverts commit f7853f2.

Revert "Simplify releasing memory of used ndarrays for calibration"

This reverts commit 70b9e38.

Clean up benchmark script and improve example

Add API and example documentation and fix bugs

Remove redundant test file and improve error message

Merge quantize and dequantize with master impl

Remove commented code

Hide monitor interface from users

Remove interface from Module

Add license header

Move quantization unittests to a separate folder so that it can be only run on P3 instances

Remove quantization unittests from test_operator_gpu.py

Move quantization to contrib

Fix lint

Add mxnetlinux-gpu-p3 to jenkins

Fix jenkins

Fix CI build

Fix CI

Update jenkins file

Use cudnn7 for ci

Add docker file for quantization unit test only

Correctly skip build with cudnn < 6

Add doc for quantize symbol api

Fix lint

Fix python3 and add doc

Try to fix cudnn build problem

* Fix compile error

* Fix CI

* Remove tests that should not run on P3

* Remove unnecessary docker file

* Fix registering quantized nn ops

* Reformat Jenkinsfile and switch quantization to CUDA 9 (zheng-da#9)

* Address interface change cr

* Address comments and fix bugs

* Make unit test stable

* Improve unit test

* Address cr

* Address cr

* Fix flaky unit test layer_norm

* Fix doc
@BUG1989
Copy link

BUG1989 commented Apr 24, 2018

Thank you very much for sharing the Int8 quantize implement.Surely choose the current calibration dataset is very important,in our project we use the entropy calibration to find threshold.There are two detection models,one is using suitable dataset but other is using ugly dataset,accuracy loss quite different.
Suitable dataset:
Float32:
0.987543[1364,291,1408,335]
0.863533[1610,46,1650,86]
0.704229[869,142,923,196]
0.703651[765,108,808,151]
Int8:
0.985061[1365,292,1410,336]
0.834675[1611,46,1651,87]
0.698077[868,142,923,197]
0.687184[765,109,808,152]

Ugly dataset
Float32:
0.997095 [609,225,834,406]
0.970455 [95,760,278,1079]
0.899680 [594,397,702,697]
0.833142 [1043,176,1244,299]
0.809374 [254,363,342,620]
Int8:
0.992615 [610,226,837,407]
0.886571 [578,394,705,701]
0.813775 [1041,176,1244,298]
0.728242 [106,720,279,1061]
0.705122 [257,364,344,623]

@JingrenChen
Copy link

Could you please add quantized version of depthwise convolution so that MobileNetV1 and V2 can be quantized? Maybe using cuDNN's group convolution? Thank you.

@reminisce
Copy link
Contributor Author

@JingrenChen Thanks for the proposal. Currently, we don't have plan to add ops using cuDNN because it was found being lack of performance advantage compared to FP32 version. In the long term, we may consider adding optimized op kernels by TVM. Nevertheless, we still welcome community contributions of adding more quantized ops using cuDNN and MKLDNN. Please feel free to submit a PR if you would like to.

@reminisce
Copy link
Contributor Author

@BUG1989 Sorry I didn't notice your message earlier. Thanks for sharing the results. Could you clarify the meanings of the numbers of each row and suitable/ugly datasets? If you are interested in discussion, shall we create an Issue ticket and start from there to avoid spamming other subscribers?

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* [Quantization] 8bit Quantization and GPU Support

[Quantization] CuDNN 8bit quantized relu v0.1

[Quantization] CuDNN 8bit quantized max_pool v0.1

[Quantization] CuDNN 8bit quantized lrn v0.1

[Quantization] CuDNN 8bit quantized convolution v0.1

[Quantization] CuDNN 8bit quantized fully connected v0.1

[Quantization] Small fix

[Quantization] Implement backward method

[Quantization] Convolution backward method

[Quantization] Add range for matmul and conv

[Quantization] New types in ndarray.py

[Quantization] 8bit conv works

[Quantization] conv support multiple type

[Quantization] matmul works now

[Quantization] matmul works well

[Quantization] efactor quantization operators

[Quantization] Op: quantize_down_and_shrink_range

[Quantization] Complete quantize_graph_pass

[Quantization] Add example

[Quantization] Take zero-center quantize, accuracy fixed

[Quantization] Multiple layers MLP pass

[Quantization] Make quantized_conv same as Convolution

[Quantization] quantized_conv works

[Quantization] Fix bug

[Quantization] lenet works now

[Quantization] Add quantized_flatten

[Quantization] Quantized max pool works well

[Quantization] Make quantized_conv support NHWC

[Quantization] add max_pool

[Quantization] add ignore_symbols

[Quantization] Save change

[Quantization] Reorganize tests, 8 layers resnet works on cifar

[Quantization] Support for 'NHWC' max pool

[Quantization] Support for 'NHWC' quantized max pool

[Quantization] Fix speed of quantize_down_and_shrink_range

[Quantization] script for resnet on imagenet

[Quantization] refactor for quantize offline

[Quantization] Fix infershape

[Quantization] Update test

[Quantization] Update example

[Quantization] Fix build error

* [Quantization] Add calibration flow and refactor code

Rebase with dmlc/master

Add quantize_down_and_shrink by threshold

Don't assign resource when threshold is available for quantize_down_and_shrink

Fix quantize_down_and_shrink saturation

Implement pass for setting calib table to node attrs

Rebase with upstream master

Change threshold to min/max quantized params

Add c-api for setting calib table to graph

Add calibration front end function

Bug fixes and add unit test

Add data iter type to calibration

Fix bug in calibrate_quantized_model

Bug fix and add example

Add the second calibration approach and benchmark

Fix

Fix infer error and add benchmark for conv

Add benchmark script

Change output names and argument names

Remove commented out code

Change name

Add layout to benchmark_convolution

Remove redundant comment

Remove common and add soft link

More fix and benchmark

Add scripts to plot images

Minor fix

More fix

More fix and util tools

Tools and support bias in quantized_conv2d

Add script for getting the optimal thresholds using kl divergence

Add kl divergence for optimizing thresholds

Add benchmark scripts

Fix compile after rebasing on master

Allocate temp space only once for quantized_conv2d

Change quantize_down_and_shrink_range to allocate temp space once

No temp space for calib model

Refactor quantize_down_and_shrink_range into requantize

Refactor quantized convolution using nnvm interfaces

Fix quantized_conv bug

Use ConvolutionParam for QuantizedCuDNNConvOp

Refactor quantized fc using nnvm interfaces

Change TQuantizationNeedShrink to FNeedRequantize

Refactor quantized_pooling

Simplify FQuantizedOp interface

Better naming

Fix shape and type inference for quantized_flatten

Clean up quantization frontend APIs and examples

Delete quantized lrn and relu

Add python script for generating quantized models

Add script for running inference

Add inference example

Remove redundant files from example/quantization

Simplify user-level python APIs

Add logger

Improve user-level python api

Fix coding style

Add unit test for quantized_conv

Fix bugs in quantized_fully_connected and add unit test

Add unit test for requantize

Fix a bug and add python api unit tests

Import test_quantization in test_operator_gpu.py

Rebase with master

Remove redundant files

Fix test case for python3 and fix doc

Fix unit tests

Fix unit tests for python3

Release used ndarrays in calibration for saving memory usage

Simplify releasing memory of used ndarrays for calibration

Fix a bug

Revert "Fix a bug"

This reverts commit f7853f2.

Revert "Simplify releasing memory of used ndarrays for calibration"

This reverts commit 70b9e38.

Clean up benchmark script and improve example

Add API and example documentation and fix bugs

Remove redundant test file and improve error message

Merge quantize and dequantize with master impl

Remove commented code

Hide monitor interface from users

Remove interface from Module

Add license header

Move quantization unittests to a separate folder so that it can be only run on P3 instances

Remove quantization unittests from test_operator_gpu.py

Move quantization to contrib

Fix lint

Add mxnetlinux-gpu-p3 to jenkins

Fix jenkins

Fix CI build

Fix CI

Update jenkins file

Use cudnn7 for ci

Add docker file for quantization unit test only

Correctly skip build with cudnn < 6

Add doc for quantize symbol api

Fix lint

Fix python3 and add doc

Try to fix cudnn build problem

* Fix compile error

* Fix CI

* Remove tests that should not run on P3

* Remove unnecessary docker file

* Fix registering quantized nn ops

* Reformat Jenkinsfile and switch quantization to CUDA 9 (apache#9)

* Address interface change cr

* Address comments and fix bugs

* Make unit test stable

* Improve unit test

* Address cr

* Address cr

* Fix flaky unit test layer_norm

* Fix doc
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* [Quantization] 8bit Quantization and GPU Support

[Quantization] CuDNN 8bit quantized relu v0.1

[Quantization] CuDNN 8bit quantized max_pool v0.1

[Quantization] CuDNN 8bit quantized lrn v0.1

[Quantization] CuDNN 8bit quantized convolution v0.1

[Quantization] CuDNN 8bit quantized fully connected v0.1

[Quantization] Small fix

[Quantization] Implement backward method

[Quantization] Convolution backward method

[Quantization] Add range for matmul and conv

[Quantization] New types in ndarray.py

[Quantization] 8bit conv works

[Quantization] conv support multiple type

[Quantization] matmul works now

[Quantization] matmul works well

[Quantization] efactor quantization operators

[Quantization] Op: quantize_down_and_shrink_range

[Quantization] Complete quantize_graph_pass

[Quantization] Add example

[Quantization] Take zero-center quantize, accuracy fixed

[Quantization] Multiple layers MLP pass

[Quantization] Make quantized_conv same as Convolution

[Quantization] quantized_conv works

[Quantization] Fix bug

[Quantization] lenet works now

[Quantization] Add quantized_flatten

[Quantization] Quantized max pool works well

[Quantization] Make quantized_conv support NHWC

[Quantization] add max_pool

[Quantization] add ignore_symbols

[Quantization] Save change

[Quantization] Reorganize tests, 8 layers resnet works on cifar

[Quantization] Support for 'NHWC' max pool

[Quantization] Support for 'NHWC' quantized max pool

[Quantization] Fix speed of quantize_down_and_shrink_range

[Quantization] script for resnet on imagenet

[Quantization] refactor for quantize offline

[Quantization] Fix infershape

[Quantization] Update test

[Quantization] Update example

[Quantization] Fix build error

* [Quantization] Add calibration flow and refactor code

Rebase with dmlc/master

Add quantize_down_and_shrink by threshold

Don't assign resource when threshold is available for quantize_down_and_shrink

Fix quantize_down_and_shrink saturation

Implement pass for setting calib table to node attrs

Rebase with upstream master

Change threshold to min/max quantized params

Add c-api for setting calib table to graph

Add calibration front end function

Bug fixes and add unit test

Add data iter type to calibration

Fix bug in calibrate_quantized_model

Bug fix and add example

Add the second calibration approach and benchmark

Fix

Fix infer error and add benchmark for conv

Add benchmark script

Change output names and argument names

Remove commented out code

Change name

Add layout to benchmark_convolution

Remove redundant comment

Remove common and add soft link

More fix and benchmark

Add scripts to plot images

Minor fix

More fix

More fix and util tools

Tools and support bias in quantized_conv2d

Add script for getting the optimal thresholds using kl divergence

Add kl divergence for optimizing thresholds

Add benchmark scripts

Fix compile after rebasing on master

Allocate temp space only once for quantized_conv2d

Change quantize_down_and_shrink_range to allocate temp space once

No temp space for calib model

Refactor quantize_down_and_shrink_range into requantize

Refactor quantized convolution using nnvm interfaces

Fix quantized_conv bug

Use ConvolutionParam for QuantizedCuDNNConvOp

Refactor quantized fc using nnvm interfaces

Change TQuantizationNeedShrink to FNeedRequantize

Refactor quantized_pooling

Simplify FQuantizedOp interface

Better naming

Fix shape and type inference for quantized_flatten

Clean up quantization frontend APIs and examples

Delete quantized lrn and relu

Add python script for generating quantized models

Add script for running inference

Add inference example

Remove redundant files from example/quantization

Simplify user-level python APIs

Add logger

Improve user-level python api

Fix coding style

Add unit test for quantized_conv

Fix bugs in quantized_fully_connected and add unit test

Add unit test for requantize

Fix a bug and add python api unit tests

Import test_quantization in test_operator_gpu.py

Rebase with master

Remove redundant files

Fix test case for python3 and fix doc

Fix unit tests

Fix unit tests for python3

Release used ndarrays in calibration for saving memory usage

Simplify releasing memory of used ndarrays for calibration

Fix a bug

Revert "Fix a bug"

This reverts commit f7853f2.

Revert "Simplify releasing memory of used ndarrays for calibration"

This reverts commit 70b9e38.

Clean up benchmark script and improve example

Add API and example documentation and fix bugs

Remove redundant test file and improve error message

Merge quantize and dequantize with master impl

Remove commented code

Hide monitor interface from users

Remove interface from Module

Add license header

Move quantization unittests to a separate folder so that it can be only run on P3 instances

Remove quantization unittests from test_operator_gpu.py

Move quantization to contrib

Fix lint

Add mxnetlinux-gpu-p3 to jenkins

Fix jenkins

Fix CI build

Fix CI

Update jenkins file

Use cudnn7 for ci

Add docker file for quantization unit test only

Correctly skip build with cudnn < 6

Add doc for quantize symbol api

Fix lint

Fix python3 and add doc

Try to fix cudnn build problem

* Fix compile error

* Fix CI

* Remove tests that should not run on P3

* Remove unnecessary docker file

* Fix registering quantized nn ops

* Reformat Jenkinsfile and switch quantization to CUDA 9 (#9)

* Address interface change cr

* Address comments and fix bugs

* Make unit test stable

* Improve unit test

* Address cr

* Address cr

* Fix flaky unit test layer_norm

* Fix doc
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.