Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add quantization API doc and oneDNN to migration guide #20813

Merged
merged 8 commits into from
Feb 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/python_docs/python/api/contrib/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,11 @@ Contributed modules

Functions for manipulating text data.

.. card::
:title: contrib.quantization
:link: quantization/index.html

Functions for precision reduction.

.. toctree::
:hidden:
Expand Down
23 changes: 23 additions & 0 deletions docs/python_docs/python/api/contrib/quantization/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.

contrib.quantization
====================

.. automodule:: mxnet.contrib.quantization
:members:
:autosummary:
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,67 @@ A new module called `mxnet.gluon.probability` has been introduced in Gluon 2.0.

3. [Transformation](https://github.com/apache/incubator-mxnet/tree/master/python/mxnet/gluon/probability/transformation): implement invertible transformation with computable log det jacobians.

## oneDNN Integration
### Operator Fusion
In versions 1.x of MXNet pattern fusion in execution graph was enabled by default when using MXNet built with oneDNN library support and could have been disabled by setting 'MXNET_SUBGRAPH_BACKEND' environment flag to `None`. MXNet 2.0 introduced changes in forward inference flow which led to refactor of fusion mechanism. To fuse model in MXNet 2.0 there are two requirements:

- the model must be defined as a subclass of HybridBlock or Symbol,

- the model must have specific operator patterns which can be fused.

Both HybridBlock and Symbol classes provide API to easily run fusion of operators. Adding only one line of code is needed to run fusion passes on model:
```{.python}
# on HybridBlock
net.optimize_for(data, backend='ONEDNN')
# on Symbol
optimized_symbol = sym.optimize_for(backend='ONEDNN')
```

Controling which patterns should be fused still can be done by setting proper environment variables. See [**oneDNN Environment Variables**](#oneDNN-Environment-Variables)

### INT8 Quantization / Precision reduction
Quantization API was also refactored to be consistent with other new features and mechanisms. In comparison to MXNet 1.x releases, in MXNet 2.0 `quantize_net_v2` function has been removed and development focused mainly on `quantize_net` function to make it easier to use for end user and ultimately give him more flexibility.
Quantization can be performed on either subclass of HybridBlock with `quantize_net` or Symbol with deprecated `quantize_model` (`quantize_model` is left only to provide backward compatibility and its usage is strongly discouraged).

```{.python}
import mxnet as mx
from mxnet.contrib.quantization import quantize_net
from mxnet.gluon.model_zoo.vision import resnet50_v1

# load model
net = resnet50_v1(pretrained=True)

# prepare calibration data
dummy_data = mx.nd.random.uniform(-1.0, 1.0, (batch_size, 3, 224, 224))
calib_data_loader = mx.gluon.data.DataLoader(dummy_data, batch_size=batch_size)

# quantization
qnet = quantize_net(net, calib_mode='naive', calib_data=calib_data_loader)
```
`quantize_net` can be much more complex - all function attributes can be found in the [API](../../api/contrib/quantization/index.rst).

### oneDNN Environment Variables
In version 2.0 of MXNet all references to MKLDNN (former name of oneDNN) were replaced by ONEDNN. Below table lists all environment variables:

| MXNet 1.x | MXNet 2.0 |
| ------------------------------------ | ---------------------------------------|
| MXNET_MKLDNN_ENABLED | MXNET_ONEDNN_ENABLED |
| MXNET_MKLDNN_CACHE_NUM | MXNET_ONEDNN_CACHE_NUM |
| MXNET_MKLDNN_FORCE_FC_AB_FORMAT | MXNET_ONEDNN_FORCE_FC_AB_FORMAT |
| MXNET_MKLDNN_ENABLED | MXNET_ONEDNN_ENABLED |
| MXNET_MKLDNN_DEBUG | MXNET_ONEDNN_DEBUG |
| MXNET_USE_MKLDNN_RNN | MXNET_USE_ONEDNN_RNN |
| MXNET_DISABLE_MKLDNN_CONV_OPT | MXNET_DISABLE_ONEDNN_CONV_OPT |
| MXNET_DISABLE_MKLDNN_FUSE_CONV_BN | MXNET_DISABLE_ONEDNN_FUSE_CONV_BN |
| MXNET_DISABLE_MKLDNN_FUSE_CONV_RELU | MXNET_DISABLE_ONEDNN_FUSE_CONV_RELU |
| MXNET_DISABLE_MKLDNN_FUSE_CONV_SUM | MXNET_DISABLE_ONEDNN_FUSE_CONV_SUM |
| MXNET_DISABLE_MKLDNN_FC_OPT | MXNET_DISABLE_ONEDNN_FC_OPT |
| MXNET_DISABLE_MKLDNN_FUSE_FC_ELTWISE | MXNET_DISABLE_ONEDNN_FUSE_FC_ELTWISE |
| MXNET_DISABLE_MKLDNN_TRANSFORMER_OPT | MXNET_DISABLE_ONEDNN_TRANSFORMER_OPT |
| n/a | MXNET_DISABLE_ONEDNN_BATCH_DOT_FUSE |
| n/a | MXNET_ONEDNN_FUSE_REQUANTIZE |
| n/a | MXNET_ONEDNN_FUSE_DEQUANTIZE |

## Appendix
### NumPy Array Deprecated Attributes
| Deprecated Attributes | NumPy ndarray Equivalent |
Expand Down
51 changes: 27 additions & 24 deletions python/mxnet/contrib/quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def _quantize_params(qsym, params, min_max_dict):
qsym : Symbol
Quantized symbol from FP32 symbol.
params : dict of str->NDArray
min_max_dict: dict of min/max pairs of layers' output
min_max_dict : dict of min/max pairs of layers' output
"""
inputs_name = qsym.list_arguments()
quantized_params = {}
Expand Down Expand Up @@ -110,11 +110,11 @@ def _quantize_symbol(sym, device, excluded_symbols=None, excluded_operators=None
Names of the parameters that users want to quantize offline. It's always recommended to
quantize parameters offline so that quantizing parameters during the inference can be
avoided.
quantized_dtype: str
quantized_dtype : str
The quantized destination type for input data.
quantize_mode: str
quantize_mode : str
The mode that quantization pass to apply.
quantize_granularity: str
quantize_granularity : str
The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise'
quantization. The default value is 'tensor-wise'.
"""
Expand Down Expand Up @@ -174,15 +174,16 @@ def __init__(self):
def collect(self, name, op_name, arr):
"""Function which is registered to Block as monitor callback. Names of layers
requiring calibration are stored in `self.include_layers` variable.
Parameters
----------
name : str
Node name from which collected data comes from
op_name : str
Operator name from which collected data comes from. Single operator
can have multiple inputs/ouputs nodes - each should have different name
arr : NDArray
NDArray containing data of monitored node

Parameters
----------
name : str
Node name from which collected data comes from.
op_name : str
Operator name from which collected data comes from. Single operator
can have multiple input/ouput nodes - each should have different name.
arr : NDArray
NDArray containing data of monitored node.
"""

def post_collect(self):
Expand Down Expand Up @@ -227,8 +228,7 @@ def post_collect(self):

@staticmethod
def combine_histogram(old_hist, arr, new_min, new_max, new_th):
""" Collect layer histogram for arr and combine it with old histogram.
"""
"""Collect layer histogram for arr and combine it with old histogram."""
(old_hist, old_hist_edges, old_min, old_max, old_th) = old_hist
if new_th <= old_th:
hist, _ = np.histogram(arr, bins=len(old_hist), range=(-old_th, old_th))
Expand Down Expand Up @@ -392,21 +392,22 @@ def quantize_model(sym, arg_params, aux_params, data_names=('data',),
The backend quantized operators are only enabled for Linux systems. Please do not run
inference using the quantized models on Windows for now.
The quantization implementation adopts the TensorFlow's approach:
https://www.tensorflow.org/performance/quantization.
https://www.tensorflow.org/lite/performance/post_training_quantization.
The calibration implementation borrows the idea of Nvidia's 8-bit Inference with TensorRT:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
and adapts the method to MXNet.

.. _`quantize_model_params`:

Parameters
----------
sym : str or Symbol
sym : Symbol
Defines the structure of a neural network for FP32 data types.
arg_params : dict
Dictionary of name to `NDArray`.
aux_params : dict
Dictionary of name to `NDArray`.
data_names : a list of strs
data_names : list of strings
Data names required for creating a Module object to run forward propagation on the
calibration dataset.
device : Device
Expand Down Expand Up @@ -441,15 +442,15 @@ def quantize_model(sym, arg_params, aux_params, data_names=('data',),
The mode that quantization pass to apply. Support 'full' and 'smart'.
'full' means quantize all operator if possible.
'smart' means quantization pass will smartly choice which operator should be quantized.
quantize_granularity: str
quantize_granularity : str
The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise'
quantization. The default value is 'tensor-wise'.
logger : Object
A logging object for printing information during the process of quantization.

Returns
-------
quantized_model: tuple
quantized_model : tuple
A tuple of quantized symbol, quantized arg_params, and aux_params.
"""
warnings.warn('WARNING: This will be deprecated please use quantize_net with Gluon models')
Expand Down Expand Up @@ -582,9 +583,10 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(),
and a collector for naive or entropy calibration.
The backend quantized operators are only enabled for Linux systems. Please do not run
inference using the quantized models on Windows for now.

Parameters
----------
sym : str or Symbol
sym : Symbol
Defines the structure of a neural network for FP32 data types.
device : Device
Defines the device that users want to run forward propagation on the calibration
Expand Down Expand Up @@ -616,7 +618,7 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(),
The mode that quantization pass to apply. Support 'full' and 'smart'.
'full' means quantize all operator if possible.
'smart' means quantization pass will smartly choice which operator should be quantized.
quantize_granularity: str
quantize_granularity : str
The granularity of quantization, currently supports 'tensor-wise' and 'channel-wise'
quantization. The default value is 'tensor-wise'.
LayerOutputCollector : subclass of CalibrationCollector
Expand Down Expand Up @@ -700,13 +702,14 @@ def quantize_graph(sym, arg_params, aux_params, device=cpu(),
return qsym, qarg_params, aux_params, collector, calib_layers

def calib_graph(qsym, arg_params, aux_params, collector,
calib_mode='entropy', logger=logging):
calib_mode='entropy', logger=None):
"""User-level API for calibrating a quantized model using a filled collector.
The backend quantized operators are only enabled for Linux systems. Please do not run
inference using the quantized models on Windows for now.

Parameters
----------
qsym : str or Symbol
qsym : Symbol
Defines the structure of a neural network for INT8 data types.
arg_params : dict
Dictionary of name to `NDArray`.
Expand Down