Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Commit 0c5b5b1

Browse files
[v1.9.x] Port #20889 from v1.x (#20923)
* Add inc to quantization documentation * Minor fixes * review fixes * fix * fix2 * Add BERT example with results, review fixes * Add results from aws machine (with VNNI instructions) * Small fix * Fix mxnet installation instruction * Review fixes * Review fixes
1 parent ae7a104 commit 0c5b5b1

File tree

4 files changed

+718
-4
lines changed

4 files changed

+718
-4
lines changed

docs/python_docs/python/tutorials/performance/backend/mkldnn/mkldnn_quantization.md

+298-4
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,10 @@ Installing MXNet with MKLDNN backend is an easy and essential process. You can f
2727

2828
```
2929
# release version
30-
pip install mxnet-mkl
31-
# nightly version
32-
pip install mxnet-mkl --pre
30+
pip install mxnet
31+
32+
# latest nightly development version
33+
pip install --pre "mxnet<2" -f https://dist.mxnet.io/python
3334
```
3435

3536
## Image Classification Demo
@@ -155,7 +156,7 @@ cqsym, cqarg_params, aux_params, collector = quantize_graph(sym=sym, arg_params=
155156
quantized_dtype=quantized_dtype, logger=logger)
156157

157158
# download imagenet validation dataset
158-
mx.test_utils.download('https://data.mxnet.io/data/val_256_q90.rec', 'dataset.rec')
159+
mx.test_utils.download('http://data.mxnet.io/data/val_256_q90.rec', 'dataset.rec')
159160
# set rgb info for data
160161
mean_std = {'mean_r': 123.68, 'mean_g': 116.779, 'mean_b': 103.939, 'std_r': 58.393, 'std_g': 57.12, 'std_b': 57.375}
161162
# set batch size
@@ -243,6 +244,8 @@ BTW, You can also modify the `min_calib_range` and `max_calib_range` in the JSON
243244

244245
- Change calibration dataset by setting different `num_calib_batches` or shuffle your validation dataset;
245246

247+
- Use Intel® Neural Compressor ([see below](#Improving-accuracy-with-Intel-Neural-Compressor))
248+
246249
#### Performance Tuning
247250

248251
- Keep sure to perform graph fusion before quantization;
@@ -255,4 +258,295 @@ BTW, You can also modify the `min_calib_range` and `max_calib_range` in the JSON
255258

256259
MXNet also supports deploy quantized models with C++. Refer [MXNet C++ Package](https://github.com/apache/incubator-mxnet/blob/master/cpp-package/README.md) for more details.
257260

261+
# Improving accuracy with Intel® Neural Compressor
262+
263+
The accuracy of a model can decrease as a result of quantization. When the accuracy drop is significant, we can try to manually find a better quantization configuration (exclude some layers, try different calibration methods, etc.), but for bigger models this might prove to be a difficult and time consuming task. [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) tries to automate this process using several tuning heuristics, which aim to find the quantization configuration that satisfies the specified accuracy requirement.
264+
265+
**NOTE:**
266+
267+
Most tuning strategies will try different configurations on an evaluation dataset in order to find out how each layer affects the accuracy of the model. This means that for larger models, it may take a long time to find a solution (as the tuning space is usually larger and the evaluation itself takes longer).
268+
269+
## Installation and Prerequisites
270+
271+
- Install MXNet with MKLDNN enabled as described in the [previous section](#Installation-and-Prerequisites).
272+
273+
- Install Intel® Neural Compressor:
274+
275+
Use one of the commands below to install INC (supported python versions are: 3.6, 3.7, 3.8, 3.9):
276+
277+
```bash
278+
# install stable version from pip
279+
pip install neural-compressor
280+
281+
# install nightly version from pip
282+
pip install -i https://test.pypi.org/simple/ neural-compressor
283+
284+
# install stable version from conda
285+
conda install neural-compressor -c conda-forge -c intel
286+
```
287+
288+
## Configuration file
289+
290+
Quantization tuning process can be customized in the yaml configuration file. Below is a simple example:
291+
292+
```yaml
293+
# cnn.yaml
294+
295+
version: 1.0
296+
297+
model:
298+
name: cnn
299+
framework: mxnet
300+
301+
quantization:
302+
calibration:
303+
sampling_size: 160 # number of samples for calibration
304+
305+
tuning:
306+
strategy:
307+
name: basic
308+
accuracy_criterion:
309+
relative: 0.01
310+
exit_policy:
311+
timeout: 0
312+
random_seed: 9527
313+
```
314+
315+
We are using the `basic` strategy, but you could also try out different ones. [Here](https://github.com/intel/neural-compressor/blob/master/docs/tuning_strategies.md) you can find a list of strategies available in INC and details of how they work. You can also add your own strategy if the existing ones do not suit your needs.
316+
317+
Since the value of `timeout` is 0, INC will run until it finds a configuration that satisfies the accuracy criterion and then exit. Depending on the strategy this may not be ideal, as sometimes it would be better to further explore the tuning space to find a superior configuration both in terms of accuracy and speed. To achieve this, we can set a specific `timeout` value, which will tell INC how long (in seconds) it should run.
318+
319+
For more information about the configuration file, see the [template](https://github.com/intel/neural-compressor/blob/master/neural_compressor/template/ptq.yaml) from the official INC repo. Keep in mind that only the `post training quantization` is currently supported for MXNet.
320+
321+
## Model quantization and tuning
322+
323+
In general, Intel® Neural Compressor requires 4 elements in order to run:
324+
1. Config file - like the example above
325+
2. Model to be quantized
326+
3. Calibration dataloader
327+
4. Evaluation function - a function that takes a model as an argument and returns the accuracy it achieves on a certain evaluation dataset.
328+
329+
### Quantizing ResNet
330+
331+
The previous sections described how to quantize ResNet using the native MXNet quantization. This example shows how we can achieve the same (with the auto-tuning) using INC.
332+
333+
1. Get the model
334+
335+
```python
336+
import logging
337+
import mxnet as mx
338+
from mxnet.gluon.model_zoo import vision
339+
340+
logging.basicConfig()
341+
logger = logging.getLogger('logger')
342+
logger.setLevel(logging.INFO)
343+
344+
batch_shape = (1, 3, 224, 224)
345+
resnet18 = vision.resnet18_v1(pretrained=True)
346+
```
347+
348+
2. Prepare the dataset:
349+
350+
```python
351+
mx.test_utils.download('http://data.mxnet.io/data/val_256_q90.rec', 'data/val_256_q90.rec')
352+
353+
batch_size = 16
354+
mean_std = {'mean_r': 123.68, 'mean_g': 116.779, 'mean_b': 103.939,
355+
'std_r': 58.393, 'std_g': 57.12, 'std_b': 57.375}
356+
357+
data = mx.io.ImageRecordIter(path_imgrec='data/val_256_q90.rec',
358+
batch_size=batch_size,
359+
data_shape=batch_shape[1:],
360+
rand_crop=False,
361+
rand_mirror=False,
362+
shuffle=False,
363+
**mean_std)
364+
data.batch_size = batch_size
365+
```
366+
367+
3. Prepare the evaluation function:
368+
369+
```python
370+
eval_samples = batch_size*10
371+
372+
def eval_func(model):
373+
data.reset()
374+
metric = mx.metric.Accuracy()
375+
for i, batch in enumerate(data):
376+
if i * batch_size >= eval_samples:
377+
break
378+
x = batch.data[0].as_in_context(mx.cpu())
379+
label = batch.label[0].as_in_context(mx.cpu())
380+
outputs = model.forward(x)
381+
metric.update(label, outputs)
382+
return metric.get()[1]
383+
```
384+
385+
4. Run Intel® Neural Compressor:
386+
387+
```python
388+
from neural_compressor.experimental import Quantization
389+
quantizer = Quantization("./cnn.yaml")
390+
quantizer.model = resnet18
391+
quantizer.calib_dataloader = data
392+
quantizer.eval_func = eval_func
393+
qnet = quantizer.fit().model
394+
```
395+
396+
Since this model already achieves good accuracy using native quantization (less than 1% accuracy drop), for the given configuration file, INC will end on the first configuration, quantizing all layers using `naive` calibration mode for each. To see the true potential of INC, we need a model which suffers from a larger accuracy drop after quantization.
397+
398+
### Quantizing BERT
399+
400+
This example shows how to use INC to quantize BERT-base for MRPC. In this case, the native MXNet quantization usually introduce a significant accuracy drop (2% - 5% using `naive` calibration mode). To simplify the code, model and task specific boilerplate has been moved to the `details.py` file.
401+
402+
This is the configuration file for this example:
403+
```yaml
404+
version: 1.0
405+
406+
model:
407+
name: bert
408+
framework: mxnet
409+
410+
quantization:
411+
calibration:
412+
sampling_size: 320 # number of samples for calibration
413+
414+
tuning:
415+
strategy:
416+
name: basic
417+
accuracy_criterion:
418+
relative: 0.01
419+
exit_policy:
420+
timeout: 0
421+
max_trials: 9999 # default is 100
422+
random_seed: 9527
423+
```
424+
425+
And here is the script:
426+
427+
```python
428+
from pathlib import Path
429+
from functools import partial
430+
431+
import details
432+
from neural_compressor.experimental import Quantization, common
433+
434+
# constants
435+
INC_CONFIG_PATH = Path('./bert.yaml').resolve()
436+
PARAMS_PATH = Path('./bert_mrpc.params').resolve()
437+
OUTPUT_DIR_PATH = Path('./output/').resolve()
438+
OUTPUT_MODEL_PATH = OUTPUT_DIR_PATH/'quantized_model'
439+
OUTPUT_DIR_PATH.mkdir(parents=True, exist_ok=True)
440+
441+
# Prepare the dataloaders (calib_dataloader is same as train_dataloader but without shuffling)
442+
train_dataloader, dev_dataloader, calib_dataloader = details.preprocess_data()
443+
444+
# Get the model
445+
model = details.BERTModel(details.BACKBONE, dropout=0.1, num_classes=details.NUM_CLASSES)
446+
model.hybridize(static_alloc=True)
447+
448+
# finetune or load the parameters of already finetuned model
449+
if not PARAMS_PATH.exists():
450+
model = details.finetune(model, train_dataloader, dev_dataloader, OUTPUT_DIR_PATH)
451+
model.save_parameters(str(PARAMS_PATH))
452+
else:
453+
model.load_parameters(str(PARAMS_PATH), ctx=details.CTX, cast_dtype=True)
454+
455+
# run INC
456+
calib_dataloader.batch_size = details.BATCH_SIZE
457+
eval_func = partial(details.evaluate, dataloader=dev_dataloader)
458+
459+
quantizer = Quantization(str(INC_CONFIG_PATH)) # 1. Config file
460+
quantizer.model = common.Model(model) # 2. Model to be quantized
461+
quantizer.calib_dataloader = calib_dataloader # 3. Calibration dataloader
462+
quantizer.eval_func = eval_func # 4. Evaluation function
463+
quantized_model = quantizer.fit().model
464+
465+
# save the quantized model
466+
quantized_model.export(str(OUTPUT_MODEL_PATH))
467+
```
468+
469+
With the evaluation function hidden in the `details.py` file:
470+
471+
```python
472+
def evaluate(model, dataloader):
473+
metric = METRIC()
474+
for batch in dataloader:
475+
input_ids, segment_ids, valid_length, label = batch
476+
input_ids = input_ids.as_in_context(CTX)
477+
segment_ids = segment_ids.as_in_context(CTX)
478+
valid_length = valid_length.as_in_context(CTX)
479+
label = label.as_in_context(CTX).reshape((-1))
480+
481+
out = model(input_ids, segment_ids, valid_length)
482+
metric.update([label], [out])
483+
484+
metric_name, metric_val = metric.get()
485+
return metric_val
486+
```
487+
488+
For comparision, this is how one could quantize this model using MXNet native quantization (this function is also located in the `details.py` file):
489+
490+
```python
491+
def native_quantization(model, calib_dataloader, dev_dataloader):
492+
quantized_model = quantize_net_v2(model,
493+
quantize_mode='smart',
494+
calib_data=calib_dataloader,
495+
calib_mode='naive',
496+
num_calib_examples=BATCH_SIZE*10)
497+
print('Native quantization results: {}'.format(evaluate(quantized_model, dev_dataloader)))
498+
return quantized_model
499+
```
500+
501+
For complete code, see this example on the [official GitHub repository](https://github.com/apache/incubator-mxnet/tree/v1.x/example/quantization_inc/BERT_MRPC).
502+
503+
#### Results:
504+
505+
Environment:
506+
- c6i.16xlarge Amazon EC2 instance (Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz)
507+
- Ubuntu 20.04 LTS
508+
- MXNet 1.9
509+
- INC 1.9.1
510+
511+
Results on the validation dataset:
512+
513+
| Quantization method | Accuracy | F1 | Relative accuracy loss [%] | Calibration/tuning time [s] | Speedup |
514+
|:----------------------------:|:--------:|:------:|:--------------------------:|:----------------------------:|:-------:|
515+
| **No quantization (f32)** | **0.8529** | **0.8956** | **0** | **0** | **1.0** |
516+
| Native 'naive', 10 batches | 0.8259 | 0.8775 | 3.1657 | 31 | 1.3811 |
517+
| Native 'naive', 20 batches | 0.8210 | 0.8731 | 3.7402 | 58 | 1.3866 |
518+
| Native 'entropy', 10 batches | 0.8064 | 0.8557 | 5.4520 | 37 | 1.3789 |
519+
| Native 'entropy', 20 batches | 0.8137 | 0.8624 | 4.5961 | 67 | 1.3460 |
520+
| INC, 'basic' | 0.8456 | 0.8889 | 0.8559 | 197 | 1.4418 |
521+
| INC, 'bayesian' | 0.8529 | 0.8888 | 0 | 129 | 1.4275 |
522+
| INC, 'mse' | 0.8480 | 0.8954 | 0.5745 | 974 | 0.9642 |
523+
524+
All INC strategies found configurations meeting the 1% relative accuracy loss criterion. Only the `mse` strategy struggled, taking the longest time and generating configuration that is slower than the f32 model. Although these results may suggest that the `mse` strategy is the worst and the `bayesian` strategy is the best, different strategies may give better results for specific models and tasks. Usually the `basic` strategy is the most stable one.
525+
526+
Here is an example of a configuration generated by INC with the `basic` strategy:
527+
528+
- Layers quantized using min-max (`naive`) calibration algorithm:
529+
```
530+
{'bertclassifier0_dropout0_fwd', 'bertencoder0_layernorm0_layernorm0', 'bertencoder0_transformer0_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer0_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer0_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer0_layernorm0_layernorm0', 'bertencoder0_transformer0_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer10_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer10_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer10_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer10_layernorm0_layernorm0', 'bertencoder0_transformer10_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer11_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer11_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer11_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer11_layernorm0_layernorm0', 'bertencoder0_transformer1_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer1_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer1_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer1_layernorm0_layernorm0', 'bertencoder0_transformer1_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer2_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer2_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer2_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer2_layernorm0_layernorm0', 'bertencoder0_transformer2_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer3_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer3_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer3_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer3_layernorm0_layernorm0', 'bertencoder0_transformer3_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer4_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer4_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer4_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer4_layernorm0_layernorm0', 'bertencoder0_transformer4_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer5_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer5_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer5_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer5_layernorm0_layernorm0', 'bertencoder0_transformer5_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer6_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer6_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer6_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer6_layernorm0_layernorm0', 'bertencoder0_transformer6_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer7_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer7_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer7_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer7_layernorm0_layernorm0', 'bertencoder0_transformer7_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer8_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer8_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer8_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer8_layernorm0_layernorm0', 'bertencoder0_transformer8_positionwiseffn0_layernorm0_layernorm0', 'bertencoder0_transformer9_dotproductselfattentioncell0_dropout0_fwd', 'bertencoder0_transformer9_dotproductselfattentioncell0_reshape3', 'bertencoder0_transformer9_dotproductselfattentioncell0_reshape7', 'bertencoder0_transformer9_layernorm0_layernorm0', 'bertencoder0_transformer9_positionwiseffn0_layernorm0_layernorm0', 'bertmodel0_reshape0', 'sg_mkldnn_fully_connected_0', 'sg_mkldnn_fully_connected_1', 'sg_mkldnn_fully_connected_11', 'sg_mkldnn_fully_connected_12', 'sg_mkldnn_fully_connected_13', 'sg_mkldnn_fully_connected_15', 'sg_mkldnn_fully_connected_16', 'sg_mkldnn_fully_connected_17', 'sg_mkldnn_fully_connected_19', 'sg_mkldnn_fully_connected_20', 'sg_mkldnn_fully_connected_21', 'sg_mkldnn_fully_connected_23', 'sg_mkldnn_fully_connected_24', 'sg_mkldnn_fully_connected_25', 'sg_mkldnn_fully_connected_27', 'sg_mkldnn_fully_connected_28', 'sg_mkldnn_fully_connected_29', 'sg_mkldnn_fully_connected_3', 'sg_mkldnn_fully_connected_31', 'sg_mkldnn_fully_connected_32', 'sg_mkldnn_fully_connected_33', 'sg_mkldnn_fully_connected_35', 'sg_mkldnn_fully_connected_36', 'sg_mkldnn_fully_connected_37', 'sg_mkldnn_fully_connected_39', 'sg_mkldnn_fully_connected_4', 'sg_mkldnn_fully_connected_40', 'sg_mkldnn_fully_connected_41', 'sg_mkldnn_fully_connected_43', 'sg_mkldnn_fully_connected_44', 'sg_mkldnn_fully_connected_45', 'sg_mkldnn_fully_connected_47', 'sg_mkldnn_fully_connected_48', 'sg_mkldnn_fully_connected_49', 'sg_mkldnn_fully_connected_5', 'sg_mkldnn_fully_connected_7', 'sg_mkldnn_fully_connected_8', 'sg_mkldnn_fully_connected_9', 'sg_mkldnn_fully_connected_eltwise_10', 'sg_mkldnn_fully_connected_eltwise_14', 'sg_mkldnn_fully_connected_eltwise_18', 'sg_mkldnn_fully_connected_eltwise_2', 'sg_mkldnn_fully_connected_eltwise_22', 'sg_mkldnn_fully_connected_eltwise_26', 'sg_mkldnn_fully_connected_eltwise_30', 'sg_mkldnn_fully_connected_eltwise_34', 'sg_mkldnn_fully_connected_eltwise_38', 'sg_mkldnn_fully_connected_eltwise_42', 'sg_mkldnn_fully_connected_eltwise_46', 'sg_mkldnn_fully_connected_eltwise_6'}
531+
```
532+
533+
- Layers quantized using KL (`entropy`) calibration algorithm:
534+
```
535+
{'sg_mkldnn_selfatt_qk_0', 'sg_mkldnn_selfatt_qk_10', 'sg_mkldnn_selfatt_qk_12', 'sg_mkldnn_selfatt_qk_14', 'sg_mkldnn_selfatt_qk_16', 'sg_mkldnn_selfatt_qk_18', 'sg_mkldnn_selfatt_qk_2', 'sg_mkldnn_selfatt_qk_20', 'sg_mkldnn_selfatt_qk_22', 'sg_mkldnn_selfatt_qk_4', 'sg_mkldnn_selfatt_qk_6', 'sg_mkldnn_selfatt_qk_8', 'sg_mkldnn_selfatt_valatt_1', 'sg_mkldnn_selfatt_valatt_11', 'sg_mkldnn_selfatt_valatt_13', 'sg_mkldnn_selfatt_valatt_15', 'sg_mkldnn_selfatt_valatt_17', 'sg_mkldnn_selfatt_valatt_19', 'sg_mkldnn_selfatt_valatt_21', 'sg_mkldnn_selfatt_valatt_23', 'sg_mkldnn_selfatt_valatt_3', 'sg_mkldnn_selfatt_valatt_5', 'sg_mkldnn_selfatt_valatt_7', 'sg_mkldnn_selfatt_valatt_9'}
536+
```
537+
538+
- Layers excluded from quantization:
539+
```
540+
{'sg_mkldnn_fully_connected_43'}
541+
```
542+
543+
## Tips
544+
- In order to get a solution that generalizes well, evaluate the model (in eval_func) on a representative dataset.
545+
- With `history.snapshot` file (generated by INC) you can recover any model that was generated during the tuning process:
546+
```python
547+
from neural_compressor.utils.utility import recover
548+
549+
quantized_model = recover(f32_model, 'nc_workspace/<tuning date>/history.snapshot', configuration_idx).model
550+
```
551+
258552
<!-- INSERT SOURCE DOWNLOAD BUTTONS -->

0 commit comments

Comments
 (0)