You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
@@ -243,6 +244,8 @@ BTW, You can also modify the `min_calib_range` and `max_calib_range` in the JSON
243
244
244
245
- Change calibration dataset by setting different `num_calib_batches` or shuffle your validation dataset;
245
246
247
+
- Use Intel® Neural Compressor ([see below](#Improving-accuracy-with-Intel-Neural-Compressor))
248
+
246
249
#### Performance Tuning
247
250
248
251
- Keep sure to perform graph fusion before quantization;
@@ -255,4 +258,295 @@ BTW, You can also modify the `min_calib_range` and `max_calib_range` in the JSON
255
258
256
259
MXNet also supports deploy quantized models with C++. Refer [MXNet C++ Package](https://github.com/apache/incubator-mxnet/blob/master/cpp-package/README.md) for more details.
257
260
261
+
# Improving accuracy with Intel® Neural Compressor
262
+
263
+
The accuracy of a model can decrease as a result of quantization. When the accuracy drop is significant, we can try to manually find a better quantization configuration (exclude some layers, try different calibration methods, etc.), but for bigger models this might prove to be a difficult and time consuming task. [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) tries to automate this process using several tuning heuristics, which aim to find the quantization configuration that satisfies the specified accuracy requirement.
264
+
265
+
**NOTE:**
266
+
267
+
Most tuning strategies will try different configurations on an evaluation dataset in order to find out how each layer affects the accuracy of the model. This means that for larger models, it may take a long time to find a solution (as the tuning space is usually larger and the evaluation itself takes longer).
268
+
269
+
## Installation and Prerequisites
270
+
271
+
- Install MXNet with MKLDNN enabled as described in the [previous section](#Installation-and-Prerequisites).
272
+
273
+
- Install Intel® Neural Compressor:
274
+
275
+
Use one of the commands below to install INC (supported python versions are: 3.6, 3.7, 3.8, 3.9):
Quantization tuning process can be customized in the yaml configuration file. Below is a simple example:
291
+
292
+
```yaml
293
+
# cnn.yaml
294
+
295
+
version: 1.0
296
+
297
+
model:
298
+
name: cnn
299
+
framework: mxnet
300
+
301
+
quantization:
302
+
calibration:
303
+
sampling_size: 160# number of samples for calibration
304
+
305
+
tuning:
306
+
strategy:
307
+
name: basic
308
+
accuracy_criterion:
309
+
relative: 0.01
310
+
exit_policy:
311
+
timeout: 0
312
+
random_seed: 9527
313
+
```
314
+
315
+
We are using the `basic` strategy, but you could also try out different ones. [Here](https://github.com/intel/neural-compressor/blob/master/docs/tuning_strategies.md) you can find a list of strategies available in INC and details of how they work. You can also add your own strategy if the existing ones do not suit your needs.
316
+
317
+
Since the value of `timeout` is 0, INC will run until it finds a configuration that satisfies the accuracy criterion and then exit. Depending on the strategy this may not be ideal, as sometimes it would be better to further explore the tuning space to find a superior configuration both in terms of accuracy and speed. To achieve this, we can set a specific `timeout` value, which will tell INC how long (in seconds) it should run.
318
+
319
+
For more information about the configuration file, see the [template](https://github.com/intel/neural-compressor/blob/master/neural_compressor/template/ptq.yaml) from the official INC repo. Keep in mind that only the `post training quantization` is currently supported for MXNet.
320
+
321
+
## Model quantization and tuning
322
+
323
+
In general, Intel® Neural Compressor requires 4 elements in order to run:
324
+
1. Config file - like the example above
325
+
2. Model to be quantized
326
+
3. Calibration dataloader
327
+
4. Evaluation function - a function that takes a model as an argument and returns the accuracy it achieves on a certain evaluation dataset.
328
+
329
+
### Quantizing ResNet
330
+
331
+
The previous sections described how to quantize ResNet using the native MXNet quantization. This example shows how we can achieve the same (with the auto-tuning) using INC.
data = mx.io.ImageRecordIter(path_imgrec='data/val_256_q90.rec',
358
+
batch_size=batch_size,
359
+
data_shape=batch_shape[1:],
360
+
rand_crop=False,
361
+
rand_mirror=False,
362
+
shuffle=False,
363
+
**mean_std)
364
+
data.batch_size = batch_size
365
+
```
366
+
367
+
3. Prepare the evaluation function:
368
+
369
+
```python
370
+
eval_samples = batch_size*10
371
+
372
+
def eval_func(model):
373
+
data.reset()
374
+
metric = mx.metric.Accuracy()
375
+
for i, batch in enumerate(data):
376
+
if i * batch_size >= eval_samples:
377
+
break
378
+
x = batch.data[0].as_in_context(mx.cpu())
379
+
label = batch.label[0].as_in_context(mx.cpu())
380
+
outputs = model.forward(x)
381
+
metric.update(label, outputs)
382
+
return metric.get()[1]
383
+
```
384
+
385
+
4. Run Intel® Neural Compressor:
386
+
387
+
```python
388
+
from neural_compressor.experimental import Quantization
389
+
quantizer = Quantization("./cnn.yaml")
390
+
quantizer.model = resnet18
391
+
quantizer.calib_dataloader = data
392
+
quantizer.eval_func = eval_func
393
+
qnet = quantizer.fit().model
394
+
```
395
+
396
+
Since this model already achieves good accuracy using native quantization (less than 1% accuracy drop), for the given configuration file, INC will end on the first configuration, quantizing all layers using `naive` calibration mode for each. To see the true potential of INC, we need a model which suffers from a larger accuracy drop after quantization.
397
+
398
+
### Quantizing BERT
399
+
400
+
This example shows how to use INC to quantize BERT-base for MRPC. In this case, the native MXNet quantization usually introduce a significant accuracy drop (2% - 5% using `naive` calibration mode). To simplify the code, model and task specific boilerplate has been moved to the `details.py` file.
401
+
402
+
This is the configuration file for this example:
403
+
```yaml
404
+
version: 1.0
405
+
406
+
model:
407
+
name: bert
408
+
framework: mxnet
409
+
410
+
quantization:
411
+
calibration:
412
+
sampling_size: 320 # number of samples for calibration
413
+
414
+
tuning:
415
+
strategy:
416
+
name: basic
417
+
accuracy_criterion:
418
+
relative: 0.01
419
+
exit_policy:
420
+
timeout: 0
421
+
max_trials: 9999 # default is 100
422
+
random_seed: 9527
423
+
```
424
+
425
+
And here is the script:
426
+
427
+
```python
428
+
from pathlib import Path
429
+
from functools import partial
430
+
431
+
import details
432
+
from neural_compressor.experimental import Quantization, common
For complete code, see this example on the [official GitHub repository](https://github.com/apache/incubator-mxnet/tree/v1.x/example/quantization_inc/BERT_MRPC).
All INC strategies found configurations meeting the 1% relative accuracy loss criterion. Only the `mse` strategy struggled, taking the longest time and generating configuration that is slower than the f32 model. Although these results may suggest that the `mse` strategy is the worst and the `bayesian` strategy is the best, different strategies may give better results for specific models and tasks. Usually the `basic` strategy is the most stable one.
525
+
526
+
Here is an example of a configuration generated by INC with the `basic` strategy:
527
+
528
+
- Layers quantized using min-max (`naive`) calibration algorithm:
0 commit comments