Skip to content

Commit

Permalink
Cherry-pick openvinotoolkit#6419
Browse files Browse the repository at this point in the history
* [Runtime] INT8 inference documentation update

* [Runtime] INT8 inference documentation: typo was fixed

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <[email protected]>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <[email protected]>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <[email protected]>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <[email protected]>

* Update docs/IE_DG/Int8Inference.md

Co-authored-by: Anastasiya Ageeva <[email protected]>

* Table of Contents was removed

Co-authored-by: Anastasiya Ageeva <[email protected]>
# Conflicts:
#	docs/IE_DG/Int8Inference.md
#	thirdparty/ade
  • Loading branch information
eshoguli authored and andrew-zaytsev committed Aug 13, 2021
1 parent 78e866f commit 6aa2751
Showing 1 changed file with 8 additions and 9 deletions.
17 changes: 8 additions & 9 deletions docs/IE_DG/Int8Inference.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,5 @@
# Low-Precision 8-bit Integer Inference {#openvino_docs_IE_DG_Int8Inference}

## Table of Contents
1. [Supported devices](#supported-devices)
2. [Low-Precision 8-bit Integer Inference Workflow](#low-precision-8-bit-integer-inference-workflow)
3. [Prerequisites](#prerequisites)
4. [Inference](#inference)
5. [Results analysis](#results-analysis)

## Supported devices

Low-precision 8-bit inference is optimized for:
Expand All @@ -24,12 +17,18 @@ Low-precision 8-bit inference is optimized for:

## Low-Precision 8-bit Integer Inference Workflow

8-bit computations (referred to as `int8`) offer better performance compared to the results of inference in higher precision (for example, `fp32`), because they allow loading more data into a single processor instruction. Usually the cost for significant boost is a reduced accuracy. However, it is proved that an accuracy drop can be negligible and depends on task requirements, so that the application engineer can set up the maximum accuracy drop that is acceptable.
8-bit computation (referred to as `int8`) offers better performance compared to the results of inference in higher precision (for example, `fp32`), because it allows loading more data into a single processor instruction. Usually the cost for significant boost is reduced accuracy. However, it has been proved that the drop in accuracy can be negligible and depends on task requirements, so that the application engineer configures the maximum accuracy drop that is acceptable.

For 8-bit integer computations, a model must be quantized. Quantized models can be downloaded from [Overview of OpenVINO™ Toolkit Intel's Pre-Trained Models](@ref omz_models_group_intel). If the model is not quantized, you can use the [Post-Training Optimization Tool](@ref pot_README) to quantize the model. The quantization process adds [FakeQuantize](../ops/quantization/FakeQuantize_1.md) layers on activations and weights for most layers. Read more about mathematical computations in the [Uniform Quantization with Fine-Tuning](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md).

When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.

In *Runtime stage*, the quantized model is loaded to the plugin. The plugin uses the `Low Precision Transformation` component to update the model to infer it in low precision:
- Update `FakeQuantize` layers to have quantized output tensors in a low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers quantized input tensors in the low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the next `FakeQuantize` layer.
- Quantize weights and store them in `Constant` layers.

## Prerequisites

In *Runtime stage* stage, the quantized model is loaded to the plugin. The plugin uses `Low Precision Transformation` component to update the model to infer it in low precision:
- Update `FakeQuantize` layers to have quantized output tensors in low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers have quantized input tensors in low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the next `FakeQuantize` layer.
- Weights are quantized and stored in `Constant` layers.
Expand All @@ -47,7 +46,7 @@ After that you should quantize model by the [Model Quantizer](@ref omz_tools_dow

## Inference

The simplest way to infer the model and collect performance counters is [C++ Benchmark Application](../../inference-engine/samples/benchmark_app/README.md).
The simplest way to infer the model and collect performance counters is the [C++ Benchmark Application](../../inference-engine/samples/benchmark_app/README.md).
```sh
./benchmark_app -m resnet-50-tf.xml -d CPU -niter 1 -api sync -report_type average_counters -report_folder pc_report_dir
```
Expand Down

0 comments on commit 6aa2751

Please sign in to comment.