You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is recommended that you use a combination of pruning and distillation training, or use pruning and quantization for test model compression. The following takes YOLOv3 as an example to carry out cutting, distillation and quantization experiments.
Experimental Environment
Python 3.7+
PaddlePaddle >= 2.1.0
PaddleSlim >= 2.1.0
CUDA 10.1+
cuDNN >=7.6.5
Version Dependency between PaddleDetection, Paddle and PaddleSlim Version
PaddleDetection Version
PaddlePaddle Version
PaddleSlim Version
Note
release/2.1
>= 2.1.0
2.1
Quantitative model exports rely on the latest Paddle Develop branch, available inPaddlePaddle Daily version
release/2.0
>= 2.0.1
2.0
Quantization depends on Paddle 2.1 and PaddleSlim 2.1
The above V100 prediction delay non-quantified model is tested by TensorRT FP32, and the quantified model is tested by TensorRT INT8, and both of them include NMS time.
The SD855 predicts the delay for deployment using PaddleLite, using the ARM8 architecture and using 4 Threads (4 Threads) to reason the delay.