Releases: alibaba/MNN
OpenCL Speed Up
Features:
- Support OpenCL Backend for Buffer, user can use numberThread / mode to control it : https://www.yuque.com/mnn/cn/create_session
- Remove unuseful code for OpenCL Backend
- Support tflite's Binary with Activation
Bugfix:
- Fix bug for convolution compute zero shape tensor
Add HIAI-NPU, While/Loop refract
Features:
- Add HIAI Backend
- Refract the control flow support for tensorflow, support dynamic LSTM from TF
- Add several op for TensorRT Backend (GatherV2, Interp, ScatterND, DetectionPostProcess, BatchMatMul)
Speed up / Bugfix
- Add Prelu fuse for Tensorflow
- Fix some op bug for CUDA
- Fix some op bug for Metal (Relu6 / Not register)
Add oneDNN, TensorArray support, Bugfix
Features:
- Add OneDNN for Convolution Opt (In Test)
- Add Codegen for Fuse Op (In Test)
- Add TensorArray support (In Test)
Speed up / Bugfix:
- Support more binary op for ARMv8.2
- Optimize Int8 compute for AVX2 and Neon
- Fix bug for region fuse for spectial case
- Fix bug for Vulkan Convolution Winograd run error for feature map is large
Bugfix and speed up
Bugfix:
- Fix bug for memory leak when create session for some model.
- Fix metal backend's serveral bug.
- Small bugfix for ios demo
- Fix bug for 3d BatchNormal Module don't work for MNNTrain
- Fix memory leak for CPUInterp
- Fix bug for stack error for MNNPackedMatMulRemain.S
- Fix bug for SSE branch use AVX instruction
Optimize:
- Reduce buffer create for metal execute.
- Reduce memory copy in CPUBatchMatMul.
- Reduce memory copy for Module
- Use neon to optimize CPUTopKV2
- Reduce memory usage for CUDA Backend by split and merge
Feature:
- Support multi-instance for Module.
- Serveral CI scripts.
- More op for ARM82 Backend ()
- More op for CUDA (ArgMax, BatchMatMul, GatherV2, LayerNorm ......)
MNN 1.1.0 Release Notes
一、框架通用性
1.1 几何计算
几何计算是本次发布中大规模的框架重构。它将大部分算子的计算过程中与硬件后端无关部分(形状计算和几何计算)剥离出来,极大地降低了异构后端算子实现的成本。基于几何计算,MNN重写了目前所有的硬件后端。由于引入几何计算之后GPU后端算子的覆盖率的增加,在阿里巴巴内部的业务模型中,MNN GPU后端性能普遍获得约20%提升。
1.2 新增后端
基于几何计算机制,MNN新增了TensorRT和CUDA后端。目前已经支持常用CV模型与RNN模型。
1.3 ASR模型支持
除了业务应用广泛的CV模型,MNN在这次发布中添加了对基于Transformer结构的ASR模型的支持。这类模型结构要求推理引擎支持Control Flow、Dynamic Shape和Zero Shape等特性。MNN在框架层面对这些特性进行了支持和完善:
• 重构Control Flow支持方案,提供用户透明的functional control flow实现,并支持TF1.x的控制流模型转换。
• 添加Dynamic Shape的支持,MNN将整图按照动态形状算子划分为多个分段子图。在代码层面,一个子图对应一个Module,Module支持嵌套,即整图被表达为一个由Module组成的调用树,树的叶子节点可以使用Session来执行,Session每次执行前Resize,重新进行形状推理和预分配内存。
• Zero Shape指的是模型中某些Tensor的shape存在0值,比如 (1, 0, 256),这种情况大多是为了给while-loop中某些循环变量提供初始值而引入的。MNN在对形状推理和执行逻辑上对Zero Shape进行了支持。
二、模型压缩
新添模型压缩的仅权值量化(MNNConvert --weightQuantBits)。此功能仅对conv/matmul/LSTM的float32权值进行量化,仅优化模型大小,加载模型后会解码为float32,量化位宽可选2~8,运行速度和float32模型一致。经内部测试8bit时精度基本无损,模型大小减小4倍。
三、性能优化
ARM后端
在今年5月,MNN在ARM CPU上的性能已立于业界前列(https://zhuanlan.zhihu.com/p/151666822)。在此之后,MNN持续投入ARM CPU性能优化,在各模型和芯片上又获得了10%~20%的性能提升。性能提升之路永无止境。
OpenCL后端
开启AutoTuning等一系列优化后,MNN在1.0.0的基础上,普遍有20%~100%的性能提升。具体性能数据如下:
x86后端
5月以来,MNN团队持续投入x86后端的优化,目前浮点单线程性能与行业标杆OpenVINO基本持平,部分情况 (Squeezenet v1.0) 超越。
四、框架易用性
Interpreter::setCacheFile API
由于OpenCL新增的AutoTuning机制、TensorRT后端初次推理的耗时较高,MNN在Interpreter上增加setCacheFile API,用于缓存GPU后端的编译优化之后的模型。用法如下:
std::shared_ptr<MNN::Interpreter> net =
std::shared_ptr<MNN::Interpreter>(MNN::Interpreter::createFromFile(fileName));
if (nullptr == net) {
return 0;
}
// Must call it before createSession.
// If ".tempcache" file does not exist, Interpreter will go through the
// regular initialization procedure, after which the compiled model files
// will be written to ".tempcache".
// If ".tempcache" file exists, the Interpreter will be created from the
// cache.
net->setCacheFile(".tempcache");
MNN::ScheduleConfig config;
// Creates the session after you've called setCacheFile.
MNN::Session* session = net->createSession(config);
五、Bugfixes
修正但不限于如下 Bug:
- SpaceToBatchND , BatchToSpaceND 支持 block size / padding 作为输入(支持在输入shape 未知情况下的 Tensorflow 空洞卷积)
- 修正 depthToSpace 和 spaceToDepth ,支持 pixelshuffle
- 修正 1x1 卷积对于 batch 较大,width / height 较小时,性能不好的问题
- 修正 Onnx 的 ConvTranspose output padding 支持问题
- 修正 Onnx 的 Resize 在某些输入个数下不支持的问题
六、One More Thing
MNN有自己的官网啦!官网上还可以下载MNN团队全新力作「MNN工作台」,涵盖开箱即用模型、可视化训练等工具,更可以一键部署到多端设备。
Advancing Our Amazing DL Engine: MNN 1.0.0
As many of you MNN fans may know, MNN was initially released in May 2019 by a team of enthusiastic Alibaba engineers who thought existing open source DL frameworks at the time weren’t good enough to handle the particular and demanding requirements of a billion-user mega application such as Taobao and that it was better to roll their own.
And so we did it. We built an inference engine from the ground up and we supported a vast variety of applications inside Alibaba with it. It was many times faster (and smaller!) than other inference engines such as TFLite and NCNN. We open sourced it and you guys loved it. The 3.9k stars on the Github repo are 3.9k votes of confidence from our biggest fans: you. What’s more, we are thrilled to publish our paper on MNN design principles in MLSys 2020 (Read the paper here: https://proceedings.mlsys.org/static/paper_files/mlsys/2020/7-Paper.pdf).
A year has passed since the initial release and we’ve made a great number of new features, improvements and bug fixes. And today we’re delighted to announce the 1.0.0 official stable release of MNN. In this release, we add training capability to MNN, which means MNN is no longer an “inference” engine, but a fully-fledged DL engine capable of training and inference, with numerous hardware backends including Arm (32bit, 64bit, v8.2), x86, Metal, OpenCL, Vulkan and OpenGL.
The following are some release highlights.
Highlights
There are three major new features in this MNN stable release: training and quantization, performance improvements and MNN Python API (currently in BETA).
Training and Quantization With the Express API
We added a new set of C++ APIs (located in the express/
directory) to dynamically construct a graph, run training and inference, quantize models, etc. These APIs stand side-by-side with the existing APIs to use MNN and will be the officially recommended API surface once we iron out more details.
This new set of C++ APIs allows you to:
- train a model from scratch using MNN,
- finetune a model trained by other DL frameworks,
- and perform quantization-aware training (a.k.a QAT).
The documentation has been updated with instructions. (Quantization-aware Training)
The following table is a comparison of model accuracy and size between MNN QAT and Tensorflow QAT using MobileNet V2.
Note 1
Both training and eval use ImageNet. Training batchsize is 32, with 100 iterations, i.e. 3200 images are used in training. Eval phase uses 50000 images.
Note 2
The original tensorflow model has an accuracy of 71.8%. Due to minor differences in image preprocessing, the original model in MNN format has a higher accuracy than the TF original.
On top of the improved accuracy with MNN QAT, MNN has great training speed on-device and on a PC. The following graph shows the time (in secs) to train a Lenet model with MNIST from scratch.
As you can see, MNN training performance on a mobile phone (MI6 or Mate 20) is comparable to the performance of PyTorch or Caffe on a 2015 Macbook Pro. What's more, MNN training speed on a Mac is more than 2 times as fast as PyTorch.
Performance Improvements
We have been hard at work pushing the limits of Arm CPU and x86 CPU, and bringing every bit of performance improvement into MNN.
Arm V8.2 Performance Improvements
We utilized asimdhp (Advanced SIMD half precision) extension and asimddp (Advanced SIMD dot product) in ArmV8.2 to increase inference speed by roughy 100%.
Inference time (ms) before and after half precision optimization:
MNN doubles the inference performance without any noticeable drop in accuracy. Details below:
Inference time (ms) before and after SIMD dot product optimization:
MNN doubles the speed with SIMD dot product optimization, even on devices whose int8 performance is worse than fp32. Details below:
Arm64 Performance Improvements
By rethinking matrix multiplication tiling, cacheline alignment and cache prefetch, we've made significant performance improvements on mid to low tier chipsets (Qualcomm 652 and 425), which are the types of devices where a sizable perf improvement could mean the difference between smooth 30 fps and a janky unusable experience.
Inference speed (ms) for MobileNetV1 before and after optimization.
X86 Performance Improvements
Even though Arm platform is MNN's "basecamp", we've been investing in the x86 platform so as to make MNN training on laptops faster.
By improving the weight matrix layout and turning on FMA instruction, we've improved the x86 performance by about 40% ~ 60% single-threaded. (The performance improvement is 100% compared to MNN at initial release).
The following table shows the inference performance before and after the optimization.
New Python Express API (BETA)
The aforementioned "Express API" in C++ allows you to train a model with MNN, finetune a model and perform quantization-aware training with MNN. We added a set of corresponding Python Express APIs for people with prior experience of Tensorflow or PyTorch. The new Python Express API documentation is here.
The new Python Express API is currently in beta as we strive to make it more pythonic and PyTorch-esque. In the mean time, please give us feedback by posting an issue on Github and leave a message in the Dingtalk group.
X86-Optimize and Plugin support
- Speed up inference ~30% for x86 cpu , support fma by set MNN_FMA_ENABLE = ON
- Fix compile bug of vulkan / converter in windows
- Add plugin support
- Add a few op support for onnx
Appendix:
1-thread MacBook Pro (Retina, 15-inch, Mid 2015) | Before opt / ms | After opt / ms |
---|---|---|
Resnet18 | 81 | 48 |
Mobilenetv1.10-224 | 37 | 26 |
4-Thread MacBook Pro (Retina, 15-inch, Mid 2015) | Before optimize / ms | After optimize / ms |
---|---|---|
resnet-v2-50.mnn | 54.042 | 39.547 |
MobileNetV2_224.mnn | 7.748 | 5.915 |
mobilenet-v1-1.0.mnn | 12.172 | 7.746 |
SqueezeNetV1.0.mnn | 20.729 | 11.133 |
inception-v3.mnn | 76.550 | 59.054 |
Bugfix and low level arm speed up.
1、Fix several bug for tflite converter
2、Speed up ARM v8a for A53 / A55 / A72 / A73
小米5,ARM 64 单线程 | 优化前 / ms | 优化后 / ms |
---|---|---|
resnet-v2-50.mnn | 529.041 | 493.421 |
MobileNetV2_224.mnn | 65.550 | 64.877 |
inception-v3 | 826.439 | 778.218 |
SqueezeNetV1.0.mnn | 135.534 | 130.932 |
mobilenet-v1-1.0.mnn | 104.304 | 104.232 |
ARM v8.2+ support and Vulkan backend enhance
1、Support CPU fp16 and sdot on ARM v8.2+ (qualcomm snapdragon 845+, qualcomm snapdragon 660+, kirlin 980+, kirlin810+), speed up about 100% on these chipes. You can just open the macro MNN_ARM82 and load libMNN_Arm82.so to enable it.
2、Add a few ops for Vulkan and reduce the prepare cost of vulkan.
Support quantization aware training
1、Support quantization aware training. See https://www.yuque.com/mnn/cn/bhz5eu for detail.
2、Support python API for MNN-Express. See pymnn/examples/MNNTrain for detail.
3、Speed up several case of binary op.