Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.6 版本同步 #2470

Merged
merged 2 commits into from
Jul 5, 2023
Merged

2.6 版本同步 #2470

merged 2 commits into from
Jul 5, 2023

Conversation

jxt1234
Copy link
Collaborator

@jxt1234 jxt1234 commented Jul 5, 2023

1. 新特性

  • 新增int8量化算子支持:
    • Softmax
    • Interp
    • Binary
    • Unary
    • Scale
  • OpenCL 支持 Loop 算子特定情形;
    • BatchMatMul
    • Gather
  • x86_64支持Gelu-bf16;
  • CUDA支持bf16模型推理;
  • benchmark 工具支持直接测试模型量化后的性能(不需要先用量化工具量化模型)
  • Pymnn Tensor/Var使用Tuple创建时支持混合类型数据;
  • 权值量化模型支持低内存推理模式,计算时反量化;
    • 支持ChatGLM-6B模型推理内存占用3G;
    • 支持构建了ChatGLM-MNN Android app;

2. 优化

  • OpenCL支持高通reocrd queue ,以降低创建 GPU Command Buffer 所需的时间;

Oneplus 9 机型 Benchmark 测试结果如下

Model unrecord record
resnet-v2-50.mnn 21.254 20.160
MobileNetV2_224.mnn 4.853 4.186
mobilenet-v1-1.0.mnn 6.424 5.315
nasnet.mnn 46.751 20.260
SqueezeNetV1.0.mnn 7.35 6.832
squeezenetv1.1.mnn 3.936 3.693
mobilenetV3.mnn 14.201 6.743
inception-v3.mnn 33.111 32.032
  • 稀疏卷积内存优化,降低内存占用;
  • 减少异构(CPU低精度/GPU)运行 MNN 模型时的常量内存占用;
  • CUDA优化int8算子性能;
  • 减少Permute几何计算产生的region数量;
  • 重新调整ConvolutionInt8及im2col在AVX512-VNNI下的分块大小,提升性能20%-30%;
  • X86新增bilinear/nearest sample的SIMD实现,提升ImageProcess性能 50% 左右;

3. Bugfix

3.1 关联 Github Issue 解决

  • 修复CUDA Raster错误导致输出为0的问题;issue-2333
  • 修复OpenCL Gather算子出错的问题;issue-2424
  • 修复ImageProcess出错的问题;issue-2386
  • OpenCL支持用户选择device id; issue-2343

3.2 其他 Bugfix

  • CUDA CMakeList对未支持架构增加报错信息;
  • testMNNFromOnnx脚本在模型测试正确时不启用DEBUG模式;
  • load_module_from_file中的shape_mutable默认改为True(存在子图的模型无法在False情形下运行);
  • MNNConvert使用keepInputFormat选项时,也同时将输出Tensor的format转换为原始格式
  • 修复log记录时设备为空时Crash的情况;
  • 修复BinaryOp单元测试在Windows下无法编译的问题;
  • 修复MNN_SUPPORT_DEPRECATED_OP宏不控制OptimizedComputer的问题;
  • 修复fp16多线程且分块方向为channel时convolution计算出错的问题;
  • 修复deconvolutionInt8访存越界的问题;
  • 修复TensorArrayWrite几何计算产生zero region的问题;
  • 修复CUDA depthwise conv出错的问题;
  • 修复一些文档格式、内容的错误;
  • 修复多线程下createRuntime和setGlobalConfig出错的问题;
  • 修复Vec.hpp中无用代码导致的编译失败问题;
  • 修复OpenCL对gpuDevice的assert失败的问题;
  • 修复OpenCL bianry mod出错的问题;
  • 修复CUDA argmax出错的问题;
  • 修复pymnn/example/mnn_numpy_cv_demo.py中形状不对的问题;

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


xiaying seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@jxt1234 jxt1234 merged commit c293f9e into master Jul 5, 2023
@jxt1234 jxt1234 deleted the feature/sync branch July 5, 2023 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants