Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MobileNetV2-YOLOv3-Nano: Detection network designed by mobile terminal,0.5BFlops🔥🔥🔥HUAWEI P40 6ms& 3MB!!! #6091

Open
dog-qiuqiu opened this issue Jun 29, 2020 · 18 comments

Comments

@dog-qiuqiu
Copy link

Mobile inference frameworks benchmark (4*ARM_CPU)

Network VOC mAP(0.5) COCO mAP(0.5) Resolution Inference time (NCNN/Kirin 990) Inference time (MNN arm82/Kirin 990) FLOPS Weight size
MobileNetV2-YOLOv3-Lite 72.61 36.57 320 33 ms 18 ms 1.8BFlops 8.0MB
MobileNetV2-YOLOv3-Nano 65.27 30.13 320 13 ms 5 ms 0.5BFlops 3.0MB
MobileNetV2-YOLOv3-Fastest 33.19 & 320 8.2 ms 3.67 ms 0.13BFlops 0.4MB

https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3

@AlexeyAB
Copy link
Owner

@dog-qiuqiu Thanks!
Can you test and compare MobileNetV2-YOLOv3-Lite vs yolov3-tiny.cfg vs yolov3-tiny-prn.cfg vs yolov4.cfg since they are already supported by NCNN?
It seems that tiny-prn faster on GPU than tiny, while tiny faster on NPU than tiny-prn.

And yolov4-tiny.cfg when it will be implemented on NCNN: Tencent/ncnn#1885

Also you can try to optimize yolov4-tiny.cfg for mobile CPU.

@dog-qiuqiu
Copy link
Author

@AlexeyAB Hi
This is the result of NCNN test, Huawei's Kirin 990, 4 core high performance:
loop_count = 1
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 0
MobileNetV2-YOLOv3-Lite-coco min = 31.58 max = 31.58 avg = 31.58
yolov3-tiny-prn min = 36.60 max = 36.60 avg = 36.60
yolov3-tiny min = 51.36 max = 51.36 avg = 51.36
yolov4 min = 733.67 max = 733.67 avg = 733.67

yolov4-tiny NCNN Does not seem to support

@AlexeyAB
Copy link
Owner

Thanks!

yolov4-tiny NCNN Does not seem to support

It was implemented 2 hours ago: Tencent/ncnn@0bc45ee

loop_count = 1
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 0

Did you try gpu_device = 0 ?

@dog-qiuqiu
Copy link
Author

dog-qiuqiu commented Jun 30, 2020

OK!
loop_count = 1
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 0
MobileNetV2-YOLOv3-Lite-coco min = 33.14 max = 33.14 avg = 33.14
yolov3-tiny-prn min = 37.15 max = 37.15 avg = 37.15
yolov3-tiny min = 58.39 max = 58.39 avg = 58.39
yolov4 min = 781.29 max = 781.29 avg = 781.29

As far as I know, Mali-GPU has no efficiency advantage over ARM, at least on my Kirin 990, but Qualcomm GPUs may have efficiency improvements
You try the arm82 of MNN, in theory, it will be twice as fast as NCNN without arm82

@AlexeyAB
Copy link
Owner

Yes, it seems this GPU doesn't improve speed.
Try yolov4-tiny.

@dog-qiuqiu
Copy link
Author

YOLOV4-TINY:

loop_count = 4
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 0
MobileNetV2-YOLOv3-Lite-coco min = 35.15 max = 35.65 avg = 35.43
yolov3-tiny-prn min = 38.83 max = 39.16 avg = 38.96
yolov3-tiny min = 52.38 max = 53.01 avg = 52.74
yolov4-tiny min = 51.23 max = 51.64 avg = 51.42
yolov4 min = 779.41 max = 791.94 avg = 785.52

@AlexeyAB
Copy link
Owner

@dog-qiuqiu
Thanks!
So this is 20 FPS - 40.2% AP50 COCO for yolov4-tiny.cfg on CPU Kirin 990 (ARM) - Huawei P40

So you can try to improve yolov4-tiny in the same way as MobileNetV2-YOLOv3-Lite/Nano/Fastest. Or just add groups= to [conv] layers and may be SE-blocks.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 30, 2020

https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3

Darknet Group convolution is not well supported on some GPUs such as NVIDIA PASCAL!!! The MobileNetV2-YOLOv3-SPP inference time is 100ms at GTX1080ti, but RTX2080 inference time is 5ms!!!

I think there is so big difference 100ms / 5ms due to different cuDNN versions or something else (one compiled with CUDNN=1 and another with CUDNN=0).

Also about groups=.
Tensor Cores on Volta/RTX will be used only if there is no groups (or groups=1) parameter in conv-layer, so for groups>1 will be used the same regular CUDA-cores (shaders) with about ~the same speed:

if (state.index != 0 && state.net.cudnn_half && !l.xnor && (!state.train || (iteration_num > 3 * state.net.burn_in) && state.net.loss_scale != 1) &&
(l.c / l.groups) % 8 == 0 && l.n % 8 == 0 && l.groups <= 1 && l.size > 1)

Darknet/TF/Pytorch/cuDNN/... use the same groups from cuDNN library.

@dog-qiuqiu
Copy link
Author

I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work!!!

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 1, 2020

@dog-qiuqiu Hi, Did you try to test yolov4-tiny.cfg and MobileNetV2-YOLOv3-Lite-coco on Raspberry Pi3 / 4?

@dog-qiuqiu
Copy link
Author

@AlexeyAB Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark

@AlexeyAB
Copy link
Owner

@dog-qiuqiu

I will try to improve yolov4-tiny with depthwise separable convolutions, Thank you for your work!!!

Okay, I have a Raspberry Pi 3b I will test the time-consuming benchmark

Hi, did you have any success with it?

@dog-qiuqiu
Copy link
Author

dog-qiuqiu commented Jul 10, 2020

@AlexeyAB Sorry, because my Raspberry Pi 3 is missing an SD card, I plan to buy an SD card on Saturday to test the Raspberry Pi 3 benchmark, but I can now run MobileNetV2-YOLOv3-Nano on Android in real time, and I plan to replace yolov4-tiny transplanted to Android to run in real time, this is the Android project: https://github.com/dog-qiuqiu/MobileNetv2-YOLOV3#ncnn-android-sample

@dog-qiuqiu
Copy link
Author

@AlexeyAB Hi,This is a real-time detection Android project based on ncnn's yolov4-tiny:https://github.com/dog-qiuqiu/Android_NCNN_yolov4-tiny

@AlexeyAB
Copy link
Owner

@dog-qiuqiu Nice!

@AlexeyAB
Copy link
Owner

It seems RaspberryPi4 (4 Threads) can processes yolov4-tiny (int8, 416x416) with 4 FPS by using TFLite: https://github.com/PINTO0309/PINTO_model_zoo#3-tflite-model-benchmark

RaspberryPi4 + Ubuntu 19.10 aarch64 + 4 Threads + yolov4_tiny_voc_416x416_integer_quant.tflite Benchmark
Timings (microseconds): count=50 first=233307 curr=233318 min=232446 max=364068 avg=243522 std=33354

TF models:

Just interesting to compare TFLite with NCNN.

@Lowell-IC
Copy link

1602672919(1)
@AlexeyAB @dog-qiuqiu
Hello! I am sorry to bother you.
I want to ask that is it depthwise convolution in the picture if I change the left into the right?
The answer is very important to me.
Looking forward to your reply.
Thanks a lot.

@LYH-depth
Copy link

@Lowell-IC brother do you get your anwser?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants