Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance YOLOv7 vs YOLOv9 Series using TensorRT engine #143

Closed
levipereira opened this issue Mar 1, 2024 · 19 comments
Closed

Performance YOLOv7 vs YOLOv9 Series using TensorRT engine #143

levipereira opened this issue Mar 1, 2024 · 19 comments

Comments

@levipereira
Copy link

levipereira commented Mar 1, 2024

Perfomance Test using GPU RTX 4090 on AMD Ryzen 7 3700X 8-Core/ 16GB RAM

Model Performance using TensorRT engine

All models were sourced from the original repository and subsequently converted to ONNX format with dynamic batching enabled. Profiling was conducted using TensorRT Engine Explorer (TREx).

Detailed reports will be made available in the coming days, providing comprehensive insights into the performance metrics and optimizations achieved.

All models were converted (re-parameterized) and optimized for inference.

TensorRT version: 8.6.1

Device Properties:

  • Selected Device: NVIDIA GeForce RTX 4090
  • Compute Capability: 8.9
  • SMs: 128.0
  • Compute Clock Rate: 2.58 GHz
  • Device Global Memory: 24208 MiB
  • Shared Memory per SM: 100 KiB
  • Memory Bus Width: 384.0 bits
  • Memory Clock Rate: 10.501 GHz

YOLO v7 vs v9 Series Models Performance Results

  • Average time: Represents the total sum of layer latencies when profiling layers individually.
  • Latency: Refers to the minimum, maximum, mean, median, and 99th percentile of the engine latency measurements, captured without profiling layers.
  • Throughput: Measured in inferences per second (IPS).

Performance Summary Tables

Throughput and Average Time

Model Name Throughput (IPS) Average Time (ms)
YOLOv7 978 1.441
YOLOv7x 609 2.065
YOLOv9-c 798 2.049
YOLOv9-e 353 4.261

Latency Summary

Model Name Min Latency (ms) Max Latency (ms) Mean Latency (ms) Median Latency (ms) 99th Percentile Latency (ms)
YOLOv7 1.012 1.104 1.020 1.018 1.024
YOLOv7x 1.613 1.751 1.640 1.636 1.664
YOLOv9-c 1.246 1.359 1.251 1.250 1.251
YOLOv9-e 2.807 3.032 2.823 2.814 2.817

Full Report
https://github.com/levipereira/triton-server-yolo/tree/master/perfomance

@WongKinYiu
Copy link
Owner

WongKinYiu commented Mar 1, 2024

@levipereira
Thanks for provide TRT performance reports.
I notice that you use yolov9-c.pt for exporting and testing performance.
In actually, yolov9-c.pt contains PGI auxiliary branch, which can be removed at inference stage.
Could you help for using yolov9-c-converted.pt and yolov9-e-converted.pt to get more performance reports?
Their architectures are as same as gelan-c.pt and gelan-e.pt respectively.

The converted weights are provided at here.
yolov9-c-converted.pt
yolov9-e-converted.pt

@WongKinYiu
Copy link
Owner

https://github.com/levipereira/yolov9/blob/main/models/experimental.py#L140

output = output[0] will gets output of auxiliary branch.
output = output[1] will gets output of main branch, which is correct one.

@levipereira
Copy link
Author

https://github.com/levipereira/yolov9/blob/main/models/experimental.py#L140

output = output[0] will gets output of auxiliary branch. output = output[1] will gets output of main branch, which is correct one.

#130 (comment)

@laugh12321
Copy link

laugh12321 commented Mar 3, 2024

Perfomance Test using GPU RTX 2080Ti 2GB on AMD Ryzen 7 5700X 8-Core/ 128GB RAM

All models are converted to ONNX models with the EfficientNMS plugin. The conversion was done using the TensoRT-YOLO tool, with the trtyolo CLI tool installed via pip install tensorrt-yolo==3.0.1. The batch size is 1 and the image size is 640.

Model Export and Performance Testing

Use the following commands to export the model and perform performance testing with trtexec:

trtyolo export -v yolov9 -w yolov9-converted.pt --imgsz 640 -o ./
trtexec --onnx=yolov9-converted.onnx --saveEngine=yolov9-converted.engine --fp16
trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov9-converted.engine

Performance testing was conducted using the TensorRT-YOLO inference on the coco128 dataset.

YOLOv9 Series

Tool YOLOv9-T-Converted YOLOv9-S-Converted YOLOv9-M-Converted YOLOv9-C-Converted YOLOv9-E-Converted
trtexec (infer) Mean Latency (ms) 3.51857 Mean Latency (ms) 3.67899 Mean Latency (ms) 4.19460 Mean Latency (ms) 4.25964 Mean Latency (ms) 8.95429
TensorRT-YOLO Python (infer) Mean Latency (ms) 10.19576 Mean Latency (ms) 10.15226 Mean Latency (ms) 9.29918 Mean Latency (ms) 9.60093 Mean Latency (ms) 21.85042
TensorRT-YOLO C++ (pre + infer) Mean Latency (ms) 3.44162 Mean Latency (ms) 3.66080 Mean Latency (ms) 4.10519 Mean Latency (ms) 4.12471 Mean Latency (ms) 8.98964
Tool Gelan-S2 Gelan-S Gelan-M Gelan-C Gelan-E
trtexec (infer) Mean Latency (ms) 3.42082 Mean Latency (ms) 3.78578 Mean Latency (ms) 4.16447 Mean Latency (ms) 4.27485 Mean Latency (ms) 8.91479
TensorRT-YOLO Python (infer) Mean Latency (ms) 9.96435 Mean Latency (ms) 10.35934 Mean Latency (ms) 9.14044 Mean Latency (ms) 9.33843 Mean Latency (ms) 21.42764
TensorRT-YOLO C++ (pre + infer) Mean Latency (ms) 3.60857 Mean Latency (ms) 3.93528 Mean Latency (ms) 4.25084 Mean Latency (ms) 4.35533 Mean Latency (ms) 9.23654

YOLOv8 Series

Tool YOLOv8n YOLOv8s YOLOv8m YOLOv8l YOLOv8x
trtexec (infer) Mean Latency (ms) 1.90273 Mean Latency (ms) 2.34166 Mean Latency (ms) 3.58595 Mean Latency (ms) 4.83306 Mean Latency (ms) 7.12179
TensorRT-YOLO Python (infer) Mean Latency (ms) 7.03217 Mean Latency (ms) 7.52751 Mean Latency (ms) 8.75298 Mean Latency (ms) 10.56914 Mean Latency (ms) 12.45605
TensorRT-YOLO C++ (pre + infer) Mean Latency (ms) 2.02848 Mean Latency (ms) 2.15021 Mean Latency (ms) 3.57631 Mean Latency (ms) 4.78318 Mean Latency (ms) 6.96686

@levipereira
Copy link
Author

Hi @WongKinYiu,
I apologize for the delay in responding; my work has been taking up a lot of my time. I'm deeply involved in assessing the performance of YOLOv9. I've managed to gather some valuable performance information comparing YOLOv9 to YOLOv7. I'll be sharing these findings in the next few days. I'm sending a more detailed report and need to highlight the differences accurately.

The original post had results from many variables that shouldn't have been included in measuring the model's performance. That's why I made changes to the original post.

@levipereira levipereira changed the title Performance YOLOv7 vs YOLOv9-C vs YOLOv9-E over TensorRT engine Performance YOLOv7 vs YOLOv9 TensorRT engine Mar 4, 2024
@levipereira levipereira changed the title Performance YOLOv7 vs YOLOv9 TensorRT engine Performance YOLOv7 vs YOLOv9 Series using TensorRT engine Mar 4, 2024
@WongKinYiu
Copy link
Owner

@laugh12321

Could you help for testing speed of yolov9-t-converted.pt, yolov9-s-converted.pt, yolov9-m-converted.pt?

Thanks.

@laugh12321
Copy link

laugh12321 commented Jun 6, 2024

@laugh12321

Could you help for testing speed of yolov9-t-converted.pt, yolov9-s-converted.pt, yolov9-m-converted.pt?

Thanks.

@WongKinYiu Use trtexec or Trnsorrt-YOLO to test the model speed with NMS plugin?

@WongKinYiu
Copy link
Owner

Same testing method as the table #143 (comment).
Are these results tested with NMS plugin?
Thanks.

@laugh12321
Copy link

@WongKinYiu Yes, these results were tested with the NMS plugin. In #143 (comment), we performed performance testing using the Python code of TensorRT-YOLO. We noticed that the results from the Python code were slightly lower compared to the tests conducted with the C++ code and the trtexec tool. To provide a more comprehensive comparison, we will conduct separate performance tests using the TensorRT-YOLO Python API, the TensorRT-YOLO C++ API, and the trtexec tool.

@WongKinYiu
Copy link
Owner

If it won't bother you too much, conduct performance tests using different protocols are nice.

@laugh12321
Copy link

@WongKinYiu Update at #143 (comment)

@WongKinYiu
Copy link
Owner

Thanks.

It seems you have same results as @levipereira.
yolov9-m has similar speed as yolov9-c.
and yolov9 t/s/m are very slow on tensorrt yolo python.

@WongKinYiu
Copy link
Owner

Could you help for test gelan-s2.pt too?
Thanks.

@laugh12321
Copy link

Could you help for test gelan-s2.pt too? Thanks.

@WongKinYiu Update at #143 (comment)

@WongKinYiu
Copy link
Owner

Thanks.

By the way, gelan-s2.pt is different from gelan-s.pt.
gelan-s2 stack 2 blocks in csp, while gelan-s stack 3 blocks in csp.

@laugh12321
Copy link

Thanks.

By the way, gelan-s2.pt is different from gelan-s.pt. gelan-s2 stack 2 blocks in csp, while gelan-s stack 3 blocks in csp.

@WongKinYiu Thank you very much for your reminder. I overlooked gelan-s2.pt and will update it shortly. Thanks again for your correction!

@laugh12321
Copy link

Thanks.
By the way, gelan-s2.pt is different from gelan-s.pt. gelan-s2 stack 2 blocks in csp, while gelan-s stack 3 blocks in csp.

@WongKinYiu Thank you very much for your reminder. I overlooked gelan-s2.pt and will update it shortly. Thanks again for your correction!

@WongKinYiu Update at #143 (comment)

@WongKinYiu
Copy link
Owner

Thanks.

@agentfuzzy
Copy link

Hi, I was able to run at ~36fps on an Nvidia Xavier AGX using yolov9-c-converted exported to TensorRT engine with FP16 inference and onnxsim. Very impressive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants