Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference Time Issue #66

Open
cathy-kim opened this issue Feb 11, 2019 · 7 comments
Open

Inference Time Issue #66

cathy-kim opened this issue Feb 11, 2019 · 7 comments

Comments

@cathy-kim
Copy link

cathy-kim commented Feb 11, 2019

@Robert-JunWang Hi, Thanks to your work.

With the merged caffe model, I only got 48fps on TX2 + TensorRT4.1.4,
it is slower than mobileNet-SSD(about 54fps).
I've already optimized my TX2 with jetson_clocks.sh.
(I think I have already done what you suggested on Issue #43

Would you tell me how did you run 70+fps?
Thanks

@Shreeyak
Copy link

@Robert-JunWang Could you also tell us how you get ~100 fps with yolov3-tiny? I'm running yolov3tiny-320, in tensorflow, without tensorrt and only getting ~12 fps. Clocks are maxed on my tx2. I don't understand the 10x performance gap!

@Robert-JunWang
Copy link
Owner

@Robert-JunWang Hi, Thanks to your work.

With the merged caffe model, I only got 48fps on TX2 + TensorRT4.1.4,
it is slower than mobileNet-SSD(about 54fps).
I've already optimized my TX2 with jetson_clocks.sh.
(I think I have already done what you suggested on Issue #43

Would you tell me how did you run 70+fps?
Thanks

That speed does not include the post-processing part(decode bounding boxes and nms). The post-processing part can be done on CPU asynchronously. The real E2E speed is almost the same as the one I report. Both mobilenet+ssd and Pelee runs over 70 FPS on FP32 mode.

@Robert-JunWang
Copy link
Owner

@Robert-JunWang Could you also tell us how you get ~100 fps with yolov3-tiny? I'm running yolov3tiny-320, in tensorflow, without tensorrt and only getting ~12 fps. Clocks are maxed on my tx2. I don't understand the 10x performance gap!

I created a Caffe model of tinyyolov3 by myself and tested the speed with the random weights. The speed also does not include the post-processing part. The input dim is 416, not 320. The only difference between my model and the original paper is that I use Relu instead of leaky relu. But I do not think this would make much difference in speed. TinyYOLOv3 can benefit from FP16 inference as well. The model on FP16 mode is about 1.8 times to 2 times faster than FP32 mode.

I never compare the speed of the tensorflow and tensorrt. But I do not think there is a 10 times gap between these two frameworks. You can remove the preprocessing and postprocessing part of your model and see whether there is any difference.

@Shreeyak
Copy link

Oh, thank you for the explanations, that makes a lot more sense now! I should also take a look at how to do the post-processing asynchronously. Would you happen to have a repo/post/example of how to do that?

Would you happen to have any benchmarks of fps including the post-processing?

@dbellan
Copy link

dbellan commented Feb 28, 2019

@ginn24 Could you please tell me how you defined the detection_out layer plugin?

I populate the Plugin factory with
mDetection_out = std::unique_ptr<INvPlugin, decltype(nvPluginDeleter)>
(createSSDDetectionOutputPlugin(params), nvPluginDeleter);

but during the building of the Engine, I have the following error:
http://NvPluginSSD.cu:795 virtual void nvinfer1::plugin::DetectionOutput::configure(const nvinfer1::Dims*, int, const nvinfer1::Dims*, int, int): Assertion `numPriorsnumLocClasses4 == inputDims[param.inputOrder[0]].d[0]' failed.

Usually this error is due to a wrong name of the layer or the wrong structur of params.InputOrder, but everything is correct. I feel that it is something related to how I created the Plugin. May I ask how did you do?

@cathy-kim
Copy link
Author

cathy-kim commented Mar 5, 2019

@dbellan I just uploaded my Pelee-TensorRT code. You can checkout here.
https://github.com/ginn24/Pelee-TensorRT

The version of code visualizes detection_out. This code doesn't include measuring inference time.
If you need to check the inference time, you should implement code for GPU time.

@cathy-kim cathy-kim reopened this Mar 5, 2019
@dbellan
Copy link

dbellan commented Mar 7, 2019

Thank you @ginn24. I'll have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants