Optimize NMS #14290

ptrendx · 2019-02-28T23:34:43Z

Description

Current Box NMS forward path is very slow on the GPU, because it launches potentially thousands very tiny kernels with exposed launch latency. This PR introduces a new kernel for this slow part of the NMS op, reducing the time significantly.

On single GPU, training Faster RCNN from GluonCV model zoo, the time to perform this portion of NMS improved by 50x, from 100 ms to 2.1 ms.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

GPU path of Box NMS forward op changed from previous nms_impl kernel to the new optimized kernel

Comments

While this was the biggest problem with NMS op, there are other ones, now exposed because the biggest offender got significantly faster. Namely: using thrust for sorting introduces syncs, allocations and deallocations in the op; there are 2 Slice operations that in the case of FRCNN take together ~18ms (which is 20% of the end to end time in FP16 training). I don't have time currently to try to fix them though :-(. @zhreshold FYI

src/operator/contrib/bounding_box-inl.cuh

zhreshold · 2019-03-01T23:21:14Z

@pengzhao-intel @ZhennanQin This is the fix I mentioned to fix nms perf on GPU

zhreshold · 2019-03-02T04:54:47Z

lgtm, I will leave it open for 24hr before merging this in case others have more comments

ptrendx · 2019-03-04T18:44:02Z

Just as an update - I trained FasterRCNN from GluonCV to completion with this new kernel and got the same accuracy as reported in GluonCV (37/57.8/39.9 mAP).

zhreshold · 2019-03-05T00:22:02Z

Awesome, I am merging this now

* Optimize NMS * Fix lint

ptrendx added 2 commits February 28, 2019 15:15

Optimize NMS

dc9d2f2

Fix lint

0cbf3fc

wkcn requested a review from zhreshold March 1, 2019 01:41

wkcn added the pr-awaiting-review PR is waiting for code review label Mar 1, 2019

zhreshold reviewed Mar 1, 2019

View reviewed changes

src/operator/contrib/bounding_box-inl.cuh Show resolved Hide resolved

zhreshold approved these changes Mar 2, 2019

View reviewed changes

zhreshold merged commit 780bddc into apache:master Mar 5, 2019

ptrendx deleted the pr_nms_apply branch March 5, 2019 00:24

zhreshold mentioned this pull request Mar 7, 2019

add backgroud class in box_nms #14058

Merged

7 tasks

vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019

Optimize NMS (apache#14290)

cf74099

* Optimize NMS * Fix lint

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Optimize NMS (apache#14290)

fc1b5f5

* Optimize NMS * Fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize NMS #14290

Optimize NMS #14290

ptrendx commented Feb 28, 2019

zhreshold commented Mar 1, 2019

zhreshold commented Mar 2, 2019

ptrendx commented Mar 4, 2019

zhreshold commented Mar 5, 2019

Optimize NMS #14290

Optimize NMS #14290

Conversation

ptrendx commented Feb 28, 2019

Description

Checklist

Essentials

Changes

Comments

zhreshold commented Mar 1, 2019

zhreshold commented Mar 2, 2019

ptrendx commented Mar 4, 2019

zhreshold commented Mar 5, 2019