Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Optimize NMS #14290

Merged
merged 2 commits into from
Mar 5, 2019
Merged

Optimize NMS #14290

merged 2 commits into from
Mar 5, 2019

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 28, 2019

Description

Current Box NMS forward path is very slow on the GPU, because it launches potentially thousands very tiny kernels with exposed launch latency. This PR introduces a new kernel for this slow part of the NMS op, reducing the time significantly.

On single GPU, training Faster RCNN from GluonCV model zoo, the time to perform this portion of NMS improved by 50x, from 100 ms to 2.1 ms.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • GPU path of Box NMS forward op changed from previous nms_impl kernel to the new optimized kernel

Comments

  • While this was the biggest problem with NMS op, there are other ones, now exposed because the biggest offender got significantly faster. Namely: using thrust for sorting introduces syncs, allocations and deallocations in the op; there are 2 Slice operations that in the case of FRCNN take together ~18ms (which is 20% of the end to end time in FP16 training). I don't have time currently to try to fix them though :-(. @zhreshold FYI

@wkcn wkcn requested a review from zhreshold March 1, 2019 01:41
@wkcn wkcn added the pr-awaiting-review PR is waiting for code review label Mar 1, 2019
@zhreshold
Copy link
Member

@pengzhao-intel @ZhennanQin This is the fix I mentioned to fix nms perf on GPU

@zhreshold
Copy link
Member

lgtm, I will leave it open for 24hr before merging this in case others have more comments

@ptrendx
Copy link
Member Author

ptrendx commented Mar 4, 2019

Just as an update - I trained FasterRCNN from GluonCV to completion with this new kernel and got the same accuracy as reported in GluonCV (37/57.8/39.9 mAP).

@zhreshold
Copy link
Member

Awesome, I am merging this now

@zhreshold zhreshold merged commit 780bddc into apache:master Mar 5, 2019
@ptrendx ptrendx deleted the pr_nms_apply branch March 5, 2019 00:24
@zhreshold zhreshold mentioned this pull request Mar 7, 2019
7 tasks
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* Optimize NMS

* Fix lint
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Optimize NMS

* Fix lint
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants