Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize RetinaNet's postprocessing #2828

Merged
merged 8 commits into from
Oct 20, 2020

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Oct 16, 2020

We speed up the RetinaNet's postprocess_detections() method by vectorizing its operations (#2799). The implementation is based on @ppwwyyxx's great work at detectron2 and was possible thanks to @fmassa's guidance. Please note that there are breaking changes on the behaviour of the post-processing because of the way we clip the candidates before NMS.

Benchmark (100 iterations) across different images:

  • Before: 12.94 sec
  • After: 5.47 sec

To measure the speed we follow the same approach as in #2819.

To examine any effects on the accuracy and performance of the model, we compared master vs branch on the COCO dataset.

Click here for the complete output of the two runs
Master:
Test:  [   0/2500]  eta: 0:49:18  model_time: 0.4481 (0.4481)  evaluator_time: 0.0077 (0.0077)  time: 1.1834  data: 0.7266  max mem: 532
Test:  [ 100/2500]  eta: 0:06:42  model_time: 0.1256 (0.1292)  evaluator_time: 0.0245 (0.0270)  time: 0.1594  data: 0.0029  max mem: 556
Test:  [ 200/2500]  eta: 0:06:16  model_time: 0.1308 (0.1291)  evaluator_time: 0.0136 (0.0268)  time: 0.1690  data: 0.0031  max mem: 557
Test:  [ 300/2500]  eta: 0:05:57  model_time: 0.1260 (0.1286)  evaluator_time: 0.0111 (0.0270)  time: 0.1497  data: 0.0030  max mem: 557
Test:  [ 400/2500]  eta: 0:05:40  model_time: 0.1199 (0.1286)  evaluator_time: 0.0208 (0.0275)  time: 0.1560  data: 0.0030  max mem: 557
Test:  [ 500/2500]  eta: 0:05:25  model_time: 0.1258 (0.1290)  evaluator_time: 0.0110 (0.0282)  time: 0.1703  data: 0.0030  max mem: 557
Test:  [ 600/2500]  eta: 0:05:11  model_time: 0.1288 (0.1296)  evaluator_time: 0.0285 (0.0287)  time: 0.1902  data: 0.0029  max mem: 557
Test:  [ 700/2500]  eta: 0:05:07  model_time: 0.1251 (0.1294)  evaluator_time: 0.0139 (0.0360)  time: 0.1552  data: 0.0031  max mem: 557
Test:  [ 800/2500]  eta: 0:04:48  model_time: 0.1280 (0.1295)  evaluator_time: 0.0168 (0.0347)  time: 0.1629  data: 0.0029  max mem: 557
Test:  [ 900/2500]  eta: 0:04:29  model_time: 0.1296 (0.1294)  evaluator_time: 0.0126 (0.0341)  time: 0.1640  data: 0.0029  max mem: 557
Test:  [1000/2500]  eta: 0:04:11  model_time: 0.1270 (0.1292)  evaluator_time: 0.0140 (0.0332)  time: 0.1569  data: 0.0031  max mem: 557
Test:  [1100/2500]  eta: 0:03:53  model_time: 0.1255 (0.1289)  evaluator_time: 0.0144 (0.0330)  time: 0.1648  data: 0.0029  max mem: 557
Test:  [1200/2500]  eta: 0:03:36  model_time: 0.1269 (0.1290)  evaluator_time: 0.0150 (0.0326)  time: 0.1580  data: 0.0029  max mem: 557
Test:  [1300/2500]  eta: 0:03:19  model_time: 0.1266 (0.1287)  evaluator_time: 0.0130 (0.0323)  time: 0.1539  data: 0.0030  max mem: 557
Test:  [1400/2500]  eta: 0:03:01  model_time: 0.1261 (0.1286)  evaluator_time: 0.0095 (0.0319)  time: 0.1493  data: 0.0032  max mem: 557
Test:  [1500/2500]  eta: 0:02:44  model_time: 0.1313 (0.1286)  evaluator_time: 0.0145 (0.0313)  time: 0.1747  data: 0.0030  max mem: 557
Test:  [1600/2500]  eta: 0:02:28  model_time: 0.1272 (0.1287)  evaluator_time: 0.0188 (0.0315)  time: 0.1795  data: 0.0030  max mem: 557
Test:  [1700/2500]  eta: 0:02:11  model_time: 0.1298 (0.1287)  evaluator_time: 0.0313 (0.0312)  time: 0.1781  data: 0.0030  max mem: 557
Test:  [1800/2500]  eta: 0:01:55  model_time: 0.1268 (0.1287)  evaluator_time: 0.0136 (0.0312)  time: 0.1607  data: 0.0029  max mem: 557
Test:  [1900/2500]  eta: 0:01:38  model_time: 0.1258 (0.1286)  evaluator_time: 0.0106 (0.0309)  time: 0.1488  data: 0.0028  max mem: 557
Test:  [2000/2500]  eta: 0:01:23  model_time: 0.1308 (0.1286)  evaluator_time: 0.0190 (0.0328)  time: 0.1726  data: 0.0030  max mem: 557
Test:  [2100/2500]  eta: 0:01:06  model_time: 0.1307 (0.1286)  evaluator_time: 0.0136 (0.0327)  time: 0.1755  data: 0.0031  max mem: 557
Test:  [2200/2500]  eta: 0:00:49  model_time: 0.1246 (0.1285)  evaluator_time: 0.0119 (0.0323)  time: 0.1521  data: 0.0030  max mem: 557
Test:  [2300/2500]  eta: 0:00:33  model_time: 0.1264 (0.1285)  evaluator_time: 0.0259 (0.0321)  time: 0.1619  data: 0.0029  max mem: 557
Test:  [2400/2500]  eta: 0:00:16  model_time: 0.1244 (0.1284)  evaluator_time: 0.0087 (0.0316)  time: 0.1432  data: 0.0029  max mem: 557
Test:  [2499/2500]  eta: 0:00:00  model_time: 0.1247 (0.1284)  evaluator_time: 0.0118 (0.0314)  time: 0.1563  data: 0.0028  max mem: 557
Test: Total time: 0:06:51 (0.1646 s / it)
Averaged stats: model_time: 0.1247 (0.1284)  evaluator_time: 0.0118 (0.0302)
Accumulating evaluation results...
DONE (t=31.67s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.364
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.558
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.383
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.193
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.400
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.490
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.315
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.506
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.558
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.386
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.595
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.699

Vectorized b639ec0 (enhancement/retina_vectorized):
Test:  [   0/2500]  eta: 0:42:00  model_time: 0.4020 (0.4020)  evaluator_time: 0.0055 (0.0055)  time: 1.0083  data: 0.5995  max mem: 532
Test:  [ 100/2500]  eta: 0:04:27  model_time: 0.0730 (0.0796)  evaluator_time: 0.0214 (0.0215)  time: 0.1020  data: 0.0031  max mem: 556
Test:  [ 200/2500]  eta: 0:04:03  model_time: 0.0790 (0.0780)  evaluator_time: 0.0136 (0.0205)  time: 0.1014  data: 0.0030  max mem: 557
Test:  [ 300/2500]  eta: 0:03:48  model_time: 0.0726 (0.0773)  evaluator_time: 0.0112 (0.0203)  time: 0.0945  data: 0.0030  max mem: 557
Test:  [ 400/2500]  eta: 0:03:37  model_time: 0.0730 (0.0770)  evaluator_time: 0.0181 (0.0206)  time: 0.0993  data: 0.0031  max mem: 557
Test:  [ 500/2500]  eta: 0:03:26  model_time: 0.0761 (0.0769)  evaluator_time: 0.0120 (0.0208)  time: 0.1050  data: 0.0031  max mem: 557
Test:  [ 600/2500]  eta: 0:03:16  model_time: 0.0774 (0.0769)  evaluator_time: 0.0249 (0.0209)  time: 0.1135  data: 0.0029  max mem: 557
Test:  [ 700/2500]  eta: 0:03:04  model_time: 0.0763 (0.0769)  evaluator_time: 0.0142 (0.0207)  time: 0.0973  data: 0.0030  max mem: 557
Test:  [ 800/2500]  eta: 0:02:54  model_time: 0.0744 (0.0769)  evaluator_time: 0.0157 (0.0204)  time: 0.1016  data: 0.0029  max mem: 557
Test:  [ 900/2500]  eta: 0:02:43  model_time: 0.0814 (0.0769)  evaluator_time: 0.0122 (0.0204)  time: 0.1034  data: 0.0029  max mem: 557
Test:  [1000/2500]  eta: 0:02:33  model_time: 0.0735 (0.0770)  evaluator_time: 0.0143 (0.0204)  time: 0.0996  data: 0.0030  max mem: 557
Test:  [1100/2500]  eta: 0:02:22  model_time: 0.0741 (0.0768)  evaluator_time: 0.0138 (0.0204)  time: 0.1016  data: 0.0030  max mem: 557
Test:  [1200/2500]  eta: 0:02:12  model_time: 0.0755 (0.0770)  evaluator_time: 0.0148 (0.0205)  time: 0.1004  data: 0.0030  max mem: 557
Test:  [1300/2500]  eta: 0:02:05  model_time: 0.0767 (0.0770)  evaluator_time: 0.0138 (0.0231)  time: 0.1010  data: 0.0031  max mem: 557
Test:  [1400/2500]  eta: 0:01:54  model_time: 0.0756 (0.0770)  evaluator_time: 0.0100 (0.0228)  time: 0.0966  data: 0.0028  max mem: 557
Test:  [1500/2500]  eta: 0:01:44  model_time: 0.0786 (0.0770)  evaluator_time: 0.0138 (0.0225)  time: 0.1056  data: 0.0030  max mem: 557
Test:  [1600/2500]  eta: 0:01:33  model_time: 0.0774 (0.0770)  evaluator_time: 0.0153 (0.0224)  time: 0.1025  data: 0.0029  max mem: 557
Test:  [1700/2500]  eta: 0:01:23  model_time: 0.0759 (0.0770)  evaluator_time: 0.0261 (0.0223)  time: 0.1090  data: 0.0029  max mem: 557
Test:  [1800/2500]  eta: 0:01:12  model_time: 0.0758 (0.0770)  evaluator_time: 0.0145 (0.0223)  time: 0.1045  data: 0.0030  max mem: 557
Test:  [1900/2500]  eta: 0:01:02  model_time: 0.0791 (0.0770)  evaluator_time: 0.0111 (0.0222)  time: 0.0997  data: 0.0029  max mem: 557
Test:  [2000/2500]  eta: 0:00:51  model_time: 0.0773 (0.0770)  evaluator_time: 0.0176 (0.0221)  time: 0.1058  data: 0.0030  max mem: 557
Test:  [2100/2500]  eta: 0:00:41  model_time: 0.0769 (0.0770)  evaluator_time: 0.0120 (0.0220)  time: 0.1034  data: 0.0029  max mem: 557
Test:  [2200/2500]  eta: 0:00:31  model_time: 0.0762 (0.0770)  evaluator_time: 0.0116 (0.0219)  time: 0.0997  data: 0.0029  max mem: 557
Test:  [2300/2500]  eta: 0:00:20  model_time: 0.0757 (0.0769)  evaluator_time: 0.0197 (0.0218)  time: 0.1010  data: 0.0029  max mem: 557
Test:  [2400/2500]  eta: 0:00:10  model_time: 0.0779 (0.0769)  evaluator_time: 0.0093 (0.0215)  time: 0.0941  data: 0.0029  max mem: 557
Test:  [2499/2500]  eta: 0:00:00  model_time: 0.0736 (0.0769)  evaluator_time: 0.0117 (0.0214)  time: 0.0960  data: 0.0029  max mem: 557
Test: Total time: 0:04:17 (0.1029 s / it)
Averaged stats: model_time: 0.0736 (0.0771)  evaluator_time: 0.0117 (0.0205)
Accumulating evaluation results...
DONE (t=15.87s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.364
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.557
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.382
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.400
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.490
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.314
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.539
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.339
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.581
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.696

Important Notes:

  • There is indeed an improvement in terms of speed (close to 2x).
  • The amount of memory used remains the same.
  • The accuracy metrics are very close with the previous approach though in some cases they are marginally lower.
  • I believe that key reason for the marginal decrease is the more aggressive clipping of candidates before the NMS step. The original implementation maintained up to 91x300 (num_classes x detections_per_img) candidates while the new one only 5 x 1000 (num_feature_levels x topk_candidates). Moreover the new implementation further reduces the final predictions to up to 300 per image.
  • The above can be confirmed experimentally by relaxing the constraints and rerunning the validation. By tuning topk_candidates=3000 the accuracy metrics of this branch become indistinguishably close to master without reducing the speed gains too much.
  • The default values of the thresholds in the implementation were chosen to be close to the RetinaNet paper.

@datumbox datumbox force-pushed the enhancement/retina_vectorized branch from f131bbe to 42d2661 Compare October 18, 2020 12:52
@datumbox datumbox changed the title [WIP] Vectorize RetinaNet's postprocessing Vectorize RetinaNet's postprocessing Oct 19, 2020
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great, thanks a lot @datumbox !

I've left a few comments which are mostly aesthetic, let me know what you think

torchvision/models/detection/retinanet.py Show resolved Hide resolved
torchvision/models/detection/retinanet.py Outdated Show resolved Hide resolved
torchvision/models/detection/retinanet.py Outdated Show resolved Hide resolved
torchvision/models/detection/retinanet.py Outdated Show resolved Hide resolved
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot!

@fmassa fmassa merged commit 0467c9d into pytorch:master Oct 20, 2020
Copy link
Contributor

@hgaiser hgaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice speedup, very impressive! Wouldn't expect it to make that much of a difference ^^

torchvision/models/detection/retinanet.py Show resolved Hide resolved
@datumbox datumbox deleted the enhancement/retina_vectorized branch October 20, 2020 09:37
@fmassa
Copy link
Member

fmassa commented Oct 20, 2020

@hgaiser I think we could re-introduce it in a follow-up PR, but it would be good to have some more strict approaches for this.
For example, the pop from the dict is not ideal, as it changes the input dict in-place.

bryant1410 pushed a commit to bryant1410/vision-1 that referenced this pull request Nov 22, 2020
* Vectorize operations, across all feaure levels.

* Remove unnecessary other_outputs variable.

* Split per feature level.

* Perform batched_nms across feature levels.

* Add extra parameter for limiting detections before and after nms.

* Restoring default threshold.

* Apply suggestions from code review

Co-authored-by: Francisco Massa <[email protected]>

* Renaming variable.

Co-authored-by: Francisco Massa <[email protected]>
vfdev-5 pushed a commit to Quansight/vision that referenced this pull request Dec 4, 2020
* Vectorize operations, across all feaure levels.

* Remove unnecessary other_outputs variable.

* Split per feature level.

* Perform batched_nms across feature levels.

* Add extra parameter for limiting detections before and after nms.

* Restoring default threshold.

* Apply suggestions from code review

Co-authored-by: Francisco Massa <[email protected]>

* Renaming variable.

Co-authored-by: Francisco Massa <[email protected]>
@fmassa fmassa mentioned this pull request Aug 12, 2021
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants