Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

validation stucks when training gluoncv ssd model #14057

Closed
arcadiaphy opened this issue Feb 2, 2019 · 1 comment
Closed

validation stucks when training gluoncv ssd model #14057

arcadiaphy opened this issue Feb 2, 2019 · 1 comment

Comments

@arcadiaphy
Copy link
Member

arcadiaphy commented Feb 2, 2019

Description

When training gluoncv ssd model, validation sometimes takes way more longer time than the training epoch. After debugging, the problem comes from the box_nms operator which contributes most of the time.

Environment info (Required)

    Centos 7
    CUDA: 9.0
    cudnn: 7 
    mxnet: 1.4.0.rc1
    gluon-cv: latest

Minimum reproducible example

The following snippets show box_nms will take very long time when processing a lot of prior boxes

import mxnet as mx
import numpy as np

np.random.seed(0)

batch_size = 32
prior_number = 100000
data = np.zeros((batch_size, prior_number, 6))
data[:, :, 0] = np.random.randint(-1, 1, (batch_size, prior_number))
data[:, :, 1] = np.random.random((batch_size, prior_number))

xmin = np.random.random((batch_size, prior_number))
ymin = np.random.random((batch_size, prior_number))
width = np.random.random((batch_size, prior_number))
height = np.random.random((batch_size, prior_number))
data[:, :, 2] = xmin
data[:, :, 3] = ymin
data[:, :, 4] = xmin + width
data[:, :, 5] = ymin + height

mx_data = mx.nd.array(data, ctx=mx.gpu(0))
rv = mx.nd.contrib.box_nms(mx_data, overlap_thresh=0.5, valid_thresh=0.01, topk=400, score_index=1, id_index=0)
mx.nd.waitall()

What I have found out

  1. The gpu version of stable sort in SortByKey function degrades badly on sorting length
  2. The box_nms operator doesn't remove background boxes in valid box filtering which leads to big sorting length

What I have done

  1. Add SORT_WITH_THRUST compiling definition in Makefile: the validation process is still very slow
  2. Add background boxes filtering in box_nms: the validation process accelerates dramatically since most of boxes are classified as background.

I will post a PR on the second solution.

@andrewfayres
Copy link
Contributor

@mxnet-label-bot add [operator, performance]

Thanks for the issue and PR!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants