Improving inference time #321

tushardhadiwal · 2021-01-07T11:26:53Z

Hi,

I came across this trick for improving inference time. opencv/opencv#14827 (comment)
While converting yolov4.cfg to tensort engine file, The cfg file that Used did not have nms-threshold=0 set in all yolo layers.
I do see some code in this repo for nms boxes etc.

Will i get any speedup in inference time of yolov4 if I add the values in cfg file? Or is this already taken care of while building trt engine?

Thanks

jkjung-avt · 2021-01-07T11:57:31Z

Thanks for raising the question. But that does not apply to the code in this repo.

If you can read CUDA code, you'd see that I don't do NMS in the yolo_layer plugin.

tensorrt_demos/plugins/yolo_layer.cu

Lines 198 to 249 in 8b81bbc

    
           // CalDetection(): This kernel processes 1 yolo layer calculation.  It 
        
           // distributes calculations so that 1 GPU thread would be responsible 
        
           // for each grid/anchor combination. 
        
           // NOTE: The output (x, y, w, h) are between 0.0 and 1.0 
        
           //       (relative to orginal image width and height). 
        
           __global__ void CalDetection(const float *input, float *output, 
        
                                        int batch_size, 
        
                                        int yolo_width, int yolo_height, 
        
                                        int num_anchors, const float *anchors, 
        
                                        int num_classes, int input_w, int input_h, 
        
                                        float scale_x_y) 
        
           { 
        
               int idx = threadIdx.x + blockDim.x * blockIdx.x; 
        
               Detection* det = ((Detection*) output) + idx; 
        
               int total_grids = yolo_width * yolo_height; 
        
               if (idx >= batch_size * total_grids * num_anchors) return; 
        
               int info_len = 5 + num_classes; 
        
               //int batch_idx = idx / (total_grids * num_anchors); 
        
               int group_idx = idx / total_grids; 
        
               int anchor_idx = group_idx % num_anchors; 
        
               const float* cur_input = input + group_idx * (info_len * total_grids) + (idx % total_grids); 
        
               int class_id; 
        
               float max_cls_logit = -CUDART_INF_F;  // minus infinity 
        
               for (int i = 5; i < info_len; ++i) { 
        
                   float l = *(cur_input + i * total_grids); 
        
                   if (l > max_cls_logit) { 
        
                       max_cls_logit = l; 
        
                       class_id = i - 5; 
        
                   } 
        
               } 
        
               float max_cls_prob = sigmoidGPU(max_cls_logit); 
        
               float box_prob = sigmoidGPU(*(cur_input + 4 * total_grids)); 
        
               //if (max_cls_prob < IGNORE_THRESH || box_prob < IGNORE_THRESH) 
        
               //    return; 
        
               int row = (idx % total_grids) / yolo_width; 
        
               int col = (idx % total_grids) % yolo_width; 
        
               det->bbox[0] = (col + scale_sigmoidGPU(*(cur_input + 0 * total_grids), scale_x_y)) / yolo_width;    // [0, 1] 
        
               det->bbox[1] = (row + scale_sigmoidGPU(*(cur_input + 1 * total_grids), scale_x_y)) / yolo_height;   // [0, 1] 
        
               det->bbox[2] = __expf(*(cur_input + 2 * total_grids)) * *(anchors + 2 * anchor_idx + 0) / input_w;  // [0, 1] 
        
               det->bbox[3] = __expf(*(cur_input + 3 * total_grids)) * *(anchors + 2 * anchor_idx + 1) / input_h;  // [0, 1] 
        
               det->bbox[0] -= det->bbox[2] / 2;  // shift from center to top-left 
        
               det->bbox[1] -= det->bbox[3] / 2; 
        
               det->det_confidence = box_prob; 
        
               det->class_id = class_id; 
        
               det->class_confidence = max_cls_prob; 
        
           }

Instead, I do NMS with python as shown below. The NMS code is written in python and indeed could be slow. You might improve FPS by optimizing this part (for example, replace it with C++ code).

tensorrt_demos/utils/yolo_with_plugins.py

Lines 139 to 146 in 8b81bbc

    
           # NMS 
        
           nms_detections = np.zeros((0, 7), dtype=detections.dtype) 
        
           for class_id in set(detections[:, 5]): 
        
               idxs = np.where(detections[:, 5] == class_id) 
        
               cls_detections = detections[idxs] 
        
               keep = _nms_boxes(cls_detections, nms_threshold) 
        
               nms_detections = np.concatenate( 
        
                   [nms_detections, cls_detections[keep]], axis=0)

BigJoon · 2021-01-10T03:08:32Z

Thanks for the detailed explanation. I'll think about optimizing the NMS code part with C++ code.
I think it will be a fun and proud work.

jkjung-avt closed this as completed Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving inference time #321

Improving inference time #321

tushardhadiwal commented Jan 7, 2021

jkjung-avt commented Jan 7, 2021

BigJoon commented Jan 10, 2021

Improving inference time #321

Improving inference time #321

Comments

tushardhadiwal commented Jan 7, 2021

jkjung-avt commented Jan 7, 2021

BigJoon commented Jan 10, 2021