-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensitivity Effects Near Grid Boundaries (Experimental Results) ~+1 AP@[.5, .95] #3293
Comments
@glenn-jocher Hi, This is a very interesting note. I have never visualized the distribution of detections in such a way, and believed that detections worsen but no too much on the grid boundaries. But I have encountered the problem of flickering detections (blinking issue) during detection on Video. And this problem can't be completely solved even by using recurrent LSTM-networks #3114 (comment) , apparently because they also use the same grid at the end. The solution may be to use two offset grids for each scale, even if each of the grids will have fewer cells in order to maintain the same processing speed - I will think about it. Can you show the distribution of the frequency of detection of objects on the graph, for example, show the number of detections on the graph, where X - is the same as the X coordinate in the figure (red-line on the image), and Y - is the number of object detections?
Yes, for larger bboxes is used [yolo]-layer with lower number of cells. |
Can you disable this quantization temporary for testing? Yes, the less (bits) precision I use FP32->FP16->INT8->BIT1 the greater the problem of blinks.
Yes, may be there is no flickering.
Do you mean to add there: Lines 269 to 270 in 55dcd1b
something like this? {
for (t = 0; t < l.max_boxes; ++t) {
box truth = float_to_box_stride(state.truth + t*(4 + 1) + b*l.truths, 1);
if (truth.x < 0 || truth.y < 0 || truth.x > 1 || truth.y > 1 || truth.w < 0 || truth.h < 0) {
printf(" Wrong label: truth.x = %f, truth.y = %f, truth.w = %f, truth.h = %f \n", truth.x, truth.y, truth.w, truth.h);
}
int class_id = state.truth[t*(4 + 1) + b*l.truths + 4];
if (l.map) class_id = l.map[class_id];
if (class_id >= l.classes) continue; // if label contains class_id more than number of classes in the cfg-file
float obj_i = (truth.x * l.w);
float obj_j = (truth.y * l.h);
float best_iou = 0;
float best_n = 0;
box truth_shift = truth;
truth_shift.x = truth_shift.y = 0;
for (n = 0; n < l.total; ++n) {
box pred = { 0 };
pred.w = l.biases[2 * n] / state.net.w;
pred.h = l.biases[2 * n + 1] / state.net.h;
float iou = box_iou(pred, truth_shift);
if (iou > best_iou) {
best_iou = iou;
best_n = n;
}
}
int mask_n = int_index(l.mask, best_n, l.n);
if (mask_n >= 0 && (fabs(i - obj_i) < 1.5) && (fabs(j - obj_j) < 1.5)) {
int obj_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 4);
l.delta[obj_index] = l.cls_normalizer * (1 - l.output[obj_index]);
int class_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 4 + 1);
delta_yolo_class(l.output, l.delta, class_index, class_id, l.classes, l.w*l.h, 0, l.focal_loss);
int box_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 0);
delta_yolo_box(truth, l.output, l.biases, best_n, box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.delta, (2 - truth.w*truth.h), l.w*l.h, l.iou_normalizer, l.iou_loss);
}
}
} |
I got +2% [email protected] on yolov3-spp-pan-xnor.cfg (without online-SVR) on several classes from Cityscapes dataset:
So may be it is a good solution, will test more. |
Oh wow that's huge!! Thats a 10% increase! 2 * logistic - 0.5 will allow the detections to span the gridspace from -0.5 to 1.5. Yes, that seems perfect. It is odd as you note that the gap is constant in pixels rather than grid units. I realized I should be able to repeat this test very simply on 5k.val using yolov3-spp.weights and plotting the same 2D histogram. This will remove the whole chain of uncertainty created by the quantizing and exporting to CoreML. I'll do that now. |
Ok, I have the 5k.val test results here at 416 resolution. This plot below shows the python3 test.py --save-json --img-size 416
Namespace(batch_size=16, cfg='cfg/yolov3-spp.cfg', conf_thres=0.001, data_cfg='data/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3-spp.weights')
Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', total_memory=16130MB)
Class Images Targets P R mAP F1
Computing mAP: 100%|█████████████████████████████████████████████████████████████| 313/313 [06:59<00:00, 1.14s/it]
all 5e+03 3.58e+04 0.104 0.747 0.552 0.178
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.335
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.563
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.347
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.151
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.359
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.493
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.280
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.432
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.459
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.254
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.496
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.620 |
@glenn-jocher |
Ah, the top two are the predicted x and y for 5k.val during testing, which produces the 0.563 mAP. The bottom two are the ground truth x and y distributions. Ideally our top 2 charts would match the bottom 2. |
Since, in more general task
by using such function: eac2622#diff-180a7a56172e12a8b79e41ec95ae569dR557 I got +1.9% mAP on
|
@AlexeyAB very interesting. So I was wondering about the possibility of extending the sensitivity itself (not just the activation function range) of a grid cell, so for example training on objects with xy centers in the range of -.1 to 1.1, with an activation function range extending perhaps a bit past that, say -.2 to 1.2. This might further ease the 'handoff' between one grid cell and the next by producing redundant duplicate detections at the border, which NMS would sort out. But I'm unclear if that's a viable possibility, if a neighbouring cell has any observability into an object not centered in it. |
Yes, you can use I haven't tested much yet, so maybe it helps only some models and only on some datasets, or preferably on low-precision INT8/BIT1, or maybe it's just the fluctuations of the mAP.
To actively use this logic, you must add this code: #3293 (comment) For passive logic, you can simply reduce to Lines 786 to 787 in abba310
|
@glenn-jocher Hi, Did you try to train yolov3-spp model with |
@AlexeyAB I tried a few tricks, but I stopped because I realized I need to establish a baseline training comparison to AlexeyAB/darknet first, which I still haven't been able to do on coco2014. The ultralytics/yolov3 repo has 3 main parts:
So I've simply been trying to reproduce the AlexeyAB/darknet training settings on yolov3-spp first, but I am still about 10% below on both [email protected] and [email protected]. The last training I ran ended with the results below, and prototyping changes to the training is very slow because GCP keeps killing my preemptable V100 instances after only a few hours. I think I should raise a new issue here so you could help me try and understand the differences, though a few things I simply haven't had time to implement yet, like
BUT, I did have two ideas related to this:
|
So by using ultralytics/yolov3 you can't achieve the same [email protected] / [email protected] as by using AlexeyAB/darknet or https://github.com/pjreddie/darknet ? Did you try to compare [email protected] / [email protected] by training on ultralytics/yolov3 with and without Logistic-scale For Lines 291 to 296 in 8c80ba6
You should use Line 126 in 8c80ba6
Do you think that we should use |
@AlexeyAB I haven't tried to fully train COCO from darknet53.conv.74 on AlexeyAB/darknet, but I've trained for a couple epochs with good results. I've trained ultralytics/yolov3 a few times, but so far not reaching the same performance. I think this is simply my fault for not duplicating the AlexeyAB/darknet loss function correctly yet. I will put a comparison in a new issue to clear up the differences. One big question I had was regarding the total training time. How many full passes through the COCO2014 training set of 117264 images are done to reach 500200 batches assuming the cfg here? At first I thought it was 273, then more recently I came to believe it was 68.25, but I'm still confused. Lines 5 to 7 in 9e9b2c4
About the Logistic-scale scale_x_y = 1.05, 1.1, 1.2, I have not tried to recreate it yet, because there are still other unresolved issues in my loss function implementation compared to AlexeyAB/darknet I think. I can describe to you the implementation here:
# reject below threshold ious (OPTIONAL, increases P, lowers R)
reject = True
if reject:
j = iou > iou_thres
t, a, gwh = targets[j], a[j], gwh[j] # targets, anchors, wh
if giou_loss:
pbox = torch.cat((torch.sigmoid(pi[..., 0:2]), torch.exp(pi[..., 2:4]) * anchor_vec[i]), 1) # predicted
giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=True) # giou computation
lxy += (k * h['giou']) * (1.0 - giou).mean() # giou loss
else:
lxy += (k * h['xy']) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss
lwh += (k * h['wh']) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss
lconf += (k * h['conf']) * BCE(pi0[..., 4], tconf) # obj_conf loss
lcls += (k * h['cls']) * CE(pi[..., 5:], tcls[i]) # class_conf loss
loss = lxy + lwh + lconf + lcls
hyp = {'giou': .035, # giou loss gain
'xy': 0.20, # xy loss gain
'wh': 0.10, # wh loss gain
'cls': 0.035, # cls loss gain
'conf': 1.61, # conf loss gain
'conf_bpw': 3.53, # conf BCELoss positive_weight
'iou_t': 0.29, # iou target-anchor training threshold
'lr0': 0.001, # initial learning rate
'momentum': 0.90, # SGD momentum
'weight_decay': 0.0005} # optimizer weight decay
Burn-in is correctly implemented in ultralytics/yolov3 I believe. In this plot one batch is really a minibatch of 16 images, corresponding to Wow, the cornernet-squeeze and cornernet-saccade results are super impressive! I was not aware of those. I'll have to read the paper. |
This is correct, the same as it is done in this repo.
|
Oh my goodness. So AlexeyAB/darknet does not letterbox images when resizing? I thought it was important to maintain the aspect ratio of the shapes. Well this is an easy change I can implement and observe the effect after epoch 0.
Ah ok, got it. That's unfortunate, that means about 1 week of training time on a V100 :(
Yes this is the crux of the experiments I have been running. I've been doing hyperparameter tuning and other testing using epoch 0 results, under the assumption that an improvement after epoch 0 would also mean an improvement after epoch 273, but its not clear to me thats always the case. How many coco epochs would you say is the minimum to run to experiment with changes?
Yes I've seen that (at least in PyTorch) CE outperforms BCE for single-label classification, not only in YOLOv3 but also on pure classification tasks like MNIST (I tested this myself). PyTorch CE loss " combines log_softmax and nll_loss in a single function." whereas BCE loss is a sigmoid layer combined with binary cross entropy. I think the different strategies for formulating the regression problem are interesting (lrtb, xywh, cornernet, centernet etc.), and obviously very important, but one of the things I saw that worked best in regression networks seems strangely absent in object detection, which is normalizing the regression network targets to zero mean and unity variance (and then applying calibration curves later during testing and detection to bring the network outputs back to the range you want them). So I think irrespective of what the regression space is (xywh, lrtb etc), we also want the regression targets to produce a statistical distribution that has zero mean and a variance as close to 1.0 as possible. In shallow regression networks I've worked on in the past, like https://github.com/ultralytics/wave, arxiv, this simple change has had huge impacts on network performance. I think the current anchor system is partly accomplishing this, and these other methods may also deal with the issue in indirect ways, but I think there is great room for improvement left. I don't have a specific recommendation right now, but I think this is an extremely important concept to keep in mind. |
Ok I've added a scaleFill resize option (I borrowed the term from Apple's vocabulary for resizing options https://developer.apple.com/documentation/vision/vnimagecropandscaleoption). The original letterboxing (scaleFit) is what I've been doing, similar to original darknet apparently. I will test each for one epoch. |
Yes, it is improtan for the most competitions. It would be insteresting to see what approach is better scaleFit or scaleFil.
I think 1 epoch for preliminary tests is enough.
BCE (sigmoid for multi-label classification) was introduced mainly for OpenImages-dataset, where are many objects are placed close to each other, so one anchors can detect several objects - it increases accuracy. Joseph tried to make one universal optimal model for the most datasets. Even if it is at the expense of (decrease) accuracy for other datasets.
What is the difference between this approach and Batch-normalization? |
I added param
|
@AlexeyAB I tested out a few changes, but did not observe improvements. The results are a bit hard to read because the effects of the changes are getting lost in the noise I think. These are all after 1 epoch using img-size 320, batch 64. # Default training command: python3 train.py --data data/coco.data --img-size 320 --single-scale --batch-size 64 --accumulate 1 --epochs 1 # 0.449hr FP32 P100, 0.279/0.324hr V100 FP32/FP16
# P R mAP F1
0.111 0.268 0.122 0.144 # default
0.087 0.281 0.109 0.121 # default mixed precision with nvidia apex
0.131 0.261 0.119 0.157 # scaleFill
0.110 0.285 0.129 0.140 # scale_xy 1.2
0.104 0.276 0.123 0.141 # scale_xy 1.5
0.109 0.286 0.124 0.132 # scale_xy 2.0
0.053 0.229 0.064 0.0768 # iou threshold = 0.0
0.114 0.28 0.125 0.139 # giou ** 2 To study the regression targets I ran 1 epoch and collected all the values that are actually passed to the MSE loss function as ground truths (including augmentation etc). The results do look pretty normalized. Clearly the anchors are doing a good job of centering the In my implementation YOLOv3 |
Yes, sure, both Delta and summarized-loss are calculated after sigmoid (for: x,y, objectness, all classes): Lines 244 to 247 in 5ec3592
I didn't train yolov3 on MS COCO with giou, I just checked trained model from https://github.com/generalized-iou/g-darknet , and it gives good mAP@75 and [email protected]
Can you clarify a little bit more? I don't understand, do you mean scale_xy as hyperparameter for |
@AlexeyAB I'm not sure that the differences between the runs are statistically reliable. Let me think. I can reproduce the same results on the same hardware+environment, so as long as the experiments don't change the number (or order) of the random numbers generated between cuda, pytorch, numpy and python (all seeds are set to 0 before each training), then the results should be directly comparable. Hmm, ok so yes, then your observations are valid. Then Note that I was lazy in my Another surprise I found was that mixed precision training (using https://github.com/NVIDIA/apex) worsened the results significantly to 0.109 mAP. I updated my little table in my previous comment with this. All the other results use full FP32, the PyTorch default. The loss-balancing hyperparameters I was refering to are defined here. I have to use these to balance out the contribution from each loss term, i.e. total hyp = {'giou': .035, # giou loss gain
'xy': 0.20, # xy loss gain
'wh': 0.10, # wh loss gain
'cls': 0.035, # cls loss gain
'conf': 1.61, # conf loss gain
'conf_bpw': 3.53, # conf BCELoss positive_weight
'iou_t': 0.29, # iou target-anchor training threshold
'lr0': 0.001, # initial learning rate
'momentum': 0.90, # SGD momentum
'weight_decay': 0.0005} # optimizer weight decay |
@AlexeyAB I was thinking about this topic today, because now I'm not really sure if the dark regions in the xy histogram are caused by simple xy prediction misalignments (which your scale_x_y parameter aims to help), or by actual failures in detection. Failures in detection would much more explain blinking objects video. The 'handoff' between one grid point and another for a moving object may be failing at the boundary. What do you think? A partial solution is of course multi-scale inference as we already do with FPN, but for small objects this will not help, as they will only pair with P3 anchors, and coincidentally small objects show the worst COCO mAP. In this video you can see the motorcycles blink significantly, and the cars almost not at all. But I can't tell how much of this is simply due to the cars having more pixels, or whether overlapping p3,p4,p5 anchors helps them as well avoid blinking: |
Do you use
Yes, sure. There are at least two 'handoff's:
May be this is "blinking small bike" due to
|
@AlexeyAB yes you are right, multi-scale inference is only useful for paper metrics, competition etc. Yes there are these two handoffs, but the anchor handoff enjoys significant overlap, whereas the grid cell handoff does not overlap in many cases. The way I understand And yes by 'only' I mean that P4 objects may enjoy grid overlap with P3 and P5, partly eliminating the grid issue for medium objects, but tiny P3 objects, smaller than 4x the smallest P4 anchor do not enjoy any help from the P4 grid. In any case though, even if P3 objects did match to P4 anchors, 50% of the P3 grid boundaries are still also P4 grid boundaries, and the problem would remain in those cases. I think a solution may be to allow both the box regressions to vary past the boundaries a little (as scale_x_y already does), but also to allow objectness to remain high even if the object is a little past the cell boundaries. Then there would be true objectness overlap. I don't think you are doing this already are you? |
@AlexeyAB one really interesting thought I had for solving the objectness handoff would be if even number output layers (P4, P6) are shifted by 0.5 grid cells in both x and y. Then output layers near each other share no grid borders (i.e. P3-P4, and P4-P5 share no more borders), and perhaps then the problem would be much reduced. Implementation would be tricky though, and depend on linear interpolation to shift the image +0.5 grid points in xy on affected P layers (i.e. P4 and P6), and then shift all P4, P6 predictions -0.5 points in xy. |
Yes, in the current
Yes. // old code: x,y = output of conv layer for x,y
x_temp = sigomid(x)*scale_x_y - (scale_x_y - 1) / 2;
y_temp = sigomid(y)*scale_x_y - (scale_x_y - 1) / 2;
x_real = (i + x_temp) * w;
y_real = (y + y_temp) * h;
// + new code
// objectness_truth = 1;
// class_prob_truth = 1;
x_d = (sigmoid(x)*(1-sigmoid(x))*scale_x_y;
y_d = (sigmoid(y)*(1-sigmoid(y))*scale_x_y;
objectness_truth = x_d * y_d; // instead of 1
class_prob_truth = x_d * y_d; // instead of 1
|
I believe the sensitivity effect you guys are seeing here may be somewhat impacted by the anchor boxes, but also could be related to a more fundamental issue with information loss during downsampling steps in the network. You may not have come across this nice paper about antialiasing downsampling at some extra computational cost to make networks more translationally invariant: https://richzhang.github.io/antialiased-cnns/ This seems to help some though the best solution (but halving frame-rates) is to simply use test-time augmentation with sending a copy of the input image but offset in x and y and running inference across both and then combining declarations. |
This was tried, seemed to give slightly worse results |
@LukeAI - Ah thanks. Yes, I also had mixed results implementing it on a different architecture (CenterNet) though the idea behind the paper seems sound and felt like it should have helped. |
@HamsterHuey It helps only for small datasets and without shift data augmentation. |
@AlexeyAB, I've been working with a custom object detector based on YOLOv3, and I am facing the same issue. The model is giving good detection with very high confidence when being tested on images, but I've observed a flickering effect when testing is done on video sequences. On further inspection, I observed my model is suffering from similar sensitivity near grid boundaries issue as is being discussed on this thread. Details regarding model training I've gone through the whole thread but I can't seem to figure out how you ended up with the given scale_xy values for different branches. For achieving this, you used the following modified equation: Rather than the original YOLOv3 equation I tried the above mentioned modified equation and analyzed the results post-training. Model trained using original YOLOv3 equations
Model trained using modified equations as mentioned in this comment
Can you shed more light on how did you select the scale_x_y values? |
@viplix3 Attach both cfg-files.
Yes. |
Sorry, I didn't use https://github.com/AlexeyAB/darknet, so cannot provide any cfg-files. Is my understanding correct regarding the modified equations?
I'm providing the relevant figures below. All the prediction are done on same test set with a confidence threshold of 0.1 and NMS threshold of 0.3 Model trained with original YOLOv3 offset equationBig Object Branch (16, 16, 18) x_mid_offset histograms on 512x512 resolution Medium Object Branch (32, 32, 18) x_mid_offset histograms on 512x512 resolution Small Object Branch (64, 64, 18) x_mid_offset histograms on 512x512 resolution Big Object Branch (16, 16, 18) x_mid_offset scatter plot on 512x512 resolution Medium Object Branch (32, 32, 18) x_mid_offset scatter plot on 512x512 resolution Small Object Branch (64, 64, 18) x_mid_offset scatter plot on 512x512 resolution x_mid_offset scatter plot normalized and combined Model trained with modified offset equationsBig Object Branch (16, 16, 18) x_mid_offset histograms on 512x512 resolution Medium Object Branch (32, 32, 18) x_mid_offset histograms on 512x512 resolution Small Object Branch (64, 64, 18) x_mid_offset histograms on 512x512 resolution Big Object Branch (16, 16, 18) x_mid_offset scatter plot on 512x512 resolution Medium Object Branch (32, 32, 18) x_mid_offset scatter plot on 512x512 resolution Small Object Branch (64, 64, 18) x_mid_offset scatter plot on 512x512 resolution x_mid_offset scatter plot normalized and combined Please note that the prediction scatter plot shape is like a parabola because the GT distribution is like that. It has nothing to do with incorrect model training. It is pretty evident from the histograms and scatter plots that model is not able to predcit boxes near the grid boundaries. |
It seems something wrong with your code. |
I can assure you my code is fine as the model trained using the said code has been tested exhaustively on over 100k frames for model detection performance and any unusually wrong detections haven't been observed so far. |
@viplix3 the general idea is that it is impossible to generate an output of 0 or 1 from a sigmoid as the input neuron would need to be outputting -inf or inf. This is the discovery I made and the effect you are seeing in your plots. The solution is to expand the output space past 0-1 (i.e. -0.2 to 1.2) while retaining the targets to a smaller space, allowing model outputs to more easily spread across the grid. The specifics of how you do this probably don't matter much, so you should experiment and see what works best for your experiment. If you arrive at any innovative solutions please update here! |
@glenn-jocher Thanks for the clarification. |
@viplix3 sure, there's all sorts of creative ways you could circumvent this or mitigate it. You can expand your sigmoid scaling to stay away from the edges, or for example you could use an FCOS style box regression, which doesn't even require the model to output a centerpoint. That would completely negate the issue. |
@AlexeyAB I ran an interesting experiment recently. Using the iDetection app, I set up an iPhone to view a street for one hour and record the all the detections for later analysis. This recorded about 400,000 detections over 100,000 video frames (3600s at an average of 20 FPS). The model used was https://pjreddie.com/media/files/yolov3-spp.weights, exported to PyTorch > ONNX > CoreML in a 192 x 320 width-height shape, with 6x10, 12x20, 24x40 grids. The results worked amazingly well, but it also uncovered a effect I'd never noticed before.
In my layover below, you can actually see the YOLOv3 grids, because for some reason there are no detections near the grid boundaries (those histogram cells are all zero).
Also equally fascinating you can see that the middle grid is used for up-close pedestrians, because I can count 12 grids across nearest to the camera, while you can actually visualize the transition to the largest 24-across grid far away.
My question to you is have you ever seen any issues at the grid intersection areas before, such as reduced recall? There are a lot of effects at play here, so the cause may not be in the
yolov3-spp.weights
themselves, but perhaps the PyTorch inference model, or the CoreML export. I'm fairly confident that the PyTorch and Darknet inference is practically identical however, due to identical test mAPs.The text was updated successfully, but these errors were encountered: