The training loss is nan in SSD #142

1292765944 · 2018-05-24T07:55:07Z

I use the default training script (train_ssd.py) to train SSD300, However, the training loss seems to be large and do not converge. The log file is shown below, and what's the problem for this? Thanks!

Namespace(batch_size=32, data_shape=300, dataset='voc', epochs=240, gpus='0', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='vgg16_atrous', num_workers=4, resume='', save_interval=10, save_prefix='ssd_300_vgg16_atrous_voc', seed=233, start_epoch=0, wd=0.0005)
Start training from [Epoch 0]
[Epoch 0][Batch 99], Speed: 31.855627 samples/sec, CrossEntropy=12.250115, SmoothL1=nan
[Epoch 0][Batch 199], Speed: 31.908488 samples/sec, CrossEntropy=12.212773, SmoothL1=nan
[Epoch 0][Batch 299], Speed: 31.279352 samples/sec, CrossEntropy=12.199556, SmoothL1=nan
[Epoch 0][Batch 399], Speed: 30.720933 samples/sec, CrossEntropy=12.192253, SmoothL1=nan
[Epoch 0][Batch 499], Speed: 31.459032 samples/sec, CrossEntropy=12.187363, SmoothL1=nan
[Epoch 0] Training cost: 548.135802, CrossEntropy=12.186622, SmoothL1=nan
[Epoch 0] Validation: 
aeroplane=0.000389
bicycle=0.000102
bird=0.004152
boat=0.000011
bottle=0.000183
bus=0.000084
car=0.000175
cat=0.004419
chair=0.001955
cow=0.000036
diningtable=0.000058
dog=0.001009
horse=0.000108
motorbike=0.005600
person=0.000281
pottedplant=0.000103
sheep=0.000045
sofa=0.000598
train=0.002716
tvmonitor=0.000453
mAP=0.001124
[Epoch 1][Batch 99], Speed: 31.706873 samples/sec, CrossEntropy=12.165724, SmoothL1=nan

The text was updated successfully, but these errors were encountered:

zhreshold · 2018-05-24T18:52:32Z

The smoothl1 is nan at the first batch, suggesting there's some problem.
What is the mxnet version you use

1292765944 · 2018-05-24T21:47:27Z

I use mxnet 1.2.0 version

zhreshold · 2018-05-24T22:55:29Z

Just re-run with latest 1.2.0 mxnet, no problem.
You can self-diagnose it by reduce the --log-interval to 1 and see if the loss is consistantly NaN, if so, there's some problem with your data.

1292765944 · 2018-05-25T03:01:50Z

@zhreshold
I run the training script five times, and I only get one training which is correct in the first epoch but the loss becomes nan from the second epoch on (shown in below, the 1st log), while the other training fall into nan loss right at the second batch of the first epoch (the 2nd log).

INFO:root:Namespace(batch_size=32, data_shape=300, dataset='voc', epochs=240, gpus='0', log_interval=1, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='vgg16_atrous', num_workers=4, resume='', save_interval=10, save_prefix='ssd_300_vgg16_atrous_voc', seed=233, start_epoch=0, wd=0.0005)
INFO:root:Start training from [Epoch 0]
[11:29:29] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[Epoch 0][Batch 0], Speed: 1.592764 samples/sec, CrossEntropy=18.970078, SmoothL1=3.773359
[Epoch 0][Batch 1], Speed: 33.675918 samples/sec, CrossEntropy=18.700781, SmoothL1=3.705951
[Epoch 0][Batch 2], Speed: 34.501906 samples/sec, CrossEntropy=17.878072, SmoothL1=3.636311
[Epoch 0][Batch 3], Speed: 34.971340 samples/sec, CrossEntropy=17.188730, SmoothL1=3.532549
[Epoch 0][Batch 4], Speed: 35.210838 samples/sec, CrossEntropy=16.539876, SmoothL1=3.594374
[Epoch 0][Batch 5], Speed: 35.344982 samples/sec, CrossEntropy=16.004512, SmoothL1=3.554986
[Epoch 0][Batch 6], Speed: 35.138346 samples/sec, CrossEntropy=15.519500, SmoothL1=3.495123
[Epoch 0][Batch 7], Speed: 35.231662 samples/sec, CrossEntropy=15.133650, SmoothL1=3.454569
[Epoch 0][Batch 8], Speed: 34.617724 samples/sec, CrossEntropy=14.822596, SmoothL1=3.410199
[Epoch 0][Batch 9], Speed: 32.935670 samples/sec, CrossEntropy=14.517015, SmoothL1=3.382657
[Epoch 0][Batch 10], Speed: 31.795520 samples/sec, CrossEntropy=14.196736, SmoothL1=3.346445
[Epoch 0][Batch 11], Speed: 31.414617 samples/sec, CrossEntropy=13.857106, SmoothL1=3.314300
[Epoch 0][Batch 12], Speed: 31.531594 samples/sec, CrossEntropy=13.554048, SmoothL1=3.303076
[Epoch 0][Batch 13], Speed: 31.247252 samples/sec, CrossEntropy=13.248046, SmoothL1=3.292221
[Epoch 0][Batch 14], Speed: 30.715182 samples/sec, CrossEntropy=12.961388, SmoothL1=3.299442
[Epoch 0][Batch 15], Speed: 30.934725 samples/sec, CrossEntropy=12.669294, SmoothL1=3.280241
[Epoch 0][Batch 16], Speed: 31.433194 samples/sec, CrossEntropy=12.401437, SmoothL1=3.253198
[Epoch 0][Batch 17], Speed: 31.619426 samples/sec, CrossEntropy=12.157580, SmoothL1=3.254118
[Epoch 0][Batch 18], Speed: 31.543910 samples/sec, CrossEntropy=11.918905, SmoothL1=3.228512
[Epoch 0][Batch 19], Speed: 31.251217 samples/sec, CrossEntropy=11.685043, SmoothL1=3.222793
[Epoch 0][Batch 20], Speed: 30.518158 samples/sec, CrossEntropy=11.486743, SmoothL1=3.205430
...

[Epoch 0][Batch 504], Speed: 30.911898 samples/sec, CrossEntropy=5.058636, SmoothL1=2.151901
[Epoch 0][Batch 505], Speed: 31.521841 samples/sec, CrossEntropy=5.056911, SmoothL1=2.151394
[Epoch 0][Batch 506], Speed: 31.359363 samples/sec, CrossEntropy=5.054802, SmoothL1=2.151377
[Epoch 0][Batch 507], Speed: 30.824867 samples/sec, CrossEntropy=5.052479, SmoothL1=2.150690
[Epoch 0][Batch 508], Speed: 30.793573 samples/sec, CrossEntropy=5.051375, SmoothL1=2.150632
[Epoch 0][Batch 509], Speed: 31.249697 samples/sec, CrossEntropy=5.049899, SmoothL1=2.150408
[Epoch 0][Batch 510], Speed: 31.339885 samples/sec, CrossEntropy=5.048257, SmoothL1=2.149687
[Epoch 0][Batch 511], Speed: 30.592469 samples/sec, CrossEntropy=5.046074, SmoothL1=2.149366
[Epoch 0][Batch 512], Speed: 31.167457 samples/sec, CrossEntropy=5.043880, SmoothL1=2.148887
[Epoch 0][Batch 513], Speed: 31.217229 samples/sec, CrossEntropy=5.042284, SmoothL1=2.148108
[Epoch 0][Batch 514], Speed: 31.367938 samples/sec, CrossEntropy=5.040168, SmoothL1=2.146903
[Epoch 0][Batch 515], Speed: 30.607803 samples/sec, CrossEntropy=5.038861, SmoothL1=2.146586
[Epoch 0][Batch 516], Speed: 30.433037 samples/sec, CrossEntropy=5.037583, SmoothL1=2.146050
[Epoch 0] Training cost: 555.827060, CrossEntropy=5.037583, SmoothL1=2.146050
[Epoch 0] Validation: 
aeroplane=0.123616
bicycle=0.030370
bird=0.122992
boat=0.006477
bottle=0.026802
bus=0.247903
car=0.480617
cat=0.302473
chair=0.031894
cow=0.191469
diningtable=0.008598
dog=0.260007
horse=0.086784
motorbike=0.222414
person=0.435699
pottedplant=0.013110
sheep=0.113426
sofa=0.061493
train=0.130224
tvmonitor=0.064738
mAP=0.148055
[Epoch 1][Batch 0], Speed: 7.914320 samples/sec, CrossEntropy=4.576447, SmoothL1=1.757895
[Epoch 1][Batch 1], Speed: 34.310139 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 2], Speed: 34.706138 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 3], Speed: 33.453595 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 4], Speed: 34.769189 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 5], Speed: 35.310856 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 6], Speed: 35.516562 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 7], Speed: 35.191598 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 8], Speed: 34.956712 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 9], Speed: 35.018677 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 10], Speed: 33.698188 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 11], Speed: 32.429172 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 12], Speed: 31.884709 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 13], Speed: 31.174225 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 14], Speed: 30.056717 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 15], Speed: 32.228178 samples/sec, CrossEntropy=nan, SmoothL1=nan
[Epoch 1][Batch 16], Speed: 31.982889 samples/sec, CrossEntropy=nan, SmoothL1=nan

Namespace(batch_size=32, data_shape=300, dataset='voc', epochs=240, gpus='0', log_interval=1, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='vgg16_atrous', num_workers=4, resume='', save_interval=10, save_prefix='ssd_300_vgg16_atrous_voc', seed=233, start_epoch=0, wd=0.0005)
Start training from [Epoch 0]
[Epoch 0][Batch 0], Speed: 1.199697 samples/sec, CrossEntropy=19.308395, SmoothL1=3.815743
[Epoch 0][Batch 1], Speed: 35.496752 samples/sec, CrossEntropy=4357492302527741261497340633546752.000000, SmoothL1=nan
[Epoch 0][Batch 2], Speed: 36.115509 samples/sec, CrossEntropy=2904994868351830197815071171674112.000000, SmoothL1=nan
[Epoch 0][Batch 3], Speed: 36.453160 samples/sec, CrossEntropy=2178746151263872648361303378755584.000000, SmoothL1=nan
[Epoch 0][Batch 4], Speed: 35.790659 samples/sec, CrossEntropy=1742996921011098176335117933346816.000000, SmoothL1=nan
[Epoch 0][Batch 5], Speed: 35.105847 samples/sec, CrossEntropy=1452497434175915098907535585837056.000000, SmoothL1=nan
[Epoch 0][Batch 6], Speed: 35.377832 samples/sec, CrossEntropy=1244997800722212983096512809533440.000000, SmoothL1=nan
[Epoch 0][Batch 7], Speed: 35.383120 samples/sec, CrossEntropy=1089373075631936324180651689377792.000000, SmoothL1=nan
[Epoch 0][Batch 8], Speed: 35.142164 samples/sec, CrossEntropy=968331622783943447310086415843328.000000, SmoothL1=nan
[Epoch 0][Batch 9], Speed: 35.495522 samples/sec, CrossEntropy=871498460505549088167558966673408.000000, SmoothL1=nan
[Epoch 0][Batch 10], Speed: 35.528794 samples/sec, CrossEntropy=792271327732317392183741263118336.000000, SmoothL1=nan
[Epoch 0][Batch 11], Speed: 35.216732 samples/sec, CrossEntropy=726248717087957549453767792918528.000000, SmoothL1=nan
[Epoch 0][Batch 12], Speed: 35.383988 samples/sec, CrossEntropy=670383431158114672120030891606016.000000, SmoothL1=nan
[Epoch 0][Batch 13], Speed: 36.050767 samples/sec, CrossEntropy=622498900361106491548256404766720.000000, SmoothL1=nan
[Epoch 0][Batch 14], Speed: 35.381637 samples/sec, CrossEntropy=580998973670366010739976619163648.000000, SmoothL1=nan
[Epoch 0][Batch 15], Speed: 35.286246 samples/sec, CrossEntropy=544686537815968162090325844688896.000000, SmoothL1=nan
[Epoch 0][Batch 16], Speed: 35.189660 samples/sec, CrossEntropy=512646153238558295634751631917056.000000, SmoothL1=nan
[Epoch 0][Batch 17], Speed: 35.466971 samples/sec, CrossEntropy=484165811391971723655043207921664.000000, SmoothL1=nan
[Epoch 0][Batch 18], Speed: 35.249279 samples/sec, CrossEntropy=458683400266078459871600083730432.000000, SmoothL1=nan
[Epoch 0][Batch 19], Speed: 35.442255 samples/sec, CrossEntropy=435749230252774544083779483336704.000000, SmoothL1=nan
[Epoch 0][Batch 20], Speed: 35.256659 samples/sec, CrossEntropy=414999266907404303679639590535168.000000, SmoothL1=nan
[Epoch 0][Batch 21], Speed: 35.038926 samples/sec, CrossEntropy=396135663866158696091870631559168.000000, SmoothL1=nan

zhreshold · 2018-05-25T17:17:24Z

@1292765944 I will investigate this problem.
Just some question that may or may not relate to this problem, is your GPUs running pretty hot?

1292765944 · 2018-05-26T02:45:54Z

@zhreshold I think my gpu is rather ok. Maxwell TITANX is used in my experiments. The running temp is 85C, and the idle temp is 42C (by nvidia-smi).
I also find that if two gpus is adopted for training, the training loss is worse than one gpu and the loss always becomes nan immediately after one batch training.
Before using gluoncv, I also try your old mxnet-ssd project. This project works well but the accuracy is a little lower than caffe (74.5% vs 77.5% for voc 2007 test), but the training efficiency of mxnet is far better than caffe and the training time is much shorter. So I'd appreciate your help.

1292765944 · 2018-05-28T15:58:12Z

@zhreshold Could you just reproduce my error? So what's the problem for this error? Is it the model initialization problem?

zhreshold · 2018-05-28T23:11:56Z

I tried multiple times on ec2, cannot reproduce the error yet. Since you are getting exploding loss after first update, I suspect your pretrained model or initialized tail are abnormal. Let me think where could possibly go wrong.

1292765944 · 2018-06-07T12:32:44Z

@zhreshold any ideas for this problem?

zhreshold · 2018-06-07T23:46:42Z

@1292765944 Try remove related pretrained models in ~/.mxnet/models and update mxnet/gluon-cv, run with default parameters again?

Intellige · 2018-06-14T12:39:21Z

Hello,
We used the file: gluon-cv/scripts/detection/ssd/train_ssd.py. MXnetcu80 version 1.3.0
1 When we used the COCO datasets, we had the Nan problem as the same as @1292765944 provided.
we ran : python train_ssd.py --gpus 0,1 --dataset coco --network vgg16_atrous --data-shape 300
the result: INFO:root:[Epoch 0][Batch 99], Speed: 65.487 samples/sec, CrossEntropy=17.562, SmoothL1=nan .
2 While we used the VOC datesets, even though CrossEntropy and SmoothLy were normal，in the validation step，The last mAP was poorly low，around 0.002 in the first epoch, and,smaller lr did not work.

We have reinstalled mxnet and reloaded initial paras,but don't work ~~~

zhreshold · 2018-06-14T18:20:23Z

We fixed a minor problem with coco, so the behavior of coco is confirmed.
However, for the VOC datasets, it is very weird that people are getting all kinds of different problems during training. I have repeated multiple times, all going pretty smoothly.

I will try some instances with CUDA80 and latest mxnet, please let me know if you guys have new findings.

zqburde · 2018-07-02T09:51:05Z

First I also encounter this problem, decreasing the lr is my solution.

1292765944 · 2018-07-02T12:18:13Z

@zqburde How did you set the lr? Does it hurt the final mAP?

zqburde · 2018-07-03T07:49:29Z

@1292765944 I set the lr 0.001 to 0.0008 and increase the epochs, it doesn't influence the final mAP.

Wallart · 2018-07-15T13:52:57Z

I am working with a GTX 1080Ti and I had the same issue on the previous gluoncv release (0.1), the loss was rapidly going to nan. Unfortunately I can no longer reproduce on gluoncv 0.2.

For better understanding I also implemented my own SSD inspired by amdegroot 's ssd.pytorch and had no issues. Then I tried to improve my implementation using some gluoncv concept and the training loss was nan too. In my experiments it's linked to :
self.init = {
'weight_initializer': Xavier(
rnd_type='gaussian', factor_type='out', magnitude=2),
'bias_initializer': 'zeros'
}

Used in VGGAtrousBase class. Removing this initializer solved my nan issue.
Hope it helps.

Angzz · 2018-08-09T14:40:58Z

When I train ssd with coco, the training curve seems normal, but when validation, the AP[0.5:0.95] is always nearly zero, so what's the problem maybe?

this is my params setting :
Namespace(batch_size=16, data_shape=512, dataset='coco', epochs=240, gpus='0,1', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='resnet50_v1', num_workers=4, resume='', save_interval=1, save_prefix='ssd_512_resnet50_v1_coco', seed=233, start_epoch=0, val_interval=1, wd=0.0005)

this is my training curve:
[Epoch 9][Batch 4799], Speed: 53.161 samples/sec, CrossEntropy=3.285, SmoothL1=3.270
[Epoch 9][Batch 4899], Speed: 54.386 samples/sec, CrossEntropy=3.285, SmoothL1=3.274
[Epoch 9][Batch 4999], Speed: 58.573 samples/sec, CrossEntropy=3.285, SmoothL1=3.275
[Epoch 9][Batch 5099], Speed: 55.887 samples/sec, CrossEntropy=3.285, SmoothL1=3.274
[Epoch 9][Batch 5199], Speed: 54.713 samples/sec, CrossEntropy=3.285, SmoothL1=3.275
[Epoch 9][Batch 5299], Speed: 41.466 samples/sec, CrossEntropy=3.284, SmoothL1=3.276
[Epoch 9][Batch 5399], Speed: 53.258 samples/sec, CrossEntropy=3.283, SmoothL1=3.278
[Epoch 9][Batch 5499], Speed: 57.705 samples/sec, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9][Batch 5599], Speed: 56.832 samples/sec, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9][Batch 5699], Speed: 54.667 samples/sec, CrossEntropy=3.284, SmoothL1=3.280
[Epoch 9][Batch 5799], Speed: 53.720 samples/sec, CrossEntropy=3.285, SmoothL1=3.278
[Epoch 9][Batch 5899], Speed: 32.793 samples/sec, CrossEntropy=3.285, SmoothL1=3.279
[Epoch 9][Batch 5999], Speed: 57.327 samples/sec, CrossEntropy=3.286, SmoothL1=3.282
[Epoch 9][Batch 6099], Speed: 40.294 samples/sec, CrossEntropy=3.286, SmoothL1=3.281
[Epoch 9][Batch 6199], Speed: 55.066 samples/sec, CrossEntropy=3.286, SmoothL1=3.283
[Epoch 9][Batch 6299], Speed: 56.626 samples/sec, CrossEntropy=3.285, SmoothL1=3.281
[Epoch 9][Batch 6399], Speed: 54.353 samples/sec, CrossEntropy=3.285, SmoothL1=3.282
[Epoch 9][Batch 6499], Speed: 27.371 samples/sec, CrossEntropy=3.284, SmoothL1=3.284
[Epoch 9][Batch 6599], Speed: 38.576 samples/sec, CrossEntropy=3.284, SmoothL1=3.286
[Epoch 9][Batch 6699], Speed: 36.140 samples/sec, CrossEntropy=3.283, SmoothL1=3.286
[Epoch 9][Batch 6799], Speed: 14.973 samples/sec, CrossEntropy=3.283, SmoothL1=3.285
[Epoch 9][Batch 6899], Speed: 56.728 samples/sec, CrossEntropy=3.284, SmoothL1=3.283
[Epoch 9][Batch 6999], Speed: 56.083 samples/sec, CrossEntropy=3.284, SmoothL1=3.281
[Epoch 9][Batch 7099], Speed: 53.444 samples/sec, CrossEntropy=3.283, SmoothL1=3.280
[Epoch 9][Batch 7199], Speed: 19.701 samples/sec, CrossEntropy=3.283, SmoothL1=3.281
[Epoch 9][Batch 7299], Speed: 55.817 samples/sec, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9] Training cost: 3115.671, CrossEntropy=3.283, SmoothL1=3.279
[Epoch 9] Validation:

=Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.006

So Why, I use the newest pre version of gluoncv without any modification. and mxnet is also the pre version, my gpu is TITANXp(12G). Can you give me some suggestions? @zhreshold

zhreshold · 2018-08-09T18:59:17Z

What I got after one epoch with python3 ../gluon-cv/scripts/detection/ssd/train_ssd.py --gpus 0,1,2,3 -j 32 --network vgg16_atrous --data-shape 300 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 240

~~~~ Summary metrics ~~~~
=Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.019
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.048
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.010
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.004
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.020
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.028
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.038
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.058
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.063
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.053
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.083

Angzz · 2018-08-10T00:57:50Z

totady I find when train with python3, everything is ok, but not ok with python2, can you have a try? @zhreshold

zhreshold · 2018-08-10T01:00:28Z

@Angzz So, now you remind me of this thread #195
Yes, it is a known bug for division on python2.

Angzz · 2018-08-10T01:02:40Z

@zhreshold OK, I will turn the orig_height and orig_width into float and have a try, thanks!!

zhreshold · 2018-08-13T21:19:17Z

Closing this, let me know if it is still a problem.

LeeRel1991 · 2018-10-16T09:48:44Z

Hi,all. Recently I also encounter such problem（loss=nan）. Specifically, when i train ssd512_vgg16_atrous on GTX1080 for face detecttion, the batchsize=8, both SmoothL1Loss and CrossEntropy are nan always. Then i comment the line net.hybridize() in train function and validate function, the loss become normal and the success. Finally i change the batchsize from 8 to 16 and lr from 0.001 to 0.00001 under net.hybridize() on voc2012 and seld-designed dataset, the find out the followings:
batchsize >=12 GTX1080 will out of memory,
batchsize=10 the train is normal,
batchsize<=8 the loss is easy to nan.

so the conclusion may be that small batchsize tend to output nan loss when using net.hybridize(), the possible solution is comment is or change large batchsize if the gpu support. Or, you can change ssd300_vgg16 on which gtx1080 also support batchsize>=16
Hope to help.

Feywell · 2018-10-30T09:32:37Z

@zhreshold
I encounter the same question.
How can i fix it?
environment:

centos 6.4
python 3.5
mxnet-cuda80 1.3.0
gluoncv 0.3.0
gpu tesla k20 * 2

command:
train_ssd.py --batch-size 32 --num-workers 4 --gpus 0,1 --log-interval 1 --epochs 20
result:

/home/liyang/anaconda2/envs/gluon/lib/python3.5/site-packages/mxnet/gluon/block.py:421: UserWarning: load_params is deprecated. Please use load_parameters.
warnings.warn("load_params is deprecated. Please use load_parameters.")
INFO:root:Namespace(batch_size=32, data_shape=300, dataset='voc', epochs=20, gpus='0,1', log_interval=1, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='vgg16_atrous', num_workers=4, resume='', save_interval=10, save_prefix='ssd_300_vgg16_atrous_voc', seed=233, start_epoch=0, val_interval=1, wd=0.0005)
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0][Batch 0], Speed: 1.764 samples/sec, CrossEntropy=19.147, SmoothL1=3.899
INFO:root:[Epoch 0][Batch 1], Speed: 24.210 samples/sec, CrossEntropy=2911234189311495764698741583380480.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 2], Speed: 24.310 samples/sec, CrossEntropy=1940822792874330605875953106157568.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 3], Speed: 24.335 samples/sec, CrossEntropy=1455617094655747882349370791690240.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 4], Speed: 24.308 samples/sec, CrossEntropy=1164493675724598305879496633352192.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 5], Speed: 24.310 samples/sec, CrossEntropy=970411396437165302937976553078784.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 6], Speed: 24.105 samples/sec, CrossEntropy=831781196946141647056783309537280.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 7], Speed: 24.253 samples/sec, CrossEntropy=727808547327873941174685395845120.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 8], Speed: 23.193 samples/sec, CrossEntropy=646940930958110201958651035385856.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 9], Speed: 24.274 samples/sec, CrossEntropy=582246837862299152939748316676096.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 10], Speed: 24.294 samples/sec, CrossEntropy=529315307147544684490680287887360.000, SmoothL1=nan
INFO:root:[Epoch 0][Batch 11], Speed: 24.299 samples/sec,

ps:
batchsize = 16 is also not work
but batchsize = 8 is ok.

zhreshold · 2018-10-30T18:00:58Z

@Feywell Try reduce lr slightly with --lr 0.0005 for example.

jacky4323 · 2018-12-12T04:48:13Z

Hi,
I have tested different batch-size, Why larger batch-size will get nan loss?
And also I test three command below ,2nd and 3th use same batch but get NaN in 3th training.

1st :python3 train_ssd.py --batch-size 12 --num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005
2nd:python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005
3th :python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

python3 train_ssd.py --batch-size 12 --num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

INFO:root:[Epoch 0][Batch 0], Speed: 2.745 samples/sec, CrossEntropy=19.230, SmoothL1=3.947
INFO:root:[Epoch 0][Batch 1], Speed: 10.406 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 1], Speed: 10.406 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 1], Speed: 10.406 samples/sec, CrossEntropy=nan, SmoothL1=nan

python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

INFO:root:[Epoch 0][Batch 0], Speed: 2.362 samples/sec, CrossEntropy=18.623, SmoothL1=4.061
INFO:root:[Epoch 0][Batch 1], Speed: 10.102 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 2], Speed: 10.230 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 3], Speed: 10.214 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 4], Speed: 10.209 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 5], Speed: 10.219 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 6], Speed: 10.218 samples/sec, CrossEntropy=nan, SmoothL1=nan
INFO:root:[Epoch 0][Batch 7], Speed: 10.201 samples/sec, CrossEntropy=nan, SmoothL1=nan

python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005

INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0][Batch 0], Speed: 1.939 samples/sec, CrossEntropy=18.577, SmoothL1=3.817
INFO:root:[Epoch 0][Batch 1], Speed: 10.114 samples/sec, CrossEntropy=18.320, SmoothL1=3.906
INFO:root:[Epoch 0][Batch 2], Speed: 10.197 samples/sec, CrossEntropy=18.193, SmoothL1=4.051
INFO:root:[Epoch 0][Batch 3], Speed: 10.168 samples/sec, CrossEntropy=17.870, SmoothL1=3.842
INFO:root:[Epoch 0][Batch 4], Speed: 10.158 samples/sec, CrossEntropy=17.454, SmoothL1=4.076
INFO:root:[Epoch 0][Batch 5], Speed: 10.179 samples/sec, CrossEntropy=16.981, SmoothL1=3.938
INFO:root:[Epoch 0][Batch 6], Speed: 10.168 samples/sec, CrossEntropy=16.503, SmoothL1=3.988
INFO:root:[Epoch 0][Batch 7], Speed: 10.134 samples/sec, CrossEntropy=16.100, SmoothL1=3.868

FishYuLi · 2019-01-14T02:50:38Z

@Intellige Have you solved the problem? How?

Hello, I got a similar problem.
The environment: cuda8.0 mxnet1.5.0 python2.7 (I don't have sudo permission to update the gpu driver...)
I run exactly the same shell script as the provided one.
The losses are pretty large at first, lager than the provided log, but the validation mAP is really poor. (Shown as follows)

The cross entropy loss converges normally, but the SmoothL1 does not converge, and the validation mAP is still very bad after 20 epochs. (Shown as follows)

I also try python 3.6, but I got the same problem.
@zhreshold Any possible suggestions?

Intellige · 2019-01-24T16:44:18Z

Hi,sorry so late to answer the question.
When I encountered the question, I had tried to change the versions of MXNET and cuda . Of course
I also tried some suggestions above.
However, I wasn't lucky. at last, I reinstalled the UBUNTU, cuda and so on, they were all the same as I faced the questions. It really works. I don't know why..

FishYuLi · 2019-01-25T05:52:14Z

@Intellige Thanks. It's really confusing...

zhreshold · 2019-01-25T18:52:20Z

please reduce learning rate a little bit in case you meet sudden nan problems, I have some feedbacks says that reduce lr to half can solve nan problem in some cases.

zhreshold · 2019-03-07T08:00:52Z

Just an update, the root cause is found and fix has been merged to master: apache/mxnet#14209

By using master/nightly built pip package hopefully you won't meet same problem any more

zhreshold self-assigned this May 25, 2018

zhreshold closed this as completed Aug 13, 2018

zhreshold reopened this Mar 7, 2019

zhreshold closed this as completed Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The training loss is nan in SSD #142

The training loss is nan in SSD #142

1292765944 commented May 24, 2018 •

edited

Loading

zhreshold commented May 24, 2018

1292765944 commented May 24, 2018

zhreshold commented May 24, 2018

1292765944 commented May 25, 2018 •

edited

Loading

zhreshold commented May 25, 2018

1292765944 commented May 26, 2018 •

edited

Loading

1292765944 commented May 28, 2018 •

edited

Loading

zhreshold commented May 28, 2018

1292765944 commented Jun 7, 2018

zhreshold commented Jun 7, 2018

Intellige commented Jun 14, 2018

zhreshold commented Jun 14, 2018

zqburde commented Jul 2, 2018

1292765944 commented Jul 2, 2018

zqburde commented Jul 3, 2018

Wallart commented Jul 15, 2018

Angzz commented Aug 9, 2018 •

edited

Loading

zhreshold commented Aug 9, 2018

Angzz commented Aug 10, 2018

zhreshold commented Aug 10, 2018

Angzz commented Aug 10, 2018

zhreshold commented Aug 13, 2018

LeeRel1991 commented Oct 16, 2018 •

edited

Loading

Feywell commented Oct 30, 2018 •

edited

Loading

zhreshold commented Oct 30, 2018

jacky4323 commented Dec 12, 2018

FishYuLi commented Jan 14, 2019 •

edited

Loading

Intellige commented Jan 24, 2019

FishYuLi commented Jan 25, 2019

zhreshold commented Jan 25, 2019

zhreshold commented Mar 7, 2019

The training loss is nan in SSD #142

The training loss is nan in SSD #142

Comments

1292765944 commented May 24, 2018 • edited Loading

zhreshold commented May 24, 2018

1292765944 commented May 24, 2018

zhreshold commented May 24, 2018

1292765944 commented May 25, 2018 • edited Loading

zhreshold commented May 25, 2018

1292765944 commented May 26, 2018 • edited Loading

1292765944 commented May 28, 2018 • edited Loading

zhreshold commented May 28, 2018

1292765944 commented Jun 7, 2018

zhreshold commented Jun 7, 2018

Intellige commented Jun 14, 2018

zhreshold commented Jun 14, 2018

zqburde commented Jul 2, 2018

1292765944 commented Jul 2, 2018

zqburde commented Jul 3, 2018

Wallart commented Jul 15, 2018

Angzz commented Aug 9, 2018 • edited Loading

zhreshold commented Aug 9, 2018

Angzz commented Aug 10, 2018

zhreshold commented Aug 10, 2018

Angzz commented Aug 10, 2018

zhreshold commented Aug 13, 2018

LeeRel1991 commented Oct 16, 2018 • edited Loading

Feywell commented Oct 30, 2018 • edited Loading

zhreshold commented Oct 30, 2018

jacky4323 commented Dec 12, 2018

FishYuLi commented Jan 14, 2019 • edited Loading

Intellige commented Jan 24, 2019

FishYuLi commented Jan 25, 2019

zhreshold commented Jan 25, 2019

zhreshold commented Mar 7, 2019

1292765944 commented May 24, 2018 •

edited

Loading

1292765944 commented May 25, 2018 •

edited

Loading

1292765944 commented May 26, 2018 •

edited

Loading

1292765944 commented May 28, 2018 •

edited

Loading

Angzz commented Aug 9, 2018 •

edited

Loading

LeeRel1991 commented Oct 16, 2018 •

edited

Loading

Feywell commented Oct 30, 2018 •

edited

Loading

FishYuLi commented Jan 14, 2019 •

edited

Loading