-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The training loss is nan in SSD #142
Comments
The smoothl1 is nan at the first batch, suggesting there's some problem. |
I use mxnet 1.2.0 version |
Just re-run with latest 1.2.0 mxnet, no problem. |
@zhreshold
|
@1292765944 I will investigate this problem. |
@zhreshold I think my gpu is rather ok. Maxwell TITANX is used in my experiments. The running temp is 85C, and the idle temp is 42C (by nvidia-smi). |
@zhreshold Could you just reproduce my error? So what's the problem for this error? Is it the model initialization problem? |
I tried multiple times on ec2, cannot reproduce the error yet. Since you are getting exploding loss after first update, I suspect your pretrained model or initialized tail are abnormal. Let me think where could possibly go wrong. |
@zhreshold any ideas for this problem? |
@1292765944 Try remove related pretrained models in ~/.mxnet/models and update mxnet/gluon-cv, run with default parameters again? |
Hello, We have reinstalled mxnet and reloaded initial paras,but don't work ~~~ |
We fixed a minor problem with coco, so the behavior of coco is confirmed. I will try some instances with CUDA80 and latest mxnet, please let me know if you guys have new findings. |
First I also encounter this problem, decreasing the lr is my solution. |
@zqburde How did you set the lr? Does it hurt the final mAP? |
@1292765944 I set the lr 0.001 to 0.0008 and increase the epochs, it doesn't influence the final mAP. |
I am working with a GTX 1080Ti and I had the same issue on the previous gluoncv release (0.1), the loss was rapidly going to nan. Unfortunately I can no longer reproduce on gluoncv 0.2. For better understanding I also implemented my own SSD inspired by amdegroot 's ssd.pytorch and had no issues. Then I tried to improve my implementation using some gluoncv concept and the training loss was nan too. In my experiments it's linked to : Used in VGGAtrousBase class. Removing this initializer solved my nan issue. |
When I train ssd with coco, the training curve seems normal, but when validation, the AP[0.5:0.95] is always nearly zero, so what's the problem maybe? this is my params setting : this is my training curve:
|
What I got after one epoch with python3 ../gluon-cv/scripts/detection/ssd/train_ssd.py --gpus 0,1,2,3 -j 32 --network vgg16_atrous --data-shape 300 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 240
|
totady I find when train with python3, everything is ok, but not ok with python2, can you have a try? @zhreshold |
@zhreshold OK, I will turn the orig_height and orig_width into float and have a try, thanks!! |
Closing this, let me know if it is still a problem. |
Hi,all. Recently I also encounter such problem(loss=nan). Specifically, when i train ssd512_vgg16_atrous on GTX1080 for face detecttion, the batchsize=8, both SmoothL1Loss and CrossEntropy are nan always. Then i comment the line net.hybridize() in train function and validate function, the loss become normal and the success. Finally i change the batchsize from 8 to 16 and lr from 0.001 to 0.00001 under net.hybridize() on voc2012 and seld-designed dataset, the find out the followings: so the conclusion may be that small batchsize tend to output nan loss when using net.hybridize(), the possible solution is comment is or change large batchsize if the gpu support. Or, you can change ssd300_vgg16 on which gtx1080 also support batchsize>=16 |
@zhreshold centos 6.4 command:
ps: |
@Feywell Try reduce lr slightly with |
Hi, 1st :python3 train_ssd.py --batch-size 12 --num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005 python3 train_ssd.py --batch-size 12 --num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005
python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005
python3 train_ssd.py --batch-size 8--num-workers 10 --gpus 0 --log-interval 1 --lr 0.0005
|
@Intellige Have you solved the problem? How? Hello, I got a similar problem. I also try python 3.6, but I got the same problem. |
Hi,sorry so late to answer the question. |
@Intellige Thanks. It's really confusing... |
please reduce learning rate a little bit in case you meet sudden |
Just an update, the root cause is found and fix has been merged to master: apache/mxnet#14209 By using master/nightly built pip package hopefully you won't meet same problem any more |
I use the default training script (train_ssd.py) to train SSD300, However, the training loss seems to be large and do not converge. The log file is shown below, and what's the problem for this? Thanks!
The text was updated successfully, but these errors were encountered: