Cudnn error when training faster rcnn #4656

ghost · 2017-01-13T02:26:29Z

I'm trying to use my own datasets to train faster-rcnn. The code I use is based on the example code https://github.com/dmlc/mxnet/tree/master/example/rcnn. I modified the labels to fit my dataset. I compile MXnet with cuda 8.0 and cudnn

The command I use to train is:

python train_alternate.py

parameters are as below:

    parser = argparse.ArgumentParser(description='Train Faster R-CNN Network')
    parser.add_argument('--image_set', dest='image_set', help='can be trainval or train',
                        default='trainval', type=str)
    parser.add_argument('--test_image_set', dest='test_image_set', help='can be test or val',
                        default='test', type=str)
    parser.add_argument('--year', dest='year', help='can be 2007, 2010, 2012',
                        default='2007', type=str)
    parser.add_argument('--root_path', dest='root_path', help='output data folder',
                        default=os.path.join(os.getcwd(), 'data'), type=str)
    parser.add_argument('--devkit_path', dest='devkit_path', help='VOCdevkit path',
                        default=os.path.join(os.getcwd(), 'data', 'VOCdevkit'), type=str)
    parser.add_argument('--pretrained', dest='pretrained', help='pretrained model prefix',
                        default=os.path.join(os.getcwd(), 'model', 'vgg16'), type=str)
    parser.add_argument('--epoch', dest='epoch', help='epoch of pretrained model',
                        default=1, type=int)
    parser.add_argument('--gpus', dest='gpu_ids', help='GPU device to train with',
                        default='1', type=str)
    parser.add_argument('--begin_epoch', dest='begin_epoch', help='begin epoch of training',
                        default=0, type=int)
    parser.add_argument('--rpn_epoch', dest='rpn_epoch', help='end epoch of rpn training',
                        default=8, type=int)
    parser.add_argument('--rcnn_epoch', dest='rcnn_epoch', help='end epoch of rcnn training',
                        default=8, type=int)
    parser.add_argument('--frequent', dest='frequent', help='frequency of logging',
                        default=20, type=int)
    parser.add_argument('--kv_store', dest='kv_store', help='the kv-store type',
                        default='device', type=str)
    parser.add_argument('--work_load_list', dest='work_load_list', help='work load for different devices',
                        default=None, type=list)
    args = parser.parse_args()
    return args

Error Message:

INFO:root:Epoch[0] Train-Accuracy=0.939022
INFO:root:Epoch[0] Train-LogLoss=0.207042
INFO:root:Epoch[0] Train-SmoothL1Loss=0.151098
INFO:root:Epoch[0] Time cost=607.882
INFO:root:Saved checkpoint to "model/rcnn2-0001.params"
INFO:root:Epoch[1] Batch [20]	Speed: 4.77 samples/sec	Train-Accuracy=0.944196,	LogLoss=0.193479,	SmoothL1Loss=0.160497
INFO:root:Epoch[1] Batch [40]	Speed: 4.62 samples/sec	Train-Accuracy=0.949695,	LogLoss=0.169040,	SmoothL1Loss=0.150361
INFO:root:Epoch[1] Batch [60]	Speed: 4.96 samples/sec	Train-Accuracy=0.952741,	LogLoss=0.158988,	SmoothL1Loss=0.149162
INFO:root:Epoch[1] Batch [80]	Speed: 5.07 samples/sec	Train-Accuracy=0.951100,	LogLoss=0.154413,	SmoothL1Loss=0.144357
...
INFO:root:Epoch[1] Batch [840]	Speed: 4.97 samples/sec	Train-Accuracy=0.953246,	LogLoss=0.141077,	SmoothL1Loss=0.131647
INFO:root:Epoch[1] Batch [860]	Speed: 5.10 samples/sec	Train-Accuracy=0.953252,	LogLoss=0.141045,	SmoothL1Loss=0.131450
INFO:root:Epoch[1] Batch [880]	Speed: 5.15 samples/sec	Train-Accuracy=0.953098,	LogLoss=0.141462,	SmoothL1Loss=0.132056
[01:21:26] /home/eva/mxnet/dmlc-core/include/dmlc/./logging.h:300: [01:21:26] src/operator/./cudnn_convolution-inl.h:517: Check failed: cudnnFindConvolutionForwardAlgorithm(s->dnn_handle_, in_desc_, filter_desc_, conv_desc_, out_desc_, kMaxAlgos, &nalgo, fwd_algo) == CUDNN_STATUS_SUCCESS (2 vs. 0) 

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0b90a4329c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZZN5mxnet2op18CuDNNConvolutionOpIfE10SelectAlgoERKNS_7ContextERKSt6vectorIN4nnvm6TShapeESaIS8_EESC_ENKUlNS_10RunContextEE_clESD_+0x42a) [0x7f0b91ce9c7a]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataS1_S3_+0x23) [0x7f0b91203453]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x8c) [0x7f0b9124fd4c]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7f0b91253120]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f0b83eeea60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7f0bbac29182]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f0bba95647d]

Environment info

Operating System: Ubuntu 14.04

Package used (Python/R/Scala/Julia): Python

MXNet version:0.9.1

MXNet commit hash (git rev-parse HEAD):fbb68859699861bc104f4a692c660d74cff72f66

Python version and distribution:Python 2.7

The text was updated successfully, but these errors were encountered:

yajiedesign · 2017-09-28T07:58:40Z

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

ijkguo mentioned this issue Jan 18, 2017

rcnn example issue collection #4713

Closed

yajiedesign closed this as completed Sep 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cudnn error when training faster rcnn #4656

Cudnn error when training faster rcnn #4656

ghost commented Jan 13, 2017 •

edited by ghost

Loading

yajiedesign commented Sep 28, 2017

Cudnn error when training faster rcnn #4656

Cudnn error when training faster rcnn #4656

Comments

ghost commented Jan 13, 2017 • edited by ghost Loading

Error Message:

Environment info

yajiedesign commented Sep 28, 2017

ghost commented Jan 13, 2017 •

edited by ghost

Loading