Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Cudnn error when training faster rcnn #4656

Closed
ghost opened this issue Jan 13, 2017 · 1 comment
Closed

Cudnn error when training faster rcnn #4656

ghost opened this issue Jan 13, 2017 · 1 comment

Comments

@ghost
Copy link

ghost commented Jan 13, 2017

I'm trying to use my own datasets to train faster-rcnn. The code I use is based on the example code https://github.com/dmlc/mxnet/tree/master/example/rcnn. I modified the labels to fit my dataset. I compile MXnet with cuda 8.0 and cudnn

The command I use to train is:

python train_alternate.py

parameters are as below:

    parser = argparse.ArgumentParser(description='Train Faster R-CNN Network')
    parser.add_argument('--image_set', dest='image_set', help='can be trainval or train',
                        default='trainval', type=str)
    parser.add_argument('--test_image_set', dest='test_image_set', help='can be test or val',
                        default='test', type=str)
    parser.add_argument('--year', dest='year', help='can be 2007, 2010, 2012',
                        default='2007', type=str)
    parser.add_argument('--root_path', dest='root_path', help='output data folder',
                        default=os.path.join(os.getcwd(), 'data'), type=str)
    parser.add_argument('--devkit_path', dest='devkit_path', help='VOCdevkit path',
                        default=os.path.join(os.getcwd(), 'data', 'VOCdevkit'), type=str)
    parser.add_argument('--pretrained', dest='pretrained', help='pretrained model prefix',
                        default=os.path.join(os.getcwd(), 'model', 'vgg16'), type=str)
    parser.add_argument('--epoch', dest='epoch', help='epoch of pretrained model',
                        default=1, type=int)
    parser.add_argument('--gpus', dest='gpu_ids', help='GPU device to train with',
                        default='1', type=str)
    parser.add_argument('--begin_epoch', dest='begin_epoch', help='begin epoch of training',
                        default=0, type=int)
    parser.add_argument('--rpn_epoch', dest='rpn_epoch', help='end epoch of rpn training',
                        default=8, type=int)
    parser.add_argument('--rcnn_epoch', dest='rcnn_epoch', help='end epoch of rcnn training',
                        default=8, type=int)
    parser.add_argument('--frequent', dest='frequent', help='frequency of logging',
                        default=20, type=int)
    parser.add_argument('--kv_store', dest='kv_store', help='the kv-store type',
                        default='device', type=str)
    parser.add_argument('--work_load_list', dest='work_load_list', help='work load for different devices',
                        default=None, type=list)
    args = parser.parse_args()
    return args

Error Message:

INFO:root:Epoch[0] Train-Accuracy=0.939022
INFO:root:Epoch[0] Train-LogLoss=0.207042
INFO:root:Epoch[0] Train-SmoothL1Loss=0.151098
INFO:root:Epoch[0] Time cost=607.882
INFO:root:Saved checkpoint to "model/rcnn2-0001.params"
INFO:root:Epoch[1] Batch [20]	Speed: 4.77 samples/sec	Train-Accuracy=0.944196,	LogLoss=0.193479,	SmoothL1Loss=0.160497
INFO:root:Epoch[1] Batch [40]	Speed: 4.62 samples/sec	Train-Accuracy=0.949695,	LogLoss=0.169040,	SmoothL1Loss=0.150361
INFO:root:Epoch[1] Batch [60]	Speed: 4.96 samples/sec	Train-Accuracy=0.952741,	LogLoss=0.158988,	SmoothL1Loss=0.149162
INFO:root:Epoch[1] Batch [80]	Speed: 5.07 samples/sec	Train-Accuracy=0.951100,	LogLoss=0.154413,	SmoothL1Loss=0.144357
...
INFO:root:Epoch[1] Batch [840]	Speed: 4.97 samples/sec	Train-Accuracy=0.953246,	LogLoss=0.141077,	SmoothL1Loss=0.131647
INFO:root:Epoch[1] Batch [860]	Speed: 5.10 samples/sec	Train-Accuracy=0.953252,	LogLoss=0.141045,	SmoothL1Loss=0.131450
INFO:root:Epoch[1] Batch [880]	Speed: 5.15 samples/sec	Train-Accuracy=0.953098,	LogLoss=0.141462,	SmoothL1Loss=0.132056
[01:21:26] /home/eva/mxnet/dmlc-core/include/dmlc/./logging.h:300: [01:21:26] src/operator/./cudnn_convolution-inl.h:517: Check failed: cudnnFindConvolutionForwardAlgorithm(s->dnn_handle_, in_desc_, filter_desc_, conv_desc_, out_desc_, kMaxAlgos, &nalgo, fwd_algo) == CUDNN_STATUS_SUCCESS (2 vs. 0) 

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f0b90a4329c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZZN5mxnet2op18CuDNNConvolutionOpIfE10SelectAlgoERKNS_7ContextERKSt6vectorIN4nnvm6TShapeESaIS8_EESC_ENKUlNS_10RunContextEE_clESD_+0x42a) [0x7f0b91ce9c7a]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataS1_S3_+0x23) [0x7f0b91203453]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x8c) [0x7f0b9124fd4c]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.1-py2.7-linux-x86_64.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7f0b91253120]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f0b83eeea60]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7f0bbac29182]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f0bba95647d]

Environment info

Operating System: Ubuntu 14.04

Package used (Python/R/Scala/Julia): Python

MXNet version:0.9.1

MXNet commit hash (git rev-parse HEAD):fbb68859699861bc104f4a692c660d74cff72f66

Python version and distribution:Python 2.7

@yajiedesign
Copy link
Contributor

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant