-
Notifications
You must be signed in to change notification settings - Fork 6.8k
rcnn example issue collection #4713
Comments
|
|
cudnn_auto_tune problem. any other ways? I find the MXNET_CUDNN_AUTOTUNE_DEFAULT is very useful |
You can keep it if you like. |
@precedenceguo not only customop cpu training will freeze, but gpu training, when I use gpu training fcis and deformable convolution network, asnumpy() halted after random iterations. |
I try to train faster rcnn with resnet-101 as backbone network and OHEM(online hard example mining) on 4 gpus, and encountered with #4224 although I have 12GB K80 GPUs. I wonder does it to be just a memory shortage problem or some other issue I do not don't know? |
It seems I cannot train with batch size more than 1 on each GPU, will there be a fix for this. (I guess this will be a feature request) |
This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks! |
The rcnn example has been adapted to be compatible with nnvm branch in https://github.com/precedenceguo/mx-rcnn. Waiting for results to confirm everything works. I invite you to try this new version right now.
I have collected some issues of rcnn example.
4G memory is not enough for VGG e2e training.
cudaMalloc failed: out of memory #4224, Memory allocation failed #2913
This issue arises because py-faster-rcnn caffe version only needs 3GB memory. They use cudnn v3 and allocates dynamically. My experiment with cudnn 5.1 uses 4GB. Backward mirror will not help.
Python 3 / Windows compatibility.
rcnn train_end2end.py error: Cannot find custom operator type proposal #4007, Can't run example code with Python 3 #3995, import mxnet error #2601
Except for print(), what is really incompatible? Is customop compatible?
Relevant to Making mxnet/examples compatible for python2/3 #4071.
The following will be fixed soon.
CustomOp cpu training will freeze or blocked training
block in Batch[2150] #4297, The program get stuck in check_call func when run rcnn demo.py on macbook Pro #4244, asnumpy() of NDArray @cpu halted #3724(with strange solution)
I don't know how to solve this.
UPDATE Confirmed solved in fix custom op #4528 for cpu demo. wont test cpu training.
UPDATE multi-gpu alternate is broken. Will be fixed.
workspace is not enough.
Error Message: GPU is not enabled #4431, MxNet/example/rcnn/demo.py error #2694
Set it to workspace=2048 like https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/symbol/symbol_vgg.py. Will be fixed.
BN networks cannot train stably./ ResNet support.
training faster-rcnn using resnet but the test detection seems not right #3852
Set use_global_stats=True. ResNet will be added soon. Will be fixed.
cudnn_auto_tune problem.
Cudnn error when training faster rcnn #4656,
Set env MXNET_CUDNN_AUTOTUNE_DEFAULT=0. Will be fixed as default.
multigpu e2e training
train faster-rcnn use two gpus, error occured like this #3836, multi-gpu support of joint training of rcnn example is broken #3639, Can I train the faster_RCNN with multi-GPU with multi-machine #3517
Will be fixed.
misc mistakes
very small mistake in mx-rcnn debug mode #4633, a question about rcnn exmaple, why set aux to zero when testing? #2975
Will be fixed.
testing memory explodes
Cannot run rcnn training #3321
Will be fixed by module testing.
performance issues
Python custom layer is extremely slow #3139
GPU testing will be 2x faster than caffe soon.
There are some interesting things:
#3704, the group symbol behavior is changed with nnvm?
#3542, I don't see anything wrong with layout?
#2214, the converter issue is good for reference.
The text was updated successfully, but these errors were encountered: