rcnn example issue collection #4713

ijkguo · 2017-01-18T07:16:55Z

The rcnn example has been adapted to be compatible with nnvm branch in https://github.com/precedenceguo/mx-rcnn. Waiting for results to confirm everything works. I invite you to try this new version right now.

I have collected some issues of rcnn example.

4G memory is not enough for VGG e2e training.
cudaMalloc failed: out of memory #4224, Memory allocation failed #2913
This issue arises because py-faster-rcnn caffe version only needs 3GB memory. They use cudnn v3 and allocates dynamically. My experiment with cudnn 5.1 uses 4GB. Backward mirror will not help.
Python 3 / Windows compatibility.
rcnn train_end2end.py error: Cannot find custom operator type proposal #4007, Can't run example code with Python 3 #3995, import mxnet error #2601
Except for print(), what is really incompatible? Is customop compatible?
Relevant to Making mxnet/examples compatible for python2/3 #4071.

The following will be fixed soon.

CustomOp cpu training will freeze or blocked training
block in Batch[2150] #4297, The program get stuck in check_call func when run rcnn demo.py on macbook Pro #4244, asnumpy() of NDArray @cpu halted #3724(with strange solution)
I don't know how to solve this.
UPDATE Confirmed solved in fix custom op #4528 for cpu demo. wont test cpu training.
UPDATE multi-gpu alternate is broken. Will be fixed.
workspace is not enough.
Error Message: GPU is not enabled #4431, MxNet/example/rcnn/demo.py error #2694
Set it to workspace=2048 like https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/symbol/symbol_vgg.py. Will be fixed.
BN networks cannot train stably./ ResNet support.
training faster-rcnn using resnet but the test detection seems not right #3852
Set use_global_stats=True. ResNet will be added soon. Will be fixed.
cudnn_auto_tune problem.
Cudnn error when training faster rcnn #4656,
Set env MXNET_CUDNN_AUTOTUNE_DEFAULT=0. Will be fixed as default.
multigpu e2e training
train faster-rcnn use two gpus, error occured like this #3836, multi-gpu support of joint training of rcnn example is broken #3639, Can I train the faster_RCNN with multi-GPU with multi-machine #3517
Will be fixed.
misc mistakes
very small mistake in mx-rcnn debug mode #4633, a question about rcnn exmaple, why set aux to zero when testing? #2975
Will be fixed.
testing memory explodes
Cannot run rcnn training #3321
Will be fixed by module testing.
performance issues
Python custom layer is extremely slow #3139
GPU testing will be 2x faster than caffe soon.

There are some interesting things:
#3704, the group symbol behavior is changed with nnvm?
#3542, I don't see anything wrong with layout?
#2214, the converter issue is good for reference.

piiswrong · 2017-01-18T17:36:15Z

CustomOp cpu training will freeze or blocked training
This should have been solved. Please pull.

4G memory is not enough for VGG e2e training. we seem to have a lot of memory regressions recently. @precedenceguo Was this working on 0.8? @tqchen Could you take a look?

ijkguo · 2017-01-19T03:40:22Z

CustomOp cpu training will freeze or blocked training
Confirmed solved.

4G memory is not enough for VGG e2e training.
Behavior is the same as v0.8. I was wondering why caffe needs even less memory. It is related to cuDNN v3?

morusu · 2017-02-24T03:15:56Z

cudnn_auto_tune problem.
#4656,
Set env MXNET_CUDNN_AUTOTUNE_DEFAULT=0. Will be fixed as default.

any other ways? I find the MXNET_CUDNN_AUTOTUNE_DEFAULT is very useful

ijkguo · 2017-02-24T09:49:15Z

You can keep it if you like.

realwill · 2017-05-23T08:30:12Z

@precedenceguo not only customop cpu training will freeze, but gpu training, when I use gpu training fcis and deformable convolution network, asnumpy() halted after random iterations.

hzh8311 · 2017-06-08T07:31:46Z

I try to train faster rcnn with resnet-101 as backbone network and OHEM(online hard example mining) on 4 gpus, and encountered with #4224 although I have 12GB K80 GPUs. I wonder does it to be just a memory shortage problem or some other issue I do not don't know?

Jerryzcn · 2017-07-12T22:59:25Z

It seems I cannot train with batch size more than 1 on each GPU, will there be a fix for this. (I guess this will be a feature request)

szha · 2017-10-29T00:26:45Z

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
Also, do please check out our forum (and Chinese version) for general "how-to" questions.

ijkguo mentioned this issue Jan 19, 2017

Update rcnn example with accleration, module, resnet, coco and nnvm #4730

Merged

ijkguo mentioned this issue Apr 20, 2017

asnumpy() of NDArray @cpu halted #3724

Closed

szha closed this as completed Oct 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rcnn example issue collection #4713

rcnn example issue collection #4713

ijkguo commented Jan 18, 2017 •

edited

Loading

piiswrong commented Jan 18, 2017

ijkguo commented Jan 19, 2017

morusu commented Feb 24, 2017

ijkguo commented Feb 24, 2017

realwill commented May 23, 2017

hzh8311 commented Jun 8, 2017

Jerryzcn commented Jul 12, 2017 •

edited

Loading

szha commented Oct 29, 2017

rcnn example issue collection #4713

rcnn example issue collection #4713

Comments

ijkguo commented Jan 18, 2017 • edited Loading

piiswrong commented Jan 18, 2017

ijkguo commented Jan 19, 2017

morusu commented Feb 24, 2017

ijkguo commented Feb 24, 2017

realwill commented May 23, 2017

hzh8311 commented Jun 8, 2017

Jerryzcn commented Jul 12, 2017 • edited Loading

szha commented Oct 29, 2017

ijkguo commented Jan 18, 2017 •

edited

Loading

Jerryzcn commented Jul 12, 2017 •

edited

Loading