Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

mx.sym.WarpCTC cuda memcpy or memset failed issue #6121

Closed
xinq2016 opened this issue May 5, 2017 · 19 comments
Closed

mx.sym.WarpCTC cuda memcpy or memset failed issue #6121

xinq2016 opened this issue May 5, 2017 · 19 comments

Comments

@xinq2016
Copy link

xinq2016 commented May 5, 2017

Environment info

Operating System: Ubuntu 14.04
GPU: GTX 1080

Compiler:

Package used (Python/R/Scala/Julia): Python

MXNet version: 0.9.5 in python

Or if installed from source: git clone https://github.com/dmlc/mxnet.git ~/mxnet --recursive

Error Message:

[16:36:11] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[ INFO][2017/05/05 16:36:25.768] ---------train---------
terminate called after throwing an instance of 'std::runtime_error'
what(): Error: compute_ctc_loss, stat = cuda memcpy or memset failed

Minimum reproducible example

cd ~
git clone https://github.com/baidu-research/warp-ctc
cd warp-ctc
mkdir build
cd build
cmake ..
make
sudo make install

git clone https://github.com/dmlc/mxnet.git ~/mxnet --recursive
cd ~/mxnet
cp make/config.mk .

modify the config.mk as following:
USE_BLAS = openblas
USE_CUDA = 1
USE_CUDA_PATH = /usr/local/cuda
USE_CUDNN = 1
WARPCTC_PATH = /home/nd/warp-ctc (which my wrap ctc installed)
MXNET_PLUGINS += plugin/warpctc/warpctc.mk
CUDA_ARCH := -gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61

make
cd python
sudo python setup.py install

Steps to reproduce

python main.py --python main.py --configfile default.cfg

the ctc layer code:
net = mx.sym.WarpCTC(data=net, label=label, label_length=num_label, input_length=seq_len)

How can I fix it?

Many thanks
Xin.q.

@piiswrong
Copy link
Contributor

@sbodenstein Could you look into this?
We can

  1. revert the ctc_loss pr
  2. disable ctc_loss when using WarpCTC
  3. fix the incompatibility.

@sbodenstein
Copy link
Contributor

Looking at this now. I think option 2 or 3 should be aimed at.

@sbodenstein
Copy link
Contributor

@piiswrong: I'm having difficulties trying to reproduce this due to #6032 (and I'm using a Mac).

@piiswrong
Copy link
Contributor

piiswrong commented May 5, 2017

Ok then could you disable ctc_loss when using warp ctc plugin first? We don't want to break existing features. also make sure all the copied ctc/modern gpu code is not in include path.
Also could you rename ctc_loss to _contrib_ctc_loss? so that it goes into mx.sym.contrib.

@sbodenstein
Copy link
Contributor

Sure, I will do those things.

For disabling: how would you recommend I do this?

@piiswrong
Copy link
Contributor

I'm actually not sure.
One way I can think of is making this another plugin. Otherwise the .cc files will always be compiled unless you guard everyone with macros.

I think the problem here is you have two implementations of the same function (one from original baiductc.so, one from absorbed code). One solution I can think of is to wrap all absorbed ctc code in a different namespace.

@sbodenstein
Copy link
Contributor

I do think it shouldn't be too hard to fix (eg with namespaces). But its annoying that I can't build MXNet on GPU at the moment due to the OSX bug.

@piiswrong
Copy link
Contributor

you can workaround it temporarily by deleting quantize/dequantize ops under contrib

@xinq2016
Copy link
Author

xinq2016 commented May 9, 2017

@piiswrong
Is it OK to use the version of v0.9.3 Official Release to run the example correctly?
how can I using the source code to setup mxnet? I found there are folds missing when I use the source code to install the mxnet. While installing with git clone, the latest version is gotten.

Many thanks
xin.q.

@sbodenstein
Copy link
Contributor

@xinq2016, @piiswrong: I can finally reproduce this problem. I have confirmed that this example works when you delete all the new ctc_loss files (the folder ctc_include, ctc_loss.cc, ctc_loss.cu and ctc_loss-inl.h), but breaks when you include them.

@xinq2016
Copy link
Author

xinq2016 commented May 9, 2017

@sbodenstein
OK, thanks a lot.
It works now.

Many thanks

@dmas-at-wiris
Copy link

dmas-at-wiris commented May 24, 2017

I found the same problem. @sbodenstein or @xinq2016, can you provide more details on how to fix this?

Edit: Found them! Removing them works. They are located in ~/mxnet/src/operator/contrib

@KeyKy
Copy link
Contributor

KeyKy commented May 27, 2017

@sbodenstein @piiswrong I delete all ctc_include, ctc_loss.cc, ctc_loss.cu and ctc_loss-inl.h below ~/mxnet/src/operator/contrib but i still get the error message:

terminate called after throwing an instance of 'std::runtime_error'
 what():  Error: compute_ctc_loss, stat = cuda memcpy or memset failed

@sbodenstein
Copy link
Contributor

@KeyKy: did you do make clean, and then make? Also: I should have fix for this by the end of the weekend.

@tobechao
Copy link

@KeyKy sbodenstein is right,It works now,thank you

@FreemanX
Copy link

Fixed yet? I got the same error with mxnet 0.10.1

@jcftang
Copy link

jcftang commented Aug 2, 2017

This doesn't appear to be fixed, it does not work for me either.

@sbodenstein
Copy link
Contributor

@jcftang: this is strange. Its fixed for some (like myself), and others not.

Could you give info about the GPU you are using?

@jcftang
Copy link

jcftang commented Aug 5, 2017

@sbodenstein I have nvidia 1080Ti's (founder's edition)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants