Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Custom-Op Bug when using multiple custom-ops #4521

Closed
lyttonhao opened this issue Jan 4, 2017 · 4 comments
Closed

Custom-Op Bug when using multiple custom-ops #4521

lyttonhao opened this issue Jan 4, 2017 · 4 comments

Comments

@lyttonhao
Copy link
Contributor

I found that when using multiple output custom-ops, the program got stuck. It seems that the engine is suffering from the deadlock. This problem will occur when the custom-op contains the codes like `mx.nd.xx( xx ).asnumpy()'. This problem does not occur when using NaiveEngine.

I have written an example to reproduce this bug. You can put this file on the path of 'exmple/numpy-ops' and then run it. If we add line 15, the program will get stuck. Otherwise it works fine.

MXNet version: test two versions.

  1. the newest master: ceb9f01
  2. an older master: 01cde15
@sxjscience
Copy link
Member

sxjscience commented Jan 4, 2017

I've tried and the script runs well. I'm using the latest dmlc/master. Also, I've compiled using the master version of nnvm + mshadow + dmlc-core
The script runs well on my windows build and stuck on my linux build...
Also, I've tried ThreadedEngine + ThreadedEnginePerDevice
Log

(C:\Anaconda2) D:\HKUST\mxnet\example\numpy-ops>python nnvm_customop_bug.py
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784)
[23:19:24] D:\HKUST\mxnet\src\engine\engine.cc:36: MXNet start using engine: ThreadedEnginePerDevice
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(100,784)
WARNING:root:�[91m[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.�[0m
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [50]   Speed: 31645.57 samples/sec     Train-multi-accuracy_0=0.534000
INFO:root:Epoch[0] Batch [50]   Speed: 31645.57 samples/sec     Train-multi-accuracy_1=0.534000
INFO:root:Epoch[0] Batch [100]  Speed: 32051.30 samples/sec     Train-multi-accuracy_0=0.850400
INFO:root:Epoch[0] Batch [100]  Speed: 32051.30 samples/sec     Train-multi-accuracy_1=0.850400
INFO:root:Epoch[0] Batch [150]  Speed: 31249.98 samples/sec     Train-multi-accuracy_0=0.887400
INFO:root:Epoch[0] Batch [150]  Speed: 31249.98 samples/sec     Train-multi-accuracy_1=0.887400
INFO:root:Epoch[0] Batch [200]  Speed: 30674.83 samples/sec     Train-multi-accuracy_0=0.894000
INFO:root:Epoch[0] Batch [200]  Speed: 30674.83 samples/sec     Train-multi-accuracy_1=0.894000
INFO:root:Epoch[0] Batch [250]  Speed: 31055.90 samples/sec     Train-multi-accuracy_0=0.905000
INFO:root:Epoch[0] Batch [250]  Speed: 31055.90 samples/sec     Train-multi-accuracy_1=0.905000
INFO:root:Epoch[0] Batch [300]  Speed: 30674.83 samples/sec     Train-multi-accuracy_0=0.909400
INFO:root:Epoch[0] Batch [300]  Speed: 30674.83 samples/sec     Train-multi-accuracy_1=0.909400
INFO:root:Epoch[0] Batch [350]  Speed: 31446.56 samples/sec     Train-multi-accuracy_0=0.916000

@piiswrong
Copy link
Contributor

piiswrong commented Jan 4, 2017

os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4"
Add this to the beginning before importing mxnet

@lyttonhao
Copy link
Contributor Author

It has been fixed by #4528

@coconutyao
Copy link

os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4"
Add this to the beginning before importing mxnet

Need some help, Thank you!
Deadlock happend while calling MXNDArraySyncCopyToCPU() ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants