Custom-Op Bug when using multiple custom-ops #4521

lyttonhao · 2017-01-04T12:00:22Z

I found that when using multiple output custom-ops, the program got stuck. It seems that the engine is suffering from the deadlock. This problem will occur when the custom-op contains the codes like `mx.nd.xx( xx ).asnumpy()'. This problem does not occur when using NaiveEngine.

I have written an example to reproduce this bug. You can put this file on the path of 'exmple/numpy-ops' and then run it. If we add line 15, the program will get stuck. Otherwise it works fine.

MXNet version: test two versions.

the newest master: ceb9f01
an older master: 01cde15

The text was updated successfully, but these errors were encountered:

sxjscience · 2017-01-04T15:17:01Z

~~I've tried and the script runs well.~~ I'm using the latest dmlc/master. Also, I've compiled using the master version of nnvm + mshadow + dmlc-core
The script runs well on my windows build and stuck on my linux build...
Also, I've tried ThreadedEngine + ThreadedEnginePerDevice
Log

(C:\Anaconda2) D:\HKUST\mxnet\example\numpy-ops>python nnvm_customop_bug.py
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784)
[23:19:24] D:\HKUST\mxnet\src\engine\engine.cc:36: MXNet start using engine: ThreadedEnginePerDevice
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(100,784)
WARNING:root:�[91m[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.�[0m
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [50]   Speed: 31645.57 samples/sec     Train-multi-accuracy_0=0.534000
INFO:root:Epoch[0] Batch [50]   Speed: 31645.57 samples/sec     Train-multi-accuracy_1=0.534000
INFO:root:Epoch[0] Batch [100]  Speed: 32051.30 samples/sec     Train-multi-accuracy_0=0.850400
INFO:root:Epoch[0] Batch [100]  Speed: 32051.30 samples/sec     Train-multi-accuracy_1=0.850400
INFO:root:Epoch[0] Batch [150]  Speed: 31249.98 samples/sec     Train-multi-accuracy_0=0.887400
INFO:root:Epoch[0] Batch [150]  Speed: 31249.98 samples/sec     Train-multi-accuracy_1=0.887400
INFO:root:Epoch[0] Batch [200]  Speed: 30674.83 samples/sec     Train-multi-accuracy_0=0.894000
INFO:root:Epoch[0] Batch [200]  Speed: 30674.83 samples/sec     Train-multi-accuracy_1=0.894000
INFO:root:Epoch[0] Batch [250]  Speed: 31055.90 samples/sec     Train-multi-accuracy_0=0.905000
INFO:root:Epoch[0] Batch [250]  Speed: 31055.90 samples/sec     Train-multi-accuracy_1=0.905000
INFO:root:Epoch[0] Batch [300]  Speed: 30674.83 samples/sec     Train-multi-accuracy_0=0.909400
INFO:root:Epoch[0] Batch [300]  Speed: 30674.83 samples/sec     Train-multi-accuracy_1=0.909400
INFO:root:Epoch[0] Batch [350]  Speed: 31446.56 samples/sec     Train-multi-accuracy_0=0.916000

piiswrong · 2017-01-04T17:06:02Z

os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4"
Add this to the beginning before importing mxnet

lyttonhao · 2017-01-05T04:48:47Z

It has been fixed by #4528

coconutyao · 2018-10-23T09:26:37Z

os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4"
Add this to the beginning before importing mxnet

Need some help, Thank you!
Deadlock happend while calling MXNDArraySyncCopyToCPU() ?

piiswrong mentioned this issue Jan 4, 2017

fix custom op #4528

Merged

lyttonhao closed this as completed Jan 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom-Op Bug when using multiple custom-ops #4521

Custom-Op Bug when using multiple custom-ops #4521

lyttonhao commented Jan 4, 2017

sxjscience commented Jan 4, 2017 •

edited

Loading

piiswrong commented Jan 4, 2017 •

edited

Loading

lyttonhao commented Jan 5, 2017

coconutyao commented Oct 23, 2018

Custom-Op Bug when using multiple custom-ops #4521

Custom-Op Bug when using multiple custom-ops #4521

Comments

lyttonhao commented Jan 4, 2017

sxjscience commented Jan 4, 2017 • edited Loading

piiswrong commented Jan 4, 2017 • edited Loading

lyttonhao commented Jan 5, 2017

coconutyao commented Oct 23, 2018

sxjscience commented Jan 4, 2017 •

edited

Loading

piiswrong commented Jan 4, 2017 •

edited

Loading