Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

'Segmentation fault' when using simple_bind operation #865

Closed
lyttonhao opened this issue Dec 8, 2015 · 5 comments
Closed

'Segmentation fault' when using simple_bind operation #865

lyttonhao opened this issue Dec 8, 2015 · 5 comments

Comments

@lyttonhao
Copy link
Contributor

Hi, I've got a 'Segmentation fault' when I use simple_bind operation. At first, it occurs when I test my own created layers. Then I find that it occurs in others codes. I simply use the following code from simple_bind example.

import mxnet as mx
import numpy as np
import logging

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# we can use mx.sym in short of mx.symbol
data = mx.sym.Variable("data")
fc1 = mx.sym.FullyConnected(data=data, num_hidden=128, name="fc1")
bn1 = mx.sym.BatchNorm(data=fc1, name="bn1")
act1 = mx.sym.Activation(data=bn1, name="act1", act_type="tanh")
fc2 = mx.sym.FullyConnected(data=act1, name="fc2", num_hidden=10)
softmax = mx.sym.Softmax(data=fc2, name="softmax")
# visualize the network
batch_size = 100
data_shape = (batch_size, 784)


ctx = mx.gpu(1)
executor = softmax.simple_bind(ctx=ctx, data=data_shape, grad_req='write')

Then the program may get 'Segmentation fault' at some tests and may be fine at another try. It's quite strange that the error comes out in random. When I change to use mx.cpu(), the program goes fine. However, in my own network, the error always comes out despite the devices.

@piiswrong
Copy link
Contributor

try update to newest version, make clean, make sure you have cuda enabled in config.mk, and make again

@lyttonhao
Copy link
Contributor Author

@piiswrong I've updated the codes and installed again. However the problem is still there. If I try the complete 'simple_bind' example, there is no error. So I guess maybe there are some bugs of allocating or freeing memory.

@winstywang
Copy link
Contributor

The backtrace of gdb:

#0 __GI___libc_free (mem=0x746867696577) at malloc.c:2929
#1 0x00007fffe9081a49 in cudnnDestroyTensorDescriptor ()
from /usr/local/cuda/lib64/libcudnn.so.7.0
#2 0x00007ffff2d6d6be in mxnet::op::CuDNNActivationOp::~CuDNNActivationOp (
this=0x1b9bca0, __in_chrg=)
at src/operator/./cudnn_activation-inl.h:39
#3 0x00007ffff2d6dbb9 in mxnet::op::CuDNNActivationOp::~CuDNNActivationOp (
this=0x1b9bca0, __in_chrg=)
at src/operator/./cudnn_activation-inl.h:40

@winstywang
Copy link
Contributor

I think it should be closely related with #817

@piiswrong
Copy link
Contributor

Ok now I get it. This is a harmless error. It's because mxnet exited before it's fully initialized. Do something after bind and you'll be fine, like start training/predicting or simply time.sleep(10)

@tqchen this is the singleton destruction bug we talked about. Could you fix it or at least hide it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants