Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Segmentation fault when grouping with numpy op #876

Closed
lyttonhao opened this issue Dec 9, 2015 · 5 comments
Closed

Segmentation fault when grouping with numpy op #876

lyttonhao opened this issue Dec 9, 2015 · 5 comments

Comments

@lyttonhao
Copy link
Contributor

Hi, I've got segmentation fault when grouping with Numpyop and other layer. I use the NumpySoftmax codes. The following is my testing codes:

import mxnet as mx
import numpy as np
import logging

class NumpySoftmax(mx.operator.NumpyOp):
    def __init__(self):
        super(NumpySoftmax, self).__init__(False)

    def list_arguments(self):
        return ['data', 'label']

    def list_outputs(self):
        return ['output']

    def infer_shape(self, in_shape):
        data_shape = in_shape[0]
        label_shape = (in_shape[0][0],)
        output_shape = in_shape[0]
        return [data_shape, label_shape], [output_shape]

    def forward(self, in_data, out_data):
        x = in_data[0]
        y = out_data[0]
        y[:] = np.exp(x - x.max(axis=1).reshape((x.shape[0], 1)))
        y /= y.sum(axis=1).reshape((x.shape[0], 1))

    def backward(self, out_grad, in_data, out_data, in_grad):
        l = in_data[1]
        l = l.reshape((l.size,)).astype(np.int)
        y = out_data[0]
        dx = in_grad[0]
        dx[:] = y
        dx[np.arange(l.shape[0]), l] -= 1.0

def build_network():
    flatten =mx.symbol.Variable(name="data")
    flatten = mx.symbol.Flatten(data=flatten, name='flatten')
    fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=2, name='car_fc')
    fc2 = mx.symbol.FullyConnected(data=flatten, num_hidden=4, name='bb_fc')

    mysoftmax = NumpySoftmax()
    softmax = mysoftmax(data=fc1, name='softmax')

    output = mx.symbol.Group([softmax, fc2])

    return output

net = build_network()
executor = net.simple_bind(ctx=mx.cpu(), data=(16,3,224,224), grad_req='write')

If I return 'softmax' in the 'build_network' function, it works fine. However, when grouping it with other layer, the error occurs.

The backtrace info from gdb:

(gdb) bt
#0  0x00007fffffffcc20 in ?? ()
#1  0x00007ffff2c2d018 in mxnet::op::NativeOpProp::ListArguments() const ()
   from /usr/local/lib/python2.7/dist-packages/mxnet-0.5.0-py2.7.egg/mxnet/libmxnet.so
#2  0x00007ffff2c855a9 in mxnet::GraphExecutor::InitGraph(mxnet::Symbol const&, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, bool) ()
   from /usr/local/lib/python2.7/dist-packages/mxnet-0.5.0-py2.7.egg/mxnet/libmxnet.so
#3  0x00007ffff2c9ad41 in mxnet::GraphExecutor::Init(mxnet::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /usr/local/lib/python2.7/dist-packages/mxnet-0.5.0-py2.7.egg/mxnet/libmxnet.so
#4  0x00007ffff2c85b54 in mxnet::Executor::Bind(mxnet::Symbol, mxnet::Context const&, std::map<std::string, mxnet::Context, std::less<std::string>, std::allocator<std::pair<std::string const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&) () from /usr/local/lib/python2.7/dist-packages/mxnet-0.5.0-py2.7.egg/mxnet/libmxnet.so
#5  0x00007ffff2b00a7e in MXExecutorBindX ()
   from /usr/local/lib/python2.7/dist-packages/mxnet-0.5.0-py2.7.egg/mxnet/libmxnet.so
@lyttonhao
Copy link
Contributor Author

And it cannot solved by using steps like 'start training/predicting or simply time.sleep(10)' from 865.

@piiswrong
Copy link
Contributor

Actually this is because mysoftmax gets garbage collected. keep a reference on it by returning mysoftmax together with output and it will work

@winstywang
Copy link
Contributor

Is there any better way to fix it? We could do the PR if needed

@piiswrong
Copy link
Contributor

Should be fixed after my latest pr is merged
On Dec 9, 2015 5:44 PM, "Naiyan Wang" [email protected] wrote:

Is there any better way to fix it? We could do the PR if needed


Reply to this email directly or view it on GitHub
#876 (comment).

@winstywang
Copy link
Contributor

Great. Will wait for that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants