Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

mx.nd.Custom conflicts with memory management #14522

Open
YutingZhang opened this issue Mar 26, 2019 · 26 comments
Open

mx.nd.Custom conflicts with memory management #14522

YutingZhang opened this issue Mar 26, 2019 · 26 comments
Labels
Backend Issues related to the backend of MXNet Bug C++ Related to C++ Exception Handling

Comments

@YutingZhang
Copy link
Contributor

When training/running a large neural network with CustomOP, mxnet can get stuck. My speculation is that if memory management (e.g., releasing/reallocating GPU memory, raising an "out of memory error") is needed while running CustomOP, mxnet can get stuck.

A minimum piece code to show that CustomOP can deadlock with memory management ("out-of-memory in this case"):
As expected, main(1) should work.
main(100) should give "out of GPU memory" error. However, it just got stuck.

The real-world problem I met is not just about inability to give an "out of memory" error. It seems MxNet can release/reallocate memory dynamically, but this probably also deadlock with CustomOP, so my program, which could fit into GPU memory, can also get stuck

import mxnet as mx


class MyMulMax(mx.operator.CustomOp):

    def __init__(self):
        super().__init__()

    def forward(self, is_train, req, in_data, out_data, aux):
        a, b = in_data[0:2]
        c = mx.nd.batch_dot(a, b)
        d = mx.nd.max(c, axis=-1, keepdims=True)
        self.assign(out_data[0], req[0], d)

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0], req[0], 0)
        self.assign(in_grad[1], req[1], 0)


@mx.operator.register("MyMulMax")
class MyMulMaxProp(mx.operator.CustomOpProp):

    def __init__(self):
        super().__init__()

    def list_arguments(self):
        return ['a', 'b']

    def list_outputs(self):
        return ['d']

    def infer_shape(self, in_shape):
        return in_shape, [list(in_shape[0][:-1] + [1])]

    def create_operator(self, ctx, shapes, dtypes):
        return MyMulMax()


def main(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        d = mx.nd.Custom(a, b, op_type="MyMulMax")
        d_np = d.asnumpy()


if __name__ == "__main__":
    main(1)
    print("DONE -- 1")
    main(100)
    print("DONE -- 2")

Tested with nightly build of mxnet-cu90mkl with Python3.6

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@YutingZhang
Copy link
Contributor Author

@arcadiaphy @wkcn @anirudh2290 Interested in looking into this?

@wkcn
Copy link
Member

wkcn commented Mar 26, 2019

Does it reproduce the bug in old version of MXNet?

@YutingZhang
Copy link
Contributor Author

@wkcn "the bug in old version": Do you mean the deadlock in subprocess? I thought #14451 solved this problem (I have not tried it out myself).

This thread is about a different problem. It happens in the main process on a single GPU.

@wkcn
Copy link
Member

wkcn commented Mar 26, 2019

I mean that does this issue exist before the PR #14363 ?

@wkcn
Copy link
Member

wkcn commented Mar 26, 2019

reproduce the bug in CPU environment, it get stuck too.

I use GDB to print the backtrace:

#0  0x00007ffff7f6aafc in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0                                      
#1  0x00007fffe28cacd1 in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>)                              
    at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:864         
#2  0x00007fffe28cacd1 in std::condition_variable::wait(std::unique_lock<std::mutex>&) (this=<optimized out>, __lock=...)    
    at /build/gcc/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53                                                    
#3  0x00007fffead27065 in mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) ()                                  
    at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so                                                    
#4  0x00007fffea739052 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const ()                                       
    at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so                                                    
#5  0x00007fffead95697 in MXNDArraySyncCopyToCPU () at /home/wkcn/proj/incubator-mxnet/python/mxnet/../../lib/libmxnet.so    
#6  0x00007ffff6b386d0 in ffi_call_unix64 () at /usr/lib/libffi.so.6                                                         
#7  0x00007ffff6b380a0 in ffi_call () at /usr/lib/libffi.so.6
#8  0x00007ffff6b9e615 in _ctypes_callproc () at /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so      
#9  0x00007ffff6b9efa0 in  () at /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so
#10 0x00007ffff7ba163c in _PyObject_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#11 0x00007ffff7beccca in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#12 0x00007ffff7b752eb in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#13 0x00007ffff7be7c42 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#14 0x00007ffff7b752eb in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#15 0x00007ffff7be7dfa in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
--Type <RET> for more, q to quit, c to continue without paging--bt
\#16 0x00007ffff7b2eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#17 0x00007ffff7b2fab4 in PyEval_EvalCodeEx () at /usr/lib/libpython3.7m.so.1.0
#18 0x00007ffff7b2fadc in PyEval_EvalCode () at /usr/lib/libpython3.7m.so.1.0
#19 0x00007ffff7c59c94 in  () at /usr/lib/libpython3.7m.so.1.0
#20 0x00007ffff7c5b8be in PyRun_FileExFlags () at /usr/lib/libpython3.7m.so.1.0
#21 0x00007ffff7c5cc75 in PyRun_SimpleFileExFlags () at /usr/lib/libpython3.7m.so.1.0
#22 0x00007ffff7c5eeb7 in  () at /usr/lib/libpython3.7m.so.1.0
#23 0x00007ffff7c5f0fc in _Py_UnixMain () at /usr/lib/libpython3.7m.so.1.0
#24 0x00007ffff7dbd223 in __libc_start_main () at /usr/lib/libc.so.6
#25 0x000055555555505e in _start ()

@wkcn
Copy link
Member

wkcn commented Mar 26, 2019

I see. In python/mxnet/ndarray/ndarray.py, the function def copyto(self, other): calls _internal._copyto(self, out=other), which gets stuck.

@YutingZhang
Copy link
Contributor Author

I mean that does this issue exist before the PR #14363 ?

Both before an after. I believe it should be a different issue.

@arcadiaphy
Copy link
Member

arcadiaphy commented Mar 27, 2019

MXNET_ENGINE_TYPE=ThreadedEngine gdb python -ex 'b src/storage/pooled_storage_manager.h:145' -ex 'r reproduce.py'

image

Pretty interesting bug, seems like cuda out of memory gets stuck and cannot be triggered in custom op.

@wkcn
Copy link
Member

wkcn commented Mar 27, 2019

OOM on CPU gets stuck too.

@arcadiaphy
Copy link
Member

@wkcn I think it's not related to OOM, looks more like LOG(FATAL) gets stuck.

@YutingZhang
Copy link
Contributor Author

@arcadiaphy if LOG(FATAL) gets stuck, then there might be more stuff that can get stuck, which are harder to figure out.

@anirudh2290
Copy link
Member

I think worth trying would be checking if cudaGetErrorString is not called does it still get stuck. Since custom op uses its own custom threads it may not handle exceptions very well but still std::terminate should be called for unhandled exception. I am surprised that its not happening here for the unhandled exception.

@larroy
Copy link
Contributor

larroy commented Mar 28, 2019

Hi

I'm investigating this one. With the provided example I believe the cause of the error is the following at least in CPU context (changing gpu to cpu context in the example above).

I believe the current example crashes in Gemm due to an integer overflow in GEMM in the following code in dot_engine-inl.h

   
    for (int i = 0; i < batch_count; ++i) {
      gemm(stream, transa, transb, m, n, k, alpha,
           A + i * m * k, lda, B + i * k * n, ldb,
           beta, C + i * m * n, ldc);
    }

When having a gemm with dimension 6000 * 7000 * 100 which is > 2^32 we have an overflow which leads to the crash.
 I would suggest checking matrix dimensions for overflow before dispatching to GEMM.

@arcadiaphy
Copy link
Member

arcadiaphy commented Mar 29, 2019

@anirudh2290 @wkcn @YutingZhang Finally figure out the reason:

Normally, when a exception is thrown in spawn thread, it should be caught in std::exception_ptr and re-thrown in main thread to ensure proper except handling and avoid std::terminate. This mechanism is introduced in #9681 to handle exceptions in operator.

But there are still two problems in the except handling of custom op:

  1. The exception thrown in custom thread cannot be caught in main thread, causing program crash.
  2. When exception happens in custom op, it will be caught and re-thrown in sync function WaitForVar. But the WaitForVar is deadlocked because the pushed operation CustomOperator is skipped running, making the write dependency of waiting ndarray not completed forever.
    https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L128
    By adding c.wait_to_read(), the exception is forced to be re-thrown, but the program still crashes due to Problem 1.
import mxnet as mx


class MyMulMax(mx.operator.CustomOp):
    def __init__(self):
        super(MyMulMax, self).__init__()

    def forward(self, is_train, req, in_data, out_data, aux):
        a, b = in_data[0:2]
        c = mx.nd.batch_dot(a, b)
        # force re-throw exception, program still crash due to uncaught exception
        # c.wait_to_read()
        d = mx.nd.max(c, axis=-1, keepdims=True)
        self.assign(out_data[0], req[0], d)

    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        self.assign(in_grad[0], req[0], 0)
        self.assign(in_grad[1], req[1], 0)


@mx.operator.register("MyMulMax")
class MyMulMaxProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(MyMulMaxProp, self).__init__()

    def list_arguments(self):
        return ['a', 'b']

    def list_outputs(self):
        return ['d']

    def infer_shape(self, in_shape):
        return in_shape, [list(in_shape[0][:-1] + [1])]

    def create_operator(self, ctx, shapes, dtypes):
        return MyMulMax()


def custom(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        d = mx.nd.Custom(a, b, op_type="MyMulMax")
        d.wait_to_read()


def direct(n):
    with mx.Context(mx.gpu(0)):
        a = mx.nd.random.uniform(shape=(n, 6000, 1))
        b = mx.nd.random.uniform(shape=(n, 1, 7000))
        c = mx.nd.batch_dot(a, b)
        d = mx.nd.max(c, axis=-1, keepdims=True)
        # deadlock due to skipped CustomOperation function
        d.wait_to_read()


if __name__ == "__main__":
    custom(100)

@arcadiaphy
Copy link
Member

arcadiaphy commented Mar 29, 2019

@wkcn @anirudh2290 Like what I've mentioned before, adding on_complete callback in CustomOperator is not a good design, since the calling may be skipped when exception happens:
https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L128

Any suggestions on how to avoid it?

@larroy
Copy link
Contributor

larroy commented Mar 29, 2019

In CPU I see a crash in sgemm. I don't see the relation with exceptions in CPU.

(lldb) r repro2.py
Process 6491 launched: '/Users/pllarroy/devel/mxnet/py3_venv/bin/python' (x86_64)
Process 6491 stopped
* thread #4, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c4800)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
  thread #10, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c6c78)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
  thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff326c90ec)
    frame #0: 0x0000000113fb3b4d libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL + 107
libopenblasp-r0.3.5.dylib`sgemm_beta_HASWELL:
->  0x113fb3b4d <+107>: movl   $0x0, (%rdx,%rax,4)
    0x113fb3b54 <+114>: incq   %rax
    0x113fb3b57 <+117>: jmp    0x113fb3b42               ; <+96>
    0x113fb3b59 <+119>: decq   %rsi
Target 0: (python) stopped.
(lldb)

@larroy
Copy link
Contributor

larroy commented Mar 29, 2019

In GPU this is the segmentation fault that I get. Are we talking about the same problem?

Thread 297 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffea9371700 (LWP 3786)]
0x00007fff31cdb240 in sgemm_beta_HASWELL () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
(gdb) bt 3
#0  0x00007fff31cdb240 in sgemm_beta_HASWELL () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
#1  0x00007fff30821d95 in ?? () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
#2  0x00007fff30969b52 in ?? () from /usr/lib/x86_64-linux-gnu/libopenblas.so.0
(More stack frames follow...)
(gdb)

@larroy
Copy link
Contributor

larroy commented Mar 29, 2019

I talked to @anirudh2290 and I understand we are talking about two separate issues here:

  • Exception handling in custom ops
  • The integer overflow problem that I described.

@arcadiaphy do you plan to send a fix for the handling of exceptions in custom ops?

I can work on the integer overflow problem.

@larroy
Copy link
Contributor

larroy commented Mar 29, 2019

This repro in imperative mode isolates from the exception handling problem in custom ops:

import mxnet as mx
n = 100
a = mx.nd.random.uniform(shape=(n, 6000, 1))
b = mx.nd.random.uniform(shape=(n, 1, 7000))
c = mx.nd.batch_dot(a, b)
c.wait_to_read()
print(c.shape)

@arcadiaphy
Copy link
Member

@larroy Yes, two different issues here.

Reproduced the gemm bug too, but I'm focusing on the exception handling issue.

@anirudh2290
Copy link
Member

Thanks @arcadiaphy ! Nice work! We can do this incrementally. Would you be willing to start with a PR for 1 ?

  1. The exception thrown in custom thread cannot be caught in main thread, causing program crash.

@arcadiaphy
Copy link
Member

@anirudh2290 Have started a PR for 1, but the really tricky one is 2.

@wkcn
Copy link
Member

wkcn commented Mar 31, 2019

Great! Thanks fo your analysis!
I will read the comments and review the PR : )

@larroy
Copy link
Contributor

larroy commented Apr 1, 2019

nice! I will work on a fix for tensor shapes in the Blas Engine, unless somebody else has a strong desire to help there.

@larroy
Copy link
Contributor

larroy commented Apr 1, 2019

@mxnet-label-bot add [C++, Backend, Bug, Exception Handling]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet Bug C++ Related to C++ labels Apr 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Bug C++ Related to C++ Exception Handling
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants