Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

core dump in macosx using big model #12438

Closed
loadwiki opened this issue Sep 3, 2018 · 3 comments
Closed

core dump in macosx using big model #12438

loadwiki opened this issue Sep 3, 2018 · 3 comments
Labels
C++ Related to C++ Memory

Comments

@loadwiki
Copy link

loadwiki commented Sep 3, 2018

(Brief description of the problem in no more than 2 sentences.)
My cpp program sometimes core dump in libmxnet.so when the model is as large as 200M bytes;
no core dump with small model.

Environment info (Required)

imac osx 10.13.6
CPU
compliler:
clang -v
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Build info (Required if built from source)

git diff make/config.mk
@@ -82,7 +82,7 @@ USE_NCCL_PATH = NONE

whether use opencv during compilation

you can disable it, however, you will not able to use

imbin iterator

-USE_OPENCV = 1
+USE_OPENCV = 0

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
@@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0
USE_LIBJPEG_TURBO_PATH = NONE

use openmp for parallelization

-USE_OPENMP = 1
+USE_OPENMP = 0

Error Message:

(Paste the complete error message, including stack trace.)
lldb main -c /cores/core.97762
(lldb) target create "main" --core "/cores/core.97762"
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 52, in
import weakref
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py", line 14, in
from _weakref import (
ImportError: cannot import name _remove_dead_weakref
Core file '/cores/core.97762' (x86_64) was loaded.
(lldb) bt
warning: could not execute support code to read Objective-C class data in the process. This may reduce the quality of type information available.

  • thread Add some ops #1, stop reason = signal SIGSTOP
    • frame #0: 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10 frame #1: 0x00007fff64046589 libsystem_pthread.dylib_pthread_cond_wait + 732
      frame [concurrent-blocking-queue-fix] ConcurrentBlockingQueue::Pop's return… #2: 0x00007fff61c81cb0 libc++.1.dylibstd::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18 frame #3: 0x000000010d6bc364 libmxnet.somxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) + 596
      frame rename #4: 0x000000010d7cd49a libmxnet.somxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const + 954 frame #5: 0x000000010d6ad0d4 libmxnet.soMXPredGetOutput + 340
      frame clean up registry code #6: 0x000000010c1cac30 mainInfer(pred_hnd=0x00007fcba2f00000, image_data=size=1, data=size=1) at face_predict.cpp:296 frame #7: 0x000000010c120e99 mainprocess_camera(model_path="../models/ncnn", camera=0x00007ffee3af5170, output_folder="./output/192.168.150.244", mainThread=true) at main.cpp:278
      frame static graph #8: 0x000000010c125f42 mainmain(argc=4, argv=0x00007ffee3af57b0) at main.cpp:484 frame #9: 0x00007fff63d2d015 libdyld.dylibstart + 1
      (lldb) thread list
      Process 0 stopped
  • thread Add some ops #1: tid = 0x0000, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #2: tid = 0x0001, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread Update dev branch #3: tid = 0x0002, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #4: tid = 0x0003, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread change capi #5: tid = 0x0004, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.sovoid mxnet::op::BatchNormForwardImpl<mshadow::cpu, float, float>(mshadow::Streammshadow::cpu*, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&, std::__1::vector<mxnet::TBlob, std::__1::allocatormxnet::TBlob > const&) + 1002, stop reason = signal SIGSTOP
    thread symbol implementation and fix #7: tid = 0x0006, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #8: tid = 0x0007, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread new symbol interface #9: tid = 0x0008, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #10: tid = 0x0009, 0x00007fff63e7da16 libsystem_kernel.dylib__psynch_cvwait + 10, stop reason = signal SIGSTOP
    thread static graph #11: tid = 0x000a, 0x00007fff63e7e28a libsystem_kernel.dylib__workq_kernreturn + 10, stop reason = signal SIGSTOP thread #12: tid = 0x000b, 0x00007fff63e7e28a libsystem_kernel.dylib__workq_kernreturn + 10, stop reason = signal SIGSTOP
    thread out_data is necessary, e.g. sigmoid #13: tid = 0x000c, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP

Minimum reproducible example

There is no obvious condition which cause the core dump.
I do manuelly send a sigstop signal to my main program, then main stop as usual.
I'm curious that there is no segment fault or abort or some other signal but a sigstop when the core dump occurs.
At first I compile the mxnet master branch. Then I switch a release tag '1.2.1.rc1', same thing happens.

@loadwiki
Copy link
Author

loadwiki commented Sep 3, 2018

Maybe the model size is not relevant. The model vision is more likely the trigger.
coredump model:
/Users/load/code/python/model/model-r100-gg/model-symbol.json ... 287521 bytes
/Users/load/code/python/model/model-r100-gg/model-0000.params ... 260958682 bytes
[11:34:04] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[11:34:04] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!

@vrakesh
Copy link
Contributor

vrakesh commented Sep 3, 2018

@loadwiki Thank you for reporting the issue

@mxnet-label-bot [C++, Memory]

@marcoabreu marcoabreu added C++ Related to C++ Memory labels Sep 3, 2018
@loadwiki
Copy link
Author

This core dump issue causes by the bug from other code. But I do find some core dump bug which can be repeat very easily: create a infer handle by load a model, then the thread exit immediately. Then core dump happens, something like this:
// Create Predictor MXPredCreate(static_cast<const char*>(json_data.GetBuffer()), static_cast<const char*>(param_data.GetBuffer()), static_cast<int>(param_data.GetLength()), dev_type, dev_id, num_input_nodes, input_keys, input_shape_indptr, input_shape_data, &pred_hnd); assert(pred_hnd); exit(0);
However, if insert a sleep() statement before exit(), the issue doesn't exist.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C++ Related to C++ Memory
Projects
None yet
Development

No branches or pull requests

3 participants