Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Segfault of test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #17341

Closed
TaoLv opened this issue Jan 16, 2020 · 1 comment
Closed
Labels

Comments

@TaoLv
Copy link
Member

TaoLv commented Jan 16, 2020

Description

Maybe not that flaky. I met the crash in my MKL-DNN upgrading PR (#17313) which seems to be not related to this test.
Put this issue here to see if anyone else meets the same problem and hope someone familiar with threaded engine can take a look.

Occurrences

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17313/2/pipeline/299

What have you tried to solve it?

Back trace:

(gdb) bt
#0  0x00007f68610b898d in pthread_join (threadid=140079671015168, thread_return=0x0) at pthread_join.c:90
#1  0x00007f68575f4793 in std::thread::join() () from target:/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f6853460407 in mxnet::engine::ThreadPool::~ThreadPool (this=0x20b8ce0, __in_chrg=<optimized out>) at src/engine/./thread_pool.h:84
#3  std::default_delete<mxnet::engine::ThreadPool>::operator() (this=<optimized out>, __ptr=0x20b8ce0) at /usr/include/c++/5/bits/unique_ptr.h:76
#4  std::unique_ptr<mxnet::engine::ThreadPool, std::default_delete<mxnet::engine::ThreadPool> >::~unique_ptr (this=0x2c20bf8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/unique_ptr.h:236
#5  mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>::~ThreadWorkerBlock (this=0x2c20b30, __in_chrg=<optimized out>) at src/engine/threaded_engine_perdevice.cc:214
#6  std::_Sp_counted_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
    at /usr/include/c++/5/bits/shared_ptr_base.h:374
#7  0x00007f684f65a3ea in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x21e2ce0) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#8  0x00007f685345c50b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#9  std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>)
    at /usr/include/c++/5/bits/shared_ptr_base.h:925
#10 std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, (__gnu_cxx::_Lock_policy)2>&&) (__r=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr_base.h:1000
#11 std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::operator=(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >&&) (__r=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/5/bits/shared_ptr.h:294
#12 mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0> >::Clear (this=this@entry=0x1f230f8) at src/engine/../common/lazy_alloc_array.h:149
#13 0x00007f685345fb2c in mxnet::engine::ThreadedEnginePerDevice::StopNoWait (this=0x1f22ff0) at src/engine/threaded_engine_perdevice.cc:67
#14 mxnet::engine::ThreadedEnginePerDevice::Stop (this=0x1f22ff0) at src/engine/threaded_engine_perdevice.cc:74
#15 0x00007f685357dfb6 in mxnet::LibraryInitializer::atfork_prepare (this=<optimized out>) at src/initialize.cc:196
  1. Add DEBUG=1 to the make line can get rid of the problem;
  2. Did not observe the problem when running the single test or the single test file of test_gluon_data.py.
@TaoLv TaoLv added the Flaky label Jan 16, 2020
@TaoLv
Copy link
Member Author

TaoLv commented Feb 26, 2020

Should be addressed already in the upgrading PR.

@TaoLv TaoLv closed this as completed Feb 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant