Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

ghost · 2020-02-25T03:36:15Z

Describe the bug
A clear and concise description of what the bug is.
there is coredump in work_007 run with gdb -ex=r -ex=bt --args python3 $@

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff7338b700 (LWP 21722)]
0x00007ffff6bf6337 in raise () from /usr/lib64/libc.so.6
#0 0x00007ffff6bf6337 in raise () from /usr/lib64/libc.so.6
#1 0x00007ffff6bf7a28 in abort () from /usr/lib64/libc.so.6
#2 0x00007ffff6c38e87 in __libc_message () from /usr/lib64/libc.so.6
#3 0x00007ffff6c3f7c4 in malloc_printerr () from /usr/lib64/libc.so.6
#4 0x00007fffb8c7bd3a in _M_dispose (__a=..., this=)
at /usr/local/include/c++/7.5.0/bits/basic_string.h:3260
#5 ~basic_string (this=0x7ffe54075540, __in_chrg=)
at /usr/local/include/c++/7.5.0/bits/basic_string.h:3630
#6 ps::Meta::~Meta (this=0x7ffe54075520, __in_chrg=)
at 3rdparty/ps-lite/include/ps/internal/message.h:136
#7 0x00007fffb8ce757a in ~Message (this=,
__in_chrg=) at ./include/ps/internal/message.h:210
#8 ~pair (this=, __in_chrg=)
at /usr/local/include/c++/7.5.0/bits/stl_pair.h:208
#9 destroy<std::pair<ps::MessageBuffer* const, ps::Message> > (
this=, __p=)
at /usr/local/include/c++/7.5.0/ext/new_allocator.h:140
#10 destroy<std::pair<ps::MessageBuffer* const, ps::Message> > (
__a=, __p=)
at /usr/local/include/c++/7.5.0/bits/alloc_traits.h:487
#11 _M_deallocate_node (this=0x7ffe54003258, __n=)
at /usr/local/include/c++/7.5.0/bits/hashtable_policy.h:2084
#12 _M_erase (__n=, __prev_n=,
__bkt=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:1890
#13 _M_erase (__k=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:1916
#14 erase (__k=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:759
#15 erase (__x=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/unordered_map.h:814
#16 ReleaseFirstMsg (msg_buf=, this=0x7ffe54002d70)
at src/./rdma_van.h:498
#17 ps::RDMAVan::PollCQ (this=0x7ffe54002d70) at src/./rdma_van.h:704
#18 0x00007ffff31b1def in execute_native_thread_routine ()
from /usr/local/lib64/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#19 0x00007ffff769ee65 in start_thread () from /usr/lib64/libpthread.so.0
#20 0x00007ffff6cbe88d in clone () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install python3-3.6.8-10.el7.x86_64

To Reproduce
Steps to reproduce the behavior:

run byteps with 1 scheduler 2 server 2 woker with two nodes; one deploy 1 scheduler 1 server and 1 worker, another deploy with 1 server 1 woker.
run with RDMA enable, and ps-lite-byteps test_benchmark test worked with same configure.
run pytorch 1.4 sample the coredump came out.
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS:
centos 7, MLNX_OFED_LINUX-4.6-1.0.1.1, v100 * 8
GCC version:
test gcc 7.4 and gcc 7.5
CUDA and NCCL version:
cuda 10.1, nccl 2..5.7
Framework (TF, PyTorch, MXNet):
Pytorch 1.4

Additional context
Add any other context about the problem here.

ymjiang · 2020-02-25T04:03:57Z

This could be probably due to gcc incompatibility between PyTorch and BytePS. Before you install BytePS, please pin gcc to 4.9 (example).

bobzhuyb · 2020-02-25T04:04:28Z

Which example are you running? Do all PyTorch examples fail?

ghost · 2020-02-25T04:10:47Z

the example is https://github.com/bytedance/byteps/blob/master/example/pytorch/train_imagenet_resnet50_byteps.py

pytorch1.4 need gcc 5.0+, do you have a test with pytorch 1.4 ?
and i also test rebuild pytorch 1.4 with gcc 7.5, but the coredump also came, i check with valgrind -memcheck and helgrind, only found the error msg, there no other related information with this coredump.

[23:51:31] src/./rdma_van.h:220: Connect to Node 1 with Transport=IPC
==21234== Thread 13:
==21234== Invalid free() / delete / delete[] / realloc()
==21234== at 0x4C2B508: operator delete(void*) (vg_replace_malloc.c:586)
==21234== by 0x5EED2D39: _M_dispose (basic_string.h:3260)
==21234== by 0x5EED2D39: ~basic_string (basic_string.h:3630)
==21234== by 0x5EED2D39: ps::Meta::~Meta() (message.h:136)
==21234== by 0x5EF244B3: ~Message (message.h:210)
==21234== by 0x5EF244B3: ps::Van::Start(int) (van.cc:407)
==21234== by 0x5EF3F978: ps::RDMAVan::Start(int) (rdma_van.h:56)
==21234== by 0x5EF1F7F4: ps::Postoffice::Start(int, char const*, bool) (postoffice.cc:76)
==21234== by 0x5EEE55A4: StartAsync (ps.h:47)
==21234== by 0x5EEE55A4: byteps::common::BytePSGlobal::GetOrInitPS() (global.cc:258)
==21234== by 0x5EED16EF: byteps::common::InitTensor(byteps::common::BytePSContext&, unsigned long, int, void*) (operations.cc:291)
==21234== by 0x5EF00A30: byteps::torch::StartTask(at::Tensor, at::Tensor, int, std::string, int, int, int) (ops.cc:64)
==21234== by 0x5EF03DB8: __invoke_impl<void, void ()(at::Tensor, at::Tensor, int, std::basic_string, int, int, int), at::Tensor, at::Tensor, int, std::basic_string<char, std::char_traits, std::allocator >, int, int, int> (invoke.h:60)
==21234== by 0x5EF03DB8: __invoke<void ()(at::Tensor, at::Tensor, int, std::basic_string, int, int, int), at::Tensor, at::Tensor, int, std::basic_string<char, std::char_traits, std::allocator >, int, int, int> (invoke.h:95)
==21234== by 0x5EF03DB8: _M_invoke<0, 1, 2, 3, 4, 5, 6, 7> (thread:234)
==21234== by 0x5EF03DB8: operator() (thread:243)
==21234== by 0x5EF03DB8: std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(at::Tensor, at::Tensor, int, std::string, int, int, int), at::Tensor, at::Tensor, int, std::string, int, int, int> > >::_M_run() (thread:186)
==21234== by 0xA03BDEE: execute_native_thread_routine (in /usr/local/lib64/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
==21234== by 0x5366E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==21234== by 0x5D8288C: clone (in /usr/lib64/libc-2.17.so)
==21234== Address 0xa078085f8 is on thread 13's stack
==21234== in frame #2, created by ps::Van::Start(int) (message.h:343)
==21234==

ymjiang · 2020-02-25T04:16:21Z

and i also test rebuild pytorch 1.4 with gcc 7.5, but the coredump also came, i check with valgrind -memcheck and helgrind, only found the error msg, there no other related information with this coredump.

Did you build pytorch 1.4 from source code using gcc-7.5? If so, you also need gcc-7.5 to build byteps.

ghost · 2020-02-25T04:22:13Z

yes i had tested that rebuild pytorch , byteps, nccl with gcc7.5 , there is also coredump.
the ps-lite-byteps test_benchmark can worked.

maybe there has some thread conflict in somewhere, but i didnot found in valgrind helgrind report.

@ymjiang could u provide pytorch1.4 docker worked images ?

bobzhuyb · 2020-02-26T21:47:06Z

We are investigating the issue with PyTorch 1.4 internally.

bobzhuyb · 2020-02-28T19:14:05Z

@tanguofu We have identified the problem and fixed it internally. The fix is coming to this public repo very soon.

ghost · 2020-02-29T02:10:03Z

it’s very nice! many thanks!

ymjiang · 2020-03-01T07:16:24Z

@tanguofu Please see our fix in #212. PyTorch 1.4 runs well with the new dockerfile and setup.py. We will also update the official image bytepsimage/pytorch soon.

If you are eager to try now, you can checkout that commit and do
docker build -t byteps.pytorch . -f Dockerfile --build-arg FRAMEWORK=pytorch

ghost · 2020-03-01T11:29:37Z

thanks for very efficient fix ！ now the example runs normally Although the speed slow than horovod , i will post new issue to ask for help .
this coredump confused me for some days， grateful for you found the problem and fix it。

ghost closed this as completed Mar 1, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

ghost commented Feb 25, 2020

ymjiang commented Feb 25, 2020

bobzhuyb commented Feb 25, 2020

ghost commented Feb 25, 2020

ymjiang commented Feb 25, 2020

ghost commented Feb 25, 2020 •

edited by ghost

Loading

bobzhuyb commented Feb 26, 2020

bobzhuyb commented Feb 28, 2020

ghost commented Feb 29, 2020

ymjiang commented Mar 1, 2020

ghost commented Mar 1, 2020

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

Comments

ghost commented Feb 25, 2020

ymjiang commented Feb 25, 2020

bobzhuyb commented Feb 25, 2020

ghost commented Feb 25, 2020

ymjiang commented Feb 25, 2020

ghost commented Feb 25, 2020 • edited by ghost Loading

bobzhuyb commented Feb 26, 2020

bobzhuyb commented Feb 28, 2020

ghost commented Feb 29, 2020

ymjiang commented Mar 1, 2020

ghost commented Mar 1, 2020

ghost commented Feb 25, 2020 •

edited by ghost

Loading