Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

Closed
ghost opened this issue Feb 25, 2020 · 10 comments
Closed

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

ghost opened this issue Feb 25, 2020 · 10 comments

Comments

@ghost
Copy link

ghost commented Feb 25, 2020

Describe the bug
A clear and concise description of what the bug is.
there is coredump in work_007 run with gdb -ex=r -ex=bt --args python3 $@

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff7338b700 (LWP 21722)]
0x00007ffff6bf6337 in raise () from /usr/lib64/libc.so.6
#0 0x00007ffff6bf6337 in raise () from /usr/lib64/libc.so.6
#1 0x00007ffff6bf7a28 in abort () from /usr/lib64/libc.so.6
#2 0x00007ffff6c38e87 in __libc_message () from /usr/lib64/libc.so.6
#3 0x00007ffff6c3f7c4 in malloc_printerr () from /usr/lib64/libc.so.6
#4 0x00007fffb8c7bd3a in _M_dispose (__a=..., this=)
at /usr/local/include/c++/7.5.0/bits/basic_string.h:3260
#5 ~basic_string (this=0x7ffe54075540, __in_chrg=)
at /usr/local/include/c++/7.5.0/bits/basic_string.h:3630
#6 ps::Meta::~Meta (this=0x7ffe54075520, __in_chrg=)
at 3rdparty/ps-lite/include/ps/internal/message.h:136
#7 0x00007fffb8ce757a in ~Message (this=,
__in_chrg=) at ./include/ps/internal/message.h:210
#8 ~pair (this=, __in_chrg=)
at /usr/local/include/c++/7.5.0/bits/stl_pair.h:208
#9 destroy<std::pair<ps::MessageBuffer* const, ps::Message> > (
this=, __p=)
at /usr/local/include/c++/7.5.0/ext/new_allocator.h:140
#10 destroy<std::pair<ps::MessageBuffer* const, ps::Message> > (
__a=, __p=)
at /usr/local/include/c++/7.5.0/bits/alloc_traits.h:487
#11 _M_deallocate_node (this=0x7ffe54003258, __n=)
at /usr/local/include/c++/7.5.0/bits/hashtable_policy.h:2084
#12 _M_erase (__n=, __prev_n=,
__bkt=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:1890
#13 _M_erase (__k=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:1916
#14 erase (__k=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:759
#15 erase (__x=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/unordered_map.h:814
#16 ReleaseFirstMsg (msg_buf=, this=0x7ffe54002d70)
at src/./rdma_van.h:498
#17 ps::RDMAVan::PollCQ (this=0x7ffe54002d70) at src/./rdma_van.h:704
#18 0x00007ffff31b1def in execute_native_thread_routine ()
from /usr/local/lib64/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#19 0x00007ffff769ee65 in start_thread () from /usr/lib64/libpthread.so.0
#20 0x00007ffff6cbe88d in clone () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install python3-3.6.8-10.el7.x86_64

To Reproduce
Steps to reproduce the behavior:

  1. run byteps with 1 scheduler 2 server 2 woker with two nodes; one deploy 1 scheduler 1 server and 1 worker, another deploy with 1 server 1 woker.
  2. run with RDMA enable, and ps-lite-byteps test_benchmark test worked with same configure.
  3. run pytorch 1.4 sample the coredump came out.
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS:
    centos 7, MLNX_OFED_LINUX-4.6-1.0.1.1, v100 * 8
  • GCC version:
    test gcc 7.4 and gcc 7.5
  • CUDA and NCCL version:
    cuda 10.1, nccl 2..5.7
  • Framework (TF, PyTorch, MXNet):
    Pytorch 1.4

Additional context
Add any other context about the problem here.

@ymjiang
Copy link
Member

ymjiang commented Feb 25, 2020

This could be probably due to gcc incompatibility between PyTorch and BytePS. Before you install BytePS, please pin gcc to 4.9 (example).

@bobzhuyb
Copy link
Member

Which example are you running? Do all PyTorch examples fail?

@ghost
Copy link
Author

ghost commented Feb 25, 2020

the example is https://github.com/bytedance/byteps/blob/master/example/pytorch/train_imagenet_resnet50_byteps.py

pytorch1.4 need gcc 5.0+, do you have a test with pytorch 1.4 ?
and i also test rebuild pytorch 1.4 with gcc 7.5, but the coredump also came, i check with valgrind -memcheck and helgrind, only found the error msg, there no other related information with this coredump.

[23:51:31] src/./rdma_van.h:220: Connect to Node 1 with Transport=IPC
==21234== Thread 13:
==21234== Invalid free() / delete / delete[] / realloc()
==21234== at 0x4C2B508: operator delete(void*) (vg_replace_malloc.c:586)
==21234== by 0x5EED2D39: _M_dispose (basic_string.h:3260)
==21234== by 0x5EED2D39: ~basic_string (basic_string.h:3630)
==21234== by 0x5EED2D39: ps::Meta::~Meta() (message.h:136)
==21234== by 0x5EF244B3: ~Message (message.h:210)
==21234== by 0x5EF244B3: ps::Van::Start(int) (van.cc:407)
==21234== by 0x5EF3F978: ps::RDMAVan::Start(int) (rdma_van.h:56)
==21234== by 0x5EF1F7F4: ps::Postoffice::Start(int, char const*, bool) (postoffice.cc:76)
==21234== by 0x5EEE55A4: StartAsync (ps.h:47)
==21234== by 0x5EEE55A4: byteps::common::BytePSGlobal::GetOrInitPS() (global.cc:258)
==21234== by 0x5EED16EF: byteps::common::InitTensor(byteps::common::BytePSContext&, unsigned long, int, void*) (operations.cc:291)
==21234== by 0x5EF00A30: byteps::torch::StartTask(at::Tensor, at::Tensor, int, std::string, int, int, int) (ops.cc:64)
==21234== by 0x5EF03DB8: __invoke_impl<void, void ()(at::Tensor, at::Tensor, int, std::basic_string, int, int, int), at::Tensor, at::Tensor, int, std::basic_string<char, std::char_traits, std::allocator >, int, int, int> (invoke.h:60)
==21234== by 0x5EF03DB8: __invoke<void (
)(at::Tensor, at::Tensor, int, std::basic_string, int, int, int), at::Tensor, at::Tensor, int, std::basic_string<char, std::char_traits, std::allocator >, int, int, int> (invoke.h:95)
==21234== by 0x5EF03DB8: _M_invoke<0, 1, 2, 3, 4, 5, 6, 7> (thread:234)
==21234== by 0x5EF03DB8: operator() (thread:243)
==21234== by 0x5EF03DB8: std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(at::Tensor, at::Tensor, int, std::string, int, int, int), at::Tensor, at::Tensor, int, std::string, int, int, int> > >::_M_run() (thread:186)
==21234== by 0xA03BDEE: execute_native_thread_routine (in /usr/local/lib64/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
==21234== by 0x5366E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==21234== by 0x5D8288C: clone (in /usr/lib64/libc-2.17.so)
==21234== Address 0xa078085f8 is on thread 13's stack
==21234== in frame #2, created by ps::Van::Start(int) (message.h:343)
==21234==

@ymjiang
Copy link
Member

ymjiang commented Feb 25, 2020

and i also test rebuild pytorch 1.4 with gcc 7.5, but the coredump also came, i check with valgrind -memcheck and helgrind, only found the error msg, there no other related information with this coredump.

Did you build pytorch 1.4 from source code using gcc-7.5? If so, you also need gcc-7.5 to build byteps.

@ghost
Copy link
Author

ghost commented Feb 25, 2020

yes i had tested that rebuild pytorch , byteps, nccl with gcc7.5 , there is also coredump.
the ps-lite-byteps test_benchmark can worked.

maybe there has some thread conflict in somewhere, but i didnot found in valgrind helgrind report.

@ymjiang could u provide pytorch1.4 docker worked images ?

@bobzhuyb
Copy link
Member

We are investigating the issue with PyTorch 1.4 internally.

@bobzhuyb
Copy link
Member

@tanguofu We have identified the problem and fixed it internally. The fix is coming to this public repo very soon.

@ghost
Copy link
Author

ghost commented Feb 29, 2020

it’s very nice! many thanks!

@ymjiang
Copy link
Member

ymjiang commented Mar 1, 2020

@tanguofu Please see our fix in #212. PyTorch 1.4 runs well with the new dockerfile and setup.py. We will also update the official image bytepsimage/pytorch soon.

If you are eager to try now, you can checkout that commit and do
docker build -t byteps.pytorch . -f Dockerfile --build-arg FRAMEWORK=pytorch

@ghost
Copy link
Author

ghost commented Mar 1, 2020

thanks for very efficient fix ! now the example runs normally Although the speed slow than horovod , i will post new issue to ask for help .
this coredump confused me for some days, grateful for you found the problem and fix it。

@ghost ghost closed this as completed Mar 1, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants