-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211
Comments
This could be probably due to gcc incompatibility between PyTorch and BytePS. Before you install BytePS, please pin gcc to 4.9 (example). |
Which example are you running? Do all PyTorch examples fail? |
the example is https://github.com/bytedance/byteps/blob/master/example/pytorch/train_imagenet_resnet50_byteps.py pytorch1.4 need gcc 5.0+, do you have a test with pytorch 1.4 ? [23:51:31] src/./rdma_van.h:220: Connect to Node 1 with Transport=IPC |
Did you build pytorch 1.4 from source code using gcc-7.5? If so, you also need gcc-7.5 to build byteps. |
yes i had tested that rebuild pytorch , byteps, nccl with gcc7.5 , there is also coredump. maybe there has some thread conflict in somewhere, but i didnot found in valgrind helgrind report. @ymjiang could u provide pytorch1.4 docker worked images ? |
We are investigating the issue with PyTorch 1.4 internally. |
@tanguofu We have identified the problem and fixed it internally. The fix is coming to this public repo very soon. |
it’s very nice! many thanks! |
@tanguofu Please see our fix in #212. PyTorch 1.4 runs well with the new dockerfile and setup.py. We will also update the official image If you are eager to try now, you can checkout that commit and do |
thanks for very efficient fix ! now the example runs normally Although the speed slow than horovod , i will post new issue to ask for help . |
Describe the bug
A clear and concise description of what the bug is.
there is coredump in work_007 run with gdb -ex=r -ex=bt --args python3 $@
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff7338b700 (LWP 21722)]
0x00007ffff6bf6337 in raise () from /usr/lib64/libc.so.6
#0 0x00007ffff6bf6337 in raise () from /usr/lib64/libc.so.6
#1 0x00007ffff6bf7a28 in abort () from /usr/lib64/libc.so.6
#2 0x00007ffff6c38e87 in __libc_message () from /usr/lib64/libc.so.6
#3 0x00007ffff6c3f7c4 in malloc_printerr () from /usr/lib64/libc.so.6
#4 0x00007fffb8c7bd3a in _M_dispose (__a=..., this=)
at /usr/local/include/c++/7.5.0/bits/basic_string.h:3260
#5 ~basic_string (this=0x7ffe54075540, __in_chrg=)
at /usr/local/include/c++/7.5.0/bits/basic_string.h:3630
#6 ps::Meta::~Meta (this=0x7ffe54075520, __in_chrg=)
at 3rdparty/ps-lite/include/ps/internal/message.h:136
#7 0x00007fffb8ce757a in ~Message (this=,
__in_chrg=) at ./include/ps/internal/message.h:210
#8 ~pair (this=, __in_chrg=)
at /usr/local/include/c++/7.5.0/bits/stl_pair.h:208
#9 destroy<std::pair<ps::MessageBuffer* const, ps::Message> > (
this=, __p=)
at /usr/local/include/c++/7.5.0/ext/new_allocator.h:140
#10 destroy<std::pair<ps::MessageBuffer* const, ps::Message> > (
__a=, __p=)
at /usr/local/include/c++/7.5.0/bits/alloc_traits.h:487
#11 _M_deallocate_node (this=0x7ffe54003258, __n=)
at /usr/local/include/c++/7.5.0/bits/hashtable_policy.h:2084
#12 _M_erase (__n=, __prev_n=,
__bkt=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:1890
#13 _M_erase (__k=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:1916
#14 erase (__k=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/hashtable.h:759
#15 erase (__x=, this=0x7ffe54003258)
at /usr/local/include/c++/7.5.0/bits/unordered_map.h:814
#16 ReleaseFirstMsg (msg_buf=, this=0x7ffe54002d70)
at src/./rdma_van.h:498
#17 ps::RDMAVan::PollCQ (this=0x7ffe54002d70) at src/./rdma_van.h:704
#18 0x00007ffff31b1def in execute_native_thread_routine ()
from /usr/local/lib64/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#19 0x00007ffff769ee65 in start_thread () from /usr/lib64/libpthread.so.0
#20 0x00007ffff6cbe88d in clone () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install python3-3.6.8-10.el7.x86_64
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
centos 7, MLNX_OFED_LINUX-4.6-1.0.1.1, v100 * 8
test gcc 7.4 and gcc 7.5
cuda 10.1, nccl 2..5.7
Pytorch 1.4
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: