-
Notifications
You must be signed in to change notification settings - Fork 6.8k
test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build. #9004
Comments
NCCL is 2.1 with CUDA9 |
@leleamol Could you run with env variable NCCL_DEBUG=INFO and post the result? |
Following is the output of test_nccl.py when ran with NCCL_DEBUG=INFO.
It created a core dump. The callstack is as follows
|
We have following 2 requests for this issue:
|
Any update on the two requests above? |
Update from Nvidia: "Issue has been fixed, and will be part of the next release 2.2.1 or later". This issue can be kept open for verification in Nvidia NCCL 2.2.1. |
@leleamol can you verify if this is resolved? |
@leleamol Bouncing this one for your feedback. |
Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io
Description
The test_nccl.py script when ran against NCCL enabled MXNet causes a core dump.
Environment info (Required)
MXNet version v1.0.0 built with USE_NCCL=1 and USE_NCCL_PATH
NCCL 2.1
Instance type : p2.16xlarge
Package used (Python/R/Scala/Julia):
(I'm using ...) Python
For Scala user, please provide:
java -version
)mvn -version
)scala -version
)For R user, please provide R
sessionInfo()
:Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): gcc
MXNet commit hash:
(Paste the output of
git rev-parse HEAD
here.)2b67436
Build config:
(Paste the content of config.mk, or the build command.)
USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda
USE_CUDNN=1
USE_DIST_KVSTORE=1
USE_MKL2017=1
USE_BLAS=openblas
USE_S3=1
USE_NCCL=1
USE_NCCL_PATH=/usr/nccl/cuda-9
CUDA_ARCH := -gencode arch=compute_35,code=sm_35 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70
Error Message:
Core was generated by `/home/ec2-user/src/anaconda2/bin/python ./src/anaconda2/bin/nosetests /home/ec2'.
, argc=3, argv=0x7ffe7888ded8, init=, fini=,Program terminated with signal 11, Segmentation fault.
#0 0x00007f21e752aa6e in commFree (comm=0x5567ce4da6f0) at init.cu:100
100 init.cu: No such file or directory.
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.12.amzn1.x86_64 krb5-libs-1.15.1-8.43.amzn1.x86_64 libcom_err-1.42.12-4.40.amzn1.x86_64 libjpeg-turbo-1.2.90-5.14.amzn1.x86_64 libselinux-2.1.10-3.22.amzn1.x86_64 libuuid-2.23.2-33.28.amzn1.x86_64 openssl-1.0.2k-8.106.amzn1.x86_64
(gdb) where
#0 0x00007f21e752aa6e in commFree (comm=0x5567ce4da6f0) at init.cu:100
#1 0x00007f21e752edad in ncclCommInitAll (comms=, ndev=, devlist=) at init.cu:692
#2 0x00007f22294f7a50 in mxnet::kvstore::KVStoreNCCL::Reduce(std::vector<int, std::allocator >, std::vector<std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::allocator<std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > > > const&, int, std::vector<mxnet::NDArray const*, std::allocator<mxnet::NDArray const*> >*) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#3 0x00007f222950423a in mxnet::kvstore::KVStoreNCCL::PushImpl(std::vector<int, std::allocator > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, int) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#4 0x00007f22294baba1 in mxnet::kvstore::KVStoreLocal::Push(std::vector<std::string, std::allocatorstd::string > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, int) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#5 0x00007f22294377fb in MXKVStorePushEx () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#6 0x00007f224231aec0 in ffi_call_unix64 () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6
#7 0x00007f224231a87d in ffi_call () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6
#8 0x00007f2242530736 in _ctypes_callproc () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/_ctypes.so
#9 0x00007f2242526a61 in PyCFuncPtr_call () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/_ctypes.so
#10 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#11 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#12 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#13 0x00007f224e0c2482 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#14 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#15 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#16 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#17 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#18 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#19 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#20 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#21 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#22 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#23 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#24 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#25 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#26 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#27 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#28 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#29 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#30 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#31 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#32 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#33 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#34 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#35 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#36 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#37 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#38 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#39 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#40 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#41 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#42 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#43 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#44 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#45 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#46 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#47 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#48 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#49 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#50 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#51 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#52 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#53 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#54 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#55 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#56 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#57 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#58 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#59 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#60 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#61 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#62 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#63 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#64 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#65 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#66 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#67 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#68 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#69 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#70 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#71 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#72 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#73 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#74 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#75 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#76 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#77 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#78 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#79 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#80 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#81 0x00007f224e082254 in slot_tp_init () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#82 0x00007f224e07eb0b in type_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#83 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#84 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#85 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#86 0x00007f224e0c570a in PyEval_EvalCode () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#87 0x00007f224e0de93d in run_mod () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#88 0x00007f224e0dfab8 in PyRun_FileExFlags () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#89 0x00007f224e0e0cd8 in PyRun_SimpleFileExFlags () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#90 0x00007f224e0f2d3c in Py_Main () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#91 0x00007f224d32fb05 in __libc_start_main (main=0x5567c5f66850
rtld_fini=, stack_end=0x7ffe7888dec8) at libc-start.c:269
#92 0x00005567c5f6687f in _start ()
Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
mxnet/tests/python/gpu/test_nccl.py
Steps to reproduce
(Paste the commands you ran that produced the error.)
@unittest.skip("Test requires NCCL library installed and enabled during build")
python tests/python/gpu/test_nccl.py
What have you tried to solve it?
The text was updated successfully, but these errors were encountered: