Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

ndarray.load crashes MXNet GPU builds on CPU machines #16399

Closed
leezu opened this issue Oct 8, 2019 · 4 comments · Fixed by #16432
Closed

ndarray.load crashes MXNet GPU builds on CPU machines #16399

leezu opened this issue Oct 8, 2019 · 4 comments · Fixed by #16432
Labels

Comments

@leezu
Copy link
Contributor

leezu commented Oct 8, 2019

Description

For some parameter files below crash happens. It does not happen for all parameter files.

Environment info (Required)

----------Python Info----------
Version      : 3.7.3
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Jun 13 2019 13:24:27')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /home/ubuntu/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.1
Directory    : /home/ubuntu/.local/lib/python3.7/site-packages/mxnet
Commit Hash   : c9818480680f84daa6e281a974ab263691302ba8
Library      : ['/home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so']
Build features:
✔ CUDA
✔ CUDNN
✔ NCCL
✔ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✔ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
----------System Info----------
Platform     : Linux-4.4.0-1095-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-84-3
release      : 4.4.0-1095-aws
version      : #106-Ubuntu SMP Wed Sep 18 13:33:48 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping:              4
CPU MHz:               2499.998
BogoMIPS:              4999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0010 sec, LOAD: 0.3120 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.3964 sec, LOAD: 0.0625 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0938 sec, LOAD: 0.3685 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0266 sec, LOAD: 0.1123 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0007 sec, LOAD: 0.0284 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0054 sec, LOAD: 0.0288 sec.
----------Environment----------
KMP_DUPLICATE_LIB_OK="True"

Package used (Python/R/Scala/Julia):
(I'm using ...)

Error Message:

what():  [21:44:06] src/engine/threaded_engine.cc:328: Check failed: device_count_ > 0 (-1 vs. 0) : GPU usage requires at least 1 GPU
Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7fab268524cb]
  [bt] (1) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25b50ec) [0x7fab289570ec]
  [bt] (2) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x25b5cd4) [0x7fab28957cd4]
  [bt] (3) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x3c2) [0x7fab28b57022]
  [bt] (4) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::~NDArray()+0x9a) [0x7fab2698422a]
  [bt] (5) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Copy(mxnet::Context) const+0x2cd) [0x7fab28b7651d]
  [bt] (6) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Load(dmlc::Stream*)+0xb41) [0x7fab28b78211]
  [bt] (7) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Load(dmlc::Stream*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >*, std::vector<std::string, std::allocator<std::string> >*)+0xc51) [0x7fab28b78eb1]
  [bt] (8) /home/ubuntu/.local/lib/python3.7/site-packages/mxnet/libmxnet.so(MXNDArrayLoad+0x263) [0x7fab288cf613]

Minimum reproducible example

wget http://apache-mxnet.s3.amazonaws.com/gluon/models/awd_lstm_lm_1150_wikitext-2-f9562ed0.zip
unzip awd_lstm_lm_1150_wikitext-2-f9562ed0.zip
python3 -c 'import mxnet as mx; mx.nd.load("awd_lstm_lm_1150_wikitext-2-f9562ed0.params")'

Steps to reproduce

  • pip install mxnet-cu100 on a CPU machine with cuda installed
  • Run above example
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Bug

@leezu
Copy link
Contributor Author

leezu commented Oct 11, 2019

@szha

@ddavydenko
Copy link
Contributor

@mxnet-label-bot , add [Bug]

@roywei roywei added the Bug label Oct 14, 2019
@leezu
Copy link
Contributor Author

leezu commented Oct 14, 2019

There is a simple fix at #16432
Just requires review

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants