train_cifar10.py hangs on first epoch in debug mode (4 P100 GPUs) #10123

pakmarkthub · 2018-03-15T17:21:03Z

Description

train_cifar10.py hangs before outputting validation-accuracy on the first epoch when using four P100 GPUs, compiled with debug=1 or MXNET_ENGINE_TYPE=NaiveEngine (recompile with debug=0 in the later case).

Environment info (Required)

----------Python Info----------
('Version :', '2.7.5')
('Compiler :', 'GCC 4.8.5 20150623 (Red Hat 4.8.5-16)')
('Build :', ('default', 'Aug 4 2017 00:39:18'))
('Arch :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version :', '9.0.1')
('Directory :', '/usr/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version :', '1.1.0')
('Directory :', '/home/cc/Projects/uvm-reduce/experiments/mxnet/original/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform :', 'Linux-3.10.0-514.6.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core')
('system :', 'Linux')
('node :', 'ftg-p100.novalocal')
('release :', '3.10.0-514.6.1.el7.x86_64')
('version :', '#1 SMP Wed Jan 18 13:06:36 UTC 2017')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Stepping: 1
CPU MHz: 2499.921
CPU max MHz: 3200.0000
CPU min MHz: 1200.0000
BogoMIPS: 3999.88
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0129 sec, LOAD: 0.7237 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0127 sec, LOAD: 0.2568 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0640 sec, LOAD: 0.2466 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0161 sec, LOAD: 0.0744 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0646 sec, LOAD: 0.2254 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0478 sec, LOAD: 0.4038 sec.
-----------CUDA----------------
CUDA Driver Version / Runtime Version: 9.1 / 9.1
Driver version: 390.30
GPUs: four Tesla P100-SXM2-16GB GPUs
Peer access: Yes (all GPU pairs)
-----------OpenCv--------------
Version: 3.4.1

Package used (Python/R/Scala/Julia):
I'm using python2.7.5

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc

MXNet commit hash:
07a83a0 (tag 1.1.0)
f7fa49e (master)

Build config:

#---------------------
# choice of compiler
#--------------------

export CC = gcc
export CXX = g++
export NVCC = nvcc

# whether compile with options for MXNet developer
DEV = 0

# whether compile with debug
DEBUG = 1

# whether compile with profiler
USE_PROFILER =

# whether to turn on segfault signal handler to log the stack trace
USE_SIGNAL_HANDLER =

# the additional link flags you want to add
ADD_LDFLAGS = -lopencv_core -lopencv_imgcodecs -lopencv_imgproc

# the additional compile flags you want to add
ADD_CFLAGS =

#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------

# whether use CUDA during compile
USE_CUDA = 1

# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = NONE

# whether to enable CUDA runtime compilation
ENABLE_CUDA_RTC = 1

# whether use CuDNN R3 library
USE_CUDNN = 1

#whether to use NCCL library
USE_NCCL = 1
#add the path to NCCL library
USE_NCCL_PATH = NONE

# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 1

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization
USE_OPENMP = 1

# whether use MKL-DNN library
USE_MKLDNN = 0

# whether use NNPACK library
USE_NNPACK = 0

# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 1

# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =

# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python wrapper
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
else
USE_STATIC_MKL = NONE
endif

#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
USE_SSE=0
else
USE_SSE=1
endif

#----------------------------
# distributed computing
#----------------------------

# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0

#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1

# Use gperftools if found
USE_GPERFTOOLS = 1

# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 1

#----------------------------
# additional operators
#----------------------------

# path to folders containing projects specific operators that you don't want to put in src/operators
EXTRA_OPERATORS =

#----------------------------
# other features
#----------------------------

# Create C++ interface package
USE_CPP_PACKAGE = 0

#----------------------------
# plugins
#----------------------------

# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk

# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe
# [email protected]:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk

Error Message:

(No error message, just hang up, below is the out)
INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0,1,2,3', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=300, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[16:30:36] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
[16:30:42] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
[16:30:43] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20] Speed: 629.17 samples/sec accuracy=0.120536
INFO:root:Epoch[0] Batch [40] Speed: 630.63 samples/sec accuracy=0.172656
INFO:root:Epoch[0] Batch [60] Speed: 645.72 samples/sec accuracy=0.205859
INFO:root:Epoch[0] Batch [80] Speed: 642.09 samples/sec accuracy=0.224219
INFO:root:Epoch[0] Batch [100] Speed: 634.74 samples/sec accuracy=0.250781
INFO:root:Epoch[0] Batch [120] Speed: 628.02 samples/sec accuracy=0.264453
INFO:root:Epoch[0] Batch [140] Speed: 629.10 samples/sec accuracy=0.288281
INFO:root:Epoch[0] Batch [160] Speed: 632.72 samples/sec accuracy=0.299609
INFO:root:Epoch[0] Batch [180] Speed: 636.01 samples/sec accuracy=0.320312
INFO:root:Epoch[0] Batch [200] Speed: 640.49 samples/sec accuracy=0.337891
INFO:root:Epoch[0] Batch [220] Speed: 640.68 samples/sec accuracy=0.353906
INFO:root:Epoch[0] Batch [240] Speed: 639.87 samples/sec accuracy=0.362109
INFO:root:Epoch[0] Batch [260] Speed: 637.37 samples/sec accuracy=0.370312
INFO:root:Epoch[0] Batch [280] Speed: 638.09 samples/sec accuracy=0.376953
INFO:root:Epoch[0] Batch [300] Speed: 635.81 samples/sec accuracy=0.395313
INFO:root:Epoch[0] Batch [320] Speed: 638.06 samples/sec accuracy=0.405078
INFO:root:Epoch[0] Batch [340] Speed: 638.94 samples/sec accuracy=0.417969
INFO:root:Epoch[0] Batch [360] Speed: 639.11 samples/sec accuracy=0.430469
INFO:root:Epoch[0] Batch [380] Speed: 635.67 samples/sec accuracy=0.446484
INFO:root:Epoch[0] Train-accuracy=0.450781
INFO:root:Epoch[0] Time cost=80.754
----------- Hang up here -------------------------

(Stack trace from gdb after hang)
#0 0x00007ffff77fe945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fffe8a29a6c in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /lib64/libstdc++.so.6
#2 0x00007fff137c4435 in std::condition_variable::waitmxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::__lambda21(std::unique_lockstd::mutex &, mxnet::engine::ThreadedEngine::__lambda21) (this=0x7ffebc001198, __lock=..., __p=...)
at /usr/include/c++/4.8.2/condition_variable:93
#3 0x00007fff137c3fa2 in mxnet::engine::ThreadedEngine::WaitForVar (this=0x7ffebc001150, var=0xc9d9e4c8) at src/engine/threaded_engine.cc:384
#4 0x00007fff13034e5e in mxnet::NDArray::WaitToWrite (this=0xc9bc74f0) at include/mxnet/./ndarray.h:361
#5 0x00007fff131b5f50 in mxnet::NDArray::SyncCopyToCPU (this=0xc9bc74f0, data=0xc88bde50, size=32) at src/ndarray/ndarray.cc:1290
#6 0x00007fff13809ac9 in MXNDArraySyncCopyToCPU (handle=0xc9bc74f0, data=0xc88bde50, size=32) at src/c_api/c_api.cc:261
#7 0x00007fffeee8fdcc in ffi_call_unix64 () from /lib64/libffi.so.6
#8 0x00007fffeee8f6f5 in ffi_call () from /lib64/libffi.so.6
#9 0x00007fffef0a2c8b in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
#10 0x00007fffef09ca85 in PyCFuncPtr_call () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
#11 0x00007ffff7a5a9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#12 0x00007ffff7aef0f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#13 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#14 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#15 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#16 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#17 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#18 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#19 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#20 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#21 0x00007ffff7af357d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#22 0x00007ffff7af357d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#23 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#24 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#25 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#26 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#27 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#28 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#29 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#30 0x00007ffff7af6002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#31 0x00007ffff7b0f43f in run_mod () from /lib64/libpython2.7.so.1.0
#32 0x00007ffff7b105fe in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0
#33 0x00007ffff7b11889 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0
#34 0x00007ffff7b22a3f in Py_Main () from /lib64/libpython2.7.so.1.0
#35 0x00007ffff6d48c05 in __libc_start_main () from /lib64/libc.so.6
#36 0x000000000040071e in _start ()

Minimum reproducible example

/incubator-mxnet/example/image-classification/train_cifar10.py

Steps to reproduce

Case 1:

Compile mxnet with the above config.mk (DEBUG=1).
Run python train_cifar10.py --gpu 0,1,2,3
The program hanged up before printing out "Validation-accuracy" on epoch 0.

Case 2:

Compile mxnet with the above config.mk (DEBUG=0).
MXNET_ENGINE_TYPE=NaiveEngine Run python train_cifar10.py --gpu 0,1,2,3
The program hanged up before printing out "Validation-accuracy" on epoch 0.

What have you tried to solve it?

Changed kv-store to nccl by passing --kv-store=nccl to train_cifar10.py <== Not helpful
Tried checkout from tag 1.1.0 and master (see commit hashes above) <== Not helpful
Reinstalled CUDA 9.1 <== Not helpful
Downgraded NVIDIA driver to version 387.26 <== Randomly hanged after a few epochs
Searched through issues on GitHub. Found some similar problems but no solution (expected to solve by the master branch).

The text was updated successfully, but these errors were encountered:

hetong007 · 2018-05-04T19:13:30Z

Can you try the train_cifar10.py script at: http://gluon-cv.mxnet.io/model_zoo/index.html#image-classification

piyushghai · 2018-10-09T20:16:00Z

@pakmarkthub Were you able to try @hetong007 's suggestions ?
Do you still see the hanging issue while training Cifar10 model ?

kalyc · 2018-11-06T23:04:33Z

Hi @pakmarkthub I used MXNet v1.1.0 on my machine and am unable to reproduce the issue. Could you please verify if the issue is still persisting at your end? I was able to run train_cifar10.py file without any issue. I was able to see the validation accuracy on epoch 0 and the program didn't hang.

kalyc · 2018-11-07T18:37:00Z

@sandeep-krishnamurthy could you close this issue?
@pakmarkthub please feel free to re-open/ comment if the issue still persists at your end.

sandeep-krishnamurthy · 2018-11-07T19:25:55Z

@kalyc - Does this work with latest (v1.3)?

kalyc · 2018-11-09T19:20:13Z

This might be a related issue (RecordIO input split) - #12800 and it has been resolved

kalyc · 2018-11-10T00:24:38Z

On v1.3 here's what I got:

  File "example/image-classification/train_cifar10.py", line 76, in <module>
    fit.fit(args, sym, data.get_rec_iter)
  File "/home/ubuntu/incubator-mxnet/example/image-classification/common/fit.py", line 220, in fit
    lr, lr_scheduler = _get_lr_scheduler(args, kv)
  File "/home/ubuntu/incubator-mxnet/example/image-classification/common/fit.py", line 53, in _get_lr_scheduler
    base_lr=args.lr))
TypeError: __init__() got an unexpected keyword argument 'base_lr'

pakmarkthub mentioned this issue Mar 15, 2018

Race Conditions on Pascal GPUs? #4858

Closed

nswamy added Bug Example CUDA labels May 1, 2018

sandeep-krishnamurthy closed this as completed Nov 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_cifar10.py hangs on first epoch in debug mode (4 P100 GPUs) #10123

train_cifar10.py hangs on first epoch in debug mode (4 P100 GPUs) #10123

pakmarkthub commented Mar 15, 2018 •

edited

Loading

hetong007 commented May 4, 2018

piyushghai commented Oct 9, 2018

kalyc commented Nov 6, 2018

kalyc commented Nov 7, 2018

sandeep-krishnamurthy commented Nov 7, 2018

kalyc commented Nov 9, 2018

kalyc commented Nov 10, 2018

train_cifar10.py hangs on first epoch in debug mode (4 P100 GPUs) #10123

train_cifar10.py hangs on first epoch in debug mode (4 P100 GPUs) #10123

Comments

pakmarkthub commented Mar 15, 2018 • edited Loading

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

hetong007 commented May 4, 2018

piyushghai commented Oct 9, 2018

kalyc commented Nov 6, 2018

kalyc commented Nov 7, 2018

sandeep-krishnamurthy commented Nov 7, 2018

kalyc commented Nov 9, 2018

kalyc commented Nov 10, 2018

pakmarkthub commented Mar 15, 2018 •

edited

Loading