How to train model with multi machines #9186

Feywell · 2017-12-23T03:00:48Z

Description

I want to train CNN with multi machines.
Hardware: cluster ( Tesla K20)
system: Red Hat Enterprise Linux Server release 6.4 (Santiago)
When I try to run theexample/image-classification/train_cifar10.py
using comand :

python ../../tools/launch.py -n 4 -H hosts
python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3
--kv-store dist_device_sync 2>&1|tee train_cifar10_log
error as following:

Traceback (most recent call last):
File "train_cifar10.py", line 19, in
import argparse
ImportError: No module named argparse
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/liyang/anaconda2-5.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/liyang/anaconda2-5.0/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/liyang/incubator-mxnet-master/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 61, in run
subprocess.check_call(prog, shell = True)
File "/home/liyang/anaconda2-5.0/lib/python2.7/subprocess.py", line 186, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 172.16.1.182 -p 22 'export LD_LIBRARY_PATH=/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/composer_xe_2013.3.163/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013.3.163/mpirt/lib/intel64:/opt/intel/composer_xe_2013.3.163/ipp/../compiler/lib/intel64:/opt/intel/composer_xe_2013.3.163/ipp/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013.3.163/compiler/lib/intel64:/opt/intel/composer_xe_2013.3.163/mkl/lib/intel64:/opt/intel/composer_xe_2013.3.163/tbb/lib/intel64/gcc4.1:/home/liyang/usr/lib:/home/liyang/usr/local/lib:/home/liyang/usr/libtool/lib:/home/liyang/usr/local/gcc-5.4.0/lib64:/home/liyang/usr/local/gcc-5.4.0/lib:/home/liyang/usr/local/perl/lib:/usr/local/cuda-8.0/lib64:/home/liyang/anaconda2-5.0/lib::/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/impi/4.1.1.036/mic/lib; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_PS_ROOT_URI=172.16.1.183; export DMLC_NUM_SERVER=4; export DMLC_NUM_WORKER=4; cd /home/liyang/incubator-mxnet-master/example/image-classification/; python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3 --kv-store dist_device_sync'' returned non-zero exit status 1

I check my python package, argparse is existed.

Environment info (Required)

----------Python Info----------
('Version :', '2.7.13')
('Compiler :', 'GCC 7.2.0')
('Build :', ('default', 'Sep 22 2017 00:47:24'))
('Arch :', ('64bit', ''))
------------Pip Info-----------
('Version :', '9.0.1')
('Directory :', '/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
import OpenSSL.SSL
('Version :', '1.0.0')
('Directory :', '/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet')
Traceback (most recent call last):
File "diagnose.py", line 171, in
check_mxnet()
File "diagnose.py", line 113, in check_mxnet
except FileNotFoundError:
NameError: global name 'FileNotFoundError' is not defined

Package used (Python/R/Scala/Julia):
(I'm using Python)

Build info (Required if built from source)

Compiler：GCC ( 5.4.0)

Build config:
command:

make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_DIST_KVSTORE=1

I don't know what errors there are. How can I fix this error?
Thank you!

The text was updated successfully, but these errors were encountered:

rahul003 · 2018-02-07T02:27:38Z

Have you installed mxnet by running sudo python setup.py install from incubator-mxnet/python/ on each of these machines?

thomelane · 2018-07-12T17:10:20Z

@Feywell if you're still interested there's an tutorial being worked on at the moment which clearly documents the steps required for distributed training. Check out the PR here, and the tutorial file is mxnet/example/distributed_training/README.md.

Feywell · 2018-07-13T01:43:21Z

@thomelane Thank you！ I will read it carefully.

Ishitori · 2018-07-30T20:34:09Z

@Feywell, did it help? Is problem resolved now?

ThomasDelteil · 2018-08-21T23:10:58Z

@yzhliu can you please close the issue?
@Feywell if you have issues setting up distritibuted training, please create a post on https://discuss.mxnet.io Thanks!

marcoabreu added Question Distributed Example labels Apr 10, 2018

yzhliu added the Pending Requester Info label Jul 30, 2018

yzhliu closed this as completed Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train model with multi machines #9186

How to train model with multi machines #9186

Feywell commented Dec 23, 2017

rahul003 commented Feb 7, 2018

thomelane commented Jul 12, 2018

Feywell commented Jul 13, 2018

Ishitori commented Jul 30, 2018

ThomasDelteil commented Aug 21, 2018

How to train model with multi machines #9186

How to train model with multi machines #9186

Comments

Feywell commented Dec 23, 2017

Description

Environment info (Required)

Build info (Required if built from source)

rahul003 commented Feb 7, 2018

thomelane commented Jul 12, 2018

Feywell commented Jul 13, 2018

Ishitori commented Jul 30, 2018

ThomasDelteil commented Aug 21, 2018