Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

How to train model with multi machines #9186

Closed
Feywell opened this issue Dec 23, 2017 · 5 comments
Closed

How to train model with multi machines #9186

Feywell opened this issue Dec 23, 2017 · 5 comments

Comments

@Feywell
Copy link

Feywell commented Dec 23, 2017

Description

I want to train CNN with multi machines.
Hardware: cluster ( Tesla K20)
system: Red Hat Enterprise Linux Server release 6.4 (Santiago)
When I try to run theexample/image-classification/train_cifar10.py
using comand :

python ../../tools/launch.py -n 4 -H hosts
python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3
--kv-store dist_device_sync 2>&1|tee train_cifar10_log
error as following:

Traceback (most recent call last):
File "train_cifar10.py", line 19, in
import argparse
ImportError: No module named argparse
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/liyang/anaconda2-5.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/liyang/anaconda2-5.0/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/liyang/incubator-mxnet-master/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 61, in run
subprocess.check_call(prog, shell = True)
File "/home/liyang/anaconda2-5.0/lib/python2.7/subprocess.py", line 186, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no 172.16.1.182 -p 22 'export LD_LIBRARY_PATH=/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/composer_xe_2013.3.163/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013.3.163/mpirt/lib/intel64:/opt/intel/composer_xe_2013.3.163/ipp/../compiler/lib/intel64:/opt/intel/composer_xe_2013.3.163/ipp/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013.3.163/compiler/lib/intel64:/opt/intel/composer_xe_2013.3.163/mkl/lib/intel64:/opt/intel/composer_xe_2013.3.163/tbb/lib/intel64/gcc4.1:/home/liyang/usr/lib:/home/liyang/usr/local/lib:/home/liyang/usr/libtool/lib:/home/liyang/usr/local/gcc-5.4.0/lib64:/home/liyang/usr/local/gcc-5.4.0/lib:/home/liyang/usr/local/perl/lib:/usr/local/cuda-8.0/lib64:/home/liyang/anaconda2-5.0/lib::/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/impi/4.1.1.036/mic/lib; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_PS_ROOT_URI=172.16.1.183; export DMLC_NUM_SERVER=4; export DMLC_NUM_WORKER=4; cd /home/liyang/incubator-mxnet-master/example/image-classification/; python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3 --kv-store dist_device_sync'' returned non-zero exit status 1

I check my python package, argparse is existed.

Environment info (Required)

----------Python Info----------
('Version :', '2.7.13')
('Compiler :', 'GCC 7.2.0')
('Build :', ('default', 'Sep 22 2017 00:47:24'))
('Arch :', ('64bit', ''))
------------Pip Info-----------
('Version :', '9.0.1')
('Directory :', '/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
import OpenSSL.SSL
('Version :', '1.0.0')
('Directory :', '/home/liyang/anaconda2-5.0/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet')
Traceback (most recent call last):
File "diagnose.py", line 171, in
check_mxnet()
File "diagnose.py", line 113, in check_mxnet
except FileNotFoundError:
NameError: global name 'FileNotFoundError' is not defined

Package used (Python/R/Scala/Julia):
(I'm using Python)

Build info (Required if built from source)

Compiler:GCC ( 5.4.0)

Build config:
command:

make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_DIST_KVSTORE=1

I don't know what errors there are. How can I fix this error?
Thank you!

@rahul003
Copy link
Member

rahul003 commented Feb 7, 2018

Have you installed mxnet by running sudo python setup.py install from incubator-mxnet/python/ on each of these machines?

@thomelane
Copy link
Contributor

@Feywell if you're still interested there's an tutorial being worked on at the moment which clearly documents the steps required for distributed training. Check out the PR here, and the tutorial file is mxnet/example/distributed_training/README.md.

@Feywell
Copy link
Author

Feywell commented Jul 13, 2018

@thomelane Thank you! I will read it carefully.

@Ishitori
Copy link
Contributor

@Feywell, did it help? Is problem resolved now?

@ThomasDelteil
Copy link
Contributor

@yzhliu can you please close the issue?
@Feywell if you have issues setting up distritibuted training, please create a post on https://discuss.mxnet.io Thanks!

@yzhliu yzhliu closed this as completed Aug 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants