Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Does MXNET support to train on multi CPU cluster ? #5094

Closed
sampathchanda opened this issue Feb 21, 2017 · 6 comments
Closed

Does MXNET support to train on multi CPU cluster ? #5094

sampathchanda opened this issue Feb 21, 2017 · 6 comments

Comments

@sampathchanda
Copy link

Issue #185 has same question, but looks like it is not resolved. Link pointed to in the tracker is broken:
https://mxnet.readthedocs.org/en/latest/distributed_training.html

However, I see a similar article at:
http://newdocs.readthedocs.io/en/latest/distributed_training.html

Following this, I created an AWS EC2 cluster of t2.micro CPUs using the deeplearning.template (modified to create cluster of t2.micro CPUs).

However, I see the following issue when trying to run distributed training:

Traceback (most recent call last):
  File "train_mnist.py", line 153, in <module>
    train_model.fit(args, net, get_iterator(data_shape))
  File "/home/ec2-user/src/mxnet/example/image-classification/train_model.py", line 104, in fit
    epoch_end_callback = checkpoint)
  File "/usr/local/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/model.py", line 787, in fit
    sym_gen=self.sym_gen)
  File "/usr/local/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/model.py", line 202, in _train_multi_device
    update_on_kvstore=update_on_kvstore)
  File "/usr/local/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/model.py", line 80, in _initialize_kvstore
    kvstore.init(idx, arg_params[param_names[idx]])
  File "/usr/local/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/kvstore.py", line 100, in init
    self.handle, mx_uint(len(ckeys)), ckeys, cvals))
  File "/usr/local/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/base.py", line 77, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:15:03] src/storage/storage.cc:38: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: no CUDA-capable device is detected

Steps to Reproduce:

  1. Create a Cloud/Stack on AWS EC2 using CPU only instances using the deeplearning.template
  2. cd ~/src/mxnet/example/image-classification
  3. ../../tools/launch.py -n 2 -H hosts-orig python train_mnist.py --kv-store dist_sync

It seems that MXNET is looking for a GPU, whenever distributed training is attempted.
Can anyone please confirm, if distributed training on multiple CPUs is supported at all using MXNET.

Thanks !!

@glingyan
Copy link
Contributor

@zhenlinluo

@qiyuangong
Copy link
Contributor

Actually, MXNet supports training on multi CPU cluster. (We tried two months ago on mnist.) To my knowledge, MXNet will look for CPUs if GPUs is not set. You can found these codes in model.py (line 455 current version).

I think you encountered this bug because your MXNet is complied with USE_CUDA = 1 in config.mk (line 47 current version). In this case, MXNet will check GPU and CUDA first.

By the way, your MXNet (0.7) and train_mnist.py are out of date. Please update them to latest version.

@sampathchanda
Copy link
Author

@qiyuangong Thanks for the reply.

I have actually been trying out multiple AWS AMIs (on Amazon Linux), both from community AMIs and the ones that I made, after installing MXNET with latest source code from GitHub, using installation steps provided on the website. The one I posted above is from the deeplearning.template which is linked from the mxnet repository.

When I installed MXNET from scratch on AWS CPU instance (t2.micro/t2.medium) with Amazon Linux, it is not even starting the training at all. (Nothing being printed) when the following command is launched:
../../tools/launch.py -n 2 -H hosts python train_mnist.py --kv-store dist_sync

Also, while installing, I made sure that USE_CUDA=0 and also USE_DIST_KVSTORE=1 (used for distributed training). In fact, I have used the script at $MXNET_HOME/setup-utils/install-mxnet-amz-linux.sh

Following are the options being populated into my config.mk from my installation script:
echo "USE_CUDA=0" >>config.mk
echo "USE_CUDNN=0" >>config.mk
echo "USE_BLAS=openblas" >>config.mk
echo "USE_DIST_KVSTORE=1" >>config.mk
echo "ADD_CFLAGS += -I/usr/include/openblas" >>config.mk
echo "ADD_LDFLAGS += -lopencv_core -lopencv_imgproc -lopencv_imgcodecs" >>config.mk

When you said that you tried out distributed training on multi CPU cluster using MXNET, can you please share more details like --

  1. What kind of nodes you used? If AWS instances, what type of instances?
  2. Did you use Amazon Linux or Ubuntu as OS ?
  3. Any specific version of MXNET ? or the latest one available?

Since I have been using the standard installation script for AWS Linux, I am not sure, what could probably be going wrong now. With the training not even starting or no log being printed, I am not sure how to debug even. Please help.

@glingyan
Copy link
Contributor

please use latest mxnet master + #5094 (comment)
tested multi-node training on AMZ

@qiyuangong
Copy link
Contributor

qiyuangong commented Feb 22, 2017

@sampathchanda

May be it gets stuck when downloading data file (22 MB). This url is always unreachable in my env. I suggest download data files and load locally in train_mnist.py (need to modify this script).

Here are the details about our env:

  1. Ubuntu 16.04 or Centos 7. Personally, I suggest Ubuntu nodes on VMs (you can launch multiple nodes with Virtual Box, which is free).
  2. Client with Python and MXNet installed, and MXNet source code in $HOME$. So we can use launch.py to submit tasks.
  3. SSH no login is configured for remote nodes. (I am not sure if this step is necessary)
  4. MXNet is not necessary on remote nodes when using --sync-dst-dir, but the dependencies are necessary. Because you can sync mxnet/python/mxnet (python interface) and libmxnet.so to remote nodes. Check it in Run MXNet on Multiple CPU/GPUs with Data Parallel.
  5. New a dir (e.g., /tmp/mxnet) on remote nodes, if you want to use --sync-dst-dir.
  6. We put all data files in single dir ($Home$/train_mnist) like this:
    -----__init__.py (empty)
    ------mxnet (from mxnet/python/mxnet)
    ---------libmxnet.so (build from src)
    ------train_mnist.py
    ------hosts
    ------common (from example/image-classification/common)
    ------symbols (from example/image-classification/symbols)
    ------date files (t10k-images.......)
  7. We use the following cmd:
    ../mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync

In this case, MXNet will sync the train_mnist dir (with data files) to all remote nodes (/tmp/mxnet). Then launch the training command.

Your command is also valid, but you need to install MXNet on each node, and make sure your network is fine (for both client and nodes).

@sampathchanda
Copy link
Author

@qiyuangong Thanks for providing me with your configuration details and steps.

Installation of MXNET with Ubuntu 16.04 is giving an internal g++ compiler error. However, with Ubuntu 14.04 on AWS instances, am able to install and run MXNET on multi CPU cluster environment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants