PyTorch example failed #7

Lyken17 · 2019-06-27T17:49:52Z

Describe the bug
By following the instruction in step-by-step-tutorials.md, I failed to run the example.

To Reproduce
Steps to reproduce the behavior:

docker pull bytepsimage/worker_pytorch

nvidia-docker run --shm-size=32768m -it bytepsimage/worker_pytorch bash

# now you are in docker environment
export NVIDIA_VISIBLE_DEVICES=0,1,2,3  # say you have 4 GPUs 
export DMLC_WORKER_ID=0 # your worker id
export DMLC_NUM_WORKER=1 # you only have one worker
export DMLC_ROLE=worker # your role is worker

# the following value does not matter for non-distributed jobs 
export DMLC_NUM_SERVER=1 
export DMLC_PS_ROOT_URI=10.0.0.1 
export DMLC_PS_ROOT_PORT=1234 

export EVAL_TYPE=benchmark 
python /usr/local/byteps/launcher/launch.py \
       /usr/local/byteps/example/pytorch/start_pytorch_byteps.sh \
       --model resnet50 --num-iters 1000

The error messages are attached below

root@265e564096d1:~# python /usr/local/byteps/launcher/launch.py \
>        /usr/local/byteps/example/pytorch/start_pytorch_byteps.sh \
>        --model resnet50 --num-iters 1000
BytePS launching worker
running benchmark...
running benchmark...
running benchmark...
running benchmark...
[2019-06-27 17:46:54.407767: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
[2019-06-27 17:46:54.428154: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
[2019-06-27 17:46:54.437652: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
[2019-06-27 17:46:54.453323: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   220 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   218 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134


/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   216 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134

/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   219 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134

Environment (please complete the following information):

Docker version 18.09.1, build 4c52b90
8 1080Ti GPUs.

The text was updated successfully, but these errors were encountered:

changlan · 2019-06-27T17:59:08Z

Thanks for reporting. Just to confirm: were you trying to do single machine training?

Lyken17 · 2019-06-27T18:03:19Z

Yes, I am going to start with single machine example.

bobzhuyb · 2019-06-27T19:24:48Z

Did you set "export DMLC_NUM_SERVER=1" or not? I know the comment says the value does not matter.. I just want to confirm.

Lyken17 · 2019-06-27T20:01:35Z

@bobzhuyb Thanks, after I set the environment. The program can execute normally!

It seems the value indeed matters for the example. The documentation needs to be updated.

Lyken17 · 2019-06-27T20:04:47Z

Umm another issue raises. How to exit the training process cleanly? Ctrl + c only kills program on one gpu and leaves the other gpus occupied.

Lyken17 · 2019-06-27T20:04:59Z

bobzhuyb · 2019-06-27T20:22:15Z

Thanks for letting us know. We will update the documents (actually, I think we should update the code.. non-distributed mode should not check that value at all.)

ctrl-c kills the main process. I am not sure why the child processes are not killed. You may do "ps -ef", find out those child processes, and kill them.

We have always been killing the whole docker container, so never encountered this problem..

ymjiang · 2019-06-28T02:37:57Z

Thank you for the report.

@bobzhuyb I will remove the non-relevant env checking.

* rdma: allow binding to given interface * tests: add key log * tests: add knob for log frequency

changlan added the bug Something isn't working label Jun 27, 2019

changlan assigned ymjiang Jun 27, 2019

Lyken17 closed this as completed Jun 27, 2019

haoxintong mentioned this issue Jun 28, 2019

example: Add MXNet Gluon training example of MNIST. #22

Merged

ghost mentioned this issue Feb 25, 2020

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

Closed

pleasantrabbit pushed a commit that referenced this issue Jul 13, 2020

hotfix: use default num_threads (#7)

b261e4b

DeruiLiu mentioned this issue Aug 5, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

pleasantrabbit pushed a commit that referenced this issue Nov 3, 2020

rdma: allow binding to given interface (#7)

862f8a6

* rdma: allow binding to given interface * tests: add key log * tests: add knob for log frequency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch example failed #7

PyTorch example failed #7

Lyken17 commented Jun 27, 2019

changlan commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

bobzhuyb commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

bobzhuyb commented Jun 27, 2019

ymjiang commented Jun 28, 2019

PyTorch example failed #7

PyTorch example failed #7

Comments

Lyken17 commented Jun 27, 2019

changlan commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

bobzhuyb commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

Lyken17 commented Jun 27, 2019

bobzhuyb commented Jun 27, 2019

ymjiang commented Jun 28, 2019