Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch example failed #7

Closed
Lyken17 opened this issue Jun 27, 2019 · 8 comments
Closed

PyTorch example failed #7

Lyken17 opened this issue Jun 27, 2019 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@Lyken17
Copy link

Lyken17 commented Jun 27, 2019

Describe the bug
By following the instruction in step-by-step-tutorials.md, I failed to run the example.

To Reproduce
Steps to reproduce the behavior:

docker pull bytepsimage/worker_pytorch

nvidia-docker run --shm-size=32768m -it bytepsimage/worker_pytorch bash

# now you are in docker environment
export NVIDIA_VISIBLE_DEVICES=0,1,2,3  # say you have 4 GPUs 
export DMLC_WORKER_ID=0 # your worker id
export DMLC_NUM_WORKER=1 # you only have one worker
export DMLC_ROLE=worker # your role is worker

# the following value does not matter for non-distributed jobs 
export DMLC_NUM_SERVER=1 
export DMLC_PS_ROOT_URI=10.0.0.1 
export DMLC_PS_ROOT_PORT=1234 

export EVAL_TYPE=benchmark 
python /usr/local/byteps/launcher/launch.py \
       /usr/local/byteps/example/pytorch/start_pytorch_byteps.sh \
       --model resnet50 --num-iters 1000      

The error messages are attached below

root@265e564096d1:~# python /usr/local/byteps/launcher/launch.py \
>        /usr/local/byteps/example/pytorch/start_pytorch_byteps.sh \
>        --model resnet50 --num-iters 1000
BytePS launching worker
running benchmark...
running benchmark...
running benchmark...
running benchmark...
[2019-06-27 17:46:54.407767: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
[2019-06-27 17:46:54.428154: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
[2019-06-27 17:46:54.437652: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
[2019-06-27 17:46:54.453323: F byteps/common/global.cc:101] Check failed: getenv("DMLC_NUM_SERVER") error: env DMLC_NUM_SERVER not set
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   220 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   218 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134


/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   216 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134

/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   219 Aborted                 (core dumped) python $path/benchmark_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 18, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh --model resnet50 --num-iters 1000' returned non-zero exit status 134

Environment (please complete the following information):

  • Docker version 18.09.1, build 4c52b90
  • 8 1080Ti GPUs.
@changlan changlan added the bug Something isn't working label Jun 27, 2019
@changlan
Copy link
Contributor

Thanks for reporting. Just to confirm: were you trying to do single machine training?

@Lyken17
Copy link
Author

Lyken17 commented Jun 27, 2019

Yes, I am going to start with single machine example.

@bobzhuyb
Copy link
Member

Did you set "export DMLC_NUM_SERVER=1" or not? I know the comment says the value does not matter.. I just want to confirm.

@Lyken17
Copy link
Author

Lyken17 commented Jun 27, 2019

@bobzhuyb Thanks, after I set the environment. The program can execute normally!

It seems the value indeed matters for the example. The documentation needs to be updated.

@Lyken17 Lyken17 closed this as completed Jun 27, 2019
@Lyken17
Copy link
Author

Lyken17 commented Jun 27, 2019

Umm another issue raises. How to exit the training process cleanly? Ctrl + c only kills program on one gpu and leaves the other gpus occupied.

@Lyken17
Copy link
Author

Lyken17 commented Jun 27, 2019

image

@bobzhuyb
Copy link
Member

Thanks for letting us know. We will update the documents (actually, I think we should update the code.. non-distributed mode should not check that value at all.)

ctrl-c kills the main process. I am not sure why the child processes are not killed. You may do "ps -ef", find out those child processes, and kill them.

We have always been killing the whole docker container, so never encountered this problem..

@ymjiang
Copy link
Member

ymjiang commented Jun 28, 2019

Thank you for the report.

@bobzhuyb I will remove the non-relevant env checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants