Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check failed: getenv("BYTEPS_LOCAL_RANK") error: env BYTEPS_LOCAL_RANK not set Aborted #3

Closed
rabintang opened this issue Jun 27, 2019 · 9 comments
Labels
documentation Improvements or additions to documentation

Comments

@rabintang
Copy link

When I run tensorflow_mnist.py example, I encounter the above error.

p.s. There are too little doc about how to run other platforms' program!!!

@changlan
Copy link
Contributor

Sorry for the lack of clarity. Did you follow this instruction (https://github.com/bytedance/byteps/blob/master/docs/env.md) to set up environment variables?

@rabintang
Copy link
Author

Sorry for the lack of clarity. Did you follow this instruction (https://github.com/bytedance/byteps/blob/master/docs/env.md) to set up environment variables?

I have read this doc, but it is about mxnet's platform. I didn't see any instruction about tensorflow platform.

I directly run the command: python example/tensorflow/tensorflow_mnist.py, as (https://github.com/bytedance/byteps/blob/master/example/tensorflow/run_tensorflow_byteps.sh) show.

@bobzhuyb
Copy link
Member

Regardless of the framework (TF or MXNet or PyTorch), you need to set the same set of environment variables. The reason is that we reuse the DMLC stuff for worker/server/scheduler bootstrapping.

In short, just follow the document to set the environment variables.

@bobzhuyb
Copy link
Member

@ymjiang
Copy link
Member

ymjiang commented Jun 27, 2019

@rabintang
In your case, the correct way to run is: (and make sure you have set the env correctly)
python byteps/launcher/launch.py byteps/example/tensorflow/tensorflow_mnist.py

The launcher will allocate BYTEPS_LOCAL_RANK automatically.

Besides, we are updating the tutorials actively. Stay tuned.

@bobzhuyb bobzhuyb added the documentation Improvements or additions to documentation label Jun 27, 2019
@rabintang
Copy link
Author

@rabintang
In your case, the correct way to run is: (and make sure you have set the env correctly)
python byteps/launcher/launch.py byteps/example/tensorflow/tensorflow_mnist.py

The launcher will allocate BYTEPS_LOCAL_RANK automatically.

Besides, we are updating the tutorials actively. Stay tuned.

thx, but what is the correctly env if I just run in a machine with 4 gpus?

I set env like this:
export DMLC_ROLE=worker
export DMLC_PS_ROOT_URI=192.168.144.133
export DMLC_PS_ROOT_PORT=9100
export DMLC_NUM_WORKER=2
export DMLC_WORKER_ID=0
export NVIDIA_VISIBLE_DEVICES=0,1

and run this command: python launcher/launch.py example/tensorflow/tensorflow_mnist.py

however, I also encountered the following errors:
`BytePS launching worker
/bin/sh: 1: example/tensorflow/tensorflow_mnist.py: Permission denied
/bin/sh: 1: example/tensorflow/tensorflow_mnist.py: Permission denied
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "launcher/launch.py", line 18, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'example/tensorflow/tensorflow_mnist.py' returned non-zero exit status 126

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "launcher/launch.py", line 18, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'example/tensorflow/tensorflow_mnist.py' returned non-zero exit status 126`

@ymjiang
Copy link
Member

ymjiang commented Jun 27, 2019

@rabintang
For you case, try this:

export DMLC_ROLE=worker
export DMLC_PS_ROOT_URI=192.168.144.133
export DMLC_PS_ROOT_PORT=9100
export DMLC_NUM_SERVER=1   # this value does not matter when you only have 1 worker
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=1     # you only have 1 worker
export NVIDIA_VISIBLE_DEVICES=0,1,2,3  # you have 4 GPUs on one worker

# to solve the permission problem
chmod 755 example/tensorflow/tensorflow_mnist.py

python launcher/launch.py example/tensorflow/tensorflow_mnist.py

@rabintang
Copy link
Author

thx, it works!

@ymjiang
Copy link
Member

ymjiang commented Jun 27, 2019

Glad that it works, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants