Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"set_mempolicy: Operation not permitted" and performance degradation in 8GPU with single machine #17

Closed
burness opened this issue Jun 28, 2019 · 15 comments
Labels
bug Something isn't working

Comments

@burness
Copy link

burness commented Jun 28, 2019

Describe the bug
I use 4GPUs(1080ti) in a single machine, It perform well, but when I use 8GPUs and byteps get performance degradation: from 161.8 img/sec per GPU to 17.4 img/sec per GPU and have some warning info "set_mempolicy: Operation not permitted".

To Reproduce
Steps to reproduce the behavior:
Just change the gpu num in step-by-step toturial

Environment (please complete the following information):

  • OS: ubuntu
  • GCC version: Ubuntu 5.4.0-6ubuntu1~16.04.11)
  • CUDA and NCCL version: CUDA Version 9.0.176
  • Framework (TF, PyTorch, MXNet): TensorFlow
  • use the byteps docker

Additional context
Add any other context about the problem here.

@changlan
Copy link
Contributor

changlan commented Jun 28, 2019

Looks like Docker prevented numactl from setting mempolicy. We will need to confirm if this persists on our testbeds @ymjiang.

As a quick workaround, perhaps try docker run with --security-opt seccomp=seccomp.json, using the custom seccomp.json file like this: https://gist.github.com/w1ndy/4aee49aa3a608c977a858542ed5f1ee5?

@changlan changlan added the bug Something isn't working label Jun 28, 2019
@burness
Copy link
Author

burness commented Jun 28, 2019

@changlan I try docker run with --security-opt seccomp=seccomp.json, warn info like “set_mempolicy: Operation not permitted” disappear, but the speed is still low
Screenshot from 2019-06-28 14-01-28

@ymjiang
Copy link
Member

ymjiang commented Jun 28, 2019

It may have something to do with NUMA. Can you try export BYTEPS_PCIE_SWITCH_SIZE=8 before running? This may help us figure out the problem.

As far as we have tested, on our testbed there is no such problem. @changlan

@burness
Copy link
Author

burness commented Jun 28, 2019

@ymjiang export BYTEPS_PCIE_SWITCH_SIZE=8 will solve this problem, can you add it in byteps to auto set this val according the GPU num?

@bobzhuyb
Copy link
Member

@ymjiang Let's change the default value to 8. Only very experienced BytePS users may know whether a <8 value is better for their environment.

@ymjiang
Copy link
Member

ymjiang commented Jun 28, 2019

Default value changed to 8 now (7c4dd67). Closing this issue.

@ymjiang ymjiang closed this as completed Jun 28, 2019
@burness
Copy link
Author

burness commented Jun 28, 2019

@ymjiang @changlan It seems that there is not "/usr/local/byteps/launcher/launch.py" in the server images "bytepsimage/byteps_server" . Anyone could help push the correct images?

@ymjiang
Copy link
Member

ymjiang commented Jun 28, 2019

@burness Can you clean the cached image and try again now? We just pushed a new image.

@burness
Copy link
Author

burness commented Jun 29, 2019

@ymjiang @changlan I have change an new image, and when i run nvidia-docker run --shm-size=32768m --security-opt seccomp=seccomp.json -it bytepsimage/worker_tensorflow bash in scheduler docker container, it fails and its error log :

BytePS launching scheduler
[01:53:31] src/./zmq_van.h:285: Start ZMQ recv thread
Traceback (most recent call last):
  File "/usr/local/byteps/launcher/launch.py", line 45, in <module>
    import mxnet
  File "/root/incubator-mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/root/incubator-mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/root/incubator-mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/root/incubator-mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/root/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [01:53:31] src/van.cc:358: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /root/incubator-mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f400f4d1e5c]
[bt] (1) /root/incubator-mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f400f4d31d8]
[bt] (2) /root/incubator-mxnet/lib/libmxnet.so(ps::Van::Start(int)+0x962) [0x7f40129bb362]
[bt] (3) /root/incubator-mxnet/lib/libmxnet.so(ps::ZMQVan::Start(int)+0x1a6) [0x7f40129c7406]
[bt] (4) /root/incubator-mxnet/lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f40129b5fcc]
[bt] (5) /root/incubator-mxnet/lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x11e) [0x7f4012946c7e]
[bt] (6) /root/incubator-mxnet/lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f40128aef15]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f40a1503e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f40a15038ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f40a17133df]

It seems port cause the error

@burness
Copy link
Author

burness commented Jun 29, 2019

@ymjiang another question, I see the bt error infor in the log, How can it get the coredump file? I set ulimit -c unlimited in your container, But I can't find any coredump files

@bobzhuyb
Copy link
Member

@ymjiang Did you miss "--net=host" in all your commands in the tutorial?

@ymjiang
Copy link
Member

ymjiang commented Jun 29, 2019

@burness Can you add --net=host while you run docker? We will modify the tutorials.

@ymjiang
Copy link
Member

ymjiang commented Jun 29, 2019

Fixed the tutorial: ec55073

@burness
Copy link
Author

burness commented Jun 29, 2019

add --net=host will solve this problem, and another advice : you can delete the comment words
because export may be export the val to be 10.0.0.1 # the scheduler IP that the
image

@ymjiang
Copy link
Member

ymjiang commented Jun 29, 2019

@burness Thank you for reminding. On our platform there seems no such problem for export. Did you notice that in your OS?

Anyway, we will fix this just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants