"set_mempolicy: Operation not permitted" and performance degradation in 8GPU with single machine #17

burness · 2019-06-28T05:40:20Z

Describe the bug
I use 4GPUs(1080ti) in a single machine, It perform well, but when I use 8GPUs and byteps get performance degradation: from 161.8 img/sec per GPU to 17.4 img/sec per GPU and have some warning info "set_mempolicy: Operation not permitted".

To Reproduce
Steps to reproduce the behavior:
Just change the gpu num in step-by-step toturial

Environment (please complete the following information):

OS: ubuntu
GCC version: Ubuntu 5.4.0-6ubuntu1~16.04.11)
CUDA and NCCL version: CUDA Version 9.0.176
Framework (TF, PyTorch, MXNet): TensorFlow
use the byteps docker

Additional context
Add any other context about the problem here.

changlan · 2019-06-28T05:50:06Z

Looks like Docker prevented numactl from setting mempolicy. We will need to confirm if this persists on our testbeds @ymjiang.

As a quick workaround, perhaps try docker run with --security-opt seccomp=seccomp.json, using the custom seccomp.json file like this: https://gist.github.com/w1ndy/4aee49aa3a608c977a858542ed5f1ee5?

burness · 2019-06-28T06:02:09Z

@changlan I try docker run with --security-opt seccomp=seccomp.json， warn info like “set_mempolicy: Operation not permitted” disappear， but the speed is still low

ymjiang · 2019-06-28T06:13:18Z

It may have something to do with NUMA. Can you try export BYTEPS_PCIE_SWITCH_SIZE=8 before running? This may help us figure out the problem.

As far as we have tested, on our testbed there is no such problem. @changlan

burness · 2019-06-28T07:26:00Z

@ymjiang export BYTEPS_PCIE_SWITCH_SIZE=8 will solve this problem, can you add it in byteps to auto set this val according the GPU num?

bobzhuyb · 2019-06-28T07:32:11Z

@ymjiang Let's change the default value to 8. Only very experienced BytePS users may know whether a <8 value is better for their environment.

ymjiang · 2019-06-28T07:44:08Z

Default value changed to 8 now (7c4dd67). Closing this issue.

burness · 2019-06-28T08:40:04Z

@ymjiang @changlan It seems that there is not "/usr/local/byteps/launcher/launch.py" in the server images "bytepsimage/byteps_server" . Anyone could help push the correct images?

ymjiang · 2019-06-28T08:50:08Z

@burness Can you clean the cached image and try again now? We just pushed a new image.

burness · 2019-06-29T01:55:54Z

@ymjiang @changlan I have change an new image, and when i run nvidia-docker run --shm-size=32768m --security-opt seccomp=seccomp.json -it bytepsimage/worker_tensorflow bash in scheduler docker container, it fails and its error log :

BytePS launching scheduler
[01:53:31] src/./zmq_van.h:285: Start ZMQ recv thread
Traceback (most recent call last):
  File "/usr/local/byteps/launcher/launch.py", line 45, in <module>
    import mxnet
  File "/root/incubator-mxnet/python/mxnet/__init__.py", line 91, in <module>
    from . import kvstore_server
  File "/root/incubator-mxnet/python/mxnet/kvstore_server.py", line 85, in <module>
    _init_kvstore_server_module()
  File "/root/incubator-mxnet/python/mxnet/kvstore_server.py", line 82, in _init_kvstore_server_module
    server.run()
  File "/root/incubator-mxnet/python/mxnet/kvstore_server.py", line 73, in run
    check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
  File "/root/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [01:53:31] src/van.cc:358: Check failed: (my_node_.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /root/incubator-mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f400f4d1e5c]
[bt] (1) /root/incubator-mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f400f4d31d8]
[bt] (2) /root/incubator-mxnet/lib/libmxnet.so(ps::Van::Start(int)+0x962) [0x7f40129bb362]
[bt] (3) /root/incubator-mxnet/lib/libmxnet.so(ps::ZMQVan::Start(int)+0x1a6) [0x7f40129c7406]
[bt] (4) /root/incubator-mxnet/lib/libmxnet.so(ps::Postoffice::Start(int, char const*, bool)+0x7c) [0x7f40129b5fcc]
[bt] (5) /root/incubator-mxnet/lib/libmxnet.so(mxnet::kvstore::KVStoreDist::RunServer(std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)+0x11e) [0x7f4012946c7e]
[bt] (6) /root/incubator-mxnet/lib/libmxnet.so(MXKVStoreRunServer+0x65) [0x7f40128aef15]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f40a1503e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f40a15038ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f40a17133df]

It seems port cause the error

burness · 2019-06-29T02:05:05Z

@ymjiang another question, I see the bt error infor in the log, How can it get the coredump file? I set ulimit -c unlimited in your container, But I can't find any coredump files

bobzhuyb · 2019-06-29T02:13:31Z

@ymjiang Did you miss "--net=host" in all your commands in the tutorial?

ymjiang · 2019-06-29T02:25:35Z

@burness Can you add --net=host while you run docker? We will modify the tutorials.

ymjiang · 2019-06-29T02:44:19Z

Fixed the tutorial: ec55073

burness · 2019-06-29T03:29:53Z

add --net=host will solve this problem， and another advice : you can delete the comment words
because export may be export the val to be 10.0.0.1 # the scheduler IP that the

ymjiang · 2019-06-29T03:31:20Z

@burness Thank you for reminding. On our platform there seems no such problem for export. Did you notice that in your OS?

Anyway, we will fix this just in case.

Complementary improvement of bytedance/ps-lite#16

changlan added the bug Something isn't working label Jun 28, 2019

ymjiang closed this as completed Jun 28, 2019

bobzhuyb mentioned this issue Jul 3, 2019

common: make omp threads configurable #43

Merged

ghost mentioned this issue Feb 25, 2020

Coredump at ps::Meta::~Meta called in ReleaseFirstMsg #211

Closed

pleasantrabbit pushed a commit that referenced this issue Jul 13, 2020

hotfix: file mode use append (#17)

3844228

DeruiLiu mentioned this issue Aug 6, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

pleasantrabbit pushed a commit that referenced this issue Nov 3, 2020

rdma: adjust rx depth (#17)

b368e38

Complementary improvement of bytedance/ps-lite#16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"set_mempolicy: Operation not permitted" and performance degradation in 8GPU with single machine #17

"set_mempolicy: Operation not permitted" and performance degradation in 8GPU with single machine #17

burness commented Jun 28, 2019 •

edited

Loading

changlan commented Jun 28, 2019 •

edited

Loading

burness commented Jun 28, 2019

ymjiang commented Jun 28, 2019

burness commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019

ymjiang commented Jun 28, 2019

burness commented Jun 28, 2019 •

edited

Loading

ymjiang commented Jun 28, 2019

burness commented Jun 29, 2019 •

edited

Loading

burness commented Jun 29, 2019

bobzhuyb commented Jun 29, 2019

ymjiang commented Jun 29, 2019

ymjiang commented Jun 29, 2019

burness commented Jun 29, 2019

ymjiang commented Jun 29, 2019 •

edited

Loading

"set_mempolicy: Operation not permitted" and performance degradation in 8GPU with single machine #17

"set_mempolicy: Operation not permitted" and performance degradation in 8GPU with single machine #17

Comments

burness commented Jun 28, 2019 • edited Loading

changlan commented Jun 28, 2019 • edited Loading

burness commented Jun 28, 2019

ymjiang commented Jun 28, 2019

burness commented Jun 28, 2019

bobzhuyb commented Jun 28, 2019

ymjiang commented Jun 28, 2019

burness commented Jun 28, 2019 • edited Loading

ymjiang commented Jun 28, 2019

burness commented Jun 29, 2019 • edited Loading

burness commented Jun 29, 2019

bobzhuyb commented Jun 29, 2019

ymjiang commented Jun 29, 2019

ymjiang commented Jun 29, 2019

burness commented Jun 29, 2019

ymjiang commented Jun 29, 2019 • edited Loading

burness commented Jun 28, 2019 •

edited

Loading

changlan commented Jun 28, 2019 •

edited

Loading

burness commented Jun 28, 2019 •

edited

Loading

burness commented Jun 29, 2019 •

edited

Loading

ymjiang commented Jun 29, 2019 •

edited

Loading