Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA: Check failed: mr ibv_reg_mr failed: Cannot allocate memory #372

Closed
Ruinhuang opened this issue Mar 16, 2021 · 1 comment
Closed

RDMA: Check failed: mr ibv_reg_mr failed: Cannot allocate memory #372

Ruinhuang opened this issue Mar 16, 2021 · 1 comment

Comments

@Ruinhuang
Copy link

Ruinhuang commented Mar 16, 2021

Describe the bug
I've tried the following scenario by , and Error occurs.
Run resnet50 on 2 nodes, each node with 8 GPUs pbslaunch, NO additional CPU servers
https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA

To Reproduce
Steps to reproduce the behavior:
The steps are exactly the same as the instruction manual (https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA)

i try to set 2 workers, 2 servers and 1 scheduler in this scenario
Node1:
1.start scheduler
2.start server
3.start worker

Node2:
1.start server
2.start worker

i start the thread by this sequence:
Node1 scheduler->server->worker->Node2 server->worker
after i start worker on node 2, error occurs and the error log show on scheduler:

BytePS launching scheduler
Command: python3 -c 'import byteps.server'

[02:19:48] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[02:19:48] src/postoffice.cc:25: Creating Van: 1
[02:19:48] src/van.cc:84: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[02:19:48] src/./rdma_van.h:44: Shared memory IPC has been disabled
[02:19:48] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[02:19:48] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[02:21:08] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:23:25] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:25:36] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:34] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:35] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[02:27:35] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]


Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
    launch_bps()
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 216, in launch_bps
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

Environment (please complete the following information):
i use the pytorch docker file(byteps/docker/Dockerfile) to build the container
Additional context
Do you have any suggestions?
my ulimit -l result is unlimited
i set BYTEPS_RDMA_START_DEPTH=16 and BYTEPS_RDMA_RX_DEPTH =32 It still shows the same error
my byteps code version is lastest

The start sequence is the point?

@Ruinhuang
Copy link
Author

i solved this issue by

export BYTEPS_RDMA_RX_DEPTH=1024
export BYTEPS_RDMA_START_DEPTH=64

But how can i set the value
According to what conditions to set the parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant