Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check failed: mr ibv_reg_mr failed: Cannot allocate memory #295

Closed
ChenYuHo opened this issue Sep 16, 2020 · 3 comments
Closed

Check failed: mr ibv_reg_mr failed: Cannot allocate memory #295

ChenYuHo opened this issue Sep 16, 2020 · 3 comments

Comments

@ChenYuHo
Copy link

ChenYuHo commented Sep 16, 2020

I'm using docker built with byteps at master branch 2141033f6c9a09ab0393bdde77f9fa86a5d057de, ps-lite at byteps branch 51f7248dd4d02d6182a4f93e6109f69a3a04d49b and encountered the same issue when trying to use RDMA (with TCP it works), my settings are:
4 physical machines, 2 running workers, 1 running a server, 1 running a server and a scheduler

1 scheduler:

docker run \
-e DMLC_ROLE=scheduler \
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --net=host byteps/pytorch:latest \
bpslaunch

2 servers:

docker run \
-e DMLC_ROLE=server \
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --net=host byteps/pytorch:latest \
bpslaunch

2 worker:

docker run \
-e DMLC_ROLE=worker \
-e DMLC_WORKER_ID=0 \  # and 1 for the other
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --runtime=nvidia --net=host byteps/pytorch:latest \
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters
20

The logs:

BytePS launching scheduler
[05:42:27] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:42:27] src/postoffice.cc:19: Creating Van: ibverbs
[05:42:28] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=IPC
[05:42:28] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:43:11] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:43:30] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:07] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[05:45:07] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f]

server 0

BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1

server 1

BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1

worker 0

BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1

worker 1

BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1

I checked #282 and followed the instructions there.
I tried reducing BYTEPS_RDMA_RX_DEPTH and BYTEPS_RDMA_START_DEPTH but it didn't work,
ulimit -l is unlimited inside container

any suggestions?

@bobzhuyb
Copy link
Member

What value did you reduce BYTEPS_RDMA_RX_DEPTH and BYTEPS_RDMA_START_DEPTH to? Can you try setting it even lower?

@Ruinhuang
Copy link

@ChenYuHo can you tell me how did you solve this issue?

@Ruinhuang
Copy link

i have the same issue
ulimit -l is unlimited inside container
BYTEPS_RDMA_RX_DEPTH set to 32
BYTEPS_RDMA_START_DEPTH set to 16
it still can't solve the problem
@bobzhuyb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants