Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed
DeruiLiu opened this issue Aug 4, 2020 · 17 comments

Comments

@DeruiLiu
Copy link

DeruiLiu commented Aug 4, 2020

I want to 1 worker and 1 server, but when I use the following command to start server, I have some error, can anyone meet the same error?

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1

export DMLC_INTERFACE=ens6f1
export DMLC_PS_ROOT_URI=172.168.30.25
export DMLC_PS_ROOT_PORT=9000

bpslaunch

the error is as below:
terminate called after throwing an instance of 'dmlc::Error'
what(): [16:23:05] src/./rdma_transport.h:130: Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=941, kMempoolChunkSize=56

i don't need to docker.
maybe i must need to use docker pull bytepsimage/tensorflow to correctly start?
can i start byteps without docker?

@DeruiLiu DeruiLiu changed the title some question about to start server. some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory Aug 4, 2020
@ymjiang
Copy link
Member

ymjiang commented Aug 4, 2020

Similar issue: #216.

We have fixed this (bytedance/ps-lite#30) but it is not merged into master yet. For now you can update the 3rdparty/ps-lite submodule, recompile ps-lite and BytePS, and then try to reduce BYTEPS_RDMA_RX_DEPTH (default 2048) to a smaller value (256 for example).

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 4, 2020

Similar issue: #216.

We have fixed this (bytedance/ps-lite#30) but it is not merged into master yet. For now you can update the 3rdparty/ps-lite submodule, recompile ps-lite and BytePS, and then try to reduce BYTEPS_RDMA_RX_DEPTH (default 2048) to a smaller value (256 for example).

yeah, But when I have 2 machines, one machine as 1 server and 1 worker, another machine as 1 worker and 1 schedule。
and i set BYTEPS_RDMA_RX_DEPTH = 128, I also get the error
Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=914, kMempoolChunkSize=56。

maybe i cannot run it this way? but i think in theroy i can do this.

@ymjiang
Copy link
Member

ymjiang commented Aug 4, 2020

and i set BYTEPS_RDMA_RX_DEPTH = 128, I also get the error
Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=914, kMempoolChunkSize=56

From your log, it seems that you were still using the old ps-lite. Can you update your ps-lite submodule to 7e4800feb, and then do the following to start over:

cd 3rdparty/ps-lite
make clean && make -j USE_RDMA=1
cd ../..
python3 setup.py install

This should make sure that you can update ps-lite correctly.

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 5, 2020

and i set BYTEPS_RDMA_RX_DEPTH = 128, I also get the error
Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=914, kMempoolChunkSize=56

From your log, it seems that you were still using the old ps-lite. Can you update your ps-lite submodule to 7e4800feb, and then do the following to start over:

cd 3rdparty/ps-lite
make clean && make -j USE_RDMA=1
cd ../..
python3 setup.py install

This should make sure that you can update ps-lite correctly.

I follow it and update my ps-lite, but i also get the error in schedule.
src/./rdma_transport.h:130: Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=914, kMempoolChunkSize=56
my start code is as below:
one machine:
1.schedule
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ens6f1
export DMLC_PS_ROOT_URI=172.168.30.25
export DMLC_PS_ROOT_PORT=9005

bpslaunch
1.worker
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=1
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ens6f1
export DMLC_PS_ROOT_URI=172.168.30.25
export DMLC_PS_ROOT_PORT=9005

bpslaunch python example/tensorflow/synthetic_benchmark.py --model VGG16 --num-iters 10

another machine
1.server
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ens6f1
export DMLC_PS_ROOT_URI=172.168.30.25
export DMLC_PS_ROOT_PORT=9005

bpslaunch
1.worker
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=1
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ens6f1
export DMLC_PS_ROOT_URI=172.168.30.25
export DMLC_PS_ROOT_PORT=9005

bpslaunch python example/tensorflow/synthetic_benchmark.py --model VGG16 --num-iters 10

@ymjiang
Copy link
Member

ymjiang commented Aug 5, 2020

I follow it and update my ps-lite, but i also get the error in schedule.
src/./rdma_transport.h:130: Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=914, kMempoolChunkSize=56
my start code is as below:

Can you double check? In 7e4800feb, line 130 is not the associated line as shown in your log...
https://github.com/bytedance/ps-lite/blob/7e4800febf9f3203c41f7451f191361270a11b7d/src/rdma_transport.h#L130

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 5, 2020

I follow it and update my ps-lite, but i also get the error in schedule.
src/./rdma_transport.h:130: Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=914, kMempoolChunkSize=56
my start code is as below:

Can you double check? In 7e4800feb, line 130 is not the associated line as shown in your log...
https://github.com/bytedance/ps-lite/blob/7e4800febf9f3203c41f7451f191361270a11b7d/src/rdma_transport.h#L130

sorry to bother again, i check it, and i get a similiar error. the error is as below:
[16:19:51] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

i set kRxDepth=256 of rdma_transport.h.
also i tried set kStartDepth = 64; kRxDepth = 128;kReplyDepth = kRxDepth; and i get same error as above.
but the machine's mem is big enough.

free -m
total used free shared buff/cache available
Mem: 192011 2879 185532 282 3599 186808
Swap: 976 0 976

@ymjiang
Copy link
Member

ymjiang commented Aug 5, 2020

Can you show the output of ulimit -l on your machine?

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 5, 2020

Can you show the output of ulimit -l on your machine?

yeah, thank you! now i solved it, maybe i need the correct sequence of worker and server to start.
I start in the following order and the problem solved,
1worker->1server->1worker->1server->1secheduler.
but the performance is not as good as expected. The performance of distribute is far lower than single training. and the performance is not stable. the output is as below:
first exercise
Iter #0: 73.0 img/sec per GPU
Iter #1: 124.5 img/sec per GPU
Iter #2: 110.0 img/sec per GPU
Iter #3: 141.1 img/sec per GPU
Iter #4: 146.4 img/sec per GPU
Iter #5: 118.2 img/sec per GPU
Iter #6: 146.7 img/sec per GPU
Iter #7: 146.6 img/sec per GPU
Iter #8: 128.4 img/sec per GPU
Iter #9: 127.1 img/sec per GPU
Img/sec per GPU: 126.2 +-42.2
Total img/sec on 2 GPU(s): 252.4 +-84.4

second exercise
Running benchmark...
Iter #0: 33.1 img/sec per GPU
Iter #1: 36.5 img/sec per GPU
Iter #2: 139.7 img/sec per GPU
Iter #3: 146.3 img/sec per GPU
Iter #4: 17.7 img/sec per GPU
Iter #5: 19.2 img/sec per GPU
Iter #6: 37.4 img/sec per GPU
Iter #7: 55.1 img/sec per GPU
Iter #8: 42.7 img/sec per GPU
Iter #9: 25.8 img/sec per GPU
Img/sec per GPU: 55.4 +-88.4
Total img/sec on 2 GPU(s): 110.7 +-176.7

but the signle machine is about 149
does this value impact the performance?
my start code of worker is as below and synthetic_benchmark.py is in byteps. i also use ipc.
bpslaunch python2 example/tensorflow/synthetic_benchmark.py --model VGG16 --num-iters 10

@ymjiang
Copy link
Member

ymjiang commented Aug 6, 2020

There might be some resource contention that causes the unstable performance. Is there any other process running on your machines? And can you check the CPU utilization?

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 6, 2020

There might be some resource contention that causes the unstable performance. Is there any other process running on your machines? And can you check the CPU utilization?

there are no any other process running on my machine, but the cpu utilization is also unstable. Have you ever had a problem like this?
and i find when cpu utilization is high, the performance is better. and when i start, the max cpu utilization is about 26% us of machine-1, and the cpu utilization of another machine is not such high.
machine-1: 1 server, 1worker
machine-2: 1scheduler,1server,1worker
i also try to set export MXNET_CPU_WORKER_NTHREADS=32 in the server. the output is similiar.
is there any parameters that will allow to maintain high cpu utilization all the time?
i run num-iters=100,
bpslaunch python2 example/tensorflow/synthetic_benchmark.py --model VGG16 --num-iters 100 --batch-size 64
some output is as below: it is so unstable.
Running benchmark...
Iter #0: 85.4 img/sec per GPU
Iter #1: 100.1 img/sec per GPU
Iter #2: 87.8 img/sec per GPU
Iter #3: 57.5 img/sec per GPU
Iter #4: 44.9 img/sec per GPU
Iter #5: 57.8 img/sec per GPU
Iter #6: 152.3 img/sec per GPU
Iter #7: 67.1 img/sec per GPU
Iter #8: 151.7 img/sec per GPU
Iter #9: 77.5 img/sec per GPU
Iter #10: 152.0 img/sec per GPU
Iter #11: 152.0 img/sec per GPU
Iter #12: 152.0 img/sec per GPU
Iter #13: 84.3 img/sec per GPU
Iter #14: 82.1 img/sec per GPU
Iter #15: 53.4 img/sec per GPU
Iter #16: 78.6 img/sec per GPU
Iter #17: 93.6 img/sec per GPU
Iter #18: 152.1 img/sec per GPU
Iter #19: 152.2 img/sec per GPU
Iter #20: 103.0 img/sec per GPU
Iter #21: 92.2 img/sec per GPU

@ymjiang
Copy link
Member

ymjiang commented Aug 6, 2020

MXNET_CPU_WORKER_NTHREADS no longer works since BytePS does not rely on MXNet now. You can try tuning BYTEPS_SERVER_ENGINE_THREAD.

We never saw this problem in our platform. Can you also try to bind the processes to specific cores using taskset? Make sure to bind the worker and server processes to different cores to avoid contention.

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 6, 2020

MXNET_CPU_WORKER_NTHREADS no longer works since BytePS does not rely on MXNet now. You can try tuning BYTEPS_SERVER_ENGINE_THREAD.

We never saw this problem in our platform. Can you also try to bind the processes to specific cores using taskset? Make sure to bind the worker and server processes to different cores to avoid contention.

i use taskset to bind the worker and server process to different cores, but the output is also unstable, do you have ever test the cohost case? put the server and a worker on one machine.
can you tell me which way to use in IPC? message queue? share memory? or other way?

@ymjiang
Copy link
Member

ymjiang commented Aug 6, 2020

The colocated mode uses shared memory. It works very well in our environment..

If you are sure this problem does not happen in single machine training, would you test the ps-lite IPC benchmark and paste the performance log? We may be able to know whether the problem is in networking.

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 6, 2020

The colocated mode uses shared memory. It works very well in our environment..

If you are sure this problem does not happen in single machine training, would you test the ps-lite IPC benchmark and paste the performance log? We may be able to know whether the problem is in networking.

i try to run /tests/test_ipc_benchmark and the representative part of output is as below: and in this case i don't use taskset.
Most of the time it's more than 90G, and a few times it's more than 40G.
when i use taskset, the output is similiar.
my bandwidth is 100G, but it maybe only up to 64G because of PCIE. is it affect the performance?

[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.3967 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.4521 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.3792 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.4312 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.4287 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.4219 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.386 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.424 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.3929 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.4209 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.3378 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.4146 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 91.9032 Gbps
[20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 48.952 Gbps
[20:11:21] tests/test_ipc_benchmark.cc:136: Application goodput: 0.439915 Gbps
[20:11:21] tests/test_ipc_benchmark.cc:136: Application goodput: 49.7426 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 0.191788 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9264 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9167 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.8637 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9351 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.771 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.8977 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9385 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9555 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9064 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.842 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.926 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.8681 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9038 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9272 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9307 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9204 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.925 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9244 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.9277 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 50.0223 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.8759 Gbps
[20:11:30] tests/test_ipc_benchmark.cc:136: Application goodput: 49.8612 Gbps

@ymjiang
Copy link
Member

ymjiang commented Aug 6, 2020

This is strange. The goodput should not drop that much. If your network is based on RoCE, Is PFC control enabled? (e.g., have you checked the RDMA bandwidth by ib_write_bw?)

BTW, the log period you show is quite short. You can reduce the log frequency by LOG_DURATION=100 or larger. Maybe we should see longer period.

@DeruiLiu
Copy link
Author

DeruiLiu commented Aug 7, 2020

This is strange. The goodput should not drop that much. If your network is based on RoCE, Is PFC control enabled? (e.g., have you checked the RDMA bandwidth by ib_write_bw?)

BTW, the log period you show is quite short. You can reduce the log frequency by LOG_DURATION=100 or larger. Maybe we should see longer period.

yeah, thank you. now i solved it, the reason is about PFC. i do not enabled PFC control, it is so stupid. the performance is great! thank you very much!!

@DeruiLiu DeruiLiu closed this as completed Aug 7, 2020
@ChenYuHo
Copy link

Hello @ymjiang
I'm using docker built with byteps at master branch 2141033f6c9a09ab0393bdde77f9fa86a5d057de, ps-lite at byteps branch 51f7248dd4d02d6182a4f93e6109f69a3a04d49b and encountered the same issue when trying to use RDMA (with TCP it works), my settings are:
4 physical machines, 2 running workers, 1 running a server, 1 running a server and a scheduler

1 scheduler:

docker run \
-e DMLC_ROLE=scheduler \
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --net=host byteps/pytorch:latest \
bpslaunch

2 servers:

docker run \
-e DMLC_ROLE=server \
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --net=host byteps/pytorch:latest \
bpslaunch

2 worker:

docker run \
-e DMLC_ROLE=worker \
-e DMLC_WORKER_ID=0 \  # and 1 for the other
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --runtime=nvidia --net=host byteps/pytorch:latest \
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters
20

The logs:

BytePS launching scheduler
[05:42:27] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:42:27] src/postoffice.cc:19: Creating Van: ibverbs
[05:42:28] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=IPC
[05:42:28] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:43:11] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:43:30] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:07] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[05:45:07] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f]

server 0

BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1

server 1

BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1

worker 0

BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1

worker 1

BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1

I tried reducing BYTEPS_RDMA_RX_DEPTH and BYTEPS_RDMA_START_DEPTH but it didn't work,
ulimit -l is unlimited inside container

any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants