You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using docker built with byteps at master branch 2141033f6c9a09ab0393bdde77f9fa86a5d057de, ps-lite at byteps branch 51f7248dd4d02d6182a4f93e6109f69a3a04d49b and encountered the same issue when trying to use RDMA (with TCP it works), my settings are:
4 physical machines, 2 running workers, 1 running a server, 1 running a server and a scheduler
BytePS launching scheduler
[05:42:27] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:42:27] src/postoffice.cc:19: Creating Van: ibverbs
[05:42:28] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=IPC
[05:42:28] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:43:11] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:43:30] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:07] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[05:45:07] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f]
terminate called after throwing an instance of 'dmlc::Error'what(): [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f]
server 0
BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1
server 1
BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1
worker 0
BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1
worker 1
BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1
I checked #282 and followed the instructions there.
I tried reducing BYTEPS_RDMA_RX_DEPTH and BYTEPS_RDMA_START_DEPTH but it didn't work, ulimit -l is unlimited inside container
any suggestions?
The text was updated successfully, but these errors were encountered:
i have the same issue
ulimit -l is unlimited inside container
BYTEPS_RDMA_RX_DEPTH set to 32
BYTEPS_RDMA_START_DEPTH set to 16
it still can't solve the problem @bobzhuyb
I'm using docker built with byteps at master branch
2141033f6c9a09ab0393bdde77f9fa86a5d057de
, ps-lite at byteps branch51f7248dd4d02d6182a4f93e6109f69a3a04d49b
and encountered the same issue when trying to use RDMA (with TCP it works), my settings are:4 physical machines, 2 running workers, 1 running a server, 1 running a server and a scheduler
1 scheduler:
2 servers:
2 worker:
The logs:
server 0
BytePS launching server [04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [04:09:19] src/postoffice.cc:19: Creating Van: ibverbs [04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC [04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1
server 1
BytePS launching server [04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance [04:09:19] src/postoffice.cc:19: Creating Van: ibverbs [04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA [04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1
worker 0
worker 1
I checked #282 and followed the instructions there.
I tried reducing
BYTEPS_RDMA_RX_DEPTH
andBYTEPS_RDMA_START_DEPTH
but it didn't work,ulimit -l
is unlimited inside containerany suggestions?
The text was updated successfully, but these errors were encountered: