You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i try to set 2 workers, 2 servers and 1 scheduler in this scenario
Node1:
1.start scheduler
2.start server
3.start worker
Node2:
1.start server
2.start worker
i start the thread by this sequence:
Node1 scheduler->server->worker->Node2 server->worker
after i start worker on node 2, error occurs and the error log show on scheduler:
BytePS launching scheduler
Command: python3 -c 'import byteps.server'
[02:19:48] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[02:19:48] src/postoffice.cc:25: Creating Van: 1
[02:19:48] src/van.cc:84: DMLC_ENABLE_RDMA=1 will be deprecated. Please use DMLC_ENABLE_RDMA=ibverbs instead.
[02:19:48] src/./rdma_van.h:44: Shared memory IPC has been disabled
[02:19:48] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[02:19:48] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[02:21:08] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:23:25] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:25:36] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:34] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[02:27:35] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[02:27:35] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]
terminate called after throwing an instance of 'dmlc::Error'
what(): [02:27:35] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22a8c) [0x7f4970e27a8c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ecd) [0x7f4970e27ecd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x787f0) [0x7f4970e7d7f0]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7991b) [0x7f4970e7e91b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f49705056df]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f49721376db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f497247071f]
Aborted (core dumped)
Traceback (most recent call last):
File "/usr/local/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
launch_bps()
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 216, in launch_bps
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.
Environment (please complete the following information):
i use the pytorch docker file(byteps/docker/Dockerfile) to build the container Additional context
Do you have any suggestions?
my ulimit -l result is unlimited
i set BYTEPS_RDMA_START_DEPTH=16 and BYTEPS_RDMA_RX_DEPTH =32 It still shows the same error
my byteps code version is lastest
The start sequence is the point?
The text was updated successfully, but these errors were encountered:
Describe the bug
I've tried the following scenario by , and Error occurs.
Run resnet50 on 2 nodes, each node with 8 GPUs pbslaunch, NO additional CPU servers
https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA
To Reproduce
Steps to reproduce the behavior:
The steps are exactly the same as the instruction manual (https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#:~:text=Distributed%20Training,with%20RDMA)
i try to set 2 workers, 2 servers and 1 scheduler in this scenario
Node1:
1.start scheduler
2.start server
3.start worker
Node2:
1.start server
2.start worker
i start the thread by this sequence:
Node1 scheduler->server->worker->Node2 server->worker
after i start worker on node 2, error occurs and the error log show on scheduler:
Environment (please complete the following information):
i use the pytorch docker file(byteps/docker/Dockerfile) to build the container
Additional context
Do you have any suggestions?
my ulimit -l result is unlimited
i set
BYTEPS_RDMA_START_DEPTH=16
andBYTEPS_RDMA_RX_DEPTH =32
It still shows the same errormy byteps code version is lastest
The start sequence is the point?
The text was updated successfully, but these errors were encountered: