-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not run Distributed Training with RDMA #216
Comments
Seems like it hits the limit of memory registration. We will post a fix soon. Before that, could you please manually change the value of
|
It works! thanks you, @ymjiang . |
@mengkai94 256 should be enough in most cases. |
thanks you. |
Using same setting in k8s does not work. Scheduler/server/workers are in same physical machine. |
DMLC_PS_ROOT_PORT does not need to be the same as SSH port. Instead, it should be a different port otherwise the it will conflict with ssh. However, you must make sure DMLC_PS_ROOT_PORT can be seen from all containers. Are you using --net=host with k8s? i.e., all containers share the same host network namespace and can see the RDMA NIC? |
In all containers, can you set DMLC_NODE_HOST to be the IP of your RDMA NIC? |
hostNetwork has been used in k8s equal to --net=host in docker. |
We can try a few things --
Thess will help us isolate potential problems. |
I try other models, and they can work. Only Resnet50 with RDMA in TF can not run. Resnet50 with TCP in TF is OK. |
@ymjiang Can you check the TF+ResNet50+RDMA issue? It does not make sense to me.. Maybe ResNet requires a deep queue in RDMA transport? @mengkai94 Regarding the VGG-16 issue, is it possible that your RDMA network configuration is not correct? Can you use BytePS's ps-lite benchmark to make sure the network works as expected? Follow the commands in the README here https://github.com/bytedance/ps-lite/tree/byteps. You just need to run |
Looks like ResNet indeed needs large queue depth. @mengkai94 You can try one of the following to solve that:
|
I indeed use 2 servers for 2 workers. but I can only set kReplyDepth=128 at maximum. so byteps can not run resnet50 in my mechine? |
@mengkai94 The kReplyDepth limitation is due to that your machine seems to be configured with lower memory registration limit than usual. Would you check this article about how to increase the limit (if you have the root access to configure it)? https://community.mellanox.com/s/article/howto-increase-memory-size-used-by-mellanox-adapters You can look at the example in that article and set your parameters accordingly
|
@mengkai94 Are you saying that the performance of PyTorch+VGG19+BytePS is stable? If so, the problem is unlikely inside ps-lite or anything below it. |
@mengkai94 The pytorch result does not look stable, either. Can you check the CPU utilization of the two workers and two servers? |
@mengkai94 Sorry for my misread. So the case is that VGG19 is stable while VGG16 is not. Can you also check the CPU utilization? BTW, what if you use just one GPU for each worker? |
@zhouxhao Have you tried this? #216 (comment) |
@zhouxhao Can you try to reduce |
@ymjiang Maybe we should make it read from an environmental variable.. |
Will allow optional value in bytedance/ps-lite#30. |
@ymjiang Thanks. It can work now. But I found it go much slower with two servers than one server. Scheduler, servers and workers are running in five different physical machines. How can I find the reason? |
@zhouxhao Can you run the ps-lite benchmark (2 workers and 2 servers) to check if the problem is in networking? You can follow the tutorial here: https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark. |
Closing due to inactivity. |
Describe the bug
I use byteps according to Distributed Training with RDMA of byteps/docs/step-by-step-tutorial.
I use the latest images : bytepsimage/tensorflow and get error.
But using bytepsimage/tensorflow_rdma and bytepsimage/server_rdma can success.
Using one scheduler, one server, two workers with one respective gpu in a physical worker.
To Reproduce
Steps to reproduce the behavior:
For the scheduler:
docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch
For the server:
docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch
For worker-0:
nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=1
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
For worker-1:
nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export NVIDIA_VISIBLE_DEVICES=7
export DMLC_ENABLE_RDMA=1
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
scheduler see error
BytePS launching scheduler
[02:34:49] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[02:34:49] src/postoffice.cc:20: enable RDMA for networking
[02:34:49] src/./rdma_van.h:40: Shared memory IPC has been disabled
[02:34:49] src/./rdma_van.h:801: OnConnect to Node 1 with Transport=RDMA
[02:34:49] src/./rdma_van.h:207: Connect to Node 1 with Transport=RDMA
[02:34:58] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
[02:35:23] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
[02:35:38] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
[02:35:38] src/./rdma_van.h:207: Connect to Node 9 with Transport=RDMA
[02:35:38] src/./rdma_van.h:207: Connect to Node 8 with Transport=RDMA
[02:35:38] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:35:38] src/./rdma_transport.h:130: Check failed: mr
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1b98c) [0x7f788672398c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1bdad) [0x7f7886723dad]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x40fb8) [0x7f7886748fb8]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x57dbb) [0x7f788675fdbb]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f7885e0866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f78891356db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f788946e88f]
Expected behavior
run sucess
Screenshots
scheduler
server
worker0
worker1
Environment physical machine(please complete the following information):
The text was updated successfully, but these errors were encountered: