Can not run Distributed Training with RDMA #216

mengkai94 · 2020-03-09T02:58:24Z

Describe the bug
I use byteps according to Distributed Training with RDMA of byteps/docs/step-by-step-tutorial.
I use the latest images : bytepsimage/tensorflow and get error.
But using bytepsimage/tensorflow_rdma and bytepsimage/server_rdma can success.
Using one scheduler, one server, two workers with one respective gpu in a physical worker.

To Reproduce
Steps to reproduce the behavior:

For the scheduler:
docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch
For the server:
docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch
For worker-0:
nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=1
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
For worker-1:
nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
export NVIDIA_VISIBLE_DEVICES=7
export DMLC_ENABLE_RDMA=1
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=eth0
export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
export DMLC_PS_ROOT_PORT=9008
bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000
scheduler see error
BytePS launching scheduler
[02:34:49] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[02:34:49] src/postoffice.cc:20: enable RDMA for networking
[02:34:49] src/./rdma_van.h:40: Shared memory IPC has been disabled
[02:34:49] src/./rdma_van.h:801: OnConnect to Node 1 with Transport=RDMA
[02:34:49] src/./rdma_van.h:207: Connect to Node 1 with Transport=RDMA
[02:34:58] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
[02:35:23] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
[02:35:38] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
[02:35:38] src/./rdma_van.h:207: Connect to Node 9 with Transport=RDMA
[02:35:38] src/./rdma_van.h:207: Connect to Node 8 with Transport=RDMA
[02:35:38] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:35:38] src/./rdma_transport.h:130: Check failed: mr
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1b98c) [0x7f788672398c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1bdad) [0x7f7886723dad]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x40fb8) [0x7f7886748fb8]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x57dbb) [0x7f788675fdbb]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f7885e0866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f78891356db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f788946e88f]

Expected behavior
run sucess

Screenshots
scheduler

server

worker0

worker1

Environment physical machine(please complete the following information):

OS:16.04.2-Ubuntu
GCC version: 5.4.0
CUDA and NCCL version:10.1
Framework (TF, PyTorch, MXNet):TF

ymjiang · 2020-03-09T04:11:53Z

Seems like it hits the limit of memory registration. We will post a fix soon. Before that, could you please manually change the value of kReplyDepth (in byteps/3rdparty/ps-lite/src/rdma_utils.h) to 256 (or even smaller), and then recompile with the following commands?

cd byteps/3rdparty/ps-lite && make clean  && cd -
pip3 uninstall -y byteps
python3 setup.py clean --all
python3 setup.py install

mengkai94 · 2020-03-09T04:58:14Z

It works! thanks you, @ymjiang .
Does the value of kReplyDepth effect byteps's performance？

bobzhuyb · 2020-03-09T05:13:23Z

@mengkai94 256 should be enough in most cases.

mengkai94 · 2020-03-09T05:14:41Z

thanks you.

mengkai94 · 2020-03-09T07:04:56Z

Using same setting in k8s does not work. Scheduler/server/workers are in same physical machine.
Workers do not start run, but using bytepsimage/tensorflow_rdma and bytepsimage/server_rdma in k8s can success.
BTW, does byteps need ssh_port same with DMLC_PS_ROOT_PORT?

#Screenshots
scheduler

server

woker0

worker1

bobzhuyb · 2020-03-09T08:20:45Z

DMLC_PS_ROOT_PORT does not need to be the same as SSH port. Instead, it should be a different port otherwise the it will conflict with ssh. However, you must make sure DMLC_PS_ROOT_PORT can be seen from all containers.

Are you using --net=host with k8s? i.e., all containers share the same host network namespace and can see the RDMA NIC?

bobzhuyb · 2020-03-09T08:24:13Z

In all containers, can you set DMLC_NODE_HOST to be the IP of your RDMA NIC?

mengkai94 · 2020-03-09T11:15:18Z

hostNetwork has been used in k8s equal to --net=host in docker.
I add setting DMLC_NODE_HOST, but does not work.
k8s using 4 physical machines also has same problem.(two woker in two physical machines, one scheduler and one server in two cpu physical machines)

bobzhuyb · 2020-03-09T18:02:36Z

We can try a few things --

Try starting only one worker (i.e., non-distributed mode). Does the training start normally?
Try setting DMLC_ENABLE_RDMA=0. Does the training run normally with TCP?

Thess will help us isolate potential problems.

mengkai94 · 2020-03-10T02:22:46Z

I try other models, and they can work. Only Resnet50 with RDMA in TF can not run. Resnet50 with TCP in TF is OK.
I find performance of VGG16 is unstable with high variance.(two woker in two physical machines, one scheduler and one server in two cpu physical machines)

bobzhuyb · 2020-03-10T02:58:42Z

@ymjiang Can you check the TF+ResNet50+RDMA issue? It does not make sense to me.. Maybe ResNet requires a deep queue in RDMA transport?

@mengkai94 Regarding the VGG-16 issue, is it possible that your RDMA network configuration is not correct? Can you use BytePS's ps-lite benchmark to make sure the network works as expected? Follow the commands in the README here https://github.com/bytedance/ps-lite/tree/byteps. You just need to run ./tests/test_benchmark in ps-lite folder instead of your TF script.

ymjiang · 2020-03-10T04:06:51Z

Looks like ResNet indeed needs large queue depth. kReplyDepth=256 causes hanging issue when using 1 server + 2 workers.

@mengkai94 You can try one of the following to solve that:

Use kReplyDepth=512 if it is still a valid number on your machines;
Use 2 servers for 2 workers so that the RDMA load will be divided and balanced.

mengkai94 · 2020-03-10T04:44:21Z

I indeed use 2 servers for 2 workers. but I can only set kReplyDepth=128 at maximum. so byteps can not run resnet50 in my mechine?
The results of test_benchmark are 89.6Gbps and 81Gbps in two worker with two server.
In pytorch, VGG19 is stable at 164 imgs/sec/gpu，Vgg16 can achieve 178 imgs/sec/gpu at maximum, but 145 imgs/sec/gpu on the average of 10 times.

bobzhuyb · 2020-03-10T05:51:02Z

@mengkai94 The kReplyDepth limitation is due to that your machine seems to be configured with lower memory registration limit than usual. Would you check this article about how to increase the limit (if you have the root access to configure it)? https://community.mellanox.com/s/article/howto-increase-memory-size-used-by-mellanox-adapters

You can look at the example in that article and set your parameters accordingly

For example, if the physical memory on the server is 64GB, it is recommended to have twice this size (2x64GB=128GB) for the max_reg_mem.

max_reg_mem = (2^ log_num_mtt) * (2^1) * (4 kB)

128GB = (2^ log_num_mtt) * (2^1) * (4 kB)

2^37 = (2^ log_num_mtt) * (2^1) * (2^12)

2^24 = (2^ log_num_mtt)

24 = log_num_mtt

bobzhuyb · 2020-03-10T05:54:29Z

@mengkai94 Are you saying that the performance of PyTorch+VGG19+BytePS is stable? If so, the problem is unlikely inside ps-lite or anything below it.

mengkai94 · 2020-03-10T06:03:11Z

Yes, the performance of PyTorch+VGG19+BytePS is stable.
The performance of PyTorch+VGG16+BytePS:

ymjiang · 2020-03-10T11:55:26Z

@mengkai94 The pytorch result does not look stable, either. Can you check the CPU utilization of the two workers and two servers?

ymjiang · 2020-03-11T00:21:11Z

@mengkai94 Sorry for my misread. So the case is that VGG19 is stable while VGG16 is not. Can you also check the CPU utilization?

BTW, what if you use just one GPU for each worker?

zhouxhao · 2020-03-27T07:51:54Z

Seems like it hits the limit of memory registration. We will post a fix soon. Before that, could you please manually change the value of kReplyDepth (in byteps/3rdparty/ps-lite/src/rdma_utils.h) to 256 (or even smaller), and then recompile with the following commands?
cd byteps/3rdparty/ps-lite && make clean  && cd -
pip3 uninstall -y byteps
python3 setup.py clean --all
python3 setup.py install

When I run the tensoflow example synthetic_benchmark.py with ResNet50 + RDMA + two servers + two workers， it can not work. But it can work with one server and two workers.

ymjiang · 2020-03-27T08:13:57Z

@zhouxhao Have you tried this? #216 (comment)

zhouxhao · 2020-03-27T08:28:45Z

@ymjiang

ymjiang · 2020-03-27T08:47:52Z

@zhouxhao Can you try to reduce kRxDepth only for the scheduler? Do not change it for the workers/servers if they don't run into ibv_reg_mr failed.

bobzhuyb · 2020-03-27T18:05:30Z

@ymjiang Maybe we should make it read from an environmental variable..

ymjiang · 2020-03-28T03:29:25Z

Will allow optional value in bytedance/ps-lite#30.

zhouxhao · 2020-03-28T09:10:43Z

@ymjiang Thanks. It can work now. But I found it go much slower with two servers than one server. Scheduler, servers and workers are running in five different physical machines. How can I find the reason?

ymjiang · 2020-03-28T10:32:27Z

@zhouxhao Can you run the ps-lite benchmark (2 workers and 2 servers) to check if the problem is in networking? You can follow the tutorial here: https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark.

bobzhuyb · 2020-07-12T18:52:23Z

Closing due to inactivity.

mengkai94 closed this as completed Mar 9, 2020

mengkai94 reopened this Mar 9, 2020

ymjiang mentioned this issue Mar 28, 2020

Allow optional RDMA queue depth bytedance/ps-lite#30

Merged

bobzhuyb closed this as completed Jul 12, 2020

ymjiang mentioned this issue Aug 4, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

wuyujiji mentioned this issue Nov 2, 2020

run distributed training with RDMA reports the libibverbs warning #313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not run Distributed Training with RDMA #216

Can not run Distributed Training with RDMA #216

mengkai94 commented Mar 9, 2020 •

edited

Loading

ymjiang commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

mengkai94 commented Mar 10, 2020

bobzhuyb commented Mar 10, 2020 •

edited

Loading

ymjiang commented Mar 10, 2020

mengkai94 commented Mar 10, 2020

bobzhuyb commented Mar 10, 2020 •

edited

Loading

bobzhuyb commented Mar 10, 2020

mengkai94 commented Mar 10, 2020

ymjiang commented Mar 10, 2020 •

edited

Loading

ymjiang commented Mar 11, 2020

zhouxhao commented Mar 27, 2020 •

edited

Loading

ymjiang commented Mar 27, 2020

zhouxhao commented Mar 27, 2020 •

edited

Loading

ymjiang commented Mar 27, 2020

bobzhuyb commented Mar 27, 2020

ymjiang commented Mar 28, 2020

zhouxhao commented Mar 28, 2020 •

edited

Loading

ymjiang commented Mar 28, 2020 •

edited

Loading

bobzhuyb commented Jul 12, 2020

Can not run Distributed Training with RDMA #216

Can not run Distributed Training with RDMA #216

Comments

mengkai94 commented Mar 9, 2020 • edited Loading

ymjiang commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

mengkai94 commented Mar 9, 2020

bobzhuyb commented Mar 9, 2020

mengkai94 commented Mar 10, 2020

bobzhuyb commented Mar 10, 2020 • edited Loading

ymjiang commented Mar 10, 2020

mengkai94 commented Mar 10, 2020

bobzhuyb commented Mar 10, 2020 • edited Loading

bobzhuyb commented Mar 10, 2020

mengkai94 commented Mar 10, 2020

ymjiang commented Mar 10, 2020 • edited Loading

ymjiang commented Mar 11, 2020

zhouxhao commented Mar 27, 2020 • edited Loading

ymjiang commented Mar 27, 2020

zhouxhao commented Mar 27, 2020 • edited Loading

ymjiang commented Mar 27, 2020

bobzhuyb commented Mar 27, 2020

ymjiang commented Mar 28, 2020

zhouxhao commented Mar 28, 2020 • edited Loading

ymjiang commented Mar 28, 2020 • edited Loading

bobzhuyb commented Jul 12, 2020

mengkai94 commented Mar 9, 2020 •

edited

Loading

bobzhuyb commented Mar 10, 2020 •

edited

Loading

bobzhuyb commented Mar 10, 2020 •

edited

Loading

ymjiang commented Mar 10, 2020 •

edited

Loading

zhouxhao commented Mar 27, 2020 •

edited

Loading

zhouxhao commented Mar 27, 2020 •

edited

Loading

zhouxhao commented Mar 28, 2020 •

edited

Loading

ymjiang commented Mar 28, 2020 •

edited

Loading