Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not run Distributed Training with RDMA #216

Closed
mengkai94 opened this issue Mar 9, 2020 · 27 comments
Closed

Can not run Distributed Training with RDMA #216

mengkai94 opened this issue Mar 9, 2020 · 27 comments

Comments

@mengkai94
Copy link

mengkai94 commented Mar 9, 2020

Describe the bug
I use byteps according to Distributed Training with RDMA of byteps/docs/step-by-step-tutorial.
I use the latest images : bytepsimage/tensorflow and get error.
But using bytepsimage/tensorflow_rdma and bytepsimage/server_rdma can success.
Using one scheduler, one server, two workers with one respective gpu in a physical worker.

To Reproduce
Steps to reproduce the behavior:

  1. For the scheduler:
    docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
    export DMLC_ENABLE_RDMA=1
    export DMLC_NUM_WORKER=2
    export DMLC_ROLE=scheduler
    export DMLC_NUM_SERVER=1
    export DMLC_INTERFACE=eth0
    export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
    export DMLC_PS_ROOT_PORT=9008
    bpslaunch

  2. For the server:
    docker run -it --net=host --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
    export DMLC_ENABLE_RDMA=1
    export DMLC_NUM_WORKER=2
    export DMLC_ROLE=server
    export DMLC_NUM_SERVER=1
    export DMLC_INTERFACE=eth0
    export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
    export DMLC_PS_ROOT_PORT=9008
    bpslaunch

  3. For worker-0:
    nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
    export NVIDIA_VISIBLE_DEVICES=0
    export DMLC_ENABLE_RDMA=1
    export DMLC_WORKER_ID=0
    export DMLC_NUM_WORKER=2
    export DMLC_ROLE=worker
    export DMLC_NUM_SERVER=1
    export DMLC_INTERFACE=eth0
    export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
    export DMLC_PS_ROOT_PORT=9008
    bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000

  4. For worker-1:
    nvidia-docker run -it --net=host --shm-size=32768m --device /dev/infiniband/rdma_cm --device /dev/infiniband/issm0 --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 --cap-add IPC_LOCK byteps:tensorflow-0.2 bash
    export NVIDIA_VISIBLE_DEVICES=7
    export DMLC_ENABLE_RDMA=1
    export DMLC_WORKER_ID=1
    export DMLC_NUM_WORKER=2
    export DMLC_ROLE=worker
    export DMLC_NUM_SERVER=1
    export DMLC_INTERFACE=eth0
    export DMLC_PS_ROOT_URI=xxx.xx.xx.xx
    export DMLC_PS_ROOT_PORT=9008
    bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000

  5. scheduler see error
    BytePS launching scheduler
    [02:34:49] byteps/server/server.cc:339: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
    [02:34:49] src/postoffice.cc:20: enable RDMA for networking
    [02:34:49] src/./rdma_van.h:40: Shared memory IPC has been disabled
    [02:34:49] src/./rdma_van.h:801: OnConnect to Node 1 with Transport=RDMA
    [02:34:49] src/./rdma_van.h:207: Connect to Node 1 with Transport=RDMA
    [02:34:58] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
    [02:35:23] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
    [02:35:38] src/./rdma_van.h:801: OnConnect to Node 2147483647 with Transport=RDMA
    [02:35:38] src/./rdma_van.h:207: Connect to Node 9 with Transport=RDMA
    [02:35:38] src/./rdma_van.h:207: Connect to Node 8 with Transport=RDMA
    [02:35:38] 3rdparty/ps-lite/include/dmlc/logging.h:276: [02:35:38] src/./rdma_transport.h:130: Check failed: mr
    Stack trace returned 7 entries:
    [bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1b98c) [0x7f788672398c]
    [bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x1bdad) [0x7f7886723dad]
    [bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x40fb8) [0x7f7886748fb8]
    [bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.0-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x57dbb) [0x7f788675fdbb]
    [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f7885e0866f]
    [bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f78891356db]
    [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f788946e88f]

Expected behavior
run sucess

Screenshots
scheduler
image
server
image
worker0
image
worker1
image

Environment physical machine(please complete the following information):

  • OS:16.04.2-Ubuntu
  • GCC version: 5.4.0
  • CUDA and NCCL version:10.1
  • Framework (TF, PyTorch, MXNet):TF
@ymjiang
Copy link
Member

ymjiang commented Mar 9, 2020

Seems like it hits the limit of memory registration. We will post a fix soon. Before that, could you please manually change the value of kReplyDepth (in byteps/3rdparty/ps-lite/src/rdma_utils.h) to 256 (or even smaller), and then recompile with the following commands?

cd byteps/3rdparty/ps-lite && make clean  && cd -
pip3 uninstall -y byteps
python3 setup.py clean --all
python3 setup.py install

@mengkai94
Copy link
Author

It works! thanks you, @ymjiang .
Does the value of kReplyDepth effect byteps's performance?

@bobzhuyb
Copy link
Member

bobzhuyb commented Mar 9, 2020

@mengkai94 256 should be enough in most cases.

@mengkai94
Copy link
Author

thanks you.

@mengkai94 mengkai94 reopened this Mar 9, 2020
@mengkai94
Copy link
Author

Using same setting in k8s does not work. Scheduler/server/workers are in same physical machine.
Workers do not start run, but using bytepsimage/tensorflow_rdma and bytepsimage/server_rdma in k8s can success.
BTW, does byteps need ssh_port same with DMLC_PS_ROOT_PORT?

#Screenshots
scheduler
image
server
image
woker0
image
worker1
image

@bobzhuyb
Copy link
Member

bobzhuyb commented Mar 9, 2020

DMLC_PS_ROOT_PORT does not need to be the same as SSH port. Instead, it should be a different port otherwise the it will conflict with ssh. However, you must make sure DMLC_PS_ROOT_PORT can be seen from all containers.

Are you using --net=host with k8s? i.e., all containers share the same host network namespace and can see the RDMA NIC?

@bobzhuyb
Copy link
Member

bobzhuyb commented Mar 9, 2020

In all containers, can you set DMLC_NODE_HOST to be the IP of your RDMA NIC?

@mengkai94
Copy link
Author

hostNetwork has been used in k8s equal to --net=host in docker.
I add setting DMLC_NODE_HOST, but does not work.
k8s using 4 physical machines also has same problem.(two woker in two physical machines, one scheduler and one server in two cpu physical machines)

@bobzhuyb
Copy link
Member

bobzhuyb commented Mar 9, 2020

We can try a few things --

  1. Try starting only one worker (i.e., non-distributed mode). Does the training start normally?
  2. Try setting DMLC_ENABLE_RDMA=0. Does the training run normally with TCP?

Thess will help us isolate potential problems.

@mengkai94
Copy link
Author

I try other models, and they can work. Only Resnet50 with RDMA in TF can not run. Resnet50 with TCP in TF is OK.
I find performance of VGG16 is unstable with high variance.(two woker in two physical machines, one scheduler and one server in two cpu physical machines)

@bobzhuyb
Copy link
Member

bobzhuyb commented Mar 10, 2020

@ymjiang Can you check the TF+ResNet50+RDMA issue? It does not make sense to me.. Maybe ResNet requires a deep queue in RDMA transport?

@mengkai94 Regarding the VGG-16 issue, is it possible that your RDMA network configuration is not correct? Can you use BytePS's ps-lite benchmark to make sure the network works as expected? Follow the commands in the README here https://github.com/bytedance/ps-lite/tree/byteps. You just need to run ./tests/test_benchmark in ps-lite folder instead of your TF script.

@ymjiang
Copy link
Member

ymjiang commented Mar 10, 2020

Looks like ResNet indeed needs large queue depth. kReplyDepth=256 causes hanging issue when using 1 server + 2 workers.

@mengkai94 You can try one of the following to solve that:

  • Use kReplyDepth=512 if it is still a valid number on your machines;
  • Use 2 servers for 2 workers so that the RDMA load will be divided and balanced.

@mengkai94
Copy link
Author

I indeed use 2 servers for 2 workers. but I can only set kReplyDepth=128 at maximum. so byteps can not run resnet50 in my mechine?
The results of test_benchmark are 89.6Gbps and 81Gbps in two worker with two server.
In pytorch, VGG19 is stable at 164 imgs/sec/gpu,Vgg16 can achieve 178 imgs/sec/gpu at maximum, but 145 imgs/sec/gpu on the average of 10 times.

@bobzhuyb
Copy link
Member

bobzhuyb commented Mar 10, 2020

@mengkai94 The kReplyDepth limitation is due to that your machine seems to be configured with lower memory registration limit than usual. Would you check this article about how to increase the limit (if you have the root access to configure it)? https://community.mellanox.com/s/article/howto-increase-memory-size-used-by-mellanox-adapters

You can look at the example in that article and set your parameters accordingly

For example, if the physical memory on the server is 64GB, it is recommended to have twice this size (2x64GB=128GB) for the max_reg_mem.

max_reg_mem = (2^ log_num_mtt) * (2^1) * (4 kB)

128GB = (2^ log_num_mtt) * (2^1) * (4 kB)

2^37 = (2^ log_num_mtt) * (2^1) * (2^12)

2^24 = (2^ log_num_mtt)

24 = log_num_mtt

@bobzhuyb
Copy link
Member

@mengkai94 Are you saying that the performance of PyTorch+VGG19+BytePS is stable? If so, the problem is unlikely inside ps-lite or anything below it.

@mengkai94
Copy link
Author

Yes, the performance of PyTorch+VGG19+BytePS is stable.
The performance of PyTorch+VGG16+BytePS:
image

@ymjiang
Copy link
Member

ymjiang commented Mar 10, 2020

@mengkai94 The pytorch result does not look stable, either. Can you check the CPU utilization of the two workers and two servers?

@ymjiang
Copy link
Member

ymjiang commented Mar 11, 2020

@mengkai94 Sorry for my misread. So the case is that VGG19 is stable while VGG16 is not. Can you also check the CPU utilization?

BTW, what if you use just one GPU for each worker?

@zhouxhao
Copy link

zhouxhao commented Mar 27, 2020

Seems like it hits the limit of memory registration. We will post a fix soon. Before that, could you please manually change the value of kReplyDepth (in byteps/3rdparty/ps-lite/src/rdma_utils.h) to 256 (or even smaller), and then recompile with the following commands?

cd byteps/3rdparty/ps-lite && make clean  && cd -
pip3 uninstall -y byteps
python3 setup.py clean --all
python3 setup.py install

When I run the tensoflow example synthetic_benchmark.py with ResNet50 + RDMA + two servers + two workers, it can not work. But it can work with one server and two workers.
scheduler

@ymjiang
Copy link
Member

ymjiang commented Mar 27, 2020

@zhouxhao Have you tried this? #216 (comment)

@zhouxhao
Copy link

zhouxhao commented Mar 27, 2020

@ymjiang
code
work
notwork

@ymjiang
Copy link
Member

ymjiang commented Mar 27, 2020

@zhouxhao Can you try to reduce kRxDepth only for the scheduler? Do not change it for the workers/servers if they don't run into ibv_reg_mr failed.

@bobzhuyb
Copy link
Member

@ymjiang Maybe we should make it read from an environmental variable..

@ymjiang
Copy link
Member

ymjiang commented Mar 28, 2020

Will allow optional value in bytedance/ps-lite#30.

@zhouxhao
Copy link

zhouxhao commented Mar 28, 2020

@ymjiang Thanks. It can work now. But I found it go much slower with two servers than one server. Scheduler, servers and workers are running in five different physical machines. How can I find the reason?
one
two
servers

@ymjiang
Copy link
Member

ymjiang commented Mar 28, 2020

@zhouxhao Can you run the ps-lite benchmark (2 workers and 2 servers) to check if the problem is in networking? You can follow the tutorial here: https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark.

@bobzhuyb
Copy link
Member

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants