Performance regression with multi-node running #365

MichaelHsu170 · 2021-02-09T06:17:34Z

Describe the bug
I've tried the following 2 scenarios, and compared their performances.

Run VGG16 on 1 single node with 8 GPUs with pbslaunch.
Run VGG16 on 2 nodes with 8 GPUs each with pbslaunch.
Performance regressed a lot with scenario 2 (1/100 of scenario 1)
for scenario 1: 300 img/sec per GPU
for scenario 2: 3.4 img/sec per GPU

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/bytedance/byteps.git
python3 setup.py install
dpkg -i nccl-local-repo-ubuntu1804-2.8.4-cuda11.0_1.0-1_amd64.deb
apt install libnccl2 libnccl-dev
Prepare running scripts for scenario 1:
run_worker.sh
#!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
Run ./run_worker.sh on 1 node
Prepare running scripts for scenario 2:
run_scheduler.sh
#!/bin/bash export DMLC_ROLE=scheduler export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch

run_server.sh
#!/bin/bash export DMLC_ROLE=server export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch

run_worker.sh
#!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
3. Run ./run_scheduler.sh, ./run_server.sh and ./run_worker.sh on 1 node, and then run ./run_server.sh and ./run_worker.sh on another node.
4. Performance:
scenario 1:
Model: vgg16 Batch size: 32 Number of GPUs: 8 Running warmup... Running benchmark... 300 img/sec per GPU
scenario 2:
Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 3.4 img/sec per GPU

Expected behavior
No such big performance gap

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: Ubuntu 18.04
GCC version: 7.5
CUDA and NCCL version: 11.0, 2.8.4
Framework (TF, PyTorch, MXNet): PyTorch 1.7.1

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

ymjiang · 2021-02-09T15:21:13Z

How much is the bandwidth between these two nodes?

MichaelHsu170 · 2021-02-10T05:47:25Z

They are 200Gib nic cards on both nodes. I got averaged speed at around 450Mb/s per my measurement with iftop (sudo iftop -n).

MichaelHsu170 · 2021-02-10T08:00:02Z

By the way, are there any recommended configurations to run data parallel training with VGG16 on 2 nodes?
For example, how many workers should we start, how many servers should we start? Do we need to have a separate machine as the scheduler?

ymjiang · 2021-02-11T01:20:08Z

I am confused by your log. You mentioned

2. for scenario 2: 3.4 img/sec per GPU

and

scenario 2:
Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 300 img/sec per GPU

So what exactly is the performance for scenario 2?

Here are a few tips of recommended configs. https://github.com/bytedance/byteps/blob/master/docs/best-practice.md
And your throughput is less than 0.5Gbps, this is not expected with 200Gbps NICs. Can you use this benchmark to test the networking performance? https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark

MichaelHsu170 · 2021-02-11T06:38:05Z

Hi @ymjiang , Happy Chinese New Year!!!
Sorry, for scenario 2 it is 3.4 img/sec.
I'll try the benchmark tool for networking performance measurement.
Thank you.

MichaelHsu170 · 2021-03-01T07:05:48Z

Hi @ymjiang ,
We tried the basic benchmark mentioned in https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark, but got failures. Could you suggest how can we get it working? Thank you.
We ran 2 scenarios:

1 scheduler, 1 server and 2 workers on 2 machines. On machine A, 1 scheduler and 1 worker were executed. On machine B, 1 server and 1 worker were executed. This scenario failed with " what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)". We tried to set these 2 environment variables to even single-digit numbers, but always this error shown up.
1 scheduler, 1 server and 2 workers were run on a single machine. The scheduler crashed with error " what(): [08:43:06] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)" at the moment when the last client (server or worker) got launched.

ib_send_bw works correctly on both machines.
We used the IP address from ib0 port as scheduler address.
$ ibdev2netdev mlx5_0 port 1 ==> ib0 (Up) mlx5_1 port 1 ==> ib1 (Up) mlx5_2 port 1 ==> ib2 (Up) mlx5_3 port 1 ==> ib3 (Up) mlx5_4 port 1 ==> ib4 (Up) mlx5_5 port 1 ==> ib5 (Up) mlx5_6 port 1 ==> ib6 (Up) mlx5_7 port 1 ==> ib7 (Up) mlx5_8 port 1 ==> enp225s0f0 (Down) mlx5_9 port 1 ==> enp225s0f1 (Down)
$ ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd3e sys_image_guid: 0c42:a103:000b:cd3e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 7 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc6 sys_image_guid: 0c42:a103:000b:cbc6 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 17 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cc1e sys_image_guid: 0c42:a103:000b:cc1e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 5 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:024c sys_image_guid: 0c42:a103:000c:024c vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 8 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_4 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc2 sys_image_guid: 0c42:a103:000b:cbc2 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 16 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_5 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd2a sys_image_guid: 0c42:a103:000b:cd2a vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 6 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_6 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0478 sys_image_guid: 0c42:a103:000c:0478 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 11 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_7 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0488 sys_image_guid: 0c42:a103:000c:0488 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_8 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37da sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_9 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37db sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet

ymjiang · 2021-03-01T07:18:21Z

Can you show the output of ulimit -l?

MichaelHsu170 · 2021-03-01T10:12:18Z

It shown unlimited.
`$ ulimit -l

unlimited`

ymjiang · 2021-03-02T04:51:00Z

There was a similar issue before: #282. Can you try this setup: 1 scheduler + 2 servers + 2 workers? It may have better load balance than using one server.

MichaelHsu170 · 2021-03-03T07:51:01Z

I tried this scenario on 2 machines:
machine A: scheduler, server, worker
machine B: server, worker

But still processes on machine B crashed with error message what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048). Reducing BYTEPS_RDMA_START_DEPTH and BYTEPS_RDMA_RX_DEPTH yields the same error.
The ticket you mentioned seems to be related to PFC. Do you think this error is possibly caused by disabled PFC functionality?

ymjiang · 2021-03-03T09:17:48Z

PFC is not related to this problem. However, I am not sure about the possible reasons. Perhaps some hardware configurations on your machines are limited. But I have no idea now.

Does using 1 worker and 1 server works?

MichaelHsu170 · 2021-03-03T09:48:40Z

If scheduler, 1 server and 1 worker run on the same machine, scheduer crashed with terminate called after throwing an instance of 'dmlc::Error' what(): [09:45:09] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR) error.
Run them on 2 machines:
machine A: scheduler, server
machine B: worker
worker on machine B crashed with what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048) error.

MichaelHsu170 · 2021-03-26T07:03:52Z

Hi @ymjiang , any recommendation will be grateful. Thank you.

ymjiang · 2021-03-26T08:28:29Z

Would you check these similar issues -- #371 and #372?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression with multi-node running #365

Performance regression with multi-node running #365

MichaelHsu170 commented Feb 9, 2021 •

edited

Loading

ymjiang commented Feb 9, 2021

MichaelHsu170 commented Feb 10, 2021 •

edited

Loading

MichaelHsu170 commented Feb 10, 2021 •

edited

Loading

ymjiang commented Feb 11, 2021

MichaelHsu170 commented Feb 11, 2021

MichaelHsu170 commented Mar 1, 2021 •

edited

Loading

ymjiang commented Mar 1, 2021

MichaelHsu170 commented Mar 1, 2021 •

edited

Loading

ymjiang commented Mar 2, 2021

MichaelHsu170 commented Mar 3, 2021

ymjiang commented Mar 3, 2021

MichaelHsu170 commented Mar 3, 2021

MichaelHsu170 commented Mar 26, 2021

ymjiang commented Mar 26, 2021

Performance regression with multi-node running #365

Performance regression with multi-node running #365

Comments

MichaelHsu170 commented Feb 9, 2021 • edited Loading

ymjiang commented Feb 9, 2021

MichaelHsu170 commented Feb 10, 2021 • edited Loading

MichaelHsu170 commented Feb 10, 2021 • edited Loading

ymjiang commented Feb 11, 2021

MichaelHsu170 commented Feb 11, 2021

MichaelHsu170 commented Mar 1, 2021 • edited Loading

ymjiang commented Mar 1, 2021

MichaelHsu170 commented Mar 1, 2021 • edited Loading

ymjiang commented Mar 2, 2021

MichaelHsu170 commented Mar 3, 2021

ymjiang commented Mar 3, 2021

MichaelHsu170 commented Mar 3, 2021

MichaelHsu170 commented Mar 26, 2021

ymjiang commented Mar 26, 2021

MichaelHsu170 commented Feb 9, 2021 •

edited

Loading

MichaelHsu170 commented Feb 10, 2021 •

edited

Loading

MichaelHsu170 commented Feb 10, 2021 •

edited

Loading

MichaelHsu170 commented Mar 1, 2021 •

edited

Loading

MichaelHsu170 commented Mar 1, 2021 •

edited

Loading