Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with multi-node running #365

Open
MichaelHsu170 opened this issue Feb 9, 2021 · 14 comments
Open

Performance regression with multi-node running #365

MichaelHsu170 opened this issue Feb 9, 2021 · 14 comments

Comments

@MichaelHsu170
Copy link

MichaelHsu170 commented Feb 9, 2021

Describe the bug
I've tried the following 2 scenarios, and compared their performances.

  1. Run VGG16 on 1 single node with 8 GPUs with pbslaunch.
  2. Run VGG16 on 2 nodes with 8 GPUs each with pbslaunch.
    Performance regressed a lot with scenario 2 (1/100 of scenario 1)
    for scenario 1: 300 img/sec per GPU
    for scenario 2: 3.4 img/sec per GPU

To Reproduce
Steps to reproduce the behavior:

  1. git clone https://github.com/bytedance/byteps.git
  2. python3 setup.py install
  3. dpkg -i nccl-local-repo-ubuntu1804-2.8.4-cuda11.0_1.0-1_amd64.deb
  4. apt install libnccl2 libnccl-dev
  5. Prepare running scripts for scenario 1:
    run_worker.sh
    #!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
  6. Run ./run_worker.sh on 1 node
  7. Prepare running scripts for scenario 2:
    run_scheduler.sh
    #!/bin/bash export DMLC_ROLE=scheduler export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch

run_server.sh
#!/bin/bash export DMLC_ROLE=server export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch

run_worker.sh
#!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
3. Run ./run_scheduler.sh, ./run_server.sh and ./run_worker.sh on 1 node, and then run ./run_server.sh and ./run_worker.sh on another node.
4. Performance:
scenario 1:
Model: vgg16 Batch size: 32 Number of GPUs: 8 Running warmup... Running benchmark... 300 img/sec per GPU
scenario 2:
Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 3.4 img/sec per GPU

Expected behavior
No such big performance gap

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • GCC version: 7.5
  • CUDA and NCCL version: 11.0, 2.8.4
  • Framework (TF, PyTorch, MXNet): PyTorch 1.7.1

Additional context
Add any other context about the problem here.

@ymjiang
Copy link
Member

ymjiang commented Feb 9, 2021

How much is the bandwidth between these two nodes?

@MichaelHsu170
Copy link
Author

MichaelHsu170 commented Feb 10, 2021

They are 200Gib nic cards on both nodes. I got averaged speed at around 450Mb/s per my measurement with iftop (sudo iftop -n).

@MichaelHsu170
Copy link
Author

MichaelHsu170 commented Feb 10, 2021

By the way, are there any recommended configurations to run data parallel training with VGG16 on 2 nodes?
For example, how many workers should we start, how many servers should we start? Do we need to have a separate machine as the scheduler?

@ymjiang
Copy link
Member

ymjiang commented Feb 11, 2021

  1. I am confused by your log. You mentioned

2. for scenario 2: 3.4 img/sec per GPU

and

scenario 2:
Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 300 img/sec per GPU

So what exactly is the performance for scenario 2?

  1. Here are a few tips of recommended configs. https://github.com/bytedance/byteps/blob/master/docs/best-practice.md

  2. And your throughput is less than 0.5Gbps, this is not expected with 200Gbps NICs. Can you use this benchmark to test the networking performance? https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark

@MichaelHsu170
Copy link
Author

Hi @ymjiang , Happy Chinese New Year!!!
Sorry, for scenario 2 it is 3.4 img/sec.
I'll try the benchmark tool for networking performance measurement.
Thank you.

@MichaelHsu170
Copy link
Author

MichaelHsu170 commented Mar 1, 2021

Hi @ymjiang ,
We tried the basic benchmark mentioned in https://github.com/bytedance/ps-lite/tree/byteps#1-basic-benchmark, but got failures. Could you suggest how can we get it working? Thank you.
We ran 2 scenarios:

  1. 1 scheduler, 1 server and 2 workers on 2 machines. On machine A, 1 scheduler and 1 worker were executed. On machine B, 1 server and 1 worker were executed. This scenario failed with " what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)". We tried to set these 2 environment variables to even single-digit numbers, but always this error shown up.
  2. 1 scheduler, 1 server and 2 workers were run on a single machine. The scheduler crashed with error " what(): [08:43:06] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR)" at the moment when the last client (server or worker) got launched.
  • ib_send_bw works correctly on both machines.
  • We used the IP address from ib0 port as scheduler address.
    $ ibdev2netdev mlx5_0 port 1 ==> ib0 (Up) mlx5_1 port 1 ==> ib1 (Up) mlx5_2 port 1 ==> ib2 (Up) mlx5_3 port 1 ==> ib3 (Up) mlx5_4 port 1 ==> ib4 (Up) mlx5_5 port 1 ==> ib5 (Up) mlx5_6 port 1 ==> ib6 (Up) mlx5_7 port 1 ==> ib7 (Up) mlx5_8 port 1 ==> enp225s0f0 (Down) mlx5_9 port 1 ==> enp225s0f1 (Down)
    $ ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd3e sys_image_guid: 0c42:a103:000b:cd3e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 7 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc6 sys_image_guid: 0c42:a103:000b:cbc6 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 17 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cc1e sys_image_guid: 0c42:a103:000b:cc1e vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 5 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:024c sys_image_guid: 0c42:a103:000c:024c vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 8 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_4 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cbc2 sys_image_guid: 0c42:a103:000b:cbc2 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 16 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_5 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000b:cd2a sys_image_guid: 0c42:a103:000b:cd2a vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 6 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_6 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0478 sys_image_guid: 0c42:a103:000c:0478 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 11 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_7 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000c:0488 sys_image_guid: 0c42:a103:000c:0488 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 12 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_8 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37da sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_9 transport: InfiniBand (0) fw_ver: 20.28.1002 node_guid: 0c42:a103:000a:37db sys_image_guid: 0c42:a103:000a:37da vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000225 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet

@ymjiang
Copy link
Member

ymjiang commented Mar 1, 2021

Can you show the output of ulimit -l?

@MichaelHsu170
Copy link
Author

MichaelHsu170 commented Mar 1, 2021

It shown unlimited.
`$ ulimit -l

unlimited`

@ymjiang
Copy link
Member

ymjiang commented Mar 2, 2021

There was a similar issue before: #282. Can you try this setup: 1 scheduler + 2 servers + 2 workers? It may have better load balance than using one server.

@MichaelHsu170
Copy link
Author

I tried this scenario on 2 machines:
machine A: scheduler, server, worker
machine B: server, worker

But still processes on machine B crashed with error message what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048). Reducing BYTEPS_RDMA_START_DEPTH and BYTEPS_RDMA_RX_DEPTH yields the same error.
The ticket you mentioned seems to be related to PFC. Do you think this error is possibly caused by disabled PFC functionality?

@ymjiang
Copy link
Member

ymjiang commented Mar 3, 2021

PFC is not related to this problem. However, I am not sure about the possible reasons. Perhaps some hardware configurations on your machines are limited. But I have no idea now.

Does using 1 worker and 1 server works?

@MichaelHsu170
Copy link
Author

If scheduler, 1 server and 1 worker run on the same machine, scheduer crashed with terminate called after throwing an instance of 'dmlc::Error' what(): [09:45:09] src/./rdma_van.h:747: Check failed: 0 OnEvent: unknown event 1 (RDMA_CM_EVENT_ADDR_ERROR) error.
Run them on 2 machines:
machine A: scheduler, server
machine B: worker
worker on machine B crashed with what(): [08:40:36] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory. You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048) error.

@MichaelHsu170
Copy link
Author

Hi @ymjiang , any recommendation will be grateful. Thank you.

@ymjiang
Copy link
Member

ymjiang commented Mar 26, 2021

Would you check these similar issues -- #371 and #372?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants