-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression with multi-node running #365
Comments
How much is the bandwidth between these two nodes? |
They are 200Gib nic cards on both nodes. I got averaged speed at around 450Mb/s per my measurement with iftop ( |
By the way, are there any recommended configurations to run data parallel training with VGG16 on 2 nodes? |
and
So what exactly is the performance for scenario 2?
|
Hi @ymjiang , Happy Chinese New Year!!! |
Hi @ymjiang ,
|
Can you show the output of |
It shown unlimited. unlimited` |
There was a similar issue before: #282. Can you try this setup: 1 scheduler + 2 servers + 2 workers? It may have better load balance than using one server. |
I tried this scenario on 2 machines: But still processes on machine B crashed with error message |
PFC is not related to this problem. However, I am not sure about the possible reasons. Perhaps some hardware configurations on your machines are limited. But I have no idea now. Does using 1 worker and 1 server works? |
If scheduler, 1 server and 1 worker run on the same machine, scheduer crashed with |
Hi @ymjiang , any recommendation will be grateful. Thank you. |
Describe the bug
I've tried the following 2 scenarios, and compared their performances.
Performance regressed a lot with scenario 2 (1/100 of scenario 1)
for scenario 1: 300 img/sec per GPU
for scenario 2: 3.4 img/sec per GPU
To Reproduce
Steps to reproduce the behavior:
run_worker.sh
#!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=1 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
./run_worker.sh
on 1 noderun_scheduler.sh
#!/bin/bash export DMLC_ROLE=scheduler export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch
run_server.sh
#!/bin/bash export DMLC_ROLE=server export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch
run_worker.sh
#!/bin/bash export NVIDIA_VISISBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_ROLE=worker export DMLC_NUM_WORKER=2 export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=xxx.xxx.xxx.xxx export DMLC_PS_ROOT_PORT=yyyy python3 ./bin/bpslaunch python3 ./example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 20
3. Run
./run_scheduler.sh
,./run_server.sh
and./run_worker.sh
on 1 node, and then run./run_server.sh
and./run_worker.sh
on another node.4. Performance:
scenario 1:
Model: vgg16 Batch size: 32 Number of GPUs: 8 Running warmup... Running benchmark... 300 img/sec per GPU
scenario 2:
Model: vgg16 Batch size: 32 Number of GPUs: 16 Running warmup... Running benchmark... 3.4 img/sec per GPU
Expected behavior
No such big performance gap
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: