-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run the distributed training on Kubernetes #61
Comments
Is it mandatory to let scheduler server and worker located at different physical machine? In k8s, the Pod level isolation is as same as the physical machine, so could we let scheduler and server on different Pods but at same machine? Is it possible to use mxnet operator to drive the bytePS? I see the runtime structures are same. |
For now BytePS assumes homogeneous setup, i.e., each worker must have the same number of GPUs. Otherwise you would get into trouble.
We usually use docker containers. Running on bare-metal should be fine, but you need to be careful with the environments (gcc, cuda driver, etc). We never test with Kubernetes actually.
Yes, it is OK if each worker only has 1 GPU. Regarding the GPU model, we have tested with 1080Ti and V100.
Unfortunately we don't have experience with Kubernetes. |
It is not mandatory. You can put them on the same physically machine.
I am not sure what does "operator" means here. Can you please clarify? |
Please set each worker with the same number of GPUs and then try again. If you still run into trouble, feel free to let us know. |
We require that each worker has the same number of GPUs only because we need a way to correctly calculate the total number of workers and global rank. |
"Operator" is a concept of Kubernetes. Please refer to https://coreos.com/operators/. I'd like to verify the performance firstly, and then have an "operator" to let it run on k8s much easier. |
Thanks for the notes of k8s. Besides, if you want high performance, you need to put workers & servers on different physical machines. We have addressed this in README. |
How did you benchmark with 16 GPUs? how many servers did you run? |
@compete369 Did single machine work as expected for you? Your setup is pretty similar to ours. However, there are a few details I want to understand
It's very possible to drive BytePS using the MXNet K8s operator with minor modification -- all the DMLC_* environmental variables are the same. |
The single machine run is very strange that the nvlink has no traffic. Could you point me?
|
@compete369 The single machine case is very strange. On a single machine, BytePS uses NCCL, which should use nvlink. Can you set NCCL_DEBUG=INFO when you run the test, and check the output of NCCL? What is the training speed of single machine? |
worker-pytorch-single1:25:25 [0] NCCL INFO Using internal Network Socket worker-pytorch-single1:25:25 [0] NCCL INFO Ring 00 : 0 1 2 3 Iter #0: 141.0 img/sec per GPU When I set NCCL_P2P_DISABLE =1 Iter #0: 107.7 img/sec per GPU |
@compete369 Thanks for the log. "P2P/IPC" is the correct status. BTW, did you try Horovod in the single machine case? Is the performance similar? As long as it's similar, you are good with the single machine case. You may also try MXNet's own NCCL implementation. These should all be very similar. Based on your NCCL_P2P_DISABLE =1 results, I think you should expect BytePS to get similar results in the multi-worker distributed case (100~140 img/sec per GPU). I believe you can get that with two separated parameter servers. Yes, it's much slower than our results. But I think it's due to your machine setup... In the end, BytePS can't beat the performance of a local NCCL using direct shared memory. You can compare the data path in the two cases: Local NCCL with direct shared memory: GPU -> CPU memory -> GPU The best you can expect from BytePS is to help you completely hide the latency of push and pull, and gets the same performance as local NCCL. The same goes for all the alternative distributed options you have. For example, Horovod can't get you any better results, either. |
Thanks very much. Finally, I got the expected speed on the single machine. However, I failed moving on to the 2 machines training. I had 1 scheduler, 2 server, and 2 workers. They on the different machine. [10:03:24] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 Stack trace returned 9 entries: Could you help me on this? tks! |
This is strange. Looks like Node11 (worker) does not send message to the scheduler at all, but the scheduler still behaves like all workers & servers are ready. Can you please show the complete scripts you use to launch each scheduler / server / worker? Thanks. Another guess: Did you set |
|
I am sorry that I had wrong DMLC_NUM_WOKRER=1 for the scheduler, should be DMLC_NUM_WOKRER=2. |
I had 1 scheduler, 2 servers (8U16G), 2workers(64U,256G, 8V100GPUs). It could run successfully, but training speed is around 95 images per GPU. In a single machine, it could run 250 images per GPU. Any clue on this? BytePS launching worker worker-pytorch-0:33:33 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] worker-pytorch-0:24:24 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] worker-pytorch-0:31:31 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] worker-pytorch-0:27:27 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] worker-pytorch-0:46:46 [7] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] worker-pytorch-0:40:40 [4] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] worker-pytorch-0:43:43 [5] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] worker-pytorch-0:45:45 [6] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1] And GPU utilization is very low. |
Let's continue the discussion in your performance issue. #68 |
* 1bit: not need to do wd mom for uncompressed gradients * 1bit: fix typo * 1bit: normal weight decay * 1bit: update * 1bit: update * misc: fix typo
After the successful single run on Kubernetes with the workaround, I tried to run the distributed train with 2 workers on Kubernetes. However there is only one worker running, and the another one hangs always. I assigned just 1 device (with 0 as device tag), but the running worker said 2 GPUS benchmarking. The running worker has 2 GPUs, and hanging worker has 1 GPU only.
The text was updated successfully, but these errors were encountered: