release: v0.2.0 (#206)

* fix manifest.in * remove tools dir * __version__.py: 0.2.0 * update readme * fix docs * fix best-practice.md * readme: nccl path * readme: improve * readme: improve * add changelog.rst * fix changelog * readme: add pypi badge and news * improve readme and changelog
bytedance · Feb 19, 2020 · 536dd85 · 536dd85
1 parent af6fd58
commit 536dd85
Show file tree

Hide file tree

Showing 11 changed files with 81 additions and 207 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -0,0 +1,18 @@
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Changelog for BytePS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+0.2.0 (2020-02)
+------------------
+* Largely improve RDMA performance by enforcing page aligned memory.
+* Add IPC support for RDMA. Now support colocating servers and workers without sacrificing much performance.
+* Fix a hanging bug in BytePS server.
+* Fix RDMA-related segmentation fault problem during fork() (e.g., used by PyTorch data loader).
+* New feature: Enable mixing use of colocate and non-colocate servers, along with a smart tensor allocation strategy.
+* New feature: Add ``bpslaunch`` as the command to launch tasks.
+* Add support for pip install: ``pip3 install byteps``
+
+
+0.1.0 (2019-12)
+------------------
+* First official release.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,3 +1,4 @@
-include */*
+include */* LICENSE byteps.lds byteps.exp
+exclude .git/*
 recursive-include * *.cc *.h
 graft 3rdparty/ps-lite
diff --git a/README.md b/README.md
@@ -2,13 +2,18 @@
 
 [![Build Status](https://travis-ci.org/bytedance/byteps.svg?branch=master)](https://travis-ci.org/bytedance/byteps)
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+![Pypi](https://img.shields.io/pypi/v/byteps.svg)
 
 BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on either TCP or RDMA network.
 
 BytePS outperforms existing open-sourced distributed training frameworks by a large margin. For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than [Horovod](https://github.com/horovod/horovod)+[NCCL](https://github.com/NVIDIA/nccl). In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL.
 
 ## News
 
+- [BytePS-0.2.0](CHANGELOG.rst) has been released.
+- Now pip install is available, refer to the [install tutorial](https://github.com/bytedance/byteps#quick-start).
+- [Largely improve RDMA performance](https://github.com/bytedance/byteps/pull/184). Now support colocating servers and workers with high performance.
+- Fix [RDMA fork problem](https://github.com/bytedance/byteps/pull/192) caused by multi-processing.
 - [New Server](https://github.com/bytedance/byteps/pull/151): We improve the server performance by a large margin, and it is now independent of MXNet KVStore. Try our [new docker images](docker/).
 - Use [the ssh launcher](launcher/) to launch your distributed jobs
 - [Improved key distribution strategy for better load-balancing](https://github.com/bytedance/byteps/pull/116)
@@ -41,21 +46,30 @@ BytePS also incorporates many acceleration techniques such as hierarchical strat
 
 We provide a [step-by-step tutorial](docs/step-by-step-tutorial.md) for you to run benchmark training tasks. The simplest way to start is to use our [docker images](docker). Refer to [Documentations](docs) for how to [launch distributed jobs](docs/running.md) and more [detailed configurations](docs/env.md). After you can start BytePS, read [best practice](docs/best-practice.md) to get the best performance.
 
-Below, we explain how to build and run BytePS by yourself. BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet. BytePS depends on CUDA and NCCL, and requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try `yum install devtoolset-7` before everything else.
+Below, we explain how to install BytePS by yourself. There are two options.
+
+### Install by pip
+
+```
+pip3 install byteps
+```
 
 ### Build from source code
 
-If the above does not contain your desired wheel resource, or you want to try building from source code: 
+You can try out the latest features by directly installing from master branch:
 
 ```
-git clone --recurse-submodules https://github.com/bytedance/byteps
+git clone --recursive https://github.com/bytedance/byteps
 cd byteps
-python setup.py install
+python3 setup.py install
 ```
 
-Notes:
-- For best compatibility, please pin your gcc to 4.9 before building, [here](https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.pytorch#L72-L80) is an example.
-- You may set `BYTEPS_USE_RDMA=1` to install with RDMA support. Before this, make sure your RDMA drivers have been properly installed and tested.
+Notes for above two options:
+- BytePS assumes that you have already installed one or more of the following frameworks: TensorFlow / PyTorch / MXNet. 
+- BytePS depends on CUDA and NCCL. You should specify the NCCL path with `export BYTEPS_NCCL_HOME=/path/to/nccl`. By default it points to `/usr/local/nccl`.
+- The installation requires gcc>=4.9. If you are working on CentOS/Redhat and have gcc<4.9, you can try `yum install devtoolset-7` before everything else. In general, we recommend using gcc 4.9 for best compatibility ([an example](https://github.com/bytedance/byteps/blob/3fba75def0d81c1d3225f8f397cc985200f57de7/docker/Dockerfile.mxnet#L72-L80) to pin gcc).
+- RDMA support: During setup, the script will automatically detect the RDMA header file. If you want to use RDMA, make sure your RDMA environment has been properly installed and tested before install ([an example](https://github.com/bytedance/byteps/blob/3fba75def0d81c1d3225f8f397cc985200f57de7/docker/Dockerfile.mxnet#L29-L33) for Ubuntu-18.04). 
+
 
 ## Use BytePS in Your Code
 

diff --git a/byteps/__version__.py b/byteps/__version__.py
@@ -1,3 +1,3 @@
-VERSION = (0, 1, 0)
+VERSION = (0, 2, 0)
 
 __version__ = '.'.join(map(str, VERSION))
diff --git a/docs/best-practice.md b/docs/best-practice.md
@@ -13,15 +13,23 @@ If you have NVLinks, leave `BYTEPS_PCIE_SWITCH_SIZE` unmodified. If you don't kn
 
 ## Multi-machine (distributed mode)
 
-This mode requires at least **4** physical machines, otherwise you won't see any benefits of BytePS. Two of the machines should have GPUs and run as workers. The other two run as servers and do not need GPUs. The scheduler can run on any machine.
+### With additional CPU servers
+
+This mode requires at least **4** physical machines. Two of the machines should have GPUs and run as workers. The other two run as CPU servers and do not need GPUs. The scheduler can run on any machine.
 
 The key here is to make sure the following:
 * Servers must be on different physical machines from workers.
 * The total bandwidth of the servers must be equal or larger than the total bandwidth of workers.
 
 If you are using RDMA, this should be sufficient. However, with TCP and >=25Gbps networks, it's possible that BytePS cannot fully utilize the bandwidth because a single TCP connection usually cannot run up to 25Gbps.
 
-To address this, you can try running more BytePS server instances on the server machines. For example, you can try running two server instances per server machines. This effectively doubles the number of TCP connections and should be sufficient for 25Gbps networks. For 40Gbps/50Gbps networks, you need three server instances per server machine, and so on. When doing this, you probably need to set `MXNET_OMP_MAX_THREADS` as: your CPU cores number divided by number of server instances per machine. For example, one machine has 32 cores and you put 4 server instances on it, then you need to `export MXNET_OMP_MAX_THREADS=8`. The idea is to reduce the CPU contention of different server instances.
+To address this, you can try running more BytePS server instances on the server machines. For example, you can try running two server instances per server machines. This effectively doubles the number of TCP connections and should be sufficient for 25Gbps networks. For 40Gbps/50Gbps networks, you need three server instances per server machine, and so on. 
+
+### No additional CPU servers
+
+When you don't have additional CPU servers, then for each physical machine, you should launch a worker and a server process. We call this *co-locate* mode, and the resource consumption is the same with Horovod (no additional servers). 
+
+If you are using TCP, you will probably get near-identical performance with Horovod-TCP. However, if you are using RDMA, you can set `BYTEPS_ENABLE_IPC=1` to enable the IPC communication between the co-located worker and server. And eventually you will get higher end-to-end performance than Horovod. 
 
 ## The expected performance
 

diff --git a/docs/env.md b/docs/env.md
@@ -107,16 +107,16 @@ export BYTEPS_NCCL_GROUP_SIZE=w
 ```
 
 Servers can also be the performance bottleneck, e.g., when there are only one server but multiple workers. 
-You can try to increase the number of push threads on the servers (default is 1):
+You can try to increase the number of processing threads on the servers (default is 4):
 
 ```
-export SERVER_PUSH_NTHREADS=v
+export BYTEPS_SERVER_ENGINE_THREAD=v
 ```
 
-Increasing the number of engine CPU threads may also improves server performance:
+Or enable scheduling at the server side to prioritize tensors with higher priority:
 
 ```
-export MXNET_CPU_WORKER_NTHREADS=p
+export BYTEPS_SERVER_ENABLE_SCHEDULE=1
 ```
 
 ## Asynchronous training

diff --git a/docs/running.md b/docs/running.md
@@ -11,15 +11,15 @@ On worker 0, run:
 ```
 DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
 DMLC_WORKER_ID=0 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
-python launcher/launcher.py YOUR_COMMAND
+bpslaunch YOUR_COMMAND
 ```
 
 On worker 1, run (only DMLC_WORKER_ID is different from above):
 
 ```
 DMLC_ROLE=worker DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
 DMLC_WORKER_ID=1 DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 \
-python launcher/launcher.py YOUR_COMMAND
+bpslaunch YOUR_COMMAND
 ```
 
 **For servers and schedulers, we highly recommend you use the docker image we build:**
@@ -32,14 +32,14 @@ Start server and scheduler docker instances with this image. In the server, run
 
 ```
 DMLC_ROLE=server DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
-DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
+DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 bpslaunch
 ```
 
 On the scheduler, run (we also remove DMLC_WORKER_ID, and set role to scheduler):
 
 ```
 DMLC_ROLE=scheduler DMLC_PS_ROOT_URI=10.0.0.1 DMLC_PS_ROOT_PORT=9000 \
-DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 python launcher/launcher.py
+DMLC_NUM_WORKER=2 DMLC_NUM_SERVER=1 bpslaunch
 ```
 
 In this example, your scheduler must be able to bind to `10.0.0.1:9000`.

diff --git a/docs/step-by-step-tutorial.md b/docs/step-by-step-tutorial.md
@@ -23,9 +23,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=10.0.0.1 
 export DMLC_PS_ROOT_PORT=1234 
 
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
-    --model ResNet50 --num-iters 1000000 
+bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000 
 ```
 
 ### PyTorch
@@ -47,9 +45,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=10.0.0.1 
 export DMLC_PS_ROOT_PORT=1234 
 
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
-    --model resnet50 --num-iters 1000000      
+bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000      
 ```
 
 ### MXNet
@@ -70,9 +66,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=10.0.0.1 
 export DMLC_PS_ROOT_PORT=1234 
 
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
-    --benchmark 1 --batch-size=32  
+bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32  
 ```
 
 ## Distributed Training (TCP)
@@ -95,7 +89,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP 
 export DMLC_PS_ROOT_PORT=1234  # the scheduler port
 
-python3 /usr/local/byteps/launcher/launch.py
+bpslaunch
 ```
 
 For the server:
@@ -111,7 +105,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP 
 export DMLC_PS_ROOT_PORT=1234  # the scheduler port
 
-python3 /usr/local/byteps/launcher/launch.py
+bpslaunch
 ```
 
 
@@ -129,9 +123,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP 
 export DMLC_PS_ROOT_PORT=1234 # the scheduler port
 
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
-    --model ResNet50 --num-iters 1000000 
+bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000 
 ```
 
 For worker-1:
@@ -149,26 +141,20 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=10.0.0.1 # the scheduler IP 
 export DMLC_PS_ROOT_PORT=1234 # the scheduler port
 
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
-    --model ResNet50 --num-iters 1000000 
+bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000 
 ```
 
 
 If your workers use PyTorch, you need to change the image name to `bytepsimage/pytorch`, and replace the python script of the workers with
 
 ```
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
-    --model resnet50 --num-iters 1000000      
+bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000      
 ```
 
 
 If your workers use MXNet, you need to change the image name to `bytepsimage/mxnet`, and replace the python script of the workers with
 ```
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
-    --benchmark 1 --batch-size=32    
+bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32    
 ```
 
 ## Distributed Training with RDMA
@@ -198,7 +184,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
 export DMLC_PS_ROOT_PORT=9000
 
 # launch the job
-python3 /usr/local/byteps/launcher/launch.py
+bpslaunch
 ```
 
 For the server:
@@ -222,7 +208,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
 export DMLC_PS_ROOT_PORT=9000
 
 # launch the job
-python3 /usr/local/byteps/launcher/launch.py
+bpslaunch
 ```
 
 For worker-0:
@@ -250,9 +236,7 @@ export DMLC_PS_ROOT_URI=10.0.0.100
 export DMLC_PS_ROOT_PORT=9000
 
 # launch the job
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
-    --model ResNet50 --num-iters 1000000 
+bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000 
 ```
 
 For worker-1:
@@ -281,25 +265,19 @@ export DMLC_PS_ROOT_URI=10.0.0.100
 export DMLC_PS_ROOT_PORT=9000
 
 # launch the job
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py \
-    --model ResNet50 --num-iters 1000000 
+bpslaunch python3 /usr/local/byteps/example/tensorflow/synthetic_benchmark.py --model ResNet50 --num-iters 1000000 
 ```
 
 
 
 If your workers use PyTorch, you need to change the image name to `bytepsimage/pytorch`, and replace the python script of the workers with
 
 ```
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py \
-    --model resnet50 --num-iters 1000000      
+bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000      
 ```
 
 
 If your workers use MXNet, you need to change the image name to `bytepsimage/mxnet`, and replace the python script of the workers with
 ```
-python3 /usr/local/byteps/launcher/launch.py \
-    python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py \
-    --benchmark 1 --batch-size=32    
+bpslaunch python3 /usr/local/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32    
 ```
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -11,7 +11,7 @@ When launching distributed jobs, if you see hanging at the beginning, one possib
 Install ps-lite:
 
 ```
-git clone --branch byteps https://github.com/bytedance/ps-lite.git
+git clone -b byteps https://github.com/bytedance/ps-lite.git
 cd ps-lite
 make -j
 ``` 
@@ -25,7 +25,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP] 
 export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT] 
 export DMLC_INTERFACE=eth0 
-./ps-lite/tests/test_kv_app_benchmark 
+./ps-lite/tests/test_benchmark 
 ```
 
 For the server
@@ -36,7 +36,7 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP] 
 export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT] 
 export DMLC_INTERFACE=eth0 
-./ps-lite/tests/test_kv_app_benchmark 
+./ps-lite/tests/test_benchmark 
 ```
 
 For the worker:
@@ -47,13 +47,13 @@ export DMLC_NUM_SERVER=1
 export DMLC_PS_ROOT_URI=[YOUR_SCHEDULER_IP] 
 export DMLC_PS_ROOT_PORT=[YOUR_SCHEDULER_PORT] 
 export DMLC_INTERFACE=eth0 
-./ps-lite/tests/test_kv_app_benchmark 1024000 100 0
+./ps-lite/tests/test_benchmark 1024000 100 0
 ```
 
 If it succeed, you should be able to see something like this on the worker. 
 ```
-tests/test_kv_app_benchmark.cc:77: push_byte=4096000, repeat=100, total_time=128.842ms
-tests/test_kv_app_benchmark.cc:91: pull_byte=4096000, repeat=100, total_time=353.38ms
+push_byte=4096000, repeat=100, total_time=128.842ms
+pull_byte=4096000, repeat=100, total_time=353.38ms
 ```
 
 (Note: for RDMA networks, use `make -j USE_RDMA=1` to build, and `export DMLC_ENABLE_RDMA=1` for running the scheduler / server / worker)