-
Notifications
You must be signed in to change notification settings - Fork 487
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
105 additions
and
97 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,112 +1,120 @@ | ||
<img src="http://parameterserver.org/images/parameterserver.png" width=400 /> | ||
|
||
[![Build Status](https://travis-ci.org/dmlc/ps-lite.svg?branch=master)](https://travis-ci.org/dmlc/ps-lite) | ||
[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE) | ||
|
||
A light and efficient implementation of the parameter server | ||
framework. It provides clean yet powerful APIs. For example, a worker node can | ||
communicate with the server nodes by | ||
- `Push(keys, values)`: push a list of (key, value) pairs to the server nodes | ||
- `Pull(keys)`: pull the values from servers for a list of keys | ||
- `Wait`: wait untill a push or pull finished. | ||
|
||
A simple example: | ||
|
||
```c++ | ||
std::vector<uint64_t> key = {1, 3, 5}; | ||
std::vector<float> val = {1, 1, 1}; | ||
std::vector<float> recv_val; | ||
ps::KVWorker<float> w; | ||
w.Wait(w.Push(key, val)); | ||
w.Wait(w.Pull(key, &recv_val)); | ||
|
||
|
||
This is the communication library for [BytePS](https://github.com/bytedance/byteps). It is designed for high performance RDMA. However, it also supports TCP. | ||
|
||
## Build | ||
|
||
```bash | ||
git clone -b byteps https://github.com/bytedance/ps-lite | ||
cd ps-lite | ||
make -j USE_RDMA=1 | ||
``` | ||
|
||
More features: | ||
Remove `USE_RDMA=1` if you don't want to build with RDMA support. | ||
|
||
## Concepts | ||
|
||
In ps-lite, there are three roles: worker, server and scheduler. Each role is an independent process. | ||
|
||
The scheduler is responsible for setting up the connections between workers and servers at initialization. There should be only 1 scheduler process. | ||
|
||
- Flexible and high-performance communication: zero-copy push/pull, supporting | ||
dynamic length values, user-defined filters for communication compression | ||
- Server-side programming: supporting user-defined handles on server nodes | ||
A worker process only communicates with server processes, and vice versa. There won't be any traffic between worker-to-worker, and server-to-server. | ||
|
||
### Build | ||
|
||
`ps-lite` requires a C++11 compiler such as `g++ >= 4.8`. On Ubuntu >= 13.10, we | ||
can install it by | ||
## Tutorial | ||
|
||
After build, you will have two testing applications under `tests/` dir, namely `test_benchmark` and `test_ipc_benchmark`. | ||
Below we elaborate how you can run with them. | ||
|
||
To debug, set `PS_VERBOSE=1` to see important log during connection setup, and `PS_VERBOSE=2` to see each message log. | ||
|
||
### 1. Basic benchmark | ||
|
||
Suppose you want to run with 1 worker and 1 server on different machines. Therefore, we need to launch 3 processes in total (including the scheduler). You can launch the scheduler process at any machine as it does not affect the performance. | ||
|
||
For the scheduler: | ||
|
||
``` | ||
sudo apt-get update && sudo apt-get install -y build-essential git | ||
# common setup | ||
export DMLC_ENABLE_RDMA=1 | ||
export DMLC_NUM_WORKER=1 | ||
export DMLC_NUM_SERVER=1 | ||
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP | ||
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) | ||
export DMLC_INTERFACE=eth5 # my RDMA interface | ||
# launch scheduler | ||
DMLC_ROLE=scheduler test_benchmark | ||
``` | ||
Instructions for | ||
[older Ubuntu](http://ubuntuhandbook.org/index.php/2013/08/install-gcc-4-8-via-ppa-in-ubuntu-12-04-13-04/), | ||
[Centos](http://linux.web.cern.ch/linux/devtoolset/), | ||
and | ||
[Mac Os X](http://hpc.sourceforge.net/). | ||
|
||
Then clone and build | ||
|
||
```bash | ||
git clone https://github.com/dmlc/ps-lite | ||
cd ps-lite && make -j4 | ||
For the server: | ||
``` | ||
# common setup | ||
export DMLC_ENABLE_RDMA=1 | ||
export DMLC_NUM_WORKER=1 | ||
export DMLC_NUM_SERVER=1 | ||
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP | ||
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) | ||
export DMLC_INTERFACE=eth5 # my RDMA interface | ||
# launch server | ||
DMLC_ROLE=server test_benchmark | ||
``` | ||
|
||
### Build with RDMA support | ||
For the worker: | ||
``` | ||
# common setup | ||
export DMLC_ENABLE_RDMA=1 | ||
export DMLC_NUM_WORKER=1 | ||
export DMLC_NUM_SERVER=1 | ||
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP | ||
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) | ||
export DMLC_INTERFACE=eth5 # my RDMA interface | ||
# launch worker | ||
DMLC_ROLE=worker test_benchmark | ||
``` | ||
|
||
You can add `USE_RDMA=1` to enable RDMA support. | ||
If you just want to use TCP, make sure to set `DMLC_ENABLE_RDMA=0` for all processes. | ||
|
||
```bash | ||
make -j $(nproc) USE_RDMA=1 | ||
### 2. Benchmark with IPC support | ||
|
||
The `test_ipc_benchmark` demonstrates how inter-process communication (IPC) helps improve RDMA performance when the server is co-located with the worker. | ||
|
||
Suppose you have two machines. Each machine should launch a worker and a server process. | ||
|
||
For the scheduler: | ||
(you can launch it on either machine-0 or machine-1) | ||
``` | ||
# common setup | ||
export DMLC_ENABLE_RDMA=1 | ||
export DMLC_NUM_WORKER=2 | ||
export DMLC_NUM_SERVER=2 | ||
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP | ||
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) | ||
export DMLC_INTERFACE=eth5 # my RDMA interface | ||
# launch scheduler | ||
DMLC_ROLE=scheduler test_ipc_benchmark | ||
``` | ||
|
||
For machine-0 and machine-1: | ||
|
||
``` | ||
# common setup | ||
export DMLC_ENABLE_RDMA=1 | ||
export DMLC_NUM_WORKER=2 | ||
export DMLC_NUM_SERVER=2 | ||
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP | ||
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) | ||
export DMLC_INTERFACE=eth5 # my RDMA interface | ||
# launch server and worker | ||
DMLC_ROLE=server test_ipc_benchmark & | ||
DMLC_ROLE=worker test_ipc_benchmark | ||
``` | ||
|
||
|
||
Note: This benchmark is only valid for RDMA. | ||
|
||
### Enable new features for higher performance | ||
|
||
To avoid head-of-line blocking, use `USE_MULTI_THREAD_FOR_RECEIVING=1` to enable multi-threading | ||
in the receiving functions of `customer.cc` | ||
(this flag must be set explicitly to enable multi-threading). | ||
You can also set `NUM_RECEIVE_THREAD` to change the number of threads of the thread pool | ||
(default is 2). | ||
|
||
|
||
### How to use | ||
|
||
`ps-lite` provides asynchronous communication for other projects: | ||
- Distributed deep neural networks: | ||
[MXNet](https://github.com/dmlc/mxnet), | ||
[CXXNET](https://github.com/dmlc/cxxnet) and | ||
[Minverva](https://github.com/minerva-developers/minerva) | ||
- Distributed high dimensional inference, such as sparse logistic regression, | ||
factorization machines: | ||
[DiFacto](https://github.com/dmlc/difacto) | ||
[Wormhole](https://github.com/dmlc/wormhole) | ||
|
||
### History | ||
|
||
We started to work on the parameter server framework since 2010. | ||
|
||
1. The first generation was | ||
designed and optimized for specific algorithms, such as logistic regression and | ||
LDA, to serve the sheer size industrial machine learning tasks (hundreds billions of | ||
examples and features with 10-100TB data size) . | ||
|
||
2. Later we tried to build a open-source general purpose framework for machine learning | ||
algorithms. The project is available at [dmlc/parameter_server](https://github.com/dmlc/parameter_server). | ||
|
||
3. Given the growing demands from other projects, we created `ps-lite`, which provides a clean data communication API and a | ||
lightweight implementation. The implementation is based on `dmlc/parameter_server`, but we refactored the job launchers, file I/O and machine | ||
learning algorithms codes into different projects such as `dmlc-core` and | ||
`wormhole`. | ||
|
||
4. From the experience we learned during developing | ||
[dmlc/mxnet](https://github.com/dmlc/mxnet), we further refactored the API and implementation from [v1](https://github.com/dmlc/ps-lite/releases/tag/v1). The main | ||
changes include | ||
- less library dependencies | ||
- more flexible user-defined callbacks, which facilitate other language | ||
bindings | ||
- let the users, such as the dependency | ||
engine of mxnet, manage the data consistency | ||
|
||
### Research papers | ||
1. Mu Li, Dave Andersen, Alex Smola, Junwoo Park, Amr Ahmed, Vanja Josifovski, | ||
James Long, Eugene Shekita, Bor-Yiing | ||
Su. [Scaling Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf). In | ||
Operating Systems Design and Implementation (OSDI), 2014 | ||
2. Mu Li, Dave Andersen, Alex Smola, and Kai | ||
Yu. [Communication Efficient Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_nips14.pdf). In | ||
Neural Information Processing Systems (NIPS), 2014 |