Skip to content

Commit

Permalink
readme: add tutorial (#24)
Browse files Browse the repository at this point in the history
  • Loading branch information
ymjiang authored Mar 4, 2020
1 parent 5da3f68 commit 26b0184
Showing 1 changed file with 105 additions and 97 deletions.
202 changes: 105 additions & 97 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,112 +1,120 @@
<img src="http://parameterserver.org/images/parameterserver.png" width=400 />

[![Build Status](https://travis-ci.org/dmlc/ps-lite.svg?branch=master)](https://travis-ci.org/dmlc/ps-lite)
[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE)

A light and efficient implementation of the parameter server
framework. It provides clean yet powerful APIs. For example, a worker node can
communicate with the server nodes by
- `Push(keys, values)`: push a list of (key, value) pairs to the server nodes
- `Pull(keys)`: pull the values from servers for a list of keys
- `Wait`: wait untill a push or pull finished.

A simple example:

```c++
std::vector<uint64_t> key = {1, 3, 5};
std::vector<float> val = {1, 1, 1};
std::vector<float> recv_val;
ps::KVWorker<float> w;
w.Wait(w.Push(key, val));
w.Wait(w.Pull(key, &recv_val));


This is the communication library for [BytePS](https://github.com/bytedance/byteps). It is designed for high performance RDMA. However, it also supports TCP.

## Build

```bash
git clone -b byteps https://github.com/bytedance/ps-lite
cd ps-lite
make -j USE_RDMA=1
```

More features:
Remove `USE_RDMA=1` if you don't want to build with RDMA support.

## Concepts

In ps-lite, there are three roles: worker, server and scheduler. Each role is an independent process.

The scheduler is responsible for setting up the connections between workers and servers at initialization. There should be only 1 scheduler process.

- Flexible and high-performance communication: zero-copy push/pull, supporting
dynamic length values, user-defined filters for communication compression
- Server-side programming: supporting user-defined handles on server nodes
A worker process only communicates with server processes, and vice versa. There won't be any traffic between worker-to-worker, and server-to-server.

### Build

`ps-lite` requires a C++11 compiler such as `g++ >= 4.8`. On Ubuntu >= 13.10, we
can install it by
## Tutorial

After build, you will have two testing applications under `tests/` dir, namely `test_benchmark` and `test_ipc_benchmark`.
Below we elaborate how you can run with them.

To debug, set `PS_VERBOSE=1` to see important log during connection setup, and `PS_VERBOSE=2` to see each message log.

### 1. Basic benchmark

Suppose you want to run with 1 worker and 1 server on different machines. Therefore, we need to launch 3 processes in total (including the scheduler). You can launch the scheduler process at any machine as it does not affect the performance.

For the scheduler:

```
sudo apt-get update && sudo apt-get install -y build-essential git
# common setup
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5 # my RDMA interface
# launch scheduler
DMLC_ROLE=scheduler test_benchmark
```
Instructions for
[older Ubuntu](http://ubuntuhandbook.org/index.php/2013/08/install-gcc-4-8-via-ppa-in-ubuntu-12-04-13-04/),
[Centos](http://linux.web.cern.ch/linux/devtoolset/),
and
[Mac Os X](http://hpc.sourceforge.net/).

Then clone and build

```bash
git clone https://github.com/dmlc/ps-lite
cd ps-lite && make -j4
For the server:
```
# common setup
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5 # my RDMA interface
# launch server
DMLC_ROLE=server test_benchmark
```

### Build with RDMA support
For the worker:
```
# common setup
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=1
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5 # my RDMA interface
# launch worker
DMLC_ROLE=worker test_benchmark
```

You can add `USE_RDMA=1` to enable RDMA support.
If you just want to use TCP, make sure to set `DMLC_ENABLE_RDMA=0` for all processes.

```bash
make -j $(nproc) USE_RDMA=1
### 2. Benchmark with IPC support

The `test_ipc_benchmark` demonstrates how inter-process communication (IPC) helps improve RDMA performance when the server is co-located with the worker.

Suppose you have two machines. Each machine should launch a worker and a server process.

For the scheduler:
(you can launch it on either machine-0 or machine-1)
```
# common setup
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5 # my RDMA interface
# launch scheduler
DMLC_ROLE=scheduler test_ipc_benchmark
```

For machine-0 and machine-1:

```
# common setup
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_NUM_SERVER=2
export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
export DMLC_INTERFACE=eth5 # my RDMA interface
# launch server and worker
DMLC_ROLE=server test_ipc_benchmark &
DMLC_ROLE=worker test_ipc_benchmark
```


Note: This benchmark is only valid for RDMA.

### Enable new features for higher performance

To avoid head-of-line blocking, use `USE_MULTI_THREAD_FOR_RECEIVING=1` to enable multi-threading
in the receiving functions of `customer.cc`
(this flag must be set explicitly to enable multi-threading).
You can also set `NUM_RECEIVE_THREAD` to change the number of threads of the thread pool
(default is 2).


### How to use

`ps-lite` provides asynchronous communication for other projects:
- Distributed deep neural networks:
[MXNet](https://github.com/dmlc/mxnet),
[CXXNET](https://github.com/dmlc/cxxnet) and
[Minverva](https://github.com/minerva-developers/minerva)
- Distributed high dimensional inference, such as sparse logistic regression,
factorization machines:
[DiFacto](https://github.com/dmlc/difacto)
[Wormhole](https://github.com/dmlc/wormhole)

### History

We started to work on the parameter server framework since 2010.

1. The first generation was
designed and optimized for specific algorithms, such as logistic regression and
LDA, to serve the sheer size industrial machine learning tasks (hundreds billions of
examples and features with 10-100TB data size) .

2. Later we tried to build a open-source general purpose framework for machine learning
algorithms. The project is available at [dmlc/parameter_server](https://github.com/dmlc/parameter_server).

3. Given the growing demands from other projects, we created `ps-lite`, which provides a clean data communication API and a
lightweight implementation. The implementation is based on `dmlc/parameter_server`, but we refactored the job launchers, file I/O and machine
learning algorithms codes into different projects such as `dmlc-core` and
`wormhole`.

4. From the experience we learned during developing
[dmlc/mxnet](https://github.com/dmlc/mxnet), we further refactored the API and implementation from [v1](https://github.com/dmlc/ps-lite/releases/tag/v1). The main
changes include
- less library dependencies
- more flexible user-defined callbacks, which facilitate other language
bindings
- let the users, such as the dependency
engine of mxnet, manage the data consistency

### Research papers
1. Mu Li, Dave Andersen, Alex Smola, Junwoo Park, Amr Ahmed, Vanja Josifovski,
James Long, Eugene Shekita, Bor-Yiing
Su. [Scaling Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf). In
Operating Systems Design and Implementation (OSDI), 2014
2. Mu Li, Dave Andersen, Alex Smola, and Kai
Yu. [Communication Efficient Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_nips14.pdf). In
Neural Information Processing Systems (NIPS), 2014

0 comments on commit 26b0184

Please sign in to comment.