diff --git a/README.md b/README.md index dfe59845e..4436e6608 100644 --- a/README.md +++ b/README.md @@ -1,112 +1,120 @@ - - -[![Build Status](https://travis-ci.org/dmlc/ps-lite.svg?branch=master)](https://travis-ci.org/dmlc/ps-lite) -[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE) - -A light and efficient implementation of the parameter server -framework. It provides clean yet powerful APIs. For example, a worker node can -communicate with the server nodes by -- `Push(keys, values)`: push a list of (key, value) pairs to the server nodes -- `Pull(keys)`: pull the values from servers for a list of keys -- `Wait`: wait untill a push or pull finished. - -A simple example: - -```c++ - std::vector key = {1, 3, 5}; - std::vector val = {1, 1, 1}; - std::vector recv_val; - ps::KVWorker w; - w.Wait(w.Push(key, val)); - w.Wait(w.Pull(key, &recv_val)); + + +This is the communication library for [BytePS](https://github.com/bytedance/byteps). It is designed for high performance RDMA. However, it also supports TCP. + +## Build + +```bash +git clone -b byteps https://github.com/bytedance/ps-lite +cd ps-lite +make -j USE_RDMA=1 ``` -More features: +Remove `USE_RDMA=1` if you don't want to build with RDMA support. + +## Concepts + +In ps-lite, there are three roles: worker, server and scheduler. Each role is an independent process. + +The scheduler is responsible for setting up the connections between workers and servers at initialization. There should be only 1 scheduler process. -- Flexible and high-performance communication: zero-copy push/pull, supporting - dynamic length values, user-defined filters for communication compression -- Server-side programming: supporting user-defined handles on server nodes +A worker process only communicates with server processes, and vice versa. There won't be any traffic between worker-to-worker, and server-to-server. -### Build -`ps-lite` requires a C++11 compiler such as `g++ >= 4.8`. On Ubuntu >= 13.10, we -can install it by +## Tutorial + +After build, you will have two testing applications under `tests/` dir, namely `test_benchmark` and `test_ipc_benchmark`. +Below we elaborate how you can run with them. + +To debug, set `PS_VERBOSE=1` to see important log during connection setup, and `PS_VERBOSE=2` to see each message log. + +### 1. Basic benchmark + +Suppose you want to run with 1 worker and 1 server on different machines. Therefore, we need to launch 3 processes in total (including the scheduler). You can launch the scheduler process at any machine as it does not affect the performance. + +For the scheduler: + ``` -sudo apt-get update && sudo apt-get install -y build-essential git +# common setup +export DMLC_ENABLE_RDMA=1 +export DMLC_NUM_WORKER=1 +export DMLC_NUM_SERVER=1 +export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP +export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) +export DMLC_INTERFACE=eth5 # my RDMA interface + +# launch scheduler +DMLC_ROLE=scheduler test_benchmark ``` -Instructions for -[older Ubuntu](http://ubuntuhandbook.org/index.php/2013/08/install-gcc-4-8-via-ppa-in-ubuntu-12-04-13-04/), -[Centos](http://linux.web.cern.ch/linux/devtoolset/), -and -[Mac Os X](http://hpc.sourceforge.net/). -Then clone and build -```bash -git clone https://github.com/dmlc/ps-lite -cd ps-lite && make -j4 +For the server: +``` +# common setup +export DMLC_ENABLE_RDMA=1 +export DMLC_NUM_WORKER=1 +export DMLC_NUM_SERVER=1 +export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP +export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) +export DMLC_INTERFACE=eth5 # my RDMA interface + +# launch server +DMLC_ROLE=server test_benchmark ``` -### Build with RDMA support +For the worker: +``` +# common setup +export DMLC_ENABLE_RDMA=1 +export DMLC_NUM_WORKER=1 +export DMLC_NUM_SERVER=1 +export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP +export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) +export DMLC_INTERFACE=eth5 # my RDMA interface + +# launch worker +DMLC_ROLE=worker test_benchmark +``` -You can add `USE_RDMA=1` to enable RDMA support. +If you just want to use TCP, make sure to set `DMLC_ENABLE_RDMA=0` for all processes. -```bash -make -j $(nproc) USE_RDMA=1 +### 2. Benchmark with IPC support + +The `test_ipc_benchmark` demonstrates how inter-process communication (IPC) helps improve RDMA performance when the server is co-located with the worker. + +Suppose you have two machines. Each machine should launch a worker and a server process. + +For the scheduler: +(you can launch it on either machine-0 or machine-1) ``` +# common setup +export DMLC_ENABLE_RDMA=1 +export DMLC_NUM_WORKER=2 +export DMLC_NUM_SERVER=2 +export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP +export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) +export DMLC_INTERFACE=eth5 # my RDMA interface + +# launch scheduler +DMLC_ROLE=scheduler test_ipc_benchmark +``` + +For machine-0 and machine-1: + +``` +# common setup +export DMLC_ENABLE_RDMA=1 +export DMLC_NUM_WORKER=2 +export DMLC_NUM_SERVER=2 +export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP +export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose) +export DMLC_INTERFACE=eth5 # my RDMA interface + +# launch server and worker +DMLC_ROLE=server test_ipc_benchmark & +DMLC_ROLE=worker test_ipc_benchmark +``` + + +Note: This benchmark is only valid for RDMA. -### Enable new features for higher performance - -To avoid head-of-line blocking, use `USE_MULTI_THREAD_FOR_RECEIVING=1` to enable multi-threading -in the receiving functions of `customer.cc` -(this flag must be set explicitly to enable multi-threading). - You can also set `NUM_RECEIVE_THREAD` to change the number of threads of the thread pool -(default is 2). - - -### How to use - -`ps-lite` provides asynchronous communication for other projects: - - Distributed deep neural networks: - [MXNet](https://github.com/dmlc/mxnet), - [CXXNET](https://github.com/dmlc/cxxnet) and - [Minverva](https://github.com/minerva-developers/minerva) - - Distributed high dimensional inference, such as sparse logistic regression, - factorization machines: - [DiFacto](https://github.com/dmlc/difacto) - [Wormhole](https://github.com/dmlc/wormhole) - -### History - -We started to work on the parameter server framework since 2010. - -1. The first generation was -designed and optimized for specific algorithms, such as logistic regression and -LDA, to serve the sheer size industrial machine learning tasks (hundreds billions of -examples and features with 10-100TB data size) . - -2. Later we tried to build a open-source general purpose framework for machine learning -algorithms. The project is available at [dmlc/parameter_server](https://github.com/dmlc/parameter_server). - -3. Given the growing demands from other projects, we created `ps-lite`, which provides a clean data communication API and a -lightweight implementation. The implementation is based on `dmlc/parameter_server`, but we refactored the job launchers, file I/O and machine -learning algorithms codes into different projects such as `dmlc-core` and -`wormhole`. - -4. From the experience we learned during developing - [dmlc/mxnet](https://github.com/dmlc/mxnet), we further refactored the API and implementation from [v1](https://github.com/dmlc/ps-lite/releases/tag/v1). The main - changes include - - less library dependencies - - more flexible user-defined callbacks, which facilitate other language - bindings - - let the users, such as the dependency - engine of mxnet, manage the data consistency - -### Research papers - 1. Mu Li, Dave Andersen, Alex Smola, Junwoo Park, Amr Ahmed, Vanja Josifovski, - James Long, Eugene Shekita, Bor-Yiing - Su. [Scaling Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf). In - Operating Systems Design and Implementation (OSDI), 2014 - 2. Mu Li, Dave Andersen, Alex Smola, and Kai - Yu. [Communication Efficient Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_nips14.pdf). In - Neural Information Processing Systems (NIPS), 2014