diff --git a/README.md b/README.md
index dfe59845e..4436e6608 100644
--- a/README.md
+++ b/README.md
@@ -1,112 +1,120 @@
-
-
-[![Build Status](https://travis-ci.org/dmlc/ps-lite.svg?branch=master)](https://travis-ci.org/dmlc/ps-lite)
-[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE)
-
-A light and efficient implementation of the parameter server
-framework. It provides clean yet powerful APIs. For example, a worker node can
-communicate with the server nodes by
-- `Push(keys, values)`: push a list of (key, value) pairs to the server nodes
-- `Pull(keys)`: pull the values from servers for a list of keys
-- `Wait`: wait untill a push or pull finished.
-
-A simple example:
-
-```c++
- std::vector key = {1, 3, 5};
- std::vector val = {1, 1, 1};
- std::vector recv_val;
- ps::KVWorker w;
- w.Wait(w.Push(key, val));
- w.Wait(w.Pull(key, &recv_val));
+
+
+This is the communication library for [BytePS](https://github.com/bytedance/byteps). It is designed for high performance RDMA. However, it also supports TCP.
+
+## Build
+
+```bash
+git clone -b byteps https://github.com/bytedance/ps-lite
+cd ps-lite
+make -j USE_RDMA=1
```
-More features:
+Remove `USE_RDMA=1` if you don't want to build with RDMA support.
+
+## Concepts
+
+In ps-lite, there are three roles: worker, server and scheduler. Each role is an independent process.
+
+The scheduler is responsible for setting up the connections between workers and servers at initialization. There should be only 1 scheduler process.
-- Flexible and high-performance communication: zero-copy push/pull, supporting
- dynamic length values, user-defined filters for communication compression
-- Server-side programming: supporting user-defined handles on server nodes
+A worker process only communicates with server processes, and vice versa. There won't be any traffic between worker-to-worker, and server-to-server.
-### Build
-`ps-lite` requires a C++11 compiler such as `g++ >= 4.8`. On Ubuntu >= 13.10, we
-can install it by
+## Tutorial
+
+After build, you will have two testing applications under `tests/` dir, namely `test_benchmark` and `test_ipc_benchmark`.
+Below we elaborate how you can run with them.
+
+To debug, set `PS_VERBOSE=1` to see important log during connection setup, and `PS_VERBOSE=2` to see each message log.
+
+### 1. Basic benchmark
+
+Suppose you want to run with 1 worker and 1 server on different machines. Therefore, we need to launch 3 processes in total (including the scheduler). You can launch the scheduler process at any machine as it does not affect the performance.
+
+For the scheduler:
+
```
-sudo apt-get update && sudo apt-get install -y build-essential git
+# common setup
+export DMLC_ENABLE_RDMA=1
+export DMLC_NUM_WORKER=1
+export DMLC_NUM_SERVER=1
+export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
+export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
+export DMLC_INTERFACE=eth5 # my RDMA interface
+
+# launch scheduler
+DMLC_ROLE=scheduler test_benchmark
```
-Instructions for
-[older Ubuntu](http://ubuntuhandbook.org/index.php/2013/08/install-gcc-4-8-via-ppa-in-ubuntu-12-04-13-04/),
-[Centos](http://linux.web.cern.ch/linux/devtoolset/),
-and
-[Mac Os X](http://hpc.sourceforge.net/).
-Then clone and build
-```bash
-git clone https://github.com/dmlc/ps-lite
-cd ps-lite && make -j4
+For the server:
+```
+# common setup
+export DMLC_ENABLE_RDMA=1
+export DMLC_NUM_WORKER=1
+export DMLC_NUM_SERVER=1
+export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
+export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
+export DMLC_INTERFACE=eth5 # my RDMA interface
+
+# launch server
+DMLC_ROLE=server test_benchmark
```
-### Build with RDMA support
+For the worker:
+```
+# common setup
+export DMLC_ENABLE_RDMA=1
+export DMLC_NUM_WORKER=1
+export DMLC_NUM_SERVER=1
+export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
+export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
+export DMLC_INTERFACE=eth5 # my RDMA interface
+
+# launch worker
+DMLC_ROLE=worker test_benchmark
+```
-You can add `USE_RDMA=1` to enable RDMA support.
+If you just want to use TCP, make sure to set `DMLC_ENABLE_RDMA=0` for all processes.
-```bash
-make -j $(nproc) USE_RDMA=1
+### 2. Benchmark with IPC support
+
+The `test_ipc_benchmark` demonstrates how inter-process communication (IPC) helps improve RDMA performance when the server is co-located with the worker.
+
+Suppose you have two machines. Each machine should launch a worker and a server process.
+
+For the scheduler:
+(you can launch it on either machine-0 or machine-1)
```
+# common setup
+export DMLC_ENABLE_RDMA=1
+export DMLC_NUM_WORKER=2
+export DMLC_NUM_SERVER=2
+export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
+export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
+export DMLC_INTERFACE=eth5 # my RDMA interface
+
+# launch scheduler
+DMLC_ROLE=scheduler test_ipc_benchmark
+```
+
+For machine-0 and machine-1:
+
+```
+# common setup
+export DMLC_ENABLE_RDMA=1
+export DMLC_NUM_WORKER=2
+export DMLC_NUM_SERVER=2
+export DMLC_PS_ROOT_URI=10.0.0.2 # scheduler's RDMA interface IP
+export DMLC_PS_ROOT_PORT=8123 # scheduler's port (can random choose)
+export DMLC_INTERFACE=eth5 # my RDMA interface
+
+# launch server and worker
+DMLC_ROLE=server test_ipc_benchmark &
+DMLC_ROLE=worker test_ipc_benchmark
+```
+
+
+Note: This benchmark is only valid for RDMA.
-### Enable new features for higher performance
-
-To avoid head-of-line blocking, use `USE_MULTI_THREAD_FOR_RECEIVING=1` to enable multi-threading
-in the receiving functions of `customer.cc`
-(this flag must be set explicitly to enable multi-threading).
- You can also set `NUM_RECEIVE_THREAD` to change the number of threads of the thread pool
-(default is 2).
-
-
-### How to use
-
-`ps-lite` provides asynchronous communication for other projects:
- - Distributed deep neural networks:
- [MXNet](https://github.com/dmlc/mxnet),
- [CXXNET](https://github.com/dmlc/cxxnet) and
- [Minverva](https://github.com/minerva-developers/minerva)
- - Distributed high dimensional inference, such as sparse logistic regression,
- factorization machines:
- [DiFacto](https://github.com/dmlc/difacto)
- [Wormhole](https://github.com/dmlc/wormhole)
-
-### History
-
-We started to work on the parameter server framework since 2010.
-
-1. The first generation was
-designed and optimized for specific algorithms, such as logistic regression and
-LDA, to serve the sheer size industrial machine learning tasks (hundreds billions of
-examples and features with 10-100TB data size) .
-
-2. Later we tried to build a open-source general purpose framework for machine learning
-algorithms. The project is available at [dmlc/parameter_server](https://github.com/dmlc/parameter_server).
-
-3. Given the growing demands from other projects, we created `ps-lite`, which provides a clean data communication API and a
-lightweight implementation. The implementation is based on `dmlc/parameter_server`, but we refactored the job launchers, file I/O and machine
-learning algorithms codes into different projects such as `dmlc-core` and
-`wormhole`.
-
-4. From the experience we learned during developing
- [dmlc/mxnet](https://github.com/dmlc/mxnet), we further refactored the API and implementation from [v1](https://github.com/dmlc/ps-lite/releases/tag/v1). The main
- changes include
- - less library dependencies
- - more flexible user-defined callbacks, which facilitate other language
- bindings
- - let the users, such as the dependency
- engine of mxnet, manage the data consistency
-
-### Research papers
- 1. Mu Li, Dave Andersen, Alex Smola, Junwoo Park, Amr Ahmed, Vanja Josifovski,
- James Long, Eugene Shekita, Bor-Yiing
- Su. [Scaling Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf). In
- Operating Systems Design and Implementation (OSDI), 2014
- 2. Mu Li, Dave Andersen, Alex Smola, and Kai
- Yu. [Communication Efficient Distributed Machine Learning with the Parameter Server](http://www.cs.cmu.edu/~muli/file/parameter_server_nips14.pdf). In
- Neural Information Processing Systems (NIPS), 2014