Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Update #10

Merged
merged 8 commits into from
May 17, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 114 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,139 @@
# rnn-benchmarks

All benchmarks are reported using a Nvidia GeForce GTX TITAN X GPU.
All benchmarks are reported for a host with the following specifications :
* NVIDIA GeForce GTX TITAN X GPU
* Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz
* CUDA 7.5, cudnnv5

These benchmarks compare the running time of various recurrent neural networks on different deep-learning libraries.
The networks (RNN or LSTM) take as input a 3D Tensor (batch_size x seq_length x input_size) and output the last hidden state, compute a MSE loss and backpropagate the errors through the network. Input layer size is always set to 100, and sequence length to 30.
The networks (RNN or LSTM) take as input a 3D Tensor `batch_size x seq_length x hidden_size`
and output the last hidden state, compute a MSE loss, backpropagate the errors through the network and do a simple update of the parameters (`params = params - lr*gradParams`).
The sequence length is always set to `30`.
The `hidden_size` specifies the size of the output and input layer of the networks.

The code of the scripts I ran is available. The implementations of each model on the different libraries each use the fastest implementations I was able to find. If you are aware of faster implementations, please let me know. I've reported results on Theano and Torch so far, but I will try to include many more libraries in the future.
The code of the scripts we ran are available.
The implementations of each model on the different libraries each use
the fastest implementations we were able to find.
If you are aware of faster implementations, please let me know.
I've reported results on Theano, Torch and TensorFlow so far, but we will try to include many more libraries in the future (including cudnn very soon).

The reported time is the average time needed to run a training example (and not a training batch), so the smaller the better.
The reported `Train` time is the average time needed to run (forward, backward, and update) a training example (and not a training batch), so the smaller the better.
We also report `Compile` time, which includes symbolic graph optimizations (Theano and TensorFlow compilation), as well as a forward and backward pass (to allocate memory).
While the compilation time isn't really a factor in production, it does increase debugging time, which is why we report it here.

### RNN
## LSTM

#### Hidden layer size 100 - Batch size 20
This LSTM implementation used for these benchmarks does not use peephole connections between cell and gates.

| Library | Time (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- |
| Theano | 253.9 | 87.82 |
| Torch | 315.4 | 121.8 |
### Batch Size 32

#### Hidden Size 128

#### Hidden layer size 500 - Batch size 20
| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 7.46 | 289.6 | 99.1 |
| Torch | 0.03 | 434.4 | 99.9 |
| TensorFlow | 3.91 | 820.0 | 266.7 |

| Library | Time (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- |
| Torch | 376.0 | 143.1 |
| Theano | 498.4 | 182.9 |

#### Hidden Size 512

#### Hidden layer size 1000 - Batch size 20
| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 7.59 | 619.4 | 200.9 |
| Torch | 0.19 | 610.7 | 201.7 |
| TensorFlow | 3.97 | 886.9 | 324.9 |

| Library | Time (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- |
| Torch | 637.4 | 230.2 |
| Theano | 758.8 | 326.3 |

#### Hidden Size 1024

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 9.62 | 1013.5 | 324.1 |
| Torch | 0.69 | 1139.8 | 346.3 |
| TensorFlow | 3.81 | 1329.2 | 562.7 |

### LSTM

#### Hidden layer size 100 - Batch size 20
### Batch Size 128

| Library | Time (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- |
| Theano (FastLSTM) | 587.7 | 215.1 |
| Theano (LSTM) | 725.3 | 237.5 |
| Torch (Element-Research FastLSTM) | 1017.4 | 367.3 |
| Torch (Element-Research LSTM) | 3549.5 | 1630.8 |
#### Hidden Size 128

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 7.38 | 102.9 | 25.6 |
| Torch | 0.03 | 109.8 | 25.2 |
| TensorFlow | 3.68 | 188.6 | 65.0 |

#### Hidden layer size 500 - Batch size 20

| Library | Time (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- |
| Theano (FastLSTM) | 1045.4 | 342.7 |
| Torch (Element-Research FastLSTM) | 1106.5 | 425.2 |
| Theano (LSTM) | 2298.1 | 736.4 |
| Torch (Element-Research LSTM) | 4636.1 | 2923.9 |
#### Hidden Size 512

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 7.50 | 256.0 | 62.8 |
| Torch | 0.20 | 214.3 | 51.4 |
| TensorFlow | 3.73 | 255.2 | 114.2 |

FastLSTM implementations (for both Torch and Theano) do not use peephole connections between cell and gates, and compute the input, forget and output gates, as well as the hidden state, in the same operation.
#### Hidden Size 1024

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 7.45 | 583.4 | 160.2 |
| Torch | 0.75 | 558.1 | 112.4 |
| TensorFlow | 3.84 | 592.2 | 238.1 |


## RNN

This section benchmarks a simple RNN implementation.

### Batch Size 32

#### Hidden Size 128

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 4.31 | 104.6 | 30.9 |
| Torch | 0.05 | 259.53 | 103.06 |
| TensorFlow | 1.64 | 278.4 | 111.5 |

#### Hidden Size 512

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 4.36 | 275.2 | 102.2 |
| Torch | 0.05 | 288.2 | 114.6 |
| TensorFlow | 1.62 | 349.7 | 218.4 |

#### Hidden Size 1024

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 4.44 | 443.8 | 179.5 |
| Torch | 0.09 | 381.4 | 118.8 |
| TensorFlow | 1.72 | 530.0 | 241.7 |

### Batch Size 128

#### Hidden Size 128

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 4.48 | 45.4 | 13.7 |
| Torch | 0.08 | 67.7 | 32.7 |
| TensorFlow | 1.70 | 75.5 | 33.6 |

#### Hidden Size 512

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 4.40 | 79.0 | 23.8 |
| Torch | 0.09 | 73.5 | 34.2 |
| TensorFlow | 1.63 | 125.6 | 86.8 |

#### Hidden Size 1024

| Library | Compile (s) | Train (µs) | Forward only (µs) |
| ------------- | ------------- | ------------- | ------------- |
| Theano | 4.38 | 147.8 | 50.3 |
| Torch | 0.13 | 150.2 | 64.7 |
| TensorFlow | 1.70 | 222.5 | 137.8 |
168 changes: 146 additions & 22 deletions tensorflow/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,170 @@
#TensorFlow benchmarks

provided by Maarten Bosma.
Provided by Maarten Bosma.

I used the build-in rnn libary. ``basic_lstm`` is the Tensorflow equivalent of FastLSTM.

To install TensorFlow follow [these instructions](https://www.tensorflow.org/versions/r0.7/get_started/os_setup.html#pip-installation).
These results are produced using TensorFlow 0.8, cuda 7.5, cudnnv5, turned off ondemand cpu governor [1], Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz, Titan X:

## Results
To install TensorFlow from source:
* https://www.tensorflow.org/versions/r0.8/get_started/os_setup.html#installing-from-sources
* http://stackoverflow.com/questions/34239537/how-to-update-tensorflow-from-source

## Fast LSTM

These results are produced using TensorFlow 0.7.1, cuda 7.5, cudnnv4, turned off ondemand cpu governor [1], Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz, Titan X:


### 30 x 32 x 128

```
$ python rnn.py -n basic_lstm -b 32 -l 128 -s 30
Setup : compile + forward/backward x 1
--- 3.91482686996 seconds
Forward:
--- 32000 samples in 8.53500294685 seconds (3749.266427 samples/s, 0.0002667 s/sample) ---
Forward + Backward:
--- 32000 samples in 26.2391839027 seconds (1219.550125 samples/s, 0.0008200 s/sample) ---
```

### 30 x 32 x 512

```
python rnn.py -n basic_lstm -b 32 -l 512 -s 30
Setup : compile + forward/backward x 1
--- 3.97159981728 seconds
Forward:
--- 32000 samples in 10.3965659142 seconds (3077.939414 samples/s, 0.0003249 s/sample) ---
Forward + Backward:
--- 32000 samples in 28.3808200359 seconds (1127.522036 samples/s, 0.0008869 s/sample) ---
```

### 30 x 32 x 1024


```
python rnn.py -n basic_lstm -b 32 -l 1024 -s 30
Setup : compile + forward/backward x 1
--- 3.81890392303 seconds
Forward:
--- 32000 samples in 18.0062820911 seconds (1777.157541 samples/s, 0.0005627 s/sample) ---
Forward + Backward:
--- 32000 samples in 42.533454895 seconds (752.348947 samples/s, 0.0013292 s/sample) ---
```


### 30 x 128 x 128

```
$ python rnn.py -n basic_lstm -b 128 -l 128 -s 30
Setup : compile + forward/backward x 1
--- 3.68258690834 seconds
Forward:
--- 128000 samples in 8.3175599575 seconds (15389.128621 samples/s, 0.0000650 s/sample) ---
Forward + Backward:
--- 128000 samples in 24.1425020695 seconds (5301.853123 samples/s, 0.0001886 s/sample) ---

```

### 30 x 128 x 512

```
python rnn.py -n basic_lstm -b 128 -l 512 -s 30
Setup : compile + forward/backward x 1
--- 3.72586607933 seconds
Forward:
--- 128000 samples in 14.6179850101 seconds (8756.336794 samples/s, 0.0001142 s/sample) ---
Forward + Backward:
--- 128000 samples in 32.6627261639 seconds (3918.840067 samples/s, 0.0002552 s/sample) ---

```

### 30 x 128 x 1024

```
+ python rnn.py -n rnn -b 20 -i 100 -l 100 -s 30
python rnn.py -n basic_lstm -b 128 -l 1024 -s 30
Setup : compile + forward/backward x 1
--- 3.84206986427 seconds
Forward:
--- 100000 samples in 11.039894104 seconds (9058.057117 samples/s) ---
--- 128000 samples in 30.4814198017 seconds (4199.279457 samples/s, 0.0002381 s/sample) ---
Forward + Backward:
--- 100000 samples in 25.300686121 seconds (3952.461833 samples/s) ---
+ python rnn.py -n rnn -b 20 -i 100 -l 500 -s 30
--- 128000 samples in 75.8014390469 seconds (1688.622295 samples/s, 0.0005922 s/sample) ---

```

## RNN

### 30 x 32 x 128

```
python rnn.py -n rnn -b 32 -l 128 -s 30
Setup : compile + forward/backward x 1
--- 1.6487121582 seconds
Forward:
--- 100000 samples in 19.6222681999 seconds (5096.250552 samples/s) ---
--- 32000 samples in 3.56794595718 seconds (8968.745711 samples/s, 0.0001115 s/sample) ---
Forward + Backward:
--- 100000 samples in 43.0670762062 seconds (2321.959292 samples/s) ---
+ python rnn.py -n basic_lstm -b 20 -i 100 -l 100 -s 30
--- 32000 samples in 8.91037988663 seconds (3591.317139 samples/s, 0.0002784 s/sample) ---
```

### 30 x 32 x 512

```
python rnn.py -n rnn -b 32 -l 512 -s 30
Setup : compile + forward/backward x 1
--- 1.62368106842 seconds
Forward:
--- 100000 samples in 25.3170599937 seconds (3949.905568 samples/s) ---
--- 32000 samples in 6.98823904991 seconds (4579.122118 samples/s, 0.0002184 s/sample) ---
Forward + Backward:
--- 100000 samples in 77.6742260456 seconds (1287.428310 samples/s) ---
+ python rnn.py -n basic_lstm -b 20 -i 100 -l 500 -s 30
--- 32000 samples in 11.1912858486 seconds (2859.367586 samples/s, 0.0003497 s/sample) ---
```

### 30 x 32 x 1024

```
python rnn.py -n rnn -b 32 -l 1024 -s 30
Setup : compile + forward/backward x 1
--- 1.72744393349 seconds
Forward:
--- 100000 samples in 36.4037480354 seconds (2746.969825 samples/s) ---
--- 32000 samples in 7.73560094833 seconds (4136.718041 samples/s, 0.0002417 s/sample) ---
Forward + Backward:
--- 100000 samples in 104.032881021 seconds (961.234534 samples/s) ---
+ python rnn.py -n lstm -b 20 -i 100 -l 100 -s 30
--- 32000 samples in 16.9597899914 seconds (1886.815816 samples/s, 0.0005300 s/sample) ---
```

### 30 x 128 x 128

```
python rnn.py -n rnn -b 128 -l 128 -s 30
Setup : compile + forward/backward x 1
--- 1.698335886 seconds
Forward:
--- 100000 samples in 26.2394618988 seconds (3811.053590 samples/s) ---
--- 128000 samples in 4.29631710052 seconds (29792.959180 samples/s, 0.0000336 s/sample) ---
Forward + Backward:
--- 100000 samples in 81.6460819244 seconds (1224.798498 samples/s) ---
+ python rnn.py -n lstm -b 20 -i 100 -l 500 -s 30
--- 128000 samples in 9.66468191147 seconds (13244.098582 samples/s, 0.0000755 s/sample) ---
```

### 30 x 128 x 512

```
python rnn.py -n rnn -b 128 -l 512 -s 30
Setup : compile + forward/backward x 1
--- 1.63733696938 seconds
Forward:
--- 100000 samples in 36.3097510338 seconds (2754.080981 samples/s) ---
--- 128000 samples in 11.1102721691 seconds (11520.869881 samples/s, 0.0000868 s/sample) ---
Forward + Backward:
--- 100000 samples in 104.501612902 seconds (956.923021 samples/s) ---
--- 128000 samples in 16.0786859989 seconds (7960.849538 samples/s, 0.0001256 s/sample) ---
```

### 30 x 128 x 1024

```
python rnn.py -n rnn -b 128 -l 1024 -s 30
Setup : compile + forward/backward x 1
--- 1.7014939785 seconds
Forward:
--- 128000 samples in 17.6321749687 seconds (7259.456092 samples/s, 0.0001378 s/sample) ---
Forward + Backward:
--- 128000 samples in 28.4844169617 seconds (4493.685097 samples/s, 0.0002225 s/sample) ---

```


[1] Turning on performance governor: `sudo bash -c 'for i in ls /sys/devices/system/cpu/*/cpufreq/scaling_governor; do echo 'performance' > $i; done;'`
Loading