Skip to content

Commit

Permalink
added benchmark results of keccak256 based binary merklization on mul…
Browse files Browse the repository at this point in the history
…tiple platforms ( cpu, gpu etc. )
  • Loading branch information
itzmeanjan committed Mar 14, 2022
1 parent f03707f commit 3a5fa29
Show file tree
Hide file tree
Showing 4 changed files with 165 additions and 2 deletions.
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ SYCL accelerated Binary Merklization using SHA1, SHA2 & SHA3

## Motivation

After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions. BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.
After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions ( along with `keccak256` ). BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 ( or keccak256 ) families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.

```bash
((a, b), (c, d)) < --- [Level 1] [Root]
Expand All @@ -25,7 +25,7 @@ input = [a, b, c, d]
output = [0, ((a, b), (c, d)), (a, b), (c, d)]
```

Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function, which one to use is compile-time choice using pre-processor directive.
Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function ( along with keccak256 ), which one to use is compile-time choice using pre-processor directive.

If you happen to be interested in Binary Merklization using Rescue Prime Hash/ BLAKE3, consider seeing following links.

Expand All @@ -36,6 +36,8 @@ If you happen to be interested in Binary Merklization using Rescue Prime Hash/ B
> During SHA3 implementations, I've followed SHA-3 Standard [specification](http://dx.doi.org/10.6028/NIST.FIPS.202).
> During Keccak256 implementation, I took some inspiration from [here](https://keccak.team/files/Keccak-implementation-3.2.pdf); though note that, keccak256 & sha3-256 are very much similar, except input message padding rule; see https://github.com/itzmeanjan/merklize-sha/pull/10 PR description.
> Using SHA1 for binary merklization may not be a good choice these days, see [here](https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions). But still I'm keeping SHA1 implementation, just as a reference.
## Prerequisites
Expand Down Expand Up @@ -153,5 +155,9 @@ I'm keeping binary merklization benchmark results of
- [Nvidia GPU(s)](results/sha3-512/nvidia_gpu.md)
- [Intel CPU(s)](results/sha3-512/intel_cpu.md)
- [Intel GPU(s)](results/sha3-512/intel_gpu.md)
- KECCAK-256
- [Nvidia GPU(s)](results/keccak-256/nvidia_gpu.md)
- [Intel CPU(s)](results/keccak-256/intel_cpu.md)
- [Intel GPU(s)](results/keccak-256/intel_gpu.md)

obtained after executing them on multiple accelerators.
109 changes: 109 additions & 0 deletions results/keccak-256/intel_cpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
### Binary Merklization using KECCAK-256 on Intel CPU(s)

Compiling with

```bash
SHA=keccak_256 make aot_cpu
```

### On `Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz`

```bash
$ lscpu | grep -i cpu\(s\)

CPU(s): 4
On-line CPU(s) list: 0-3
NUMA node0 CPU(s): 0-3
```

```bash
running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz


Benchmarking Binary Merklization using KECCAK-256

leaf count execution time host-to-device tx time device-to-host tx time
2 ^ 20 466.478477 ms 3.288778 ms 3.442020 ms
2 ^ 21 898.963977 ms 6.508914 ms 6.558546 ms
2 ^ 22 1.797621 s 13.061319 ms 13.172746 ms
2 ^ 23 3.591501 s 27.324937 ms 27.123078 ms
2 ^ 24 7.186666 s 54.148528 ms 54.237210 ms
2 ^ 25 14.404052 s 123.865217 ms 108.246855 ms
```

### On `Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz`

```bash
$ lscpu | grep -i cpu\(s\)

CPU(s): 128
On-line CPU(s) list: 0-127
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
```

```bash
running on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz


Benchmarking Binary Merklization using KECCAK-256

leaf count execution time host-to-device tx time device-to-host tx time
2 ^ 20 13.362355 ms 1.821476 ms 1.326708 ms
2 ^ 21 20.922397 ms 3.589614 ms 2.430955 ms
2 ^ 22 33.674320 ms 6.493885 ms 4.294246 ms
2 ^ 23 106.859444 ms 11.947260 ms 8.593155 ms
2 ^ 24 117.165222 ms 23.851139 ms 8.417020 ms
2 ^ 25 233.647003 ms 25.051263 ms 16.673447 ms
```

### On `Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz`

```bash
$ lscpu | grep -i cpu\(s\)

CPU(s): 24
On-line CPU(s) list: 0-23
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23
```

```bash
running on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz


Benchmarking Binary Merklization using KECCAK-256

leaf count execution time host-to-device tx time device-to-host tx time
2 ^ 20 34.571529 ms 1.809763 ms 897.616875 us
2 ^ 21 61.404680 ms 3.326612 ms 1.588368 ms
2 ^ 22 117.968746 ms 5.674248 ms 7.157974 ms
2 ^ 23 231.852088 ms 9.238144 ms 13.273680 ms
2 ^ 24 462.241001 ms 20.315251 ms 12.602417 ms
2 ^ 25 924.972606 ms 31.446401 ms 24.707977 ms
```

### On `Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz`

```bash
$ lscpu | grep -i cpu\(s\)

CPU(s): 12
On-line CPU(s) list: 0-11
NUMA node0 CPU(s): 0-11
```

```bash
running on Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz


Benchmarking Binary Merklization using KECCAK-256

leaf count execution time host-to-device tx time device-to-host tx time
2 ^ 20 73.894415 ms 932.138625 us 850.445250 us
2 ^ 21 109.423621 ms 1.782943 ms 1.715456 ms
2 ^ 22 218.244072 ms 3.493360 ms 3.446031 ms
2 ^ 23 436.918616 ms 6.905427 ms 6.842661 ms
2 ^ 24 883.594877 ms 13.812258 ms 13.749230 ms
2 ^ 25 1.930962 s 27.554382 ms 27.591307 ms
```
24 changes: 24 additions & 0 deletions results/keccak-256/intel_gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
### Binary Merklization using KECCAK-256 on Intel GPU(s)

Compiling with

```bash
SHA=keccak_256 make aot_gpu
```

### On `Intel(R) UHD Graphics P630 [0x3e96]`

```bash
running on Intel(R) UHD Graphics P630 [0x3e96]


Benchmarking Binary Merklization using KECCAK-256

leaf count execution time host-to-device tx time device-to-host tx time
2 ^ 20 108.488926 ms 1.332275 ms 745.381500 us
2 ^ 21 212.384799 ms 1.497735 ms 1.454533 ms
2 ^ 22 422.459127 ms 5.289694 ms 2.832562 ms
2 ^ 23 841.035348 ms 5.684048 ms 5.597084 ms
2 ^ 24 1.679276 s 11.176738 ms 11.080438 ms
2 ^ 25 3.355854 s 22.150604 ms 22.356589 ms
```
24 changes: 24 additions & 0 deletions results/keccak-256/nvidia_gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
### Binary Merklization using KECCAK-256 on Nvidia GPU(s)

Compile with

```bash
SHA=keccak_256 make cuda
```

### On `Tesla V100-SXM2-16GB`

```bash
running on Tesla V100-SXM2-16GB


Benchmarking Binary Merklization using KECCAK-256

leaf count execution time host-to-device tx time device-to-host tx time
2 ^ 20 751.924875 us 1.167792 ms 1.005363 ms
2 ^ 21 1.344910 ms 2.304931 ms 2.016678 ms
2 ^ 22 2.517974 ms 4.593017 ms 4.025208 ms
2 ^ 23 4.864380 ms 9.128906 ms 8.053345 ms
2 ^ 24 8.179686 ms 18.250488 ms 16.049194 ms
2 ^ 25 16.144776 ms 36.534668 ms 32.099121 ms
```

0 comments on commit 3a5fa29

Please sign in to comment.