added benchmark results of keccak256 based binary merklization on mul…

…tiple platforms ( cpu, gpu etc. )
itzmeanjan · Mar 14, 2022 · 3a5fa29 · 3a5fa29
1 parent f03707f
commit 3a5fa29
Show file tree

Hide file tree

Showing 4 changed files with 165 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ SYCL accelerated Binary Merklization using SHA1, SHA2 & SHA3
 
 ## Motivation
 
-After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions. BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.
+After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions ( along with `keccak256` ). BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 ( or keccak256 ) families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.
 
 ```bash
   ((a, b), (c, d))          < --- [Level 1] [Root]
@@ -25,7 +25,7 @@ input   = [a, b, c, d]
 output  = [0, ((a, b), (c, d)), (a, b), (c, d)]
 ```
 
-Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function, which one to use is compile-time choice using pre-processor directive.
+Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function ( along with keccak256 ), which one to use is compile-time choice using pre-processor directive.
 
 If you happen to be interested in Binary Merklization using Rescue Prime Hash/ BLAKE3, consider seeing following links.
 
@@ -36,6 +36,8 @@ If you happen to be interested in Binary Merklization using Rescue Prime Hash/ B
 
 > During SHA3 implementations, I've followed SHA-3 Standard [specification](http://dx.doi.org/10.6028/NIST.FIPS.202).
 
+> During Keccak256 implementation, I took some inspiration from [here](https://keccak.team/files/Keccak-implementation-3.2.pdf); though note that, keccak256 & sha3-256 are very much similar, except input message padding rule; see https://github.com/itzmeanjan/merklize-sha/pull/10 PR description.
+
 > Using SHA1 for binary merklization may not be a good choice these days, see [here](https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions). But still I'm keeping SHA1 implementation, just as a reference.
 
 ## Prerequisites
@@ -153,5 +155,9 @@ I'm keeping binary merklization benchmark results of
   - [Nvidia GPU(s)](results/sha3-512/nvidia_gpu.md)
   - [Intel CPU(s)](results/sha3-512/intel_cpu.md)
   - [Intel GPU(s)](results/sha3-512/intel_gpu.md)
+- KECCAK-256
+  - [Nvidia GPU(s)](results/keccak-256/nvidia_gpu.md)
+  - [Intel CPU(s)](results/keccak-256/intel_cpu.md)
+  - [Intel GPU(s)](results/keccak-256/intel_gpu.md)
 
 obtained after executing them on multiple accelerators.
diff --git a/results/keccak-256/intel_cpu.md b/results/keccak-256/intel_cpu.md
@@ -0,0 +1,109 @@
+### Binary Merklization using KECCAK-256 on Intel CPU(s)
+
+Compiling with
+
+```bash
+SHA=keccak_256 make aot_cpu
+```
+
+### On `Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz`
+
+```bash
+$ lscpu | grep -i cpu\(s\)
+
+CPU(s):                          4
+On-line CPU(s) list:             0-3
+NUMA node0 CPU(s):               0-3
+```
+
+```bash
+running on Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+
+
+Benchmarking Binary Merklization using KECCAK-256
+
+      leaf count                  execution time                host-to-device tx time          device-to-host tx time
+        2 ^ 20                   466.478477 ms                     3.288778 ms                     3.442020 ms
+        2 ^ 21                   898.963977 ms                     6.508914 ms                     6.558546 ms
+        2 ^ 22                      1.797621 s                    13.061319 ms                    13.172746 ms
+        2 ^ 23                      3.591501 s                    27.324937 ms                    27.123078 ms
+        2 ^ 24                      7.186666 s                    54.148528 ms                    54.237210 ms
+        2 ^ 25                     14.404052 s                   123.865217 ms                   108.246855 ms
+```
+
+### On `Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz`
+
+```bash
+$ lscpu | grep -i cpu\(s\)
+
+CPU(s):                          128
+On-line CPU(s) list:             0-127
+NUMA node0 CPU(s):               0-31,64-95
+NUMA node1 CPU(s):               32-63,96-127
+```
+
+```bash
+running on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
+
+
+Benchmarking Binary Merklization using KECCAK-256
+
+      leaf count                  execution time                host-to-device tx time          device-to-host tx time
+        2 ^ 20                    13.362355 ms                     1.821476 ms                     1.326708 ms
+        2 ^ 21                    20.922397 ms                     3.589614 ms                     2.430955 ms
+        2 ^ 22                    33.674320 ms                     6.493885 ms                     4.294246 ms
+        2 ^ 23                   106.859444 ms                    11.947260 ms                     8.593155 ms
+        2 ^ 24                   117.165222 ms                    23.851139 ms                     8.417020 ms
+        2 ^ 25                   233.647003 ms                    25.051263 ms                    16.673447 ms
+```
+
+### On `Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz`
+
+```bash
+$ lscpu | grep -i cpu\(s\)
+
+CPU(s):                          24
+On-line CPU(s) list:             0-23
+NUMA node0 CPU(s):               0-5,12-17
+NUMA node1 CPU(s):               6-11,18-23
+```
+
+```bash
+running on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
+
+
+Benchmarking Binary Merklization using KECCAK-256
+
+      leaf count                  execution time                host-to-device tx time          device-to-host tx time
+        2 ^ 20                    34.571529 ms                     1.809763 ms                   897.616875 us
+        2 ^ 21                    61.404680 ms                     3.326612 ms                     1.588368 ms
+        2 ^ 22                   117.968746 ms                     5.674248 ms                     7.157974 ms
+        2 ^ 23                   231.852088 ms                     9.238144 ms                    13.273680 ms
+        2 ^ 24                   462.241001 ms                    20.315251 ms                    12.602417 ms
+        2 ^ 25                   924.972606 ms                    31.446401 ms                    24.707977 ms
+```
+
+### On `Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz`
+
+```bash
+$ lscpu | grep -i cpu\(s\)
+
+CPU(s):                          12
+On-line CPU(s) list:             0-11
+NUMA node0 CPU(s):               0-11
+```
+
+```bash
+running on Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz
+
+
+Benchmarking Binary Merklization using KECCAK-256
+
+      leaf count                  execution time                host-to-device tx time          device-to-host tx time
+        2 ^ 20                    73.894415 ms                   932.138625 us                   850.445250 us
+        2 ^ 21                   109.423621 ms                     1.782943 ms                     1.715456 ms
+        2 ^ 22                   218.244072 ms                     3.493360 ms                     3.446031 ms
+        2 ^ 23                   436.918616 ms                     6.905427 ms                     6.842661 ms
+        2 ^ 24                   883.594877 ms                    13.812258 ms                    13.749230 ms
+        2 ^ 25                      1.930962 s                    27.554382 ms                    27.591307 ms
+```
diff --git a/results/keccak-256/intel_gpu.md b/results/keccak-256/intel_gpu.md
@@ -0,0 +1,24 @@
+### Binary Merklization using KECCAK-256 on Intel GPU(s)
+
+Compiling with
+
+```bash
+SHA=keccak_256 make aot_gpu
+```
+
+### On `Intel(R) UHD Graphics P630 [0x3e96]`
+
+```bash
+running on Intel(R) UHD Graphics P630 [0x3e96]
+
+
+Benchmarking Binary Merklization using KECCAK-256
+
+      leaf count                  execution time                host-to-device tx time          device-to-host tx time
+        2 ^ 20                   108.488926 ms                     1.332275 ms                   745.381500 us
+        2 ^ 21                   212.384799 ms                     1.497735 ms                     1.454533 ms
+        2 ^ 22                   422.459127 ms                     5.289694 ms                     2.832562 ms
+        2 ^ 23                   841.035348 ms                     5.684048 ms                     5.597084 ms
+        2 ^ 24                      1.679276 s                    11.176738 ms                    11.080438 ms
+        2 ^ 25                      3.355854 s                    22.150604 ms                    22.356589 ms
+```
diff --git a/results/keccak-256/nvidia_gpu.md b/results/keccak-256/nvidia_gpu.md
@@ -0,0 +1,24 @@
+### Binary Merklization using KECCAK-256 on Nvidia GPU(s)
+
+Compile with
+
+```bash
+SHA=keccak_256 make cuda
+```
+
+### On `Tesla V100-SXM2-16GB`
+
+```bash
+running on Tesla V100-SXM2-16GB
+
+
+Benchmarking Binary Merklization using KECCAK-256
+
+      leaf count                  execution time                host-to-device tx time          device-to-host tx time
+        2 ^ 20                   751.924875 us                     1.167792 ms                     1.005363 ms
+        2 ^ 21                     1.344910 ms                     2.304931 ms                     2.016678 ms
+        2 ^ 22                     2.517974 ms                     4.593017 ms                     4.025208 ms
+        2 ^ 23                     4.864380 ms                     9.128906 ms                     8.053345 ms
+        2 ^ 24                     8.179686 ms                    18.250488 ms                    16.049194 ms
+        2 ^ 25                    16.144776 ms                    36.534668 ms                    32.099121 ms
+```