Merge pull request #10 from itzmeanjan/keccak256

Keccak256 based Binary Merklization
itzmeanjan · Mar 14, 2022 · ebcd785 · ebcd785
2 parents 8a2c006 + 35574c7
commit ebcd785
Show file tree

Hide file tree

Showing 13 changed files with 525 additions and 48 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
 # merklize-sha
 
-SYCL accelerated Binary Merklization using SHA1, SHA2 & SHA3
+SYCL accelerated Binary Merklization using SHA1, SHA2 & SHA3 ( along with keccak256 )
 
 ## Motivation
 
-After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions. BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.
+After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions ( along with keccak256 ). BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 ( or keccak256 ) families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.
 
 ```bash
   ((a, b), (c, d))          < --- [Level 1] [Root]
@@ -25,7 +25,7 @@ input   = [a, b, c, d]
 output  = [0, ((a, b), (c, d)), (a, b), (c, d)]
 ```
 
-Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function, which one to use is compile-time choice using pre-processor directive.
+Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function ( along with keccak256 ), which one to use is compile-time choice using pre-processor directive.
 
 If you happen to be interested in Binary Merklization using Rescue Prime Hash/ BLAKE3, consider seeing following links.
 
@@ -36,6 +36,8 @@ If you happen to be interested in Binary Merklization using Rescue Prime Hash/ B
 
 > During SHA3 implementations, I've followed SHA-3 Standard [specification](http://dx.doi.org/10.6028/NIST.FIPS.202).
 
+> During Keccak256 implementation, I took some inspiration from [here](https://keccak.team/files/Keccak-implementation-3.2.pdf); though note that, keccak256 & sha3-256 are very much similar, except input message padding rule; see https://github.com/itzmeanjan/merklize-sha/pull/10 PR description.
+
 > Using SHA1 for binary merklization may not be a good choice these days, see [here](https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions). But still I'm keeping SHA1 implementation, just as a reference.
 
 ## Prerequisites
@@ -88,12 +90,13 @@ If you happen to be interested in 2-to-1 hash implementation of
 - [SHA3-256](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_256.cpp)
 - [SHA3-384](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_384.cpp)
 - [SHA3-512](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_512.cpp)
+- [KECCAK-256](https://github.com/itzmeanjan/merklize-sha/blob/75dfd47/example/keccak_256.cpp)
 
 where two digests of respective hash functions are input, in byte concatenated form, to `hash( ... )` function, consider taking a look at above hyperlinked examples.
 
 > Compile above examples using `dpcpp -fsycl example/<file>.cpp -I./include`
 
-You will probably like to see how binary merklization kernels use these 2-to-1 hash functions; see [here](https://github.com/itzmeanjan/merklize-sha/blob/4aadd99/include/merklize.hpp)
+You will probably like to see how binary merklization kernels use these 2-to-1 hash functions; see [here](https://github.com/itzmeanjan/merklize-sha/blob/ddb7ac9/include/merklize.hpp)
 
 ## Tests
 
@@ -153,5 +156,9 @@ I'm keeping binary merklization benchmark results of
   - [Nvidia GPU(s)](results/sha3-512/nvidia_gpu.md)
   - [Intel CPU(s)](results/sha3-512/intel_cpu.md)
   - [Intel GPU(s)](results/sha3-512/intel_gpu.md)
+- KECCAK-256
+  - [Nvidia GPU(s)](results/keccak-256/nvidia_gpu.md)
+  - [Intel CPU(s)](results/keccak-256/intel_cpu.md)
+  - [Intel GPU(s)](results/keccak-256/intel_gpu.md)
 
 obtained after executing them on multiple accelerators.
diff --git a/bench/main.cpp b/bench/main.cpp
@@ -73,6 +73,10 @@ main(int argc, char** argv)
 #elif defined SHA3_512
   std::cout << "\nBenchmarking Binary Merklization using SHA3-512" << std::endl
             << std::endl;
+#elif defined KECCAK_256
+  std::cout << "\nBenchmarking Binary Merklization using KECCAK-256"
+            << std::endl
+            << std::endl;
 #endif
 
   std::cout << std::setw(16) << std::right << "leaf count"

diff --git a/example/keccak_256.cpp b/example/keccak_256.cpp
@@ -0,0 +1,68 @@
+#include "keccak_256.hpp"
+#include <cassert>
+
+// This example attempts to show how to use 2-to-1 KECCAK-256 hash function !
+int
+main(int argc, char** argv)
+{
+  // $ python3
+  // >>> a = [0xff] * 32
+  //
+  // first input digest
+  constexpr sycl::uchar digest_0[32] = {
+    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+    255, 255, 255, 255, 255, 255, 255, 255, 255, 255
+  };
+
+  // >>> b = [0x0f] * 32
+  //
+  // second input digest
+  constexpr sycl::uchar digest_1[32] = { 15, 15, 15, 15, 15, 15, 15, 15,
+                                         15, 15, 15, 15, 15, 15, 15, 15,
+                                         15, 15, 15, 15, 15, 15, 15, 15,
+                                         15, 15, 15, 15, 15, 15, 15, 15 };
+
+  // >>> c = a + b
+  // >>> import sha3
+  // >>> list(sha3.keccak_256(bytes(c)).digest())
+  //
+  // final output digest after merging two input digests
+  constexpr sycl::uchar digest_2[32] = { 134, 168, 210, 52, 189, 102, 98,  76,
+                                         116, 7,   56,  37, 176, 235, 150, 128,
+                                         138, 149, 200, 26, 32,  23,  79,  38,
+                                         139, 137, 109, 85, 195, 183, 230, 9 };
+
+  sycl::default_selector s{};
+  sycl::device d{ s };
+  sycl::context c{ d };
+  sycl::queue q{ c, d };
+
+  // so that input digests can be transferred from host to device ( by runtime )
+  sycl::uchar* in = static_cast<sycl::uchar*>(
+    sycl::malloc_shared(sizeof(digest_0) + sizeof(digest_1), q));
+
+  // so that output digest can be transferred from device to host ( by runtime )
+  sycl::uchar* out =
+    static_cast<sycl::uchar*>(sycl::malloc_shared(sizeof(digest_2), q));
+
+  // copy both input digests to device memory
+  q.memcpy(in + 0, digest_0, sizeof(digest_0)).wait();
+  q.memcpy(in + sizeof(digest_0), digest_1, sizeof(digest_1)).wait();
+
+  // compute 2-to-1 hash
+  q.single_task<class kernelExampleKECCAK_256>(
+    [=]() { keccak_256::hash(in, out); });
+  q.wait();
+
+  // finally assert !
+  for (size_t i = 0; i < sizeof(digest_2); i++) {
+    assert(*(out + i) == digest_2[i]);
+  }
+
+  // deallocate resources !
+  sycl::free(in, q);
+  sycl::free(out, q);
+
+  return EXIT_SUCCESS;
+}
diff --git a/include/bench_merklize.hpp b/include/bench_merklize.hpp
@@ -54,6 +54,9 @@ benchmark_merklize(sycl::queue& q,
 #elif defined SHA3_512
   const size_t i_size = leaf_cnt * sha3_512::OUT_LEN_BYTES; // in bytes
   const size_t o_size = leaf_cnt * sha3_512::OUT_LEN_BYTES; // in bytes
+#elif defined KECCAK_256
+  const size_t i_size = leaf_cnt * keccak_256::OUT_LEN_BYTES; // in bytes
+  const size_t o_size = leaf_cnt * keccak_256::OUT_LEN_BYTES; // in bytes
 #endif
 
 #if defined SHA1 || defined SHA2_224 || defined SHA2_256
@@ -70,7 +73,7 @@ benchmark_merklize(sycl::queue& q,
   sycl::ulong* i_d = static_cast<sycl::ulong*>(sycl::malloc_device(i_size, q));
   sycl::ulong* o_d = static_cast<sycl::ulong*>(sycl::malloc_device(o_size, q));
 #elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 ||              \
-  defined SHA3_512
+  defined SHA3_512 || defined KECCAK_256
   // allocate resources
   sycl::uchar* i_h = static_cast<sycl::uchar*>(sycl::malloc_host(i_size, q));
   sycl::uchar* o_h = static_cast<sycl::uchar*>(sycl::malloc_host(o_size, q));
@@ -135,6 +138,8 @@ benchmark_merklize(sycl::queue& q,
                      (sha3_384::OUT_LEN_BYTES)
 #elif defined SHA3_512
                      (sha3_512::OUT_LEN_BYTES)
+#elif defined KECCAK_256
+                     (keccak_256::OUT_LEN_BYTES)
 #endif
 
          ;

diff --git a/include/keccak_256.hpp b/include/keccak_256.hpp
@@ -0,0 +1,119 @@
+#pragma once
+#include "sha3.hpp"
+
+namespace keccak_256 {
+
+// Keccak-256 specific input/ output width constants
+constexpr size_t IN_LEN_BITS = 512;
+constexpr size_t IN_LEN_BYTES = IN_LEN_BITS >> 3;
+
+constexpr size_t OUT_LEN_BITS = IN_LEN_BITS >> 1;
+constexpr size_t OUT_LEN_BYTES = IN_LEN_BYTES >> 1;
+
+// From input byte array ( = 64 bytes ) preparing 5 x 5 x 64 keccak state array
+// as twenty five 64 -bit unsigned integers
+//
+// Combined techniques adapted from section 3.1.2 of
+// http://dx.doi.org/10.6028/NIST.FIPS.202; algorithm 10
+// defined in section B.1 of above linked document
+//
+// Note, in SHA3 specification padding rule is different than what it's for
+// keccak-256 i.e. to be very spcific, compare
+// https://github.com/itzmeanjan/merklize-sha/blob/e421134ea4c9b1a832458bc870c3e79dc2849ecc/include/sha3_256.hpp#L43
+// ( this is sha3-256 implementation ) and write to 9-th state lane ( at index 8
+// of state array ) in following function body
+//
+// I suggest you read https://keccak.team/files/Keccak-implementation-3.2.pdf 's
+// section 1.1 where padding rule is defined under `Keccak[r, c](M)` definition
+void
+to_state_array(const sycl::uchar* __restrict in,
+               sycl::ulong* const __restrict state)
+{
+#pragma unroll 8
+  for (size_t i = 0; i < 8; i++) {
+    state[i] = static_cast<sycl::ulong>(in[(i << 3) + 7]) << 56 |
+               static_cast<sycl::ulong>(in[(i << 3) + 6]) << 48 |
+               static_cast<sycl::ulong>(in[(i << 3) + 5]) << 40 |
+               static_cast<sycl::ulong>(in[(i << 3) + 4]) << 32 |
+               static_cast<sycl::ulong>(in[(i << 3) + 3]) << 24 |
+               static_cast<sycl::ulong>(in[(i << 3) + 2]) << 16 |
+               static_cast<sycl::ulong>(in[(i << 3) + 1]) << 8 |
+               static_cast<sycl::ulong>(in[(i << 3) + 0]) << 0;
+  }
+
+  // see how 0b01 is padded to input message; following keccak-256
+  // implementation guide
+  // https://keccak.team/files/Keccak-implementation-3.2.pdf 's section 1.1
+  // where `Keccak[r, c](M)` is defined ( spcifically padding rule block in
+  // pseudocode, at very end of mentioned section )
+  //
+  // ! read right to left !
+  state[8] = 0b1ull;
+
+#pragma unroll 7
+  for (size_t i = 9; i < 16; i++) {
+    state[i] = 0ull;
+  }
+
+  // this 1 is added to input message bits due to padding requirement
+  // defined in keccak-256 implementation guide
+  // https://keccak.team/files/Keccak-implementation-3.2.pdf 's section 1.1
+  // where `Keccak[r, c](M)` is defined ( spcifically padding rule block in
+  // pseudocode, at very end of mentioned section )
+  //
+  // ! read right to left, so it's actually 1 << 63 !
+  state[16] = 9223372036854775808ull;
+
+#pragma unroll 8
+  for (size_t i = 17; i < 25; i++) {
+    state[i] = 0ull;
+  }
+}
+
+// From absorbed hash state array of dimension 5 x 5 x 64, produces 32 -bytes
+// digest using method defined in section 3.1.3 of
+// http://dx.doi.org/10.6028/NIST.FIPS.202 and algorithm 11 defined in section
+// B.1 of above hyperlinked document
+//
+// Note, digest preparation method is same for both sha3-256 and keccak-256
+void
+to_digest_bytes(const sycl::ulong* __restrict in,
+                sycl::uchar* const __restrict digest)
+{
+#pragma unroll 4
+  for (size_t i = 0; i < 4; i++) {
+    const sycl::ulong lane = in[i];
+
+    digest[(i << 3) + 0] = static_cast<sycl::uchar>((lane >> 0) & 0xffull);
+    digest[(i << 3) + 1] = static_cast<sycl::uchar>((lane >> 8) & 0xffull);
+    digest[(i << 3) + 2] = static_cast<sycl::uchar>((lane >> 16) & 0xffull);
+    digest[(i << 3) + 3] = static_cast<sycl::uchar>((lane >> 24) & 0xffull);
+    digest[(i << 3) + 4] = static_cast<sycl::uchar>((lane >> 32) & 0xffull);
+    digest[(i << 3) + 5] = static_cast<sycl::uchar>((lane >> 40) & 0xffull);
+    digest[(i << 3) + 6] = static_cast<sycl::uchar>((lane >> 48) & 0xffull);
+    digest[(i << 3) + 7] = static_cast<sycl::uchar>((lane >> 56) & 0xffull);
+  }
+}
+
+// Keccak-256 2-to-1 hasher, where input is 64 contiguous bytes which is hashed
+// to produce 32 -bytes output
+//
+// This function itself doesn't do much instead of calling other functions
+// which actually
+// - prepares state bit array from input byte array ( difference lies here, when
+// comparing between sha3-256 & keccak-256 )
+// - permutes input using `keccak-p[b, n_r]`
+// - truncates first 256 -bits from state bit array
+//
+// See section 6.1 of http://dx.doi.org/10.6028/NIST.FIPS.202
+void
+hash(const sycl::uchar* __restrict in, sycl::uchar* const __restrict digest)
+{
+  sycl::ulong state[25];
+
+  to_state_array(in, state);
+  keccak_p(state);
+  to_digest_bytes(state, digest);
+}
+
+}
diff --git a/include/merklize.hpp b/include/merklize.hpp
@@ -3,7 +3,7 @@
 #if !(defined SHA1 || defined SHA2_224 || defined SHA2_256 ||                  \
       defined SHA2_384 || defined SHA2_512 || defined SHA2_512_224 ||          \
       defined SHA2_512_256 || defined SHA3_256 || defined SHA3_224 ||          \
-      defined SHA3_384 || defined SHA3_512)
+      defined SHA3_384 || defined SHA3_512 || defined KECCAK_256)
 #define SHA2_256
 #endif
 
@@ -40,6 +40,9 @@
 #elif defined SHA3_512
 #include "sha3_512.hpp"
 #pragma message "Choosing to compile Merklization with SHA3-512 !"
+#elif defined KECCAK_256
+#include "keccak_256.hpp"
+#pragma message "Choosing to compile Merklization with KECCAK-256 !"
 #endif
 
 // Binary merklization --- collects motivation from
@@ -56,7 +59,7 @@ merklize(sycl::queue& q,
   defined SHA2_512_256
          const sycl::ulong* __restrict leaf_nodes,
 #elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 ||              \
-  defined SHA3_512
+  defined SHA3_512 || defined KECCAK_256
          const sycl::uchar* __restrict leaf_nodes,
 #endif
 
@@ -69,7 +72,7 @@ merklize(sycl::queue& q,
   defined SHA2_512_256
          sycl::ulong* const __restrict intermediates,
 #elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 ||              \
-  defined SHA3_512
+  defined SHA3_512 || defined KECCAK_256
          sycl::uchar* const __restrict intermediates,
 #endif
 
@@ -116,12 +119,16 @@ merklize(sycl::queue& q,
 #elif defined SHA3_512
   assert(i_size == leaf_cnt * sha3_512::OUT_LEN_BYTES);
   assert(o_size == (itmd_cnt + 1) * sha3_512::OUT_LEN_BYTES);
+#elif defined KECCAK_256
+  assert(i_size == leaf_cnt * keccak_256::OUT_LEN_BYTES);
+  assert(o_size == (itmd_cnt + 1) * keccak_256::OUT_LEN_BYTES);
 #endif
 
   // both input and output allocation has same size
 #if defined SHA1 || defined SHA2_224 || defined SHA2_256 ||                    \
   defined SHA2_384 || defined SHA2_512 || defined SHA2_512_256 ||              \
-  defined SHA3_256 || defined SHA3_224 || defined SHA3_384 || defined SHA3_512
+  defined SHA3_256 || defined SHA3_224 || defined SHA3_384 ||                  \
+  defined SHA3_512 || defined KECCAK_256
 
   assert(i_size == o_size);
 
@@ -160,7 +167,7 @@ merklize(sycl::queue& q,
   // note that `o_size` is in terms of bytes
   const size_t elm_cnt = o_size >> 3;
 #elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 ||              \
-  defined SHA3_512
+  defined SHA3_512 || defined KECCAK_256
   // # -of 8 -bit unsigned integers ( read a byte ), which can be contiguously
   // placed on output memory allocation
   //
@@ -227,6 +234,9 @@ merklize(sycl::queue& q,
 #elif defined SHA3_512
         const size_t in_idx = idx * sha3_512::IN_LEN_BYTES;
         const size_t out_idx = idx * sha3_512::OUT_LEN_BYTES;
+#elif defined KECCAK_256
+        const size_t in_idx = idx * keccak_256::IN_LEN_BYTES;
+        const size_t out_idx = idx * keccak_256::OUT_LEN_BYTES;
 #endif
 
 #if defined SHA1
@@ -270,6 +280,11 @@ merklize(sycl::queue& q,
         sycl::uchar* out = intermediates + o_offset + out_idx;
 
         sha3_512::hash(in, out);
+#elif defined KECCAK_256
+        const sycl::uchar* in = leaf_nodes + i_offset + in_idx;
+        sycl::uchar* out = intermediates + o_offset + out_idx;
+
+        keccak_256::hash(in, out);
 #endif
       });
   });
@@ -398,6 +413,9 @@ merklize(sycl::queue& q,
 #elif defined SHA3_512
           const size_t in_idx = idx * sha3_512::IN_LEN_BYTES;
           const size_t out_idx = idx * sha3_512::OUT_LEN_BYTES;
+#elif defined KECCAK_256
+          const size_t in_idx = idx * keccak_256::IN_LEN_BYTES;
+          const size_t out_idx = idx * keccak_256::OUT_LEN_BYTES;
 #endif
 
 #if defined SHA1
@@ -446,6 +464,11 @@ merklize(sycl::queue& q,
           sycl::uchar* out = intermediates + o_offset_ + out_idx;
 
           sha3_512::hash(in, out);
+#elif defined KECCAK_256
+          const sycl::uchar* in = intermediates + i_offset_ + in_idx;
+          sycl::uchar* out = intermediates + o_offset_ + out_idx;
+
+          keccak_256::hash(in, out);
 #endif
         });
     });