Skip to content

Commit

Permalink
Merge pull request #10 from itzmeanjan/keccak256
Browse files Browse the repository at this point in the history
Keccak256 based Binary Merklization
  • Loading branch information
itzmeanjan authored Mar 14, 2022
2 parents 8a2c006 + 35574c7 commit ebcd785
Show file tree
Hide file tree
Showing 13 changed files with 525 additions and 48 deletions.
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# merklize-sha

SYCL accelerated Binary Merklization using SHA1, SHA2 & SHA3
SYCL accelerated Binary Merklization using SHA1, SHA2 & SHA3 ( along with keccak256 )

## Motivation

After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions. BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.
After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions ( along with keccak256 ). BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 ( or keccak256 ) families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.

```bash
((a, b), (c, d)) < --- [Level 1] [Root]
Expand All @@ -25,7 +25,7 @@ input = [a, b, c, d]
output = [0, ((a, b), (c, d)), (a, b), (c, d)]
```

Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function, which one to use is compile-time choice using pre-processor directive.
Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function ( along with keccak256 ), which one to use is compile-time choice using pre-processor directive.

If you happen to be interested in Binary Merklization using Rescue Prime Hash/ BLAKE3, consider seeing following links.

Expand All @@ -36,6 +36,8 @@ If you happen to be interested in Binary Merklization using Rescue Prime Hash/ B
> During SHA3 implementations, I've followed SHA-3 Standard [specification](http://dx.doi.org/10.6028/NIST.FIPS.202).
> During Keccak256 implementation, I took some inspiration from [here](https://keccak.team/files/Keccak-implementation-3.2.pdf); though note that, keccak256 & sha3-256 are very much similar, except input message padding rule; see https://github.com/itzmeanjan/merklize-sha/pull/10 PR description.
> Using SHA1 for binary merklization may not be a good choice these days, see [here](https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions). But still I'm keeping SHA1 implementation, just as a reference.
## Prerequisites
Expand Down Expand Up @@ -88,12 +90,13 @@ If you happen to be interested in 2-to-1 hash implementation of
- [SHA3-256](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_256.cpp)
- [SHA3-384](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_384.cpp)
- [SHA3-512](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_512.cpp)
- [KECCAK-256](https://github.com/itzmeanjan/merklize-sha/blob/75dfd47/example/keccak_256.cpp)

where two digests of respective hash functions are input, in byte concatenated form, to `hash( ... )` function, consider taking a look at above hyperlinked examples.

> Compile above examples using `dpcpp -fsycl example/<file>.cpp -I./include`
You will probably like to see how binary merklization kernels use these 2-to-1 hash functions; see [here](https://github.com/itzmeanjan/merklize-sha/blob/4aadd99/include/merklize.hpp)
You will probably like to see how binary merklization kernels use these 2-to-1 hash functions; see [here](https://github.com/itzmeanjan/merklize-sha/blob/ddb7ac9/include/merklize.hpp)

## Tests

Expand Down Expand Up @@ -153,5 +156,9 @@ I'm keeping binary merklization benchmark results of
- [Nvidia GPU(s)](results/sha3-512/nvidia_gpu.md)
- [Intel CPU(s)](results/sha3-512/intel_cpu.md)
- [Intel GPU(s)](results/sha3-512/intel_gpu.md)
- KECCAK-256
- [Nvidia GPU(s)](results/keccak-256/nvidia_gpu.md)
- [Intel CPU(s)](results/keccak-256/intel_cpu.md)
- [Intel GPU(s)](results/keccak-256/intel_gpu.md)

obtained after executing them on multiple accelerators.
4 changes: 4 additions & 0 deletions bench/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@ main(int argc, char** argv)
#elif defined SHA3_512
std::cout << "\nBenchmarking Binary Merklization using SHA3-512" << std::endl
<< std::endl;
#elif defined KECCAK_256
std::cout << "\nBenchmarking Binary Merklization using KECCAK-256"
<< std::endl
<< std::endl;
#endif

std::cout << std::setw(16) << std::right << "leaf count"
Expand Down
68 changes: 68 additions & 0 deletions example/keccak_256.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#include "keccak_256.hpp"
#include <cassert>

// This example attempts to show how to use 2-to-1 KECCAK-256 hash function !
int
main(int argc, char** argv)
{
// $ python3
// >>> a = [0xff] * 32
//
// first input digest
constexpr sycl::uchar digest_0[32] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255
};

// >>> b = [0x0f] * 32
//
// second input digest
constexpr sycl::uchar digest_1[32] = { 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15 };

// >>> c = a + b
// >>> import sha3
// >>> list(sha3.keccak_256(bytes(c)).digest())
//
// final output digest after merging two input digests
constexpr sycl::uchar digest_2[32] = { 134, 168, 210, 52, 189, 102, 98, 76,
116, 7, 56, 37, 176, 235, 150, 128,
138, 149, 200, 26, 32, 23, 79, 38,
139, 137, 109, 85, 195, 183, 230, 9 };

sycl::default_selector s{};
sycl::device d{ s };
sycl::context c{ d };
sycl::queue q{ c, d };

// so that input digests can be transferred from host to device ( by runtime )
sycl::uchar* in = static_cast<sycl::uchar*>(
sycl::malloc_shared(sizeof(digest_0) + sizeof(digest_1), q));

// so that output digest can be transferred from device to host ( by runtime )
sycl::uchar* out =
static_cast<sycl::uchar*>(sycl::malloc_shared(sizeof(digest_2), q));

// copy both input digests to device memory
q.memcpy(in + 0, digest_0, sizeof(digest_0)).wait();
q.memcpy(in + sizeof(digest_0), digest_1, sizeof(digest_1)).wait();

// compute 2-to-1 hash
q.single_task<class kernelExampleKECCAK_256>(
[=]() { keccak_256::hash(in, out); });
q.wait();

// finally assert !
for (size_t i = 0; i < sizeof(digest_2); i++) {
assert(*(out + i) == digest_2[i]);
}

// deallocate resources !
sycl::free(in, q);
sycl::free(out, q);

return EXIT_SUCCESS;
}
7 changes: 6 additions & 1 deletion include/bench_merklize.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ benchmark_merklize(sycl::queue& q,
#elif defined SHA3_512
const size_t i_size = leaf_cnt * sha3_512::OUT_LEN_BYTES; // in bytes
const size_t o_size = leaf_cnt * sha3_512::OUT_LEN_BYTES; // in bytes
#elif defined KECCAK_256
const size_t i_size = leaf_cnt * keccak_256::OUT_LEN_BYTES; // in bytes
const size_t o_size = leaf_cnt * keccak_256::OUT_LEN_BYTES; // in bytes
#endif

#if defined SHA1 || defined SHA2_224 || defined SHA2_256
Expand All @@ -70,7 +73,7 @@ benchmark_merklize(sycl::queue& q,
sycl::ulong* i_d = static_cast<sycl::ulong*>(sycl::malloc_device(i_size, q));
sycl::ulong* o_d = static_cast<sycl::ulong*>(sycl::malloc_device(o_size, q));
#elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 || \
defined SHA3_512
defined SHA3_512 || defined KECCAK_256
// allocate resources
sycl::uchar* i_h = static_cast<sycl::uchar*>(sycl::malloc_host(i_size, q));
sycl::uchar* o_h = static_cast<sycl::uchar*>(sycl::malloc_host(o_size, q));
Expand Down Expand Up @@ -135,6 +138,8 @@ benchmark_merklize(sycl::queue& q,
(sha3_384::OUT_LEN_BYTES)
#elif defined SHA3_512
(sha3_512::OUT_LEN_BYTES)
#elif defined KECCAK_256
(keccak_256::OUT_LEN_BYTES)
#endif

;
Expand Down
119 changes: 119 additions & 0 deletions include/keccak_256.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#pragma once
#include "sha3.hpp"

namespace keccak_256 {

// Keccak-256 specific input/ output width constants
constexpr size_t IN_LEN_BITS = 512;
constexpr size_t IN_LEN_BYTES = IN_LEN_BITS >> 3;

constexpr size_t OUT_LEN_BITS = IN_LEN_BITS >> 1;
constexpr size_t OUT_LEN_BYTES = IN_LEN_BYTES >> 1;

// From input byte array ( = 64 bytes ) preparing 5 x 5 x 64 keccak state array
// as twenty five 64 -bit unsigned integers
//
// Combined techniques adapted from section 3.1.2 of
// http://dx.doi.org/10.6028/NIST.FIPS.202; algorithm 10
// defined in section B.1 of above linked document
//
// Note, in SHA3 specification padding rule is different than what it's for
// keccak-256 i.e. to be very spcific, compare
// https://github.com/itzmeanjan/merklize-sha/blob/e421134ea4c9b1a832458bc870c3e79dc2849ecc/include/sha3_256.hpp#L43
// ( this is sha3-256 implementation ) and write to 9-th state lane ( at index 8
// of state array ) in following function body
//
// I suggest you read https://keccak.team/files/Keccak-implementation-3.2.pdf 's
// section 1.1 where padding rule is defined under `Keccak[r, c](M)` definition
void
to_state_array(const sycl::uchar* __restrict in,
sycl::ulong* const __restrict state)
{
#pragma unroll 8
for (size_t i = 0; i < 8; i++) {
state[i] = static_cast<sycl::ulong>(in[(i << 3) + 7]) << 56 |
static_cast<sycl::ulong>(in[(i << 3) + 6]) << 48 |
static_cast<sycl::ulong>(in[(i << 3) + 5]) << 40 |
static_cast<sycl::ulong>(in[(i << 3) + 4]) << 32 |
static_cast<sycl::ulong>(in[(i << 3) + 3]) << 24 |
static_cast<sycl::ulong>(in[(i << 3) + 2]) << 16 |
static_cast<sycl::ulong>(in[(i << 3) + 1]) << 8 |
static_cast<sycl::ulong>(in[(i << 3) + 0]) << 0;
}

// see how 0b01 is padded to input message; following keccak-256
// implementation guide
// https://keccak.team/files/Keccak-implementation-3.2.pdf 's section 1.1
// where `Keccak[r, c](M)` is defined ( spcifically padding rule block in
// pseudocode, at very end of mentioned section )
//
// ! read right to left !
state[8] = 0b1ull;

#pragma unroll 7
for (size_t i = 9; i < 16; i++) {
state[i] = 0ull;
}

// this 1 is added to input message bits due to padding requirement
// defined in keccak-256 implementation guide
// https://keccak.team/files/Keccak-implementation-3.2.pdf 's section 1.1
// where `Keccak[r, c](M)` is defined ( spcifically padding rule block in
// pseudocode, at very end of mentioned section )
//
// ! read right to left, so it's actually 1 << 63 !
state[16] = 9223372036854775808ull;

#pragma unroll 8
for (size_t i = 17; i < 25; i++) {
state[i] = 0ull;
}
}

// From absorbed hash state array of dimension 5 x 5 x 64, produces 32 -bytes
// digest using method defined in section 3.1.3 of
// http://dx.doi.org/10.6028/NIST.FIPS.202 and algorithm 11 defined in section
// B.1 of above hyperlinked document
//
// Note, digest preparation method is same for both sha3-256 and keccak-256
void
to_digest_bytes(const sycl::ulong* __restrict in,
sycl::uchar* const __restrict digest)
{
#pragma unroll 4
for (size_t i = 0; i < 4; i++) {
const sycl::ulong lane = in[i];

digest[(i << 3) + 0] = static_cast<sycl::uchar>((lane >> 0) & 0xffull);
digest[(i << 3) + 1] = static_cast<sycl::uchar>((lane >> 8) & 0xffull);
digest[(i << 3) + 2] = static_cast<sycl::uchar>((lane >> 16) & 0xffull);
digest[(i << 3) + 3] = static_cast<sycl::uchar>((lane >> 24) & 0xffull);
digest[(i << 3) + 4] = static_cast<sycl::uchar>((lane >> 32) & 0xffull);
digest[(i << 3) + 5] = static_cast<sycl::uchar>((lane >> 40) & 0xffull);
digest[(i << 3) + 6] = static_cast<sycl::uchar>((lane >> 48) & 0xffull);
digest[(i << 3) + 7] = static_cast<sycl::uchar>((lane >> 56) & 0xffull);
}
}

// Keccak-256 2-to-1 hasher, where input is 64 contiguous bytes which is hashed
// to produce 32 -bytes output
//
// This function itself doesn't do much instead of calling other functions
// which actually
// - prepares state bit array from input byte array ( difference lies here, when
// comparing between sha3-256 & keccak-256 )
// - permutes input using `keccak-p[b, n_r]`
// - truncates first 256 -bits from state bit array
//
// See section 6.1 of http://dx.doi.org/10.6028/NIST.FIPS.202
void
hash(const sycl::uchar* __restrict in, sycl::uchar* const __restrict digest)
{
sycl::ulong state[25];

to_state_array(in, state);
keccak_p(state);
to_digest_bytes(state, digest);
}

}
33 changes: 28 additions & 5 deletions include/merklize.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#if !(defined SHA1 || defined SHA2_224 || defined SHA2_256 || \
defined SHA2_384 || defined SHA2_512 || defined SHA2_512_224 || \
defined SHA2_512_256 || defined SHA3_256 || defined SHA3_224 || \
defined SHA3_384 || defined SHA3_512)
defined SHA3_384 || defined SHA3_512 || defined KECCAK_256)
#define SHA2_256
#endif

Expand Down Expand Up @@ -40,6 +40,9 @@
#elif defined SHA3_512
#include "sha3_512.hpp"
#pragma message "Choosing to compile Merklization with SHA3-512 !"
#elif defined KECCAK_256
#include "keccak_256.hpp"
#pragma message "Choosing to compile Merklization with KECCAK-256 !"
#endif

// Binary merklization --- collects motivation from
Expand All @@ -56,7 +59,7 @@ merklize(sycl::queue& q,
defined SHA2_512_256
const sycl::ulong* __restrict leaf_nodes,
#elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 || \
defined SHA3_512
defined SHA3_512 || defined KECCAK_256
const sycl::uchar* __restrict leaf_nodes,
#endif

Expand All @@ -69,7 +72,7 @@ merklize(sycl::queue& q,
defined SHA2_512_256
sycl::ulong* const __restrict intermediates,
#elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 || \
defined SHA3_512
defined SHA3_512 || defined KECCAK_256
sycl::uchar* const __restrict intermediates,
#endif

Expand Down Expand Up @@ -116,12 +119,16 @@ merklize(sycl::queue& q,
#elif defined SHA3_512
assert(i_size == leaf_cnt * sha3_512::OUT_LEN_BYTES);
assert(o_size == (itmd_cnt + 1) * sha3_512::OUT_LEN_BYTES);
#elif defined KECCAK_256
assert(i_size == leaf_cnt * keccak_256::OUT_LEN_BYTES);
assert(o_size == (itmd_cnt + 1) * keccak_256::OUT_LEN_BYTES);
#endif

// both input and output allocation has same size
#if defined SHA1 || defined SHA2_224 || defined SHA2_256 || \
defined SHA2_384 || defined SHA2_512 || defined SHA2_512_256 || \
defined SHA3_256 || defined SHA3_224 || defined SHA3_384 || defined SHA3_512
defined SHA3_256 || defined SHA3_224 || defined SHA3_384 || \
defined SHA3_512 || defined KECCAK_256

assert(i_size == o_size);

Expand Down Expand Up @@ -160,7 +167,7 @@ merklize(sycl::queue& q,
// note that `o_size` is in terms of bytes
const size_t elm_cnt = o_size >> 3;
#elif defined SHA3_256 || defined SHA3_224 || defined SHA3_384 || \
defined SHA3_512
defined SHA3_512 || defined KECCAK_256
// # -of 8 -bit unsigned integers ( read a byte ), which can be contiguously
// placed on output memory allocation
//
Expand Down Expand Up @@ -227,6 +234,9 @@ merklize(sycl::queue& q,
#elif defined SHA3_512
const size_t in_idx = idx * sha3_512::IN_LEN_BYTES;
const size_t out_idx = idx * sha3_512::OUT_LEN_BYTES;
#elif defined KECCAK_256
const size_t in_idx = idx * keccak_256::IN_LEN_BYTES;
const size_t out_idx = idx * keccak_256::OUT_LEN_BYTES;
#endif

#if defined SHA1
Expand Down Expand Up @@ -270,6 +280,11 @@ merklize(sycl::queue& q,
sycl::uchar* out = intermediates + o_offset + out_idx;

sha3_512::hash(in, out);
#elif defined KECCAK_256
const sycl::uchar* in = leaf_nodes + i_offset + in_idx;
sycl::uchar* out = intermediates + o_offset + out_idx;

keccak_256::hash(in, out);
#endif
});
});
Expand Down Expand Up @@ -398,6 +413,9 @@ merklize(sycl::queue& q,
#elif defined SHA3_512
const size_t in_idx = idx * sha3_512::IN_LEN_BYTES;
const size_t out_idx = idx * sha3_512::OUT_LEN_BYTES;
#elif defined KECCAK_256
const size_t in_idx = idx * keccak_256::IN_LEN_BYTES;
const size_t out_idx = idx * keccak_256::OUT_LEN_BYTES;
#endif

#if defined SHA1
Expand Down Expand Up @@ -446,6 +464,11 @@ merklize(sycl::queue& q,
sycl::uchar* out = intermediates + o_offset_ + out_idx;

sha3_512::hash(in, out);
#elif defined KECCAK_256
const sycl::uchar* in = intermediates + i_offset_ + in_idx;
sycl::uchar* out = intermediates + o_offset_ + out_idx;

keccak_256::hash(in, out);
#endif
});
});
Expand Down
Loading

0 comments on commit ebcd785

Please sign in to comment.