Skip to content

Commit

Permalink
Merge pull request #6 from itzmeanjan/sha3-256
Browse files Browse the repository at this point in the history
Implementing SHA3-256 based Binary Merklization
  • Loading branch information
itzmeanjan authored Feb 2, 2022
2 parents 761a97d + 948c03b commit 8a2c006
Show file tree
Hide file tree
Showing 32 changed files with 2,311 additions and 7 deletions.
30 changes: 26 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ SYCL accelerated Binary Merklization using SHA1, SHA2 & SHA3

## Motivation

After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 families of cryptographic hash functions. BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.
After implementing BLAKE3 using SYCL, I decided to accelerate 2-to-1 hash implementation of all variants of SHA1, SHA2 & SHA3 families of cryptographic hash functions. BLAKE3 lends itself pretty well to parallelization efforts, due to its inherent data parallel friendly algorithmic construction, where each 1024 -bytes chunk can be compressed independently ( read parallelly ) and finally it's a binary merklization problem with compressed chunks as leaf nodes of binary merkle tree. But none of SHA1, SHA2 & SHA3 families of cryptographic hash functions are data parallel, requiring to process each message block ( can be 512 -bit/ 1024 -bit or padded to 1600 -bit in case of SHA3 family ) sequentially, which is why I only concentrated on accelerating Binary Merklization where SHA1/ SHA2/ SHA3 families of cryptographic ( 2-to-1 ) hash functions are used for computing all intermediate nodes of tree when N -many leaf nodes are provided, where `N = 2 ^ i | i = {1, 2, 3 ...}`. Each of these N -many leaf nodes are respective hash digests --- for example, when using SHA2-256 variant for computing all intermediate nodes of binary merkle tree, each of provided leaf node is 32 -bytes wide, representing a SHA2-256 digest. Now, N -many leaf digests are merged into N/ 2 -many digests which are intermediate nodes, living just above leaf nodes. Then in next phase, those N/ 2 -many intermediates are used for computing N/ 4 -many of intermediates which are living just above them. This process continues until root of merkle tree is computed. Notice, that in each level of tree, each consecutive pair of digests can be hashed independently --- and that's the scope of parallelism I'd like to make use of during binary merklization. In following depiction, when N ( = 4 ) nodes are provided as input, two intermediates can be computed in parallel and once they're computed root of tree can be computed as a single task.

```bash
((a, b), (c, d)) < --- [Level 1] [Root]
Expand All @@ -25,14 +25,16 @@ input = [a, b, c, d]
output = [0, ((a, b), (c, d)), (a, b), (c, d)]
```

Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2 variants as 2-to-1 hash function, which one to use is compile-time choice using pre-processor directive.
Here in this repository, I'm keeping binary merklization kernels, implemented in SYCL, while using SHA1/ SHA2/ SHA3 variants as 2-to-1 hash function, which one to use is compile-time choice using pre-processor directive.

If you happen to be interested in Binary Merklization using Rescue Prime Hash/ BLAKE3, consider seeing following links.

- [Binary Merklization using Rescue Prime Hash](https://github.com/itzmeanjan/ff-gpu)
- [Binary Merklization using BLAKE3](https://github.com/itzmeanjan/blake3)

> During SHA1, SHA2 implementations, I've followed Secure Hash Standard [specification](https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf).
> During SHA1, SHA2 implementations, I've followed Secure Hash Standard [specification](http://dx.doi.org/10.6028/NIST.FIPS.180-4).
> During SHA3 implementations, I've followed SHA-3 Standard [specification](http://dx.doi.org/10.6028/NIST.FIPS.202).
> Using SHA1 for binary merklization may not be a good choice these days, see [here](https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions). But still I'm keeping SHA1 implementation, just as a reference.
Expand Down Expand Up @@ -82,12 +84,16 @@ If you happen to be interested in 2-to-1 hash implementation of
- [SHA2-512](https://github.com/itzmeanjan/merklize-sha/blob/fd76b7a/example/sha2_512.cpp)
- [SHA2-512/224](https://github.com/itzmeanjan/merklize-sha/blob/fd76b7a/example/sha2_512_224.cpp)
- [SHA2-512/256](https://github.com/itzmeanjan/merklize-sha/blob/fd76b7a/example/sha2_512_256.cpp)
- [SHA3-224](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_224.cpp)
- [SHA3-256](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_256.cpp)
- [SHA3-384](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_384.cpp)
- [SHA3-512](https://github.com/itzmeanjan/merklize-sha/blob/8f9b168/example/sha3_512.cpp)

where two digests of respective hash functions are input, in byte concatenated form, to `hash( ... )` function, consider taking a look at above hyperlinked examples.

> Compile above examples using `dpcpp -fsycl example/<file>.cpp -I./include`
You will probably like to see how binary merklization kernels use these 2-to-1 hash functions; see [here](https://github.com/itzmeanjan/merklize-sha/blob/fd76b7a/include/merklize.hpp)
You will probably like to see how binary merklization kernels use these 2-to-1 hash functions; see [here](https://github.com/itzmeanjan/merklize-sha/blob/4aadd99/include/merklize.hpp)

## Tests

Expand Down Expand Up @@ -131,5 +137,21 @@ I'm keeping binary merklization benchmark results of
- [Nvidia GPU(s)](results/sha2-512-256/nvidia_gpu.md)
- [Intel CPU(s)](results/sha2-512-256/intel_cpu.md)
- [Intel GPU(s)](results/sha2-512-256/intel_gpu.md)
- SHA3-256
- [Nvidia GPU(s)](results/sha3-256/nvidia_gpu.md)
- [Intel CPU(s)](results/sha3-256/intel_cpu.md)
- [Intel GPU(s)](results/sha3-256/intel_gpu.md)
- SHA3-224
- [Nvidia GPU(s)](results/sha3-224/nvidia_gpu.md)
- [Intel CPU(s)](results/sha3-224/intel_cpu.md)
- [Intel GPU(s)](results/sha3-224/intel_gpu.md)
- SHA3-384
- [Nvidia GPU(s)](results/sha3-384/nvidia_gpu.md)
- [Intel CPU(s)](results/sha3-384/intel_cpu.md)
- [Intel GPU(s)](results/sha3-384/intel_gpu.md)
- SHA3-512
- [Nvidia GPU(s)](results/sha3-512/nvidia_gpu.md)
- [Intel CPU(s)](results/sha3-512/intel_cpu.md)
- [Intel GPU(s)](results/sha3-512/intel_gpu.md)

obtained after executing them on multiple accelerators.
20 changes: 20 additions & 0 deletions bench/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,26 @@ main(int argc, char** argv)
#elif defined SHA2_512
std::cout << "\nBenchmarking Binary Merklization using SHA2-512" << std::endl
<< std::endl;
#elif defined SHA2_512_224
std::cout << "\nBenchmarking Binary Merklization using SHA2-512/224"
<< std::endl
<< std::endl;
#elif defined SHA2_512_256
std::cout << "\nBenchmarking Binary Merklization using SHA2-512/256"
<< std::endl
<< std::endl;
#elif defined SHA3_256
std::cout << "\nBenchmarking Binary Merklization using SHA3-256" << std::endl
<< std::endl;
#elif defined SHA3_224
std::cout << "\nBenchmarking Binary Merklization using SHA3-224" << std::endl
<< std::endl;
#elif defined SHA3_384
std::cout << "\nBenchmarking Binary Merklization using SHA3-384" << std::endl
<< std::endl;
#elif defined SHA3_512
std::cout << "\nBenchmarking Binary Merklization using SHA3-512" << std::endl
<< std::endl;
#endif

std::cout << std::setw(16) << std::right << "leaf count"
Expand Down
67 changes: 67 additions & 0 deletions example/sha3_224.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#include "sha3_224.hpp"
#include <cassert>

// This example attempts to show how to use 2-to-1 SHA3-224 hash function !
int
main(int argc, char** argv)
{
// $ python3
// >>> a = [0xff] * 28
//
// first input digest
constexpr sycl::uchar digest_0[28] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
};

// >>> b = [0x0f] * 28
//
// second input digest
constexpr sycl::uchar digest_1[28] = {
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
};

// >>> c = a + b
// >>> import hashlib
// >>> list(hashlib.sha3_224(bytes(c)).digest())
//
// final output digest after merging two input digests
constexpr sycl::uchar digest_2[28] = { 248, 78, 65, 145, 150, 163, 134,
170, 248, 171, 53, 130, 25, 148,
39, 171, 106, 173, 59, 81, 105,
249, 104, 41, 206, 226, 62, 243 };

sycl::default_selector s{};
sycl::device d{ s };
sycl::context c{ d };
sycl::queue q{ c, d };

// so that input digests can be transferred from host to device ( by runtime )
sycl::uchar* in = static_cast<sycl::uchar*>(
sycl::malloc_shared(sizeof(digest_0) + sizeof(digest_1), q));

// so that output digest can be transferred from device to host ( by runtime )
sycl::uchar* out =
static_cast<sycl::uchar*>(sycl::malloc_shared(sizeof(digest_2), q));

// copy both input digests to device memory
q.memcpy(in + 0, digest_0, sizeof(digest_0)).wait();
q.memcpy(in + sizeof(digest_0), digest_1, sizeof(digest_1)).wait();

// compute 2-to-1 hash
q.single_task<class kernelExampleSHA3_224>(
[=]() { sha3_224::hash(in, out); });
q.wait();

// finally assert !
for (size_t i = 0; i < sizeof(digest_2); i++) {
assert(*(out + i) == digest_2[i]);
}

// deallocate resources !
sycl::free(in, q);
sycl::free(out, q);

return EXIT_SUCCESS;
}
68 changes: 68 additions & 0 deletions example/sha3_256.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#include "sha3_256.hpp"
#include <cassert>

// This example attempts to show how to use 2-to-1 SHA3-256 hash function !
int
main(int argc, char** argv)
{
// $ python3
// >>> a = [0xff] * 32
//
// first input digest
constexpr sycl::uchar digest_0[32] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255
};

// >>> b = [0x0f] * 32
//
// second input digest
constexpr sycl::uchar digest_1[32] = { 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15 };

// >>> c = a + b
// >>> import hashlib
// >>> list(hashlib.sha3_256(bytes(c)).digest())
//
// final output digest after merging two input digests
constexpr sycl::uchar digest_2[32] = {
121, 136, 237, 222, 17, 197, 60, 82, 161, 87, 52, 66, 251, 235, 8, 125,
1, 95, 88, 134, 1, 235, 132, 182, 114, 55, 207, 202, 17, 104, 74, 95
};

sycl::default_selector s{};
sycl::device d{ s };
sycl::context c{ d };
sycl::queue q{ c, d };

// so that input digests can be transferred from host to device ( by runtime )
sycl::uchar* in = static_cast<sycl::uchar*>(
sycl::malloc_shared(sizeof(digest_0) + sizeof(digest_1), q));

// so that output digest can be transferred from device to host ( by runtime )
sycl::uchar* out =
static_cast<sycl::uchar*>(sycl::malloc_shared(sizeof(digest_2), q));

// copy both input digests to device memory
q.memcpy(in + 0, digest_0, sizeof(digest_0)).wait();
q.memcpy(in + sizeof(digest_0), digest_1, sizeof(digest_1)).wait();

// compute 2-to-1 hash
q.single_task<class kernelExampleSHA3_256>(
[=]() { sha3_256::hash(in, out); });
q.wait();

// finally assert !
for (size_t i = 0; i < sizeof(digest_2); i++) {
assert(*(out + i) == digest_2[i]);
}

// deallocate resources !
sycl::free(in, q);
sycl::free(out, q);

return EXIT_SUCCESS;
}
72 changes: 72 additions & 0 deletions example/sha3_384.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#include "sha3_384.hpp"
#include <cassert>

// This example attempts to show how to use 2-to-1 SHA3-384 hash function !
int
main(int argc, char** argv)
{
// $ python3
// >>> a = [0xff] * 48
//
// first input digest
constexpr sycl::uchar digest_0[48] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255
};

// >>> b = [0x0f] * 48
//
// second input digest
constexpr sycl::uchar digest_1[48] = { 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15 };

// >>> c = a + b
// >>> import hashlib
// >>> list(hashlib.sha3_384(bytes(c)).digest())
//
// final output digest after merging two input digests
constexpr sycl::uchar digest_2[48] = {
25, 254, 93, 230, 2, 191, 78, 51, 238, 228, 239, 160,
231, 101, 38, 216, 38, 8, 135, 59, 34, 169, 154, 20,
221, 245, 50, 59, 27, 9, 21, 234, 249, 223, 45, 73,
214, 0, 146, 51, 25, 83, 0, 0, 111, 210, 47, 206
};

sycl::default_selector s{};
sycl::device d{ s };
sycl::context c{ d };
sycl::queue q{ c, d };

// so that input digests can be transferred from host to device ( by runtime )
sycl::uchar* in = static_cast<sycl::uchar*>(
sycl::malloc_shared(sizeof(digest_0) + sizeof(digest_1), q));

// so that output digest can be transferred from device to host ( by runtime )
sycl::uchar* out =
static_cast<sycl::uchar*>(sycl::malloc_shared(sizeof(digest_2), q));

// copy both input digests to device memory
q.memcpy(in + 0, digest_0, sizeof(digest_0)).wait();
q.memcpy(in + sizeof(digest_0), digest_1, sizeof(digest_1)).wait();

// compute 2-to-1 hash
q.single_task<class kernelExampleSHA3_384>(
[=]() { sha3_384::hash(in, out); });
q.wait();

// finally assert !
for (size_t i = 0; i < sizeof(digest_2); i++) {
assert(*(out + i) == digest_2[i]);
}

// deallocate resources !
sycl::free(in, q);
sycl::free(out, q);

return EXIT_SUCCESS;
}
75 changes: 75 additions & 0 deletions example/sha3_512.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#include "sha3_512.hpp"
#include <cassert>

// This example attempts to show how to use 2-to-1 SHA3-512 hash function !
int
main(int argc, char** argv)
{
// $ python3
// >>> a = [0xff] * 64
//
// first input digest
constexpr sycl::uchar digest_0[64] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255
};

// >>> b = [0x0f] * 64
//
// second input digest
constexpr sycl::uchar digest_1[64] = {
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15
};

// >>> c = a + b
// >>> import hashlib
// >>> list(hashlib.sha3_512(bytes(c)).digest())
//
// final output digest after merging two input digests
constexpr sycl::uchar digest_2[64] = {
73, 228, 11, 92, 59, 196, 139, 212, 163, 66, 229, 66, 106,
155, 168, 55, 241, 215, 241, 253, 75, 61, 91, 215, 172, 186,
250, 212, 10, 12, 61, 253, 80, 236, 57, 238, 27, 53, 53,
20, 81, 55, 63, 196, 104, 93, 94, 74, 19, 36, 181, 15,
41, 21, 198, 35, 60, 3, 65, 232, 15, 78, 220, 61
};

sycl::default_selector s{};
sycl::device d{ s };
sycl::context c{ d };
sycl::queue q{ c, d };

// so that input digests can be transferred from host to device ( by runtime )
sycl::uchar* in = static_cast<sycl::uchar*>(
sycl::malloc_shared(sizeof(digest_0) + sizeof(digest_1), q));

// so that output digest can be transferred from device to host ( by runtime )
sycl::uchar* out =
static_cast<sycl::uchar*>(sycl::malloc_shared(sizeof(digest_2), q));

// copy both input digests to device memory
q.memcpy(in + 0, digest_0, sizeof(digest_0)).wait();
q.memcpy(in + sizeof(digest_0), digest_1, sizeof(digest_1)).wait();

// compute 2-to-1 hash
q.single_task<class kernelExampleSHA3_512>(
[=]() { sha3_512::hash(in, out); });
q.wait();

// finally assert !
for (size_t i = 0; i < sizeof(digest_2); i++) {
assert(*(out + i) == digest_2[i]);
}

// deallocate resources !
sycl::free(in, q);
sycl::free(out, q);

return EXIT_SUCCESS;
}
Loading

0 comments on commit 8a2c006

Please sign in to comment.