Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA-2 Maj, SHA-1 H, MD4 G trick #4727

Open
solardiz opened this issue May 11, 2021 · 11 comments
Open

SHA-2 Maj, SHA-1 H, MD4 G trick #4727

solardiz opened this issue May 11, 2021 · 11 comments
Labels
RFC / discussion Help or comments wanted

Comments

@solardiz
Copy link
Member

I just found this old message:

https://groups.google.com/g/crypto-optimization/c/ZjtgKWg8AnA

Wei Dai
Mar 28, 2009, 3:15:43 AM
to [email protected]
[This is a repost that was originally written for the NIST SHA-3 forum.]

Dan Bernstein wrote:
> There was some discussion of SHA-256 performance on sci.crypt in 2003.
> Eric Young said he had code using 2891 instructions/block and running at
> 15.53 cycles/byte on an Opteron. The consensus was that 2824 arithmetic
> instructions were necessary; that means at least 14.7 cycles/byte on any
> current CPU.

Thanks for pointing to that discussion. After reading it, I thought of a
possible optimization that improves the 2824 instructions bound by 2*63 to
2698, and the theoretical speed limit to 14.1 cpb. From the newsgroup post
(http://groups.google.com/group/sci.crypt/browse_thread/thread/84bfde56e8ac6047/):

> Maj(x,y,z)
>
> We can write Maj(x,y,z) = x ^ ((x ^ y) & (x ^ z))
>
> Maj() then requires 3 XOR and 1 AND. We also need to make
> a copy of either x or y and a copy of either x or z.

We can instead write Maj(x,y,z) = y ^ ((x ^ y) & (y ^ z)). Then after
computing x^y, cache it until the next round, where it can be reused as y^z.
This reduces the number of XOR by 1, and the number of copies by 1, for
every round except the first

So we can write Maj in 4 basic operations (with a total circuit depth or latency of 3, though), and moreover with Wei Dai's trick have a common subexpression that would amortize this to 3 basic operations.

When we have at least bitselect() or the like, we don't need this, but when we don't we could give this a try.

Right now, when we don't use bitselect(), etc., we use this in C:

sha2.c: maj = (b & c) ^ (b & d) ^ (c & d);
simd-intrinsics.c:      maj = (b & c) ^ (b & d) ^ (c & d);
simd-intrinsics.c:#define Maj(x,y,z) vor(vand(x, y), vand(vor(x, y), z))

This is either 5 operations (but with depth of only 2, and there might be common subexpressions between steps - I didn't check) or 4 operations (with a depth of 3, and again I didn't check for common subexpressions). Maybe they can be reduced to amortized 3 operations (unless this is already achieved).

We use this in OpenCL:

opencl_sha2.h:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
opencl_sha2_common.h:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
pwsafe_kernel.cl:#define Maj(x, y, z) ((x & y) | (z & (x | y)))

This is the same as the 4 operations version in simd-intrinsics.c.

Why do we choose to write this differently in scalar vs. SIMD/OpenCL? Do we prioritize instruction-level parallelism over number of operations in scalar code (since have more execution units yet only one data stream)? This could actually make sense, but is worth testing both ways and with alternatives from this posting I referenced, and worth writing source code comments on (with the alternatives mentioned).

@solardiz solardiz added the RFC / discussion Help or comments wanted label May 11, 2021
@solardiz
Copy link
Member Author

Oh, found that we actually mostly use the same 4 operations version in scalar code as well:

sha2.c:#define F0(x,y,z) ((x & y) | (z & (x | y)))

The 5 operations one is in sha512_reverse(), so was probably copy-paste from closer-to-reference code. We could want to update it and see the effect, even though this only affects load time.

So my main point is we could want to try these two alternative 4 operations versions from the posting I quoted.

@solardiz
Copy link
Member Author

So this gives me slightly smaller code and about 1% speedup at sha256crypt, sha512crypt, Bitcoin on AVX2:

+++ b/src/simd-intrinsics.c
@@ -1783,7 +1783,7 @@ void SIMDSHA1body(vtype* _data, uint32_t *out, uint32_t *reload_state,
 #elif !VCMOV_EMULATED
 #define Maj(x,y,z) vcmov(x, y, vxor(z, y))
 #else
-#define Maj(x,y,z) vor(vand(x, y), vand(vor(x, y), z))
+#define Maj(x,y,z) vxor(y, vand(vxor(x, y), vxor(y, z)))
 #endif
 
 #define Ch(x,y,z) vcmov(y, z, x)
@@ -2418,6 +2418,7 @@ void SIMDSHA256body(vtype *data, uint32_t *out, uint32_t *reload_state, unsigned
 )
 #endif
 
+#if 0
 #ifdef vternarylogic
 #define Maj(x,y,z) vternarylogic(x, y, z, 0xE8)
 #elif !VCMOV_EMULATED
@@ -2427,6 +2428,7 @@ void SIMDSHA256body(vtype *data, uint32_t *out, uint32_t *reload_state, unsigned
 #endif
 
 #define Ch(x,y,z) vcmov(y, z, x)
+#endif
 
 #define SHA512_PARA_DO(x) for (x = 0; x < SIMD_PARA_SHA512; ++x)

As you can see, I simply #if 0 out the repetition of these defines - why did we even have that?

Need to also update OpenCL and scalar code, and test this more.

@solardiz solardiz added this to the Definitely 1.9.0-jumbo-2 milestone May 14, 2021
@magnumripper
Copy link
Member

As you can see, I simply #if 0 out the repetition of these defines - why did we even have that?

Looks like there were two files merged to one. Perhaps optimal SHA-512 macros could end up different than SHA-256 on some arch, but then they'd need to be undefined first.

@solardiz
Copy link
Member Author

Maybe there's a reason why Wei Dai called this "SHA Maj trick", not "SHA-2 Maj trick": SHA-1's H() is the same as SHA-2's Maj(). However, trying to apply this trick to our SHA-1 for AVX2 actually results in slight code size increase without an obvious effect on speed on Haswell. This is without explicit caching/reuse - letting the compiler possibly do it. We could want to retest this across several compiler versions.

BTW, the last time we seriously discussed related issues appears to be this thread: https://www.openwall.com/lists/john-dev/2015/09/02/5

What I don't understand is why we choose to explicitly save a temporary value (that we only use once) e.g. in (and in many other such macros):

#define SHA1_H(x,y,z)                                       \
    tmp[i] = vand((x[i]),(y[i]));                           \
    tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));

BTW, my attempted alternatives to it were:

#define SHA1_H(x,y,z)                                       \
    tmp[i] = vxor((y[i]), vand(vxor((x[i]), (y[i])), vxor((y[i]), z[i])));

and:

#define SHA1_H(x,y,z) \
    tmp[i] = vxor((x[i]), (y[i])); \
    tmp[i] = vxor((y[i]), vand((tmp[i]), vxor((y[i]), z[i])));

I also don't understand why we put the extra braces around x[i], etc. I'd make more sense to have braces around x if we worry that it could be a complicated expression.

As mentioned in that john-dev thread I referenced, we can/should also try re-ordering the arguments to Maj() and H() in all 6 possible ways to pick the fastest ordering.

@solardiz solardiz changed the title SHA-2 Maj tricks SHA-2 Maj and SHA-1 H tricks May 17, 2021
@solardiz
Copy link
Member Author

trying to apply this trick to our SHA-1 for AVX2 actually results in slight code size increase

The code size increase is avoided by also reordering the arguments (e.g. to SHA1_H(y,z,x)), but there's no decrease either.

@solardiz
Copy link
Member Author

This is without explicit caching/reuse - letting the compiler possibly do it.

Looks like explicit caching is sometimes needed. Experimenting with yescrypt/sha256.c, I get 1944 lines of objdump -d sha256.o output for it unmodified, 1925 with Maj() updated to use the trick, and 1912 with explicit caching (also requires a few additions at upper levels). This is with RHEL7's gcc building for x86_64 with make inside the yescrypt/ directory.

@magnumripper
Copy link
Member

magnumripper commented May 17, 2021

BTW, the last time we seriously discussed related issues appears to be this thread: https://www.openwall.com/lists/john-dev/2015/09/02/5

That thread lead to #1727

Maybe there's a reason why Wei Dai called this "SHA Maj trick", not "SHA-2 Maj trick": SHA-1's H() is the same as SHA-2's Maj(). (...)

...and revisiting #1727 I was reminded MD5's MD4's G() too is that same function. (EDIT: MD4, not MD5)

@solardiz solardiz changed the title SHA-2 Maj and SHA-1 H tricks SHA-2 Maj, SHA-1 H, MD4 G trick May 18, 2021
@solardiz
Copy link
Member Author

MD4's G() too is that same function.

Thanks, @magnumripper! I've just tried patching my generic md4.c accordingly. On i386 (does anyone still care?), this results in code size increase in bytes, almost no changes in instruction count. On x86_64, this reduces instruction count, and with recent gcc also substantially reduces code size in bytes. Also curiously, I am getting exact same code generated for implicit/potential vs. explicit caching/reuse. Tried gcc 4.4.7, 4.6.3, 10.2.0.

Committed SHA-256 changes into yescrypt (not yet updated in JtR) and yespower:

+++ sha256.c
@@ -101,7 +101,11 @@ static const uint32_t Krnd[64] = {
 
 /* Elementary functions used by SHA256 */
 #define Ch(x, y, z)	((x & (y ^ z)) ^ z)
-#define Maj(x, y, z)	((x & (y | z)) | (y & z))
+#if 1 /* Explicit caching/reuse of common subexpression between rounds */
+#define Maj(x, y, z)	(y ^ ((x_xor_y = x ^ y) & y_xor_z))
+#else /* Let the compiler cache/reuse or not */
+#define Maj(x, y, z)	(y ^ ((x ^ y) & (y ^ z)))
+#endif
 #define SHR(x, n)	(x >> n)
 #define ROTR(x, n)	((x >> n) | (x << (32 - n)))
 #define S0(x)		(ROTR(x, 2) ^ ROTR(x, 13) ^ ROTR(x, 22))
@@ -113,7 +117,8 @@ static const uint32_t Krnd[64] = {
 #define RND(a, b, c, d, e, f, g, h, k)			\
 	h += S1(e) + Ch(e, f, g) + k;			\
 	d += h;						\
-	h += S0(a) + Maj(a, b, c);
+	h += S0(a) + Maj(a, b, c);			\
+	y_xor_z = x_xor_y;
 
 /* Adjusted round function for rotating state */
 #define RNDr(S, W, i, ii)			\
@@ -146,6 +151,7 @@ SHA256_Transform(uint32_t state[static restrict 8],
 
 	/* 3. Mix. */
 	for (i = 0; i < 64; i += 16) {
+		uint32_t x_xor_y, y_xor_z = S[(65 - i) % 8] ^ S[(66 - i) % 8];
 		RNDr(S, W, 0, i);
 		RNDr(S, W, 1, i);
 		RNDr(S, W, 2, i);

Experimental MD4 changes:

+++ md4.c	2021-05-18 21:14:11 +0200
@@ -48,7 +48,11 @@
  * optimization for F borrowed from Colin Plumb's MD5 implementation.
  */
 #define F(x, y, z)			((z) ^ ((x) & ((y) ^ (z))))
-#define G(x, y, z)			(((x) & ((y) | (z))) | ((y) & (z)))
+#if 1 /* Explicit caching/reuse of common subexpression between rounds */
+#define G(x, y, z)			(r = ((y) ^ ((n = (x) ^ (y)) & p)), p = n, r)
+#else /* Let the compiler cache/reuse or not */
+#define G(x, y, z)			((y) ^ (((x) ^ (y)) & ((y) ^ (z))))
+#endif
 #define H(x, y, z)			((x) ^ (y) ^ (z))
 
 /*
@@ -96,7 +100,7 @@
 static const void *body(MD4_CTX *ctx, const void *data, unsigned long size)
 {
 	const unsigned char *ptr;
-	MD4_u32plus a, b, c, d;
+	MD4_u32plus a, b, c, d, n, p, r;
 	MD4_u32plus saved_a, saved_b, saved_c, saved_d;
 	const MD4_u32plus ac1 = 0x5a827999, ac2 = 0x6ed9eba1;
 
@@ -132,6 +136,7 @@
 		STEP(F, b, c, d, a, SET(15), 19)
 
 /* Round 2 */
+		p = c ^ d;
 		STEP(G, a, b, c, d, GET(0) + ac1, 3)
 		STEP(G, d, a, b, c, GET(4) + ac1, 5)
 		STEP(G, c, d, a, b, GET(8) + ac1, 9)

For MD4, I think we'll need to keep the old G expression on i386, and not bother with explicit caching/reuse on x86_64 since the same code is generated anyway (something that wasn't happening for SHA-256 for me, hence needing the explicitness there).

solardiz referenced this issue in besser82/libxcrypt Dec 7, 2022
magnumripper added a commit to magnumripper/john that referenced this issue Apr 12, 2023
This commit just adds the code; does not enable it. But it does add LUT3
for pwsafe-opencl on nvidias - that was missing!

See openwall#4727
magnumripper added a commit that referenced this issue Apr 27, 2023
This commit just adds the code; does not enable it. But it does add LUT3
for pwsafe-opencl on nvidias - that was missing!

See #4727
@solardiz
Copy link
Member Author

solardiz commented Jan 8, 2024

Committed SHA-256 changes into yescrypt (not yet updated in JtR)

It is now. We should also make similar changes to sha2.c, which we use in --without-openssl builds, or switch it to using yescrypt/sha256.c (which we already use in eigrp_fmt_plug.c even when building with OpenSSL, maybe wrongly) and an equivalent implementation for SHA-512 (which exists in FreeBSD and in libxcrypt).

@solardiz
Copy link
Member Author

For md4.c:

On x86_64, this reduces instruction count, and with recent gcc also substantially reduces code size in bytes. Also curiously, I am getting exact same code generated for implicit/potential vs. explicit caching/reuse. Tried gcc 4.4.7, 4.6.3, 10.2.0.

Also confirmed for gcc 11.3.1, but despite of code size decrease there's also a speed drop on an Intel Tiger Lake core from approx. 15M to 14M for our --format=raw-md4 when built --without-openssl --disable-simd. I guess we end up reducing parallelism more than we reduce computation. If so, we're probably not going to enable this for MD4 on CPU.

@solardiz
Copy link
Member Author

solardiz commented Dec 1, 2024

we're probably not going to enable this for MD4 on CPU.

I meant for scalar MD4 code on CPU. Need separate testing for the SIMD code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC / discussion Help or comments wanted
Projects
None yet
Development

No branches or pull requests

2 participants