Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279

magnumripper · 2023-04-12T15:16:11Z

This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing!

See #4727

This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing! See openwall#4727

magnumripper · 2023-04-12T15:31:00Z

it does add LUT3 for pwsafe-opencl on nvidias - that was missing!

The binary size went down from 331400 to 314858, but I see no difference in speed. BTW the so called "binary" is actually the interim PTX file. The number of lop3 instructions in it (using DUMP_BINARY) prior to this patch was actually 0, and with this patch it's 512. I find it a bit strange it does absolutely nothing to speed.

solardiz · 2023-04-14T01:14:27Z

This commit just adds the code; does not enable it.

Did you test its effect anywhere? What were the results? I assume on recent GPUs we have at least bitselect, so this isn't helpful, but on older NVIDIA and on CPU it might be.

solardiz · 2023-04-14T01:45:50Z

Before:

$ rm -r ~/.nv
$ ./john -test -form=nt-opencl,raw-md4-opencl,raw-sha1-opencl,raw-sha256-opencl,raw-sha512-opencl,sha256crypt-opencl,sha512crypt-opencl,bitcoin-opencl,pwsafe-opencl
Device 1: GeForce GTX 570
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=61440 (240 blocks) x2600 DONE
Raw:    2035M c/s real, 2035M c/s virtual

Benchmarking: raw-MD4-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=131072 x2600 DONE
Raw:    1776M c/s real, 1776M c/s virtual

Benchmarking: raw-SHA1-opencl [SHA1 OpenCL/mask accel]... LWS=128 GWS=32768 x2600 DONE
Raw:    602807K c/s real, 602807K c/s virtual

Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... LWS=128 GWS=7680 (60 blocks) x2600 DONE
Raw:    176188K c/s real, 175328K c/s virtual

Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... LWS=64 GWS=7680 (120 blocks) x2600 DONE
Raw:    71956K c/s real, 71956K c/s virtual

Benchmarking: sha256crypt-opencl, crypt(3) $5$ (rounds=5000) [SHA256 OpenCL]... LWS=128 GWS=61440 (480 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    19787 c/s real, 75851 c/s virtual

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=64 GWS=122880 (1920 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    14611 c/s real, 99497 c/s virtual

Benchmarking: Bitcoin-opencl, Bitcoin Core [SHA512 AES OpenCL]... LWS=32 GWS=3840 (120 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    394 c/s real, 7245 c/s virtual

Benchmarking: pwsafe-opencl, Password Safe [SHA256 OpenCL]... LWS=64 GWS=7680 (120 blocks) DONE
Speed for cost 1 (iteration count) of 2048
Raw:    126757 c/s real, 932571 c/s virtual

After:

$ sed -i "s,if 0 /\* Wei Dai's trick,if 1 /* Wei Dai's trick," `fgrep -rl "Wei Dai's trick" opencl`
$ rm -r ~/.nv
$ ./john -test -form=nt-opencl,raw-md4-opencl,raw-sha1-opencl,raw-sha256-opencl,raw-sha512-opencl,sha256crypt-opencl,sha512crypt-opencl,bitcoin-opencl,pwsafe-opencl
Device 1: GeForce GTX 570
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=61440 (240 blocks) x2600 DONE
Raw:    2114M c/s real, 2103M c/s virtual

Benchmarking: raw-MD4-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=131072 x2600 DONE
Raw:    1768M c/s real, 1768M c/s virtual

Benchmarking: raw-SHA1-opencl [SHA1 OpenCL/mask accel]... LWS=128 GWS=32768 x2600 DONE
Raw:    599977K c/s real, 602807K c/s virtual

Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... LWS=128 GWS=7680 (60 blocks) x2600 DONE
Raw:    177932K c/s real, 177932K c/s virtual

Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... LWS=64 GWS=7680 (120 blocks) x2600 DONE
Raw:    73955K c/s real, 73955K c/s virtual

Benchmarking: sha256crypt-opencl, crypt(3) $5$ (rounds=5000) [SHA256 OpenCL]... LWS=128 GWS=61440 (480 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    19883 c/s real, 76800 c/s virtual

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=64 GWS=30720 (480 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    14733 c/s real, 111709 c/s virtual

Benchmarking: Bitcoin-opencl, Bitcoin Core [SHA512 AES OpenCL]... LWS=32 GWS=3840 (120 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    399 c/s real, 6981 c/s virtual

Benchmarking: pwsafe-opencl, Password Safe [SHA256 OpenCL]... LWS=64 GWS=7680 (120 blocks) DONE
Speed for cost 1 (iteration count) of 2048
Raw:    129910 c/s real, 1135K c/s virtual

So this does appear to speed up some of these, especially nt-opencl.

solardiz · 2023-04-14T01:51:47Z

Also tried enabling the explicit caching for MD4, got slightly higher speed for nt-opencl on one occasion:

Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=61440 (240 blocks) x2600 DONE
Raw:    2124M c/s real, 2114M c/s virtual

Benchmarking: raw-MD4-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=131072 x2600 DONE
Raw:    1776M c/s real, 1768M c/s virtual

but it's not reliably reproducible, other times it's 2114M like with implicit caching opportunity for the compiler.

Why is this speedup limited to nt-opencl and not seen for raw-MD4-opencl? Is this code somehow not used for the latter? Oh, indeed it is not - that's something to fix!

solardiz · 2023-04-14T03:05:59Z

Testing AVX build with gcc 10.2.0, this significantly hurts MD4, but either slightly improves (by 1% or so) or doesn't hurt performance at SHA* (varies by format, including for raw vs. iterated). So I think let's enable it for SHA* SIMD.

I haven't benchmarked scalar yet.

solardiz · 2023-04-14T03:14:46Z

Testing AVX build with gcc 10.2.0, this significantly hurts MD4

Actually, even with this optimization disabled for MD4 (but enabled for SHA* nearby), there's a (smaller) performance regression at MD4 - so it's something to do with code layout in the program as a whole, and might not be representative of these specific changes.

Enabling the optimization for MD4 reduces code size slightly, so maybe the optimization on its own is good even for MD4 and would have positive effect in another build.

magnumripper · 2023-04-14T18:22:15Z

Did you test its effect anywhere? What were the results? I assume on recent GPUs we have at least bitselect, so this isn't helpful, but on older NVIDIA and on CPU it might be.

I merely tested that it seemed to build and run at all. Now that it's in there, I could get the idea to play more with it some day - but given that even CPUs often have cmov or even ternarylogic nowadays, I mostly wanted it in there for "completeness".

solardiz

There's more work to do on this as per the comments made here, but should we merge it as-is first?

magnumripper · 2023-04-27T00:32:27Z

There's more work to do on this as per the comments made here, but should we merge it as-is first?

Yes I think we can. I'm doing it.

Optimize MD4's G(), SHA-1 H() and SHA-2 Maj()

55da8f6

This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing! See openwall#4727

magnumripper requested a review from solardiz April 12, 2023 15:16

solardiz approved these changes Apr 26, 2023

View reviewed changes

magnumripper merged commit a58aa91 into openwall:bleeding-jumbo Apr 27, 2023

magnumripper deleted the Maj-optimization branch April 27, 2023 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279

Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279

magnumripper commented Apr 12, 2023

magnumripper commented Apr 12, 2023 •

edited

Loading

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

magnumripper commented Apr 14, 2023

solardiz left a comment

magnumripper commented Apr 27, 2023

Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279

Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279

Conversation

magnumripper commented Apr 12, 2023

magnumripper commented Apr 12, 2023 • edited Loading

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

solardiz commented Apr 14, 2023

magnumripper commented Apr 14, 2023

solardiz left a comment

Choose a reason for hiding this comment

magnumripper commented Apr 27, 2023

magnumripper commented Apr 12, 2023 •

edited

Loading