Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279

Merged
merged 1 commit into from
Apr 27, 2023

Conversation

magnumripper
Copy link
Member

This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing!

See #4727

This commit just adds the code; does not enable it. But it does add LUT3
for pwsafe-opencl on nvidias - that was missing!

See openwall#4727
@magnumripper magnumripper requested a review from solardiz April 12, 2023 15:16
@magnumripper
Copy link
Member Author

magnumripper commented Apr 12, 2023

it does add LUT3 for pwsafe-opencl on nvidias - that was missing!

The binary size went down from 331400 to 314858, but I see no difference in speed. BTW the so called "binary" is actually the interim PTX file. The number of lop3 instructions in it (using DUMP_BINARY) prior to this patch was actually 0, and with this patch it's 512. I find it a bit strange it does absolutely nothing to speed.

@solardiz
Copy link
Member

This commit just adds the code; does not enable it.

Did you test its effect anywhere? What were the results? I assume on recent GPUs we have at least bitselect, so this isn't helpful, but on older NVIDIA and on CPU it might be.

@solardiz
Copy link
Member

Before:

$ rm -r ~/.nv
$ ./john -test -form=nt-opencl,raw-md4-opencl,raw-sha1-opencl,raw-sha256-opencl,raw-sha512-opencl,sha256crypt-opencl,sha512crypt-opencl,bitcoin-opencl,pwsafe-opencl
Device 1: GeForce GTX 570
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=61440 (240 blocks) x2600 DONE
Raw:    2035M c/s real, 2035M c/s virtual

Benchmarking: raw-MD4-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=131072 x2600 DONE
Raw:    1776M c/s real, 1776M c/s virtual

Benchmarking: raw-SHA1-opencl [SHA1 OpenCL/mask accel]... LWS=128 GWS=32768 x2600 DONE
Raw:    602807K c/s real, 602807K c/s virtual

Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... LWS=128 GWS=7680 (60 blocks) x2600 DONE
Raw:    176188K c/s real, 175328K c/s virtual

Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... LWS=64 GWS=7680 (120 blocks) x2600 DONE
Raw:    71956K c/s real, 71956K c/s virtual

Benchmarking: sha256crypt-opencl, crypt(3) $5$ (rounds=5000) [SHA256 OpenCL]... LWS=128 GWS=61440 (480 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    19787 c/s real, 75851 c/s virtual

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=64 GWS=122880 (1920 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    14611 c/s real, 99497 c/s virtual

Benchmarking: Bitcoin-opencl, Bitcoin Core [SHA512 AES OpenCL]... LWS=32 GWS=3840 (120 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    394 c/s real, 7245 c/s virtual

Benchmarking: pwsafe-opencl, Password Safe [SHA256 OpenCL]... LWS=64 GWS=7680 (120 blocks) DONE
Speed for cost 1 (iteration count) of 2048
Raw:    126757 c/s real, 932571 c/s virtual

After:

$ sed -i "s,if 0 /\* Wei Dai's trick,if 1 /* Wei Dai's trick," `fgrep -rl "Wei Dai's trick" opencl`
$ rm -r ~/.nv
$ ./john -test -form=nt-opencl,raw-md4-opencl,raw-sha1-opencl,raw-sha256-opencl,raw-sha512-opencl,sha256crypt-opencl,sha512crypt-opencl,bitcoin-opencl,pwsafe-opencl
Device 1: GeForce GTX 570
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=61440 (240 blocks) x2600 DONE
Raw:    2114M c/s real, 2103M c/s virtual

Benchmarking: raw-MD4-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=131072 x2600 DONE
Raw:    1768M c/s real, 1768M c/s virtual

Benchmarking: raw-SHA1-opencl [SHA1 OpenCL/mask accel]... LWS=128 GWS=32768 x2600 DONE
Raw:    599977K c/s real, 602807K c/s virtual

Benchmarking: raw-SHA256-opencl [SHA256 OpenCL/mask accel]... LWS=128 GWS=7680 (60 blocks) x2600 DONE
Raw:    177932K c/s real, 177932K c/s virtual

Benchmarking: raw-SHA512-opencl [SHA512 OpenCL/mask accel]... LWS=64 GWS=7680 (120 blocks) x2600 DONE
Raw:    73955K c/s real, 73955K c/s virtual

Benchmarking: sha256crypt-opencl, crypt(3) $5$ (rounds=5000) [SHA256 OpenCL]... LWS=128 GWS=61440 (480 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    19883 c/s real, 76800 c/s virtual

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=64 GWS=30720 (480 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    14733 c/s real, 111709 c/s virtual

Benchmarking: Bitcoin-opencl, Bitcoin Core [SHA512 AES OpenCL]... LWS=32 GWS=3840 (120 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    399 c/s real, 6981 c/s virtual

Benchmarking: pwsafe-opencl, Password Safe [SHA256 OpenCL]... LWS=64 GWS=7680 (120 blocks) DONE
Speed for cost 1 (iteration count) of 2048
Raw:    129910 c/s real, 1135K c/s virtual

So this does appear to speed up some of these, especially nt-opencl.

@solardiz
Copy link
Member

Also tried enabling the explicit caching for MD4, got slightly higher speed for nt-opencl on one occasion:

Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=61440 (240 blocks) x2600 DONE
Raw:    2124M c/s real, 2114M c/s virtual

Benchmarking: raw-MD4-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=131072 x2600 DONE
Raw:    1776M c/s real, 1768M c/s virtual

but it's not reliably reproducible, other times it's 2114M like with implicit caching opportunity for the compiler.

Why is this speedup limited to nt-opencl and not seen for raw-MD4-opencl? Is this code somehow not used for the latter? Oh, indeed it is not - that's something to fix!

@solardiz
Copy link
Member

Testing AVX build with gcc 10.2.0, this significantly hurts MD4, but either slightly improves (by 1% or so) or doesn't hurt performance at SHA* (varies by format, including for raw vs. iterated). So I think let's enable it for SHA* SIMD.

I haven't benchmarked scalar yet.

@solardiz
Copy link
Member

Testing AVX build with gcc 10.2.0, this significantly hurts MD4

Actually, even with this optimization disabled for MD4 (but enabled for SHA* nearby), there's a (smaller) performance regression at MD4 - so it's something to do with code layout in the program as a whole, and might not be representative of these specific changes.

Enabling the optimization for MD4 reduces code size slightly, so maybe the optimization on its own is good even for MD4 and would have positive effect in another build.

@magnumripper
Copy link
Member Author

Did you test its effect anywhere? What were the results? I assume on recent GPUs we have at least bitselect, so this isn't helpful, but on older NVIDIA and on CPU it might be.

I merely tested that it seemed to build and run at all. Now that it's in there, I could get the idea to play more with it some day - but given that even CPUs often have cmov or even ternarylogic nowadays, I mostly wanted it in there for "completeness".

Copy link
Member

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more work to do on this as per the comments made here, but should we merge it as-is first?

@magnumripper
Copy link
Member Author

There's more work to do on this as per the comments made here, but should we merge it as-is first?

Yes I think we can. I'm doing it.

@magnumripper magnumripper merged commit a58aa91 into openwall:bleeding-jumbo Apr 27, 2023
@magnumripper magnumripper deleted the Maj-optimization branch April 27, 2023 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants