-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279
Optimize MD4's G(), SHA-1 H() and SHA-2 Maj() #5279
Conversation
This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing! See openwall#4727
The binary size went down from 331400 to 314858, but I see no difference in speed. BTW the so called "binary" is actually the interim PTX file. The number of lop3 instructions in it (using DUMP_BINARY) prior to this patch was actually 0, and with this patch it's 512. I find it a bit strange it does absolutely nothing to speed. |
Did you test its effect anywhere? What were the results? I assume on recent GPUs we have at least |
Before:
After:
So this does appear to speed up some of these, especially |
Also tried enabling the explicit caching for MD4, got slightly higher speed for
but it's not reliably reproducible, other times it's 2114M like with implicit caching opportunity for the compiler. Why is this speedup limited to |
Testing AVX build with gcc 10.2.0, this significantly hurts MD4, but either slightly improves (by 1% or so) or doesn't hurt performance at SHA* (varies by format, including for raw vs. iterated). So I think let's enable it for SHA* SIMD. I haven't benchmarked scalar yet. |
Actually, even with this optimization disabled for MD4 (but enabled for SHA* nearby), there's a (smaller) performance regression at MD4 - so it's something to do with code layout in the program as a whole, and might not be representative of these specific changes. Enabling the optimization for MD4 reduces code size slightly, so maybe the optimization on its own is good even for MD4 and would have positive effect in another build. |
I merely tested that it seemed to build and run at all. Now that it's in there, I could get the idea to play more with it some day - but given that even CPUs often have cmov or even ternarylogic nowadays, I mostly wanted it in there for "completeness". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's more work to do on this as per the comments made here, but should we merge it as-is first?
Yes I think we can. I'm doing it. |
This commit just adds the code; does not enable it. But it does add LUT3 for pwsafe-opencl on nvidias - that was missing!
See #4727