-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NT-long-opencl #5245
Comments
I just now realized we can use 64 bits of non-reversed binary and 64 bits of reversed, for the hot code... so it could actually be done without storing double binaries. Maybe I'll try it just for kicks. |
@magnumripper This is great news.
I think it's better to have the separate
Actually, for ease of use and lower risk/rate of user errors, it'd be great if passwords of all lengths could be attacked at once without much efficiency loss when testing short candidate passwords. So I think your work on this wasn't overkill. However, this gets more complicated under the hood, which means more effort to finish this work and potentially more bugs left?
That would be great. I was just going to say that a drawback would be memory usage when loading a huge number of NT hashes, which is a realistic concern given HIBP. Also, higher memory usage means lower cache hit rate, so could hurt performance. |
I think we can get this into the tree in two steps/PRs: first add the new format, then arrive at a unified format replacing the two. I think we want to have the two-format approach preserved in our git history e.g. for later troubleshooting or recommending to people in case the more complicated unified format ever fails. |
On a similar note, md5crypt (both CPU and OpenCL) is really in need for longer support. I looked at it several times but can't understand the optimized code! >.< |
Yes. We do have Curiously, |
I should probably benchmark support for even longer passwords, the change in code would be absolutely minimal. But it would obviously increase register pressure. Perhaps that could be left as supported by the code but not enabled by default. So we'd ship nt-long-opencl with length 59 but bumpable all the way to 125 (or 123 which is at even block length with no extra empty block). |
1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. Closes openwall#5245
1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
Two separate formats done now.
The kernel duration is 7 ms longer at a KPC of about 3 billion (accelerated):
vs.
|
Going back to a single format with full speed at single-block is tricky. A quick and dirty way to do it is to sacrifice speed for multi-block crypts in order to keep it for single-block ones, by adding a "bogus" I think if 32 bits of result would suffice we could be fine, but with mask mode on a good GPU we'd dwelve too far in the compare function and produce possible (false) hits on every crypt, that had to be sorted in |
We could have two separate kernels, and move that decision to host side. This way there'd be NO penalty at all. |
I don't see where
Makes sense, but I think you had simple enough changes implementing |
Oh, a copy-paste error. I ran a lot of tests and if I compare fastest against fastest, the difference is about 5% I think. But it doesn't matter, I'll improve it anyway.
Doing the above is a major change, changing all the bitmap and 'perfect hash table' stuff. We'll see how insane it gets.
Yes I think the first two commits will stay more or less as is, just adding commits. |
But the decision is per candidate password length, whereas GWS for a single kernel invocation is huge. By having the decision in the kernel, we can have it vary between groups of 64 candidates tested (post-mask). |
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
Current bleeding:
Current #5246:
Current #5250:
Apparently "stack frame" is simply the plaintext buffer (after on-device UTF-16 conversion), it follows its exact size. And I seem to have somehow shaved 2 registers in #5250 compared to bleeding. |
(Edited due to git mistake) Yet another branch
|
That last alternative seems to be the winner. The current version lose 6.5% performance on AMD and less than 3% on nvidia compared to bleeding-jumbo, with 100% single-block crypts. I'm surprised the price is so high for some code never processed but anyway I'll continue from there. |
I guess this price is for the reads at index 14 and 15, which were previously a register and a constant 0, but are now part of an array spilled to global memory. Maybe it'll be cheaper to pre-check the length and avoid those reads for <=27 (store the right values in two registers, and then use those). Maybe I was wrong in suggesting you try to reduce the divergence. This was a nice exercise, but those percentages do feel high as a price for rarely-used functionality. Maybe the 3x+ (vs. 2x+) hit with divergence was more acceptable. However, as I recall you also had unacceptable hit for performance on AMD there? Maybe simply adding an |
I'll keep on experimenting. Meanwhile I tried using local memory for the plaintext buffer which got rid of the register spill, but then LWS was limited to 64 and speed dropped on 2080ti with 75%, to below 7500M instead of over 30G. We can go higher than LWS=64 with a max length of 59, but speed didn't increase at all. So I'm abandoning that idea.
Using local we can no longer use an initializer so need to zero the buffer. I would have thought it to be faster anyway, at least with mask acceleration where we amortize the zeroing in the x22308 loop (BTW I only zero as much as actually needed) but the zeroing isn't even the problem! Not sure what is - it just isn't fast.
The nvidia speed is very similar among the various versions (although some would diverge more with varying lengths). The main problem is the AMD, it only runs well with the last version, and not perfect at that either. I used Also, I'm not quite happy we don't know the AMD speed with recent drivers. It would be sad to ditch good stuff because we only test it on a driver from 2019. |
On a side note I noticed a strange thing: If I replace the (added in while loop) single-line macro I can spot one single difference: the one-line macro protects
Theoretically that could be the reason for ending up slightly different so I tried removing it but it made no difference. For the life of me I can't see why there would be a difference in binary size of 443 bytes. I looked a little at the corresponding PTX files but didn't get wiser (except I could confirm the difference is not debug comments or some such). |
With all versions of your changes, |
BTW, we also still have this pending MD4 G() optimization: #4727 (comment) |
Yes, for length 59 it grows from 56 bytes to 128, and for length 125 it becomes 320 bytes. And you're probably right: Indeed the AMD reacts well on limiting the new code to length 59 (~2.7% performance drop instead of ~6.4%) whereas on nvidia it doesn't matter the slightest.
I can't see any obvious way to do so that wouldn't introduce complexity. I'll ponder it though.
I should get that in while at it, but it shouldn't matter for nvidia (using lut3) nor AMD (using bitselect). OTOH I've had the idea to revert to basic macros (neither lut3 nor bitselect) and see if the current nvidia optimizer can do better nowadays. I'm pretty sure someone (you?) said that inline assembly can ruin things for the optimizers and I can picture why. |
Just tried this (with current bleeding-jumbo), no change in speed. Then changed to that optimization in #4727 and still no change in speed, or perhaps a sub percent drop. |
I tried your suggestion and it seemed to help a little. AMD regression now is "only" 1.8% (best of five runs) or 1.5% (average of same five runs) as long as I also limit it to max. length of 59. EDIT I dropped that idea later on, after first trying to do it with all 16 elements of (one MD block of) nt_buffer, which had no effect at all on nvidia (I hoped it would), and also confirmed with more tests that neither change had any effect on AMD. I really thought the "original code last" version would solve all problems. It didn't need tricks like the above - the original code works fine for the last block. Sure, we had two branches instead of one, but the impact on AMD was silly. I think I'll revisit that branch and stare at it a little. |
I found the solution to any performance regression. Like I said, I noticed Sayantan used the full 128-bit binary - which not only wastes memory and hash table/bitmap space (and hot compare code!), but also requires two more complete MD4 steps. Even with the birthday paradox against us, we ought to do fine with 64 bits. Fixing that wasn't trivial, but here's single-hash performance on 2080ti now (normal length 27 version):
Current bleeding-jumbo consistently does 33655Mp/s so the above is a 50% boost. EDIT While the above is repeatable, it seems to be inflated because of the very short run - probably using max. GPU clocking. Now hashcat is 50% faster than that at the exact same task, at 75866.9 MH/s, but it has optimizations for single-hash that we never bothered with (although I'm tempted to implement it in nt-opencl). And already at attacking two hashes (of which one is faked and uncrackable), we beat hashcat with 40265Mp/s against 35746.9 MH/s. On a side note, hashcat benchmark claims 93781.9 MH/s. Not sure how you'd get that figure in real cracking when their real speed attacking a single hash with mask is just 75866.9 MH/s? My current changes can easily be adopted to raw MD4/MD5 formats, they use the same shared code. And there are seven other format that can be fixed as well, but they don't (yet) share any code. |
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask. Also adds an entry in doc/NEWS. Closes openwall#5245.
This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask. Also adds an entry in doc/NEWS. Closes openwall#5245.
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask. Also adds an entry in doc/NEWS. Closes openwall#5245.
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask. Also adds an entry in doc/NEWS. Closes openwall#5245.
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
This is well tested code in other formats. About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask. Also adds an entry in doc/NEWS. Closes openwall#5245.
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes openwall#5245
This way we still got optimal speed for the original nt-opencl format that support lengths up to 27. 1 block of MD4 is up to 27 characters. 2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max). As long as we bump it over 27 we don't seem to gain any speed by limiting it to less than 125 characters. See #5245
Not only do we save memory, we can reverse much more as well, and reject early. We check the remaining bits in cold host code, for good measure. Closes #5245
@solardiz, I have a trivial patch to NT-opencl that bumps its max length to 59 (or more). Due to the extra overhead and perhaps more because we can no longer reverse MD4 steps, it makes the format about 9% slower even at single-block lengths. I'm pondering two alternatives:
a) The changes are
#if PLAINTEXT_LENGTH > 27
. I could commit the changes except the actual change ofPLAINTEXT_LENGTH
. This is a no-op but makes it fairly easy to enable the support for longer passwords and rebuild.b) I could create a separate NT-long-opencl (using same files but two format structs).
I'm not quite sure which way to go - what do you think?
BTW I also played with the idea to actually store two sets of binaries, with and without reversed steps, and really try to minimize overhead. But I think it's overkill - passwords this long are definitely used but they are not very common so they can be attacked separately anyway.
The text was updated successfully, but these errors were encountered: