Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NT-long-opencl #5245

Closed
magnumripper opened this issue Mar 9, 2023 · 24 comments · Fixed by #5246
Closed

NT-long-opencl #5245

magnumripper opened this issue Mar 9, 2023 · 24 comments · Fixed by #5246
Assignees

Comments

@magnumripper
Copy link
Member

magnumripper commented Mar 9, 2023

@solardiz, I have a trivial patch to NT-opencl that bumps its max length to 59 (or more). Due to the extra overhead and perhaps more because we can no longer reverse MD4 steps, it makes the format about 9% slower even at single-block lengths. I'm pondering two alternatives:

a) The changes are #if PLAINTEXT_LENGTH > 27. I could commit the changes except the actual change of PLAINTEXT_LENGTH. This is a no-op but makes it fairly easy to enable the support for longer passwords and rebuild.

b) I could create a separate NT-long-opencl (using same files but two format structs).

I'm not quite sure which way to go - what do you think?

BTW I also played with the idea to actually store two sets of binaries, with and without reversed steps, and really try to minimize overhead. But I think it's overkill - passwords this long are definitely used but they are not very common so they can be attacked separately anyway.

@magnumripper
Copy link
Member Author

BTW I also played with the idea to actually store two sets of binaries, with and without reversed steps, and really try to minimize overhead. But I think it's overkill - passwords this long are definitely used but they are not very common so they can be attacked separately anyway.

I just now realized we can use 64 bits of non-reversed binary and 64 bits of reversed, for the hot code... so it could actually be done without storing double binaries. Maybe I'll try it just for kicks.

@solardiz
Copy link
Member

solardiz commented Mar 9, 2023

@magnumripper This is great news.

I'm not quite sure which way to go - what do you think?

I think it's better to have the separate NT-long-opencl format than to have the user edit files back and forth.

passwords this long are definitely used but they are not very common so they can be attacked separately anyway.

Actually, for ease of use and lower risk/rate of user errors, it'd be great if passwords of all lengths could be attacked at once without much efficiency loss when testing short candidate passwords. So I think your work on this wasn't overkill. However, this gets more complicated under the hood, which means more effort to finish this work and potentially more bugs left?

so it could actually be done without storing double binaries. Maybe I'll try it just for kicks.

That would be great. I was just going to say that a drawback would be memory usage when loading a huge number of NT hashes, which is a realistic concern given HIBP. Also, higher memory usage means lower cache hit rate, so could hurt performance.

@solardiz
Copy link
Member

solardiz commented Mar 9, 2023

I think we can get this into the tree in two steps/PRs: first add the new format, then arrive at a unified format replacing the two. I think we want to have the two-format approach preserved in our git history e.g. for later troubleshooting or recommending to people in case the more complicated unified format ever fails.

@magnumripper
Copy link
Member Author

On a similar note, md5crypt (both CPU and OpenCL) is really in need for longer support. I looked at it several times but can't understand the optimized code! >.<

@magnumripper magnumripper self-assigned this Mar 9, 2023
@solardiz
Copy link
Member

solardiz commented Mar 9, 2023

On a similar note, md5crypt (both CPU and OpenCL) is really in need for longer support.

Yes. We do have md5crypt-long on CPU, but it lacks SIMD support. It is definitely possible to have something with medium performance by using SIMD yet supporting longer than 15.

Curiously, md5crypt-long is now above 1M c/s on AWS c6a.48xlarge... but the non-long one is above 10M there.

@magnumripper
Copy link
Member Author

I should probably benchmark support for even longer passwords, the change in code would be absolutely minimal. But it would obviously increase register pressure. Perhaps that could be left as supported by the code but not enabled by default. So we'd ship nt-long-opencl with length 59 but bumpable all the way to 125 (or 123 which is at even block length with no extra empty block).

magnumripper added a commit to magnumripper/john that referenced this issue Mar 9, 2023
1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 9, 2023
1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
@magnumripper
Copy link
Member Author

magnumripper commented Mar 10, 2023

Two separate formats done now. Less than 5% speed penalty.

$ ../run/john -test -form:nt-*opencl 
Device 2: NVIDIA GeForce RTX 2080 Ti
Benchmarking: NT-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=69632 (272 blocks) x22308 DONE
Raw:    32458M c/s real, 32297M c/s virtual, Dev#2 util: 98%

Benchmarking: NT-long-opencl [MD4 OpenCL/mask accel]... LWS=256 GWS=139264 (544 blocks) x22308 DONE
Raw:    30309M c/s real, 30162M c/s virtual, Dev#2 util: 98%

2 formats benchmarked.

$ bc -l <<< 3030900/32458
93.37913611436317702877

The kernel duration is 7 ms longer at a KPC of about 3 billion (accelerated):

key xfer: 177.280 us, idx xfer: 89.728 us, crypt: 90.052 ms, res xfer: 1.984 us
gws:    139264   34395M c/s 34395995110 rounds/s   90.321 ms per crypt_all()

vs.

key xfer: 177.120 us, idx xfer: 89.728 us, crypt: 97.105 ms, res xfer: 1.984 us
gws:    139264   31904M c/s 31904807850 rounds/s   97.374 ms per crypt_all()+

@magnumripper
Copy link
Member Author

Going back to a single format with full speed at single-block is tricky. A quick and dirty way to do it is to sacrifice speed for multi-block crypts in order to keep it for single-block ones, by adding a "bogus" md4_reverse() after the multi-block crypt (same stuff as is done in binary()). This way we don't need to touch anything else. Then the only penalty for single-block crypts is the extra branch needed on password length.

I think if 32 bits of result would suffice we could be fine, but with mask mode on a good GPU we'd dwelve too far in the compare function and produce possible (false) hits on every crypt, that had to be sorted in cmp_exact - we'll lose a lot of speed. So we need 64 bits. Actually the current code seem to use all 128 bits, I can't see why. I think we could speed it up further by just doing 64 bits, dropping two more steps in the end of the MD4! We'd also get half the memory footprint for millions of hashes... Perhaps I should start with focusing on that, ignoring nt-long-opencl in the meantime.

@magnumripper
Copy link
Member Author

Then the only penalty for single-block crypts is the extra branch needed on password length.

We could have two separate kernels, and move that decision to host side. This way there'd be NO penalty at all.

@solardiz
Copy link
Member

$ bc -l <<< 3030900/31824

I don't see where 31824 came from.

I think we could speed it up further by just doing 64 bits, dropping two more steps in the end of the MD4! We'd also get half the memory footprint for millions of hashes... Perhaps I should start with focusing on that, ignoring nt-long-opencl in the meantime.

Makes sense, but I think you had simple enough changes implementing nt-opencl-long already that we should keep that code version at some commit in our tree as well.

@magnumripper
Copy link
Member Author

$ bc -l <<< 3030900/31824

I don't see where 31824 came from.

Oh, a copy-paste error. I ran a lot of tests and if I compare fastest against fastest, the difference is about 5% I think. But it doesn't matter, I'll improve it anyway.

I think we could speed it up further by just doing 64 bits, dropping two more steps in the end of the MD4! We'd also get half the memory footprint for millions of hashes... Perhaps I should start with focusing on that, ignoring nt-long-opencl in the meantime.

Doing the above is a major change, changing all the bitmap and 'perfect hash table' stuff. We'll see how insane it gets.

Makes sense, but I think you had simple enough changes implementing nt-opencl-long already that we should keep that code version at some commit in our tree as well.

Yes I think the first two commits will stay more or less as is, just adding commits.

@solardiz
Copy link
Member

Then the only penalty for single-block crypts is the extra branch needed on password length.

We could have two separate kernels, and move that decision to host side. This way there'd be NO penalty at all.

But the decision is per candidate password length, whereas GWS for a single kernel invocation is huge. By having the decision in the kernel, we can have it vary between groups of 64 candidates tested (post-mask).

magnumripper added a commit to magnumripper/john that referenced this issue Mar 12, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
@magnumripper
Copy link
Member Author

magnumripper commented Mar 13, 2023

Current bleeding:

ptxas info    : 0 bytes gmem, 104 bytes cmem[3]
ptxas info    : Compiling entry function 'nt' for 'sm_75'
ptxas info    : Function properties for nt
ptxas         .     56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 16388 bytes smem, 432 bytes cmem[0], 8 bytes cmem[2]
binary size 22012

Current #5246:

ptxas info    : 0 bytes gmem, 104 bytes cmem[3]
ptxas info    : Compiling entry function 'nt' for 'sm_75'
ptxas info    : Function properties for nt
ptxas         .     312 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 50 registers, 16388 bytes smem, 432 bytes cmem[0], 8 bytes cmem[2]
binary size 44381

Current #5250:

ptxas info    : 0 bytes gmem, 104 bytes cmem[3]
ptxas info    : Compiling entry function 'nt' for 'sm_75'
ptxas info    : Function properties for nt
ptxas         .     312 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 44 registers, 16388 bytes smem, 432 bytes cmem[0], 8 bytes cmem[2]
binary size 52217

Apparently "stack frame" is simply the plaintext buffer (after on-device UTF-16 conversion), it follows its exact size. And I seem to have somehow shaved 2 registers in #5250 compared to bleeding.

@magnumripper
Copy link
Member Author

magnumripper commented Mar 13, 2023

(Edited due to git mistake)

Yet another branch singlebranch #5251 where I get away with a single branch in the very end of the function:

ptxas info    : 0 bytes gmem, 104 bytes cmem[3]
ptxas info    : Compiling entry function 'nt' for 'sm_75'
ptxas info    : Function properties for nt
ptxas         .     320 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 44 registers, 16388 bytes smem, 432 bytes cmem[0], 8 bytes cmem[2]
binary size 43711

@magnumripper
Copy link
Member Author

That last alternative seems to be the winner. The current version lose 6.5% performance on AMD and less than 3% on nvidia compared to bleeding-jumbo, with 100% single-block crypts. I'm surprised the price is so high for some code never processed but anyway I'll continue from there.

@solardiz
Copy link
Member

The current version lose 6.5% performance on AMD and less than 3% on nvidia compared to bleeding-jumbo, with 100% single-block crypts. I'm surprised the price is so high for some code never processed but anyway I'll continue from there.

I guess this price is for the reads at index 14 and 15, which were previously a register and a constant 0, but are now part of an array spilled to global memory. Maybe it'll be cheaper to pre-check the length and avoid those reads for <=27 (store the right values in two registers, and then use those).

Maybe I was wrong in suggesting you try to reduce the divergence. This was a nice exercise, but those percentages do feel high as a price for rarely-used functionality. Maybe the 3x+ (vs. 2x+) hit with divergence was more acceptable. However, as I recall you also had unacceptable hit for performance on AMD there? Maybe simply adding an unlikely() to there would do the trick?

@magnumripper
Copy link
Member Author

magnumripper commented Mar 13, 2023

The current version lose 6.5% performance on AMD and less than 3% on nvidia compared to bleeding-jumbo, with 100% single-block crypts. I'm surprised the price is so high for some code never processed but anyway I'll continue from there.

I guess this price is for the reads at index 14 and 15, which were previously a register and a constant 0, but are now part of an array spilled to global memory. Maybe it'll be cheaper to pre-check the length and avoid those reads for <=27 (store the right values in two registers, and then use those).

I'll keep on experimenting. Meanwhile I tried using local memory for the plaintext buffer which got rid of the register spill, but then LWS was limited to 64 and speed dropped on 2080ti with 75%, to below 7500M instead of over 30G. We can go higher than LWS=64 with a max length of 59, but speed didn't increase at all. So I'm abandoning that idea.

ptxas info    : 0 bytes gmem, 104 bytes cmem[3]
ptxas info    : Compiling entry function 'nt' for 'sm_75'
ptxas info    : Function properties for nt
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 36868 bytes smem, 432 bytes cmem[0], 16 bytes cmem[2]
binary size 36130

Using local we can no longer use an initializer so need to zero the buffer. I would have thought it to be faster anyway, at least with mask acceleration where we amortize the zeroing in the x22308 loop (BTW I only zero as much as actually needed) but the zeroing isn't even the problem! Not sure what is - it just isn't fast.

Maybe I was wrong in suggesting you try to reduce the divergence. This was a nice exercise, but those percentages do feel high as a price for rarely-used functionality. Maybe the 3x+ (vs. 2x+) hit with divergence was more acceptable. However, as I recall you also had unacceptable hit for performance on AMD there? Maybe simply adding an unlikely() to there would do the trick?

The nvidia speed is very similar among the various versions (although some would diverge more with varying lengths). The main problem is the AMD, it only runs well with the last version, and not perfect at that either. I used unlikely() in all versions.

Also, I'm not quite happy we don't know the AMD speed with recent drivers. It would be sad to ditch good stuff because we only test it on a driver from 2019.

@magnumripper
Copy link
Member Author

On a side note I noticed a strange thing: If I replace the (added in while loop) single-line macro MD4(uint, nt_buffer, hash) (from opencl_md4.h) with Sayantan's many-lines version, it "should" end up exactly the same: Both use the same MD4_x() basic functions and the single-line MD4 macro should expand to identical code as each line in the other version. Yet, without the single-line macro: binary size 53963. With it: binary size 54406. Difference 443 bytes. From what!?

I can spot one single difference: the one-line macro protects (x) with parens here:

#define MD4STEP(f, a, b, c, d, x, s)  \
	(a) += f((b), (c), (d)) + (x); \
	    (a) = rotate((a), (uint)(s))

Theoretically that could be the reason for ending up slightly different so I tried removing it but it made no difference. For the life of me I can't see why there would be a difference in binary size of 443 bytes. I looked a little at the corresponding PTX files but didn't get wiser (except I could confirm the difference is not debug comments or some such).

@solardiz
Copy link
Member

With all versions of your changes, nt_buffer becomes much larger, right? Maybe that's what causes the slowdown on AMD, e.g. through affecting whether it's allocated in registers or on stack, or/and through the placement of its hot (first/only limb) portions in caches (which become non-contiguous). Maybe you can split this array in two - for first/only limb and the rest? I don't know if that's a sufficient hint to the compiler or not. Is there a way to give an explicit hint that one of the buffers is hot and the other is cold? Maybe by using the latter only inside if (unlikely()) blocks. I also don't know if those decisions by the compiler are sometimes per-variable or maybe effectively only per-element anyway (if so, my suggested changes wouldn't make a difference). In fact, since compilers typically use SSA, the decisions are probably even finer-grained than per-element - they may vary between element written values' lifetimes - but maybe there's some higher-level layer at play as well.

@solardiz
Copy link
Member

BTW, we also still have this pending MD4 G() optimization: #4727 (comment)

@magnumripper
Copy link
Member Author

With all versions of your changes, nt_buffer becomes much larger, right? Maybe that's what causes the slowdown on AMD

Yes, for length 59 it grows from 56 bytes to 128, and for length 125 it becomes 320 bytes. And you're probably right: Indeed the AMD reacts well on limiting the new code to length 59 (~2.7% performance drop instead of ~6.4%) whereas on nvidia it doesn't matter the slightest.

Maybe you can split this array in two - for first/only limb and the rest?

I can't see any obvious way to do so that wouldn't introduce complexity. I'll ponder it though.

BTW, we also still have this pending MD4 G() optimization

I should get that in while at it, but it shouldn't matter for nvidia (using lut3) nor AMD (using bitselect). OTOH I've had the idea to revert to basic macros (neither lut3 nor bitselect) and see if the current nvidia optimizer can do better nowadays. I'm pretty sure someone (you?) said that inline assembly can ruin things for the optimizers and I can picture why.

@magnumripper
Copy link
Member Author

magnumripper commented Mar 13, 2023

I've had the idea to revert to basic macros (neither lut3 nor bitselect) and see if the current nvidia optimizer can do better nowadays.

Just tried this (with current bleeding-jumbo), no change in speed. Then changed to that optimization in #4727 and still no change in speed, or perhaps a sub percent drop.

@magnumripper
Copy link
Member Author

magnumripper commented Mar 13, 2023

I guess this price is for the reads at index 14 and 15, which were previously a register and a constant 0, but are now part of an array spilled to global memory. Maybe it'll be cheaper to pre-check the length and avoid those reads for <=27 (store the right values in two registers, and then use those).

I tried your suggestion and it seemed to help a little. AMD regression now is "only" 1.8% (best of five runs) or 1.5% (average of same five runs) as long as I also limit it to max. length of 59. EDIT I dropped that idea later on, after first trying to do it with all 16 elements of (one MD block of) nt_buffer, which had no effect at all on nvidia (I hoped it would), and also confirmed with more tests that neither change had any effect on AMD.

I really thought the "original code last" version would solve all problems. It didn't need tricks like the above - the original code works fine for the last block. Sure, we had two branches instead of one, but the impact on AMD was silly. I think I'll revisit that branch and stare at it a little.

@solardiz solardiz added this to the Definitely 2.0.0 milestone Mar 14, 2023
@magnumripper
Copy link
Member Author

magnumripper commented Mar 15, 2023

I found the solution to any performance regression. Like I said, I noticed Sayantan used the full 128-bit binary - which not only wastes memory and hash table/bitmap space (and hot compare code!), but also requires two more complete MD4 steps. Even with the birthday paradox against us, we ought to do fine with 64 bits.

Fixing that wasn't trivial, but here's single-hash performance on 2080ti now (normal length 27 version):

Device 2: NVIDIA GeForce RTX 2080 Ti
Loaded 1 password hash (NT-opencl [MD4 OpenCL])
LWS=128 GWS=139264 (1088 blocks) x22308 
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
MAgN11!8         (test-MAgN11!8)     
1g 0:00:00:02 DONE (2023-03-15 15:32) 0.3355g/s 51260Mp/s 51260Mc/s 9223372TC/s Dev#2:50°C util:98% MAgN11!8
No remaining hashes

Current bleeding-jumbo consistently does 33655Mp/s so the above is a 50% boost. EDIT While the above is repeatable, it seems to be inflated because of the very short run - probably using max. GPU clocking.

Now hashcat is 50% faster than that at the exact same task, at 75866.9 MH/s, but it has optimizations for single-hash that we never bothered with (although I'm tempted to implement it in nt-opencl). And already at attacking two hashes (of which one is faked and uncrackable), we beat hashcat with 40265Mp/s against 35746.9 MH/s.

On a side note, hashcat benchmark claims 93781.9 MH/s. Not sure how you'd get that figure in real cracking when their real speed attacking a single hash with mask is just 75866.9 MH/s?

My current changes can easily be adopted to raw MD4/MD5 formats, they use the same shared code. And there are seven other format that can be fixed as well, but they don't (yet) share any code.

magnumripper added a commit to magnumripper/john that referenced this issue Mar 15, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 15, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 16, 2023
This is well tested code in other formats.
About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.

Also adds an entry in doc/NEWS.  Closes openwall#5245.
magnumripper added a commit to magnumripper/john that referenced this issue Mar 17, 2023
This is well tested code in other formats.
About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.

Also adds an entry in doc/NEWS.  Closes openwall#5245.
magnumripper added a commit to magnumripper/john that referenced this issue Mar 18, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 18, 2023
This is well tested code in other formats.
About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.

Also adds an entry in doc/NEWS.  Closes openwall#5245.
magnumripper added a commit to magnumripper/john that referenced this issue Mar 19, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 19, 2023
This is well tested code in other formats.
About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.

Also adds an entry in doc/NEWS.  Closes openwall#5245.
magnumripper added a commit to magnumripper/john that referenced this issue Mar 19, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 19, 2023
This is well tested code in other formats.
About 10% boost on 2080ti, against 5300 hashes and pure wordlist, no mask.

Also adds an entry in doc/NEWS.  Closes openwall#5245.
magnumripper added a commit to magnumripper/john that referenced this issue Mar 21, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 21, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 21, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 21, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 23, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 23, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 25, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 28, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 28, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 28, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 28, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 28, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit to magnumripper/john that referenced this issue Mar 31, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes openwall#5245
magnumripper added a commit that referenced this issue Mar 31, 2023
This way we still got optimal speed for the original nt-opencl format that
support lengths up to 27.

1 block of MD4 is up to 27 characters.
2 blocks is 59, 3 is 91, 4 is 123 and 5 is 125 (due to core max).

As long as we bump it over 27 we don't seem to gain any speed by limiting
it to less than 125 characters.

See #5245
magnumripper added a commit that referenced this issue Mar 31, 2023
Not only do we save memory, we can reverse much more as well, and reject
early.  We check the remaining bits in cold host code, for good measure.

Closes #5245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants