-
Notifications
You must be signed in to change notification settings - Fork 20
Experiment: porting the improvements from rapidhash #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ly run the rapidhash benchmarks somehow
…ovement, englishword -5%, url -6%
|
Heads up: I probably won't really have time to look at this until the weekend. However, a quick note: I'd personally prefer to go about this a bit more scientifically and do some ablation studies to see which changes actually have which effect. Instead of trying 10 different changes at once, isolating the effects per change where possible. That doesn't have to mean you'll have to make separate PR's, but just letting you know that I'll probably be copying them and analyzing them one-by-one.
There's a good reason I used the rotation on the seed: it's very dangerous to mix length information in a location an attacker can control. Your rapidhash crate has (almost) seed-independent collisions because of this addition scheme you used: use std::hash::{BuildHasher, Hasher};
fn main() {
let a: [u8; 15] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
let b: [u8; 16] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0];
let bh = rapidhash::fast::RandomState::new();
let mut h = bh.build_hasher();
h.write(&a);
dbg!(h.finish());
let mut h = bh.build_hasher();
h.write(&b);
dbg!(h.finish());
}Half of the time this program will print the same hash twice. It's technically fine because the current Rust hasher implementation will always write a length prefix for slices (I had to explicitly call One particular benchmark I'll write which I'd be interested in is seeing a graph that goes byte-per-byte in size to see what the speed is for both hashers at that size. In fact, two such benchmarks: one where the size the the same each iteration, and one where the sizes are shuffled so that you get branch mispredictions on the size (an often overlooked aspect in benchmarks). |
Not a problem, and fully understand why you'd like to, I sadly don't have the time to offer this myself though – apologies! I'll aim to maintain that list of each change and commit smaller changes individually. Happy to discuss any of them further or go through it on a call with you if you have questions. For now I think I'll aim to make this PR as close to rapidhash's performance, and then I'm happy to unpick, critique, and benchmark changes separately with you afterwards?
Agreed it would be interesting. There are definitely obvious cases where rapidhash suddenly has a significantly higher cost for hashing +1 byte – at 16->17 bytes the medium input hashing cost jumps with the function call and extra if statements, and at 48->49 bytes there's some extra setup and teardown for the 3 independent execution paths that it has to pay for (effectively wyhash or the C++ rapidhash V1 long hashing).
It's why I like the english words and url examples in your benchmarks. I had my own half-baked version before incorporating the foldhash benchmark, and kept one for email distributions that I could add into your benchmarks, but IIRC the results aren't too dissimilar and so I didn't want to skew the foldhash benchmarks "in rapidhash's favour".
Interesting, is this something where we can only generate pairs, or is it a more significant weakness? Happy to leave it as a rotation and accept the performance penalty, my other worry was it could break down if there are any 64 byte cycles that could be found in foldhash, especially with |
We can generate more than simply pairs, but I don't see a path to arbitrary multi-collisions. I think 8-way collisions should be possible.
I don't think there's a performance penalty, the rotation is an 1 cycle latency op on all the platforms I care about and the register-register
I don't see a reasonable path to a cycle. state = folded_multiply(input0 ^ state, input1 ^ seed);If you can somehow make |
… can omit the bounds checks for
Just benchmarked it and you're absolutely right. I've removed the Some updates that I'll include in the main post:
This gives us a reference for what foldhash looks like when it matches rapidhash's short string performance (4x higher throughput in some benchmarks). Next steps
Let me know if you're happy with the direction of the next steps? Cheers! |
Ouch... In my previous testing it removed the bounds check but maybe this regressed? Or perhaps it only eliminates it in tiny functions and further inlining/larger codesize blocks the optimization?
You're moving a bit fast here for my usual pace. I'll probably end up splitting this into several different PR's, each individually verified to be an incremental improvement over the other, plus a PR or two for the benchmarks. Again, this is not something I expect you to do, and I'm already more than happy that you're willing to create a known-good baseline to compare to. So I do have to warn you upfront that I'm unlikely to merge a large generic "improves performance" PR which combines a bunch of optimizations. Don't worry though, if I even end up merging one thing from this (which I almost surely will) I'll add an acknowledgement. Ultimately my goal is to see exactly what it is that made rapidhash's short string performance better compared to foldhash, and to see if we can do even better still. One thing I'm curious about is whether it matters whether the |
|
Actually there's some very strange things going on with the
EDIT: I can't reproduce the faster numbers now, even with an older nightly Rust version. The only thing I can think of that changed in the meantime is my linker... |
I believe this would always need instructions to load 4x u64 seeds on any hashed string, regardless of whether expand seeds were used. I think it would be asking a lot for the compiler to implicitly move those load instructions inside specialisations of The rapidhash long method also uses 7 secrets, so it becomes a bit more effective to pass around a reference. Maybe less of a concern at 4 secrets.
I've just tested it, you're correct, there are no bounds check issues on master when I port only the rapid_u64 method to compare.
Good point! That's very weird. I can't think of any recent changes to One thing that surprised me on my benchmarks is how the foldhash and rapidhash quality variant's avalanche step is "free" on Intel x86 chips in the charts and on the foldhash bench results, and randomly foldhash or rapidhash fast will drop from 0.77 to 0.66, but not consistently. As if there's some optimisation the compiler only sometimes manages to apply. I should inspect it further with xcode instruments, likewise for foldhash to see what goes wrong.
Yeah it's difficult to break them up into separate PRs because of the way the changes stack together. The read_u64 change is a no-op without some of the rapidhash code for example. I also appreciate it's hard to "take you on the journey" in a single PR and you'd like to truly understand the magnitude of each change for yourself. Part of me wonders whether it's easier to test them backwards rather than forwards? Working forwards, the way I'd recommend ordering the changes:
I'm also happy to create and work through each PR with you at a pace you'd like over the next month, there's no rush on my side. Also understand if you'd prefer to do so yourself and loop me in if you have questions. |
This is what I meant, or with a sub-struct and passing a reference to that around.
Considering I can't reproduce the better numbers even with the old version of Rust and for some very mysterious reason foldhash-q does better than foldhash-f (which literally should be impossible), I think there's some serious linker shenanigans going on. |
|
Ah... Do you want to know what most of the rapidhash performance gain comes from? Mark Need to benchmark what that does to 17, 18 etc. length strings, but all string hashing is faster for it. Not inlining MR with results: #33 |

Following on from our discussion in rust-lang/hashbrown#633, this draft PR demonstrates a number of the rapidhash optimisations. I'm leaving in notes and old functions to make testing and comparing easier, and will do the clean up once we're happy!
Current changes:
Hash::writefor medium + long by reducing the number of passed arguments. This means passing the&FoldHasherdirectly to hash medium/long.self.accumulatoroutside of the if statement,since the compiler doesn't seem to be able to optimise this itself.codegen-units related, no performance impact. Have left the change in because I think it's clearer, happy to remove if requested.codegen-units = 1andincremental = false. This has been fixed for more consistent benchmarking.seeds: &'static [u64; 4]reference. This is compatible with the variousStateimplementations, reduces the size of FoldHasher, and avoids loading the expand seeds when hashing smaller strings. Previously it would have had to load 4x u64 seeds on any string hashing regardless of whether expand seeds were used as the string lengths are opaque.read_u64andread_u32implementations that let the compiler correctly omit the bounds checks. This bumps MSRV to 1.77, this implementation was necessary in rapidhash to keep the functionconst, but we might be able to find an alternative with a lower MSRV that also enables omitting the bounds check.Ideas
FoldHashersize by storing the seeds as a&[u64; 4].RapidHasher uses aIt worked fine with FixedState, I was worried FoldHasher allowed setting non-static secrets.&'staticwhich might not play nicely with some of the other State types. Will benchmark it with'staticand comment out FixedState to test the hypothesis.write_numimplementations. I couldn't spot a significant performance difference when using the rapidhash macro version, there's a tiny chance it improves strurl by 100ms.RemovingCargo.toml did not havehash_longwould sacrifice some top-end throughput, but will improve small string hashing to beat or match rapidhash up to 288 bytes. In rapidhash, the if gate for >288 inputs seems to be almost free somehow... 🤔codegen-units = 1andincremental = false.Judging by the benchmarks, LLVM doesn't seem to specialise the rapidhash medium function to omit some of the bounds checks. LTO is "thin", rapidhash's specialisation works, the if short/medium/long layout is the same, and it doesn't to be const-related.Cargo.toml did not havecodegen-units = 1andincremental = false.read_u64andread_u32implementations with read_unaligned matches the rapidhash performance.Current Performance Improvement
Performance Goal
I've added rapidhash to the foldhash benchmarks to make sure I hadn't done something unfair with the rapidhash crate. We're still a small way off with the above improvements that it's making me paranoid I've done something wrong in rapidhash!
Other notes
Reasoning behind reducing function arguments
Regarding the function arguments, reducing them means less logic in the
Hash::writemethod to load seeds from memory and forcing them into specific registers. If they're not passed in registers, then they end up as load instructions inside the function instead, and the compiler is free to choose which registers to load them into, and when to load them.Compiles to the following with
-Oon rust 1.63 on godboltMixing the length: wrapping_add vs rotate_right
Wrapping add uses the full 64-bit length (admittedly, rare), but should also only use a single instruction on all platforms. Our SMHasher tests with rapidhash suggest XOR and wrapping add were sufficient to still produce a high quality hash function, and has the benefit of not being limited to only having 64 possible rotations.
Compiles to the following with -O on rust 1.63 on godbolt: