Skip to content

Conversation

@hoxxep
Copy link
Contributor

@hoxxep hoxxep commented Aug 5, 2025

Following on from our discussion in rust-lang/hashbrown#633, this draft PR demonstrates a number of the rapidhash optimisations. I'm leaving in notes and old functions to make testing and comparing easier, and will do the clean up once we're happy!

Current changes:

  • Reduce the function calling assembly in Hash::write for medium + long by reducing the number of passed arguments. This means passing the &FoldHasher directly to hash medium/long.
  • Extract self.accumulator outside of the if statement, since the compiler doesn't seem to be able to optimise this itself. codegen-units related, no performance impact. Have left the change in because I think it's clearer, happy to remove if requested.
  • Reduce code size by removing the XORs from the small hashing. The compiler likely optimised this away already.
  • Replace medium input hashing with the rapidhash version, will tidy up the function naming and the unused function when we're happy, have left in both for easier testing and to remember the rapidhash bounds.
  • Cargo.toml did not have codegen-units = 1 and incremental = false. This has been fixed for more consistent benchmarking.
  • Replace the fold_seed and expand_seeds with a single seeds: &'static [u64; 4] reference. This is compatible with the various State implementations, reduces the size of FoldHasher, and avoids loading the expand seeds when hashing smaller strings. Previously it would have had to load 4x u64 seeds on any string hashing regardless of whether expand seeds were used as the string lengths are opaque.
  • Replace the old slice to array conversion method when reading u64 and u32 with specific read_u64 and read_u32 implementations that let the compiler correctly omit the bounds checks. This bumps MSRV to 1.77, this implementation was necessary in rapidhash to keep the function const, but we might be able to find an alternative with a lower MSRV that also enables omitting the bounds check.

Ideas

  • Working on this now: Experiment with reducing FoldHasher size by storing the seeds as a &[u64; 4]. RapidHasher uses a &'static which might not play nicely with some of the other State types. Will benchmark it with 'static and comment out FixedState to test the hypothesis. It worked fine with FixedState, I was worried FoldHasher allowed setting non-static secrets.
  • Compare write_num implementations. I couldn't spot a significant performance difference when using the rapidhash macro version, there's a tiny chance it improves strurl by 100ms.
  • Removing hash_long would sacrifice some top-end throughput, but will improve small string hashing to beat or match rapidhash up to 288 bytes. In rapidhash, the if gate for >288 inputs seems to be almost free somehow... 🤔 Cargo.toml did not have codegen-units = 1 and incremental = false.
  • Judging by the benchmarks, LLVM doesn't seem to specialise the rapidhash medium function to omit some of the bounds checks. LTO is "thin", rapidhash's specialisation works, the if short/medium/long layout is the same, and it doesn't to be const-related. Cargo.toml did not have codegen-units = 1 and incremental = false.
  • The remainder of the performance gap is limited by bounds checks not being optimised away. Replacing read_u64 and read_u32 implementations with read_unaligned matches the rapidhash performance.
  • Replace the foldhash long input hashing with rapidhash's 288+ method to improve peak throughput. If you'd want to include it?

Current Performance Improvement

StrUuid/hashonly-struuid-foldhash-fast                                                                             
                        time:   [2.8227 ns 2.8245 ns 2.8263 ns]
                        change: [-49.107% -49.024% -48.956%] (p = 0.00 < 0.05)
                        Performance has improved.

StrDate/hashonly-strdate-foldhash-fast                                                                             
                        time:   [1.3091 ns 1.3117 ns 1.3148 ns]
                        change: [-75.774% -75.719% -75.658%] (p = 0.00 < 0.05)
                        Performance has improved.

StrEnglishWord/hashonly-strenglishword-foldhash-fast                                                                             
                        time:   [1.4575 ns 1.4589 ns 1.4605 ns]
                        change: [-73.756% -73.716% -73.674%] (p = 0.00 < 0.05)
                        Performance has improved.

StrUrl/hashonly-strurl-foldhash-fast                                                                             
                        time:   [5.1459 ns 5.1917 ns 5.2371 ns]
                        change: [-31.611% -30.927% -30.211%] (p = 0.00 < 0.05)
                        Performance has improved.

Performance Goal

I've added rapidhash to the foldhash benchmarks to make sure I hadn't done something unfair with the rapidhash crate. We're still a small way off with the above improvements that it's making me paranoid I've done something wrong in rapidhash!

realworld/StrUuid/hashonly-struuid-rapidhash-f                                                                             
                        time:   [2.5651 ns 2.5673 ns 2.5697 ns]

realworld/StrDate/hashonly-strdate-rapidhash-f                                                                             
                        time:   [1.4701 ns 1.4880 ns 1.5093 ns]

realworld/StrEnglishWord/hashonly-strenglishword-rapidhash-f                                                                             
                        time:   [1.4776 ns 1.4812 ns 1.4851 ns]

realworld/StrUrl/hashonly-strurl-rapidhash-f                                                                             
                        time:   [5.0637 ns 5.1026 ns 5.1410 ns]

Other notes

Reasoning behind reducing function arguments

Regarding the function arguments, reducing them means less logic in the Hash::write method to load seeds from memory and forcing them into specific registers. If they're not passed in registers, then they end up as load instructions inside the function instead, and the compiler is free to choose which registers to load them into, and when to load them.

pub fn many_args(x: u64, a: u64, b: u64, c: u64, d: u64, e: u64) -> u64 {
    x ^ a ^ b ^ c ^ d ^ e
}

pub fn few_args(x: u64, a: &[u64; 5]) -> u64 {
    x ^ a[0] ^ a[1] ^ a[2] ^ a[3] ^ a[4]
}

Compiles to the following with -O on rust 1.63 on godbolt

example::many_args::h3745b95ab12b4ef4:
        mov     rax, r8
        xor     rdi, rsi
        xor     rdx, rcx
        xor     rdx, rdi
        xor     rax, r9
        xor     rax, rdx
        ret

example::few_args::h6cff59e4f5c20fe7:
        mov     rax, rdi
        xor     rax, qword ptr [rsi]
        xor     rax, qword ptr [rsi + 8]
        xor     rax, qword ptr [rsi + 16]
        xor     rax, qword ptr [rsi + 24]
        xor     rax, qword ptr [rsi + 32]
        ret

Mixing the length: wrapping_add vs rotate_right

Wrapping add uses the full 64-bit length (admittedly, rare), but should also only use a single instruction on all platforms. Our SMHasher tests with rapidhash suggest XOR and wrapping add were sufficient to still produce a high quality hash function, and has the benefit of not being limited to only having 64 possible rotations.

pub fn rot_right(x: u64, value: u64) -> u64 {
    x.rotate_right(value as u32)
}

pub fn wrapping_add(x: u64, value: u64) -> u64 {
    x.wrapping_add(value)
}

Compiles to the following with -O on rust 1.63 on godbolt:

example::rot_right::h0cde95e92bd15e6b:
        mov     rcx, rsi
        mov     rax, rdi
        ror     rax, cl
        ret

example::wrapping_add::ha6cb2ed9a0360b12:
        lea     rax, [rdi + rsi]
        ret

@hoxxep hoxxep changed the title Experiments hashing improvements from rapidhash Experiments on the improvements from rapidhash Aug 5, 2025
@hoxxep hoxxep changed the title Experiments on the improvements from rapidhash Experiment: porting the improvements from rapidhash Aug 5, 2025
@orlp
Copy link
Owner

orlp commented Aug 5, 2025

Heads up: I probably won't really have time to look at this until the weekend.

However, a quick note: I'd personally prefer to go about this a bit more scientifically and do some ablation studies to see which changes actually have which effect. Instead of trying 10 different changes at once, isolating the effects per change where possible. That doesn't have to mean you'll have to make separate PR's, but just letting you know that I'll probably be copying them and analyzing them one-by-one.

Replace the accumulator.rotate_right(len) with an accumulator.wrapping_add(len), rapidhash suggests it still provides strong has quality, and it only takes a single instruction on all platforms.

There's a good reason I used the rotation on the seed: it's very dangerous to mix length information in a location an attacker can control. Your rapidhash crate has (almost) seed-independent collisions because of this addition scheme you used:

use std::hash::{BuildHasher, Hasher};

fn main() {
    let a: [u8; 15] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
    let b: [u8; 16] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0];

    let bh = rapidhash::fast::RandomState::new();
    let mut h = bh.build_hasher();
    h.write(&a);
    dbg!(h.finish());
    
    let mut h = bh.build_hasher();
    h.write(&b);
    dbg!(h.finish());
}

Half of the time this program will print the same hash twice. It's technically fine because the current Rust hasher implementation will always write a length prefix for slices (I had to explicitly call write), but I personally would prefer it if the hash function itself took care of length-based attacks like this (which is why we mix in the length in the first place).


One particular benchmark I'll write which I'd be interested in is seeing a graph that goes byte-per-byte in size to see what the speed is for both hashers at that size. In fact, two such benchmarks: one where the size the the same each iteration, and one where the sizes are shuffled so that you get branch mispredictions on the size (an often overlooked aspect in benchmarks).

@hoxxep
Copy link
Contributor Author

hoxxep commented Aug 5, 2025

I'd personally prefer to go about this a bit more scientifically and do some ablation studies to see which changes actually have which effect.

Not a problem, and fully understand why you'd like to, I sadly don't have the time to offer this myself though – apologies!

I'll aim to maintain that list of each change and commit smaller changes individually. Happy to discuss any of them further or go through it on a call with you if you have questions. For now I think I'll aim to make this PR as close to rapidhash's performance, and then I'm happy to unpick, critique, and benchmark changes separately with you afterwards?

one where the size the the same each iteration

Agreed it would be interesting. There are definitely obvious cases where rapidhash suddenly has a significantly higher cost for hashing +1 byte – at 16->17 bytes the medium input hashing cost jumps with the function call and extra if statements, and at 48->49 bytes there's some extra setup and teardown for the 3 independent execution paths that it has to pay for (effectively wyhash or the C++ rapidhash V1 long hashing).

and one where the sizes are shuffled so that you get branch mispredictions on the size (an often overlooked aspect in benchmarks).

It's why I like the english words and url examples in your benchmarks. I had my own half-baked version before incorporating the foldhash benchmark, and kept one for email distributions that I could add into your benchmarks, but IIRC the results aren't too dissimilar and so I didn't want to skew the foldhash benchmarks "in rapidhash's favour".

it's very dangerous to mix length information in a location an attacker can control

Interesting, is this something where we can only generate pairs, or is it a more significant weakness? Happy to leave it as a rotation and accept the performance penalty, my other worry was it could break down if there are any 64 byte cycles that could be found in foldhash, especially with hash_bytes_long being a 64 byte step. I don't have a good way of reasoning if a cycle is impossible or not?

@orlp
Copy link
Owner

orlp commented Aug 5, 2025

Interesting, is this something where we can only generate pairs, or is it a more significant weakness?

We can generate more than simply pairs, but I don't see a path to arbitrary multi-collisions. I think 8-way collisions should be possible.

Happy to leave it as a rotation and accept the performance penalty

I don't think there's a performance penalty, the rotation is an 1 cycle latency op on all the platforms I care about and the register-register movs should be free. But I'll benchmark it.

my other worry was it could break down if there are any 64 byte cycles that could be found in foldhash, especially with hash_bytes_long being a 64 byte step. I don't have a good way of reasoning if a cycle is impossible or not?

I don't see a reasonable path to a cycle.

state = folded_multiply(input0 ^ state, input1 ^ seed);

If you can somehow make input1 ^ seed == 1 then you've got a cycle, but if you can do that you might as well make input1 ^ seed == 0 for an immediate multi-collision. Technically speaking there are also other fixed points than input ^ seed == 1, but I would find it somewhat difficult to find them even if I knew state and seed, let alone if they're a complete mystery to me.

@hoxxep
Copy link
Contributor Author

hoxxep commented Aug 5, 2025

I don't think there's a performance penalty, the rotation is an 1 cycle latency op on all the platforms I care about and the register-register movs should be free. But I'll benchmark it.

Just benchmarked it and you're absolutely right. I've removed the wrapping_add back to a rotate_right, but in the same code position as rapidhash performs wrapping_add.

Some updates that I'll include in the main post:

  • I've added codegen-units = 1 and incremental = false to the release profile as we were getting inconsistent benchmarks before. This makes me doubt whether some of the previous changes really had an impact, or were just getting luckier with codegen units in the build step, such as moving self.accumulator outside of the if statement.
  • Bounds checking was the final gap to rapidhash. For some reason slice[..].try_into().unwrap() wasn't having its bound checks omitted, spotted by using pointer unaligned reads temporarily.
    • Old: u64::from_ne_bytes(slice[offset..offset + 8].try_into().unwrap())
    • New (roughly): u64::from_ne_bytes(*slice.split_at(offset).1.first_chunk::<8>().unwrap()) This bumps MSRV to 1.77, so it might be worth experimenting with alternatives.

This gives us a reference for what foldhash looks like when it matches rapidhash's short string performance (4x higher throughput in some benchmarks).

Next steps

  • Throughput benchmarks: I'll add rapidhash's throughput benchmarks, as that'll let us profile each input length one by one.
  • Clean up: I'd like to clean up the notes, function naming, and a few other pieces to match the existing foldhash code.
  • Replacing hash_bytes_long: I can reduce the level of unrolling (equivalent to COMPACT=true on rapidhash) so we aren't fully unrolled into 244 steps.
  • Ablation study: I can possibly do some work later this week, working backwards for your ablation study and undoing the changes made. This might help identify the spurious codegen-units related changes, I can undo those to stay close to the original foldhash, and I'll summarise how impactful each change is.

Let me know if you're happy with the direction of the next steps? Cheers!

@orlp
Copy link
Owner

orlp commented Aug 5, 2025

For some reason slice[..].try_into().unwrap() wasn't having its bound checks omitted, spotted by using pointer unaligned reads temporarily.

Ouch... In my previous testing it removed the bounds check but maybe this regressed? Or perhaps it only eliminates it in tiny functions and further inlining/larger codesize blocks the optimization?

Let me know if you're happy with the direction of the next steps? Cheers!

You're moving a bit fast here for my usual pace. I'll probably end up splitting this into several different PR's, each individually verified to be an incremental improvement over the other, plus a PR or two for the benchmarks. Again, this is not something I expect you to do, and I'm already more than happy that you're willing to create a known-good baseline to compare to.

So I do have to warn you upfront that I'm unlikely to merge a large generic "improves performance" PR which combines a bunch of optimizations. Don't worry though, if I even end up merging one thing from this (which I almost surely will) I'll add an acknowledgement.

Ultimately my goal is to see exactly what it is that made rapidhash's short string performance better compared to foldhash, and to see if we can do even better still.


One thing I'm curious about is whether it matters whether the seeds struct is an array, or whether using named fields (such as fold_seed and expand_seed) performs the same. I'd prefer the names if the performance is the same.

@orlp
Copy link
Owner

orlp commented Aug 5, 2025

Actually there's some very strange things going on with the strenglishword benchmark I've just noticed.

  1. The performance numbers have seriously regressed from when I last ran them. As you can see in my results strenglishword reported a time of 1.84 ns, now on the same machine it's 5.27 ns. I will do some investigation if this regressed with some specific version of Rust.

  2. There are some serious issues with the new numbers. For example, both on your and on my Apple Arm machine we see that foldhash-q is significantly faster than foldhash-f for strenglishword. Check your own numbers: https://github.com/hoxxep/rapidhash/blob/master/docs/bench_hash_aarch64_apple_m1_max.txt. This makes no sense since foldhash-q does strictly more work than foldhash-f.

EDIT: I can't reproduce the faster numbers now, even with an older nightly Rust version. The only thing I can think of that changed in the meantime is my linker...

@hoxxep
Copy link
Contributor Author

hoxxep commented Aug 5, 2025

One thing I'm curious about is whether it matters whether the seeds struct is an array, or whether using named fields (such as fold_seed and expand_seed) performs the same. I'd prefer the names if the performance is the same.

I believe this would always need instructions to load 4x u64 seeds on any hashed string, regardless of whether expand seeds were used. I think it would be asking a lot for the compiler to implicitly move those load instructions inside specialisations of hash_bytes_medium and hash_bytes_long, at least I couldn't get it to work. An alternative is to change SharedSeeds to be a struct with named fields, and pass around a &SharedSeeds instead of the inner array?

The rapidhash long method also uses 7 secrets, so it becomes a bit more effective to pass around a reference. Maybe less of a concern at 4 secrets.

Ouch... In my previous testing it removed the bounds check but maybe this regressed? Or perhaps it only eliminates it in tiny functions and further inlining/larger codesize blocks the optimization?

I've just tested it, you're correct, there are no bounds check issues on master when I port only the rapid_u64 method to compare.

The performance numbers have seriously regressed from when I last ran them.

Good point! That's very weird. I can't think of any recent changes to Hash or Hasher that I'm aware of at least. Alternatively, it could have been getting lucky with codegen-units? I had a lot of inconsistencies benchmarking what should have been noop changes to rapidhash a year ago before I realised how unreliable benchmarking is without codegen-units = 1.

One thing that surprised me on my benchmarks is how the foldhash and rapidhash quality variant's avalanche step is "free" on Intel x86 chips in the charts and on the foldhash bench results, and randomly foldhash or rapidhash fast will drop from 0.77 to 0.66, but not consistently. As if there's some optimisation the compiler only sometimes manages to apply. I should inspect it further with xcode instruments, likewise for foldhash to see what goes wrong.

You're moving a bit fast here for my usual pace. I'll probably end up splitting this into several different PR's, each individually verified to be an incremental improvement over the other, plus a PR or two for the benchmarks

Yeah it's difficult to break them up into separate PRs because of the way the changes stack together. The read_u64 change is a no-op without some of the rapidhash code for example. I also appreciate it's hard to "take you on the journey" in a single PR and you'd like to truly understand the magnitude of each change for yourself. Part of me wonders whether it's easier to test them backwards rather than forwards?

Working forwards, the way I'd recommend ordering the changes:

  1. Refactor to use read_u64 and read_u32, and set codegen-units = 1. It's a noop on master, but it makes the benchmarks reliable and it's easier to test the bounds checks going forwards.
  2. Reduce the size of FoldHasher to store a &SharedSeed. This should reduce the set up cost marginally when hashing larger types where not everything is inlined. It may or may not be a noop on the benchmark suite.
  3. Reduce the work done before and arguments passed to hash_bytes_medium and hash_bytes_long. I've had success with (accumulate, seeds, bytes), but experimenting with (hasher: &FoldHasher, bytes) might also be of interest if it can be equally as fast. My aim was to produce as little LLVM-IR/resulting instructions for Hash::write so it's inlined more often and potentially uses fewer registers.
  4. Replace hash_bytes_medium with the rapidhash flavour.
  5. Mark hash_bytes_medium as #[inline(never)]. It may be a noop, but I just switched from #[cold] to #[inline(never)] and it appeared to improve rapidhash on a preliminary benchmark run.
  6. Replace hash_bytes_long with the rapidhash flavour, keeping #[inline(never)] and #[cold].

I'm also happy to create and work through each PR with you at a pace you'd like over the next month, there's no rush on my side. Also understand if you'd prefer to do so yourself and loop me in if you have questions.

@orlp
Copy link
Owner

orlp commented Aug 5, 2025

An alternative is to change SharedSeeds to be a struct with named fields, and pass around a &SharedSeeds instead of the inner array?

This is what I meant, or with a sub-struct and passing a reference to that around.

Good point! That's very weird. I can't think of any recent changes to Hash or Hasher that I'm aware of at least.

Considering I can't reproduce the better numbers even with the old version of Rust and for some very mysterious reason foldhash-q does better than foldhash-f (which literally should be impossible), I think there's some serious linker shenanigans going on.

@hoxxep
Copy link
Contributor Author

hoxxep commented Aug 5, 2025

I've just looked at it using cargo-instruments, it's not inlining the Hash::hash(string) part of the BuildHasher::hash_one so it's not able to optimise across the write_len_prefix and finish.

Instruments is having one of it's tantrums and not showing me the raw assembly, but the highlighted line is not marked as "inline". What's sufficient to get inlined vs not is a mystery to me, but from fighting against it building rapidhash, I just know the less LLVM-IR and the simpler the function the better. Conveniently, rust's intermediate representation removes const if statements, which is why rapidhash is littered with them...

image

I'm certainly not qualified to speak on the matter, but I would love to lower the inlining cost (or raise the limit) for methods related to the rust hashing traits. We can't mark Hash::hash #[inline(always)] for good reason, but even with big optimisations to be made for 4-5x speed up, LLVM is really tentative at aggressively inlining across it even on simple types with minimal logic – like foldhash string hashing in this example. The original version of RapidHasher used separate a,b state instead of a single accumulator, but that was enough to trip over the inlining threshold when also using an integer sponge.

@hoxxep
Copy link
Contributor Author

hoxxep commented Aug 5, 2025

Ah... Do you want to know what most of the rapidhash performance gain comes from?

Mark hash_bytes_medium as #[inline(never)] and it should fix the inlining!

Need to benchmark what that does to 17, 18 etc. length strings, but all string hashing is faster for it. Not inlining write_len_prefix and finish has a bigger cost overall than never inlining hash_bytes_medium seemingly.

MR with results: #33

StrUuid/hashonly-struuid-foldhash-fast                                                                             
                        time:   [3.8693 ns 3.8759 ns 3.8845 ns]
                        change: [-30.347% -30.207% -30.075%] (p = 0.00 < 0.05)
                        Performance has improved.

StrDate/hashonly-strdate-foldhash-fast                                                                             
                        time:   [1.3790 ns 1.3990 ns 1.4221 ns]
                        change: [-74.467% -74.116% -73.656%] (p = 0.00 < 0.05)
                        Performance has improved.

StrEnglishWord/hashonly-strenglishword-foldhash-fast                                                                             
                        time:   [1.4834 ns 1.4860 ns 1.4887 ns]
                        change: [-73.295% -73.243% -73.190%] (p = 0.00 < 0.05)
                        Performance has improved.

StrUrl/hashonly-strurl-foldhash-fast                                                                             
                        time:   [6.2293 ns 6.2748 ns 6.3197 ns]
                        change: [-17.006% -16.186% -15.326%] (p = 0.00 < 0.05)
                        Performance has improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants