Tweak `EntityHasher` for more speed #10605

scottmcm · 2023-11-17T04:31:00Z

Objective

#9903 did a great job on EntityHasher, but seeing it in the 0.12 release notes nerd-sniped me and I found some μoptimizations to it.

(This was also the source of #10519.)

See also some Discord conversation in https://discord.com/channels/691052431525675048/1172033156845674507/1174887022914187264.

Solution

EntityHasher does two things today that are great and I keep here:

being pass-through for H1 (assuming fewer than ≈3,758,096,384 entities in the table)
using multiplication to populate H2

(For H1 & H2, see https://youtu.be/ncHmEUmJZf4?t=1540)

But there's two things that can tweak:

Skip generation entirely

Today, the hasher bit-ors the generation into the high 32 bits of the hash. Unfortunately, H2 only cares about the high 7 bits of the result, so unless you get to generation 33,554,432 that doesn't actually do anything.

As a good demonstration of that, note that the benchmark that just looks up extant entities with the (probably-)wrong generation is actually faster with this change, despite being the one where including the generation in the hash should matter the most:

(Note how the miss on generation is ≈25% slower than the miss on index, since the hash has a high probability of noticing the miss from the index before getting to == on the entities.)

And render-world clears things every frame, so if I'm understanding correctly it'll almost never have generation conflicts anyway.

By skipping the generation, the optimized codegen for the hash doesn't even bother to load it from the &Entity.

Phrase the hash with less shifting

Today, the hash does a 64-bit multiplication of the input with a 64-bit constant, but then it shifts it left by 32 bits, which means it actually threw away most of the result. That's not actually a problem, really, since LLVM is clever and figures out what's going on, but had me puzzled for a while.

That optimization inspired the approach here: rather than shifting and masking, just put that in the multiplication constant directly. And thanks to skipping the generation, it's a 32-bit input. So on both 64-bit and 32-bit, it only needs a single multiplication. (LLVM sees the particular form of the constant and lowers it in a smart way using a slightly different constant in the multiplication: https://rust.godbolt.org/z/nGWzhfcfe)

This leaves the hash, once some intermediate parts optimize away, as just

u64::from(e.index).wrapping_mul(0x9e37_79b9_0000_0001)

That's It

And that's about the simplest possible thing that we could do while still trying to avoid H2 collisions.

This produces identical results (Proof: https://alive2.llvm.org/ce/z/orQoq9) as multiply-shift-or, just LLVM seemingly doesn't want to do it by itself (https://rust.godbolt.org/z/58avPcG1o).

Macrobenchmarks

TBH there's not much to see here. Some slightly better, some slightly worse, but I don't think any are particularly interesting -- I don't think I have consistent enough frame timings to see differences of a couple of instructions in hashing, especially when those instructions are fast bitops.

Yellow (this trace) is the new code and red (external trace) is the baseline.

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode mesh2d --ordered-z

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode mesh2d

cargo run --example many_cubes --release --features bevy/trace_tracy -- --benchmark

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode sprite --ordered-z

cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode sprite

cargo run --example many_buttons --release --features bevy/trace_tracy

Changelog

Tweaked EntityHasher for a bit more speed in usual cases.

Migration Guide

(No breaking changes.)

scottmcm · 2023-11-18T08:24:34Z

crates/bevy_utils/src/lib.rs

@@ -285,6 +285,20 @@ impl Hasher for EntityHasher {

    #[inline]
    fn write_u64(&mut self, i: u64) {


Demonstration of the overall assembly diff with this change: https://rust.godbolt.org/z/9zxfMdEfv

scottmcm · 2023-11-18T08:27:12Z

crates/bevy_utils/src/lib.rs

+        // of a conflict in the index despite a good hash function.
+        //
+        // This masking actually ends up with negative cost after optimization,
+        // since it saves needing to do the shift-and-or between the fields.


Demonstration of this by comparing the codegen without the masking to having the masking: https://rust.godbolt.org/z/3vvsT38ar

scottmcm · 2023-11-18T08:58:55Z

There was a good conversation on Discord about whether it's ok to drop the generation. I've put some text in the comment about that, but to summarize the part that's specific to the previous implementation:

If you work through some simplifications on

bevy/crates/bevy_utils/src/lib.rs

Lines 286 to 292 in 48d10e6

    
           fn write_u64(&mut self, i: u64) { 
        
               // Apparently hashbrown's hashmap uses the upper 7 bits for some SIMD 
        
               // optimisation that uses those bits for binning. This hash function 
        
               // was faster than i | (i << (64 - 7)) in the worst cases, and was 
        
               // faster than PassHasher for all cases tested. 
        
               self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32); 
        
           }

you'll find out that it's the same as u64::from(index.wrapping_mul(MAGIC) | generation) << 32 | u64::from(index). Or, further removing unnecessary detail, it's u64::from(generation) << 32 | some_function_of(index).

So the generation is technically used, but since it's only in the high bits, what does it take for it to matter?

H1 looks at low bits, so the generation wouldn't matter until you get to 2³³ buckets. hashbrown aims for a load factor of 87.5% or less, so it reallocates to that after 2³² buckets are 87.5% filled, and thus you need to have over 2³² × 87.5% = 3,758,096,384 items in the hash table for the generation to affect H1, which seems mostly unlikely.
H2 looks at the high 7 bits, so the generation doesn't affect that until you get to generation 1 << (32 - 7) = 33,554,432.

To put those in perspective, that's either

a table spending 31.5 GiB of RAM on just the Entity keys (including the 1 byte of stored H2 metadata for each), or
a program that, assuming it bumps the generation on a particular index once per frame at 144 FPS, has been running for 2.7 days

Thus it's fine for this change to ignore generation, since the previous one effectively was too.

scottmcm · 2023-11-18T20:29:09Z

I need to re-run numbers for this after #10558

Looks like this'll need a new OP, since the new repr changes things. Closing this one; will open a new one with all new perf stuff and description.

@superdump

# Objective Keep essentially the same structure of `EntityHasher` from #9903, but rephrase the multiplication slightly to save an instruction. cc @superdump Discord thread: https://discord.com/channels/691052431525675048/1172033156845674507/1174969772522356756 ## Solution Today, the hash is ```rust self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32); ``` with `i` being `(generation << 32) | index`. Expanding things out, we get ```rust i | ( (i * CONST) << 32 ) = (generation << 32) | index | ((((generation << 32) | index) * CONST) << 32) = (generation << 32) | index | ((index * CONST) << 32) // because the generation overflowed = (index * CONST | generation) << 32 | index ``` What if we do the same thing, but with `+` instead of `|`? That's almost the same thing, except that it has carries, which are actually often better in a hash function anyway, since it doesn't saturate. (`|` can be dangerous, since once something becomes `-1` it'll stay that, and there's no mixing available.) ```rust (index * CONST + generation) << 32 + index = (CONST << 32 + 1) * index + generation << 32 = (CONST << 32 + 1) * index + (WHATEVER << 32 + generation) << 32 // because the extra overflows and thus can be anything = (CONST << 32 + 1) * index + ((CONST * generation) << 32 + generation) << 32 // pick "whatever" to be something convenient = (CONST << 32 + 1) * index + ((CONST << 32 + 1) * generation) << 32 = (CONST << 32 + 1) * index +((CONST << 32 + 1) * (generation << 32) = (CONST << 32 + 1) * (index + generation << 32) = (CONST << 32 + 1) * (generation << 32 | index) = (CONST << 32 + 1) * i ``` So we can do essentially the same thing using a single multiplication instead of doing multiply-shift-or. LLVM was already smart enough to merge the shifting into a multiplication, but this saves the extra `or`: ![image](https://github.com/bevyengine/bevy/assets/18526288/d9396614-2326-4730-abbe-4908c01b5ace) <https://rust.godbolt.org/z/MEvbz4eo4> It's a very small change, and often will disappear in load latency anyway, but it's a couple percent faster in lookups: ![image](https://github.com/bevyengine/bevy/assets/18526288/c365ec85-6adc-4f6d-8fa6-a65146f55a75) (There was more of an improvement here before #10558, but with `to_bits` being a single `qword` load now, keeping things mostly as it is turned out to be better than the bigger changes I'd tried in #10605.) --- ## Changelog (Probably skip it) ## Migration Guide (none needed)

scottmcm added 2 commits November 16, 2023 18:21

Some simple quality tests for EntityHasher

222a342

Speed up EntityHasher slightly

5c48fd7

scottmcm force-pushed the faster-hasher branch from 5aa522d to 5c48fd7 Compare November 17, 2023 04:43

ItsDoot added C-Performance A change motivated by improving speed, memory usage or compile times A-Utils Utility functions and types labels Nov 17, 2023

IceSentry requested a review from superdump November 17, 2023 06:37

More comments on the approach

65d7b88

scottmcm force-pushed the faster-hasher branch from 2eb0f56 to 65d7b88 Compare November 18, 2023 08:18

scottmcm commented Nov 18, 2023

View reviewed changes

scottmcm marked this pull request as ready for review November 18, 2023 09:01

scottmcm marked this pull request as draft November 18, 2023 20:27

scottmcm closed this Nov 18, 2023

scottmcm deleted the faster-hasher branch November 18, 2023 22:51

scottmcm mentioned this pull request Nov 19, 2023

Save an instruction in EntityHasher #10648

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak `EntityHasher` for more speed #10605

Tweak `EntityHasher` for more speed #10605

scottmcm commented Nov 17, 2023 •

edited

Loading

scottmcm Nov 18, 2023

scottmcm Nov 18, 2023

scottmcm commented Nov 18, 2023

scottmcm commented Nov 18, 2023 •

edited

Loading

		@@ -285,6 +285,20 @@ impl Hasher for EntityHasher {

		#[inline]
		fn write_u64(&mut self, i: u64) {

Tweak EntityHasher for more speed #10605

Tweak EntityHasher for more speed #10605

Conversation

scottmcm commented Nov 17, 2023 • edited Loading

Objective

Solution

Skip generation entirely

Phrase the hash with less shifting

Macrobenchmarks

Changelog

Migration Guide

scottmcm Nov 18, 2023

Choose a reason for hiding this comment

scottmcm Nov 18, 2023

Choose a reason for hiding this comment

scottmcm commented Nov 18, 2023

scottmcm commented Nov 18, 2023 • edited Loading

Tweak `EntityHasher` for more speed #10605

Tweak `EntityHasher` for more speed #10605

scottmcm commented Nov 17, 2023 •

edited

Loading

scottmcm commented Nov 18, 2023 •

edited

Loading