-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak EntityHasher
for more speed
#10605
Conversation
5aa522d
to
5c48fd7
Compare
2eb0f56
to
65d7b88
Compare
@@ -285,6 +285,20 @@ impl Hasher for EntityHasher { | |||
|
|||
#[inline] | |||
fn write_u64(&mut self, i: u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demonstration of the overall assembly diff with this change: https://rust.godbolt.org/z/9zxfMdEfv
// of a conflict in the index despite a good hash function. | ||
// | ||
// This masking actually ends up with negative cost after optimization, | ||
// since it saves needing to do the shift-and-or between the fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demonstration of this by comparing the codegen without the masking to having the masking: https://rust.godbolt.org/z/3vvsT38ar
There was a good conversation on Discord about whether it's ok to drop the generation. I've put some text in the comment about that, but to summarize the part that's specific to the previous implementation: If you work through some simplifications on bevy/crates/bevy_utils/src/lib.rs Lines 286 to 292 in 48d10e6
you'll find out that it's the same as So the generation is technically used, but since it's only in the high bits, what does it take for it to matter?
To put those in perspective, that's either
Thus it's fine for this change to ignore generation, since the previous one effectively was too. |
I need to re-run numbers for this after #10558 Looks like this'll need a new OP, since the new repr changes things. Closing this one; will open a new one with all new perf stuff and description. |
# Objective Keep essentially the same structure of `EntityHasher` from #9903, but rephrase the multiplication slightly to save an instruction. cc @superdump Discord thread: https://discord.com/channels/691052431525675048/1172033156845674507/1174969772522356756 ## Solution Today, the hash is ```rust self.hash = i | (i.wrapping_mul(FRAC_U64MAX_PI) << 32); ``` with `i` being `(generation << 32) | index`. Expanding things out, we get ```rust i | ( (i * CONST) << 32 ) = (generation << 32) | index | ((((generation << 32) | index) * CONST) << 32) = (generation << 32) | index | ((index * CONST) << 32) // because the generation overflowed = (index * CONST | generation) << 32 | index ``` What if we do the same thing, but with `+` instead of `|`? That's almost the same thing, except that it has carries, which are actually often better in a hash function anyway, since it doesn't saturate. (`|` can be dangerous, since once something becomes `-1` it'll stay that, and there's no mixing available.) ```rust (index * CONST + generation) << 32 + index = (CONST << 32 + 1) * index + generation << 32 = (CONST << 32 + 1) * index + (WHATEVER << 32 + generation) << 32 // because the extra overflows and thus can be anything = (CONST << 32 + 1) * index + ((CONST * generation) << 32 + generation) << 32 // pick "whatever" to be something convenient = (CONST << 32 + 1) * index + ((CONST << 32 + 1) * generation) << 32 = (CONST << 32 + 1) * index +((CONST << 32 + 1) * (generation << 32) = (CONST << 32 + 1) * (index + generation << 32) = (CONST << 32 + 1) * (generation << 32 | index) = (CONST << 32 + 1) * i ``` So we can do essentially the same thing using a single multiplication instead of doing multiply-shift-or. LLVM was already smart enough to merge the shifting into a multiplication, but this saves the extra `or`: ![image](https://github.com/bevyengine/bevy/assets/18526288/d9396614-2326-4730-abbe-4908c01b5ace) <https://rust.godbolt.org/z/MEvbz4eo4> It's a very small change, and often will disappear in load latency anyway, but it's a couple percent faster in lookups: ![image](https://github.com/bevyengine/bevy/assets/18526288/c365ec85-6adc-4f6d-8fa6-a65146f55a75) (There was more of an improvement here before #10558, but with `to_bits` being a single `qword` load now, keeping things mostly as it is turned out to be better than the bigger changes I'd tried in #10605.) --- ## Changelog (Probably skip it) ## Migration Guide (none needed)
Objective
#9903 did a great job on
EntityHasher
, but seeing it in the 0.12 release notes nerd-sniped me and I found some μoptimizations to it.(This was also the source of #10519.)
See also some Discord conversation in https://discord.com/channels/691052431525675048/1172033156845674507/1174887022914187264.
Solution
EntityHasher
does two things today that are great and I keep here:(For H1 & H2, see https://youtu.be/ncHmEUmJZf4?t=1540)
But there's two things that can tweak:
Skip generation entirely
Today, the hasher bit-ors the generation into the high 32 bits of the hash. Unfortunately, H2 only cares about the high 7 bits of the result, so unless you get to generation 33,554,432 that doesn't actually do anything.
As a good demonstration of that, note that the benchmark that just looks up extant entities with the (probably-)wrong generation is actually faster with this change, despite being the one where including the generation in the hash should matter the most:
(Note how the miss on generation is ≈25% slower than the miss on index, since the hash has a high probability of noticing the miss from the index before getting to
==
on the entities.)And render-world clears things every frame, so if I'm understanding correctly it'll almost never have generation conflicts anyway.
By skipping the generation, the optimized codegen for the hash doesn't even bother to
load
it from the&Entity
.Phrase the hash with less shifting
Today, the hash does a 64-bit multiplication of the input with a 64-bit constant, but then it shifts it left by 32 bits, which means it actually threw away most of the result. That's not actually a problem, really, since LLVM is clever and figures out what's going on, but had me puzzled for a while.
That optimization inspired the approach here: rather than shifting and masking, just put that in the multiplication constant directly. And thanks to skipping the generation, it's a 32-bit input. So on both 64-bit and 32-bit, it only needs a single multiplication. (LLVM sees the particular form of the constant and lowers it in a smart way using a slightly different constant in the multiplication: https://rust.godbolt.org/z/nGWzhfcfe)
This leaves the hash, once some intermediate parts optimize away, as just
That's It
And that's about the simplest possible thing that we could do while still trying to avoid H2 collisions.
This produces identical results (Proof: https://alive2.llvm.org/ce/z/orQoq9) as multiply-shift-or, just LLVM seemingly doesn't want to do it by itself (https://rust.godbolt.org/z/58avPcG1o).
Macrobenchmarks
TBH there's not much to see here. Some slightly better, some slightly worse, but I don't think any are particularly interesting -- I don't think I have consistent enough frame timings to see differences of a couple of instructions in hashing, especially when those instructions are fast bitops.
Yellow (this trace) is the new code and red (external trace) is the baseline.
cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode mesh2d --ordered-z
cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode mesh2d
cargo run --example many_cubes --release --features bevy/trace_tracy -- --benchmark
cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode sprite --ordered-z
cargo run --example bevymark --release --features bevy/trace_tracy -- --benchmark --waves 160 --per-wave 1000 --mode sprite
cargo run --example many_buttons --release --features bevy/trace_tracy
Changelog
Tweaked
EntityHasher
for a bit more speed in usual cases.Migration Guide
(No breaking changes.)