-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize impl_read_unsigned_leb128
#92604
Optimize impl_read_unsigned_leb128
#92604
Conversation
impl_read_unsigned_leb128
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit ac6e14fe68386ccb4988623104c2f36f2fdcdbb9 with merge 2817d16b0dd82139c66514d1ed9609a37366c53d... |
This comment has been minimized.
This comment has been minimized.
ac6e14f
to
16869c2
Compare
This comment has been minimized.
This comment has been minimized.
16869c2
to
a18ddee
Compare
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit a18ddee2ead17a25fd616314fd0e30524b5fd04e with merge 7775b1f299a2e55b29c22ee1c04e4b31a08e34f8... |
☀️ Try build successful - checks-actions |
Queued 7775b1f299a2e55b29c22ee1c04e4b31a08e34f8 with parent f1ce0e6, future comparison URL. |
Finished benchmarking commit (7775b1f299a2e55b29c22ee1c04e4b31a08e34f8): comparison url. Summary: This change led to moderate relevant mixed results 🤷 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
|
||
// The first iteration of the loop is unpeeled because this code is | ||
// hot and integers less than 128 are very common, occurring 80%+ | ||
// of the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that true for all types? IIRC, that's very common for usize
but much less so for u64
, for example. Maybe it would make sense to have two implementations, one optimized for small values and another optimized for larger values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it could be a good idea to generalize the impl for various "categories" of numbers (often small, often large), it helped recently in hashing: #92103.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I did some follow-up investigation, and saw that this accounts for the regressions seen in some cases. For example, the ctfe-stress-4 check incr-unchanged
regression seen on the CI run has this distribution of leb128 lengths for read u64
values:
2361790 counts:
( 1) 1311716 (55.5%, 55.5%): u64 1
( 2) 1049356 (44.4%,100.0%): u64 10
( 3) 574 ( 0.0%,100.0%): u64 9
( 4) 115 ( 0.0%,100.0%): u64 2
( 5) 20 ( 0.0%,100.0%): u64 4
( 6) 5 ( 0.0%,100.0%): u64 3
( 7) 3 ( 0.0%,100.0%): u64 8
( 8) 1 ( 0.0%,100.0%): u64 7
FWIW, I looked at reading/writing LEB128 recently. I tried two things:
etc. It helped for some benchmarks, but in the end it was mostly a regression, probably because the code was much larger. But I think that maybe if a good compromise on the number of branches was found, it could be an improvement, because now the loop is quite tight and the repeated increment can slow it down. |
It's a small but clear performance win.
I have split out the second commit ("Modify the buffer position directly when reading leb128 values") into its own PR (#92631) because it's a straightforward win for all types. Once that is merged I'll return here and work through the more subtle changes that are type-dependent. Thanks for the feedback. |
a18ddee
to
facba24
Compare
I did some more investigation.
So I now have two commits in this PR, and if the CI run works well I think we should land them both here, and drop #92631. Sorry for any confusion! @bors try @rust-timer queue |
📌 Commit facba24 has been approved by |
https://crates.io/crates/varint-simd |
It seems like getting any benefit from that crate requires compiling rustc with at least the ssse3 target feature. Rustc is currently compiled to be suitable for all x86_64 cpu's. |
⌛ Testing commit facba24 with merge e7fc933df9c0bfd4e10cc827cdcf6bb44afb4769... |
💥 Test timed out |
@bors retry Test timed out |
⌛ Testing commit facba24 with merge 1078f1d405bf2eacdfc6efb40f0ef7aa797cbabf... |
💔 Test failed - checks-actions |
@bors retry Some apple builder is having network issues. "unable to access 'https://github.com/rust-lang-ci/rust/': Could not resolve host: github.com" |
☀️ Test successful - checks-actions |
Finished benchmarking commit (38c22af): comparison url. Summary: This change led to large relevant improvements 🎉 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. @rustbot label: -perf-regression |
I see instruction count improvements of up to 3.5% locally with these changes, mostly on the smaller benchmarks.
r? @michaelwoerister