-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slice::Iter::fold
optimizes poorly for some niche optimized types.
#106288
Comments
Looks like slice::Iter uses Iterator's default impl. Writing a custom one with a counted loop instead optimizes better. I'll make a PR and see what perf says. |
Not on its own at least. LLVM is sensitive to details here.
|
Sounds like LLVM is at least partially to blame here right? Might be worth to try and find some minimized IR that should optimize out the loop but doesn't and open an issue on the LLVM repo. |
IR: https://rust.godbolt.org/z/59dPqob8E Without runtime unrolling: https://rust.godbolt.org/z/hss67focY fold_val() has a reassociation failure, with something like this:
fold_ptr() has a minor optimization failure, which should be fixed by #106294:
LLVM can derive the final value of the primary IV, but not of the result, which is Edited: Cleaned up base IR for future reference: https://llvm.godbolt.org/z/nG8aMEnq3 |
So the issue is specific to the case where one writes a pointless fold where all previous iterations are disregarded?
@Sp00ph was this reduced from real code where using slice::back() wasn't obvious? If not then maybe it's too uncommon to be worth optimizing for. #106343 doesn't show much of an impact. |
It's just a little synthetic test I wrote, nothing from real code. I was just trying around how much LLVM can optimize and mainly opened the issue because of the unintuitive discrepancy between the different cases. |
🤔 peeling the first loop iteration should solve this.... but it turns out it's even simpler than that. Making the len == 0 case explicit and then turning the while into a do-while loop fixes it too. |
I think this is the same root issue as #76746 (comment) Basically, the point of For example, if you write https://rust.godbolt.org/z/jrbzPExWG pub fn fold_ref(s: &[i32]) -> Option<&i32> {
let mut r = None;
s.iter().for_each(|i| r = Some(i));
r
} then it optimizes down well already example::fold_ref:
test rsi, rsi
lea rax, [rdi + 4*rsi - 4]
cmove rax, rsi
ret |
optimize slice::Iter::fold Fixes 2 of 4 cases from rust-lang#106288 ``` OLD: test slice::fold_to_last ... bench: 248 ns/iter (+/- 3) NEW: test slice::fold_to_last ... bench: 0 ns/iter (+/- 0) ```
Fixed on nightly by #106343 |
I tried this code:
(nevermind the fact that these could obviously just use
slice::back
)(godbolt link: https://rust.godbolt.org/z/6fjzo4faW )
I expected that all of these functions produce more or less similar assembly, as all of them just need to peel the last loop iteration to be able to optimize away the whole loop body. Indeed, the first two functions optimize just fine:
The
fold_{nonnull,ref}
functions however don't optimize away the loop:I'm assuming this somehow has to do with
NonNull
and&T
having the null niche value, as I don't see any other reason for the differences between*const T
andNonNull<T>
. It doesn't seem to be happening with all niche optimized types though, as functions like these do optimize away the loop:This is using nightly rustc on godbolt, which currently is:
The text was updated successfully, but these errors were encountered: