Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice::Iter::fold optimizes poorly for some niche optimized types. #106288

Closed
Sp00ph opened this issue Dec 30, 2022 · 9 comments
Closed

slice::Iter::fold optimizes poorly for some niche optimized types. #106288

Sp00ph opened this issue Dec 30, 2022 · 9 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Sp00ph
Copy link
Member

Sp00ph commented Dec 30, 2022

I tried this code:

pub fn fold_val(s: &[i32]) -> Option<i32> {
    s.iter().fold(None, |_, i| Some(*i))
}

pub fn fold_ptr(s: &[i32]) -> Option<*const i32> {
    s.iter().fold(None, |_, i| Some(<*const i32>::from(i)))
}

pub fn fold_nonnull(s: &[i32]) -> Option<std::ptr::NonNull<i32>> {
    s.iter().fold(None, |_, i| Some(From::from(i)))
}

pub fn fold_ref(s: &[i32]) -> Option<&i32> {
    s.iter().fold(None, |_, i| Some(i))
}

(nevermind the fact that these could obviously just use slice::back)
(godbolt link: https://rust.godbolt.org/z/6fjzo4faW )

I expected that all of these functions produce more or less similar assembly, as all of them just need to peel the last loop iteration to be able to optimize away the whole loop body. Indeed, the first two functions optimize just fine:

example::fold_val:
        test    rsi, rsi
        je      .LBB0_1
        mov     edx, dword ptr [rdi + 4*rsi - 4]
        mov     eax, 1
        ret
.LBB0_1:
        xor     eax, eax
        ret

example::fold_ptr:
        xor     eax, eax
        test    rsi, rsi
        setne   al
        lea     rdx, [rdi + 4*rsi]
        add     rdx, -4
        ret

The fold_{nonnull,ref} functions however don't optimize away the loop:

example::fold_nonnull:
        movabs  r8, 4611686018427387903
        and     r8, rsi
        lea     ecx, [rsi + 1]
        and     rcx, 7
        je      .LBB2_1
        xor     r9d, r9d
        mov     rdx, rdi
.LBB2_3:
        mov     rax, r9
        mov     r9, rdx
        add     rdx, 4
        dec     rcx
        jne     .LBB2_3
        cmp     r8, 7
        jae     .LBB2_5
        jmp     .LBB2_7
.LBB2_1:
        mov     rdx, rdi
        cmp     r8, 7
        jb      .LBB2_7
.LBB2_5:
        lea     rcx, [rdi + 4*rsi]
        add     rdx, -8
.LBB2_6:
        lea     rax, [rdx + 32]
        add     rdx, 36
        cmp     rdx, rcx
        mov     rdx, rax
        jne     .LBB2_6
.LBB2_7:
        ret

example::fold_ref:
        movabs  r8, 4611686018427387903
        and     r8, rsi
        lea     ecx, [rsi + 1]
        and     rcx, 7
        je      .LBB3_1
        xor     r9d, r9d
        mov     rdx, rdi
.LBB3_3:
        mov     rax, r9
        mov     r9, rdx
        add     rdx, 4
        dec     rcx
        jne     .LBB3_3
        cmp     r8, 7
        jae     .LBB3_5
        jmp     .LBB3_7
.LBB3_1:
        mov     rdx, rdi
        cmp     r8, 7
        jb      .LBB3_7
.LBB3_5:
        lea     rcx, [rdi + 4*rsi]
        add     rdx, -8
.LBB3_6:
        lea     rax, [rdx + 32]
        add     rdx, 36
        cmp     rdx, rcx
        mov     rdx, rax
        jne     .LBB3_6
.LBB3_7:
        ret

I'm assuming this somehow has to do with NonNull and &T having the null niche value, as I don't see any other reason for the differences between *const T and NonNull<T>. It doesn't seem to be happening with all niche optimized types though, as functions like these do optimize away the loop:

pub fn fold_bool(s: &[bool]) -> Option<bool> {
    s.iter().fold(None, |_, i| Some(*i))
}

pub fn fold_nz(s: &[NonZeroUsize]) -> Option<NonZeroUsize> {
    s.iter().fold(None, |_, i| Some(*i))
}

This is using nightly rustc on godbolt, which currently is:

rustc 1.68.0-nightly (ad8ae0504 2022-12-29)
binary: rustc
commit-hash: ad8ae0504c54bc2bd8306abfcfe8546c1bb16a49
commit-date: 2022-12-29
host: x86_64-unknown-linux-gnu
release: 1.68.0-nightly
LLVM version: 15.0.6
@nikic nikic added I-slow Issue: Problems and improvements with respect to performance of generated code. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Dec 30, 2022
@the8472
Copy link
Member

the8472 commented Dec 31, 2022

Looks like slice::Iter uses Iterator's default impl. Writing a custom one with a counted loop instead optimizes better. I'll make a PR and see what perf says.

https://rust.godbolt.org/z/sneT8853n

@the8472
Copy link
Member

the8472 commented Jan 1, 2023

I'm assuming this somehow has to do with NonNull and &T having the null niche value, as I don't see any other reason for the differences between *const T and NonNull.

Not on its own at least. LLVM is sensitive to details here.
https://rust.godbolt.org/z/xx6MfKnK7

fold_nonnull_ptr_std is the current impl with the bad assembly

fold_nonnull_ptr_ne is that but with the niche-handling removed (by not going through intermediate Options from next()). Same bad results.

fold_nonnull_ptr_le is the same except != was replaced with < in the loop condition. It results in less assembly but it still loops unnecessarily, instead of optimizing all but the last iteration away. This is weird because afaik != was chosen intentionally in the past for being easier on the optimizer. Is that no longer true?

fold_nonnull_idx finally changes from direct pointer increments to index increments + taking index-based offsets. This finally eliminates the loop. Also weird because it means llvm can't derive the last pointer value somehow?

@Sp00ph
Copy link
Member Author

Sp00ph commented Jan 1, 2023

Sounds like LLVM is at least partially to blame here right? Might be worth to try and find some minimized IR that should optimize out the loop but doesn't and open an issue on the LLVM repo.

@nikic
Copy link
Contributor

nikic commented Jan 1, 2023

I'm assuming this somehow has to do with NonNull and &T having the null niche value, as I don't see any other reason for the differences between *const T and NonNull.

Not on its own at least. LLVM is sensitive to details here. https://rust.godbolt.org/z/xx6MfKnK7

IR: https://rust.godbolt.org/z/59dPqob8E Without runtime unrolling: https://rust.godbolt.org/z/hss67focY

fold_val() has a reassociation failure, with something like this:

  %0 = getelementptr inbounds i32, ptr %s.0, i64 %s.1
  %1 = ptrtoint ptr %0 to i64
  %2 = ptrtoint ptr %s.0 to i64
  %3 = sub nuw nsw i64 -4, %2
  %4 = add i64 %3, %1

%1 - %2 is 4 * %s.1, but this does not fold due to missing or undesirable reassociation.

fold_ptr() has a minor optimization failure, which should be fixed by #106294:

%accum.sroa.4.0.lcssa.i = select i1 %_10.i.peel.i, ptr undef, ptr %uglygep

fold_nonnull_ptr_std is the current impl with the bad assembly

fold_nonnull_ptr_ne is that but with the niche-handling removed (by not going through intermediate Options from next()). Same bad results.

fold_nonnull_ptr_le is the same except != was replaced with < in the loop condition. It results in less assembly but it still loops unnecessarily, instead of optimizing all but the last iteration away. This is weird because afaik != was chosen intentionally in the past for being easier on the optimizer. Is that no longer true?

!= is better. What you're seeing here is the loop being unrolled because the trip count is known and sufficiently simple. Compare: https://llvm.godbolt.org/z/YTcznesv9

fold_nonnull_idx finally changes from direct pointer increments to index increments + taking index-based offsets. This finally eliminates the loop. Also weird because it means llvm can't derive the last pointer value somehow?

LLVM can derive the final value of the primary IV, but not of the result, which is phi ptr [ null, %start ], [ %p.0.i, %bb1.i ], i.e. either null or the IV from the next-to-last iteration. This would need dedicated support.

Edited: Cleaned up base IR for future reference: https://llvm.godbolt.org/z/nG8aMEnq3

@the8472
Copy link
Member

the8472 commented Jan 1, 2023

LLVM can derive the final value of the primary IV, but not of the result, which is phi ptr [ null, %start ], [ %p.0.i, %bb1.i ], i.e. either null or the IV from the next-to-last iteration. This would need dedicated support.

So the issue is specific to the case where one writes a pointless fold where all previous iterations are disregarded?

(nevermind the fact that these could obviously just use slice::back)

@Sp00ph was this reduced from real code where using slice::back() wasn't obvious? If not then maybe it's too uncommon to be worth optimizing for. #106343 doesn't show much of an impact.

@Sp00ph
Copy link
Member Author

Sp00ph commented Jan 1, 2023

It's just a little synthetic test I wrote, nothing from real code. I was just trying around how much LLVM can optimize and mainly opened the issue because of the unintuitive discrepancy between the different cases.

@the8472
Copy link
Member

the8472 commented Jan 2, 2023

LLVM can derive the final value of the primary IV, but not of the result, which is phi ptr [ null, %start ], [ %p.0.i, %bb1.i ], i.e. either null or the IV from the next-to-last iteration. This would need dedicated support.

🤔 peeling the first loop iteration should solve this.... but it turns out it's even simpler than that. Making the len == 0 case explicit and then turning the while into a do-while loop fixes it too.

https://rust.godbolt.org/z/xGfYfMz55

@scottmcm
Copy link
Member

scottmcm commented Jan 2, 2023

I think this is the same root issue as #76746 (comment)

Basically, the point of fold is to get ownership access to the accumulator. If you don't need that, then you're better off using for_each with a mutable closure instead.

For example, if you write https://rust.godbolt.org/z/jrbzPExWG

pub fn fold_ref(s: &[i32]) -> Option<&i32> {
    let mut r = None;
    s.iter().for_each(|i| r = Some(i));
    r
}

then it optimizes down well already

example::fold_ref:
        test    rsi, rsi
        lea     rax, [rdi + 4*rsi - 4]
        cmove   rax, rsi
        ret

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 15, 2023
optimize slice::Iter::fold

Fixes 2 of 4 cases from rust-lang#106288

```
OLD: test slice::fold_to_last                                           ... bench:         248 ns/iter (+/- 3)
NEW: test slice::fold_to_last                                           ... bench:           0 ns/iter (+/- 0)
```
@the8472
Copy link
Member

the8472 commented Jun 20, 2023

Fixed on nightly by #106343

@the8472 the8472 closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants