`slice::Iter::fold` optimizes poorly for some niche optimized types. #106288

Sp00ph · 2022-12-30T13:48:38Z

I tried this code:

pub fn fold_val(s: &[i32]) -> Option<i32> {
    s.iter().fold(None, |_, i| Some(*i))
}

pub fn fold_ptr(s: &[i32]) -> Option<*const i32> {
    s.iter().fold(None, |_, i| Some(<*const i32>::from(i)))
}

pub fn fold_nonnull(s: &[i32]) -> Option<std::ptr::NonNull<i32>> {
    s.iter().fold(None, |_, i| Some(From::from(i)))
}

pub fn fold_ref(s: &[i32]) -> Option<&i32> {
    s.iter().fold(None, |_, i| Some(i))
}

(nevermind the fact that these could obviously just use slice::back)
(godbolt link: https://rust.godbolt.org/z/6fjzo4faW )

I expected that all of these functions produce more or less similar assembly, as all of them just need to peel the last loop iteration to be able to optimize away the whole loop body. Indeed, the first two functions optimize just fine:

example::fold_val:
        test    rsi, rsi
        je      .LBB0_1
        mov     edx, dword ptr [rdi + 4*rsi - 4]
        mov     eax, 1
        ret
.LBB0_1:
        xor     eax, eax
        ret

example::fold_ptr:
        xor     eax, eax
        test    rsi, rsi
        setne   al
        lea     rdx, [rdi + 4*rsi]
        add     rdx, -4
        ret

The fold_{nonnull,ref} functions however don't optimize away the loop:

example::fold_nonnull:
        movabs  r8, 4611686018427387903
        and     r8, rsi
        lea     ecx, [rsi + 1]
        and     rcx, 7
        je      .LBB2_1
        xor     r9d, r9d
        mov     rdx, rdi
.LBB2_3:
        mov     rax, r9
        mov     r9, rdx
        add     rdx, 4
        dec     rcx
        jne     .LBB2_3
        cmp     r8, 7
        jae     .LBB2_5
        jmp     .LBB2_7
.LBB2_1:
        mov     rdx, rdi
        cmp     r8, 7
        jb      .LBB2_7
.LBB2_5:
        lea     rcx, [rdi + 4*rsi]
        add     rdx, -8
.LBB2_6:
        lea     rax, [rdx + 32]
        add     rdx, 36
        cmp     rdx, rcx
        mov     rdx, rax
        jne     .LBB2_6
.LBB2_7:
        ret

example::fold_ref:
        movabs  r8, 4611686018427387903
        and     r8, rsi
        lea     ecx, [rsi + 1]
        and     rcx, 7
        je      .LBB3_1
        xor     r9d, r9d
        mov     rdx, rdi
.LBB3_3:
        mov     rax, r9
        mov     r9, rdx
        add     rdx, 4
        dec     rcx
        jne     .LBB3_3
        cmp     r8, 7
        jae     .LBB3_5
        jmp     .LBB3_7
.LBB3_1:
        mov     rdx, rdi
        cmp     r8, 7
        jb      .LBB3_7
.LBB3_5:
        lea     rcx, [rdi + 4*rsi]
        add     rdx, -8
.LBB3_6:
        lea     rax, [rdx + 32]
        add     rdx, 36
        cmp     rdx, rcx
        mov     rdx, rax
        jne     .LBB3_6
.LBB3_7:
        ret

I'm assuming this somehow has to do with NonNull and &T having the null niche value, as I don't see any other reason for the differences between *const T and NonNull<T>. It doesn't seem to be happening with all niche optimized types though, as functions like these do optimize away the loop:

pub fn fold_bool(s: &[bool]) -> Option<bool> {
    s.iter().fold(None, |_, i| Some(*i))
}

pub fn fold_nz(s: &[NonZeroUsize]) -> Option<NonZeroUsize> {
    s.iter().fold(None, |_, i| Some(*i))
}

This is using nightly rustc on godbolt, which currently is:

rustc 1.68.0-nightly (ad8ae0504 2022-12-29)
binary: rustc
commit-hash: ad8ae0504c54bc2bd8306abfcfe8546c1bb16a49
commit-date: 2022-12-29
host: x86_64-unknown-linux-gnu
release: 1.68.0-nightly
LLVM version: 15.0.6

The text was updated successfully, but these errors were encountered:

the8472 · 2022-12-31T19:11:41Z

Looks like slice::Iter uses Iterator's default impl. Writing a custom one with a counted loop instead optimizes better. I'll make a PR and see what perf says.

https://rust.godbolt.org/z/sneT8853n

the8472 · 2023-01-01T09:37:02Z

I'm assuming this somehow has to do with NonNull and &T having the null niche value, as I don't see any other reason for the differences between *const T and NonNull.

Not on its own at least. LLVM is sensitive to details here.
https://rust.godbolt.org/z/xx6MfKnK7

fold_nonnull_ptr_std is the current impl with the bad assembly

fold_nonnull_ptr_ne is that but with the niche-handling removed (by not going through intermediate Options from next()). Same bad results.

fold_nonnull_ptr_le is the same except != was replaced with < in the loop condition. It results in less assembly but it still loops unnecessarily, instead of optimizing all but the last iteration away. This is weird because afaik != was chosen intentionally in the past for being easier on the optimizer. Is that no longer true?

fold_nonnull_idx finally changes from direct pointer increments to index increments + taking index-based offsets. This finally eliminates the loop. Also weird because it means llvm can't derive the last pointer value somehow?

Sp00ph · 2023-01-01T12:25:18Z

Sounds like LLVM is at least partially to blame here right? Might be worth to try and find some minimized IR that should optimize out the loop but doesn't and open an issue on the LLVM repo.

nikic · 2023-01-01T12:36:29Z

I'm assuming this somehow has to do with NonNull and &T having the null niche value, as I don't see any other reason for the differences between *const T and NonNull.

Not on its own at least. LLVM is sensitive to details here. https://rust.godbolt.org/z/xx6MfKnK7

IR: https://rust.godbolt.org/z/59dPqob8E Without runtime unrolling: https://rust.godbolt.org/z/hss67focY

fold_val() has a reassociation failure, with something like this:

  %0 = getelementptr inbounds i32, ptr %s.0, i64 %s.1
  %1 = ptrtoint ptr %0 to i64
  %2 = ptrtoint ptr %s.0 to i64
  %3 = sub nuw nsw i64 -4, %2
  %4 = add i64 %3, %1

%1 - %2 is 4 * %s.1, but this does not fold due to missing or undesirable reassociation.

fold_ptr() has a minor optimization failure, which should be fixed by #106294:

%accum.sroa.4.0.lcssa.i = select i1 %_10.i.peel.i, ptr undef, ptr %uglygep

fold_nonnull_ptr_std is the current impl with the bad assembly

fold_nonnull_ptr_ne is that but with the niche-handling removed (by not going through intermediate Options from next()). Same bad results.

fold_nonnull_ptr_le is the same except != was replaced with < in the loop condition. It results in less assembly but it still loops unnecessarily, instead of optimizing all but the last iteration away. This is weird because afaik != was chosen intentionally in the past for being easier on the optimizer. Is that no longer true?

!= is better. What you're seeing here is the loop being unrolled because the trip count is known and sufficiently simple. Compare: https://llvm.godbolt.org/z/YTcznesv9

fold_nonnull_idx finally changes from direct pointer increments to index increments + taking index-based offsets. This finally eliminates the loop. Also weird because it means llvm can't derive the last pointer value somehow?

LLVM can derive the final value of the primary IV, but not of the result, which is phi ptr [ null, %start ], [ %p.0.i, %bb1.i ], i.e. either null or the IV from the next-to-last iteration. This would need dedicated support.

Edited: Cleaned up base IR for future reference: https://llvm.godbolt.org/z/nG8aMEnq3

the8472 · 2023-01-01T19:12:47Z

LLVM can derive the final value of the primary IV, but not of the result, which is phi ptr [ null, %start ], [ %p.0.i, %bb1.i ], i.e. either null or the IV from the next-to-last iteration. This would need dedicated support.

So the issue is specific to the case where one writes a pointless fold where all previous iterations are disregarded?

(nevermind the fact that these could obviously just use slice::back)

@Sp00ph was this reduced from real code where using slice::back() wasn't obvious? If not then maybe it's too uncommon to be worth optimizing for. #106343 doesn't show much of an impact.

Sp00ph · 2023-01-01T19:23:53Z

It's just a little synthetic test I wrote, nothing from real code. I was just trying around how much LLVM can optimize and mainly opened the issue because of the unintuitive discrepancy between the different cases.

the8472 · 2023-01-02T03:01:38Z

LLVM can derive the final value of the primary IV, but not of the result, which is phi ptr [ null, %start ], [ %p.0.i, %bb1.i ], i.e. either null or the IV from the next-to-last iteration. This would need dedicated support.

🤔 peeling the first loop iteration should solve this.... but it turns out it's even simpler than that. Making the len == 0 case explicit and then turning the while into a do-while loop fixes it too.

https://rust.godbolt.org/z/xGfYfMz55

scottmcm · 2023-01-02T22:48:02Z

I think this is the same root issue as #76746 (comment)

Basically, the point of fold is to get ownership access to the accumulator. If you don't need that, then you're better off using for_each with a mutable closure instead.

For example, if you write https://rust.godbolt.org/z/jrbzPExWG

pub fn fold_ref(s: &[i32]) -> Option<&i32> {
    let mut r = None;
    s.iter().for_each(|i| r = Some(i));
    r
}

then it optimizes down well already

example::fold_ref:
        test    rsi, rsi
        lea     rax, [rdi + 4*rsi - 4]
        cmove   rax, rsi
        ret

optimize slice::Iter::fold Fixes 2 of 4 cases from rust-lang#106288 ``` OLD: test slice::fold_to_last ... bench: 248 ns/iter (+/- 3) NEW: test slice::fold_to_last ... bench: 0 ns/iter (+/- 0) ```

the8472 · 2023-06-20T12:10:17Z

Fixed on nightly by #106343

nikic added I-slow Issue: Problems and improvements with respect to performance of generated code. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Dec 30, 2022

the8472 mentioned this issue Jan 1, 2023

optimize slice::Iter::fold #106343

Merged

Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

the8472 closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`slice::Iter::fold` optimizes poorly for some niche optimized types. #106288

`slice::Iter::fold` optimizes poorly for some niche optimized types. #106288

Sp00ph commented Dec 30, 2022

the8472 commented Dec 31, 2022

the8472 commented Jan 1, 2023

Sp00ph commented Jan 1, 2023 •

edited

Loading

nikic commented Jan 1, 2023 •

edited

Loading

the8472 commented Jan 1, 2023

Sp00ph commented Jan 1, 2023

the8472 commented Jan 2, 2023

scottmcm commented Jan 2, 2023 •

edited

Loading

the8472 commented Jun 20, 2023

slice::Iter::fold optimizes poorly for some niche optimized types. #106288

slice::Iter::fold optimizes poorly for some niche optimized types. #106288

Comments

Sp00ph commented Dec 30, 2022

the8472 commented Dec 31, 2022

the8472 commented Jan 1, 2023

Sp00ph commented Jan 1, 2023 • edited Loading

nikic commented Jan 1, 2023 • edited Loading

the8472 commented Jan 1, 2023

Sp00ph commented Jan 1, 2023

the8472 commented Jan 2, 2023

scottmcm commented Jan 2, 2023 • edited Loading

the8472 commented Jun 20, 2023

`slice::Iter::fold` optimizes poorly for some niche optimized types. #106288

`slice::Iter::fold` optimizes poorly for some niche optimized types. #106288

Sp00ph commented Jan 1, 2023 •

edited

Loading

nikic commented Jan 1, 2023 •

edited

Loading

scottmcm commented Jan 2, 2023 •

edited

Loading