Iterator Default Implementation for position() is slow #119551

marthadev · 2024-01-03T16:30:54Z

The compiler struggles to optimize the current default implementation for position on Iterators.
Simplifying the implementation has increased the efficiency in various scenarios OMM:

Old position():

    #[inline]
    fn check<T>(
        mut predicate: impl FnMut(T) -> bool,
    ) -> impl FnMut(usize, T) -> ControlFlow<usize, usize> {
        #[rustc_inherit_overflow_checks]
        move |i, x| {
            if predicate(x) { ControlFlow::Break(i) } else { ControlFlow::Continue(i + 1) }
        }
    }

    self.try_fold(0, check(predicate)).break_value()

Simplified version (There are several alternative implementations producing similar results as well):

    let mut position = 0;
    while let Some(x) = self.next() {
        if predicate(x) {
            return Some(position);
        }
        position += 1;
    }
    None

Bench results:

position            fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ arcstr_new       103.4 µs      │ 134.8 µs      │ 105.4 µs      │ 106.4 µs      │ 118     │ 944
├─ arcstr_old       103.8 µs      │ 143.7 µs      │ 105.6 µs      │ 107.1 µs      │ 117     │ 936
├─ big_range_new    33.33 ns      │ 25.79 µs      │ 34.95 ns      │ 36.06 ns      │ 642049  │ 642049
├─ big_range_old    805 µs        │ 893.4 µs      │ 828.3 µs      │ 830 µs        │ 121     │ 121
├─ small_range_new  547.3 ns      │ 60.17 µs      │ 622 ns        │ 659 ns        │ 128056  │ 128056
├─ small_range_old  3.174 µs      │ 54.47 µs      │ 3.195 µs      │ 3.298 µs      │ 28676   │ 28676
├─ u32_new          109.7 µs      │ 125.7 µs      │ 111.6 µs      │ 112.2 µs      │ 112     │ 896
├─ u32_old          245 µs        │ 261.4 µs      │ 247.6 µs      │ 248.5 µs      │ 100     │ 800
├─ u64_new          110.5 µs      │ 147.9 µs      │ 112.8 µs      │ 114.6 µs      │ 109     │ 872
├─ u64_old          255 µs        │ 378.5 µs      │ 258.4 µs      │ 264.1 µs      │ 100     │ 800
├─ u8_new           10.86 µs      │ 34.03 µs      │ 11.49 µs      │ 11.71 µs      │ 1064    │ 8512
╰─ u8_old           17.71 µs      │ 28.99 µs      │ 18.13 µs      │ 18.18 µs      │ 686     │ 5488

bench.rs

use std::sync::Arc;
use once_cell::sync::Lazy;

fn main() {
    divan::main();
}

static VECSTRING: Lazy<Vec<Arc<str>>> =
    Lazy::new(|| Vec::from_iter((0..1_000).map(|x| x.to_string().into())));
static VEC64: Lazy<Vec<u64>> = Lazy::new(|| Vec::from_iter(0..10_000));
static VEC32: Lazy<Vec<u32>> = Lazy::new(|| Vec::from_iter(0..10_000));
static VEC8: Lazy<Vec<u8>> = Lazy::new(|| Vec::from_iter(0..=255));

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u32_old() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC32)
                .iter()
                .copied()
                .position(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u32_new() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC32).iter().copied().locate(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u64_old() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC64)
                .iter()
                .copied()
                .position(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u64_new() {
    for x in (0..10_000).step_by(100) {
        assert_eq!(
            divan::black_box(&VEC64).iter().copied().locate(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u8_old() {
    for x in 0..=255 {
        assert_eq!(
            divan::black_box(&VEC8).iter().copied().position(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn u8_new() {
    for x in 0..=255 {
        assert_eq!(
            divan::black_box(&VEC8).iter().copied().locate(|s| s == x),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn arcstr_old() {
    for x in (0..1_000).step_by(10) {
        let xstr = x.to_string();
        assert_eq!(
            divan::black_box(&VECSTRING)
                .iter()
                .clone()
                .position(|s| **s == *xstr.as_str()),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 8)]
fn arcstr_new() {
    for x in (0..1_000).step_by(10) {
        let xstr = x.to_string();
        assert_eq!(
            divan::black_box(&VECSTRING)
                .iter()
                .clone()
                .locate(|s| **s == *xstr.as_str()),
            Some(x as usize)
        );
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn big_range_old() {
    for x in (0..1_000_000).step_by(100_000) {
        assert_eq!(
            divan::black_box(0..1_000_000).position(|s| s == x),
            Some(x as usize)
        )
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn big_range_new() {
    for x in (0..1_000_000).step_by(100_000) {
        assert_eq!(
            divan::black_box(0..1_000_000).locate(|s| s == x),
            Some(x as usize)
        )
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn small_range_old() {
    for x in 0..100 {
        assert_eq!(
            divan::black_box(0..100).position(|s| s == x),
            Some(x as usize)
        )
    }
}

#[divan::bench(min_time = 0.1, sample_size = 1)]
fn small_range_new() {
    for x in 0..100 {
        assert_eq!(
            divan::black_box(0..100).locate(|s| s == x),
            Some(x as usize)
        )
    }
}

trait Locate {
    type Item;
    fn locate<P>(&mut self, predicate: P) -> Option<usize>
    where
        P: FnMut(Self::Item) -> bool;
}

impl<T, U> Locate for U
where
    U: Iterator<Item = T>,
{
    type Item = T;

    #[inline]
    fn locate<P>(&mut self, mut predicate: P) -> Option<usize>
    where
        P: FnMut(Self::Item) -> bool,
    {
        let mut position = 0;
        while let Some(x) = self.next() {
            if predicate(x) {
                return Some(position);
            }
            position += 1;
        }
        None
    }
}

Benchmarks are named after the iterator and were run using Divan.
Ranges benefit massively (big ones by multiple orders of magnitude), while for other types, speedups are usually between 0% and 100%.

A specific implementation for range calculating the position would of course be even faster, but I don't think position() is called that often on ranges.

The implementation is similar to the slice::iter::Iter one, which was introduced to reduce compile times #72166, so perhaps this would help in that regard as well.

However, the compiler appears to occasionally forget how to optimize the function after editing unrelated code. This also happens with the current implementation, so I don't know how reproducible the speedups are (FWIW Ranges have never failed to improve, though).

I did not attempt to benchmark compilation speed, but according to godbolt the memory usage is down from 11.8 MB to 8.3 MB which looks promising.

I was motivated by a thread on reddit, where some context and speculation can be found.

The text was updated successfully, but these errors were encountered:

the8472 · 2024-01-04T03:01:03Z

The try_fold-based implementation is likely intended to speed up position-finding in Chain or Flatten iterators. So measuring range or slice iters doesn't exercise it well.

That said, they should be equally fast for slices. Going through try_fold should end up with a very similar white-let loop.

The call-tree is deeper, so this might affect optimizations. Have you tried compiling with 1CGU or LTO to see if that makes a difference? Benchmarks are fickle.

the8472 · 2024-01-04T03:31:33Z

The godbolt example does show that try_fold version results in a few more instructions.

The LLVM IR looks like it's trying to do some branchless version of the accumulator increments here:

  %_30.i.i.i = getelementptr inbounds i32, ptr %_30.i12.i.i, i64 1, !dbg !71
  %.val.i.i = load i32, ptr %_30.i12.i.i, align 4, !dbg !84, !noalias !85, !noundef !11
  %_0.i.i.i.i.i = icmp ne i32 %.val.i.i, %0, !dbg !91
  %_8.0.i.i.i.i = zext i1 %_0.i.i.i.i.i to i64, !dbg !103
  %_0.sroa.3.0.i.i.i.i = add i64 %accum.0.i.i, %_8.0.i.i.i.i, !dbg !103

while

  %_30.i.i.i = getelementptr inbounds i32, ptr %_30.i.i56.i, i64 1, !dbg !81
  %2 = add nuw nsw i64 %accum.07.i, 1, !dbg !94
  %3 = icmp eq ptr %_30.i.i.i, %end_or_len, !dbg !36

is an unconditional increment sitting in its own block.

Maybe ControlFlow is not well-suited for carrying the value and the loop state at the same time? CC @scottmcm

marthadev · 2024-01-04T18:09:21Z

The try_fold-based implementation is likely intended to speed up position-finding in Chain or Flatten iterators. So measuring range or slice iters doesn't exercise it well.

I did some quick measurements for a chain of vecs:

position      fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ chain_new  477 µs        │ 651.6 µs      │ 487.3 µs      │ 491.9 µs      │ 204     │ 204
╰─ chain_old  361.1 µs      │ 651.4 µs      │ 377 µs        │ 380.2 µs      │ 263     │ 263

So the current impl does indeed work better for chains

That said, they should be equally fast for slices. Going through try_fold should end up with a very similar white-let loop.

Yes, the difference is the ControlFlow abstraction

The call-tree is deeper, so this might affect optimizations. Have you tried compiling with 1CGU or LTO to see if that makes a difference? Benchmarks are fickle.

Yes:

[profile.bench]
incremental = false
codegen-units = 1
lto ="fat"

No real differences though, results were with these settings.

Maybe ControlFlow is not well-suited for carrying the value and the loop state at the same time? CC @scottmcm

This made me retest an alternative I used previously (reducing ControlFlow size from <usize, usize> to <usize, ()>):

let mut accum = 0;
self.try_fold((), |_, x| if predicate(x) {
    ControlFlow::Break(accum)
} else {
    accum += 1;
    ControlFlow::Continue(())
}).break_value()

And this one actually beats the current one for the chain test:


position      fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ chain_alt  264.9 µs      │ 450.4 µs      │ 271.3 µs      │ 276.9 µs      │ 361     │ 361
├─ chain_new  314.5 µs      │ 413.3 µs      │ 322.3 µs      │ 326.4 µs      │ 307     │ 307
╰─ chain_old  361.6 µs      │ 526.2 µs      │ 371 µs        │ 373.4 µs      │ 268     │ 268

The chain_alt is even faster, but it's an attempt to keep the current style, updated for keeping the counter outside the fold:

#[inline]
fn check<'a, T>(
    mut predicate: impl FnMut(T) -> bool + 'a,
    acc: &'a mut usize,
) -> impl FnMut((), T) -> ControlFlow<usize, ()> + 'a {
    move |_, x| {
        if predicate(x) {
            ControlFlow::Break(*acc)
        } else {
            *acc += 1;
            ControlFlow::Continue(())
        }
    }
}
let mut acc = 0;
self.try_fold((), check(predicate, &mut acc)).break_value()

It isn't pretty, but it works 😅
Godbolt for the alt version with chains
The Assembly and IR are too long for me too really identify where the benefit comes from, curiously the faster version has more instructions, but I guess with all the jumps everywhere that doesn't mean a lot.

I am interested in testing these more thoroughly, is there a nice resource with some real world data to test them on?

Edit: Went digging in recent projects and I had a solution for Advent of Code Day 6 using .rposition() and .position() with the relevant part being:

    let mut hold_durations = (0..time).map(|hold| hold * (time - hold));
    hold_durations.rposition(|d| d > distance).unwrap()
        - hold_durations.position(|d| d > distance).unwrap()
        + 1

And changing to the alt version (with check(predicate, acc)) improved runtime from 8.8 ms to 6.2 ms, a 40% speedup! (I know there is an algebraic solution to the problem, but still)

the8472 · 2024-01-05T00:41:33Z

I am interested in testing these more thoroughly, is there a nice resource with some real world data to test them on?

You could make a PR and we can do a perf run. It doesn't have dedicated benchmarks that stress position and I haven't checked if any of the runtime benchmarks make use of position at all, but the compiler also makes use of it internally so it might show up in check builds too.

…g the accumulating value outside of the fold in an attempt to improve code generation

Rewrite Iterator::position default impl Storing the accumulating value outside the fold in an attempt to improve code generation has shown speedups on various handwritten benchmarks, see discussion at rust-lang#119551.

scottmcm · 2024-01-07T00:44:39Z

@the8472 Yes, we learned in #76746 (comment) that if you don't need move access to the accumulator -- either because &mut is fine, as in that thread, or seemingly because you can Copy, as seen here -- then it's better to use try_for_each with a FnMut closure instead of needing LLVM to understand the threading through the try_fold+Fn.

(Maybe this'll be better once LLVM does better with 2-variant enums and stops losing information about them, but for now we're stuck with this.)

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 3, 2024

marthadev added a commit to marthadev/rust that referenced this issue Jan 5, 2024

Fix rust-lang#119551: Rewrite Iterator::position default impl, storin…

bd65462

…g the accumulating value outside of the fold in an attempt to improve code generation

marthadev mentioned this issue Jan 5, 2024

Rewrite Iterator::position default impl #119599

Merged

the8472 linked a pull request Jan 5, 2024 that will close this issue

Rewrite Iterator::position default impl #119599

Merged

bors closed this as completed in 0e9c3d9 Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterator Default Implementation for position() is slow #119551

Iterator Default Implementation for position() is slow #119551

marthadev commented Jan 3, 2024

the8472 commented Jan 4, 2024 •

edited

Loading

the8472 commented Jan 4, 2024 •

edited

Loading

marthadev commented Jan 4, 2024 •

edited

Loading

the8472 commented Jan 5, 2024

scottmcm commented Jan 7, 2024

Iterator Default Implementation for position() is slow #119551

Iterator Default Implementation for position() is slow #119551

Comments

marthadev commented Jan 3, 2024

the8472 commented Jan 4, 2024 • edited Loading

the8472 commented Jan 4, 2024 • edited Loading

marthadev commented Jan 4, 2024 • edited Loading

the8472 commented Jan 5, 2024

scottmcm commented Jan 7, 2024

the8472 commented Jan 4, 2024 •

edited

Loading

the8472 commented Jan 4, 2024 •

edited

Loading

marthadev commented Jan 4, 2024 •

edited

Loading