Add fold_mut alternative to Iterator trait #76746

mlodato517 · 2020-09-15T15:44:59Z

This PR
Adds a fold_mut alternative to fold for the Iterator trait
Closes #76725

Why?
Most of the background is in #76725 but the TL;DR is that it might be faster and it might be ergonomic in some cases.

Questions:

I've never contributed library code here before - once I added unstable the tests started failing. Do I need to use nightly rust? Do I need to specify the feature config somewhere? Do I need that specified for the tests and benchmarks?
Are those docs are sufficient? I was offloading everything to fold() which is a little lazy maybe
Do we want all those benchmarks? Is it bloating the library/causing unnecessary time increase in the benchmarks?
Should this be implemented for other structs that have custom fold implementations like VecDeque, etc.?
Should we track a separate issue to improve the performance of fold itself? I imagine that, in general, the runtimes should be the same

rust-highfive · 2020-09-15T15:45:05Z

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

workingjubilee · 2020-09-15T15:52:25Z

Needs to be updated to pass x.py check / x.py test tidy.

warning: build failed, waiting for other jobs to finish...
error[E0658]: use of unstable library feature 'iterator_fold_mut'
--> library/core/tests/iter.rs:3229:30
|
3229 | let result = nums.iter().fold_mut(Vec::new(), |v, i| {
| ^^^^^^^^
|
= note: see issue #76725 #76725 for more information
= help: add #![feature(iterator_fold_mut)] to the crate attributes to enable

Yes, Rust is built with the nightly compiler (actually, the full bootstrap process initiated by x.py starts with the beta, usually).
You may want to read https://rustc-dev-guide.rust-lang.org/implementing_new_features.html

mlodato517 · 2020-09-15T16:38:04Z

Sorry, I'm not 100% sure what to do here - I rustup default nightly'ed, added #![feature(iterator_fold_mut)] in library/core/tests/iter.rs, and ran ./x.py c -i library/core/ and I see

error[E0658]: use of unstable library feature 'iterator_fold_mut'
    --> library/core/tests/iter.rs:3230:30
     |
3230 |     let result = nums.iter().fold_mut(Vec::new(), |v, i| {
     |                              ^^^^^^^^
     |
     = note: see issue #76725 <https://github.com/rust-lang/rust/issues/76725> for more information
     = help: add `#![feature(iterator_fold_mut)]` to the crate attributes to enable

What's the right way to set this up so that feature is enabled?

Also, I'm seeing

error: an inner attribute is not permitted in this context
    --> library/core/tests/iter.rs:3226:1
     |
3226 | #![feature(iterator_fold_mut)]
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     |
     = note: inner attributes, like `#![no_std]`, annotate the item enclosing them, and are usually found at the beginning of source files. Outer attributes, like `#[test]`, annotate the item following them.

in the doctest. I copied that from fold_first and it doesn't seem to obviously complain about that one - do you know what I'm missing there?

I wasn't sure if I needed to add it to the list in rustc_feature/src/active.rs but I tried it (I hope with the correct syntax) and that didn't help either, nor did adding it to 🤔

Is there a good commit/PR for me to track the right way to do things? I was originally following #65222 but I don't see anything particularly sneaky.

leonardo-m · 2020-09-15T16:54:56Z

Currently the stdlib lacks some essential iterators (present in Itertools). Every function added to iterators increases cognitive load, and time to find the right function, and memory load to remember its purpose. I am not saying this shouldn't fit in stdlib, but please consider the costs.

cuviper · 2020-09-15T16:56:10Z

library/core/src/iter/traits/iterator.rs

+        let mut accum = init;
+        while let Some(x) = self.next() {
+            f(&mut accum, x);
+        }


I think the default should be implemented in terms of fold, so it automatically takes advantage of all the iterators that already specialize this path.

self.fold(init, |mut acc, x| { f(&mut acc, x); acc })

Even better if that isolates the closure generics like #62429, so it only depends on Self::Item rather than all of Self (plus B and F of course).

Try with inputs like Chain or FlatMap to see the performance difference of custom fold.

I think the default should be implemented in terms of fold, so it automatically takes advantage of all the iterators that already specialize this path.

I like this idea! I tried it here and the results seem to indicate that this brought back the performance "issue" (assuming the benchmark is built right, Criterion is doing it's thing correctly, and I'm interpreting the results correctly):

Should I push forward with that? I was originally making fold_mut for the performance but we could instead make it about the shape of the closure and just wait on improvements to fold if we think that's valuable

Try with inputs like Chain or FlatMap to see the performance difference of custom fold.

Good call, I'll whip up a few of these when I get the chance!

Eureka! I added some benchmarks (with terrible names and with terrible duplication that should be replaced with a macro) in 2921101 and it appears that fold_mut is much slower when using chain and flat_map!

Unfortunately they're a little bit out of order - I will definitely pick better names if we decide to go through with this PR but basically for chain and flat_map, fold_mut is like 50%-100% slower.

For a non-chain-non-flat-map fold on a hashmap, they're about the same (except fold_mut has a huge variance so I'm not sure about that there).

For a non-chain-non-flat-map fold on a number, fold_mut is slower now (interesting that has now switched. I guess that means they're basically "the same within precision"?).

For a non-chain-non-flat-map fold on a Vec, fold_mut is faster by some huge margin that is suspicious.

I think maybe I'll port these benchmarks back to my other repo so I can use criterion for (maybe) more consistent data? Or should we just pull the plug on this? Or redirect the effort towards the ergonomics and not worry about performance?

it appears that fold_mut is much slower when using chain and flat_map!

That's expected because of the while let (essentially a for loop), as cuviper said/implied. If you just replace the (conceptual) for loop with a .for_each( I suspect you'd recover the performance:

let mut accum = init; - while let Some(x) = self.next() { - f(&mut accum, x); - } + self.for_each(|x| f(&mut accum, x));

(insert comment about #62429 here and how that shouldn't be the implementation in core, but it'd be fine for benchmarking)

Hmm that went a little over my head, let me see if I understand:

Reduce the genericity of closures in the iterator traits #62429 introduced some specialized iterator methods (fold among them) for certain iterators. This reduced the number of types the iterator closure was generic over

By implementing a separate fold_mut that doesn't use fold, fold_mut misses out on the specific fold implementations for those iterators

By switching it to a for_each, fold_mut may recover the performance because, I assume, there are similarly specialized for_each implementations for several iterators

Interested to know if any of those are in the right ballpark!

And for expedience (maybe), assuming those points are in the correct ballpark, may I ask:

how does the custom implementation, that reduces the generic types on the closure, improve performance? I can understand the comments about reducing the number of monomorphizations and then code size, but it's not immediately obvious how this plays into runtime performance? Or is it just smaller = faster here? Or is it that, with fewer trait bounds, the compiler is able to do some additional optimization?

how should this be reflected in this PR? My original hope was to have a fold that was "closer" to a "zero cost abstraction". It's seeming more and more like that isn't super possible (except with maybe the for_each construction above?). Should I bail on the "performance" of fold_mut and double down on the ergonomics of the closure by defining fold_mut in terms of fold and then fold_mut will maybe become more of a zero cost abstraction as fold is specialized on other iterators like Vec::IntoIter or something?

The concerns of #62429 are mostly separate from the concerns of fold specialization. The overlap is just that using fold or for_each often introduces a generic closure.

The point of #62429 is that closures inherit all of the generic type parameters of the surrounding scope. In your example that would be B, F, and Self -- but the closure doesn't really need the whole iterator type, just Self::Item. So you can end up generating the same closure for <B, F, vec::IntoIter<i32>>, <B, F, Cloned<hash_set::Iter<'_, i32>>>, <B, F, Map<Chain<Range<i32>, Once<i32>>, map-closure{...}>>, etc., when our closure could be just <B, F, Item = i32>. This is a compile-time concern for excessive monomorphizations, and can also make it exceed type-length limits. There has been some compiler work to trim unused parameters, but it still doesn't know how to reduce that Self to Self::Item. So the trick in #62429 is to use a "closure factory" function with tighter control over the generic parameters.

The fold specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next() has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold() can just fold all of the first iterator and then all of the second. The default for_each is just a fold with an empty () accumulator, which is why it also benefits here.

Ahh, thank you for that explanation!

The fold specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next() has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold() can just fold all of the first iterator and then all of the second.

Awesome, this makes a ton of sense

The default for_each is just a fold with an empty () accumulator, which is why it also benefits here.

Woah, did not see that coming! So I originally read that thinking, "Great, I'll rewrite fold_mut using for_each as mentioned above and I'll get the benefits of fold improvements plus maybe Rust can tell in for_each that moving and reassigning () is a no-op and we'll keep the performance improvements on 'simple' iterators".

I'm running out of benchmark steam (is there a better way to do this than running ./x.py bench -i library/core/ --test-args mark_bench_fold? - it takes about a half hour on my machine) but gave this a shot - I defined three methods

fold_mut which uses while let

fold_mut_fold which uses fold under the hood

fold_mut_each which uses for_each under the hood

And ran 8 benchmarks. 4 were operating on (0..100000).map(black_box) (named _simple) and 4 were operating on (0i64..1000000).chain(0..1000000).map(black_box) (named _chain). Each test was (hopefully) calculating the sum of all the even numbers:

test iter::mark_bench_fold_chain ... bench: 2,287,252 ns/iter (+/- 160,890) test iter::mark_bench_fold_chain_mut ... bench: 1,969,051 ns/iter (+/- 75,847) test iter::mark_bench_fold_chain_mut_each ... bench: 2,516,875 ns/iter (+/- 123,425) test iter::mark_bench_fold_chain_mut_fold ... bench: 2,363,194 ns/iter (+/- 123,658) test iter::mark_bench_fold_simple ... bench: 57,271 ns/iter (+/- 3,691) test iter::mark_bench_fold_simple_mut ... bench: 58,071 ns/iter (+/- 3,410) test iter::mark_bench_fold_simple_mut_each ... bench: 540,887 ns/iter (+/- 5,221) test iter::mark_bench_fold_simple_mut_fold ... bench: 57,627 ns/iter (+/- 4,896)

so I think based on all these shifty benchmarks moving around so much ... maybe these're all within statistical uncertainy (combined with whatever my computer is doing at any given time). This was super fun to play around with but I think I'm going to bow out of the "maybe fold_mut will be faster!" argument :-D

maybe Rust can tell in for_each that moving and reassigning () is a no-op

It's a zero sized type (ZST), so there's literally nothing to do in codegen.

I'm really surprised that it did poorly in your benchmark, but my guess is that there was some unlucky inlining (or lack thereof), especially if parts of that benchmark got split across codegen-units.

cuviper · 2020-09-15T16:59:03Z

library/core/src/iter/traits/iterator.rs

+    /// assert_eq!(counts[&'a'], 5);
+    /// ```
+    #[inline]
+    #[unstable(feature = "iterator_fold_mut", issue = "76725")]


This issue needs to be a real tracking issue, not the feature request.
See other instances of issues labeled C-tracking-issue and T-libs.

Hmm I created #76751 but I don't obviously see how to add T-libs to it 🤔

It got tagged but in the future you can use @rustbot modify labels: +(label), -(label)

scottmcm · 2020-09-15T18:08:00Z

It's not obvious to me that this is materially different from a for_each with a FnMut closure.

To look at the example in the issue,

iter.fold(Vec::new(), |v, x| {
  v.push(x);
  v // this line is a little weird
})

How does the proposed fold_mut compare to just doing something like this instead?

let mut v = Vec::new();
iter.for_each(|x| v.push(x));
let v = v;

EDIT after thinking some more:

To me, the point of fold is that the accumulator is owned. Emulating a fold with a normal .for_each is annoying because you're stuck doing an Option::take dance (or similar) to get ownership of the original item. But when one only needs by-reference access to the accumulator (as in fold_mut), then I'd say it's better to just write it by capturing it in the closure, rather than manually threading it down through all the calls and always returning it.

Now I wonder if this could be made a clippy lint, especially given the optimization gap that you've identified here...

workingjubilee · 2020-09-15T19:03:30Z

It is different because it enables my predilection of doing peculiar things with method chains. 😉

Perhaps that is not enough to justify the addition of a new method, but I would argue the included benchmarks should land, and the relevant code moved next to the benchmarks if necessary, as something to test against (and a future FIXME I guess?), because this overhead should in fact be optimized away... I write code in similar patterns fairly often, and I know I have seen many others do so as well, even without my prompting.

mlodato517 · 2020-09-15T20:02:36Z

It's not obvious to me that this is materially different from a for_each with a FnMut closure.

To look at the example in the issue,
iter.fold(Vec::new(), |v, x| {
  v.push(x);
  v // this line is a little weird
})
How does the proposed fold_mut compare to just doing something like this instead?
let mut v = Vec::new();
iter.for_each(|x| v.push(x));
let v = v;

I assume the performance is the same (because I assume for_each is the same as for which I've already benched). I think the main benefit is hiding the mutability inside the closure along with fold being a common idiom to reach for. I imagine most folds could be rewritten with for_eachs but I could be missing some ownership nuance there.

Generally, I'd prefer a faster fold over a fold_mut and I didn't know if someone would look at fold and fold_mut and go, "Aha! We must not be doing X when compiling fold! We can just turn this knob here and look at that, fold is twice as fast across the board!".

mlodato517 · 2020-09-15T20:06:08Z

So would the plan forward be to add benchmarks and add the fold_mut implementation directly in the benchmark to show that fold needs to be changed in some way and we leave the issue open for optimizations on fold?

cuviper · 2020-09-15T20:08:00Z

I imagine most folds could be rewritten with for_eachs but I could be missing some ownership nuance there.

If the operation needs to consume the accumulator to produce the next one, you need fold. You can't consume captured values in a for_each closure because that makes it FnOnce, or else you have to play ugly games with Option::take.

mlodato517 · 2020-09-15T20:13:50Z

Yeah I'd be interested in seeing how people generally use fold. I haven't rusted much but I've only ever written inline closures. I imagine there are cases where the user wants to leverage a pre-existing function that doesn't take references and already returns the folded value and I'd be interested in seeing how prevalent each style is!

scottmcm · 2020-09-15T21:37:50Z

Well, it wasn't hard to find a fold-a-string example in the compiler. I don't know what it says, though, since it also turned out to be a place where rewriting to the mutable closure was simple and (negligibly) shorter:

         fn tokens_to_string(tokens: &[TokenType]) -> String {
             let mut i = tokens.iter();
             // This might be a sign we need a connect method on `Iterator`.
-            let b = i.next().map_or(String::new(), |t| t.to_string());
-            i.enumerate().fold(b, |mut b, (i, a)| {
+            let mut b = i.next().map_or(String::new(), |t| t.to_string());
+            i.enumerate().for_each(|(i, a)| {
                 if tokens.len() > 2 && i == tokens.len() - 2 {
                     b.push_str(", or ");
                 } else if tokens.len() == 2 && i == tokens.len() - 2 {
                     b.push_str(" or ");
                 } else {
                     b.push_str(", ");
                 }
                 b.push_str(&a.to_string());
-                b
-            })
+            });
+            b
         }

         let mut expected = edible

Passing methods directly seems pretty rare; it seems more common to end up putting them in a closure anyway, like the following few examples I found quickly:

self.attrs.iter().fold(self.span, |acc, attr| acc.to(attr.span))
.fold(Fingerprint::ZERO, |a, b| a.combine_commutative(b));
exprs.iter().rev().fold(succ, |succ, expr| self.propagate_through_expr(&expr, succ))

mlodato517 · 2020-09-15T21:43:04Z

Ah, thanks for doing the digging! Definitely looks cleaner with the method in there. So maybe fold_mut isn't particularly useful? Do we care to add it for the performance difference or just punt and leave this as a feature request to improve fold? Or in the docs we could note that if the closure doesn't require ownership of the accumulator, it could be faster to use for_each? Or is that kind of a gross "suggest the wrong thing for the (questionably) right reason"?

I guess if anyone has a performance issue with fold they might discover this issue/PR and know what to do up until fold itself is improved?

mlodato517 · 2020-09-17T21:30:45Z

After lots of benchmarking and discussing (and maybe learning 🤞), I'm feeling that fold_mut probably isn't faster than fold in any statistically relevant way. I'm still happy to push this through to offer an optional method for when a consuming closure is "awkward" but I'm having a hard time deciding if that's valuable enough to the standard library/community to add.

Looking for input - do we want this alternate method in the standard library?

cuviper · 2020-09-17T21:48:08Z

do we want this alternate method in the standard library?

I'm leaning toward "no", given that the current options are pretty short and reasonable.
Maybe it would fit the itertools crate?

mlodato517 · 2020-09-18T18:50:07Z

I was trying to figure out why the Vec benchmarks were consistently "in favor of" fold_mut while the number and HashMap benchmarks didn't care so much.

I began to wonder if it has to do with the size of the accumulator. I don't know how HashMap is represented but Vec is roughly three usizes which, on a 64 bit machine, is three times the size of an i64 so maybe the moving that happens in each fold iteration has to move more data and so it slows down.

Did a super quick 'n dirty test here and it seems to hold up. It looks like fold gets slower as the accumulator gets larger and the movement in the closure gets more cumbersome (or some other reason).

Add fold_mut alternative

b6ddb29

rust-highfive assigned joshtriplett Sep 15, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 15, 2020

mlodato517 changed the title ~~Add fold_mut alternative~~ Add fold_mut alternative to Iterator trait Sep 15, 2020

jyn514 added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. A-iterators Area: Iterators labels Sep 15, 2020

cuviper reviewed Sep 15, 2020

View reviewed changes

mlodato517 mentioned this pull request Sep 15, 2020

Tracking Issue for Iterator fold_mut #76751

Closed

3 tasks

This comment has been minimized.

Sign in to view

Mark Lodato added 2 commits September 16, 2020 16:55

Add benchmarks for chain/flat_map. Needs macro

2921101

Use tracking issue, not feature request in stability marker

d7cc6be

mlodato517 closed this Sep 18, 2020

mlodato517 deleted the ml-fold-mut branch September 18, 2020 18:22

mlodato517 mentioned this pull request Sep 21, 2020

Add fold_mut alternative to fold rust-itertools/itertools#481

Closed

scottmcm mentioned this pull request Oct 21, 2020

Make ControlFlow #[must_use] #78202

Closed

scottmcm mentioned this pull request Jan 2, 2023

slice::Iter::fold optimizes poorly for some niche optimized types. #106288

Closed

scottmcm mentioned this pull request Jan 7, 2024

Iterator Default Implementation for position() is slow #119551

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fold_mut alternative to Iterator trait #76746

Add fold_mut alternative to Iterator trait #76746

mlodato517 commented Sep 15, 2020

rust-highfive commented Sep 15, 2020

workingjubilee commented Sep 15, 2020 •

edited

Loading

mlodato517 commented Sep 15, 2020

leonardo-m commented Sep 15, 2020

cuviper Sep 15, 2020

cuviper Sep 15, 2020

mlodato517 Sep 15, 2020

mlodato517 Sep 16, 2020 •

edited

Loading

scottmcm Sep 16, 2020

mlodato517 Sep 17, 2020

cuviper Sep 17, 2020 •

edited

Loading

mlodato517 Sep 17, 2020

cuviper Sep 17, 2020

cuviper Sep 15, 2020

mlodato517 Sep 15, 2020

workingjubilee Sep 15, 2020

mlodato517 Sep 16, 2020

scottmcm commented Sep 15, 2020 •

edited

Loading

This comment has been minimized.

workingjubilee commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

cuviper commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

scottmcm commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

mlodato517 commented Sep 17, 2020

cuviper commented Sep 17, 2020

mlodato517 commented Sep 18, 2020 •

edited

Loading

Add fold_mut alternative to Iterator trait #76746

Add fold_mut alternative to Iterator trait #76746

Conversation

mlodato517 commented Sep 15, 2020

rust-highfive commented Sep 15, 2020

workingjubilee commented Sep 15, 2020 • edited Loading

mlodato517 commented Sep 15, 2020

leonardo-m commented Sep 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlodato517 Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuviper Sep 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottmcm commented Sep 15, 2020 • edited Loading

This comment has been minimized.

workingjubilee commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

cuviper commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

scottmcm commented Sep 15, 2020

mlodato517 commented Sep 15, 2020

mlodato517 commented Sep 17, 2020

cuviper commented Sep 17, 2020

mlodato517 commented Sep 18, 2020 • edited Loading

workingjubilee commented Sep 15, 2020 •

edited

Loading

mlodato517 Sep 16, 2020 •

edited

Loading

cuviper Sep 17, 2020 •

edited

Loading

scottmcm commented Sep 15, 2020 •

edited

Loading

mlodato517 commented Sep 18, 2020 •

edited

Loading