-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fold_mut alternative to Iterator trait #76746
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
Needs to be updated to pass
Yes, Rust is built with the nightly compiler (actually, the full bootstrap process initiated by |
Sorry, I'm not 100% sure what to do here - I
What's the right way to set this up so that feature is enabled? Also, I'm seeing
in the doctest. I copied that from I wasn't sure if I needed to add it to the list in Is there a good commit/PR for me to track the right way to do things? I was originally following #65222 but I don't see anything particularly sneaky. |
Currently the stdlib lacks some essential iterators (present in Itertools). Every function added to iterators increases cognitive load, and time to find the right function, and memory load to remember its purpose. I am not saying this shouldn't fit in stdlib, but please consider the costs. |
let mut accum = init; | ||
while let Some(x) = self.next() { | ||
f(&mut accum, x); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default should be implemented in terms of fold
, so it automatically takes advantage of all the iterators that already specialize this path.
self.fold(init, |mut acc, x| { f(&mut acc, x); acc })
Even better if that isolates the closure generics like #62429, so it only depends on Self::Item
rather than all of Self
(plus B
and F
of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try with inputs like Chain
or FlatMap
to see the performance difference of custom fold
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default should be implemented in terms of
fold
, so it automatically takes advantage of all the iterators that already specialize this path.
I like this idea! I tried it here and the results seem to indicate that this brought back the performance "issue" (assuming the benchmark is built right, Criterion is doing it's thing correctly, and I'm interpreting the results correctly):
Should I push forward with that? I was originally making fold_mut
for the performance but we could instead make it about the shape of the closure and just wait on improvements to fold
if we think that's valuable
Try with inputs like
Chain
orFlatMap
to see the performance difference of customfold
.
Good call, I'll whip up a few of these when I get the chance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eureka! I added some benchmarks (with terrible names and with terrible duplication that should be replaced with a macro) in 2921101 and it appears that fold_mut
is much slower when using chain
and flat_map
!
Unfortunately they're a little bit out of order - I will definitely pick better names if we decide to go through with this PR but basically for chain
and flat_map
, fold_mut
is like 50%-100% slower.
For a non-chain-non-flat-map fold on a hashmap, they're about the same (except fold_mut
has a huge variance so I'm not sure about that there).
For a non-chain-non-flat-map fold on a number, fold_mut
is slower now (interesting that has now switched. I guess that means they're basically "the same within precision"?).
For a non-chain-non-flat-map fold on a Vec
, fold_mut
is faster by some huge margin that is suspicious.
I think maybe I'll port these benchmarks back to my other repo so I can use criterion for (maybe) more consistent data? Or should we just pull the plug on this? Or redirect the effort towards the ergonomics and not worry about performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that fold_mut is much slower when using chain and flat_map!
That's expected because of the while let
(essentially a for
loop), as cuviper said/implied. If you just replace the (conceptual) for loop with a .for_each(
I suspect you'd recover the performance:
let mut accum = init;
- while let Some(x) = self.next() {
- f(&mut accum, x);
- }
+ self.for_each(|x| f(&mut accum, x));
(insert comment about #62429 here and how that shouldn't be the implementation in core
, but it'd be fine for benchmarking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that went a little over my head, let me see if I understand:
- Reduce the genericity of closures in the iterator traits #62429 introduced some specialized iterator methods (
fold
among them) for certain iterators. This reduced the number of types the iterator closure was generic over - By implementing a separate
fold_mut
that doesn't usefold
,fold_mut
misses out on the specificfold
implementations for those iterators - By switching it to a
for_each
,fold_mut
may recover the performance because, I assume, there are similarly specializedfor_each
implementations for several iterators
Interested to know if any of those are in the right ballpark!
And for expedience (maybe), assuming those points are in the correct ballpark, may I ask:
- how does the custom implementation, that reduces the generic types on the closure, improve performance? I can understand the comments about reducing the number of monomorphizations and then code size, but it's not immediately obvious how this plays into runtime performance? Or is it just smaller = faster here? Or is it that, with fewer trait bounds, the compiler is able to do some additional optimization?
- how should this be reflected in this PR? My original hope was to have a
fold
that was "closer" to a "zero cost abstraction". It's seeming more and more like that isn't super possible (except with maybe thefor_each
construction above?). Should I bail on the "performance" offold_mut
and double down on the ergonomics of the closure by definingfold_mut
in terms offold
and thenfold_mut
will maybe become more of a zero cost abstraction asfold
is specialized on other iterators likeVec::IntoIter
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concerns of #62429 are mostly separate from the concerns of fold
specialization. The overlap is just that using fold
or for_each
often introduces a generic closure.
The point of #62429 is that closures inherit all of the generic type parameters of the surrounding scope. In your example that would be B
, F
, and Self
-- but the closure doesn't really need the whole iterator type, just Self::Item
. So you can end up generating the same closure for <B, F, vec::IntoIter<i32>>
, <B, F, Cloned<hash_set::Iter<'_, i32>>>
, <B, F, Map<Chain<Range<i32>, Once<i32>>, map-closure{...}>>
, etc., when our closure could be just <B, F, Item = i32>
. This is a compile-time concern for excessive monomorphizations, and can also make it exceed type-length limits. There has been some compiler work to trim unused parameters, but it still doesn't know how to reduce that Self
to Self::Item
. So the trick in #62429 is to use a "closure factory" function with tighter control over the generic parameters.
The fold
specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next()
has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold()
can just fold all of the first iterator and then all of the second. The default for_each
is just a fold
with an empty ()
accumulator, which is why it also benefits here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, thank you for that explanation!
The fold specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next() has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold() can just fold all of the first iterator and then all of the second.
Awesome, this makes a ton of sense
The default for_each is just a fold with an empty () accumulator, which is why it also benefits here.
Woah, did not see that coming! So I originally read that thinking, "Great, I'll rewrite fold_mut
using for_each
as mentioned above and I'll get the benefits of fold
improvements plus maybe Rust can tell in for_each
that moving and reassigning ()
is a no-op and we'll keep the performance improvements on 'simple' iterators".
I'm running out of benchmark steam (is there a better way to do this than running ./x.py bench -i library/core/ --test-args mark_bench_fold
? - it takes about a half hour on my machine) but gave this a shot - I defined three methods
fold_mut
which useswhile let
fold_mut_fold
which usesfold
under the hoodfold_mut_each
which usesfor_each
under the hood
And ran 8 benchmarks. 4 were operating on (0..100000).map(black_box)
(named _simple
) and 4 were operating on (0i64..1000000).chain(0..1000000).map(black_box)
(named _chain
). Each test was (hopefully) calculating the sum of all the even numbers:
test iter::mark_bench_fold_chain ... bench: 2,287,252 ns/iter (+/- 160,890)
test iter::mark_bench_fold_chain_mut ... bench: 1,969,051 ns/iter (+/- 75,847)
test iter::mark_bench_fold_chain_mut_each ... bench: 2,516,875 ns/iter (+/- 123,425)
test iter::mark_bench_fold_chain_mut_fold ... bench: 2,363,194 ns/iter (+/- 123,658)
test iter::mark_bench_fold_simple ... bench: 57,271 ns/iter (+/- 3,691)
test iter::mark_bench_fold_simple_mut ... bench: 58,071 ns/iter (+/- 3,410)
test iter::mark_bench_fold_simple_mut_each ... bench: 540,887 ns/iter (+/- 5,221)
test iter::mark_bench_fold_simple_mut_fold ... bench: 57,627 ns/iter (+/- 4,896)
so I think based on all these shifty benchmarks moving around so much ... maybe these're all within statistical uncertainy (combined with whatever my computer is doing at any given time). This was super fun to play around with but I think I'm going to bow out of the "maybe fold_mut
will be faster!" argument :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Rust can tell in
for_each
that moving and reassigning()
is a no-op
It's a zero sized type (ZST), so there's literally nothing to do in codegen.
I'm really surprised that it did poorly in your benchmark, but my guess is that there was some unlucky inlining (or lack thereof), especially if parts of that benchmark got split across codegen-units.
/// assert_eq!(counts[&'a'], 5); | ||
/// ``` | ||
#[inline] | ||
#[unstable(feature = "iterator_fold_mut", issue = "76725")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue
needs to be a real tracking issue, not the feature request.
See other instances of issues labeled C-tracking-issue
and T-libs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I created #76751 but I don't obviously see how to add T-libs to it 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It got tagged but in the future you can use @rustbot modify labels: +(label), -(label)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me that this is materially different from a To look at the example in the issue, iter.fold(Vec::new(), |v, x| {
v.push(x);
v // this line is a little weird
}) How does the proposed
EDIT after thinking some more: To me, the point of Now I wonder if this could be made a clippy lint, especially given the optimization gap that you've identified here... |
This comment has been minimized.
This comment has been minimized.
It is different because it enables my predilection of doing peculiar things with method chains. 😉 Perhaps that is not enough to justify the addition of a new method, but I would argue the included benchmarks should land, and the relevant code moved next to the benchmarks if necessary, as something to test against (and a future FIXME I guess?), because this overhead should in fact be optimized away... I write code in similar patterns fairly often, and I know I have seen many others do so as well, even without my prompting. |
I assume the performance is the same (because I assume Generally, I'd prefer a faster |
So would the plan forward be to add benchmarks and add the |
If the operation needs to consume the accumulator to produce the next one, you need |
Yeah I'd be interested in seeing how people generally use |
Well, it wasn't hard to find a fold-a-string example in the compiler. I don't know what it says, though, since it also turned out to be a place where rewriting to the mutable closure was simple and (negligibly) shorter: fn tokens_to_string(tokens: &[TokenType]) -> String {
let mut i = tokens.iter();
// This might be a sign we need a connect method on `Iterator`.
- let b = i.next().map_or(String::new(), |t| t.to_string());
- i.enumerate().fold(b, |mut b, (i, a)| {
+ let mut b = i.next().map_or(String::new(), |t| t.to_string());
+ i.enumerate().for_each(|(i, a)| {
if tokens.len() > 2 && i == tokens.len() - 2 {
b.push_str(", or ");
} else if tokens.len() == 2 && i == tokens.len() - 2 {
b.push_str(" or ");
} else {
b.push_str(", ");
}
b.push_str(&a.to_string());
- b
- })
+ });
+ b
}
let mut expected = edible Passing methods directly seems pretty rare; it seems more common to end up putting them in a closure anyway, like the following few examples I found quickly: self.attrs.iter().fold(self.span, |acc, attr| acc.to(attr.span))
.fold(Fingerprint::ZERO, |a, b| a.combine_commutative(b));
exprs.iter().rev().fold(succ, |succ, expr| self.propagate_through_expr(&expr, succ)) |
Ah, thanks for doing the digging! Definitely looks cleaner with the method in there. So maybe I guess if anyone has a performance issue with |
After lots of benchmarking and discussing (and maybe learning 🤞), I'm feeling that Looking for input - do we want this alternate method in the standard library? |
I'm leaning toward "no", given that the current options are pretty short and reasonable. |
I was trying to figure out why the I began to wonder if it has to do with the size of the accumulator. I don't know how Did a super quick 'n dirty test here and it seems to hold up. It looks like |
This PR
Adds a
fold_mut
alternative tofold
for theIterator
traitCloses #76725
Why?
Most of the background is in #76725 but the TL;DR is that it might be faster and it might be ergonomic in some cases.
Questions:
unstable
the tests started failing. Do I need to use nightly rust? Do I need to specify the feature config somewhere? Do I need that specified for the tests and benchmarks?fold()
which is a little lazy maybefold
implementations likeVecDeque
, etc.?fold
itself? I imagine that, in general, the runtimes should be the same