-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterator Default Implementation for position() is slow #119551
Comments
The That said, they should be equally fast for slices. Going through The call-tree is deeper, so this might affect optimizations. Have you tried compiling with 1CGU or LTO to see if that makes a difference? Benchmarks are fickle. |
The godbolt example does show that try_fold version results in a few more instructions. The LLVM IR looks like it's trying to do some branchless version of the accumulator increments here:
while
is an unconditional increment sitting in its own block. Maybe ControlFlow is not well-suited for carrying the value and the loop state at the same time? CC @scottmcm |
I did some quick measurements for a chain of vecs:
So the current impl does indeed work better for chains
Yes, the difference is the ControlFlow abstraction
Yes:
No real differences though, results were with these settings.
This made me retest an alternative I used previously (reducing ControlFlow size from <usize, usize> to <usize, ()>):
And this one actually beats the current one for the chain test:
The chain_alt is even faster, but it's an attempt to keep the current style, updated for keeping the counter outside the fold:
It isn't pretty, but it works 😅 I am interested in testing these more thoroughly, is there a nice resource with some real world data to test them on? Edit: Went digging in recent projects and I had a solution for Advent of Code Day 6 using .rposition() and .position() with the relevant part being:
And changing to the alt version (with check(predicate, acc)) improved runtime from 8.8 ms to 6.2 ms, a 40% speedup! (I know there is an algebraic solution to the problem, but still) |
You could make a PR and we can do a perf run. It doesn't have dedicated benchmarks that stress |
…g the accumulating value outside of the fold in an attempt to improve code generation
Rewrite Iterator::position default impl Storing the accumulating value outside the fold in an attempt to improve code generation has shown speedups on various handwritten benchmarks, see discussion at rust-lang#119551.
Rewrite Iterator::position default impl Storing the accumulating value outside the fold in an attempt to improve code generation has shown speedups on various handwritten benchmarks, see discussion at rust-lang#119551.
Rewrite Iterator::position default impl Storing the accumulating value outside the fold in an attempt to improve code generation has shown speedups on various handwritten benchmarks, see discussion at rust-lang#119551.
@the8472 Yes, we learned in #76746 (comment) that if you don't need move access to the accumulator -- either because (Maybe this'll be better once LLVM does better with 2-variant enums and stops losing information about them, but for now we're stuck with this.) |
The compiler struggles to optimize the current default implementation for position on Iterators.
Simplifying the implementation has increased the efficiency in various scenarios OMM:
Old position():
Simplified version (There are several alternative implementations producing similar results as well):
Bench results:
bench.rs
Benchmarks are named after the iterator and were run using Divan.
Ranges benefit massively (big ones by multiple orders of magnitude), while for other types, speedups are usually between 0% and 100%.
A specific implementation for range calculating the position would of course be even faster, but I don't think position() is called that often on ranges.
The implementation is similar to the slice::iter::Iter one, which was introduced to reduce compile times #72166, so perhaps this would help in that regard as well.
However, the compiler appears to occasionally forget how to optimize the function after editing unrelated code. This also happens with the current implementation, so I don't know how reproducible the speedups are (FWIW Ranges have never failed to improve, though).
I did not attempt to benchmark compilation speed, but according to godbolt the memory usage is down from 11.8 MB to 8.3 MB which looks promising.
I was motivated by a thread on reddit, where some context and speculation can be found.
The text was updated successfully, but these errors were encountered: