-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rust doesn't optimize closure in scan iterator #11084
Comments
This patch changes `result::collect` (and adds a new `option::collect`) from creating a `~[T]` to take an `Iterator`. This makes the function much more flexible, and may replace the need for #10989. This patch is a little more complicated than it needs to be because of #11084. Once that is fixed we can replace the `CollectIterator` with a `Scan` iterator. It also fixes a test warning.
The bug rust-lang#11084 causes these collect functions to run about twice as slow as they should because llvm is having trouble optimizing away the closure for some reason. This patch works around that performance bug by using a simple adapter iterator explicitly for capturing if the outer iterator returns an error.
The bug #11084 causes `option::collect` and `result::collect` about twice as slower as it should because llvm is having some trouble optimizing away the scan closure. This gets rid of it so now those functions perform equivalent to a hand written version. This also adds an impl of `Default` for `Rc` along the way.
Do unboxed closures fix this? |
Not completely, it seems. I rewrote the original tests provided. The implementation in the standard library uses the adapter iterator, so I only compared it to a version using
So it seems like the adapter iterator is still the most performant. |
@shepmaster Thanks for updating the benchmarks! Unfortunately they primarily measure the time it takes to create the Improved benchmark at https://gist.github.com/dotdash/5ce3f0bfa2e652ca58b5 -- I simply moved the vector creation out of the loop and user Results:
|
Yeah, I should have noted that. I left if that way just to be consistent with the original version, but I also figured that since each was doing the same over-work they would come out consistent. A small number (~5) runs had the same winners for me, so I shipped it! :-) Thanks for improving it even more! |
Results with the improved benchmark on nightly:
|
There is a FIXME related to this issue, |
This issue can be closed, as the However, since |
I think I see a way to rewrite them using
Update: reading over the previous comments more carefully, @erickt, @shepmaster and @dotdash have all provided benchmarks that illustrate how to implement |
|
Interestingly, for me I do observe a 2x slowdown on the @dotdash -provided benchmark. (And now I need to see if I can reproduce the previous results that claims there was no slowdown...) |
Okay I think I see what happened here, at least procedurally. (I do not yet have an explanation for the performance difference I am observing, but I want to first document the details of the observations that I have made, because the historical trajectory is interesting.) First, here's what I did (in
Observations (in
So: that is a series of steps that sort of masks the bigger story here: these benchmarks went from this under 1.13:
to this under 1.27:
Which does not seem like a good trajectory over time, for any of the variants, in both absolute and relative terms. (Note that there is non-zero chance that some other confounding factor in my benchmark apparatus is actually to blame for the gradual performance shift over time. I have thoughts on this but I am too tired to get into them here.) |
making this a P-high issue to at least diagnose what has happened to our performance in the cases noted in previous comment. (It might be better to fork such investigation off into its own separate issue. but again, I am tired and am just trying to do the minimal amount necessary to keep this on my radar in the short term with respect to tagging and assigning myself.) |
@pnkfelix Could you share the code for your benchmark?
which seems pretty hilarious. |
Compiled with
Compiled with
|
@dotdash here is the file I'm compiling for my benchmark: https://gist.github.com/pnkfelix/7cb66d32c3cd9e1f2ea1a61ef15a0a3a |
Okay; I had forgotten about the change to the defaults for |
Results for my linux machine, with
|
One additional caveat I will add: while I remain concerned about some of the performance differences observed here, I also worry about controlling for factors introduced by the benchmark harness itself. In particular, I made two variants of
Note in particular that the timing for That means there may be cache effects or memory-allocator issues that are causing the effects of running one benchmark to bleed into the execution of another. (That in turn means I need to be very careful about which executions I consider to be worrisome with respect to performance. That, or I need to be more careful about how I run the benchmarks.) |
(one way to control for some of the aforementioned effects would be to rewrite this benchmark such that it does not accumulate into a |
Okay, so here is a revised version of the code that has added a new target type https://gist.github.com/80fe57277434ef31f018c08fa8e21f84 Hopefully this can provide additional focus on the With this, I get the following results:
So: there is still some regression when you compare the |
Here's a graph of @pnkfelix's original data showing the variance (creation scripts). Then someone went and changed the testing methods, so this may not be useful anymore... |
(well, all I did was add new variants; the original benchmarks are still there...) Update: I will admit that one need to figure out some sort of normalization function before attempting to compare the |
Ah here is one last run: a note from @eddyb elsewhere pointed out that using codegen-units=1 disables link-time optimizations. Lets explicitly control link-time optimizations via lto=off (same as default with codegen-units=1)
lto=fat
lto=thin
Notably, using lto (fat or thin) brings I think has convinced me that there is not much we should do about this case in particular (apart from trying to resolve broader issues with link-time optimizations and/or codegen-unit partitioning). And I am going to go ahead and put up a PR with the simplified implementations of |
Well, okay, to be fair, I realized I should do a similar experiment, but this time show the results when you vary lto=off codegen-units=16
lto=fat codegen-units=16
lto=thin codegen-units=16
The regressions for the |
Update: @michaelwoerister has informed me that the default for LTO when you do not specify anything is not necessarily LTO off.
There is no way to explictly opt into "local ThinLTO" via the |
Okay I am satisfied with the results of my investigation. So I'm immediately going to mark this as P-low for now. But furthermore, I do not think there is much else that is actionable on this specific issue. I have posted a PR that demonstrates the change to And I do wonder whether we need to revise some of our defaults for the compiler command UI, in terms of how the various LTO settings are specified or at least documented, and also in terms of changing what Therefore, I am inclined to actually close this issue. But I think I'll hold off on doing so until I get some feedback on PR #59605. |
Its possible that we might be able to do something about this phenomenon. It might be low-hanging fruit for optimization, maybe. |
(This is an attempt to ensure we do not regress performance here, since the performance of this operation varied pretty wildly of the course of rust-lang#11084.)
based on my summary in previous comment, and the inconclusive results from PR #59605, I'm just going to close this. |
…scottmcm Refactoring use common code between option, result and accum `Option` and `Result` have almost exactly the same code that in `accum.rs` that implement `Sum` and `Product`. This PR just move some code to use the same code for all of them. I believe is better to not implement this `Iterator` feature twice. I'm not very familiar with pub visibility hope I didn't make then public. However, maybe these adapters could be useful and we could think to make then pub. rust-lang#59605 rust-lang#11084 r? @pnkfelix
Does closing PR #59605 imply opening this issue again? or i'm wrong? |
[`useless_vec`]: add more tests and don't lint inside of macros Closes rust-lang#11084. I realized that the fix I added in rust-lang#11081 itself also causes an error in a suggestion when inside of a macro. Example: ```rs macro_rules! x { () => { for _ in vec![1, 2] {} } } x!(); ``` Here it would suggest replacing `vec![1, 2]` with `[x!()]`, because that's what the source callsite is (reminder: it does this to get the correct span of `x!()` for code like `for _ in vec![x!()]`), but that's wrong when *inside* macros, so I decided to make it not lint if the whole loop construct is inside a macro to avoid this issue. changelog: [`useless_vec`]: add more tests and don't lint inside of macros r? `@Alexendoo` since these were your tests, I figured it makes most sense to assign you
Good evening. I ran across a missed optimization I was trying to convert
std::result::collect
into usingFromIteratator
. My initial version usediter::Scan
, but it proved to be 2 times slower than at--opt-level=3
than the original implementation. I also created a custom iterator that doesn't have the closure, and it compiles down to the same speed as the originalstd::result::collect
. I'm guessing llvm isn't able to inline the closure for some reason.I've gathered up this test in a gist. Note that in this example, the
Scan1::size_hint()
is different than thestd::iter::Scan::size_hint()
method.cc @thestinger
The text was updated successfully, but these errors were encountered: