-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary search performance regressed in 1.25 #53823
Comments
The likely culprit is the upgrade from LLVM 4 to 6. The upgrade was merged on the 10th of Feb. The binary search was fast on the 9th and slow on the 12th. @estebank The optimization did hit stable (if only for 2 releases), so does it not need the |
@Emerentius what do you mean for only 2 releases? Isn't the slowdown present in stable through nightly? Reading the assembly from godbolt it seems to me like there have been some changes but the code causing the slowdown presented in this issue is still there. If that is the case, then it would need that label, but if it was fixed in current stable, then we can close the ticket, as I do not believe we cut point releases for previous stable releases. |
@estebank I mean it was fast in 1.23 stable and slow from 1.25 stable on, so fast for 2 stable releases, then regressed. |
triage: P-medium, since I don't think investigating this takes priority over other things currently being juggled. but also self-assigning since I'm curious and I think I can do some bisection as a background task, assuming I can reproduce the perf regression locally. |
discussed in T-compiler meeting. Re-prioritizing as P-high just for evaluating the scope of the regression (though some good work on that is already evident in this comment thread). |
Not sure why I previously removed I-slow; I'm putting it back now. |
Hey @Emerentius, what is the distinction you are making in the columns with the prefix "old_" and "new_" in the tables you present? That is, I would have thought that "old_" somehow corresponds to the state prior to the changes from PR #45333 and "new_" corresponds to after PR #45333 landed; but you already established that PR #45333 landed in between the two versions that each have "old_" and "new_" labels applied to them. So I don't understand what difference has been applied to either the code or the benchmark in between the two columns there. (Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?) |
(In any case I definitely am able to reproduce a regression on my Linux desktop on benches/slice between nightly-2018-02-09 and 2018-02-12, as anticipated/noted by #53823 (comment) ; there's a fair amount of noise between runs when looking at binary_search_l3 in my own machine, but binary_search_l1 is much more consistent in illustrating a 2x regression for me.) |
Okay so I've done some preliminary analysis of the scope of the problem. On my desktop linux machine, I built the benchmark suite in Update: On the LLVM ticket associated with this issue that was subsequently filed, someone noted that I didn't indicate the processor model where I gathered this data. Here that is, in a details block.
Here is the raw data from that exercise: https://gist.github.com/8f511af7012c750928a6bdbf8efa3e6f Then I put it into a spreadsheet, took the min (of the three runs) of the reported ns/iter for each benchmark, and asked to see the ratio (i.e. slowdown, where 2.0 is a 2x slowdown, 1.0 is no change, and 0.5 is a 2x speedup), comparing the min for 2018-02-09 to 2018-02-12.
Finally, when looking at the slowdown ratio, I asked for the values > 1.2 to be highlighted in red (so that we'd pay special attention to a 1.2x slowdown) and the values < 0.9 to be highlighted in green (so that we'd also notice reasonable speedups). Here is that spreadsheet: https://www.dropbox.com/s/qznzchqhibi5o67/three-runs.xlsx?dl=0 |
A number of benchmarks improved significantly (i.e. the slowdown ratio is <0.9) over the time period specified. I'm not going to list them here, but there are ~44 of them. But here are the 9 cases that regressed:
Obviously the latter six cases are the ones that were already being pointed out by this bug. (The other three entries in the table may or may not have the same core cause.) Obviously the libcore/benches benchmark should not be the sole decider about whether a regression is significant, but based on these results, the impact does seem to be somewhat restricted in scope... |
Also, there are some pretty interesting discussions among LLVM developers about when they turn cmov into branches. It sounds there are a number of scenarios where this exhibits better performance due to (I think) either 1. the processor's branch prediction or 2. an expensive test condition that you'd rather evaluate just once |
In particular, in the LLVM developer discussions linked above, I saw mention of the following flag: I gave that a whirl, and the results seem promising:
|
To be clear: I am not suggesting that we should turn the But:
|
Just checked, that patch is in our LLVM, and this still reproduces with current LLVM (running generated IR through llc). |
|
I've filed an LLVM issue: https://bugs.llvm.org/show_bug.cgi?id=40027 |
Okay at this point I think I've convinced myself that the scope of this problem is somewhat limited, So I'm going to downgrade this to P-medium. It also isn't really going to be addressed on our end (except perhaps via cherry-picking a patch for the LLVM issue). So I'm tempted to close it, but I'll maybe wait to let other members of @rust-lang/compiler weigh in before I do that. |
I'll leave myself assigned for now, but mostly as a reminder to close it in the future. If someone else wants to take this issue and run with it, feel free to assign yourself and unassign me. |
I think the main actionable point here is to try and change the binary search implementation in a way that does not run into this LLVM issue. Not sure how that would look like, but something that interchanges the order of the compare and the load (i.e. already load for the next iteration) might work, because this is a pattern that LLVM specifically checks for. |
I have tested several older revisions of the binary_search and partition_point implementations. All of them had their conditional moves converted to cmp + branch due to this llvm change. This LLVM issue comment suggests that the primary heuristic which turns off this optimization only looks for binary tree searches (unpredictable pointer chasing) but not binary searches (unpredictable pointer arithmetic). Another heuristic related to cost modelling has been added. Modifying its threshold via The options I currently see are:
Only option 2 would be a pure libs thing. Some of those options would still require changes to the current implementation and benchmark runs. The old |
We discussed this issue in our last libs team meeting and concluded that there's nothing left for libs to do, and that we need to wait for the fix to happen upstream in https://bugs.llvm.org/show_bug.cgi?id=40027. Our plan is to hand this over to the compiler team for tracking as discussed in https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/tracking.20an.20old.20regression. For now we've tagged both teams and nominated it so that the compiler team can confirm the handover in their next meeting. |
@yaahc @joshtriplett I'm happy to confirm handover to T-compiler, but the prioritization of this (currently P-low; P-medium as of 8 days ago) essentially means that we're never going to prioritize it for discussion in a meeting. It will just be task hanging out there for interested people to look at and discuss. So, I just wanted to confirm: Is that an acceptable outcome from the point of view of you all? Or do you think this should be assigned higher priority? |
@pnkfelix That's roughly the outcome we were expecting. We do feel that this would be worthwhile to have on the radar of a team whose primary purview is LLVM-related fixes that would benefit Rust, but we also recognize that that's a "somebody should", and we don't have a somebody to offer. |
What Josh said, I've gone ahead and removed the |
Could something like https://docs.rs/cmov/latest/cmov/ be used in the interim to force the usage of cmov to recover the performance? |
Is profile guided optimization able to do anything in this case maybe? |
The x86 backend should now respect |
There's a discussion what to do about |
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
See #128254 for a potential fix. |
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823
Rewrite binary search implementation This PR builds on top of rust-lang#128250, which should be merged first. This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823 Fixes rust-lang#115271
Rewrite binary search implementation This PR builds on top of rust-lang#128250, which should be merged first. This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823 Fixes rust-lang#115271
The speedup obtained by the optimizations from #45333 which hit stable in 1.23 were mostly lost again from 1.25 onwards, because the compiler started emitting a branch instead of a cmov.
1.24
1.25
From godbolt: https://godbolt.org/z/DUDl-T
Bench comparison with 1.23 nightly (2017-11-20, 9 days after PR merge)
and status quo with 1.30 nightly (2018-08-15)
The text was updated successfully, but these errors were encountered: