Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile time regression introduced in r322401 #37436

Open
alexcrichton opened this issue Jul 7, 2018 · 7 comments
Open

Compile time regression introduced in r322401 #37436

alexcrichton opened this issue Jul 7, 2018 · 7 comments
Labels

Comments

@alexcrichton
Copy link
Contributor

Bugzilla Link 38088
Version trunk
OS All
CC @aelovikov-intel,@bmrzycki,@dwblaikie,@orivej,@sebpop

Extended Description

We're currently attempting to upgrade LLVM in the Rust project 1 but we're unfortunately losing performance from this upgrade on a number of our benchmarks 2. I've been doing some bisection and it's somewhat difficult to narrow down on any one regression because the regression to track down is smaller and smaller.

In any case though I've tracked down one regression to r322401 or https://reviews.llvm.org/D40146. The IR file 3 takes awhile to pass through the opt tool, but before this commit it took 16.8s locally and afterwards it takes 17.9.

I haven't dove too much further into the patch, but let me know if I can help out!

@sebpop
Copy link
Mannequin

sebpop mannequin commented Jul 8, 2018

Thanks for reporting this issue.

Is this performance degradation due to a slower compiler or due to the quality of code generated by llvm?
In other terms, is rust using llvm during the execution of the rust programs, or is rust using llvm ahead of time?

@alexcrichton
Copy link
Contributor Author

Ah yes, a good clarification to make! The site at perf.rust-lang.org is tracking the compile time of various benchmarks. That means it's testing how long it takes rustc to compile various crates. This execution time, however, is reflective of both LLVM's generated code as well as LLVM's execution time. The Rust compiler itself is compiled by LLVM, so some of the runtime is code generated by LLVM's master branch (the Rust compiler pieces) and some of the runtime is LLVM itself (optimizations and whatnot).

In this particular case though I believe the regression is specifically in LLVM's execution time, not in the generated code. In the timings that I was looking at the timings of the Rust-specific pieces were comparable both before and after (aka testing LLVM's code generation). The variable times were the amount of time it took the optimization passes in LLVM.

@bmrzycki
Copy link
Contributor

bmrzycki commented Jul 9, 2018

Hello Alex, thank you for the detailed report.

The commit you referenced adds a feature where a type of analysis (Dominators) are now kept up-to-date across one of the larger passes within LLVM. Due to the design of LLVM this analysis information is kept separate from the actual IR and it means when changes to one happen, they also have to be made in the other.

The initial application of this patch was much worse and we were forced to batch up the Dominator updates to minimize the performance impact. I am a bit surprised by your loss of around 6.5% percent though. It may be that rust emits IR that is highly branchy and requires much more iterations inside JumpThreading to stabilize than code emitted by clang. In my testing of the batched mode I saw most compile-sensitive tests around the 1% mark with 3 outliers:
https://reviews.llvm.org/feed/6519861976962257235/

There is work ongoing to revise this interface again:
https://reviews.llvm.org/D48967

It is unclear if this patch will improve performance, or possibly degrade further since it's attempting to preserve the analysis of two data structures instead of just one (PDT and DT).

With all of this said, have you tried the latest tip of the tree? There is a new patch in JumpThreading that improves its performance and may help offset some of the loss you're seeing in keeping this analysis up-to-date. The patch is the following:

https://reviews.llvm.org/D44282

@alexcrichton
Copy link
Contributor Author

Ok thanks for the information! Unfortunately though the patch you mentioned was actually a source of an even bigger regression 1 in our testing which I reported as bug 38047 and was reverted in r336393.

I'd definitely believe though that rustc's LLVM IR generation is subpar in various areas, and this is also just one particular benchmark in rustc rather than a massive regression across everything, so it's certainly not the end of the world.

Thanks for taking the time to review!

@bmrzycki
Copy link
Contributor

bmrzycki commented Jul 9, 2018

Ok thanks for the information! Unfortunately though the patch you mentioned
was actually a source of an even bigger regression [1] in our testing which
I reported as bug 38047 and was reverted in r336393.
Interesting, I had not seen that bug. Thanks for pointing it out and I'll watch it too.

It's possible you're seeing the regression here due to the change in how the code is generated. It added a synchronization (flush()) to every call where JumpThreading rewires the edges of the IR, and is similar to the immediate update method I originally abandoned. See here:

https://reviews.llvm.org/D44282?id=138097#inline-392430

I was hoping with the change that the performance offset of saving time updating the IR directly would result in a net win.

I'd definitely believe though that rustc's LLVM IR generation is subpar in
various areas, and this is also just one particular benchmark in rustc
rather than a massive regression across everything, so it's certainly not
the end of the world.
Yes, there are some input IR cases that unfortunately take more time and I doubt they can be easily refactored. The way JumpThreading works is the following:

for Block in Function:
do
StateChanged = Process(BB)
while not StateChanged

The Process(BB) is a complex function that attempts several changes and transforms to the IR, especially incoming and outgoing edges. Every change to these need to be tracked and synchronized.

The only way I can see this becoming more efficient is to overhaul Process(BB) to be more intelligent of the opportunities it attacks and to not just keep trying until everything stabilizes.

That foo.ll file is huuuuge at 60MB! Keep in mind that JumpThreading is a function-level scalar transform. The more basic blocks you have in a single function the more JumpThreading has to work. One way you can reduce this from the code generated out of Rust is to either reduce the number of basicblocks or try to strike a balance between calling external functions or forcing all the code inline.

This code reminds me quite a bit of the amalgamated SQlite source and that too performed worse with the change for adding Dominator preservation.

@alexcrichton
Copy link
Contributor Author

Ah yeah and sorry I don't mind this as like "hair on fire" bug at all! I figured it would basically be good hygiene to report this and make sure there was no obvious mistake (which it sounds like there definitely isn't). In that sense I think we're likely to merge our LLVM upgrade in Rust with this hit to perf on a few benchmarks.

FWIW the huge IR file I didn't try to reduce at all, it was basically just the first one that showed a perceptible-by-me difference in performance that I could bisect on. AFAIK we haven't spent a ton of time in Rust improving our LLVM IR generation that we hand to the backend, but that's definitely a possible optimization for us if it comes to it!

Y'all can definitely feel free to do what you'd like with this bug, if you'd prefer to close it as "working as intended for now" that's totally fine!

@bmrzycki
Copy link
Contributor

bmrzycki commented Jul 9, 2018

Ah yeah and sorry I don't mind this as like "hair on fire" bug at all! I
figured it would basically be good hygiene to report this and make sure
there was no obvious mistake (which it sounds like there definitely isn't).
In that sense I think we're likely to merge our LLVM upgrade in Rust with
this hit to perf on a few benchmarks.
I definitely appreciate the bug report, thank you for taking the time to report it. At the very least it helps gauge how common large functions are in the real world. Dominator analysis is valuable in being able to analyze further run-time enhancements to JumpThreading and was the original reason for adding the code.

FWIW the huge IR file I didn't try to reduce at all, it was basically just
the first one that showed a perceptible-by-me difference in performance that
I could bisect on. AFAIK we haven't spent a ton of time in Rust improving
our LLVM IR generation that we hand to the backend, but that's definitely a
possible optimization for us if it comes to it!
No worries. :) I just saw the size and it didn't surprise me that JumpThreading slowed down on such a large file. the larger the function, the greater the tendency for JumpThreading to "bounce" between a few intermediate states adding, and then subsequently removing, a potential IR change to enhance the chance of finding a JumpThreading opportunity. It's not the most elegant pass.

Any attempt by Rust to reduce the number of blocks emitted in a single function will help reduce the JumpThreading processing time.

There are also a few tuning parameters to JumpThreading you can play with to see if helps strke a compromise between quality of emitted code and the time it takes to compile:

BBDuplicateThreshold("jump-threading-threshold",
cl::desc("Max block size to duplicate for jump threading"),
cl::init(6), cl::Hidden);

static cl::opt
ImplicationSearchThreshold(
"jump-threading-implication-search-threshold",
cl::desc("The number of predecessors to search for a stronger "
"condition to use to thread over a weaker condition"),
cl::init(3), cl::Hidden);

Y'all can definitely feel free to do what you'd like with this bug, if you'd
prefer to close it as "working as intended for now" that's totally fine!
This ticket is similar to bug 37929. I plan on leaving these open to see if others with more clout in the community decide the performance loss is more important than the added functionality of having Dominance analysis within JumpThreading.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants