-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Use csel and ccmp for conditional moves #67894
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsEarly prototype.
|
This is a continuation of #67286 |
Current version now uses GT_COND_EQ etc nodes. The following code gets correctly compiled:
However, when multiple blocks are merged together (eg: "if (op1 > 0 && op2 > 5)") , then the resulting code is wrong. However - I'm increasingly aware that doing this in the lowering phase is the wrong place for the code. So, I'm now in the process of writing a new "ifconversion" phase. Given that the nodes will be in graph order at that point, I'm hoping my issues should be easier to fix |
Yes, it's definitely has to be a higher-level phase, I'd introduce a general
|
e057bce
to
a8546c4
Compare
Reworked this to use an if conversion phase which plants a GT_SELECT node A single condition can be optimised:
Second statement in an And is optimised:
Or conditions are not optimised:
Else cases are not optimised:
|
Still need to:
Also, the cases above that don't optimise will need CCMP instructions. My next step is to make sure that would work with the code I've added. |
With the latest version, I've added some support for CCMP too. Reasoning for doing this (other than better code generation / performance) was that I wanted to be sure my SELECT code was extendable to for CCMP. In the end, there was a lot of overlap of the code (eg: with all the switches), so I ended up combing the patches. I can now fully build all the system libraries now too. With this patch.... A single condition can be optimised:
Multiple AND statements are optimised:
Or conditions are not optimised:
Else cases are not optimised:
|
Looking at the generated IR, it isn't that far away from being usable by the if convert pass. It's probably a good follow on task from this. |
We'll also need the profile data in order to do CSEL nodes inside loops (roughly: inside loops, CSEL is slower than branches if the branch is predictable. Outside loops CSEL is faster). Again, this feels like a good follow on task from this. |
Note - there are a bunch of assertion failures using superpmi, so I'm working through those. |
Fixed a bunch of assert failures running superpmi. Three unique failures still remain: The assert in lower is due to a compare being optimised into a const int. |
// If conversion | ||
// | ||
DoPhase(this, PHASE_IF_CONVERSION, &Compiler::optIfConversion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious why you chose this rather early stage for this.
On the face of it, doing this early has the significant disadvantage of decanonicalizing the IR w.r.t. relops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs doing after loop information is available. The next phase is PHASE_CLEAR_LOOP_INFO, so I wanted to go before that.
The phase before is loop unrolling. If conversion is only modifying code outside of loops (for now), so it made sense then to do this after loop unrolling to maximise the number of changes.
As far as I'm aware, other compilers also do if conversion fairly early too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next phase is PHASE_CLEAR_LOOP_INFO, so I wanted to go before that.
Note that PHASE_CLEAR_LOOP_INFO
doesn't actually clear much, only information used by cloning.
I see you are using the "inside a loop" information as a performance heuristic currently, that should be in a "good enough" shape for that even after the optimizer. We're using the loop table for a similar purpose much later for loop alignment.
Ultimately though the question is if there are actual problems because of this, i. e. if there are any regressions in the diffs. I suppose there aren't any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that PHASE_CLEAR_LOOP_INFO doesn't actually clear much, only information used by cloning.
ah, ok, maybe it can move down a bit then.
A quick scan through the passes, I'm thinking this should be done before loop hoisting too.
if there are any regressions in the diffs. I suppose there aren't any?
I'm not certain either way yet. Certainly, the tests I've tried look good.
I've got three asserts left from "superpmi.py replay" to fix up (there's a comment elsewhere on here about that). Next step after that would be to do a full test build and run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok, maybe it can move down a bit then.
Note my thinking with this is to avoid adding all the front-end support required (you've hit this problem with constant folding).
A quick scan through the passes, I'm thinking this should be done before loop hoisting too.
Well, it is currently done "before hoisting". It does not make sense to interject it between SSA and the optimizer phases I think (since it alters the flow graph). I was hoping we could slot it just after the optimizer (whether it'd be before lowering or after range check does not matter much I think, though making it LIR-only will make some things (morph support) unnecessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a full test run, of 2000+ tests, I get 11 failures, of which there are 3 unique failures: 1) The constant folding issue I already know about 2) a "pThread" assert, which I get without my patch (will ignore for now) and 3) a segfault - I'll debug this.
Question still remains if this is right place to have the phase. Looking at other compilers:
*LLVM if converts very early, and optimises every occurrence to select nodes. This is because removing the if then branches makes later phases, especially vectorisation, much easier. Then at the end of the compilation, it restores back to if/then branches for certain cases.
*GCC if converts loops early to allow vectorisation. Scalar code is if converted at the end of compilation
*OpenJDK if converts early. It does everything outside of loops and for inside loops uses historical branch taken / not taken counts.
Back to dotnet...
My original patch added if conversion after lowering but the ability to reshape the graph was limited and didn't work for multiple conditions. Agreed with by @EgorBo too.
As a test, I tried pushing if conversion down to just before lowering, and this seems to work in theory - I then get later some later errors, just requires some debugging.
On AArch64 Ubuntu 18.04:
I'm note sure about: Removing the draft status so this can get a review. |
014d0d5
to
2340be1
Compare
For some reason, I am seeing this pattern with your changes. (Left is You can see the asmdiffs at https://dev.azure.com/dnceng/public/_build/results?buildId=1779370&view=ms.vss-build-web.run-extensions-tab and can download subset of the asm files from https://dev.azure.com/dnceng/public/_build/results?buildId=1779370&view=artifacts&pathAsName=false&type=publishedArtifacts |
A single condition can be optimised: if (op1 > 0) { op1 = 5; } IN0001: 000008 mov w2, #5 IN0002: 00000C cmp w0, #0 IN0003: 000010 csel w0, w0, w2, eq Multiple AND statements are optimised: if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; } IN0001: 000008 cmp w2, #7 IN0002: 00000C ccmp w1, dotnet#10, z, lt IN0003: 000010 ccmp w0, #3, nzc, ne IN0004: 000014 mov w2, #5 IN0005: 000018 csel w0, w2, w0, gt Or conditions are not optimised: if (op1 > 3 || op2 == 10) { op1 = 9; } IN0001: 000008 cmp w0, #3 IN0002: 00000C bhi G_M41752_IG04 G_M41752_IG03: IN0003: 000010 cmp w1, dotnet#10 IN0004: 000014 bne G_M41752_IG05 G_M41752_IG04: IN0005: 000018 mov w0, dotnet#9 G_M41752_IG05: Else cases are not optimised: if (op1 > 0) { op1 = 5; } else { op1 = 3; } IN0001: 000008 cbz w0, G_M64063_IG04 G_M64063_IG03: IN0002: 00000C mov w0, #5 IN0003: 000010 b G_M64063_IG05 G_M64063_IG04: IN0004: 000014 mov w0, #3 G_M64063_IG05:
42d6833
to
84f2ef4
Compare
Reverting the last "fix" as it was causing build failures. PR is now latest HEAD + the code from a month ago (ie from here #67894 (comment)) |
Thanks! I will take a look once CI comes back with the results. |
Looks like lot of failures because of AV. |
Change-Id: Ie1398954697debec0892bbebd04e32701ab2c792 CustomizedGitHooks: yes
Sadly new patch I just posted doesn't fix that, but it does fix an assert later on. |
This patch should fix the crossgen2 build failures |
"Allow select nodes to constant propagate" - this commit fixes up the failures to optimise away compares where the condition is constant. However, the way the patch does it is bad. The problem was that the constant propagation was being skipped select nodes. This was due to setting GTF_DONT_CSE on the connected compare node. Once this flag is removed, if constant, the compare gets optimised to 1 or 0, and then the select node gets optimised to the true or false path. That's good. That change then introduces a new issue: removing GTF_DONT_CSE enables full CSE across the select nodes. What can then happen is a compare node is replaced with a different expression tree. In a later pass, the select will assert due to it's compare node not being a compare. For example:
becomes:
And node 138 causes an assert. The correct way to fix it would be for constant propagation to skip based off a new flag (GTF_DONT_CONST_PROP) instead of GTF_DONT_CSE. However, there are 105 instances of GTF_DONT_CSE in the code which then need potentially fixing up. To keep things simple, for now I introduced GTF_DO_CONST_PROP which is an override for GTF_DONT_CSE when constant propagating. I fully expect to have to tidy this up before merging. If this PR gets split into multiple smaller pieces, then this is a definite candidate for its own PR. For now it's here so that we can see any performance implications of the full PR. |
I ran superpmi asmdiff on this patch and went through the results. 916 tests had code gen differences.
In general, each use of csel or ccmp will reduce the size of the code by 1 instruction. 👍 Why the increases in code? Well, if that code was a compare against 0 then it was already optimised with cbnz. Switching to csel causes one additional instruction. Before my patch:
with my patch:
Although the second block is longer, it will perform better (assuming the chance of the branch being taken is random). My next steps are to fix the failing tests in CI (not sure yet if any are caused by my patch). Are there any obvious performance tests that can be run on this patch? |
Thanks for doing this. I do see diffs in x64...can you double check why that is the case?
That's fine with me given that c++ does it too. https://godbolt.org/z/jTa14qTnf windows/arm64 benchmarks collection : One thing that concerns me is the extra instructions we would end up executing for one of the branch. E.g. below, we would not execute windows/arm64 libraries-crossgen2 linux/arm64 libraries-pmi Here is another one. windows/arm64 libraries-crossgen2 Any idea, why there are extra instructions here? linux/arm64 libraries-pmi Same here:
I tried checking in benchmarks collection, but I don't see any benchmarks that shows affected with this change. Just try to come up with a sample program and see if it is improved with this change. |
At the moment this patch doesn't do that - it always assumes csel/ccmp is better. Generally, for a case with random chance of the branch being taken, then cmp, ldr, csel is better than cbz, ldr. So for the first case posted, I'd definitely go with the new csel version. For the other two cases, it looks like the cases as independent of each other. Essentially:
Every run of the code has to check every condition regardless of the previous result. So, in these cases again I'd go with the csel version. Where slowdowns will happen is if we get large chains of csel with multiple ccmps. eg: |
If we're still unsure about the chains, then maybe remove generation of ccmps from this patch (but keep all the ccmp IR nodes and later phase checks as it's all mostly common code with csel). Then later patches can slowly start adding uses of ccmp. Also, I'll check those other 2 cases - yes, that looks like there is another constant elimination case I'm missing |
"Redundant branch opts" is parsing through all the branches in the code. In the current HEAD, it gets to the JTRUE/NE node and decides that the branch must always happen due to matching dominators and VNs (I'm a little vague on the exact reasoning for the decision). So it deletes the block that gets jumped over. In my patch, the branch doesn't exist and the blocks have been merged together. So there's nothing for Redundant branch opts to do. To optimise the code away in my patch, we'd need something similar to the Redundant branch opts patch, but instead of iterating over the list of branches, it'd have to iterate over compare nodes (or select nodes). Maybe constant propagation could do that. For reference, this is JTRUE node 000309 being removed:
|
I wrote some benchmarks:
This was on an Altra. For "Single" it'll take the branch roughly 50% of the time, and the test runs 50% quicker with csel. For "And" the branch will be take 25% of the time, and then dropping further for the other tests. Performance drops off, but is still quicker. The other tests do not yet optimise with this patch, but I've included them for the future. |
It just occurred to me that all these benchmarks validates csel performance in a loop. We have extracted it in |
Split the lower parts of this code into a new PR: #71616 |
Since this is no longer in development, closing it. |
Early prototype.
Fixes: #55364