Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Use csel and ccmp for conditional moves #67894

Closed
wants to merge 6 commits into from

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Apr 12, 2022

Early prototype.

Fixes: #55364

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 12, 2022
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 12, 2022
@ghost
Copy link

ghost commented Apr 12, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Early prototype.

Author: a74nh
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@a74nh a74nh marked this pull request as draft April 12, 2022 08:41
@a74nh
Copy link
Contributor Author

a74nh commented Apr 12, 2022

This is a continuation of #67286
I created a new pull request because the older one used main as the branch. It was causing me issues, and AIUI there is no way to change it.

@a74nh
Copy link
Contributor Author

a74nh commented Apr 12, 2022

Current version now uses GT_COND_EQ etc nodes.

The following code gets correctly compiled:

static void TransformsGtSingle(uint op1, uint op2) {
    if (op1 > 0) {
        op1 = 5;
    }
    Consume(op1, op2);
}

IN0001: 000008                    cmp     w0, #0
IN0002: 00000C                    mov     w2, #5
IN0003: 000010                    csel    w0, w2, w0, ne
IN0004: 000014                    movz    x2, #0xbcd8
IN0005: 000018                    movk    x2, #0x7e7e LSL #16
IN0006: 00001C                    movk    x2, #0xffff LSL #32
IN0007: 000020                    ldr     x2, [x2]
IN0008: 000024                    blr     x2

However, when multiple blocks are merged together (eg: "if (op1 > 0 && op2 > 5)") , then the resulting code is wrong.

However - I'm increasingly aware that doing this in the lowering phase is the wrong place for the code. So, I'm now in the process of writing a new "ifconversion" phase. Given that the nodes will be in graph order at that point, I'm hoping my issues should be easier to fix

@EgorBo
Copy link
Member

EgorBo commented Apr 12, 2022

Yes, it's definitely has to be a higher-level phase, I'd introduce a general GT_SELECT node (similar to llvm's select IR op) that we can also handle for x64 as well via cmov. Its operators should not have execution order-related side effects
A yet another phase that does a full BB walk has to provide good benefits in order to justify the JIT throughput regression (or at least the impact has to be measured).
Some notes:

  1. We might need to fix JIT: Expand GT_RETURN condition to GT_JTRUE #65370 in order to focus only on BBJ_JUMP blocks
  2. We don't need branchless operations when we have profile data (PGO) that states that one of the branches is always taken

@a74nh
Copy link
Contributor Author

a74nh commented Apr 25, 2022

Reworked this to use an if conversion phase which plants a GT_SELECT node

A single condition can be optimised:

    if (op1 > 0) {
        op1 = 5;
    }
    IN0001: 000008                    mov     w2, #5
    IN0002: 00000C                    cmp     w0, #0
    IN0003: 000010                    csel    w0, w0, w2, eq

Second statement in an And is optimised:

    if (op1 > 0 && op2 > 5) {
        op1 = 5;
    }
    IN0001: 000008                    cbz     w0, G_M2158_IG04
    IN0002: 00000C                    mov     w2, #5
    IN0003: 000010                    cmp     w1, #5
    IN0004: 000014                    csel    w0, w0, w2, le
    G_M2158_IG04:

Or conditions are not optimised:

    if (op1 > 3 || op2 == 10) {
        op1 = 9;
    }
    IN0001: 000008                    cmp     w0, #3
    IN0002: 00000C                    bhi     G_M41752_IG04
    G_M41752_IG03:
    IN0003: 000010                    cmp     w1, #10
    IN0004: 000014                    bne     G_M41752_IG05
    G_M41752_IG04:
    IN0005: 000018                    mov     w0, #9
    G_M41752_IG05:

Else cases are not optimised:

    if (op1 > 0) {
        op1 = 5;
    } else {
        op1 = 3;
    }
    IN0001: 000008                    cbz     w0, G_M64063_IG04
    G_M64063_IG03:
    IN0002: 00000C                    mov     w0, #5
    IN0003: 000010                    b       G_M64063_IG05
    G_M64063_IG04:
    IN0004: 000014                    mov     w0, #3
    G_M64063_IG05:

@a74nh
Copy link
Contributor Author

a74nh commented Apr 25, 2022

Still need to:

  • Fix some asserts when building the c# libraries
  • Add some proper unit tests and check performance
  • Check in more detail the two notes Egor added

Also, the cases above that don't optimise will need CCMP instructions. My next step is to make sure that would work with the code I've added.

@a74nh
Copy link
Contributor Author

a74nh commented May 5, 2022

With the latest version, I've added some support for CCMP too. Reasoning for doing this (other than better code generation / performance) was that I wanted to be sure my SELECT code was extendable to for CCMP. In the end, there was a lot of overlap of the code (eg: with all the switches), so I ended up combing the patches.

I can now fully build all the system libraries now too.

With this patch....

A single condition can be optimised:

if (op1 > 0) { op1 = 5; }
IN0001: 000008                    mov     w2, 5
IN0002: 00000C                    cmp     w0, #0
IN0003: 000010                    csel    w0, w0, w2, eq

Multiple AND statements are optimised:

if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; }
IN0001: 000008                    cmp     w2, 7
IN0002: 00000C                    ccmp    w1, 10, z, lt
IN0003: 000010                    ccmp    w0, 3, nzc, ne
IN0004: 000014                    mov     w2, 5
IN0005: 000018                    csel    w0, w2, w0, gt

Or conditions are not optimised:

if (op1 > 3 || op2 == 10) {
    op1 = 9;
}
IN0001: 000008                    cmp     w0, 3
IN0002: 00000C                    bhi     G_M41752_IG04
G_M41752_IG03:
IN0003: 000010                    cmp     w1, 10
IN0004: 000014                    bne     G_M41752_IG05
G_M41752_IG04:
IN0005: 000018                    mov     w0, 9
G_M41752_IG05:

Else cases are not optimised:

if (op1 > 0) {
    op1 = 5;
} else {
    op1 = 3;
}
IN0001: 000008                    cbz     w0, G_M64063_IG04
G_M64063_IG03:
IN0002: 00000C                    mov     w0, 5
IN0003: 000010                    b       G_M64063_IG05
G_M64063_IG04:
IN0004: 000014                    mov     w0, 3
G_M64063_IG05:

@a74nh
Copy link
Contributor Author

a74nh commented May 5, 2022

  1. We might need to fix JIT: Expand GT_RETURN condition to GT_JTRUE #65370 in order to focus only on BBJ_JUMP blocks

Looking at the generated IR, it isn't that far away from being usable by the if convert pass. It's probably a good follow on task from this.

@a74nh
Copy link
Contributor Author

a74nh commented May 5, 2022

2. We don't need branchless operations when we have profile data (PGO) that states that one of the branches is always taken

We'll also need the profile data in order to do CSEL nodes inside loops (roughly: inside loops, CSEL is slower than branches if the branch is predictable. Outside loops CSEL is faster). Again, this feels like a good follow on task from this.

@a74nh
Copy link
Contributor Author

a74nh commented May 5, 2022

Note - there are a bunch of assertion failures using superpmi, so I'm working through those.

@a74nh
Copy link
Contributor Author

a74nh commented May 6, 2022

Fixed a bunch of assert failures running superpmi.

Three unique failures still remain:
3 /home/alahay01/dotnet/runtime/src/coreclr/jit/gentree.h (2000)
2 /home/alahay01/dotnet/runtime/src/coreclr/jit/gentree.h (2904)
11 /home/alahay01/dotnet/runtime/src/coreclr/jit/lower.cpp (7034)

The assert in lower is due to a compare being optimised into a const int.
See the comments in morph.cpp.
I think there is a good argument to be made for GenTreeConditional nodes being a new GTK_CONDOP node type, and then GTK_SMPOP changed to be (GTK_UNOP | GTK_BINOP | GTK_CONDOP).
This would allow all the optimisations that work on standard compare nodes (GT_EQ etc) to be applied to conditional nodes (GT_CEQ etc).
I'm holding off doing this change as it'll be quite a big change until someone else thinks it's a good idea.

Comment on lines +4819 to +4826
// If conversion
//
DoPhase(this, PHASE_IF_CONVERSION, &Compiler::optIfConversion);
Copy link
Contributor

@SingleAccretion SingleAccretion May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why you chose this rather early stage for this.

On the face of it, doing this early has the significant disadvantage of decanonicalizing the IR w.r.t. relops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs doing after loop information is available. The next phase is PHASE_CLEAR_LOOP_INFO, so I wanted to go before that.
The phase before is loop unrolling. If conversion is only modifying code outside of loops (for now), so it made sense then to do this after loop unrolling to maximise the number of changes.
As far as I'm aware, other compilers also do if conversion fairly early too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next phase is PHASE_CLEAR_LOOP_INFO, so I wanted to go before that.

Note that PHASE_CLEAR_LOOP_INFO doesn't actually clear much, only information used by cloning.

I see you are using the "inside a loop" information as a performance heuristic currently, that should be in a "good enough" shape for that even after the optimizer. We're using the loop table for a similar purpose much later for loop alignment.

Ultimately though the question is if there are actual problems because of this, i. e. if there are any regressions in the diffs. I suppose there aren't any?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that PHASE_CLEAR_LOOP_INFO doesn't actually clear much, only information used by cloning.

ah, ok, maybe it can move down a bit then.
A quick scan through the passes, I'm thinking this should be done before loop hoisting too.

if there are any regressions in the diffs. I suppose there aren't any?

I'm not certain either way yet. Certainly, the tests I've tried look good.
I've got three asserts left from "superpmi.py replay" to fix up (there's a comment elsewhere on here about that). Next step after that would be to do a full test build and run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, maybe it can move down a bit then.

Note my thinking with this is to avoid adding all the front-end support required (you've hit this problem with constant folding).

A quick scan through the passes, I'm thinking this should be done before loop hoisting too.

Well, it is currently done "before hoisting". It does not make sense to interject it between SSA and the optimizer phases I think (since it alters the flow graph). I was hoping we could slot it just after the optimizer (whether it'd be before lowering or after range check does not matter much I think, though making it LIR-only will make some things (morph support) unnecessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a full test run, of 2000+ tests, I get 11 failures, of which there are 3 unique failures: 1) The constant folding issue I already know about 2) a "pThread" assert, which I get without my patch (will ignore for now) and 3) a segfault - I'll debug this.

Question still remains if this is right place to have the phase. Looking at other compilers:
*LLVM if converts very early, and optimises every occurrence to select nodes. This is because removing the if then branches makes later phases, especially vectorisation, much easier. Then at the end of the compilation, it restores back to if/then branches for certain cases.
*GCC if converts loops early to allow vectorisation. Scalar code is if converted at the end of compilation
*OpenJDK if converts early. It does everything outside of loops and for inside loops uses historical branch taken / not taken counts.

Back to dotnet...
My original patch added if conversion after lowering but the ability to reshape the graph was limited and didn't work for multiple conditions. Agreed with by @EgorBo too.
As a test, I tried pushing if conversion down to just before lowering, and this seems to work in theory - I then get later some later errors, just requires some debugging.

@a74nh
Copy link
Contributor Author

a74nh commented May 18, 2022

On AArch64 Ubuntu 18.04:

  • "Superpmi.py replay" passes
  • "./src/tests/run.sh Checked" passes (except for the 5 failures I get without my patch - asserts with "pThread")

I'm note sure about:
*if there are additional tests I should be running.
*what performance tests can be run.
*what new tests should be added, and where they should be added.

Removing the draft status so this can get a review.

@a74nh a74nh changed the title WIP: Use csel for conditional moves AArch64: Use csel and ccmp for conditional moves May 18, 2022
@a74nh a74nh changed the title AArch64: Use csel and ccmp for conditional moves Arm64: Use csel and ccmp for conditional moves May 18, 2022
@a74nh a74nh marked this pull request as ready for review May 18, 2022 08:10
@a74nh a74nh force-pushed the github_a74nh_csel2 branch 3 times, most recently from 014d0d5 to 2340be1 Compare May 19, 2022 10:44
@kunalspathak
Copy link
Member

For some reason, I am seeing this pattern with your changes. (Left is main and Right is PR branch). Could be assertion prop is not eliminating them?

image

You can see the asmdiffs at https://dev.azure.com/dnceng/public/_build/results?buildId=1779370&view=ms.vss-build-web.run-extensions-tab and can download subset of the asm files from https://dev.azure.com/dnceng/public/_build/results?buildId=1779370&view=artifacts&pathAsName=false&type=publishedArtifacts

image

A single condition can be optimised:

if (op1 > 0) { op1 = 5; }
IN0001: 000008                    mov     w2, #5
IN0002: 00000C                    cmp     w0, #0
IN0003: 000010                    csel    w0, w0, w2, eq

Multiple AND statements are optimised:

if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; }
IN0001: 000008                    cmp     w2, #7
IN0002: 00000C                    ccmp    w1, dotnet#10, z, lt
IN0003: 000010                    ccmp    w0, #3, nzc, ne
IN0004: 000014                    mov     w2, #5
IN0005: 000018                    csel    w0, w2, w0, gt

Or conditions are not optimised:

if (op1 > 3 || op2 == 10) {
    op1 = 9;
}
IN0001: 000008                    cmp     w0, #3
IN0002: 00000C                    bhi     G_M41752_IG04
G_M41752_IG03:
IN0003: 000010                    cmp     w1, dotnet#10
IN0004: 000014                    bne     G_M41752_IG05
G_M41752_IG04:
IN0005: 000018                    mov     w0, dotnet#9
G_M41752_IG05:

Else cases are not optimised:

if (op1 > 0) {
    op1 = 5;
} else {
    op1 = 3;
}
IN0001: 000008                    cbz     w0, G_M64063_IG04
G_M64063_IG03:
IN0002: 00000C                    mov     w0, #5
IN0003: 000010                    b       G_M64063_IG05
G_M64063_IG04:
IN0004: 000014                    mov     w0, #3
G_M64063_IG05:
@a74nh a74nh force-pushed the github_a74nh_csel2 branch 2 times, most recently from 42d6833 to 84f2ef4 Compare June 21, 2022 13:48
@a74nh
Copy link
Contributor Author

a74nh commented Jun 21, 2022

Reverting the last "fix" as it was causing build failures.

PR is now latest HEAD + the code from a month ago (ie from here #67894 (comment))

@kunalspathak
Copy link
Member

Reverting the last "fix" as it was causing build failures.

PR is now latest HEAD + the code from a month ago (ie from here #67894 (comment))

Thanks! I will take a look once CI comes back with the results.

@kunalspathak
Copy link
Member

Looks like lot of failures because of AV.

Change-Id: Ie1398954697debec0892bbebd04e32701ab2c792
CustomizedGitHooks: yes
@a74nh
Copy link
Contributor Author

a74nh commented Jun 23, 2022

Looks like lot of failures because of AV.

Sadly new patch I just posted doesn't fix that, but it does fix an assert later on.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 23, 2022

This patch should fix the crossgen2 build failures

@a74nh
Copy link
Contributor Author

a74nh commented Jun 27, 2022

"Allow select nodes to constant propagate" - this commit fixes up the failures to optimise away compares where the condition is constant. However, the way the patch does it is bad.

The problem was that the constant propagation was being skipped select nodes. This was due to setting GTF_DONT_CSE on the connected compare node. Once this flag is removed, if constant, the compare gets optimised to 1 or 0, and then the select node gets optimised to the true or false path. That's good.

That change then introduces a new issue: removing GTF_DONT_CSE enables full CSE across the select nodes. What can then happen is a compare node is replaced with a different expression tree. In a later pass, the select will assert due to it's compare node not being a compare. For example:

N006 ( 10,  8)              [000122] -----------                         \--*  SELECT    int    $181
N003 (  8,  5) CSE #03 (use)[000119] N----------                            +--*  NE        int    <l:$209, c:$20a>
N001 (  3,  2)              [000120] -----------                            |  +--*  LCL_VAR   int    V01 loc0         u:2 <l:$209, c:$20a>
N002 (  1,  2)              [000121] -----------                            |  \--*  CNS_INT   int    0 $c0
N004 (  1,  2)              [000088] -----------                            +--*  CNS_INT   int    255 $c6
N005 (  1,  1)              [000118] -----------                            \--*  LCL_VAR   int    V02 loc1         u:2 (last use) $c0

becomes:

N004 (  5,  5) [000122] -----------                         \--*  SELECT    int    $181
N001 (  3,  2) [000138] -----------                            +--*  LCL_VAR   int    V05 cse0         u:1 <l:$209, c:$20a>
N002 (  1,  2) [000088] -----------                            +--*  CNS_INT   int    255 $c6
               [000160] -----------                            \--*  CNS_INT   int    0 $c0

And node 138 causes an assert.

The correct way to fix it would be for constant propagation to skip based off a new flag (GTF_DONT_CONST_PROP) instead of GTF_DONT_CSE. However, there are 105 instances of GTF_DONT_CSE in the code which then need potentially fixing up.

To keep things simple, for now I introduced GTF_DO_CONST_PROP which is an override for GTF_DONT_CSE when constant propagating.

I fully expect to have to tidy this up before merging. If this PR gets split into multiple smaller pieces, then this is a definite candidate for its own PR.

For now it's here so that we can see any performance implications of the full PR.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 28, 2022

I ran superpmi asmdiff on this patch and went through the results.

916 tests had code gen differences.
Of those tests there are:

  • 177 tests with fewer instructions generated. 👍
  • 395 tests had the same number of instructions
  • 344 tests had more instructions generated 👎
  • In total across all tests, there were 178 additional instructions. 👎
  • Changes broke down into:
    • 1039 uses of csel
    • 1 use of ccmp eq
    • 16 uses of ccmp ne
    • 40 uses of ccmp lt
    • No uses of ccmp {le,ge,gt}

In general, each use of csel or ccmp will reduce the size of the code by 1 instruction. 👍

Why the increases in code? Well, if that code was a compare against 0 then it was already optimised with cbnz. Switching to csel causes one additional instruction. Before my patch:

            cbnz    x0, G_M55128_IG04
            ldr     x0, [x0]
G_M55128_IG04: 

with my patch:

            cmp     x0, #0
            ldr     x1, [x1]
            csel    x0, x1, x0, eq

Although the second block is longer, it will perform better (assuming the chance of the branch being taken is random).

My next steps are to fix the failing tests in CI (not sure yet if any are caused by my patch).

Are there any obvious performance tests that can be run on this patch?

@kunalspathak
Copy link
Member

I ran superpmi asmdiff on this patch and went through the results.

Thanks for doing this. I do see diffs in x64...can you double check why that is the case?

Although the second block is longer, it will perform better (assuming the chance of the branch being taken is random).

That's fine with me given that c++ does it too. https://godbolt.org/z/jTa14qTnf

windows/arm64 benchmarks collection : 4316.dasm

One thing that concerns me is the extra instructions we would end up executing for one of the branch. E.g. below, we would not execute ldrh, if w0 == 0, but now we would do. Did you take into account how expensive it is to calculate the values of both the branch and decide if we should do csel or not based on that?

image

windows/arm64 libraries-crossgen2 174906.dasm

image

linux/arm64 libraries-pmi 188539.dasm

Here is another one.

image

windows/arm64 libraries-crossgen2 126535.dasm

Any idea, why there are extra instructions here?

image

linux/arm64 libraries-pmi 158360.dasm

Same here:

image

Are there any obvious performance tests that can be run on this patch?

I tried checking in benchmarks collection, but I don't see any benchmarks that shows affected with this change. Just try to come up with a sample program and see if it is improved with this change.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 29, 2022

One thing that concerns me is the extra instructions we would end up executing for one of the branch. E.g. below, we would not execute ldrh, if w0 == 0, but now we would do. Did you take into account how expensive it is to calculate the values of both the branch and decide if we should do csel or not based on that?

At the moment this patch doesn't do that - it always assumes csel/ccmp is better.

Generally, for a case with random chance of the branch being taken, then cmp, ldr, csel is better than cbz, ldr. So for the first case posted, I'd definitely go with the new csel version.

For the other two cases, it looks like the cases as independent of each other. Essentially:

if (x==5) { y&=n1; }
if (x==6) { y&=n2; }
if (x==7) { y&=n3; }
if (x==8) { y&=n4; }  etc

Every run of the code has to check every condition regardless of the previous result. So, in these cases again I'd go with the csel version.

Where slowdowns will happen is if we get large chains of csel with multiple ccmps. eg:
if (a[0] && b[0] && c[0] && d[0] && e[0] && f[0]) {y[0]=x[0]}
would produce something like:
ldr, cmp, ldr, ccmp, ldr, ccmp, ldr, ccmp, ldr, ccmp, ldr, ldr, ldr, csel
As the chain gets longer, the additional loads will eventually start slowing things down, and it becomes better to do the branches. I think llvm puts at most three items in a chain (but it's probably a little more subtle than that). However, we're not really getting many chains happening yet - probably due to only chaining && statements. This'll become more important if/when I add || statements and else statements into the chaining.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 29, 2022

If we're still unsure about the chains, then maybe remove generation of ccmps from this patch (but keep all the ccmp IR nodes and later phase checks as it's all mostly common code with csel). Then later patches can slowly start adding uses of ccmp.

Also, I'll check those other 2 cases - yes, that looks like there is another constant elimination case I'm missing

@a74nh
Copy link
Contributor Author

a74nh commented Jun 29, 2022

Any idea, why there are extra instructions here?

"Redundant branch opts" is parsing through all the branches in the code. In the current HEAD, it gets to the JTRUE/NE node and decides that the branch must always happen due to matching dominators and VNs (I'm a little vague on the exact reasoning for the decision). So it deletes the block that gets jumped over.

In my patch, the branch doesn't exist and the blocks have been merged together. So there's nothing for Redundant branch opts to do.

To optimise the code away in my patch, we'd need something similar to the Redundant branch opts patch, but instead of iterating over the list of branches, it'd have to iterate over compare nodes (or select nodes). Maybe constant propagation could do that.

For reference, this is JTRUE node 000309 being removed:

Dominator BB01 of BB07 has relop with reversed liberal VN
N009 ( 15, 12) [000022] J------N---                         *  NE        int    <l:$246, c:$247>
N007 ( 13,  9) [000020] -----------                         +--*  OR        int    <l:$244, c:$245>
N003 (  6,  4) [000016] -----------                         |  +--*  EQ        int    <l:$240, c:$241>
N001 (  1,  1) [000014] -----------                         |  |  +--*  LCL_VAR   ref    V01 loc0         u:1 <l:$208, c:$83>
N002 (  1,  2) [000015] -----------                         |  |  \--*  CNS_INT   ref    null $VN.Null
N006 (  6,  4) [000019] -----------                         |  \--*  EQ        int    <l:$242, c:$243>
N004 (  1,  1) [000017] -----------                         |     +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N005 (  1,  2) [000018] -----------                         |     \--*  CNS_INT   ref    null $VN.Null
N008 (  1,  2) [000021] -----------                         \--*  CNS_INT   int    0 $40
 Redundant compare; current relop:
N003 (  3,  4) [000050] J------N---                         *  NE        int    <l:$248, c:$249>
N001 (  1,  1) [000048] -----------                         +--*  LCL_VAR   ref    V10 tmp7         u:1 (last use) <l:$208, c:$83>
N002 (  1,  2) [000049] -----------                         \--*  CNS_INT   ref    null $VN.Null

optRedundantRelop in BB04; jump tree is
N004 (  5,  6) [000309] -----------                         *  JTRUE     void   $VN.Void
N003 (  3,  4) [000308] J------N---                         \--*  NE        int    <l:$24b, c:$24c>
N001 (  1,  1) [000161] -----------                            +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N002 (  1,  2) [000307] -----------                            \--*  CNS_INT   ref    null $VN.Null
 ... checking previous tree
N006 (  4,  3) [000322] -A-XG---R--                         *  ASG       ref    $233
N005 (  1,  1) [000321] D------N---                         +--*  LCL_VAR   ref    V28 tmp25        d:1 $VN.Void
N004 (  4,  3) [000297] ---XG------                         \--*  IND       ref    <l:$40d, c:$40e>
N003 (  3,  4) [000506] -------N---                            \--*  ADD       byref  $151
N001 (  1,  1) [000162] -----------                               +--*  LCL_VAR   ref    V00 arg0         u:1 $80
N002 (  1,  2) [000505] -----------                               \--*  CNS_INT   long   8 field offset Fseq[_ats] $104
 -- prev tree VN is not related
 ... checking previous tree
N004 ( 17, 15) [000168] -AC-----R--                         *  ASG       ref    $VN.Void
N003 (  1,  1) [000167] D------N---                         +--*  LCL_VAR   ref    V17 tmp14        d:1 $VN.Void
N002 ( 17, 15) [000166] --C--------                         \--*  CALL help ref    HELPER.CORINFO_HELP_NEWSFAST $40b
N001 (  3, 12) [000165] H---------- arg0 in x0                 \--*  CNS_INT(h) long   0x7f5877cc00 class $18b
 -- prev tree has side effects and is not next to jumpTree
Inferring predicate value from OR

Dominator BB01 of BB04 has relop with reversed liberal VN
N009 ( 15, 12) [000022] J------N---                         *  NE        int    <l:$246, c:$247>
N007 ( 13,  9) [000020] -----------                         +--*  OR        int    <l:$244, c:$245>
N003 (  6,  4) [000016] -----------                         |  +--*  EQ        int    <l:$240, c:$241>
N001 (  1,  1) [000014] -----------                         |  |  +--*  LCL_VAR   ref    V01 loc0         u:1 <l:$208, c:$83>
N002 (  1,  2) [000015] -----------                         |  |  \--*  CNS_INT   ref    null $VN.Null
N006 (  6,  4) [000019] -----------                         |  \--*  EQ        int    <l:$242, c:$243>
N004 (  1,  1) [000017] -----------                         |     +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N005 (  1,  2) [000018] -----------                         |     \--*  CNS_INT   ref    null $VN.Null
N008 (  1,  2) [000021] -----------                         \--*  CNS_INT   int    0 $40
 Redundant compare; current relop:
N003 (  3,  4) [000308] J------N---                         *  NE        int    <l:$24b, c:$24c>
N001 (  1,  1) [000161] -----------                         +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N002 (  1,  2) [000307] -----------                         \--*  CNS_INT   ref    null $VN.Null
Fall through successor BB02 of BB01 reaches, relop [000308] must be true

Redundant branch opt in BB04:

removing useless STMT00081 ( INL21 @ 0x000[E-] ... ??? ) <- INL19 @ 0x006[E-] <- INLRT @ ???
N004 (  5,  6) [000309] -----------                         *  JTRUE     void   $VN.Void
N003 (  3,  4) [000308] -----------                         \--*  CNS_INT   int    1
 from BB04

Conditional folded at BB04
BB04 becomes a BBJ_ALWAYS to BB06
optRedundantBranch removed tree:
N004 (  5,  6) [000309] -----------                         *  JTRUE     void   $VN.Void
N003 (  3,  4) [000308] -----------                         \--*  CNS_INT   int    1

@a74nh
Copy link
Contributor Author

a74nh commented Jul 1, 2022

I wrote some benchmarks:

|      Method |        Job |                                                                                                 Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|------------ |----------- |---------------------------------------------------------------------------------------------------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
|      Single | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 66.65 us | 0.019 us | 0.015 us | 66.65 us | 66.63 us | 66.69 us |  1.00 |            Base |         - |          NA |
|      Single | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 34.01 us | 0.010 us | 0.008 us | 34.01 us | 34.00 us | 34.03 us |  0.51 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|         And | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 59.77 us | 0.021 us | 0.018 us | 59.76 us | 59.74 us | 59.81 us |  1.00 |            Base |         - |          NA |
|         And | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 39.42 us | 0.221 us | 0.172 us | 39.36 us | 39.35 us | 39.95 us |  0.66 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|      AndAnd | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 53.99 us | 0.034 us | 0.030 us | 53.99 us | 53.96 us | 54.06 us |  1.00 |            Base |         - |          NA |
|      AndAnd | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 46.49 us | 0.027 us | 0.021 us | 46.48 us | 46.45 us | 46.53 us |  0.86 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|   AndAndAnd | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 54.13 us | 0.034 us | 0.028 us | 54.12 us | 54.10 us | 54.18 us |  1.00 |            Base |         - |          NA |
|   AndAndAnd | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 50.96 us | 0.007 us | 0.006 us | 50.96 us | 50.95 us | 50.97 us |  0.94 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|          Or | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 73.78 us | 0.017 us | 0.013 us | 73.78 us | 73.77 us | 73.81 us |  1.00 |            Base |         - |          NA |
|          Or | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 73.29 us | 0.104 us | 0.082 us | 73.27 us | 73.24 us | 73.54 us |  0.99 |            Same |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|        OrOr | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 84.34 us | 0.026 us | 0.022 us | 84.34 us | 84.32 us | 84.39 us |  1.00 |            Base |         - |          NA |
|        OrOr | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 84.26 us | 0.025 us | 0.019 us | 84.26 us | 84.23 us | 84.29 us |  1.00 |            Same |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|       AndOr | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 81.88 us | 0.138 us | 0.115 us | 81.89 us | 81.71 us | 82.14 us |  1.00 |            Base |         - |          NA |
|       AndOr | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 78.53 us | 0.038 us | 0.032 us | 78.53 us | 78.48 us | 78.60 us |  0.96 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
| SingleArray | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 80.16 us | 0.049 us | 0.038 us | 80.15 us | 80.11 us | 80.23 us |  1.00 |            Base |         - |          NA |
| SingleArray | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 80.43 us | 0.025 us | 0.020 us | 80.43 us | 80.40 us | 80.48 us |  1.00 |            Same |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|    AndArray | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 78.59 us | 0.020 us | 0.016 us | 78.59 us | 78.57 us | 78.63 us |  1.00 |            Base |       1 B |        1.00 |
|    AndArray | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 78.94 us | 0.144 us | 0.113 us | 78.91 us | 78.89 us | 79.30 us |  1.00 |            Same |         - |        0.00 |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|     OrArray | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 77.67 us | 0.014 us | 0.012 us | 77.67 us | 77.65 us | 77.69 us |  1.00 |            Base |         - |          NA |
|     OrArray | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 77.65 us | 0.147 us | 0.115 us | 77.62 us | 77.59 us | 78.01 us |  1.00 |            Same |         - |          NA |

This was on an Altra.

For "Single" it'll take the branch roughly 50% of the time, and the test runs 50% quicker with csel.

For "And" the branch will be take 25% of the time, and then dropping further for the other tests. Performance drops off, but is still quicker.

The other tests do not yet optimise with this patch, but I've included them for the future.

@kunalspathak
Copy link
Member

and the test runs 50% quicker with csel.

It just occurred to me that all these benchmarks validates csel performance in a loop. We have extracted it in SingleInner and marked in NoInline so the logic sees the usage is outside of loop and we generate csel. But in real world scenario, we might not get this (or any) performance improvement because we are not optimizing cases that are inside the loop. Of course, if the if conditions are in a hot method, then I can see that csel will improve its performance, so that is still a good thing to have.

@a74nh
Copy link
Contributor Author

a74nh commented Jul 4, 2022

Split the lower parts of this code into a new PR: #71616

@kunalspathak
Copy link
Member

Since this is no longer in development, closing it.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arm64: Generate conditional comparison and selection instructions
6 participants