Arm64: Use csel and ccmp for conditional moves #67894

a74nh · 2022-04-12T08:41:24Z

Early prototype.

ghost · 2022-04-12T08:41:30Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Early prototype.

Author:	a74nh
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

a74nh · 2022-04-12T08:43:26Z

This is a continuation of #67286
I created a new pull request because the older one used main as the branch. It was causing me issues, and AIUI there is no way to change it.

a74nh · 2022-04-12T08:54:22Z

Current version now uses GT_COND_EQ etc nodes.

The following code gets correctly compiled:

static void TransformsGtSingle(uint op1, uint op2) {
    if (op1 > 0) {
        op1 = 5;
    }
    Consume(op1, op2);
}

IN0001: 000008                    cmp     w0, #0
IN0002: 00000C                    mov     w2, #5
IN0003: 000010                    csel    w0, w2, w0, ne
IN0004: 000014                    movz    x2, #0xbcd8
IN0005: 000018                    movk    x2, #0x7e7e LSL #16
IN0006: 00001C                    movk    x2, #0xffff LSL #32
IN0007: 000020                    ldr     x2, [x2]
IN0008: 000024                    blr     x2

However, when multiple blocks are merged together (eg: "if (op1 > 0 && op2 > 5)") , then the resulting code is wrong.

However - I'm increasingly aware that doing this in the lowering phase is the wrong place for the code. So, I'm now in the process of writing a new "ifconversion" phase. Given that the nodes will be in graph order at that point, I'm hoping my issues should be easier to fix

EgorBo · 2022-04-12T12:45:31Z

Yes, it's definitely has to be a higher-level phase, I'd introduce a general GT_SELECT node (similar to llvm's select IR op) that we can also handle for x64 as well via cmov. Its operators should not have execution order-related side effects
A yet another phase that does a full BB walk has to provide good benefits in order to justify the JIT throughput regression (or at least the impact has to be measured).
Some notes:

We might need to fix JIT: Expand GT_RETURN condition to GT_JTRUE #65370 in order to focus only on BBJ_JUMP blocks
We don't need branchless operations when we have profile data (PGO) that states that one of the branches is always taken

a74nh · 2022-04-25T16:16:10Z

Reworked this to use an if conversion phase which plants a GT_SELECT node

A single condition can be optimised:

    if (op1 > 0) {
        op1 = 5;
    }
    IN0001: 000008                    mov     w2, #5
    IN0002: 00000C                    cmp     w0, #0
    IN0003: 000010                    csel    w0, w0, w2, eq

Second statement in an And is optimised:

    if (op1 > 0 && op2 > 5) {
        op1 = 5;
    }
    IN0001: 000008                    cbz     w0, G_M2158_IG04
    IN0002: 00000C                    mov     w2, #5
    IN0003: 000010                    cmp     w1, #5
    IN0004: 000014                    csel    w0, w0, w2, le
    G_M2158_IG04:

Or conditions are not optimised:

    if (op1 > 3 || op2 == 10) {
        op1 = 9;
    }
    IN0001: 000008                    cmp     w0, #3
    IN0002: 00000C                    bhi     G_M41752_IG04
    G_M41752_IG03:
    IN0003: 000010                    cmp     w1, #10
    IN0004: 000014                    bne     G_M41752_IG05
    G_M41752_IG04:
    IN0005: 000018                    mov     w0, #9
    G_M41752_IG05:

Else cases are not optimised:

    if (op1 > 0) {
        op1 = 5;
    } else {
        op1 = 3;
    }
    IN0001: 000008                    cbz     w0, G_M64063_IG04
    G_M64063_IG03:
    IN0002: 00000C                    mov     w0, #5
    IN0003: 000010                    b       G_M64063_IG05
    G_M64063_IG04:
    IN0004: 000014                    mov     w0, #3
    G_M64063_IG05:

a74nh · 2022-04-25T16:20:20Z

Still need to:

Fix some asserts when building the c# libraries
Add some proper unit tests and check performance
Check in more detail the two notes Egor added

Also, the cases above that don't optimise will need CCMP instructions. My next step is to make sure that would work with the code I've added.

a74nh · 2022-05-05T10:27:17Z

With the latest version, I've added some support for CCMP too. Reasoning for doing this (other than better code generation / performance) was that I wanted to be sure my SELECT code was extendable to for CCMP. In the end, there was a lot of overlap of the code (eg: with all the switches), so I ended up combing the patches.

I can now fully build all the system libraries now too.

With this patch....

A single condition can be optimised:

if (op1 > 0) { op1 = 5; }
IN0001: 000008                    mov     w2, 5
IN0002: 00000C                    cmp     w0, #0
IN0003: 000010                    csel    w0, w0, w2, eq

Multiple AND statements are optimised:

if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; }
IN0001: 000008                    cmp     w2, 7
IN0002: 00000C                    ccmp    w1, 10, z, lt
IN0003: 000010                    ccmp    w0, 3, nzc, ne
IN0004: 000014                    mov     w2, 5
IN0005: 000018                    csel    w0, w2, w0, gt

Or conditions are not optimised:

if (op1 > 3 || op2 == 10) {
    op1 = 9;
}
IN0001: 000008                    cmp     w0, 3
IN0002: 00000C                    bhi     G_M41752_IG04
G_M41752_IG03:
IN0003: 000010                    cmp     w1, 10
IN0004: 000014                    bne     G_M41752_IG05
G_M41752_IG04:
IN0005: 000018                    mov     w0, 9
G_M41752_IG05:

Else cases are not optimised:

if (op1 > 0) {
    op1 = 5;
} else {
    op1 = 3;
}
IN0001: 000008                    cbz     w0, G_M64063_IG04
G_M64063_IG03:
IN0002: 00000C                    mov     w0, 5
IN0003: 000010                    b       G_M64063_IG05
G_M64063_IG04:
IN0004: 000014                    mov     w0, 3
G_M64063_IG05:

a74nh · 2022-05-05T11:18:15Z

We might need to fix JIT: Expand GT_RETURN condition to GT_JTRUE #65370 in order to focus only on BBJ_JUMP blocks

Looking at the generated IR, it isn't that far away from being usable by the if convert pass. It's probably a good follow on task from this.

a74nh · 2022-05-05T11:23:27Z

2. We don't need branchless operations when we have profile data (PGO) that states that one of the branches is always taken

We'll also need the profile data in order to do CSEL nodes inside loops (roughly: inside loops, CSEL is slower than branches if the branch is predictable. Outside loops CSEL is faster). Again, this feels like a good follow on task from this.

a74nh · 2022-05-05T13:38:18Z

Note - there are a bunch of assertion failures using superpmi, so I'm working through those.

a74nh · 2022-05-06T15:06:47Z

Fixed a bunch of assert failures running superpmi.

Three unique failures still remain:
3 /home/alahay01/dotnet/runtime/src/coreclr/jit/gentree.h (2000)
2 /home/alahay01/dotnet/runtime/src/coreclr/jit/gentree.h (2904)
11 /home/alahay01/dotnet/runtime/src/coreclr/jit/lower.cpp (7034)

The assert in lower is due to a compare being optimised into a const int.
See the comments in morph.cpp.
I think there is a good argument to be made for GenTreeConditional nodes being a new GTK_CONDOP node type, and then GTK_SMPOP changed to be (GTK_UNOP | GTK_BINOP | GTK_CONDOP).
This would allow all the optimisations that work on standard compare nodes (GT_EQ etc) to be applied to conditional nodes (GT_CEQ etc).
I'm holding off doing this change as it'll be quite a big change until someone else thinks it's a good idea.

SingleAccretion · 2022-05-06T15:15:56Z

src/coreclr/jit/compiler.cpp

+        // If conversion
+        //
+        DoPhase(this, PHASE_IF_CONVERSION, &Compiler::optIfConversion);


I am curious why you chose this rather early stage for this.

On the face of it, doing this early has the significant disadvantage of decanonicalizing the IR w.r.t. relops.

This needs doing after loop information is available. The next phase is PHASE_CLEAR_LOOP_INFO, so I wanted to go before that.
The phase before is loop unrolling. If conversion is only modifying code outside of loops (for now), so it made sense then to do this after loop unrolling to maximise the number of changes.
As far as I'm aware, other compilers also do if conversion fairly early too.

The next phase is PHASE_CLEAR_LOOP_INFO, so I wanted to go before that.

Note that PHASE_CLEAR_LOOP_INFO doesn't actually clear much, only information used by cloning.

I see you are using the "inside a loop" information as a performance heuristic currently, that should be in a "good enough" shape for that even after the optimizer. We're using the loop table for a similar purpose much later for loop alignment.

Ultimately though the question is if there are actual problems because of this, i. e. if there are any regressions in the diffs. I suppose there aren't any?

Note that PHASE_CLEAR_LOOP_INFO doesn't actually clear much, only information used by cloning.

ah, ok, maybe it can move down a bit then.
A quick scan through the passes, I'm thinking this should be done before loop hoisting too.

if there are any regressions in the diffs. I suppose there aren't any?

I'm not certain either way yet. Certainly, the tests I've tried look good.
I've got three asserts left from "superpmi.py replay" to fix up (there's a comment elsewhere on here about that). Next step after that would be to do a full test build and run.

ah, ok, maybe it can move down a bit then.

Note my thinking with this is to avoid adding all the front-end support required (you've hit this problem with constant folding).

A quick scan through the passes, I'm thinking this should be done before loop hoisting too.

Well, it is currently done "before hoisting". It does not make sense to interject it between SSA and the optimizer phases I think (since it alters the flow graph). I was hoping we could slot it just after the optimizer (whether it'd be before lowering or after range check does not matter much I think, though making it LIR-only will make some things (morph support) unnecessary).

After a full test run, of 2000+ tests, I get 11 failures, of which there are 3 unique failures: 1) The constant folding issue I already know about 2) a "pThread" assert, which I get without my patch (will ignore for now) and 3) a segfault - I'll debug this.

Question still remains if this is right place to have the phase. Looking at other compilers:
*LLVM if converts very early, and optimises every occurrence to select nodes. This is because removing the if then branches makes later phases, especially vectorisation, much easier. Then at the end of the compilation, it restores back to if/then branches for certain cases.
*GCC if converts loops early to allow vectorisation. Scalar code is if converted at the end of compilation
*OpenJDK if converts early. It does everything outside of loops and for inside loops uses historical branch taken / not taken counts.

Back to dotnet...
My original patch added if conversion after lowering but the ability to reshape the graph was limited and didn't work for multiple conditions. Agreed with by @EgorBo too.
As a test, I tried pushing if conversion down to just before lowering, and this seems to work in theory - I then get later some later errors, just requires some debugging.

a74nh · 2022-05-18T08:09:50Z

On AArch64 Ubuntu 18.04:

"Superpmi.py replay" passes
"./src/tests/run.sh Checked" passes (except for the 5 failures I get without my patch - asserts with "pThread")

I'm note sure about:
*if there are additional tests I should be running.
*what performance tests can be run.
*what new tests should be added, and where they should be added.

Removing the draft status so this can get a review.

src/coreclr/jit/optimizer.cpp

kunalspathak · 2022-05-23T21:52:21Z

For some reason, I am seeing this pattern with your changes. (Left is main and Right is PR branch). Could be assertion prop is not eliminating them?

You can see the asmdiffs at https://dev.azure.com/dnceng/public/_build/results?buildId=1779370&view=ms.vss-build-web.run-extensions-tab and can download subset of the asm files from https://dev.azure.com/dnceng/public/_build/results?buildId=1779370&view=artifacts&pathAsName=false&type=publishedArtifacts

A single condition can be optimised: if (op1 > 0) { op1 = 5; } IN0001: 000008 mov w2, #5 IN0002: 00000C cmp w0, #0 IN0003: 000010 csel w0, w0, w2, eq Multiple AND statements are optimised: if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; } IN0001: 000008 cmp w2, #7 IN0002: 00000C ccmp w1, dotnet#10, z, lt IN0003: 000010 ccmp w0, #3, nzc, ne IN0004: 000014 mov w2, #5 IN0005: 000018 csel w0, w2, w0, gt Or conditions are not optimised: if (op1 > 3 || op2 == 10) { op1 = 9; } IN0001: 000008 cmp w0, #3 IN0002: 00000C bhi G_M41752_IG04 G_M41752_IG03: IN0003: 000010 cmp w1, dotnet#10 IN0004: 000014 bne G_M41752_IG05 G_M41752_IG04: IN0005: 000018 mov w0, dotnet#9 G_M41752_IG05: Else cases are not optimised: if (op1 > 0) { op1 = 5; } else { op1 = 3; } IN0001: 000008 cbz w0, G_M64063_IG04 G_M64063_IG03: IN0002: 00000C mov w0, #5 IN0003: 000010 b G_M64063_IG05 G_M64063_IG04: IN0004: 000014 mov w0, #3 G_M64063_IG05:

a74nh · 2022-06-21T13:50:17Z

Reverting the last "fix" as it was causing build failures.

PR is now latest HEAD + the code from a month ago (ie from here #67894 (comment))

kunalspathak · 2022-06-21T13:52:00Z

Reverting the last "fix" as it was causing build failures.

PR is now latest HEAD + the code from a month ago (ie from here #67894 (comment))

Thanks! I will take a look once CI comes back with the results.

kunalspathak · 2022-06-22T22:34:43Z

Looks like lot of failures because of AV.

Change-Id: Ie1398954697debec0892bbebd04e32701ab2c792 CustomizedGitHooks: yes

a74nh · 2022-06-23T09:03:25Z

Looks like lot of failures because of AV.

Sadly new patch I just posted doesn't fix that, but it does fix an assert later on.

a74nh · 2022-06-23T11:34:31Z

This patch should fix the crossgen2 build failures

a74nh · 2022-06-27T11:29:56Z

"Allow select nodes to constant propagate" - this commit fixes up the failures to optimise away compares where the condition is constant. However, the way the patch does it is bad.

The problem was that the constant propagation was being skipped select nodes. This was due to setting GTF_DONT_CSE on the connected compare node. Once this flag is removed, if constant, the compare gets optimised to 1 or 0, and then the select node gets optimised to the true or false path. That's good.

That change then introduces a new issue: removing GTF_DONT_CSE enables full CSE across the select nodes. What can then happen is a compare node is replaced with a different expression tree. In a later pass, the select will assert due to it's compare node not being a compare. For example:

N006 ( 10,  8)              [000122] -----------                         \--*  SELECT    int    $181
N003 (  8,  5) CSE #03 (use)[000119] N----------                            +--*  NE        int    <l:$209, c:$20a>
N001 (  3,  2)              [000120] -----------                            |  +--*  LCL_VAR   int    V01 loc0         u:2 <l:$209, c:$20a>
N002 (  1,  2)              [000121] -----------                            |  \--*  CNS_INT   int    0 $c0
N004 (  1,  2)              [000088] -----------                            +--*  CNS_INT   int    255 $c6
N005 (  1,  1)              [000118] -----------                            \--*  LCL_VAR   int    V02 loc1         u:2 (last use) $c0

becomes:

N004 (  5,  5) [000122] -----------                         \--*  SELECT    int    $181
N001 (  3,  2) [000138] -----------                            +--*  LCL_VAR   int    V05 cse0         u:1 <l:$209, c:$20a>
N002 (  1,  2) [000088] -----------                            +--*  CNS_INT   int    255 $c6
               [000160] -----------                            \--*  CNS_INT   int    0 $c0

And node 138 causes an assert.

The correct way to fix it would be for constant propagation to skip based off a new flag (GTF_DONT_CONST_PROP) instead of GTF_DONT_CSE. However, there are 105 instances of GTF_DONT_CSE in the code which then need potentially fixing up.

To keep things simple, for now I introduced GTF_DO_CONST_PROP which is an override for GTF_DONT_CSE when constant propagating.

I fully expect to have to tidy this up before merging. If this PR gets split into multiple smaller pieces, then this is a definite candidate for its own PR.

For now it's here so that we can see any performance implications of the full PR.

a74nh · 2022-06-28T16:21:08Z

I ran superpmi asmdiff on this patch and went through the results.

916 tests had code gen differences.
Of those tests there are:

177 tests with fewer instructions generated. 👍
395 tests had the same number of instructions
344 tests had more instructions generated 👎
In total across all tests, there were 178 additional instructions. 👎
Changes broke down into:
- 1039 uses of csel
- 1 use of ccmp eq
- 16 uses of ccmp ne
- 40 uses of ccmp lt
- No uses of ccmp {le,ge,gt}

In general, each use of csel or ccmp will reduce the size of the code by 1 instruction. 👍

Why the increases in code? Well, if that code was a compare against 0 then it was already optimised with cbnz. Switching to csel causes one additional instruction. Before my patch:

            cbnz    x0, G_M55128_IG04
            ldr     x0, [x0]
G_M55128_IG04:

with my patch:

            cmp     x0, #0
            ldr     x1, [x1]
            csel    x0, x1, x0, eq

Although the second block is longer, it will perform better (assuming the chance of the branch being taken is random).

My next steps are to fix the failing tests in CI (not sure yet if any are caused by my patch).

Are there any obvious performance tests that can be run on this patch?

kunalspathak · 2022-06-28T17:16:43Z

I ran superpmi asmdiff on this patch and went through the results.

Thanks for doing this. I do see diffs in x64...can you double check why that is the case?

Although the second block is longer, it will perform better (assuming the chance of the branch being taken is random).

That's fine with me given that c++ does it too. https://godbolt.org/z/jTa14qTnf

windows/arm64 benchmarks collection : 4316.dasm

One thing that concerns me is the extra instructions we would end up executing for one of the branch. E.g. below, we would not execute ldrh, if w0 == 0, but now we would do. Did you take into account how expensive it is to calculate the values of both the branch and decide if we should do csel or not based on that?

windows/arm64 libraries-crossgen2 174906.dasm

linux/arm64 libraries-pmi 188539.dasm

Here is another one.

windows/arm64 libraries-crossgen2 126535.dasm

Any idea, why there are extra instructions here?

linux/arm64 libraries-pmi 158360.dasm

Same here:

Are there any obvious performance tests that can be run on this patch?

I tried checking in benchmarks collection, but I don't see any benchmarks that shows affected with this change. Just try to come up with a sample program and see if it is improved with this change.

a74nh · 2022-06-29T11:13:04Z

One thing that concerns me is the extra instructions we would end up executing for one of the branch. E.g. below, we would not execute ldrh, if w0 == 0, but now we would do. Did you take into account how expensive it is to calculate the values of both the branch and decide if we should do csel or not based on that?

At the moment this patch doesn't do that - it always assumes csel/ccmp is better.

Generally, for a case with random chance of the branch being taken, then cmp, ldr, csel is better than cbz, ldr. So for the first case posted, I'd definitely go with the new csel version.

For the other two cases, it looks like the cases as independent of each other. Essentially:

if (x==5) { y&=n1; }
if (x==6) { y&=n2; }
if (x==7) { y&=n3; }
if (x==8) { y&=n4; }  etc

Every run of the code has to check every condition regardless of the previous result. So, in these cases again I'd go with the csel version.

Where slowdowns will happen is if we get large chains of csel with multiple ccmps. eg:
if (a[0] && b[0] && c[0] && d[0] && e[0] && f[0]) {y[0]=x[0]}
would produce something like:
ldr, cmp, ldr, ccmp, ldr, ccmp, ldr, ccmp, ldr, ccmp, ldr, ldr, ldr, csel
As the chain gets longer, the additional loads will eventually start slowing things down, and it becomes better to do the branches. I think llvm puts at most three items in a chain (but it's probably a little more subtle than that). However, we're not really getting many chains happening yet - probably due to only chaining && statements. This'll become more important if/when I add || statements and else statements into the chaining.

a74nh · 2022-06-29T11:18:34Z

If we're still unsure about the chains, then maybe remove generation of ccmps from this patch (but keep all the ccmp IR nodes and later phase checks as it's all mostly common code with csel). Then later patches can slowly start adding uses of ccmp.

Also, I'll check those other 2 cases - yes, that looks like there is another constant elimination case I'm missing

a74nh · 2022-06-29T15:50:22Z

Any idea, why there are extra instructions here?

"Redundant branch opts" is parsing through all the branches in the code. In the current HEAD, it gets to the JTRUE/NE node and decides that the branch must always happen due to matching dominators and VNs (I'm a little vague on the exact reasoning for the decision). So it deletes the block that gets jumped over.

In my patch, the branch doesn't exist and the blocks have been merged together. So there's nothing for Redundant branch opts to do.

To optimise the code away in my patch, we'd need something similar to the Redundant branch opts patch, but instead of iterating over the list of branches, it'd have to iterate over compare nodes (or select nodes). Maybe constant propagation could do that.

For reference, this is JTRUE node 000309 being removed:

Dominator BB01 of BB07 has relop with reversed liberal VN
N009 ( 15, 12) [000022] J------N---                         *  NE        int    <l:$246, c:$247>
N007 ( 13,  9) [000020] -----------                         +--*  OR        int    <l:$244, c:$245>
N003 (  6,  4) [000016] -----------                         |  +--*  EQ        int    <l:$240, c:$241>
N001 (  1,  1) [000014] -----------                         |  |  +--*  LCL_VAR   ref    V01 loc0         u:1 <l:$208, c:$83>
N002 (  1,  2) [000015] -----------                         |  |  \--*  CNS_INT   ref    null $VN.Null
N006 (  6,  4) [000019] -----------                         |  \--*  EQ        int    <l:$242, c:$243>
N004 (  1,  1) [000017] -----------                         |     +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N005 (  1,  2) [000018] -----------                         |     \--*  CNS_INT   ref    null $VN.Null
N008 (  1,  2) [000021] -----------                         \--*  CNS_INT   int    0 $40
 Redundant compare; current relop:
N003 (  3,  4) [000050] J------N---                         *  NE        int    <l:$248, c:$249>
N001 (  1,  1) [000048] -----------                         +--*  LCL_VAR   ref    V10 tmp7         u:1 (last use) <l:$208, c:$83>
N002 (  1,  2) [000049] -----------                         \--*  CNS_INT   ref    null $VN.Null

optRedundantRelop in BB04; jump tree is
N004 (  5,  6) [000309] -----------                         *  JTRUE     void   $VN.Void
N003 (  3,  4) [000308] J------N---                         \--*  NE        int    <l:$24b, c:$24c>
N001 (  1,  1) [000161] -----------                            +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N002 (  1,  2) [000307] -----------                            \--*  CNS_INT   ref    null $VN.Null
 ... checking previous tree
N006 (  4,  3) [000322] -A-XG---R--                         *  ASG       ref    $233
N005 (  1,  1) [000321] D------N---                         +--*  LCL_VAR   ref    V28 tmp25        d:1 $VN.Void
N004 (  4,  3) [000297] ---XG------                         \--*  IND       ref    <l:$40d, c:$40e>
N003 (  3,  4) [000506] -------N---                            \--*  ADD       byref  $151
N001 (  1,  1) [000162] -----------                               +--*  LCL_VAR   ref    V00 arg0         u:1 $80
N002 (  1,  2) [000505] -----------                               \--*  CNS_INT   long   8 field offset Fseq[_ats] $104
 -- prev tree VN is not related
 ... checking previous tree
N004 ( 17, 15) [000168] -AC-----R--                         *  ASG       ref    $VN.Void
N003 (  1,  1) [000167] D------N---                         +--*  LCL_VAR   ref    V17 tmp14        d:1 $VN.Void
N002 ( 17, 15) [000166] --C--------                         \--*  CALL help ref    HELPER.CORINFO_HELP_NEWSFAST $40b
N001 (  3, 12) [000165] H---------- arg0 in x0                 \--*  CNS_INT(h) long   0x7f5877cc00 class $18b
 -- prev tree has side effects and is not next to jumpTree
Inferring predicate value from OR

Dominator BB01 of BB04 has relop with reversed liberal VN
N009 ( 15, 12) [000022] J------N---                         *  NE        int    <l:$246, c:$247>
N007 ( 13,  9) [000020] -----------                         +--*  OR        int    <l:$244, c:$245>
N003 (  6,  4) [000016] -----------                         |  +--*  EQ        int    <l:$240, c:$241>
N001 (  1,  1) [000014] -----------                         |  |  +--*  LCL_VAR   ref    V01 loc0         u:1 <l:$208, c:$83>
N002 (  1,  2) [000015] -----------                         |  |  \--*  CNS_INT   ref    null $VN.Null
N006 (  6,  4) [000019] -----------                         |  \--*  EQ        int    <l:$242, c:$243>
N004 (  1,  1) [000017] -----------                         |     +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N005 (  1,  2) [000018] -----------                         |     \--*  CNS_INT   ref    null $VN.Null
N008 (  1,  2) [000021] -----------                         \--*  CNS_INT   int    0 $40
 Redundant compare; current relop:
N003 (  3,  4) [000308] J------N---                         *  NE        int    <l:$24b, c:$24c>
N001 (  1,  1) [000161] -----------                         +--*  LCL_VAR   ref    V02 loc1         u:1 <l:$216, c:$87>
N002 (  1,  2) [000307] -----------                         \--*  CNS_INT   ref    null $VN.Null
Fall through successor BB02 of BB01 reaches, relop [000308] must be true

Redundant branch opt in BB04:

removing useless STMT00081 ( INL21 @ 0x000[E-] ... ??? ) <- INL19 @ 0x006[E-] <- INLRT @ ???
N004 (  5,  6) [000309] -----------                         *  JTRUE     void   $VN.Void
N003 (  3,  4) [000308] -----------                         \--*  CNS_INT   int    1
 from BB04

Conditional folded at BB04
BB04 becomes a BBJ_ALWAYS to BB06
optRedundantBranch removed tree:
N004 (  5,  6) [000309] -----------                         *  JTRUE     void   $VN.Void
N003 (  3,  4) [000308] -----------                         \--*  CNS_INT   int    1

a74nh · 2022-07-01T14:23:39Z

I wrote some benchmarks:

|      Method |        Job |                                                                                                 Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|------------ |----------- |---------------------------------------------------------------------------------------------------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
|      Single | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 66.65 us | 0.019 us | 0.015 us | 66.65 us | 66.63 us | 66.69 us |  1.00 |            Base |         - |          NA |
|      Single | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 34.01 us | 0.010 us | 0.008 us | 34.01 us | 34.00 us | 34.03 us |  0.51 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|         And | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 59.77 us | 0.021 us | 0.018 us | 59.76 us | 59.74 us | 59.81 us |  1.00 |            Base |         - |          NA |
|         And | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 39.42 us | 0.221 us | 0.172 us | 39.36 us | 39.35 us | 39.95 us |  0.66 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|      AndAnd | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 53.99 us | 0.034 us | 0.030 us | 53.99 us | 53.96 us | 54.06 us |  1.00 |            Base |         - |          NA |
|      AndAnd | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 46.49 us | 0.027 us | 0.021 us | 46.48 us | 46.45 us | 46.53 us |  0.86 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|   AndAndAnd | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 54.13 us | 0.034 us | 0.028 us | 54.12 us | 54.10 us | 54.18 us |  1.00 |            Base |         - |          NA |
|   AndAndAnd | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 50.96 us | 0.007 us | 0.006 us | 50.96 us | 50.95 us | 50.97 us |  0.94 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|          Or | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 73.78 us | 0.017 us | 0.013 us | 73.78 us | 73.77 us | 73.81 us |  1.00 |            Base |         - |          NA |
|          Or | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 73.29 us | 0.104 us | 0.082 us | 73.27 us | 73.24 us | 73.54 us |  0.99 |            Same |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|        OrOr | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 84.34 us | 0.026 us | 0.022 us | 84.34 us | 84.32 us | 84.39 us |  1.00 |            Base |         - |          NA |
|        OrOr | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 84.26 us | 0.025 us | 0.019 us | 84.26 us | 84.23 us | 84.29 us |  1.00 |            Same |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|       AndOr | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 81.88 us | 0.138 us | 0.115 us | 81.89 us | 81.71 us | 82.14 us |  1.00 |            Base |         - |          NA |
|       AndOr | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 78.53 us | 0.038 us | 0.032 us | 78.53 us | 78.48 us | 78.60 us |  0.96 |          Faster |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
| SingleArray | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 80.16 us | 0.049 us | 0.038 us | 80.15 us | 80.11 us | 80.23 us |  1.00 |            Base |         - |          NA |
| SingleArray | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 80.43 us | 0.025 us | 0.020 us | 80.43 us | 80.40 us | 80.48 us |  1.00 |            Same |         - |          NA |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|    AndArray | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 78.59 us | 0.020 us | 0.016 us | 78.59 us | 78.57 us | 78.63 us |  1.00 |            Base |       1 B |        1.00 |
|    AndArray | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 78.94 us | 0.144 us | 0.113 us | 78.91 us | 78.89 us | 79.30 us |  1.00 |            Same |         - |        0.00 |
|             |            |                                                                                                           |          |          |          |          |          |          |       |                 |           |             |
|     OrArray | Job-MLHDCM |    /base_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 77.67 us | 0.014 us | 0.012 us | 77.67 us | 77.65 us | 77.69 us |  1.00 |            Base |         - |          NA |
|     OrArray | Job-XQDIND | /runtime_src/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 77.65 us | 0.147 us | 0.115 us | 77.62 us | 77.59 us | 78.01 us |  1.00 |            Same |         - |          NA |

This was on an Altra.

For "Single" it'll take the branch roughly 50% of the time, and the test runs 50% quicker with csel.

For "And" the branch will be take 25% of the time, and then dropping further for the other tests. Performance drops off, but is still quicker.

The other tests do not yet optimise with this patch, but I've included them for the future.

kunalspathak · 2022-07-01T16:19:54Z

and the test runs 50% quicker with csel.

It just occurred to me that all these benchmarks validates csel performance in a loop. We have extracted it in SingleInner and marked in NoInline so the logic sees the usage is outside of loop and we generate csel. But in real world scenario, we might not get this (or any) performance improvement because we are not optimizing cases that are inside the loop. Of course, if the if conditions are in a hot method, then I can see that csel will improve its performance, so that is still a good thing to have.

a74nh · 2022-07-04T16:06:53Z

Split the lower parts of this code into a new PR: #71616

kunalspathak · 2022-08-01T15:18:01Z

Since this is no longer in development, closing it.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 12, 2022

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 12, 2022

a74nh marked this pull request as draft April 12, 2022 08:41

teo-tsirpanis mentioned this pull request Apr 12, 2022

[draft] Use csel for conditional moves #67286

Closed

a74nh force-pushed the github_a74nh_csel2 branch from e057bce to a8546c4 Compare April 25, 2022 16:13

a74nh force-pushed the github_a74nh_csel2 branch from a8546c4 to 1da972f Compare May 5, 2022 10:23

a74nh force-pushed the github_a74nh_csel2 branch from 1da972f to b39e3f8 Compare May 6, 2022 14:59

SingleAccretion reviewed May 6, 2022

View reviewed changes

runfoapp bot mentioned this pull request May 6, 2022

Test failure JIT/Performance/CodeQuality/BenchmarksGame/regex-redux/regex-redux-5/regex-redux-5.sh #66625

Closed

a74nh force-pushed the github_a74nh_csel2 branch from b39e3f8 to aa73869 Compare May 18, 2022 08:03

a74nh changed the title ~~WIP: Use csel for conditional moves~~ AArch64: Use csel and ccmp for conditional moves May 18, 2022

a74nh changed the title ~~AArch64: Use csel and ccmp for conditional moves~~ Arm64: Use csel and ccmp for conditional moves May 18, 2022

a74nh marked this pull request as ready for review May 18, 2022 08:10

a74nh force-pushed the github_a74nh_csel2 branch 3 times, most recently from 014d0d5 to 2340be1 Compare May 19, 2022 10:44

kunalspathak reviewed May 23, 2022

View reviewed changes

src/coreclr/jit/optimizer.cpp Outdated Show resolved Hide resolved

a74nh added 2 commits June 21, 2022 13:39

Fold constant compares and better loop detection

84f2ef4

a74nh force-pushed the github_a74nh_csel2 branch 2 times, most recently from 42d6833 to 84f2ef4 Compare June 21, 2022 13:48

Fix HW intrinsic compares

116f91d

Change-Id: Ie1398954697debec0892bbebd04e32701ab2c792 CustomizedGitHooks: yes

Add conditional nodes to gtCloneExpr

a6218da

a74nh added 2 commits June 23, 2022 14:58

Fix formatting

e4226c7

Allow select nodes to constant propagate

c65c424

a74nh mentioned this pull request Jul 1, 2022

Add If statement benchmarks dotnet/performance#2517

Merged

a74nh mentioned this pull request Jul 4, 2022

Add Conditional nodes and Arm64 code generation #71616

Merged

a74nh mentioned this pull request Jul 11, 2022

Add lowering support for conditional nodes #71705

Merged

karelz mentioned this pull request Jul 22, 2022

System.Threading.Tasks.Dataflow tests timing out on coreclr Linux_musl arm #71475

Closed

JulieLeeMSFT mentioned this pull request Jul 28, 2022

What's new in .NET 7 Preview 7 [WIP] dotnet/core#7455

Closed

kunalspathak closed this Aug 1, 2022

ghost locked as resolved and limited conversation to collaborators Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64: Use csel and ccmp for conditional moves #67894

Arm64: Use csel and ccmp for conditional moves #67894

a74nh commented Apr 12, 2022 •

edited by kunalspathak

Loading

ghost commented Apr 12, 2022

a74nh commented Apr 12, 2022

a74nh commented Apr 12, 2022

EgorBo commented Apr 12, 2022

a74nh commented Apr 25, 2022 •

edited

Loading

a74nh commented Apr 25, 2022

a74nh commented May 5, 2022 •

edited

Loading

a74nh commented May 5, 2022

a74nh commented May 5, 2022

a74nh commented May 5, 2022

a74nh commented May 6, 2022

SingleAccretion May 6, 2022 •

edited

Loading

a74nh May 6, 2022

SingleAccretion May 6, 2022

a74nh May 6, 2022

SingleAccretion May 6, 2022

a74nh May 11, 2022

a74nh commented May 18, 2022 •

edited

Loading

kunalspathak commented May 23, 2022

a74nh commented Jun 21, 2022

kunalspathak commented Jun 21, 2022

kunalspathak commented Jun 22, 2022

a74nh commented Jun 23, 2022 •

edited

Loading

a74nh commented Jun 23, 2022

a74nh commented Jun 27, 2022

a74nh commented Jun 28, 2022

kunalspathak commented Jun 28, 2022

a74nh commented Jun 29, 2022 •

edited

Loading

a74nh commented Jun 29, 2022

a74nh commented Jun 29, 2022

a74nh commented Jul 1, 2022

kunalspathak commented Jul 1, 2022

a74nh commented Jul 4, 2022

kunalspathak commented Aug 1, 2022

Arm64: Use csel and ccmp for conditional moves #67894

Arm64: Use csel and ccmp for conditional moves #67894

Conversation

a74nh commented Apr 12, 2022 • edited by kunalspathak Loading

ghost commented Apr 12, 2022

a74nh commented Apr 12, 2022

a74nh commented Apr 12, 2022

EgorBo commented Apr 12, 2022

a74nh commented Apr 25, 2022 • edited Loading

a74nh commented Apr 25, 2022

a74nh commented May 5, 2022 • edited Loading

a74nh commented May 5, 2022

a74nh commented May 5, 2022

a74nh commented May 5, 2022

a74nh commented May 6, 2022

SingleAccretion May 6, 2022 • edited Loading

Choose a reason for hiding this comment

a74nh May 6, 2022

Choose a reason for hiding this comment

SingleAccretion May 6, 2022

Choose a reason for hiding this comment

a74nh May 6, 2022

Choose a reason for hiding this comment

SingleAccretion May 6, 2022

Choose a reason for hiding this comment

a74nh May 11, 2022

Choose a reason for hiding this comment

a74nh commented May 18, 2022 • edited Loading

kunalspathak commented May 23, 2022

a74nh commented Jun 21, 2022

kunalspathak commented Jun 21, 2022

kunalspathak commented Jun 22, 2022

a74nh commented Jun 23, 2022 • edited Loading

a74nh commented Jun 23, 2022

a74nh commented Jun 27, 2022

a74nh commented Jun 28, 2022

kunalspathak commented Jun 28, 2022

a74nh commented Jun 29, 2022 • edited Loading

a74nh commented Jun 29, 2022

a74nh commented Jun 29, 2022

a74nh commented Jul 1, 2022

kunalspathak commented Jul 1, 2022

a74nh commented Jul 4, 2022

kunalspathak commented Aug 1, 2022

a74nh commented Apr 12, 2022 •

edited by kunalspathak

Loading

a74nh commented Apr 25, 2022 •

edited

Loading

a74nh commented May 5, 2022 •

edited

Loading

SingleAccretion May 6, 2022 •

edited

Loading

a74nh commented May 18, 2022 •

edited

Loading

a74nh commented Jun 23, 2022 •

edited

Loading

a74nh commented Jun 29, 2022 •

edited

Loading