[JIT] More ARM64 comparison instruction optimizations with Vector.Zero #64783

TIHan · 2022-02-04T02:30:13Z

Description
Addresses the most of the instructions listed in #33972 (comment) except for cmle, cmlt, fcmle, fcmlt as we do not emit them anywhere yet - we do not have hardware intrinsic APIs that correspond to those instructions.

ARM64 diffs based on the tests

-            movi    v16.2s, #0x00
-            fcmgt   v16.2s, v0.2s, v16.2s
+            fcmgt   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.4s, v0.4s, v16.4s
+            fcmgt   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.2d, v0.2d, v16.2d
+            fcmgt   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmgt    v16.2d, v0.2d, v16.2d
+            cmgt    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmgt   d16, d0, d16
+            fcmgt   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmgt    d16, d0, d16
+            cmgt    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   v16.2s, v0.2s, v16.2s
+            fcmge   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.4s, v0.4s, v16.4s
+            fcmge   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.2d, v0.2d, v16.2d
+            fcmge   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmge    v16.2d, v0.2d, v16.2d
+            cmge    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   d16, d0, d16
+            fcmge   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmge    d16, d0, d16
+            cmge    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================

Acceptance Criteria

Add Tests

…perand

ghost · 2022-02-04T02:30:20Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Description
Addresses the most of the instructions listed in #33972 (comment) except for cmle, cmlt, fcmle, fcmlt as we do not emit them anywhere yet - we do not have hardware intrinsic APIs that correspond to those instructions.

ARM64 diffs based on the tests

-            movi    v16.2s, #0x00
-            fcmgt   v16.2s, v0.2s, v16.2s
+            fcmgt   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.4s, v0.4s, v16.4s
+            fcmgt   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.2d, v0.2d, v16.2d
+            fcmgt   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmgt    v16.2d, v0.2d, v16.2d
+            cmgt    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmgt   d16, d0, d16
+            fcmgt   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmgt    d16, d0, d16
+            cmgt    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   v16.2s, v0.2s, v16.2s
+            fcmge   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.4s, v0.4s, v16.4s
+            fcmge   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.2d, v0.2d, v16.2d
+            fcmge   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmge    v16.2d, v0.2d, v16.2d
+            cmge    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   d16, d0, d16
+            fcmge   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmge    d16, d0, d16
+            cmge    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================

Acceptance Criteria

Add Tests

Author:	TIHan
Assignees:	TIHan
Labels:	`area-CodeGen-coreclr`
Milestone:	-

TIHan · 2022-02-04T02:32:12Z

src/coreclr/jit/hwintrinsiccodegenarm64.cpp

@@ -499,29 +537,6 @@ void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node)
                GetEmitter()->emitIns_R_R_R(ins, emitSize, targetReg, op1Reg, op2Reg, opt);
                break;

-            case NI_AdvSimd_CompareEqual:


This part was removed in favor of the table driven approach as it was simpler to handle the instructions explicitly rather than the intrinsics.

TIHan · 2022-02-04T02:37:05Z

@dotnet/jit-contrib This is ready.

tannergooding · 2022-02-04T02:59:57Z

src/coreclr/jit/emitarm64.cpp

@@ -12995,7 +12995,7 @@ void emitter::emitDispIns(
                emitDispVectorReg(id->idReg1(), id->idInsOpt(), true);
                emitDispVectorReg(id->idReg2(), id->idInsOpt(), false);
            }
-            if (ins == INS_cmeq)
+            if (ins == INS_cmeq || ins == INS_cmge || ins == INS_cmgt || ins == INS_cmle || ins == INS_cmlt)


I wonder if it's worth a small helper to cover these 5 instructions since they get grouped together a few times

I agree, having the helper would be nice.
But this can be done in a follow up PR.

FYI, this is the jit coding convention around subexpressions (including Boolean expressions)
https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-jit-coding-conventions.md#11.1

There should be parentheses around all unary and binary expressions if they are contained within other expressions.

tannergooding · 2022-02-04T03:03:40Z

src/coreclr/jit/hwintrinsiccodegenarm64.cpp

+            if ((numOperands == 2) && ((intrin.op2->IsVectorZero() && intrin.op2->isContained()) ||
+                                       (intrin.op1->IsVectorZero() && intrin.op1->isContained() &&
+                                        HWIntrinsicInfo::IsCommutative(intrin.id))))
+            {
+                assert(HWIntrinsicInfo::SupportsContainment(intrin.id));
+
+                if (intrin.op1->IsVectorZero() && intrin.op1->isContained() &&
+                    HWIntrinsicInfo::IsCommutative(intrin.id))
+                {
+                    // The intrinsic is commutative, swap the registers.
+                    assert(op1Reg == REG_NA);
+                    op1Reg = op2Reg;
+                    op2Reg = REG_NA;
+                }


For x86/x64 we actually do the operand swap in lowering to help simplify codegen a bit.

For example:
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L6113-L6119

Which then means in codegen we only ever have to consider intrin.op2->isContained()

It might be good to do that here too, but I'll defer to @echesakovMSFT

I don't mind where we do the swap, but the one positive thing about doing it in codegen is that it isn't dependent on any specific intrinsic.

I think moving the logic to be table-driven complicates things.
I would rather leave it as a special handling.

For x86/x64 we actually do the operand swap in lowering to help simplify codegen a bit.

I would do this too - this would slightly simplify the codegen part.

I think moving the logic to be table-driven complicates things.

I actually think the opposite. I find that it greatly simplifies things here and makes it a lot more trivial to light up instructions with similar behavior.

Rather than needing to specially handle the same 'n' comparison in lowering and codegen, and ensuring they stay in sync/etc; we just annotate with a flag that it supports containment of zero and it implicitly lights up.

This has allowed a lot of code and special handling to not exist for x86/x64 and to a similar extent Arm64.

Moving the swap to lowering is fine; it will simplify the logic in codegen though make it more dependent on specific intrinsics, but it only benefits CompareEqual* intrinsics so it makes sense to do it lowering.

I think moving the logic to be table-driven complicates things.

Can you explain why this complicates things? It makes it easier to focus on the specific instruction rather than the intrinsic.

@tannergooding

Rather than needing to specially handle the same 'n' comparison in lowering and codegen, and ensuring they stay in sync/etc; we just annotate with a flag that it supports containment of zero and it implicitly lights up.

This is not what I was objecting to. Having a flag is fine. Moving the intrinsics to a dedicated category SIMDCompare would also be fine. Checking for a specific instruction opcode in a code (that was supposed to be general) is not.

@echesakovMSFT, you're suggesting that something like:

switch (numOperands) { // ... existing code case 2: { if (intrin.SupportsContainmentZero() && intrin.op2->isContained()) { assert(intrin.op2->isVectorZero()); } else if (isRMW) // ... existingCode } // ... existing code }

would be better here, right?

(not necessarily exactly that, but something similar or along those lines)

but now it depends on ins (i.e. on a particular instruction opcode). This is not what I would think of as table-driven implementation.

That's fair and even in the xarch path, there is no branching on specific instructions as well.
I added a flag called "SupportsContainmentZero" and appropriately used the flag on the instrinsics we care about; so no more acting on specific instructions.

@echesakovMSFT, you're suggesting that something like:

@tannergooding Yes.

Actually, I wouldn't focus that much on that the contained operand is 'zero`. So something like

switch (numOperands) { // ... existing code case 2: { // Lowering ensures that only op2 need to be checked for containment and will swap operands if needed. if (intrin.SupportsContainment() && intrin.op2->isContained()) { } else if (isRMW) // ... existingCode } // ... existing code }

would also work.

That's fair and even in the xarch path, there is no branching on specific instructions as well.
I added a flag called "SupportsContainmentZero" and appropriately used the flag on the instrinsics we care about; so no more acting on specific instructions.

@TIHan Thanks!

tannergooding · 2022-02-04T03:07:11Z

src/coreclr/jit/lowerarmarch.cpp

+                //    - cmhi
+                //    - cmhs
+                // require both operands; they do not have a 'with zero'.
+                if (intrin.op2->IsVectorZero() && !varTypeIsUnsigned(intrin.baseType))


When you look at #64785, there is probably an opportunity to also handle the op1->IsVectorZero(). For example: CompareGreaterThan(0, x) can be CompareLessThan(x, 0)

src/coreclr/jit/hwintrinsiccodegenarm64.cpp

tannergooding · 2022-02-04T03:10:49Z

Looks good/correct to me. Just a couple small comments/suggestions.

echesakov

Looks good overall - left some comments/suggestions.

TIHan added 8 commits February 3, 2022 16:05

Normalizing instructions with an implicit vector zero as the second o…

45611c1

…perand

Checking number of operands before looking at opernads

fcabd8f

Remove assert

b46af4b

Check commutative flag

9fe70b1

Fixed commutative check

1c58088

Handling more HW intrinsics

ea9c1f8

Finishing up

e6dfa7e

Finishing up

f9f8dd2

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 4, 2022

ghost assigned TIHan Feb 4, 2022

TIHan commented Feb 4, 2022

View reviewed changes

TIHan mentioned this pull request Feb 4, 2022

[JIT] ARM64 instructions cmle cmlt fcmle fcmlt are not emitted anywhere #64785

Open

Formatting

7d47671

tannergooding reviewed Feb 4, 2022

View reviewed changes

src/coreclr/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved

numOperands = 1

cdf3d60

echesakov self-requested a review February 5, 2022 19:52

tannergooding self-requested a review February 8, 2022 05:52

tannergooding approved these changes Feb 8, 2022

View reviewed changes

JulieLeeMSFT added this to the 7.0.0 milestone Feb 14, 2022

echesakov reviewed Feb 15, 2022

View reviewed changes

TIHan added 4 commits February 15, 2022 18:19

Feedback

a9c2989

Added HW_Flag_SupportsContainmentZero

9877252

Added extra assert

d7e31d4

Merged with main

15f2b09

echesakov approved these changes Feb 16, 2022

View reviewed changes

Removing flag and simplifying codegen for containment with zeros

1567e1c

TIHan merged commit c3f5727 into dotnet:main Feb 17, 2022

TIHan deleted the arm64-compare-instrs-opts-part2 branch February 17, 2022 02:32

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT] More ARM64 comparison instruction optimizations with Vector.Zero #64783

[JIT] More ARM64 comparison instruction optimizations with Vector.Zero #64783

TIHan commented Feb 4, 2022 •

edited

Loading

ghost commented Feb 4, 2022

TIHan Feb 4, 2022

TIHan commented Feb 4, 2022

tannergooding Feb 4, 2022

echesakov Feb 15, 2022

echesakov Feb 15, 2022

tannergooding Feb 4, 2022

TIHan Feb 4, 2022

echesakov Feb 15, 2022

tannergooding Feb 16, 2022

TIHan Feb 16, 2022

echesakov Feb 16, 2022

tannergooding Feb 16, 2022

tannergooding Feb 16, 2022

TIHan Feb 16, 2022 •

edited

Loading

echesakov Feb 16, 2022

tannergooding Feb 4, 2022

tannergooding commented Feb 4, 2022

echesakov left a comment

[JIT] More ARM64 comparison instruction optimizations with Vector.Zero #64783

[JIT] More ARM64 comparison instruction optimizations with Vector.Zero #64783

Conversation

TIHan commented Feb 4, 2022 • edited Loading

ghost commented Feb 4, 2022

Choose a reason for hiding this comment

TIHan commented Feb 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TIHan Feb 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Feb 4, 2022

echesakov left a comment

Choose a reason for hiding this comment

TIHan commented Feb 4, 2022 •

edited

Loading

TIHan Feb 16, 2022 •

edited

Loading