Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] More ARM64 comparison instruction optimizations with Vector.Zero #64783

Merged
merged 15 commits into from
Feb 17, 2022

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Feb 4, 2022

Description
Addresses the most of the instructions listed in #33972 (comment) except for cmle, cmlt, fcmle, fcmlt as we do not emit them anywhere yet - we do not have hardware intrinsic APIs that correspond to those instructions.

ARM64 diffs based on the tests

-            movi    v16.2s, #0x00
-            fcmgt   v16.2s, v0.2s, v16.2s
+            fcmgt   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.4s, v0.4s, v16.4s
+            fcmgt   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.2d, v0.2d, v16.2d
+            fcmgt   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmgt    v16.2d, v0.2d, v16.2d
+            cmgt    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmgt   d16, d0, d16
+            fcmgt   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmgt    d16, d0, d16
+            cmgt    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   v16.2s, v0.2s, v16.2s
+            fcmge   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.4s, v0.4s, v16.4s
+            fcmge   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.2d, v0.2d, v16.2d
+            fcmge   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmge    v16.2d, v0.2d, v16.2d
+            cmge    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   d16, d0, d16
+            fcmge   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmge    d16, d0, d16
+            cmge    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================

Acceptance Criteria

  • Add Tests

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 4, 2022
@ghost ghost assigned TIHan Feb 4, 2022
@ghost
Copy link

ghost commented Feb 4, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Description
Addresses the most of the instructions listed in #33972 (comment) except for cmle, cmlt, fcmle, fcmlt as we do not emit them anywhere yet - we do not have hardware intrinsic APIs that correspond to those instructions.

ARM64 diffs based on the tests

-            movi    v16.2s, #0x00
-            fcmgt   v16.2s, v0.2s, v16.2s
+            fcmgt   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.4s, v0.4s, v16.4s
+            fcmgt   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmgt   v16.2d, v0.2d, v16.2d
+            fcmgt   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmgt    v16.2d, v0.2d, v16.2d
+            cmgt    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmgt   d16, d0, d16
+            fcmgt   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmgt    d16, d0, d16
+            cmgt    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   v16.2s, v0.2s, v16.2s
+            fcmge   v16.2s, v0.2s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.4s, v0.4s, v16.4s
+            fcmge   v16.4s, v0.4s, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
 ; ============================================================
-            movi    v16.4s, #0x00
-            fcmge   v16.2d, v0.2d, v16.2d
+            fcmge   v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
 ; ============================================================
-            movi    v16.4s, #0x00
-            cmge    v16.2d, v0.2d, v16.2d
+            cmge    v16.2d, v0.2d, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
 ; ============================================================
-            movi    v16.2s, #0x00
-            fcmge   d16, d0, d16
+            fcmge   d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
 ; ============================================================
-            movi    v16.2s, #0x00
-            cmge    d16, d0, d16
+            cmge    d16, d0, #0
-						;; bbWeight=1    PerfScore 2.00
+						;; bbWeight=1    PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
 ; ============================================================

Acceptance Criteria

  • Add Tests
Author: TIHan
Assignees: TIHan
Labels:

area-CodeGen-coreclr

Milestone: -

@@ -499,29 +537,6 @@ void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node)
GetEmitter()->emitIns_R_R_R(ins, emitSize, targetReg, op1Reg, op2Reg, opt);
break;

case NI_AdvSimd_CompareEqual:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part was removed in favor of the table driven approach as it was simpler to handle the instructions explicitly rather than the intrinsics.

@TIHan
Copy link
Contributor Author

TIHan commented Feb 4, 2022

@dotnet/jit-contrib This is ready.

@@ -12995,7 +12995,7 @@ void emitter::emitDispIns(
emitDispVectorReg(id->idReg1(), id->idInsOpt(), true);
emitDispVectorReg(id->idReg2(), id->idInsOpt(), false);
}
if (ins == INS_cmeq)
if (ins == INS_cmeq || ins == INS_cmge || ins == INS_cmgt || ins == INS_cmle || ins == INS_cmlt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth a small helper to cover these 5 instructions since they get grouped together a few times

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, having the helper would be nice.
But this can be done in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this is the jit coding convention around subexpressions (including Boolean expressions)
https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-jit-coding-conventions.md#11.1

There should be parentheses around all unary and binary expressions if they are contained within other expressions.

Comment on lines 371 to 384
if ((numOperands == 2) && ((intrin.op2->IsVectorZero() && intrin.op2->isContained()) ||
(intrin.op1->IsVectorZero() && intrin.op1->isContained() &&
HWIntrinsicInfo::IsCommutative(intrin.id))))
{
assert(HWIntrinsicInfo::SupportsContainment(intrin.id));

if (intrin.op1->IsVectorZero() && intrin.op1->isContained() &&
HWIntrinsicInfo::IsCommutative(intrin.id))
{
// The intrinsic is commutative, swap the registers.
assert(op1Reg == REG_NA);
op1Reg = op2Reg;
op2Reg = REG_NA;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For x86/x64 we actually do the operand swap in lowering to help simplify codegen a bit.

For example:
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L6113-L6119

Which then means in codegen we only ever have to consider intrin.op2->isContained()

It might be good to do that here too, but I'll defer to @echesakovMSFT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind where we do the swap, but the one positive thing about doing it in codegen is that it isn't dependent on any specific intrinsic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the logic to be table-driven complicates things.
I would rather leave it as a special handling.

For x86/x64 we actually do the operand swap in lowering to help simplify codegen a bit.

I would do this too - this would slightly simplify the codegen part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the logic to be table-driven complicates things.

I actually think the opposite. I find that it greatly simplifies things here and makes it a lot more trivial to light up instructions with similar behavior.

Rather than needing to specially handle the same 'n' comparison in lowering and codegen, and ensuring they stay in sync/etc; we just annotate with a flag that it supports containment of zero and it implicitly lights up.

This has allowed a lot of code and special handling to not exist for x86/x64 and to a similar extent Arm64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the swap to lowering is fine; it will simplify the logic in codegen though make it more dependent on specific intrinsics, but it only benefits CompareEqual* intrinsics so it makes sense to do it lowering.

I think moving the logic to be table-driven complicates things.

Can you explain why this complicates things? It makes it easier to focus on the specific instruction rather than the intrinsic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding

Rather than needing to specially handle the same 'n' comparison in lowering and codegen, and ensuring they stay in sync/etc; we just annotate with a flag that it supports containment of zero and it implicitly lights up.

This is not what I was objecting to. Having a flag is fine. Moving the intrinsics to a dedicated category SIMDCompare would also be fine. Checking for a specific instruction opcode in a code (that was supposed to be general) is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT, you're suggesting that something like:

switch (numOperands)
{
   // ... existing code

    case 2:
    {
        if (intrin.SupportsContainmentZero() && intrin.op2->isContained())
        {
            assert(intrin.op2->isVectorZero());
        }
        else if (isRMW)
        // ... existingCode
    }

   // ... existing code
}

would be better here, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not necessarily exactly that, but something similar or along those lines)

Copy link
Contributor Author

@TIHan TIHan Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now it depends on ins (i.e. on a particular instruction opcode). This is not what I would think of as table-driven implementation.

That's fair and even in the xarch path, there is no branching on specific instructions as well.
I added a flag called "SupportsContainmentZero" and appropriately used the flag on the instrinsics we care about; so no more acting on specific instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT, you're suggesting that something like:

@tannergooding Yes.

Actually, I wouldn't focus that much on that the contained operand is 'zero`. So something like

switch (numOperands)
{
   // ... existing code

    case 2:
    {
        // Lowering ensures that only op2 need to be checked for containment and will swap operands if needed.
        if (intrin.SupportsContainment() && intrin.op2->isContained())
        {

        }
        else if (isRMW)
        // ... existingCode
    }

   // ... existing code
}

would also work.

That's fair and even in the xarch path, there is no branching on specific instructions as well.
I added a flag called "SupportsContainmentZero" and appropriately used the flag on the instrinsics we care about; so no more acting on specific instructions.

@TIHan Thanks!

// - cmhi
// - cmhs
// require both operands; they do not have a 'with zero'.
if (intrin.op2->IsVectorZero() && !varTypeIsUnsigned(intrin.baseType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you look at #64785, there is probably an opportunity to also handle the op1->IsVectorZero(). For example: CompareGreaterThan(0, x) can be CompareLessThan(x, 0)

@tannergooding
Copy link
Member

Looks good/correct to me. Just a couple small comments/suggestions.

@echesakov echesakov self-requested a review February 5, 2022 19:52
@tannergooding tannergooding self-requested a review February 8, 2022 05:52
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Feb 14, 2022
Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall - left some comments/suggestions.

@TIHan TIHan merged commit c3f5727 into dotnet:main Feb 17, 2022
@TIHan TIHan deleted the arm64-compare-instrs-opts-part2 branch February 17, 2022 02:32
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants