Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139

kunalspathak · 2020-05-29T00:08:44Z

Optimizes following APIs using hardware intrinsics:

Vector128.WithLower()
WithLower_before.txt
WithLower_after.txt
Vector128.WithUpper
WithUpper_before.txt
WithUpper_after.txt
WithUpper_after_jit.txt
Vector64.AsDouble
AsDouble_before.txt
AsDouble_after.txt
Vector64.AsInt64
AsInt64_before.txt
AsInt64_after.txt
Vector.AsUInt64
AsUint64_before.txt
AsUint64_after.txt
Vector128.Create(Vector64, Vector64)
Create_before.txt
Create_after.txt

Update:

After talking to Tanner and Egor, we decided to not optimize the Insert+Extract combination in JIT, but do it in separate PR when we implement InsertSelectedScalar intrinsic. So currently, optimize WithUpper, WithLower and Create in managed code. Here is the code we generate after this change:

withlower_after_nojit.txt
withupper_after_nojit.txt
create_after_nojit.txt

Contributes to #33308 and #33496.

kunalspathak · 2020-05-29T00:09:19Z

@echesakovMSFT , @tannergooding , @TamarChristinaArm , @BruceForstall

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

TamarChristinaArm · 2020-05-29T10:10:28Z

Thanks @kunalspathak!

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

As a general question, these .AsXX are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?

Also the Create variants have some unneeded moves

        0EB01E10          mov     v16.8b, v16.8b
        6E180630          mov     v16.d[1], v17.d[0]
        4EB01E00          mov     v0.16b, v16.16b

First one is a no-op and last one can be avoided by having allocated to v0. But that's a general issue from the looks of the dumps.

Unrelated question to your change, but why does it allocate so much stack space?

A9BD7BFD stp fp, lr, [sp,#-48]! seems to allocate 48 bytes and only stores 16.

There's also a weird gap in between the stores

        FD0017A0          str     d0, [fp,#40]
        FD000FA1          str     d1, [fp,#24]

Could it be thinking that internally all vectors are 16 bytes?

If they were placed next to eachother you could optimize these with stp and ldp, to be exact your code in the second BB could be a single ldr q0, ... if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).

src/coreclr/src/jit/codegenlinear.cpp

src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

tannergooding · 2020-05-29T14:50:41Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

@@ -937,13 +937,23 @@ static Vector128<ulong> SoftwareFallback(ulong e0, ulong e1)
        /// <returns>A new <see cref="Vector128{Byte}" /> initialized from <paramref name="lower" /> and <paramref name="upper" />.</returns>
        public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper)


These methods should be aggressively inlined.
It might also be good to just treat them as intrinsic and to create the appropriate nodes in importation or lowering so these don't impact inlining of other methods due to increased IL size or additional locals introduced.

I changed the implementation to import to appropriate instructions. Curious, what is preferable way to do these things? Importation or lowering and what are the advantages? One advantage I see doing it early is so the nodes are passed through other optimizations (if applicable).

tannergooding · 2020-05-29T14:58:25Z

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

That is actually the codegen for the fallback case (not relevant to ARM64 except for indirect calls, like via a delegate) and is due to the fallback using return Unsafe.As<Vector64<T>, Vector64<U>(ref vector). The intrinsic case is that these are fully elided in importation and as such they don't even create nodes and can't generate additional code.

src/coreclr/src/jit/gentree.cpp

src/coreclr/src/jit/hwintrinsic.cpp

src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp

src/coreclr/src/jit/lowerarmarch.cpp

src/coreclr/src/jit/codegenlinear.cpp

echesakov

Changes look good

kunalspathak · 2020-06-03T21:25:59Z

Sorry @TamarChristinaArm , I forgot to reply.

Thanks @kunalspathak!

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

Correct. We already optimize Vector128.AsXX() while this PR optimizes Vector64.AsXX().

As a general question, these .AsXX are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?

I believe so. There is a overall problem where arguments are pushed to stack and retrieved. Related #35635. Perhaps @CarolEidt might know exact reasons.

Also the Create variants have some unneeded moves
        0EB01E10          mov     v16.8b, v16.8b
        6E180630          mov     v16.d[1], v17.d[0]
        4EB01E00          mov     v0.16b, v16.16b
First one is a no-op and last one can be avoided by having allocated to v0. But that's a general issue from the looks of the dumps.

This has changed in this PR. See my updated comments in PR description.

        4E083E20          umov    x0, v17.d[0]
        4E181C10          ins     v16.d[1], x0
        4EB01E00          mov     v0.16b, v16.16b

Unrelated question to your change, but why does it allocate so much stack space?

A9BD7BFD stp fp, lr, [sp,#-48]! seems to allocate 48 bytes and only stores 16.

There's also a weird gap in between the stores
        FD0017A0          str     d0, [fp,#40]
        FD000FA1          str     d1, [fp,#24]
Could it be thinking that internally all vectors are 16 bytes?

If they were placed next to eachother you could optimize these with stp and ldp, to be exact your code in the second BB could be a single ldr q0, ... if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).

Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?

CarolEidt · 2020-06-03T22:43:07Z

Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?

It certainly seems as if the assignment of frame locations is allocating 16 bytes for the 8 byte vectors. It would take some investigation to figure out why.

CarolEidt · 2020-06-03T22:54:39Z

I think the problem is that Compiler::getSIMDTypeAlignment which is called byCompiler::lvaAllocLocalAndSetVirtualOffset always returns 16 for TARGET_ARM64.

TamarChristinaArm · 2020-06-04T02:13:03Z

Thanks @kunalspathak !

This has changed in this PR. See my updated comments in PR description.
        4E083E20          umov    x0, v17.d[0]
        4E181C10          ins     v16.d[1], x0
        4EB01E00          mov     v0.16b, v16.16b

hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0] as before.

kunalspathak · 2020-06-04T17:24:31Z

hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0] as before.

Yes, Initially I was doing that optimization of generating mov dstReg[index1], srcReg[index2] in JIT, but we decided to hold that for now and instead do it once we implement InsertSelectedScalar (hopefully sometime soon).

kunalspathak · 2020-06-04T17:33:06Z

Opened #37429 to track the alignment questions that @TamarChristinaArm had.

kunalspathak added 7 commits May 28, 2020 16:19

Optimize Vector64.AsDouble(), Vector64.AsInt64(), Vector64.AsUInt64()

2c78526

Optimize Vector128.WithUpper()

5e856ba

Inline GetElement() as paramter

7ee86cf

Combine Insert/GetElement in JIT

957f4cc

Optimize Vector128.WithLower()

a78219d

Optimize Vector128.Create(Vector64, Vector64)

b5761f0

minor fix

48f09b7

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 29, 2020

danmoseley reviewed May 29, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Show resolved Hide resolved

tannergooding reviewed May 29, 2020

View reviewed changes

src/coreclr/src/jit/codegenlinear.cpp Outdated Show resolved Hide resolved

tannergooding reviewed May 29, 2020

View reviewed changes

src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved

tannergooding reviewed May 29, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Outdated Show resolved Hide resolved

tannergooding reviewed May 29, 2020

View reviewed changes

jaredpar mentioned this pull request May 29, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures #34472

Closed

echesakov reviewed May 29, 2020

View reviewed changes

kunalspathak added 5 commits June 1, 2020 16:17

review comments

7d582db

Change the ifdef from ARMARCH to ARM64

e6b0e8b

Mark optimized methods to be AggresiveInlining

fddba3e

Revert optimization in JIT

c480f42

Use ToVector128Unsafe()

b5dcb16

echesakov reviewed Jun 3, 2020

View reviewed changes

echesakov approved these changes Jun 3, 2020

View reviewed changes

kunalspathak merged commit 9c5d406 into dotnet:master Jun 3, 2020

kunalspathak mentioned this pull request Jun 4, 2020

ARM64: Investigate why more stack space is allocated than needed and why they are not aligned #37429

Closed

kunalspathak mentioned this pull request Jun 18, 2020

Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

Merged

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139

Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139

kunalspathak commented May 29, 2020 •

edited

Loading

kunalspathak commented May 29, 2020

TamarChristinaArm commented May 29, 2020

tannergooding May 29, 2020

kunalspathak Jun 1, 2020

tannergooding commented May 29, 2020

echesakov left a comment

kunalspathak commented Jun 3, 2020

CarolEidt commented Jun 3, 2020

CarolEidt commented Jun 3, 2020

TamarChristinaArm commented Jun 4, 2020

kunalspathak commented Jun 4, 2020

kunalspathak commented Jun 4, 2020

		@@ -937,13 +937,23 @@ static Vector128<ulong> SoftwareFallback(ulong e0, ulong e1)
		/// <returns>A new <see cref="Vector128{Byte}" /> initialized from <paramref name="lower" /> and <paramref name="upper" />.</returns>
		public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper)

Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139

Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139

Conversation

kunalspathak commented May 29, 2020 • edited Loading

Update:

kunalspathak commented May 29, 2020

TamarChristinaArm commented May 29, 2020

tannergooding May 29, 2020

Choose a reason for hiding this comment

kunalspathak Jun 1, 2020

Choose a reason for hiding this comment

tannergooding commented May 29, 2020

echesakov left a comment

Choose a reason for hiding this comment

kunalspathak commented Jun 3, 2020

CarolEidt commented Jun 3, 2020

CarolEidt commented Jun 3, 2020

TamarChristinaArm commented Jun 4, 2020

kunalspathak commented Jun 4, 2020

kunalspathak commented Jun 4, 2020

kunalspathak commented May 29, 2020 •

edited

Loading