-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139
Conversation
@echesakovMSFT , @tannergooding , @TamarChristinaArm , @BruceForstall |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Show resolved
Hide resolved
Thanks @kunalspathak! These look great, for the As a general question, these Also the
First one is a Unrelated question to your change, but why does it allocate so much stack space?
There's also a weird gap in between the stores
Could it be thinking that internally all vectors are If they were placed next to eachother you could optimize these with |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Outdated
Show resolved
Hide resolved
@@ -937,13 +937,23 @@ static Vector128<ulong> SoftwareFallback(ulong e0, ulong e1) | |||
/// <returns>A new <see cref="Vector128{Byte}" /> initialized from <paramref name="lower" /> and <paramref name="upper" />.</returns> | |||
public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods should be aggressively inlined.
It might also be good to just treat them as intrinsic and to create the appropriate nodes in importation or lowering so these don't impact inlining of other methods due to increased IL size or additional locals introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation to import to appropriate instructions. Curious, what is preferable way to do these things? Importation or lowering and what are the advantages? One advantage I see doing it early is so the nodes are passed through other optimizations (if applicable).
That is actually the codegen for the fallback case (not relevant to ARM64 except for indirect calls, like via a delegate) and is due to the fallback using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good
Sorry @TamarChristinaArm , I forgot to reply.
Correct. We already optimize
I believe so. There is a overall problem where arguments are pushed to stack and retrieved. Related #35635. Perhaps @CarolEidt might know exact reasons.
This has changed in this PR. See my updated comments in PR description.
Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap? |
It certainly seems as if the assignment of frame locations is allocating 16 bytes for the 8 byte vectors. It would take some investigation to figure out why. |
I think the problem is that |
Thanks @kunalspathak !
hmm why is it moving it between register files now though? I would have expected the same |
Yes, Initially I was doing that optimization of generating |
Opened #37429 to track the alignment questions that @TamarChristinaArm had. |
Optimizes following APIs using hardware intrinsics:
Vector128.WithLower()
WithLower_before.txt
WithLower_after.txt
Vector128.WithUpper
WithUpper_before.txt
WithUpper_after.txt
WithUpper_after_jit.txt
Vector64.AsDouble
AsDouble_before.txt
AsDouble_after.txt
Vector64.AsInt64
AsInt64_before.txt
AsInt64_after.txt
Vector.AsUInt64
AsUint64_before.txt
AsUint64_after.txt
Vector128.Create(Vector64, Vector64)
Create_before.txt
Create_after.txt
Update:
After talking to Tanner and Egor, we decided to not optimize the
Insert+Extract
combination in JIT, but do it in separate PR when we implementInsertSelectedScalar
intrinsic. So currently, optimizeWithUpper
,WithLower
andCreate
in managed code. Here is the code we generate after this change:withlower_after_nojit.txt
withupper_after_nojit.txt
create_after_nojit.txt
Contributes to #33308 and #33496.