Skip to content

Conversation

@tannergooding
Copy link
Member

Unlike xarch where this could be handled a bit more trivially in codegen, Arm64 isn't as flexible and so we get better and more correct codegen by explicitly lowering to the correct sequence instead.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 2, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding force-pushed the simd-getelement branch 3 times, most recently from a3b0300 to 95fdd44 Compare July 2, 2024 07:49
@tannergooding
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding tannergooding force-pushed the simd-getelement branch 3 times, most recently from 263de58 to 244c8da Compare July 3, 2024 05:56
@tannergooding
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding tannergooding marked this pull request as ready for review July 3, 2024 18:45
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib, This is the "proper" fix for #104232, which was worked around in #104264.

No TP diff, but good asmdiff, such as for Linux Arm64:

Overall (-10,228 bytes)
MinOpts (-4,844 bytes)
FullOpts (-5,384 bytes)

It looks like there's more improvements to be had, but those can be handled separately I think. An example is we're generating:

-            ldr     q16, [fp, #-0x68]	// [V98 tmp78]
-            dup     s16, v16.s[1]
-            ldr     q18, [fp, #-0x68]	// [V98 tmp78]
-            dup     s18, v18.s[2]
-            ldr     q19, [fp, #-0x68]	// [V98 tmp78]
-            dup     s19, v19.s[3]
+            sub     x1, fp, #104	// [V98 tmp78]
+            ; byrRegs +[x1]
+            ldr     s16, [x1, #0x04]
+            sub     x1, fp, #104	// [V98 tmp78]
+            ldr     s18, [x1, #0x08]
+            sub     x1, fp, #104	// [V98 tmp78]
+            ldr     s19, [x1, #0x0C]

This is a net improvement as we're only loading scalars, rather than repeatedly loading vectors, but it does represent a case that probably should've been represented as:

ldr     s16, [fp, #-0x64]
ldr     s18, [fp, #-0x60]
ldr     s19, [fp, #-0x5C]

I think there's just a general optimization missing here when the LclAddr itself has an offset and we're adding a constant offset (so its still representable as a single AddrMode)

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding tannergooding merged commit b9673cb into dotnet:main Jul 5, 2024
@tannergooding tannergooding deleted the simd-getelement branch July 5, 2024 17:09
@github-actions github-actions bot locked and limited conversation to collaborators Aug 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants