Skip to content

Conversation

@tannergooding
Copy link
Member

This should resolve one of the issues found in #86033 and does so by ensuring that ToScalar and GetElement are more equally handled.

In particular, we import GetElement(0) as ToScalar where feasible to avoid carrying around the unnecessary constant (minimally helping throughput). This also allows us to more trivially handle the special case semantics for getting the 0th element.

Likewise, we ensure that ToScalar supports the long on x86 where it needs to be lowered to AsUInt32() + GetElement(0) + GetElement(1), that it supports constant folding in VN, and that it supports containment as part of a store.

This helps separate the two distinct considerations of the APIs and ensures we get the best possible codegen for each case.

@ghost ghost assigned tannergooding May 13, 2023
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 13, 2023
@ghost
Copy link

ghost commented May 13, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This should resolve one of the issues found in #86033 and does so by ensuring that ToScalar and GetElement are more equally handled.

In particular, we import GetElement(0) as ToScalar where feasible to avoid carrying around the unnecessary constant (minimally helping throughput). This also allows us to more trivially handle the special case semantics for getting the 0th element.

Likewise, we ensure that ToScalar supports the long on x86 where it needs to be lowered to AsUInt32() + GetElement(0) + GetElement(1), that it supports constant folding in VN, and that it supports containment as part of a store.

This helps separate the two distinct considerations of the APIs and ensures we get the best possible codegen for each case.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member

cc @dotnet/jit-contrib -- need a reviewer

@tannergooding
Copy link
Member Author

There are some bad diffs I'm investigating here. I think I'm missing a bit of handling somewhere.

@tannergooding
Copy link
Member Author

There are some bad diffs I'm investigating here. I think I'm missing a bit of handling somewhere.

Should be fixed now. I had put the ToScalar VN handling in FunBinary when it should've been in FunUnary, killing constant folding support for the most prevalent GetElement operation was not good 😆

break;
}

case NI_VectorT128_ToScalar:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I believe is just for TP reason, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its unnecessary since we're handling decomposition for ToScalar vs GetElement(0) (since it has to be lowered to GetElement(0), GetElement(1))

GenTree* op2 = node->Op(2);

if (op2->IsIntegralConst(0))
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need similar change for arm64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Arm64 doesn't have as many considerations since it can't really do containment and so its already doing all the right things as far as I saw

{
MakeSrcContained(node, src);

if (intrinsicId == NI_Vector128_GetElement)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow this change.

Copy link
Member Author

@tannergooding tannergooding May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we were favoring containing the store over containing the load, which leads to larger and slightly slower codegen.

This changed it to favor preserving the original containment (the load) so that we could do the better thing.

Copy link
Member Author

@tannergooding tannergooding May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did need to keep part of the handling to ensure regOptional remained handled and cleared. Since we want to prefer the contained store over a non-contained store + a spilled local.

@tannergooding
Copy link
Member Author

Diffs are better now.

We see -2868 bytes on Arm64, largely from constant folding.

We see -1711 bytes on x64 with a couple small regressions. These largely look to be due to different CSE decisions thanks to the different VN or register selection and most often present as movaps reg, reg being inserted somewhere it didn't exist previously.

@tannergooding
Copy link
Member Author

Any other questions or feedback on this @kunalspathak ?

bool foundUse = BlockRange().TryGetUse(node, &use);
bool fromUnsigned = false;

GenTreeCast* cast = comp->gtNewCastNode(TYP_INT, node, fromUnsigned, simdBaseType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromUnsigned is false here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. pextrb/pextrw already zero-extend to the upper bits. So we only need to insert a cast for sign-extension (e.g. when the source type is TYP_BYTE or TYP_SHORT)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's short comment a couple lines up that explains why its just for those and not for all small types.

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding tannergooding merged commit 81d039e into dotnet:main May 16, 2023
@tannergooding tannergooding deleted the getelement-0 branch May 16, 2023 18:48
@ghost ghost locked as resolved and limited conversation to collaborators Jun 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants