Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use push for 8/12 byte struct args on x86 #65387

Merged
merged 4 commits into from
Feb 17, 2022

Conversation

SingleAccretion
Copy link
Contributor

@SingleAccretion SingleAccretion commented Feb 15, 2022

It is the case that currently for code like the following:

private static void Problem(StructWithIndex* p)
{
    CallForStructWithIndex(*p);
}

struct StructWithIndex
{
    public int Index;
    public int Value;
}

On x86 we generate the following:

       vmovq    xmm0, qword ptr [ecx]
       sub      esp, 8
       vmovq    qword ptr [esp], xmm0
       call     RyuJitReproduction.Program:CallForStructWithIndex(StructWithIndex)

While could generate:

       push     dword ptr [ecx+4]
       push     dword ptr [ecx]

Which is both a little faster (ah-hoc benchmarking on my machine shows ~10% improvement) and smaller. This change does just that, using push [addr] for 8 and 12 byte structs (the latter made the code simple).

To achieve that, some code refactoring was undertaken: we already had a path that did what we needed, but it was only used for structs with GC pointers. I deleted the double meaning of Kind::Push - for TYP_STRUCT it now means only one thing, that is genStructPutArgPush, while before this change it could also go down the "unroll" path (for the very specific cases of 8 and 12 byte structs that we are moving to using pushes).

As a cleanup item, PartialRepInstr kind was added, to represent the algorithm we use for storing structs with GC pointers on x64 Unix.

There will be two types of diffs with this change: first the one showed above, second - removal of unnecessary vzeroupper instructions that were being generated because LSRA was allocating an unnecessary XMM register for Kind::Push PUTARG_STKs with GC pointers.

Diffs - x86 only (no diffs on other targets as expected). All regressions are alignment artifacts.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 15, 2022
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 15, 2022
@ghost
Copy link

ghost commented Feb 15, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Testing CI.

Highlight: benchmarks.run: Total bytes of delta: -11344 (-0.10 % of base).

Author: SingleAccretion
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@SingleAccretion SingleAccretion force-pushed the Push-8-Byte-Args branch 5 times, most recently from d386313 to a1a8988 Compare February 15, 2022 19:08
Two changes:

 1) Outline cases for GC pointers from "genPutStructArgStk" into
    their own functions. Helps with readability, will be used in
    the substative portion of the change.
 2) Do not use "Kind::Push" for a case of a struct less than 16
    bytes in size, that does not have GC pointers, on x86. With
    this, we will now always call "genStructPutArgUnroll" for
    the "Unroll" kind. This means "genAdjustStackForPutArgStk"
    has to special-case that. This will go away in the final
    version of the change.

No diffs on x86 or Unix-x64 (the only two affected platforms).
We're not using XMMs on the "push" path.
@SingleAccretion SingleAccretion changed the title Some x86 CQ improvements Use push for 8/12 byte struct args on x86 Feb 15, 2022
It is smaller and a little faster than using an XMM
register to first load and then store the struct.
@SingleAccretion
Copy link
Contributor Author

Not sure what's up with SPMI in CI. Will assume the Windows x64 timeout is not related (it does not use PUTARG_STK).

@dotnet/jit-contrib some nice CQ improvements for x86 (as always, because I need them for another change).

@SingleAccretion SingleAccretion marked this pull request as ready for review February 16, 2022 08:32
@kunalspathak
Copy link
Member

From unix-arm logs, seems some errors:

ERROR: Unexpected exception was thrown.

ERROR: Unexpected exception was thrown.

ERROR: Method 354 of size 7557059 failed to load and compile correctly by JIT2.

ERROR: main method 354 of size 7557059 failed to load and compile correctly.

ERROR: Couldn't load base metrics summary created by child process

Can you double check on your local machine?

For x86, they likely have timed out while producing .dasm files.

@kunalspathak
Copy link
Member

second - removal of unnecessary vzeroupper instructions that were being generated because LSRA was allocating an unnecessary XMM register for Kind::Push PUTARG_STKs with GC pointers.

Could you paste the sample diff?

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring. Changes LGTM except the failure on superpmi linux-arm that you want to double check.

src/coreclr/jit/codegenxarch.cpp Outdated Show resolved Hide resolved
@SingleAccretion
Copy link
Contributor Author

Can you double check on your local machine?

@kunalspathak Thank you for pointing that failure out, I did unfortunately miss it in the logs. It does not seem related: no ARM code paths were modified, and I can reproduce it both with and without this change.

It sure does look curious though: we're OOMing the Jit (2GB+ memory consumption and then an error in the CRT is hit).

Could you paste the sample diff?

They're all are like this:

        sub      esp, 20
-       vzeroupper
        xor      eax, eax

I suppose the better wording for this would have been that "LSRA was over-reporting the use of XMM registers".

@kunalspathak
Copy link
Member

It sure does look curious though: we're OOMing the Jit (2GB+ memory consumption and then an error in the CRT is hit).

Where do you see that? in the logs?

@SingleAccretion
Copy link
Contributor Author

Where do you see that? in the logs?

No, from locally replaying the coreclr_tests.pmi.Linux.arm.checked collection (no .mc because SPMI itself crashes).

@kunalspathak
Copy link
Member

No, from locally replaying the coreclr_tests.pmi.Linux.arm.checked collection (no .mc because SPMI itself crashes).

Does it repro without your changes too?

@SingleAccretion
Copy link
Contributor Author

SingleAccretion commented Feb 16, 2022

As I note, yes.

Edit: I suppose in the CI log it also says both the baseline and diff Jits encountered the same error:

ERROR: Unexpected exception was thrown. <-- Base JIt

ERROR: Unexpected exception was thrown. <-- Diff Jit

ERROR: Method 354 of size 7557059 failed to load and compile correctly by JIT2. <-- Diff Jit

ERROR: main method 354 of size 7557059 failed to load and compile correctly. <-- Base Jit

@kunalspathak
Copy link
Member

/azp run Fuzzlyn, Antigen

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@SingleAccretion
Copy link
Contributor Author

Categorizing fuzzer failures.

It seems Fuzzlyn is timing out pretty consistently on Linux ARM, and we have one failure there:

Found example with seed 16498413861848824582 that hits error
JIT assert failed:
Assertion failed 'inVarToRegMap[varIndex] == REG_STK' in 'S1:M50(int):long:this' during 'LSRA allocate' (IL size 4270)

File: /__w/1/s/src/coreclr/jit/lsra.cpp Line: 4050

That's pre-existing - seen in an earlier run (logs link).

Antigen on different targets has quite a few "uncategorized" issues (guessing these are for "wrong result" cases?), but the only x86 failure looks like #64787, and is an inlining assert (that this backend-only change could not have caused, I suppose).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kunalspathak kunalspathak merged commit 87d6d58 into dotnet:main Feb 17, 2022
@kunalspathak
Copy link
Member

@SingleAccretion - Keep an eye on the superpmi-replay run that will kick off on your commit to make sure there are no errors.

@kunalspathak
Copy link
Member

Improvements - dotnet/perf-autofiling-issues#3651

@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants