Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: emit vzeroupper before calls to the bulk write barrier helper #106908

Merged
merged 3 commits into from
Aug 25, 2024

Conversation

AndyAyersMS
Copy link
Member

The helper uses SSE2, so we need to take care to avoid AVX-SSE transition penalties.

Fixes #106679.

The helper uses SSE2, so we need to take care to avoid AVX-SSE
transition penalties.

Fixes dotnet#106679.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 23, 2024
@AndyAyersMS
Copy link
Member Author

@tannergooding @EgorBo PTAL
cc @dotnet/jit-contrib

Will backport to .Net 9.

Will also look for other helpers that need similar treatement.

@@ -2050,6 +2050,13 @@ bool GenTreeCall::NeedsVzeroupper(Compiler* comp)
}
}

// Other special cases
//
if (!needsVzeroupper && IsHelperCall(comp, CORINFO_HELP_BULK_WRITEBARRIER))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be moved under CT_HELPER since IsHelperCall(...) requires gtCallType == CT_HELPER?

-- Not necessary, just makes it slightly more "pay for play"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That switch is guarded with IsPInvoke but this helper call is not a pinvoke -- it has no managed counterpart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, that makes sense.

I think we may need to ensure this is guarded under if (comp->canUseVexEncoding()) then; otherwise lsraxarch will mark us as having a call needing vzeroupper (see SetContainsCallNeedingVzeroupper) and that will cause a vzeroupper to be emitted when preserving callee saved float registers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this meaning this new bit, or the whole method...?

If we can't use vex then it seems like we can just immediately return false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't use vex then it seems like we can just immediately return false?

Right.

this meaning this new bit, or the whole method...?

The new bit in particular, but we already have the existing if (IsPInvoke() && comp->canUseVexEncoding()), so we can probably just do something like this:

if (!comp->canUseVexEncoding())
{
    return false;
}

if (IsPInvoke())
{
   // ...
}

if (!needsVzeroupper && IsHelperCall(comp, CORINFO_HELP_BULK_WRITEBARRIER))
{
   // ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI diffs show a lot more additions than I'd expect, will need to generate them locally and review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all look expected at a glance. vzeroupper is 3 bytes for any method that uses such a CORINFO_HELP_BULK_WRITEBARRIER

In the case that the method contains a call needing vzeroupper (i.e. contains a CORINFO_HELP_BULK_WRITEBARRIER call) then we have two scenarios:
A) The method uses 256-bit or greater AVX instructions
B) The method uses 128-bit AVX or no AVX instructions

For scenario A, we need to emit a vzeroupper prior to each CORINFO_HELP_BULK_WRITEBARRIER call
For scenario B, we optimize to only emitting a single vzeroupper call in the function prologue

Looking at the diffs, the vast majority of cases are those with just 1 new vzeroupper. There's a few larger functions that have multiple bulk write barriers and use floating-point/simd, and so have many new vzeroupper. -- 17 write barriers is the most for a non-test scenario, 67 is then an outlier for a ComInterfaceGenerator.Unit.Tests.CompileFails iterator method

For .NET 9, this is probably the "right" fix and while it increases size, vzeroupper is supposed to be "zero-cost" and handled by the register renamer. So it shouldn't negatively impact things.

In .NET 10, we could look at adding an explicit AVX path to CORINFO_HELP_BULK_WRITEBARRIER in order to avoid the need to emit vzeroupper, which may be desirable.

@AndyAyersMS AndyAyersMS merged commit 5fd3f22 into dotnet:main Aug 25, 2024
108 checks passed
@AndyAyersMS
Copy link
Member Author

/backport to release/9.0

Copy link
Contributor

@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in readonly struct passed by ref
3 participants