Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented May 25, 2025

Closes #114274

 vxorps   xmm4, xmm4, xmm4
 vmovdqu32 zmmword ptr [rsp+0x20], zmm4
-vmovdqa  xmmword ptr [rsp+0x60], xmm4
-vmovdqa  xmmword ptr [rsp+0x70], xmm4
+vmovdqu  ymmword ptr [rsp+0x60], ymm4
 vmovdqu  xmmword ptr [rsp+0x80], xmm4

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 25, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented May 26, 2025

PTAL @dotnet/jit-contrib small change, a couple of diffs (only reproduces on avx512)

@EgorBo EgorBo marked this pull request as ready for review May 26, 2025 11:04
Copilot AI review requested due to automatic review settings May 26, 2025 11:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the zero-initialization of stack frames to use YMM after ZMM, consolidates loops into a single while-based approach, and replaces aligned vmovdqa with unaligned vmovdqu for YMM registers.

  • Replaces two for-loops with a while-loop driven by lenRemaining
  • Computes regSize dynamically via roundDownSIMDSize and chooses aligned vs. unaligned moves
  • Introduces ALIGN_UP(blkSize, 16) to drive the loop and switches mov instructions
Comments suppressed due to low confidence (1)

src/coreclr/jit/codegenxarch.cpp:11261

  • Add tests for block sizes not divisible by SIMD widths (e.g., sizes between 1–15, 17–31 bytes) to verify that remainders are handled correctly by this loop.
while (lenRemaining > 0)


assert(i == blkSize);
// frameReg is definitely not known to be 32B/64B aligned -> switch to unaligned movs
instruction ins = regSize > XMM_REGSIZE_BYTES ? simdUnalignedMovIns() : simdMov;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does simdUnalignedMovIns() get hoisted out of the loop, or will it be a lookup each time?

Is there a reason to not just always use the unaligned instruction since they're the same perf for accesses that are actually aligned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume it acts as a validation of the assumption that the frame pointer is 16 bytes aligned, but not sure, I just copied it from the previous logic

@EgorBo EgorBo enabled auto-merge (squash) May 27, 2025 17:17
@EgorBo
Copy link
Member Author

EgorBo commented May 27, 2025

/ba-g "windows-x86 Debug Libraries_CheckedCoreCLR is stuck"

@EgorBo EgorBo merged commit 97d1fc2 into dotnet:main May 27, 2025
106 of 108 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suboptimal stack zeroing on AVX512

3 participants