Skip to content

Conversation

@kunalspathak
Copy link
Contributor

@kunalspathak kunalspathak commented Jun 12, 2023

  • At few places, just iterate over the regMaskTP instead of all the registers and checking them against the mask. This will remove the impact of adding more registers because with the changes in PR, we will just iterate over registers of interest.
  • Updated the pattern we use to extract the regNumber from the mask and toggling the bit in the mask.

Fixes: #87337

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 12, 2023
@ghost ghost assigned kunalspathak Jun 12, 2023
@ghost
Copy link

ghost commented Jun 12, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #87337

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Contributor Author

Results are very encouraging:

image image

However, linux native compiler is not happy:

image

@kunalspathak kunalspathak marked this pull request as ready for review June 13, 2023 22:20
@kunalspathak
Copy link
Contributor Author

@dotnet/jit-contrib @BruceForstall

@tannergooding
Copy link
Member

It's interesting that its worse for Linux x64 on Linux x64.

I think that means that Clang is generating more instructions now than it was before (Linux x64 on Windows x64 where its an improvement is on MSVC).

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

Comment on lines +300 to +302
regNumber regNum = genFirstRegNumFromMask(candidates);
regMaskTP candidateBit = genRegMask(regNum);
candidates ^= candidateBit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this one not be?

Suggested change
regNumber regNum = genFirstRegNumFromMask(candidates);
regMaskTP candidateBit = genRegMask(regNum);
candidates ^= candidateBit;
regNumber regNum = genFirstRegNumFromMaskAndToggle(candidates);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because we need candidateBit few lines below. That's why I cannot use genFirstRegNumFromMaskAndToggle().

Comment on lines +2756 to +2764
if (availableRegCount < (sizeof(regMaskTP) * 8))
{
// Mask out the bits that are between 64 ~ availableRegCount
actualRegistersMask = (1ULL << availableRegCount) - 1;
}
else
{
actualRegistersMask = ~RBM_NONE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have this always be actualRegisterMask = (1ULL << availableRegCount) - 1?

That way its always exactly the bitmask of actual registers available. No more, no less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's ideally how it should be, but for arm64, availableRegCount == 65 (including the REG_STK, etc.). So (1ULL << 65) returns 0x2 and with - 1, we get actualRegisterMask becomes 1. Debugger shows correct value.

image

I am little confused on why that happens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regMaskTP is unsigned __int64 for Arm64 and so we can represent at most 64 registers and therefore 1 << 63 is the highest shift we can safely do, because 1 << 64 is overshifting and therefore undefined behavior.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.

It looks like this isn't an "issue" today because the register allocator cannot allocate REG_SP itself. It's only manually used by codegenarm64 instead and so it doesn't need to be included in actualRegistersMask. That makes working around this "simpler" since its effectively a "special" register like REG_STK.

Short term we probably want to add an assert to validate the tracked registers don't exceed 64-bits (that is ACTUAL_REG_CNT <= 64) and to special case when it is exactly 64-bits.

Long term, I imagine we want to consider better ways to represent this so we can avoid the problem altogether. Having distinct register files for each category (SIMD/FP vs General/Integer vs Special/Other) is one way. That may also help in other areas where some Integer registers are actually Special registers and cannot be used "generally" (i.e. REG_ZR is effectively reserved and cannot be assigned, just consumed). It would also reduce the cost for various operations in the case where only one register type is being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.

That's exactly what I understand it. What confuses me is the compiler decides different behavior during execution vs. "watch windows" in debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with your suggestion, for this PR, I will keep the code that I have currently to handle the arm64 case.

@tannergooding
Copy link
Member

Changes overall look good/correct. Had a few open questions on certain parts and whether we couldn't do the same optimizations we were doing elsewhere.

@kunalspathak
Copy link
Contributor Author

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

I tried as best as to check the disassembly, but I was not able to get it reliably. I tried objdump -d libclrjit.so on Release bits, but it doesn't even prints the function name before the start of disassembly section to find out the code. I ended up debugging with gdb and put breakpoints around the area and copied the disassembly i got from asm window.

assembly code for processStartBlockLocations

before:

  >0x7fffe7675c21 <LinearScan::processBlockStartLocations(BasicBlock*)+2961>       cmpl   $0x0,0x1368(%r15)                                                                                                                                                                                                              │
0x7fffe7675c29 <LinearScan::processBlockStartLocations(BasicBlock*)+2969>       je     0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>                                                                                                                                                      │
0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+2975>       lea    0x110(%r15),%rax
0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+2982>       xor    %ecx,%ecx
0x7fffe7675c38 <LinearScan::processBlockStartLocations(BasicBlock*)+2984>       xorpd  %xmm0,%xmm0
0x7fffe7675c3c <LinearScan::processBlockStartLocations(BasicBlock*)+2988>       jmp    0x7fffe7675c88 <LinearScan::processBlockStartLocations(BasicBlock*)+3064>                                                                                                                                                      │
0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+2990>       movq   $0x0,0x18(%rax)                                                                                                                                                                                                                │
0x7fffe7675c46 <LinearScan::processBlockStartLocations(BasicBlock*)+2998>       mov    0x28(%rax),%edx
0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3001>       movl   $0xffffffff,0x1034(%r15,%rdx,4)                                                                                                                                                                                                │
0x7fffe7675c55 <LinearScan::processBlockStartLocations(BasicBlock*)+3013>       movq   $0x0,0x1118(%r15,%rdx,8)                                                                                                                                                                                                       │
0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3025>       nopw   %cs:0x0(%rax,%rax,1)                                                                                                                                                                                                           │
0x7fffe7675c6b <LinearScan::processBlockStartLocations(BasicBlock*)+3035>       nopl   0x0(%rax,%rax,1)                                                                                                                                                                                                               │
0x7fffe7675c70 <LinearScan::processBlockStartLocations(BasicBlock*)+3040>       add    $0x1,%rcx
0x7fffe7675c74 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>       mov    0x1368(%r15),%edx
0x7fffe7675c7b <LinearScan::processBlockStartLocations(BasicBlock*)+3051>       add    $0x30,%rax
0x7fffe7675c7f <LinearScan::processBlockStartLocations(BasicBlock*)+3055>       cmp    %rdx,%rcx
0x7fffe7675c82 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>       jae    0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>   

after:

   0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+3039>       callq  0x7fffe76d6e40 <BitOperations::BitScanForward(unsigned long)>                                                                                                                                                                  │
│  >0x7fffe7675c34 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>       mov    %eax,%ecx
0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+3046>       btc    %rax,%r12
0x7fffe7675c3a <LinearScan::processBlockStartLocations(BasicBlock*)+3050>       lea    (%rcx,%rcx,2),%rcx
0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+3054>       shl    $0x4,%rcx
0x7fffe7675c42 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>       mov    0xf40(%r15),%rbx
0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3065>       bts    %rax,%rbx
0x7fffe7675c4d <LinearScan::processBlockStartLocations(BasicBlock*)+3069>       mov    %rbx,0xf40(%r15)                                                                                                                                                                                                               │
0x7fffe7675c54 <LinearScan::processBlockStartLocations(BasicBlock*)+3076>       mov    0x128(%r15,%rcx,1),%rax
0x7fffe7675c5c <LinearScan::processBlockStartLocations(BasicBlock*)+3084>       test   %rax,%rax
0x7fffe7675c5f <LinearScan::processBlockStartLocations(BasicBlock*)+3087>       je     0x7fffe7675c27 <LinearScan::processBlockStartLocations(BasicBlock*)+3031>                                                                                                                                                      │
0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3089>       lea    (%r15,%rcx,1),%rdx
0x7fffe7675c65 <LinearScan::processBlockStartLocations(BasicBlock*)+3093>       add    $0x128,%rdx     

I was able to get something reliable for my 2nd change and I don't see anything suspicious here.

image

@tannergooding
Copy link
Member

I don't see anything suspicious here.

Yeah, it looks like hte same code just shuffled around a bit and different registers. However, Its very odd/unexpected that BitScanForward is not being inlined here. We should annotate the method with __forceinline since its just abstracting an intrinsic call.

@kunalspathak
Copy link
Contributor Author

It is worth pasting the TP gains here before the results gets deleted:

image image image image image image image

gtRsvdRegs &= ~tempRegMask;
return genRegNumFromMask(tempRegMask);
regNumber tempReg = genFirstRegNumFromMask(availableSet);
gtRsvdRegs ^= genRegMask(tempReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually faster than the previous code? Since it needs to do either a left shift (on amd64) or memory lookup (non-amd64). The same question applies to all the places where you introduced genRegMask.

It seems like you're saying b = genRegMask(...) + a ^= b is faster than a &= ~b?

The genFirstRegNumFromMaskAndToggle cases seem like a clear win, but I'm not as sure about these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a ^= (1 << …) is specially recognized and transformed into btc on xarch. There is sometimes special optimizations possible on Arm64 as well, but it’s worst case the same number of instructions and execution cost (but often slightly shorter)

@kunalspathak kunalspathak merged commit 60d00ec into dotnet:main Jun 19, 2023
@kunalspathak kunalspathak deleted the reg-for-loop branch June 19, 2023 16:40
@ghost ghost locked as resolved and limited conversation to collaborators Jul 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize the iteration over all registers in various places in LSRA

3 participants