LSRA-throughput: Iterate over the regMaskTP instead all registers #87424

kunalspathak · 2023-06-12T20:04:32Z

At few places, just iterate over the regMaskTP instead of all the registers and checking them against the mask. This will remove the impact of adding more registers because with the changes in PR, we will just iterate over registers of interest.
Updated the pattern we use to extract the regNumber from the mask and toggling the bit in the mask.

ghost · 2023-06-12T20:04:46Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Author:	kunalspathak
Assignees:	kunalspathak
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/lsra.cpp

kunalspathak · 2023-06-13T21:12:11Z

Results are very encouraging:

However, linux native compiler is not happy:

kunalspathak · 2023-06-13T22:21:10Z

@dotnet/jit-contrib @BruceForstall

tannergooding · 2023-06-13T23:17:09Z

It's interesting that its worse for Linux x64 on Linux x64.

I think that means that Clang is generating more instructions now than it was before (Linux x64 on Windows x64 where its an improvement is on MSVC).

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

src/coreclr/jit/codegenarm64.cpp

tannergooding · 2023-06-13T23:35:21Z

src/coreclr/jit/lsra.cpp

+        regNumber regNum       = genFirstRegNumFromMask(candidates);
+        regMaskTP candidateBit = genRegMask(regNum);
+        candidates ^= candidateBit;


Can this one not be?

Suggested change

regNumber regNum = genFirstRegNumFromMask(candidates);

regMaskTP candidateBit = genRegMask(regNum);

candidates ^= candidateBit;

regNumber regNum = genFirstRegNumFromMaskAndToggle(candidates);

no, because we need candidateBit few lines below. That's why I cannot use genFirstRegNumFromMaskAndToggle().

src/coreclr/jit/lsra.cpp

tannergooding · 2023-06-13T23:41:36Z

src/coreclr/jit/lsrabuild.cpp

+    if (availableRegCount < (sizeof(regMaskTP) * 8))
+    {
+        // Mask out the bits that are between 64 ~ availableRegCount
+        actualRegistersMask = (1ULL << availableRegCount) - 1;
+    }
+    else
+    {
+        actualRegistersMask = ~RBM_NONE;
+    }


Why not have this always be actualRegisterMask = (1ULL << availableRegCount) - 1?

That way its always exactly the bitmask of actual registers available. No more, no less.

Yes, that's ideally how it should be, but for arm64, availableRegCount == 65 (including the REG_STK, etc.). So (1ULL << 65) returns 0x2 and with - 1, we get actualRegisterMask becomes 1. Debugger shows correct value.

I am little confused on why that happens.

regMaskTP is unsigned __int64 for Arm64 and so we can represent at most 64 registers and therefore 1 << 63 is the highest shift we can safely do, because 1 << 64 is overshifting and therefore undefined behavior.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.

It looks like this isn't an "issue" today because the register allocator cannot allocate REG_SP itself. It's only manually used by codegenarm64 instead and so it doesn't need to be included in actualRegistersMask. That makes working around this "simpler" since its effectively a "special" register like REG_STK.

Short term we probably want to add an assert to validate the tracked registers don't exceed 64-bits (that is ACTUAL_REG_CNT <= 64) and to special case when it is exactly 64-bits.

Long term, I imagine we want to consider better ways to represent this so we can avoid the problem altogether. Having distinct register files for each category (SIMD/FP vs General/Integer vs Special/Other) is one way. That may also help in other areas where some Integer registers are actually Special registers and cannot be used "generally" (i.e. REG_ZR is effectively reserved and cannot be assigned, just consumed). It would also reduce the cost for various operations in the case where only one register type is being used.

Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.

That's exactly what I understand it. What confuses me is the compiler decides different behavior during execution vs. "watch windows" in debugging.

While I agree with your suggestion, for this PR, I will keep the code that I have currently to handle the arm64 case.

tannergooding · 2023-06-13T23:42:37Z

Changes overall look good/correct. Had a few open questions on certain parts and whether we couldn't do the same optimizations we were doing elsewhere.

kunalspathak · 2023-06-14T02:55:53Z

Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions?

I tried as best as to check the disassembly, but I was not able to get it reliably. I tried objdump -d libclrjit.so on Release bits, but it doesn't even prints the function name before the start of disassembly section to find out the code. I ended up debugging with gdb and put breakpoints around the area and copied the disassembly i got from asm window.

assembly code for processStartBlockLocations

before:

  >0x7fffe7675c21 <LinearScan::processBlockStartLocations(BasicBlock*)+2961>       cmpl   $0x0,0x1368(%r15)                                                                                                                                                                                                              │
│   0x7fffe7675c29 <LinearScan::processBlockStartLocations(BasicBlock*)+2969>       je     0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>                                                                                                                                                      │
│   0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+2975>       lea    0x110(%r15),%rax                                                                                                                                                                                                               │
│   0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+2982>       xor    %ecx,%ecx                                                                                                                                                                                                                      │
│   0x7fffe7675c38 <LinearScan::processBlockStartLocations(BasicBlock*)+2984>       xorpd  %xmm0,%xmm0                                                                                                                                                                                                                    │
│   0x7fffe7675c3c <LinearScan::processBlockStartLocations(BasicBlock*)+2988>       jmp    0x7fffe7675c88 <LinearScan::processBlockStartLocations(BasicBlock*)+3064>                                                                                                                                                      │
│   0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+2990>       movq   $0x0,0x18(%rax)                                                                                                                                                                                                                │
│   0x7fffe7675c46 <LinearScan::processBlockStartLocations(BasicBlock*)+2998>       mov    0x28(%rax),%edx                                                                                                                                                                                                                │
│   0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3001>       movl   $0xffffffff,0x1034(%r15,%rdx,4)                                                                                                                                                                                                │
│   0x7fffe7675c55 <LinearScan::processBlockStartLocations(BasicBlock*)+3013>       movq   $0x0,0x1118(%r15,%rdx,8)                                                                                                                                                                                                       │
│   0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3025>       nopw   %cs:0x0(%rax,%rax,1)                                                                                                                                                                                                           │
│   0x7fffe7675c6b <LinearScan::processBlockStartLocations(BasicBlock*)+3035>       nopl   0x0(%rax,%rax,1)                                                                                                                                                                                                               │
│   0x7fffe7675c70 <LinearScan::processBlockStartLocations(BasicBlock*)+3040>       add    $0x1,%rcx                                                                                                                                                                                                                      │
│   0x7fffe7675c74 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>       mov    0x1368(%r15),%edx                                                                                                                                                                                                              │
│   0x7fffe7675c7b <LinearScan::processBlockStartLocations(BasicBlock*)+3051>       add    $0x30,%rax                                                                                                                                                                                                                     │
│   0x7fffe7675c7f <LinearScan::processBlockStartLocations(BasicBlock*)+3055>       cmp    %rdx,%rcx                                                                                                                                                                                                                      │
│   0x7fffe7675c82 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>       jae    0x7fffe7675dbd <LinearScan::processBlockStartLocations(BasicBlock*)+3373>

after:

   0x7fffe7675c2f <LinearScan::processBlockStartLocations(BasicBlock*)+3039>       callq  0x7fffe76d6e40 <BitOperations::BitScanForward(unsigned long)>                                                                                                                                                                  │
│  >0x7fffe7675c34 <LinearScan::processBlockStartLocations(BasicBlock*)+3044>       mov    %eax,%ecx                                                                                                                                                                                                                      │
│   0x7fffe7675c36 <LinearScan::processBlockStartLocations(BasicBlock*)+3046>       btc    %rax,%r12                                                                                                                                                                                                                      │
│   0x7fffe7675c3a <LinearScan::processBlockStartLocations(BasicBlock*)+3050>       lea    (%rcx,%rcx,2),%rcx                                                                                                                                                                                                             │
│   0x7fffe7675c3e <LinearScan::processBlockStartLocations(BasicBlock*)+3054>       shl    $0x4,%rcx                                                                                                                                                                                                                      │
│   0x7fffe7675c42 <LinearScan::processBlockStartLocations(BasicBlock*)+3058>       mov    0xf40(%r15),%rbx                                                                                                                                                                                                               │
│   0x7fffe7675c49 <LinearScan::processBlockStartLocations(BasicBlock*)+3065>       bts    %rax,%rbx                                                                                                                                                                                                                      │
│   0x7fffe7675c4d <LinearScan::processBlockStartLocations(BasicBlock*)+3069>       mov    %rbx,0xf40(%r15)                                                                                                                                                                                                               │
│   0x7fffe7675c54 <LinearScan::processBlockStartLocations(BasicBlock*)+3076>       mov    0x128(%r15,%rcx,1),%rax                                                                                                                                                                                                        │
│   0x7fffe7675c5c <LinearScan::processBlockStartLocations(BasicBlock*)+3084>       test   %rax,%rax                                                                                                                                                                                                                      │
│   0x7fffe7675c5f <LinearScan::processBlockStartLocations(BasicBlock*)+3087>       je     0x7fffe7675c27 <LinearScan::processBlockStartLocations(BasicBlock*)+3031>                                                                                                                                                      │
│   0x7fffe7675c61 <LinearScan::processBlockStartLocations(BasicBlock*)+3089>       lea    (%r15,%rcx,1),%rdx                                                                                                                                                                                                             │
│   0x7fffe7675c65 <LinearScan::processBlockStartLocations(BasicBlock*)+3093>       add    $0x128,%rdx

I was able to get something reliable for my 2nd change and I don't see anything suspicious here.

tannergooding · 2023-06-14T15:35:13Z

I don't see anything suspicious here.

Yeah, it looks like hte same code just shuffled around a bit and different registers. However, Its very odd/unexpected that BitScanForward is not being inlined here. We should annotate the method with __forceinline since its just abstracting an intrinsic call.

kunalspathak · 2023-06-16T19:29:11Z

It is worth pasting the TP gains here before the results gets deleted:

BruceForstall · 2023-06-17T00:22:23Z

src/coreclr/jit/gentree.cpp

-    gtRsvdRegs &= ~tempRegMask;
-    return genRegNumFromMask(tempRegMask);
+    regNumber tempReg = genFirstRegNumFromMask(availableSet);
+    gtRsvdRegs ^= genRegMask(tempReg);


Is this actually faster than the previous code? Since it needs to do either a left shift (on amd64) or memory lookup (non-amd64). The same question applies to all the places where you introduced genRegMask.

It seems like you're saying b = genRegMask(...) + a ^= b is faster than a &= ~b?

The genFirstRegNumFromMaskAndToggle cases seem like a clear win, but I'm not as sure about these.

a ^= (1 << …) is specially recognized and transformed into btc on xarch. There is sometimes special optimizations possible on Arm64 as well, but it’s worst case the same number of instructions and execution cost (but often slightly shorter)

src/coreclr/jit/lsra.cpp

kunalspathak added 2 commits June 12, 2023 12:48

replace for-loop with regMaspTP iterator

18bb1cb

jit format

70813e3

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 12, 2023

ghost assigned kunalspathak Jun 12, 2023

REVERT

b9fd5eb

build-analysis bot mentioned this pull request Jun 13, 2023

Tracking issue for CI build timeouts #76454

Closed

fix a bug

40de6c0

tannergooding reviewed Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

kunalspathak added 4 commits June 13, 2023 08:59

address review feedback

119cecd

Add genFirstRegNumFromMaskAndToggle and genFirstRegNumFromMask

014bcd6

Use actualRegistersMask

e2b460a

jit format

19ec31c

kunalspathak marked this pull request as ready for review June 13, 2023 22:20

tannergooding reviewed Jun 13, 2023

View reviewed changes

src/coreclr/jit/codegenarm64.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 13, 2023

View reviewed changes

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 13, 2023

View reviewed changes

review feedback

7a7ca9a

kunalspathak added 2 commits June 16, 2023 09:07

Inline BitScanForward

181660a

fix build error

2c599f4

BruceForstall approved these changes Jun 17, 2023

View reviewed changes

remove commented code

1791121

kunalspathak merged commit 60d00ec into dotnet:main Jun 19, 2023

kunalspathak deleted the reg-for-loop branch June 19, 2023 16:40

JulieLeeMSFT mentioned this pull request Jul 17, 2023

Improving Arm64 Performance in .NET 8.0 #77010

Closed

28 tasks

ghost locked as resolved and limited conversation to collaborators Jul 20, 2023

LSRA-throughput: Iterate over the regMaskTP instead all registers #87424

LSRA-throughput: Iterate over the regMaskTP instead all registers #87424

Uh oh!

Conversation

kunalspathak commented Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jun 12, 2023

Uh oh!

Uh oh!

Uh oh!

kunalspathak commented Jun 13, 2023

Uh oh!

kunalspathak commented Jun 13, 2023

Uh oh!

tannergooding commented Jun 13, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Jun 13, 2023

Uh oh!

kunalspathak commented Jun 14, 2023

Uh oh!

tannergooding commented Jun 14, 2023

Uh oh!

kunalspathak commented Jun 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kunalspathak commented Jun 12, 2023 •

edited

Loading