-
Notifications
You must be signed in to change notification settings - Fork 5.2k
LSRA-throughput: Iterate over the regMaskTP instead all registers #87424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsFixes: #87337
|
|
@dotnet/jit-contrib @BruceForstall |
|
It's interesting that its worse for I think that means that Clang is generating more instructions now than it was before ( Do you have an example of the diffs here and whether its just Clang/LLVM being extra "clever" and producing something that is faster but with more instructions? |
| regNumber regNum = genFirstRegNumFromMask(candidates); | ||
| regMaskTP candidateBit = genRegMask(regNum); | ||
| candidates ^= candidateBit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this one not be?
| regNumber regNum = genFirstRegNumFromMask(candidates); | |
| regMaskTP candidateBit = genRegMask(regNum); | |
| candidates ^= candidateBit; | |
| regNumber regNum = genFirstRegNumFromMaskAndToggle(candidates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because we need candidateBit few lines below. That's why I cannot use genFirstRegNumFromMaskAndToggle().
| if (availableRegCount < (sizeof(regMaskTP) * 8)) | ||
| { | ||
| // Mask out the bits that are between 64 ~ availableRegCount | ||
| actualRegistersMask = (1ULL << availableRegCount) - 1; | ||
| } | ||
| else | ||
| { | ||
| actualRegistersMask = ~RBM_NONE; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this always be actualRegisterMask = (1ULL << availableRegCount) - 1?
That way its always exactly the bitmask of actual registers available. No more, no less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regMaskTP is unsigned __int64 for Arm64 and so we can represent at most 64 registers and therefore 1 << 63 is the highest shift we can safely do, because 1 << 64 is overshifting and therefore undefined behavior.
Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.
It looks like this isn't an "issue" today because the register allocator cannot allocate REG_SP itself. It's only manually used by codegenarm64 instead and so it doesn't need to be included in actualRegistersMask. That makes working around this "simpler" since its effectively a "special" register like REG_STK.
Short term we probably want to add an assert to validate the tracked registers don't exceed 64-bits (that is ACTUAL_REG_CNT <= 64) and to special case when it is exactly 64-bits.
Long term, I imagine we want to consider better ways to represent this so we can avoid the problem altogether. Having distinct register files for each category (SIMD/FP vs General/Integer vs Special/Other) is one way. That may also help in other areas where some Integer registers are actually Special registers and cannot be used "generally" (i.e. REG_ZR is effectively reserved and cannot be assigned, just consumed). It would also reduce the cost for various operations in the case where only one register type is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it (1 << 65) == 0, then 0 - 1 == -1, which is AllBitsSet. Other compilers are going to instead treat this as C# and x86/x64 do, which is to treat it as (1 << (65 % 63)) == (1 << 2) == 4 and then 4 - 1 == 3 and others still as something completely different.
That's exactly what I understand it. What confuses me is the compiler decides different behavior during execution vs. "watch windows" in debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree with your suggestion, for this PR, I will keep the code that I have currently to handle the arm64 case.
|
Changes overall look good/correct. Had a few open questions on certain parts and whether we couldn't do the same optimizations we were doing elsewhere. |
Yeah, it looks like hte same code just shuffled around a bit and different registers. However, Its very odd/unexpected that |
| gtRsvdRegs &= ~tempRegMask; | ||
| return genRegNumFromMask(tempRegMask); | ||
| regNumber tempReg = genFirstRegNumFromMask(availableSet); | ||
| gtRsvdRegs ^= genRegMask(tempReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually faster than the previous code? Since it needs to do either a left shift (on amd64) or memory lookup (non-amd64). The same question applies to all the places where you introduced genRegMask.
It seems like you're saying b = genRegMask(...) + a ^= b is faster than a &= ~b?
The genFirstRegNumFromMaskAndToggle cases seem like a clear win, but I'm not as sure about these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a ^= (1 << …) is specially recognized and transformed into btc on xarch. There is sometimes special optimizations possible on Arm64 as well, but it’s worst case the same number of instructions and execution cost (but often slightly shorter)












regMaskTPinstead of all the registers and checking them against the mask. This will remove the impact of adding more registers because with the changes in PR, we will just iterate over registers of interest.regNumberfrom the mask and toggling the bit in the mask.Fixes: #87337