-
Notifications
You must be signed in to change notification settings - Fork 5.2k
LSRA-throughput: Iterate over the regMaskTP instead all registers #87424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
18bb1cb
70813e3
b9fd5eb
40de6c0
119cecd
014bcd6
e2b460a
19ec31c
7a7ca9a
181660a
2c599f4
1791121
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -727,33 +727,30 @@ bool LinearScan::isContainableMemoryOp(GenTree* node) | |
| // | ||
| void LinearScan::addRefsForPhysRegMask(regMaskTP mask, LsraLocation currentLoc, RefType refType, bool isLastUse) | ||
| { | ||
| if (refType == RefTypeKill) | ||
| { | ||
| // The mask identifies a set of registers that will be used during | ||
| // codegen. Mark these as modified here, so when we do final frame | ||
| // layout, we'll know about all these registers. This is especially | ||
| // important if mask contains callee-saved registers, which affect the | ||
| // frame size since we need to save/restore them. In the case where we | ||
| // have a copyBlk with GC pointers, can need to call the | ||
| // CORINFO_HELP_ASSIGN_BYREF helper, which kills callee-saved RSI and | ||
| // RDI, if LSRA doesn't assign RSI/RDI, they wouldn't get marked as | ||
| // modified until codegen, which is too late. | ||
| compiler->codeGen->regSet.rsSetRegsModified(mask DEBUGARG(true)); | ||
| } | ||
|
|
||
| for (regNumber reg = REG_FIRST; mask; reg = REG_NEXT(reg), mask >>= 1) | ||
| { | ||
| if (mask & 1) | ||
| assert(refType == RefTypeKill); | ||
|
|
||
| // The mask identifies a set of registers that will be used during | ||
| // codegen. Mark these as modified here, so when we do final frame | ||
| // layout, we'll know about all these registers. This is especially | ||
| // important if mask contains callee-saved registers, which affect the | ||
| // frame size since we need to save/restore them. In the case where we | ||
| // have a copyBlk with GC pointers, can need to call the | ||
| // CORINFO_HELP_ASSIGN_BYREF helper, which kills callee-saved RSI and | ||
| // RDI, if LSRA doesn't assign RSI/RDI, they wouldn't get marked as | ||
| // modified until codegen, which is too late. | ||
| compiler->codeGen->regSet.rsSetRegsModified(mask DEBUGARG(true)); | ||
|
|
||
| for (regMaskTP candidates = mask; candidates != RBM_NONE;) | ||
| { | ||
| regNumber reg = genFirstRegNumFromMaskAndToggle(candidates); | ||
| // This assumes that these are all "special" RefTypes that | ||
| // don't need to be recorded on the tree (hence treeNode is nullptr) | ||
| RefPosition* pos = newRefPosition(reg, currentLoc, refType, nullptr, | ||
| genRegMask(reg)); // This MUST occupy the physical register (obviously) | ||
|
|
||
| if (isLastUse) | ||
| { | ||
| // This assumes that these are all "special" RefTypes that | ||
| // don't need to be recorded on the tree (hence treeNode is nullptr) | ||
| RefPosition* pos = newRefPosition(reg, currentLoc, refType, nullptr, | ||
| genRegMask(reg)); // This MUST occupy the physical register (obviously) | ||
|
|
||
| if (isLastUse) | ||
| { | ||
| pos->lastUse = true; | ||
| } | ||
| pos->lastUse = true; | ||
| } | ||
| } | ||
| } | ||
|
|
@@ -2756,6 +2753,16 @@ void LinearScan::buildIntervals() | |
| availableRegCount = REG_INT_COUNT; | ||
| } | ||
|
|
||
| if (availableRegCount < (sizeof(regMaskTP) * 8)) | ||
| { | ||
| // Mask out the bits that are between 64 ~ availableRegCount | ||
| actualRegistersMask = (1ULL << availableRegCount) - 1; | ||
| } | ||
| else | ||
| { | ||
| actualRegistersMask = ~RBM_NONE; | ||
| } | ||
|
Comment on lines
+2756
to
+2764
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not have this always be That way its always exactly the bitmask of actual registers available. No more, no less.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Some compilers are going to do overshifting as if we had infinite bits and then truncated. This would make it It looks like this isn't an "issue" today because the register allocator cannot allocate Short term we probably want to add an assert to validate the tracked registers don't exceed 64-bits (that is Long term, I imagine we want to consider better ways to represent this so we can avoid the problem altogether. Having distinct register files for each category (SIMD/FP vs General/Integer vs Special/Other) is one way. That may also help in other areas where some
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's exactly what I understand it. What confuses me is the compiler decides different behavior during execution vs. "watch windows" in debugging.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While I agree with your suggestion, for this PR, I will keep the code that I have currently to handle the arm64 case. |
||
|
|
||
| #ifdef DEBUG | ||
| // Make sure we don't have any blocks that were not visited | ||
| for (BasicBlock* const block : compiler->Blocks()) | ||
|
|
||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually faster than the previous code? Since it needs to do either a left shift (on amd64) or memory lookup (non-amd64). The same question applies to all the places where you introduced
genRegMask.It seems like you're saying
b = genRegMask(...)+a ^= bis faster thana &= ~b?The
genFirstRegNumFromMaskAndTogglecases seem like a clear win, but I'm not as sure about these.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a ^= (1 << …)is specially recognized and transformed intobtcon xarch. There is sometimes special optimizations possible on Arm64 as well, but it’s worst case the same number of instructions and execution cost (but often slightly shorter)