-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize multi-dimensional array access #70271
Optimize multi-dimensional array access #70271
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsCurrently, multi-dimensional (MD) array access operations are treated as opaque to most of This change moves the expansion of The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.
For example, a simple 2-dimensional array access like
This is replaced by:
plus the appropriate In IR, this is:
before being morphed by the usual morph transformations. Some things to consider:
Remaining work:
The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting Fixes #60785.
|
src/coreclr/jit/morph.cpp
Outdated
// TODO: morph here? Or morph at the statement level if there are differences? | ||
|
||
JITDUMP("fgMorphArrayOpsStmt (before remorph):\n"); | ||
DISPTREE(fullExpansion); | ||
|
||
GenTree* morphedTree = m_compiler->fgMorphTree(fullExpansion); | ||
DBEXEC(morphedTree != fullExpansion, morphedTree->gtDebugFlags &= ~GTF_DEBUG_NODE_MORPHED); | ||
|
||
JITDUMP("fgMorphArrayOpsStmt (after remorph):\n"); | ||
DISPTREE(morphedTree); | ||
|
||
*use = morphedTree; | ||
JITDUMP("Morphing GT_ARR_ELEM (after)\n"); | ||
DISPTREE(*use); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm re-morphing the tree here, which seems like the most targeted thing to do. But I've introduced GT_ASG
nodes, and the GTF_ASG
flag needs to propagate to the root. As a result, I'm also (currently) re-morphing changed trees at the statement level, below. Should I just stop re-morphing here and let the statement-level re-morph do its thing? Or should I re-morph here, exactly what was changed, and then do something else to propagate flags up the tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.
[edit] Added a 2nd run to validate. MDRomer regression evaporated. MDMulMatrix regression validated. Some perf results from the MDBenchI/MDBenchF suite Run 1
Run 2
|
As seen above, MDRomer has a small regression that could be investigated. Almost all spmi asmdiffs are improvements. There are a few outlier regressions that should be investigated, all in |
@AndyAyersMS @dotnet/jit-contrib PTAL |
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
// GT_ARR_ELEM nodes are morphed to appropriate trees. Note that MD array `Get`, `Set`, or `Address` | ||
// is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic | ||
// and replaced by IR nodes, especially GT_ARR_ELEM nodes, in impArrayAccessIntrinsic(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an immediate concern with this change, but I wonder if you have some ideas on how to approach adding VN support for this early expansion.
The SZ case utilizes a parser utility that tries to reconstruct whatever morph left, the (significantly) more complex MD trees look less amenable to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an excellent question, and needs more thought.
Any idea why the |
Yeah, I was looking at those too. Seems we have cases like |
MDMulMatrix also has a (big?) regression |
I need to investigate. Maybe related to the large size regressions in the cases I mentioned and Kunal pointed out. Over 7% on win-x64 is pretty extreme given that I doubt many tests even have MD arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
You might consider keeping a temp cache in fgMorphArrayOps
and mark all temps as not in use after each statement.
We do this in other places (eg for struct arg passing, and importer box temps) to try and keep the total number of temps reasonable.
SSA will see these recycled temps as having many distinct lifetimes so it should not inihibit opts.
// This is only enabled when early MD expansion is set because it causes small | ||
// asm diffs (only in some test cases) otherwise. The GT_ARR_ELEM lowering code "accidentally" does | ||
// this cast, but the new code requires it to be explicit. | ||
argVal = impImplicitIorI4Cast(argVal, TYP_INT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems dangerous to only add the cast under DEBUG
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops; I moved the "enabling" code from DEBUG to all-flavor last-minute but didn't update this.
src/coreclr/jit/morph.cpp
Outdated
// TODO: morph here? Or morph at the statement level if there are differences? | ||
|
||
JITDUMP("fgMorphArrayOpsStmt (before remorph):\n"); | ||
DISPTREE(fullExpansion); | ||
|
||
GenTree* morphedTree = m_compiler->fgMorphTree(fullExpansion); | ||
DBEXEC(morphedTree != fullExpansion, morphedTree->gtDebugFlags &= ~GTF_DEBUG_NODE_MORPHED); | ||
|
||
JITDUMP("fgMorphArrayOpsStmt (after remorph):\n"); | ||
DISPTREE(morphedTree); | ||
|
||
*use = morphedTree; | ||
JITDUMP("Morphing GT_ARR_ELEM (after)\n"); | ||
DISPTREE(*use); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.
for (unsigned i = 0; i < arrElem->gtArrRank; i++) | ||
{ | ||
GenTree* idx = arrElem->gtArrInds[i]; | ||
if ((idx->gtFlags & GTF_ALL_EFFECT) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If idx
is side-effect free but nontrivial you will want to use a temp too, otherwise you might duplicate a lot of stuff and force CSE to clean up after you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case without a temp, I just use the idx
tree directly, so there is no copy. Reminds me that I should DEBUG_DESTROY_NODE the GT_ARR_ELEM node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, they are single use, it's the effective index that is multiple use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM.
We should verify this temp cache fixes the TP issues. Not sure if you did this locally, but I can't tell this yet from CI as you have merge conflicts to resolve.
The temp cache improved CQ of MDMulMatrix (and maybe others), but it's still slower than before. So, I still need to investigate MDMulMatrix CQ, as well as checking the big asm diffs regressions for the |
MDMulMatrix: This test has 1 doubly-nested and 6 triply-nested loops. If I split out all the loops to individual benchmarks, they are up to 25% better with this change (one has no perf change: the "lkj" loop). If I reduce the number of loop nests in the function, when there are 3 or more, this change is slower, up to 1.5x the baseline. Seems like there must be issues with quantity of IR/temps. [Edit] Various MDMulMatrix subset perf runs
|
aca4585
to
2c63c4b
Compare
Size regressions: With the temp cache implemented, all the
Also, the improvements far outweigh the regressions, e.g., for coreclr_tests: MDMulMatrix is an outlier for how large the size regression is: |
TP diffs still shows significant regression on coreclr_tests spmi run, but that's probably just because that's where we actually have MD array accesses and the most asm code diffs. E.g., for win-x64:
|
One thing you can consider is hacking SPMI to produce a table of functions with the # instructions executed for each context. That might help narrow into if it's expected.
and diffMetrics.NumExecutedInstructions here:runtime/src/coreclr/tools/superpmi/superpmi/superpmi.cpp Lines 396 to 397 in 0d0fd75
and then post process this into something. |
Thanks for the suggestion. I was going to figure out which method contexts are affected (at all) by my change, and extract them into a separate mch, then use JitTimeLogCsv. A perfview diff of a spmi replay (of the full coreclr_tests collection) with/without my change shows a significant increase in fgMorphSmpOp and GenTreeVisitor::WalkTree, so I should look at total IR size before/after as well. |
Didn't know about this, looks useful. Maybe I should add the precise instruction count data in this mechanism too. |
Configuration variables: 1. `COMPlus_JitEarlyExpandMDArrays`. Set to zero to disable early MD expansion. Default is 1 (enabled). 2. If `COMPlus_JitEarlyExpandMDArrays=0`, use `COMPlus_JitEarlyExpandMDArraysFilter` to selectively enable early MD expansion for a method set (e.g., syntax like `JitDump`)
b2203da
to
a0d0812
Compare
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 3 pipeline(s). |
Improvements on Linux-x64 dotnet/perf-autofiling-issues#6721 |
Currently, multi-dimensional (MD) array access operations are treated as opaque to most of
the JIT; they pass through the optimization pipeline untouched. Lowering expands the
GT_ARR_ELEM
node (representing a
a[i,j]
operation, for example) toGT_ARR_OFFSET
andGT_ARR_INDEX
trees,to expand the register requirements of the operation. These are then directly used to generate code.
This change moves the expansion of
GT_ARR_ELEM
to a new pass that follows loop optimization but precedesValue Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to
loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new
expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes
get lowered to machine-independent nodes before code generation.
The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.
GT_ARR_ELEM
nodes are morphed to appropriate trees. Note that an MD arrayGet
,Set
, orAddress
operation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
and replaced by IR nodes, especially
GT_ARR_ELEM
nodes, inimpArrayAccessIntrinsic()
.For example, a simple 2-dimensional array access like
a[i,j]
looks like:This is replaced by:
plus the appropriate
i
andj
bounds checks.In IR, this is:
before being morphed by the usual morph transformations.
Some things to consider:
lower bound)
GT_MDARR_LOWER_BOUND(dim)
represents the lower-bound value for a particular array dimension. The "effectiveindex" for a dimension is the index minus the lower bound.
GT_MDARR_LENGTH(dim)
represents the length value (number of elements in a dimension) for a particulararray dimension.
TYP_INT
).the array object to the beginning of the array data is added.
to preserve exception ordering.
address
byref
. As usual, we need to be careful to not create an illegal byref by adding any partial index.calculation.
OMF_HAS_MDARRAYREF
flag if there are anyMD array expressions to expand. Also, the block flag
BBF_HAS_MDARRAYREF
is set to blocks where these exist,so only those blocks are processed.
Remaining work:
optEarlyProp
support for MD arrays.GT_ARR_OFFSET
andGT_ARR_INDEX
nodes and related code, as well asGT_ARR_ELEM
code used after the new expansion.
The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting
COMPlus_JitEarlyExpandMDArrays=0
. If disabled, it can be selectively enabled usingCOMPlus_JitEarlyExpandMDArraysFilter=<method_set>
(e.g., as specified forJitDump
).Fixes #60785.