-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: overview of proposed local var reference count changes #10702
Comments
This is a preparatory change for auditing and controlling how local variable ref counts are observed and manipulated. See #18969 for context. No diffs seen locally. No TP impact expected. There is a small chance we may see some asserts in broader testing as there were places in original code where local ref counts were incremented without checking for possible overflows. The new APIs will assert for overflow cases.
Currently local ref counts become valid after Here are some of the references / updates to ref counts that happen before that point:
|
I'm curious how did you come up with that estimate :). Also, do we even need ref counts in minopts? |
Perf estimates are from #8715.
I have a reasonably up to date version of this so can re-validate ... probably a good idea. |
Yeah, performance estimates still seem to hold up, eg PMI of Microsoft.CodeAnalysis.CSharp with minopts goes from 1334ms to 1275ms, around 4.5%, based on the just-updated prototype. Actual benefit may be higher because the prototype still leaves a lot of ref counting cruft behind. |
Roughly speaking, the "invalid" ref count accesses in the current scheme (ref count reads, incs, decs, or sets before the jit officially starts ref counting) fall into categories:
The implicit cases can perhaps better be handled by new local var property bits, eg The conveying cases can perhaps be tolerated by using a different set of APIs or a special parameter to permit temporary (short-lifetime) use of ref counts for some subset of locals. Struct last use copy avoidance is arguably something better handled by the work on first class structs as that should be more general, more capable, and less fragile. |
This is a preparatory change for auditing and controlling how local variable ref counts are observed and manipulated. See #18969 for context. No diffs seen locally. No TP impact expected. There is a small chance we may see some asserts in broader testing as there were places in original code where local ref counts were incremented without checking for possible overflows. The new APIs will assert for overflow cases.
Instead of relying on ref count bumps, add a new attribute bit to local vars to indicate that they may have implicit references (prolog, epilog, gc, eh) and may not have any IR references. Use this attribute bit to ensure that the ref count and weighted ref count for such variables are never reported as zero, and as a result that these variables end up being allocated and reportable. This is another preparatory step for #18969 and frees the jit to recompute explicit ref counts via an IR scan without having to special case the counts for these variables. The jit can no longer describe implicit counts other than 1 and implicit weights otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important. The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes (for windows x64).
dotnet/coreclr#19012 should clear up the implicit reference cases and the avoidable cases. For the other cases, one thought is to essentially make ref counts into something like a phased var, where the count fields are valid only for a certain time and certain purpose. The various APIs would then be extended with an enum describing the accessing phase, and this would be error-checked against the current phase. This could extend later to describe intervals when the counts are valid or potentially stale. The error checking is a bit clunky with the current API shape as we either also need to pass in the compiler instance or fetch it from TLS. But perhaps the latter is acceptable....? |
It is fine for use in Debug only code, but we should avoid using it for retail code |
It would be something like this: inline unsigned short LclVarDsc::lvRefCnt(RefCountState state) const
{
#if defined(DEBUG)
assert(state != RCS_INVALID);
Compiler* compiler = JitTls::GetCompiler();
assert(compiler->lvaRefCountState == state);
#endif
if (lvImplicitlyReferenced && (m_lvRefCnt == 0))
{
return 1;
}
return m_lvRefCnt;
}
|
This rings a bell, but I don't recall exactly what's being done. Can you point me to this? Re: the error checking, I think it's fine to fetch the compiler instance from TLS for an assert. |
Look at |
Thanks @AndyAyersMS - I now recall that some time ago when I was making the first round of 1st class struct changes, I deleted this line: https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L5297 but it caused diffs, which I was trying to avoid and I wanted to analyze it further - so I created that "to do" and it's still there :-( |
I can leave that as is for now -- I think best thing for now is to tolerate the current "early" ref counting and just make sure the only early ref counting is related to these optimizations. Then when we hit and and and Once this kind of stuff can be centralized then counts can be reliably recomputed.... |
I suppose another question is whether we should start discouraging independent manipulation of the ref count and the weighted ref count. Most of the time the should both be changed, with the weight coming from the block that contains the ref. A lot of the code does this already but the early ref count stuff doesn't use weights and so might arguably be exempt. The only problematic cases for normal manipulation are the ones where there is no actual ref and so no block to use; these are the ones we'd like to call out specially anyways as they are the implicit references that any recount algorithm will need to know about. |
Here's a more problematic case. During call args morphing, the jit invokes But we have not yet set up ref counts and weights. |
Upshot is that most locals will be costed during morph as if they are not likely enregistered as most ref counts will be zero. There may be some nonzero counts from early ref counting and/or implicit refs. Seems like the fix here is to ensure ref counts are valid and if not either decide to be optimistic or pessimistic. For now probably the latter as it will have fewer diffs. While we're in the neighborhood, this bit also looks problematic: Float exclusion is probably a hold over from x87? I'll open a separate issue for this. |
Left some notes on |
I'm not sure - I might assume the reverse, since at least most of the reasons things become ineligible for enregistering are handled by morph. In any event, I would be surprised if the slight changes in eval order are both significant and not addressable in some other way (i.e. there may be some other factors being missed). |
If there are only a handful of LclVar then you can be optimistic, if we have hundreds then pessimestic would probably be the right choice. We could also call gtSetEvalOrder again after determining the wtdRefCnts. |
…#19012) Instead of relying on ref count bumps, add a new attribute bit to local vars to indicate that they may have implicit references (prolog, epilog, gc, eh) and may not have any IR references. Use this attribute bit to ensure that the ref count and weighted ref count for such variables are never reported as zero, and as a result that these variables end up being allocated and reportable. This is another preparatory step for #18969 and frees the jit to recompute explicit ref counts via an IR scan without having to special case the counts for these variables. The jit can no longer describe implicit counts other than 1 and implicit weights otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important. The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes (for windows x64).
Pessimistic didn't cause any diffs (at least in PMI/Crossgen x64) so I've done that in dotnet/coreclr#19068. |
Started in on the first part of the minopts speedup: not traversing the IR during This causes code size increases as in some methods there are unreferenced locals and with this change we think they are referenced (so for instance spill params to the stack):
Despite the code size increase, throughput improves around 3-5%. Tried playing a bit with mixed strategies -- eg guessing that small methods (based on bb count) might be more likely to have unused params and locals -- but no luck so far -- led to bigger code and smaller jitting. Might make more sense to look at total amount of IR created or something; in minopts hopefully there isn't much in the way of dead IR. I will post more detailed and carefully measured numbers when I get around to a PR. Trying to decide now if I should go through the pre-morph and post-morph explicit ref count updates and only do them when optimizing, or leave them be. If we skip over them post-morph then later in the compiler we need to do a second ref count setting pass for newly introduced locals (this is more or less what Pat did in his prototype). It looks like some of the early ref count enabled optimizations may be active in minopts which also needs some sorting out. Suspect perhaps these should be turned off. |
dotnet/coreclr#19103 has some work towards streamlining minopts. Still needs sorting because of the many-many relationship between various concepts: opt levels, opt flags, enregistration of locals, tracking, sorting, keeping debug codegen more or less in sync with minopts, possible platform or os variances. Arguably debug codegen throughput is as or more important than minopts since it impacts dev innerloop and "F5" latency. But I see quite a few places where we only check for minopts when it really looks like we should check for both (eg |
Extract out the basic ref count computation into a method that we can conceptually call later on if we want to recompute counts. Move one existing RCS_EARLY count for promoted fields of register args into this recomputation since losing this count bump causes quite a few diffs. The hope is to eventually call this method again later in the jit phase pipeline and then possibly get rid of all the (non-early) incremental count maintenance we do currently. Part of #18969
Prototyping the next batch of changes after dotnet/coreclr#19240 where we now recompute counts from scratch after lowering (but still leave the incremental updates intact). Preliminary diffs are encouraging...
Still need to carefully look at CQ and TP impact, but diffs show lots of cases where inflated ref counts were causing the jit to make poor decisions. |
…19240) Extract out the basic ref count computation into a method that we can conceptually call later on if we want to recompute counts. Move one existing RCS_EARLY count for promoted fields of register args into this recomputation since losing this count bump causes quite a few diffs. The hope is to eventually call this method again later in the jit phase pipeline and then possibly get rid of all the (non-early) incremental count maintenance we do currently. Part of #18969
Update `lvaComputeRefCounts` to encapsulate running ref counts post-lower and to also handle the fast jitting cases. Invoke this after lower to provide recomputed (and more accurate) counts. Part of #18969.
Now that dotnet/coreclr#19325 is in I've started prototyping removing the RCS_NORMAL incremental count updates. There is a lot of code to delete and it is not easy to get at this part incrementally. Along with this we need to do something about tracking the need to sort the locals table. For now I've opted to sort it explicitly rather than triggering via various ref count machinations, as we won't have those to guide us. Current prototype more or less undoes the size wins seen in dotnet/coreclr#19325 as we again lose track of some of the zero-ref cases. So one idea is to add an RCS_LATE option that permits incremental ref count updates after lower, where (presumably) they are easier to get right (as it is mostly dead code I think). Another is to simply recompute the counts again before assigning frame offsets. Am going to pursue this "second recompute" idea initially and see how it goes. |
Recomputing counts again after liveness fixes most of the diffs... somewhat surprising how little the incremental counts were used. Details shortly. |
For details: see notes for dotnet/coreclr#19345. |
…dotnet#19012) Instead of relying on ref count bumps, add a new attribute bit to local vars to indicate that they may have implicit references (prolog, epilog, gc, eh) and may not have any IR references. Use this attribute bit to ensure that the ref count and weighted ref count for such variables are never reported as zero, and as a result that these variables end up being allocated and reportable. This is another preparatory step for #18969 and frees the jit to recompute explicit ref counts via an IR scan without having to special case the counts for these variables. The jit can no longer describe implicit counts other than 1 and implicit weights otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important. The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes (for windows x64).
Playing around with removing the local table sort. Think removing it may also have minimal codegen impact. Should have data shortly. Because of the way |
If we are eliminating the sort, we may want to still consider https://github.com/dotnet/coreclr/issues/11339. I think it could improve throughput to make the set of lclVars that are live across blocks more dense, and would also enable slightly different heuristics for the two types. Also, many, if not all, of the locals introduced in |
.. and if it is indeed the case that |
Interesting. Since we walk the IR to recompute the ref counts, we should be able to fairly cheaply determine if a trackable local's lifetime is block-local, and split the tracking index assignment into two passes so the globally live locals all have low tracking indexes. |
Back to sorting for the moment -- very few methods (~10) in the pmi -f set have more than 512 trackable locals, and the issue I noted above doesn't seem to crop up for any of them, as we evidently do a good enough job sorting that in the first 512 sorted candidates all are trackable. So one might expect a small number of diffs. However, the numbers say otherwise:
For the most not sorting this would be a size wash or win except for the trace event methods -- they have huge numbers of temps. We can "fix" the problem there by less aggressively tracking structs but that's not good in general. (The wins seem to come perhaps from changes in CSE behavior; haven't looked at this to closely yet). So we might consider some kind of two-pass tracking selection where the first pass picks out higher priority cases and the second pass then fills in anything else trackable until we run out of space. Or several passes if we want to get globally live things densely clustered in the ID space. The second issue is that the jit is evidently sensitive to the ordering relationship induced by the ID -- these have been effectively "permuted" by not sorting the table, and this is likely what causes the larger number of diffs. If we're sorting the IDs reflect an underlying ordering metric based on weighted ref counts (and other things) it seems likely we can track down where the ordering comparison happens with IDs and perhaps replace it with the predicate used to sort -- if it's not needed too often. |
Worth pointing out that the methods with big regressions all have large numbers of trackable but untracked locals. Some rough accounting...
|
Considering that such large methods tend to be rare would it make sense to increase the max number of tracked variables? |
Maybe? The liveness time and space cost is proportional to BV size * number of BBs. So if a method has a lot of locals but few BBs then sure, we could up the limit. If it has a lot of locals and a lot of BBs then perhaps we shouldn't be optimizing it at all. Divergence (at least in cases I've explored so far) for fully trackable methods seems to come about in We can rewrite this bit to just sort the tracked reg params (usually just a handful) using the old predicate. |
Yep, it needs to be done with care, we probably don't need another JIT64. I was considering making the max number of tracked variables configurable after I got rid of fixed size arrays in 18504, just to see what impact increasing the limit has. But I've yet to get to it...
Hmm, I suppose that would allow making the global BB BVs smaller than the BVs used for local liveness. But such a change is probably rather tricky and would not help if it turns out that there number of globally live lclvars is high. |
|
This is basically done, so will close. |
This issue proposes a series of changes to reduce the overhead and increase the fidelity of local variable reference counts in the jit. The main idea is to get rid of the current costly and buggy incremental count maintenance in favor of batch updates that are done just before accurate ref counts are needed.
See discussion in #8715 for background.
Expected impact is:
The more accurate reference counts and weighted counts are likely to cause widespread codegen diffs. Hopefully these will mostly be improvements, but some regressions are certainly possible.
Proposed steps are:
lvRefCnt
andlvRefCntWtd
in an API, make backing fields private (Investigate preallocating space for ICollections/IIListProvider in SelectEnumerableIterator methods #18979)lvaMarkRefs
are required for all compilations vs only needed when optimizing (WebUtility.HtmlDecode does not decode all HTML5 character entities #19103, JIT: refactor ref count computation into a reusable utility method coreclr#19240)lvaMarkRefs
to split out ref counting from other activitiescc @dotnet/jit-contrib
category:implementation
theme:ir
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: