JIT: overview of proposed local var reference count changes #10702

AndyAyersMS · 2018-07-17T21:28:15Z

This issue proposes a series of changes to reduce the overhead and increase the fidelity of local variable reference counts in the jit. The main idea is to get rid of the current costly and buggy incremental count maintenance in favor of batch updates that are done just before accurate ref counts are needed.

See discussion in #8715 for background.

Expected impact is:

4-5% improvement in tier0 / minopts / debug throughput
minimal impact to optimized jit throughput
smaller frame sizes in some cases where reference counts are currently inflated
removal of workarounds to artificially inflate reference counts to avoid asserts

The more accurate reference counts and weighted counts are likely to cause widespread codegen diffs. Hopefully these will mostly be improvements, but some regressions are certainly possible.

Proposed steps are:

cc @dotnet/jit-contrib

category:implementation
theme:ir
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

This is a preparatory change for auditing and controlling how local variable ref counts are observed and manipulated. See #18969 for context. No diffs seen locally. No TP impact expected. There is a small chance we may see some asserts in broader testing as there were places in original code where local ref counts were incremented without checking for possible overflows. The new APIs will assert for overflow cases.

AndyAyersMS · 2018-07-18T02:52:12Z

Currently local ref counts become valid after lvaMarkLocalVars and are supposed to stay valid for the remainder of compilation (will update as I discover more)....

Here are some of the references / updates to ref counts that happen before that point:

lvaPromoteStructVar -- ref count bump for promoted reg arg fields for "prolog" init count
fgMarkImplicitByRefArgs -- zero ref counts for implicit by-ref parameters
fgMarkAddressExposedLocals -- track ref counts for implicit by-ref parameters
fgRetypeImplicitByRefArgs -- use ref counts to decide whether to promote implict by-ref parameters
fgMakeOutgoingStructArgCopy -- use implicit by-ref parameter ref count to try and detect if the last use of the parameter is a call arg; if so try and avoid a copy
fgGrabTempWithImplicitUse -- eg for the inline pinvoke frame var, hidden stub parameter
lvaMarkLocalVars does some ref count manipulation before turning on "ref counts valid" mode

mikedn · 2018-07-18T04:52:45Z

4-5% improvement in tier0 / minopts / debug throughput

I'm curious how did you come up with that estimate :). Also, do we even need ref counts in minopts?

AndyAyersMS · 2018-07-18T14:26:43Z

Perf estimates are from #8715.

The improvement under MinOpts increases to about 4.7%.

I have a reasonably up to date version of this so can re-validate ... probably a good idea.

AndyAyersMS · 2018-07-18T17:27:22Z

Yeah, performance estimates still seem to hold up, eg PMI of Microsoft.CodeAnalysis.CSharp with minopts goes from 1334ms to 1275ms, around 4.5%, based on the just-updated prototype.

Actual benefit may be higher because the prototype still leaves a lot of ref counting cruft behind.

AndyAyersMS · 2018-07-18T20:55:55Z

Roughly speaking, the "invalid" ref count accesses in the current scheme (ref count reads, incs, decs, or sets before the jit officially starts ref counting) fall into categories:

setting up implicit ref counts for locals that will never have explicit IR references (referenced in the prolog, say, or referenced via EH or GC)
setting up transient implicit ref counts for locals whose references will appear eventually once constructs are lowered
conveying information during morph for the implicit byref promotion / unpromotion
conveying information during morph for optimization of last use struct args to avoid copies
count manipulations that seem easily avoidable

The implicit cases can perhaps better be handled by new local var property bits, eg lvHasImplicitReference or lvHasTemporaryImplicitReference. If these bits are set then the lvRefCnt() should always return nonzero counts (likely just adding 1 to the explicit ref count result is sufficient) and likewise for the weighted version. The distinction between always implicit and transiently implicit may not matter as it might be OK to lump them together into one category. But if not, then at some point when the appropriate lowering happens the transient implicit bit can be turned off and ref counts established for the variable.

The conveying cases can perhaps be tolerated by using a different set of APIs or a special parameter to permit temporary (short-lifetime) use of ref counts for some subset of locals.

Struct last use copy avoidance is arguably something better handled by the work on first class structs as that should be more general, more capable, and less fragile.

This is a preparatory change for auditing and controlling how local variable ref counts are observed and manipulated. See #18969 for context. No diffs seen locally. No TP impact expected. There is a small chance we may see some asserts in broader testing as there were places in original code where local ref counts were incremented without checking for possible overflows. The new APIs will assert for overflow cases.

Instead of relying on ref count bumps, add a new attribute bit to local vars to indicate that they may have implicit references (prolog, epilog, gc, eh) and may not have any IR references. Use this attribute bit to ensure that the ref count and weighted ref count for such variables are never reported as zero, and as a result that these variables end up being allocated and reportable. This is another preparatory step for #18969 and frees the jit to recompute explicit ref counts via an IR scan without having to special case the counts for these variables. The jit can no longer describe implicit counts other than 1 and implicit weights otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important. The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes (for windows x64).

AndyAyersMS · 2018-07-19T19:38:51Z

dotnet/coreclr#19012 should clear up the implicit reference cases and the avoidable cases.

For the other cases, one thought is to essentially make ref counts into something like a phased var, where the count fields are valid only for a certain time and certain purpose. The various APIs would then be extended with an enum describing the accessing phase, and this would be error-checked against the current phase. This could extend later to describe intervals when the counts are valid or potentially stale.

The error checking is a bit clunky with the current API shape as we either also need to pass in the compiler instance or fetch it from TLS. But perhaps the latter is acceptable....?

briansull · 2018-07-19T20:53:11Z

The error checking is a bit clunky with the current API shape as we either also need to pass in the compiler instance or fetch it from TLS. But perhaps the latter is acceptable....?

It is fine for use in Debug only code, but we should avoid using it for retail code

AndyAyersMS · 2018-07-19T22:11:04Z

It would be something like this:

inline unsigned short LclVarDsc::lvRefCnt(RefCountState state) const
{
    
#if defined(DEBUG)
    assert(state != RCS_INVALID);
    Compiler* compiler = JitTls::GetCompiler();
    assert(compiler->lvaRefCountState == state);
#endif
    
    if (lvImplicitlyReferenced && (m_lvRefCnt == 0))
    {
        return 1;
    }
    
    return m_lvRefCnt;
}

CarolEidt · 2018-07-19T22:51:05Z

conveying information during morph for optimization of last use struct args to avoid copies

This rings a bell, but I don't recall exactly what's being done. Can you point me to this?

Re: the error checking, I think it's fine to fetch the compiler instance from TLS for an assert.

AndyAyersMS · 2018-07-19T22:58:05Z

Look at fgMakeOutgoingStructArgCopy

CarolEidt · 2018-07-19T23:08:31Z

Thanks @AndyAyersMS - I now recall that some time ago when I was making the first round of 1st class struct changes, I deleted this line: https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L5297 but it caused diffs, which I was trying to avoid and I wanted to analyze it further - so I created that "to do" and it's still there :-(

AndyAyersMS · 2018-07-20T01:22:47Z

I can leave that as is for now -- I think best thing for now is to tolerate the current "early" ref counting and just make sure the only early ref counting is related to these optimizations.

Then when we hit lvaMarkRefs we will start with a clean slate and recompute counts from scratch. The challenge is to track down cases where current ref counts are manipulated in ways that aren't reflected in the IR, eg:

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/morph.cpp#L5295-L5297

and

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/lclvars.cpp#L2027-L2029

and

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/lclvars.cpp#L2959-L2962

and

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/lclvars.cpp#L4120-L4141

Once this kind of stuff can be centralized then counts can be reliably recomputed....

AndyAyersMS · 2018-07-20T16:08:00Z

I suppose another question is whether we should start discouraging independent manipulation of the ref count and the weighted ref count. Most of the time the should both be changed, with the weight coming from the block that contains the ref. A lot of the code does this already but the early ref count stuff doesn't use weights and so might arguably be exempt.

The only problematic cases for normal manipulation are the ones where there is no actual ref and so no block to use; these are the ones we'd like to call out specially anyways as they are the implicit references that any recount algorithm will need to know about.

AndyAyersMS · 2018-07-20T16:11:42Z

Here's a more problematic case. During call args morphing, the jit invokes gtIsLikelyRegVar as it is trying to determine the best argument evauation order. And this asks for the weighted ref count.

But we have not yet set up ref counts and weights.

AndyAyersMS · 2018-07-20T17:28:06Z

Upshot is that most locals will be costed during morph as if they are not likely enregistered as most ref counts will be zero. There may be some nonzero counts from early ref counting and/or implicit refs.

https://github.com/dotnet/coreclr/blob/b896dd14830b600043a99c2626ea848ad679fb4f/src/jit/gentree.cpp#L2986-L2989

Seems like the fix here is to ensure ref counts are valid and if not either decide to be optimistic or pessimistic. For now probably the latter as it will have fewer diffs.

While we're in the neighborhood, this bit also looks problematic:

https://github.com/dotnet/coreclr/blob/b896dd14830b600043a99c2626ea848ad679fb4f/src/jit/gentree.cpp#L2991-L2996

Float exclusion is probably a hold over from x87?

I'll open a separate issue for this.

AndyAyersMS · 2018-07-20T17:39:50Z

Left some notes on gtIsLikelyRegVar and gtSetEvalOrder in dotnet/coreclr#19061.

CarolEidt · 2018-07-20T18:37:46Z

Seems like the fix here is to ensure ref counts are valid and if not either decide to be optimistic or pessimistic. For now probably the latter as it will have fewer diffs.

I'm not sure - I might assume the reverse, since at least most of the reasons things become ineligible for enregistering are handled by morph. In any event, I would be surprised if the slight changes in eval order are both significant and not addressable in some other way (i.e. there may be some other factors being missed).

briansull · 2018-07-20T20:11:56Z

If there are only a handful of LclVar then you can be optimistic, if we have hundreds then pessimestic would probably be the right choice. We could also call gtSetEvalOrder again after determining the wtdRefCnts.
These size costs are just used to drive a couple of hueristics, like cloning the loop top test.

…#19012) Instead of relying on ref count bumps, add a new attribute bit to local vars to indicate that they may have implicit references (prolog, epilog, gc, eh) and may not have any IR references. Use this attribute bit to ensure that the ref count and weighted ref count for such variables are never reported as zero, and as a result that these variables end up being allocated and reportable. This is another preparatory step for #18969 and frees the jit to recompute explicit ref counts via an IR scan without having to special case the counts for these variables. The jit can no longer describe implicit counts other than 1 and implicit weights otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important. The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes (for windows x64).

AndyAyersMS · 2018-07-20T22:48:32Z

Pessimistic didn't cause any diffs (at least in PMI/Crossgen x64) so I've done that in dotnet/coreclr#19068.

AndyAyersMS · 2018-07-23T23:56:50Z

Started in on the first part of the minopts speedup: not traversing the IR during lvaMarkRefs. Instead we just mark all locals with nominal count and weight.

This causes code size increases as in some methods there are unreferenced locals and with this change we think they are referenced (so for instance spill params to the stack):

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit (MINOPTS)
Summary:
(Lower is better)
Total bytes of diff: 82702 (0.22% of base)
    diff is a regression.
Top file regressions by size (bytes):
       18113 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.30% of base)
       12409 : Microsoft.CodeAnalysis.CSharp.dasm (0.26% of base)
        7367 : System.Private.Xml.dasm (0.17% of base)
        4841 : Microsoft.CodeAnalysis.dasm (0.32% of base)
        4805 : System.Security.Permissions.dasm (15.13% of base)
Top file improvements by size (bytes):
        -470 : System.Reflection.Metadata.dasm (-0.11% of base)
        -317 : System.Numerics.Vectors.dasm (-0.34% of base)
        -277 : System.Drawing.Primitives.dasm (-0.58% of base)
        -226 : System.Security.AccessControl.dasm (-0.23% of base)
        -188 : System.Private.Xml.Linq.dasm (-0.11% of base)
113 total files with size differences (18 improved, 95 regressed), 17 unchanged.
Top method regressions by size (bytes):
        1032 : Microsoft.CodeAnalysis.VisualBasic.dasm - _Closure$__:_Lambda$__0-0(ref):ref:this (258 methods)
         320 : System.Private.DataContractSerialization.dasm - XmlJsonWriter:WriteArray(ref,ref,ref,ref,int,int):this (20 methods)
         312 : xunit.runner.utility.dotnet.dasm - TestMessageVisitor:Visit(ref):bool:this (39 methods)
         296 : Microsoft.CodeAnalysis.VisualBasic.dasm - _Closure$__:_Lambda$__9-0(ref):ref:this (74 methods)
         276 : Microsoft.CodeAnalysis.VisualBasic.dasm - _Closure$__:_Lambda$__10-0(ref):ref:this (69 methods)
Top method improvements by size (bytes):
        -188 : Microsoft.CodeAnalysis.dasm - Enumerator:.ctor(ref):this (20 methods)
        -182 : System.Private.DataContractSerialization.dasm - XmlBinaryWriter:WriteArray(ref,ref,ref,ref,int,int):this (20 methods)
        -136 : System.Private.Xml.Linq.dasm - XAttribute:op_Explicit(ref):struct (17 methods)
        -136 : System.Private.Xml.Linq.dasm - XElement:op_Explicit(ref):struct (17 methods)
        -103 : System.Data.Common.dasm - RBTree`1:GetIndexByNode(int):int:this (4 methods)
20974 total methods with size differences (2551 improved, 18423 regressed), 149586 unchanged.

Despite the code size increase, throughput improves around 3-5%. Tried playing a bit with mixed strategies -- eg guessing that small methods (based on bb count) might be more likely to have unused params and locals -- but no luck so far -- led to bigger code and smaller jitting. Might make more sense to look at total amount of IR created or something; in minopts hopefully there isn't much in the way of dead IR.

I will post more detailed and carefully measured numbers when I get around to a PR.

Trying to decide now if I should go through the pre-morph and post-morph explicit ref count updates and only do them when optimizing, or leave them be. If we skip over them post-morph then later in the compiler we need to do a second ref count setting pass for newly introduced locals (this is more or less what Pat did in his prototype).

It looks like some of the early ref count enabled optimizations may be active in minopts which also needs some sorting out. Suspect perhaps these should be turned off.

AndyAyersMS · 2018-07-24T17:30:34Z

dotnet/coreclr#19103 has some work towards streamlining minopts.

Still needs sorting because of the many-many relationship between various concepts: opt levels, opt flags, enregistration of locals, tracking, sorting, keeping debug codegen more or less in sync with minopts, possible platform or os variances.

Arguably debug codegen throughput is as or more important than minopts since it impacts dev innerloop and "F5" latency. But I see quite a few places where we only check for minopts when it really looks like we should check for both (eg backendRequiresLocalVarLifetimes).

Extract out the basic ref count computation into a method that we can conceptually call later on if we want to recompute counts. Move one existing RCS_EARLY count for promoted fields of register args into this recomputation since losing this count bump causes quite a few diffs. The hope is to eventually call this method again later in the jit phase pipeline and then possibly get rid of all the (non-early) incremental count maintenance we do currently. Part of #18969

AndyAyersMS · 2018-08-02T03:10:45Z

Here's a graphic showing some of the progress on jit throughput recently:

About 10% improvement in minopt, 6% or so in tiered.

AndyAyersMS · 2018-08-02T03:15:36Z

Likewise one of the larger assemblies in crossgen, with minopts:

AndyAyersMS · 2018-08-02T19:12:47Z

Prototyping the next batch of changes after dotnet/coreclr#19240 where we now recompute counts from scratch after lowering (but still leave the incremental updates intact). Preliminary diffs are encouraging...

Total bytes of diff: -58014 (-0.16% of base)
    diff is an improvement.

Top file regressions by size (bytes):
          32 : System.Console.dasm (0.07% of base)
          28 : System.Diagnostics.TextWriterTraceListener.dasm (0.51% of base)

Top file improvements by size (bytes):
      -21483 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.73% of base)
       -5182 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.09% of base)
       -4751 : System.Data.Common.dasm (-0.36% of base)
       -3195 : System.Private.CoreLib.dasm (-0.10% of base)
       -2774 : System.Threading.Tasks.Parallel.dasm (-2.34% of base)

94 total files with size differences (92 improved, 2 regressed), 35 unchanged.
...
4925 total methods with size differences (4256 improved, 669 regressed), 185782 unchanged.

Still need to carefully look at CQ and TP impact, but diffs show lots of cases where inflated ref counts were causing the jit to make poor decisions.

…19240) Extract out the basic ref count computation into a method that we can conceptually call later on if we want to recompute counts. Move one existing RCS_EARLY count for promoted fields of register args into this recomputation since losing this count bump causes quite a few diffs. The hope is to eventually call this method again later in the jit phase pipeline and then possibly get rid of all the (non-early) incremental count maintenance we do currently. Part of #18969

Update `lvaComputeRefCounts` to encapsulate running ref counts post-lower and to also handle the fast jitting cases. Invoke this after lower to provide recomputed (and more accurate) counts. Part of #18969.

AndyAyersMS · 2018-08-07T21:47:44Z

Now that dotnet/coreclr#19325 is in I've started prototyping removing the RCS_NORMAL incremental count updates. There is a lot of code to delete and it is not easy to get at this part incrementally.

Along with this we need to do something about tracking the need to sort the locals table. For now I've opted to sort it explicitly rather than triggering via various ref count machinations, as we won't have those to guide us.

Current prototype more or less undoes the size wins seen in dotnet/coreclr#19325 as we again lose track of some of the zero-ref cases.

So one idea is to add an RCS_LATE option that permits incremental ref count updates after lower, where (presumably) they are easier to get right (as it is mostly dead code I think). Another is to simply recompute the counts again before assigning frame offsets.

Am going to pursue this "second recompute" idea initially and see how it goes.

AndyAyersMS · 2018-08-07T23:34:41Z

Recomputing counts again after liveness fixes most of the diffs... somewhat surprising how little the incremental counts were used. Details shortly.

AndyAyersMS · 2018-08-08T02:05:20Z

For details: see notes for dotnet/coreclr#19345.

…dotnet#19012) Instead of relying on ref count bumps, add a new attribute bit to local vars to indicate that they may have implicit references (prolog, epilog, gc, eh) and may not have any IR references. Use this attribute bit to ensure that the ref count and weighted ref count for such variables are never reported as zero, and as a result that these variables end up being allocated and reportable. This is another preparatory step for #18969 and frees the jit to recompute explicit ref counts via an IR scan without having to special case the counts for these variables. The jit can no longer describe implicit counts other than 1 and implicit weights otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important. The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes (for windows x64).

AndyAyersMS · 2018-08-16T17:53:47Z

Playing around with removing the local table sort. Think removing it may also have minimal codegen impact. Should have data shortly.

Because of the way lvaSortByRefCount is currently written, we may end up tracking less than lclMAX_TRACKED locals and still leave some trackable locals untracked (because some low-numbered locals are untrackable).

CarolEidt · 2018-08-16T18:17:07Z

If we are eliminating the sort, we may want to still consider https://github.com/dotnet/coreclr/issues/11339. I think it could improve throughput to make the set of lclVars that are live across blocks more dense, and would also enable slightly different heuristics for the two types. Also, many, if not all, of the locals introduced in Rationalizer and Lowering are live only within a block (and usually single-def) - knowing those properties would be beneficial to register allocation.

CarolEidt · 2018-08-16T18:47:57Z

.. and if it is indeed the case that Rationalizer and Lowering only ever introduce lclVars that are live only within a block, then perhaps we could avoid running liveness after those phases, and just set the lvTracked field and the GTF_VAR_DEATH flag appropriately as new lcls are introduced by the backend.

AndyAyersMS · 2018-08-16T20:42:24Z

Interesting.

Since we walk the IR to recompute the ref counts, we should be able to fairly cheaply determine if a trackable local's lifetime is block-local, and split the tracking index assignment into two passes so the globally live locals all have low tracking indexes.

AndyAyersMS · 2018-08-17T01:48:25Z

Back to sorting for the moment -- very few methods (~10) in the pmi -f set have more than 512 trackable locals, and the issue I noted above doesn't seem to crop up for any of them, as we evidently do a good enough job sorting that in the first 512 sorted candidates all are trackable.

So one might expect a small number of diffs.

However, the numbers say otherwise:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: 8277 (0.02% of base)
    diff is a regression.
Top file regressions by size (bytes):
       11170 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.38% of base)
        2488 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.04% of base)
        1470 : Microsoft.CodeAnalysis.CSharp.dasm (0.03% of base)
        1243 : System.Linq.Parallel.dasm (0.11% of base)
         677 : System.Collections.Immutable.dasm (0.08% of base)
Top file improvements by size (bytes):
       -1957 : System.Private.CoreLib.dasm (-0.05% of base)
       -1766 : System.Private.Xml.dasm (-0.05% of base)
       -1014 : System.Security.Cryptography.Algorithms.dasm (-0.30% of base)
        -936 : System.Security.Cryptography.X509Certificates.dasm (-0.42% of base)
        -813 : System.Net.Http.dasm (-0.15% of base)
98 total files with size differences (55 improved, 43 regressed), 31 unchanged.
Top method regressions by size (bytes):
        3699 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.KernelTraceEventParser:EnumerateTemplates(ref,ref):this
        3550 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.CtfTraceEventSource:InitEventMap():ref
        2680 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.ClrPrivateTraceEventParser:EnumerateTemplates(ref,ref):this
        1356 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.ClrTraceEventParser:EnumerateTemplates(ref,ref):this
        1277 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.BoundTreeVisitor`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:VisitInternal(ref,struct):long:this
Top method improvements by size (bytes):
        -502 : System.Private.Xml.dasm - System.Xml.Serialization.XmlSerializationWriterCodeGen:WriteElements(ref,ref,ref,ref,ref,ref,bool,bool):this
        -289 : System.Private.CoreLib.dasm - <ReadAsyncInternal>d__66:MoveNext():this
        -282 : System.Reflection.Metadata.dasm - System.Reflection.Metadata.MetadataReader:InitializeTableReaders(struct,ubyte,ref,ref):this
        -200 : System.Security.Cryptography.Algorithms.dasm - System.Security.Cryptography.KeyFormatHelper:ReadEncryptedPkcs8(ref,struct,struct,ref,byref,byref) (20 methods)
        -200 : System.Security.Cryptography.Algorithms.dasm - System.Security.Cryptography.KeyFormatHelper:ReadEncryptedPkcs8(ref,struct,struct,struct,ref,byref,byref) (10 methods)
5524 total methods with size differences (2589 improved, 2935 regressed), 219099 unchanged.

For the most not sorting this would be a size wash or win except for the trace event methods -- they have huge numbers of temps. We can "fix" the problem there by less aggressively tracking structs but that's not good in general. (The wins seem to come perhaps from changes in CSE behavior; haven't looked at this to closely yet).

So we might consider some kind of two-pass tracking selection where the first pass picks out higher priority cases and the second pass then fills in anything else trackable until we run out of space. Or several passes if we want to get globally live things densely clustered in the ID space.

The second issue is that the jit is evidently sensitive to the ordering relationship induced by the ID -- these have been effectively "permuted" by not sorting the table, and this is likely what causes the larger number of diffs. If we're sorting the IDs reflect an underlying ordering metric based on weighted ref counts (and other things) it seems likely we can track down where the ordering comparison happens with IDs and perhaps replace it with the predicate used to sort -- if it's not needed too often.

AndyAyersMS · 2018-08-17T06:44:44Z

Worth pointing out that the methods with big regressions all have large numbers of trackable but untracked locals. Some rough accounting...

Assembly	Method	Final number of trackable untracked
Microsoft.Diagnostics.Tracing.TraceEvent	ApplicationServerTraceEventParser:EnumerateTemplates(ref,ref):this	1809
Microsoft.Diagnostics.Tracing.TraceEvent	CtfTraceEventSource:InitEventMap():ref	705
Microsoft.Diagnostics.Tracing.TraceEvent	KernelTraceEventParser:EnumerateTemplates(ref,ref):this	496
Microsoft.Diagnostics.Tracing.TraceEvent	ClrPrivateTraceEventParser:EnumerateTemplates(ref,ref):this	233
Microsoft.CodeAnalysis.VisualBasic	BoundTreeVisitor`2:VisitInternal(ref,struct):long:this	147
System.Private.Xml	XmlReflectionImporter:ImportAccessorMapping(ref,ref,ref,ref,ref,bool,bool,ref):this	135
Microsoft.Diagnostics.Tracing.TraceEvent	ClrTraceEventParser:EnumerateTemplates(ref,ref):this	118
System.Text.RegularExpressions	RegexInterpreter:Go():this	67
Microsoft.CodeAnalysis.VisualBasic	Binder:ReportOverloadResolutionFailureForASingleCandidate(ref,ref,int,byref,struct,struct,bool,bool,bool,bool,ref,ref,bool,ref,ref):this	37
Microsoft.CodeAnalysis.VisualBasic	VisualBasicCommandLineParser:Parse(ref,ref,ref,ref):ref:this	4

mikedn · 2018-08-17T06:59:12Z

Worth pointing out that the methods with big regressions all have large numbers of trackable but untracked locals. Some rough accounting...

Considering that such large methods tend to be rare would it make sense to increase the max number of tracked variables?

AndyAyersMS · 2018-08-17T19:55:15Z

Maybe? The liveness time and space cost is proportional to BV size * number of BBs. So if a method has a lot of locals but few BBs then sure, we could up the limit. If it has a lot of locals and a lot of BBs then perhaps we shouldn't be optimizing it at all.

Divergence (at least in cases I've explored so far) for fully trackable methods seems to come about in LinearScan::buildIntervals() where we create intervals for reg params in lvVarIndex order.

We can rewrite this bit to just sort the tracked reg params (usually just a handful) using the old predicate.

mikedn · 2018-08-18T04:52:14Z

The liveness time and space cost is proportional to BV size * number of BBs. So if a method has a lot of locals but few BBs then sure, we could up the limit. If it has a lot of locals and a lot of BBs then perhaps we shouldn't be optimizing it at all.

Yep, it needs to be done with care, we probably don't need another JIT64. I was considering making the max number of tracked variables configurable after I got rid of fixed size arrays in 18504, just to see what impact increasing the limit has. But I've yet to get to it...

Or several passes if we want to get globally live things densely clustered in the ID space.

Hmm, I suppose that would allow making the global BB BVs smaller than the BVs used for local liveness. But such a change is probably rather tricky and would not help if it turns out that there number of globally live lclvars is high.

AndyAyersMS · 2019-01-16T16:12:50Z

lvaMarkLclRefs still does a bunch of unrelated computation but I'm not planning on disentangling it any time soon. Moving to future.

AndyAyersMS · 2020-02-19T23:05:27Z

This is basically done, so will close.

AndyAyersMS self-assigned this Jul 17, 2018

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

AndyAyersMS closed this as completed Feb 19, 2020

erozenfeld mentioned this issue Jun 17, 2020

Revisit lclVar ref count updates when remorphing. #7047

Closed

ghost locked as resolved and limited conversation to collaborators Dec 16, 2020

JIT: overview of proposed local var reference count changes #10702

JIT: overview of proposed local var reference count changes #10702

Comments

AndyAyersMS commented Jul 17, 2018

AndyAyersMS commented Jul 18, 2018 • edited Loading

mikedn commented Jul 18, 2018

AndyAyersMS commented Jul 18, 2018

AndyAyersMS commented Jul 18, 2018

AndyAyersMS commented Jul 18, 2018

AndyAyersMS commented Jul 19, 2018

briansull commented Jul 19, 2018

AndyAyersMS commented Jul 19, 2018

CarolEidt commented Jul 19, 2018

AndyAyersMS commented Jul 19, 2018

CarolEidt commented Jul 19, 2018

AndyAyersMS commented Jul 20, 2018 • edited Loading

AndyAyersMS commented Jul 20, 2018

AndyAyersMS commented Jul 20, 2018

AndyAyersMS commented Jul 20, 2018

AndyAyersMS commented Jul 20, 2018

CarolEidt commented Jul 20, 2018

briansull commented Jul 20, 2018 • edited Loading

AndyAyersMS commented Jul 20, 2018

AndyAyersMS commented Jul 23, 2018

AndyAyersMS commented Jul 24, 2018

AndyAyersMS commented Aug 2, 2018

AndyAyersMS commented Aug 2, 2018

AndyAyersMS commented Aug 2, 2018

AndyAyersMS commented Aug 7, 2018

AndyAyersMS commented Aug 7, 2018

AndyAyersMS commented Aug 8, 2018

AndyAyersMS commented Aug 16, 2018 • edited Loading

CarolEidt commented Aug 16, 2018

CarolEidt commented Aug 16, 2018

AndyAyersMS commented Aug 16, 2018

AndyAyersMS commented Aug 17, 2018

AndyAyersMS commented Aug 17, 2018

mikedn commented Aug 17, 2018

AndyAyersMS commented Aug 17, 2018

mikedn commented Aug 18, 2018

AndyAyersMS commented Jan 16, 2019

AndyAyersMS commented Feb 19, 2020

AndyAyersMS commented Jul 18, 2018 •

edited

Loading

AndyAyersMS commented Jul 20, 2018 •

edited

Loading

briansull commented Jul 20, 2018 •

edited

Loading

AndyAyersMS commented Aug 16, 2018 •

edited

Loading