Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: overview of proposed local var reference count changes #10702

Closed
11 of 12 tasks
AndyAyersMS opened this issue Jul 17, 2018 · 38 comments
Closed
11 of 12 tasks

JIT: overview of proposed local var reference count changes #10702

AndyAyersMS opened this issue Jul 17, 2018 · 38 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitThroughput CLR JIT issues regarding speed of JIT itself
Milestone

Comments

@AndyAyersMS
Copy link
Member

This issue proposes a series of changes to reduce the overhead and increase the fidelity of local variable reference counts in the jit. The main idea is to get rid of the current costly and buggy incremental count maintenance in favor of batch updates that are done just before accurate ref counts are needed.

See discussion in #8715 for background.

Expected impact is:

  • 4-5% improvement in tier0 / minopts / debug throughput
  • minimal impact to optimized jit throughput
  • smaller frame sizes in some cases where reference counts are currently inflated
  • removal of workarounds to artificially inflate reference counts to avoid asserts

The more accurate reference counts and weighted counts are likely to cause widespread codegen diffs. Hopefully these will mostly be improvements, but some regressions are certainly possible.

Proposed steps are:

cc @dotnet/jit-contrib

category:implementation
theme:ir
skill-level:expert
cost:large

@AndyAyersMS AndyAyersMS self-assigned this Jul 17, 2018
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Jul 18, 2018
This is a preparatory change for auditing and controlling how local
variable ref counts are observed and manipulated.

See #18969 for context.

No diffs seen locally. No TP impact expected.

There is a small chance we may see some asserts in broader testing
as there were places in original code where local ref counts were
incremented without checking for possible overflows. The new APIs
will assert for overflow cases.
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 18, 2018

Currently local ref counts become valid after lvaMarkLocalVars and are supposed to stay valid for the remainder of compilation (will update as I discover more)....

Here are some of the references / updates to ref counts that happen before that point:

  • lvaPromoteStructVar -- ref count bump for promoted reg arg fields for "prolog" init count
  • fgMarkImplicitByRefArgs -- zero ref counts for implicit by-ref parameters
  • fgMarkAddressExposedLocals -- track ref counts for implicit by-ref parameters
  • fgRetypeImplicitByRefArgs -- use ref counts to decide whether to promote implict by-ref parameters
  • fgMakeOutgoingStructArgCopy -- use implicit by-ref parameter ref count to try and detect if the last use of the parameter is a call arg; if so try and avoid a copy
  • fgGrabTempWithImplicitUse -- eg for the inline pinvoke frame var, hidden stub parameter
  • lvaMarkLocalVars does some ref count manipulation before turning on "ref counts valid" mode

@mikedn
Copy link
Contributor

mikedn commented Jul 18, 2018

4-5% improvement in tier0 / minopts / debug throughput

I'm curious how did you come up with that estimate :). Also, do we even need ref counts in minopts?

@AndyAyersMS
Copy link
Member Author

Perf estimates are from #8715.

The improvement under MinOpts increases to about 4.7%.

I have a reasonably up to date version of this so can re-validate ... probably a good idea.

@AndyAyersMS
Copy link
Member Author

Yeah, performance estimates still seem to hold up, eg PMI of Microsoft.CodeAnalysis.CSharp with minopts goes from 1334ms to 1275ms, around 4.5%, based on the just-updated prototype.

Actual benefit may be higher because the prototype still leaves a lot of ref counting cruft behind.

@AndyAyersMS
Copy link
Member Author

Roughly speaking, the "invalid" ref count accesses in the current scheme (ref count reads, incs, decs, or sets before the jit officially starts ref counting) fall into categories:

  • setting up implicit ref counts for locals that will never have explicit IR references (referenced in the prolog, say, or referenced via EH or GC)
  • setting up transient implicit ref counts for locals whose references will appear eventually once constructs are lowered
  • conveying information during morph for the implicit byref promotion / unpromotion
  • conveying information during morph for optimization of last use struct args to avoid copies
  • count manipulations that seem easily avoidable

The implicit cases can perhaps better be handled by new local var property bits, eg lvHasImplicitReference or lvHasTemporaryImplicitReference. If these bits are set then the lvRefCnt() should always return nonzero counts (likely just adding 1 to the explicit ref count result is sufficient) and likewise for the weighted version. The distinction between always implicit and transiently implicit may not matter as it might be OK to lump them together into one category. But if not, then at some point when the appropriate lowering happens the transient implicit bit can be turned off and ref counts established for the variable.

The conveying cases can perhaps be tolerated by using a different set of APIs or a special parameter to permit temporary (short-lifetime) use of ref counts for some subset of locals.

Struct last use copy avoidance is arguably something better handled by the work on first class structs as that should be more general, more capable, and less fragile.

AndyAyersMS referenced this issue in dotnet/coreclr Jul 18, 2018
This is a preparatory change for auditing and controlling how local
variable ref counts are observed and manipulated.

See #18969 for context.

No diffs seen locally. No TP impact expected.

There is a small chance we may see some asserts in broader testing
as there were places in original code where local ref counts were
incremented without checking for possible overflows. The new APIs
will assert for overflow cases.
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Jul 19, 2018
Instead of relying on ref count bumps, add a new attribute bit to local
vars to indicate that they may have implicit references (prolog, epilog,
gc, eh) and may not have any IR references.

Use this attribute bit to ensure that the ref count and weighted ref count for
such variables are never reported as zero, and as a result that these variables
end up being allocated and reportable.

This is another preparatory step for #18969 and frees the jit to recompute
explicit ref counts via an IR scan without having to special case the counts
for these variables.

The jit can no longer describe implicit counts other than 1 and implicit weights
otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important.

The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes
(for windows x64).
@AndyAyersMS
Copy link
Member Author

dotnet/coreclr#19012 should clear up the implicit reference cases and the avoidable cases.

For the other cases, one thought is to essentially make ref counts into something like a phased var, where the count fields are valid only for a certain time and certain purpose. The various APIs would then be extended with an enum describing the accessing phase, and this would be error-checked against the current phase. This could extend later to describe intervals when the counts are valid or potentially stale.

The error checking is a bit clunky with the current API shape as we either also need to pass in the compiler instance or fetch it from TLS. But perhaps the latter is acceptable....?

@briansull
Copy link
Contributor

The error checking is a bit clunky with the current API shape as we either also need to pass in the compiler instance or fetch it from TLS. But perhaps the latter is acceptable....?

It is fine for use in Debug only code, but we should avoid using it for retail code

@AndyAyersMS
Copy link
Member Author

It would be something like this:

inline unsigned short LclVarDsc::lvRefCnt(RefCountState state) const
{
    
#if defined(DEBUG)
    assert(state != RCS_INVALID);
    Compiler* compiler = JitTls::GetCompiler();
    assert(compiler->lvaRefCountState == state);
#endif
    
    if (lvImplicitlyReferenced && (m_lvRefCnt == 0))
    {
        return 1;
    }
    
    return m_lvRefCnt;
}

@CarolEidt
Copy link
Contributor

conveying information during morph for optimization of last use struct args to avoid copies

This rings a bell, but I don't recall exactly what's being done. Can you point me to this?

Re: the error checking, I think it's fine to fetch the compiler instance from TLS for an assert.

@AndyAyersMS
Copy link
Member Author

Look at fgMakeOutgoingStructArgCopy

@CarolEidt
Copy link
Contributor

Thanks @AndyAyersMS - I now recall that some time ago when I was making the first round of 1st class struct changes, I deleted this line: https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L5297 but it caused diffs, which I was trying to avoid and I wanted to analyze it further - so I created that "to do" and it's still there :-(

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 20, 2018

I can leave that as is for now -- I think best thing for now is to tolerate the current "early" ref counting and just make sure the only early ref counting is related to these optimizations.

Then when we hit lvaMarkRefs we will start with a clean slate and recompute counts from scratch. The challenge is to track down cases where current ref counts are manipulated in ways that aren't reflected in the IR, eg:

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/morph.cpp#L5295-L5297

and

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/lclvars.cpp#L2027-L2029

and

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/lclvars.cpp#L2959-L2962

and

https://github.com/dotnet/coreclr/blob/21ade50e57e1f61fb7a1a4970adba12d72ec72f7/src/jit/lclvars.cpp#L4120-L4141

Once this kind of stuff can be centralized then counts can be reliably recomputed....

@AndyAyersMS
Copy link
Member Author

I suppose another question is whether we should start discouraging independent manipulation of the ref count and the weighted ref count. Most of the time the should both be changed, with the weight coming from the block that contains the ref. A lot of the code does this already but the early ref count stuff doesn't use weights and so might arguably be exempt.

The only problematic cases for normal manipulation are the ones where there is no actual ref and so no block to use; these are the ones we'd like to call out specially anyways as they are the implicit references that any recount algorithm will need to know about.

@AndyAyersMS
Copy link
Member Author

Here's a more problematic case. During call args morphing, the jit invokes gtIsLikelyRegVar as it is trying to determine the best argument evauation order. And this asks for the weighted ref count.

But we have not yet set up ref counts and weights.

@AndyAyersMS
Copy link
Member Author

Upshot is that most locals will be costed during morph as if they are not likely enregistered as most ref counts will be zero. There may be some nonzero counts from early ref counting and/or implicit refs.

https://github.com/dotnet/coreclr/blob/b896dd14830b600043a99c2626ea848ad679fb4f/src/jit/gentree.cpp#L2986-L2989

Seems like the fix here is to ensure ref counts are valid and if not either decide to be optimistic or pessimistic. For now probably the latter as it will have fewer diffs.

While we're in the neighborhood, this bit also looks problematic:

https://github.com/dotnet/coreclr/blob/b896dd14830b600043a99c2626ea848ad679fb4f/src/jit/gentree.cpp#L2991-L2996

Float exclusion is probably a hold over from x87?

I'll open a separate issue for this.

@AndyAyersMS
Copy link
Member Author

Left some notes on gtIsLikelyRegVar and gtSetEvalOrder in dotnet/coreclr#19061.

@CarolEidt
Copy link
Contributor

Seems like the fix here is to ensure ref counts are valid and if not either decide to be optimistic or pessimistic. For now probably the latter as it will have fewer diffs.

I'm not sure - I might assume the reverse, since at least most of the reasons things become ineligible for enregistering are handled by morph. In any event, I would be surprised if the slight changes in eval order are both significant and not addressable in some other way (i.e. there may be some other factors being missed).

@briansull
Copy link
Contributor

briansull commented Jul 20, 2018

If there are only a handful of LclVar then you can be optimistic, if we have hundreds then pessimestic would probably be the right choice. We could also call gtSetEvalOrder again after determining the wtdRefCnts.
These size costs are just used to drive a couple of hueristics, like cloning the loop top test.

AndyAyersMS referenced this issue in dotnet/coreclr Jul 20, 2018
…#19012)

Instead of relying on ref count bumps, add a new attribute bit to local
vars to indicate that they may have implicit references (prolog, epilog,
gc, eh) and may not have any IR references.

Use this attribute bit to ensure that the ref count and weighted ref count for
such variables are never reported as zero, and as a result that these variables
end up being allocated and reportable.

This is another preparatory step for #18969 and frees the jit to recompute
explicit ref counts via an IR scan without having to special case the counts
for these variables.

The jit can no longer describe implicit counts other than 1 and implicit weights
otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important.

The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes
(for windows x64).
@AndyAyersMS
Copy link
Member Author

Pessimistic didn't cause any diffs (at least in PMI/Crossgen x64) so I've done that in dotnet/coreclr#19068.

@AndyAyersMS
Copy link
Member Author

Started in on the first part of the minopts speedup: not traversing the IR during lvaMarkRefs. Instead we just mark all locals with nominal count and weight.

This causes code size increases as in some methods there are unreferenced locals and with this change we think they are referenced (so for instance spill params to the stack):

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit (MINOPTS)
Summary:
(Lower is better)
Total bytes of diff: 82702 (0.22% of base)
    diff is a regression.
Top file regressions by size (bytes):
       18113 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.30% of base)
       12409 : Microsoft.CodeAnalysis.CSharp.dasm (0.26% of base)
        7367 : System.Private.Xml.dasm (0.17% of base)
        4841 : Microsoft.CodeAnalysis.dasm (0.32% of base)
        4805 : System.Security.Permissions.dasm (15.13% of base)
Top file improvements by size (bytes):
        -470 : System.Reflection.Metadata.dasm (-0.11% of base)
        -317 : System.Numerics.Vectors.dasm (-0.34% of base)
        -277 : System.Drawing.Primitives.dasm (-0.58% of base)
        -226 : System.Security.AccessControl.dasm (-0.23% of base)
        -188 : System.Private.Xml.Linq.dasm (-0.11% of base)
113 total files with size differences (18 improved, 95 regressed), 17 unchanged.
Top method regressions by size (bytes):
        1032 : Microsoft.CodeAnalysis.VisualBasic.dasm - _Closure$__:_Lambda$__0-0(ref):ref:this (258 methods)
         320 : System.Private.DataContractSerialization.dasm - XmlJsonWriter:WriteArray(ref,ref,ref,ref,int,int):this (20 methods)
         312 : xunit.runner.utility.dotnet.dasm - TestMessageVisitor:Visit(ref):bool:this (39 methods)
         296 : Microsoft.CodeAnalysis.VisualBasic.dasm - _Closure$__:_Lambda$__9-0(ref):ref:this (74 methods)
         276 : Microsoft.CodeAnalysis.VisualBasic.dasm - _Closure$__:_Lambda$__10-0(ref):ref:this (69 methods)
Top method improvements by size (bytes):
        -188 : Microsoft.CodeAnalysis.dasm - Enumerator:.ctor(ref):this (20 methods)
        -182 : System.Private.DataContractSerialization.dasm - XmlBinaryWriter:WriteArray(ref,ref,ref,ref,int,int):this (20 methods)
        -136 : System.Private.Xml.Linq.dasm - XAttribute:op_Explicit(ref):struct (17 methods)
        -136 : System.Private.Xml.Linq.dasm - XElement:op_Explicit(ref):struct (17 methods)
        -103 : System.Data.Common.dasm - RBTree`1:GetIndexByNode(int):int:this (4 methods)
20974 total methods with size differences (2551 improved, 18423 regressed), 149586 unchanged.

Despite the code size increase, throughput improves around 3-5%. Tried playing a bit with mixed strategies -- eg guessing that small methods (based on bb count) might be more likely to have unused params and locals -- but no luck so far -- led to bigger code and smaller jitting. Might make more sense to look at total amount of IR created or something; in minopts hopefully there isn't much in the way of dead IR.

I will post more detailed and carefully measured numbers when I get around to a PR.

Trying to decide now if I should go through the pre-morph and post-morph explicit ref count updates and only do them when optimizing, or leave them be. If we skip over them post-morph then later in the compiler we need to do a second ref count setting pass for newly introduced locals (this is more or less what Pat did in his prototype).

It looks like some of the early ref count enabled optimizations may be active in minopts which also needs some sorting out. Suspect perhaps these should be turned off.

@AndyAyersMS
Copy link
Member Author

dotnet/coreclr#19103 has some work towards streamlining minopts.

Still needs sorting because of the many-many relationship between various concepts: opt levels, opt flags, enregistration of locals, tracking, sorting, keeping debug codegen more or less in sync with minopts, possible platform or os variances.

Arguably debug codegen throughput is as or more important than minopts since it impacts dev innerloop and "F5" latency. But I see quite a few places where we only check for minopts when it really looks like we should check for both (eg backendRequiresLocalVarLifetimes).

AndyAyersMS referenced this issue in AndyAyersMS/coreclr Aug 2, 2018
Extract out the basic ref count computation into a method that we
can conceptually call later on if we want to recompute counts.

Move one existing RCS_EARLY count for promoted fields of register
args into this recomputation since losing this count bump causes
quite a few diffs.

The hope is to eventually call this method again later in the jit
phase pipeline and then possibly get rid of all the (non-early)
incremental count maintenance we do currently.

Part of #18969
@AndyAyersMS
Copy link
Member Author

Here's a graphic showing some of the progress on jit throughput recently:
image

About 10% improvement in minopt, 6% or so in tiered.

@AndyAyersMS
Copy link
Member Author

Likewise one of the larger assemblies in crossgen, with minopts:

image

@AndyAyersMS
Copy link
Member Author

Prototyping the next batch of changes after dotnet/coreclr#19240 where we now recompute counts from scratch after lowering (but still leave the incremental updates intact). Preliminary diffs are encouraging...

Total bytes of diff: -58014 (-0.16% of base)
    diff is an improvement.

Top file regressions by size (bytes):
          32 : System.Console.dasm (0.07% of base)
          28 : System.Diagnostics.TextWriterTraceListener.dasm (0.51% of base)

Top file improvements by size (bytes):
      -21483 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.73% of base)
       -5182 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.09% of base)
       -4751 : System.Data.Common.dasm (-0.36% of base)
       -3195 : System.Private.CoreLib.dasm (-0.10% of base)
       -2774 : System.Threading.Tasks.Parallel.dasm (-2.34% of base)

94 total files with size differences (92 improved, 2 regressed), 35 unchanged.
...
4925 total methods with size differences (4256 improved, 669 regressed), 185782 unchanged.

Still need to carefully look at CQ and TP impact, but diffs show lots of cases where inflated ref counts were causing the jit to make poor decisions.

AndyAyersMS referenced this issue in dotnet/coreclr Aug 7, 2018
…19240)

Extract out the basic ref count computation into a method that we
can conceptually call later on if we want to recompute counts.

Move one existing RCS_EARLY count for promoted fields of register
args into this recomputation since losing this count bump causes
quite a few diffs.

The hope is to eventually call this method again later in the jit
phase pipeline and then possibly get rid of all the (non-early)
incremental count maintenance we do currently.

Part of #18969
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Aug 7, 2018
Update `lvaComputeRefCounts` to encapsulate running ref counts post-lower
and to also handle the fast jitting cases.

Invoke this after lower to provide recomputed (and more accurate) counts.

Part of #18969.
@AndyAyersMS
Copy link
Member Author

Now that dotnet/coreclr#19325 is in I've started prototyping removing the RCS_NORMAL incremental count updates. There is a lot of code to delete and it is not easy to get at this part incrementally.

Along with this we need to do something about tracking the need to sort the locals table. For now I've opted to sort it explicitly rather than triggering via various ref count machinations, as we won't have those to guide us.

Current prototype more or less undoes the size wins seen in dotnet/coreclr#19325 as we again lose track of some of the zero-ref cases.

So one idea is to add an RCS_LATE option that permits incremental ref count updates after lower, where (presumably) they are easier to get right (as it is mostly dead code I think). Another is to simply recompute the counts again before assigning frame offsets.

Am going to pursue this "second recompute" idea initially and see how it goes.

@AndyAyersMS
Copy link
Member Author

Recomputing counts again after liveness fixes most of the diffs... somewhat surprising how little the incremental counts were used. Details shortly.

@AndyAyersMS
Copy link
Member Author

For details: see notes for dotnet/coreclr#19345.

jashook referenced this issue in jashook/coreclr Aug 14, 2018
…dotnet#19012)

Instead of relying on ref count bumps, add a new attribute bit to local
vars to indicate that they may have implicit references (prolog, epilog,
gc, eh) and may not have any IR references.

Use this attribute bit to ensure that the ref count and weighted ref count for
such variables are never reported as zero, and as a result that these variables
end up being allocated and reportable.

This is another preparatory step for #18969 and frees the jit to recompute
explicit ref counts via an IR scan without having to special case the counts
for these variables.

The jit can no longer describe implicit counts other than 1 and implicit weights
otehr than BB_UNITY_WEIGHT, but that currently doesn't seem to be very important.

The new bit fits into an existing padding void so LclVarDsc remains at 128 bytes
(for windows x64).
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Aug 16, 2018

Playing around with removing the local table sort. Think removing it may also have minimal codegen impact. Should have data shortly.

Because of the way lvaSortByRefCount is currently written, we may end up tracking less than lclMAX_TRACKED locals and still leave some trackable locals untracked (because some low-numbered locals are untrackable).

@CarolEidt
Copy link
Contributor

If we are eliminating the sort, we may want to still consider https://github.com/dotnet/coreclr/issues/11339. I think it could improve throughput to make the set of lclVars that are live across blocks more dense, and would also enable slightly different heuristics for the two types. Also, many, if not all, of the locals introduced in Rationalizer and Lowering are live only within a block (and usually single-def) - knowing those properties would be beneficial to register allocation.

@CarolEidt
Copy link
Contributor

.. and if it is indeed the case that Rationalizer and Lowering only ever introduce lclVars that are live only within a block, then perhaps we could avoid running liveness after those phases, and just set the lvTracked field and the GTF_VAR_DEATH flag appropriately as new lcls are introduced by the backend.

@AndyAyersMS
Copy link
Member Author

Interesting.

Since we walk the IR to recompute the ref counts, we should be able to fairly cheaply determine if a trackable local's lifetime is block-local, and split the tracking index assignment into two passes so the globally live locals all have low tracking indexes.

@AndyAyersMS
Copy link
Member Author

Back to sorting for the moment -- very few methods (~10) in the pmi -f set have more than 512 trackable locals, and the issue I noted above doesn't seem to crop up for any of them, as we evidently do a good enough job sorting that in the first 512 sorted candidates all are trackable.

So one might expect a small number of diffs.

However, the numbers say otherwise:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: 8277 (0.02% of base)
    diff is a regression.
Top file regressions by size (bytes):
       11170 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.38% of base)
        2488 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.04% of base)
        1470 : Microsoft.CodeAnalysis.CSharp.dasm (0.03% of base)
        1243 : System.Linq.Parallel.dasm (0.11% of base)
         677 : System.Collections.Immutable.dasm (0.08% of base)
Top file improvements by size (bytes):
       -1957 : System.Private.CoreLib.dasm (-0.05% of base)
       -1766 : System.Private.Xml.dasm (-0.05% of base)
       -1014 : System.Security.Cryptography.Algorithms.dasm (-0.30% of base)
        -936 : System.Security.Cryptography.X509Certificates.dasm (-0.42% of base)
        -813 : System.Net.Http.dasm (-0.15% of base)
98 total files with size differences (55 improved, 43 regressed), 31 unchanged.
Top method regressions by size (bytes):
        3699 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.KernelTraceEventParser:EnumerateTemplates(ref,ref):this
        3550 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.CtfTraceEventSource:InitEventMap():ref
        2680 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.ClrPrivateTraceEventParser:EnumerateTemplates(ref,ref):this
        1356 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Parsers.ClrTraceEventParser:EnumerateTemplates(ref,ref):this
        1277 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.BoundTreeVisitor`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:VisitInternal(ref,struct):long:this
Top method improvements by size (bytes):
        -502 : System.Private.Xml.dasm - System.Xml.Serialization.XmlSerializationWriterCodeGen:WriteElements(ref,ref,ref,ref,ref,ref,bool,bool):this
        -289 : System.Private.CoreLib.dasm - <ReadAsyncInternal>d__66:MoveNext():this
        -282 : System.Reflection.Metadata.dasm - System.Reflection.Metadata.MetadataReader:InitializeTableReaders(struct,ubyte,ref,ref):this
        -200 : System.Security.Cryptography.Algorithms.dasm - System.Security.Cryptography.KeyFormatHelper:ReadEncryptedPkcs8(ref,struct,struct,ref,byref,byref) (20 methods)
        -200 : System.Security.Cryptography.Algorithms.dasm - System.Security.Cryptography.KeyFormatHelper:ReadEncryptedPkcs8(ref,struct,struct,struct,ref,byref,byref) (10 methods)
5524 total methods with size differences (2589 improved, 2935 regressed), 219099 unchanged.

For the most not sorting this would be a size wash or win except for the trace event methods -- they have huge numbers of temps. We can "fix" the problem there by less aggressively tracking structs but that's not good in general. (The wins seem to come perhaps from changes in CSE behavior; haven't looked at this to closely yet).

So we might consider some kind of two-pass tracking selection where the first pass picks out higher priority cases and the second pass then fills in anything else trackable until we run out of space. Or several passes if we want to get globally live things densely clustered in the ID space.

The second issue is that the jit is evidently sensitive to the ordering relationship induced by the ID -- these have been effectively "permuted" by not sorting the table, and this is likely what causes the larger number of diffs. If we're sorting the IDs reflect an underlying ordering metric based on weighted ref counts (and other things) it seems likely we can track down where the ordering comparison happens with IDs and perhaps replace it with the predicate used to sort -- if it's not needed too often.

@AndyAyersMS
Copy link
Member Author

Worth pointing out that the methods with big regressions all have large numbers of trackable but untracked locals. Some rough accounting...

Assembly Method Final number of trackable untracked
Microsoft.Diagnostics.Tracing.TraceEvent ApplicationServerTraceEventParser:EnumerateTemplates(ref,ref):this 1809
Microsoft.Diagnostics.Tracing.TraceEvent CtfTraceEventSource:InitEventMap():ref 705
Microsoft.Diagnostics.Tracing.TraceEvent KernelTraceEventParser:EnumerateTemplates(ref,ref):this 496
Microsoft.Diagnostics.Tracing.TraceEvent ClrPrivateTraceEventParser:EnumerateTemplates(ref,ref):this 233
Microsoft.CodeAnalysis.VisualBasic BoundTreeVisitor`2:VisitInternal(ref,struct):long:this 147
System.Private.Xml XmlReflectionImporter:ImportAccessorMapping(ref,ref,ref,ref,ref,bool,bool,ref):this 135
Microsoft.Diagnostics.Tracing.TraceEvent ClrTraceEventParser:EnumerateTemplates(ref,ref):this 118
System.Text.RegularExpressions RegexInterpreter:Go():this 67
Microsoft.CodeAnalysis.VisualBasic Binder:ReportOverloadResolutionFailureForASingleCandidate(ref,ref,int,byref,struct,struct,bool,bool,bool,bool,ref,ref,bool,ref,ref):this 37
Microsoft.CodeAnalysis.VisualBasic VisualBasicCommandLineParser:Parse(ref,ref,ref,ref):ref:this 4

@mikedn
Copy link
Contributor

mikedn commented Aug 17, 2018

Worth pointing out that the methods with big regressions all have large numbers of trackable but untracked locals. Some rough accounting...

Considering that such large methods tend to be rare would it make sense to increase the max number of tracked variables?

@AndyAyersMS
Copy link
Member Author

Maybe? The liveness time and space cost is proportional to BV size * number of BBs. So if a method has a lot of locals but few BBs then sure, we could up the limit. If it has a lot of locals and a lot of BBs then perhaps we shouldn't be optimizing it at all.

Divergence (at least in cases I've explored so far) for fully trackable methods seems to come about in LinearScan::buildIntervals() where we create intervals for reg params in lvVarIndex order.

We can rewrite this bit to just sort the tracked reg params (usually just a handful) using the old predicate.

@mikedn
Copy link
Contributor

mikedn commented Aug 18, 2018

The liveness time and space cost is proportional to BV size * number of BBs. So if a method has a lot of locals but few BBs then sure, we could up the limit. If it has a lot of locals and a lot of BBs then perhaps we shouldn't be optimizing it at all.

Yep, it needs to be done with care, we probably don't need another JIT64. I was considering making the max number of tracked variables configurable after I got rid of fixed size arrays in 18504, just to see what impact increasing the limit has. But I've yet to get to it...

Or several passes if we want to get globally live things densely clustered in the ID space.

Hmm, I suppose that would allow making the global BB BVs smaller than the BVs used for local liveness. But such a change is probably rather tricky and would not help if it turns out that there number of globally live lclvars is high.

@AndyAyersMS
Copy link
Member Author

lvaMarkLclRefs still does a bunch of unrelated computation but I'm not planning on disentangling it any time soon. Moving to future.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@AndyAyersMS
Copy link
Member Author

This is basically done, so will close.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitThroughput CLR JIT issues regarding speed of JIT itself
Projects
None yet
Development

No branches or pull requests

5 participants