Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LclVars and throughput #8715

Open
pgavlin opened this issue Aug 9, 2017 · 13 comments
Open

LclVars and throughput #8715

pgavlin opened this issue Aug 9, 2017 · 13 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitThroughput CLR JIT issues regarding speed of JIT itself tenet-performance Performance related issue
Milestone

Comments

@pgavlin
Copy link
Contributor

pgavlin commented Aug 9, 2017

I'm filing this issue to capture some of the various issues that have popped up recently as I've been experimenting with LclVars and their relationship to JIT throughput. This issue is going to assume basic familiarity with the RyuJIT IR and its terminology as described in the overview. Apologies if I ramble a bit :)

Basics

The fundamental characteristics that distinguish a LclVar is that from an SDSU temp are that it may be multiply-defined, multiply-used, and live-in or live-out of a block. The liveness of a lclVar is particularly important: the JIT uses the liveness information for blocks and LclVars in order to perform SSA numbering, register allocation, precise GC reporting, as well as a handful of miscellaneous smaller optimizations. Accurate liveness information comes at a cost, however: in order to calculate the live-in and live-out sets for each block, the JIT must run live variable analysis on the function being compiled.

Tracked and Untracked LclVars

In order for the JIT to calculate liveness information for a LclVar, it must be able to correctly determine the points at which it is defined and used. There are a number of limitations that prevent the JIT from doing so for certain LclVars, the most common the most common of which is that the LclVar is address-exposed (with better alias analysis, this limitation would be surmountable). I will refer to the set of LclVars for which the JIT is capable of determining def and use information as "trackable LclVars".

The most direct costs of a tracked LclVar are time spent in the liveness calculation and space occupied in the dense bitsets used to represent the set of live tracked LclVars at various points (most importantly at block entries and exists). In order to manage both these direct costs as well as the indirect costs of the phases that consume liveness information, the JIT limits the number of trackable LclVars that are actually tracked. This limit is currently set at 512, which was found to be the highest achievable point that did not adversely affect throughput. When the JIT is not optimizing, it goes even further: rather than calculating accurate liveness information for tracked LclVars, it simply assumes that they are live at all points.

The JIT decides which of the trackable LclVars to track by sorting the LclVar table roughly by weighted appearance count (see the comparison functions for details) and then marking any LclVars after the 512th as untracked. The sorted LclVar table is also used to guide the dataflow CSE pass's register pressure heuristics as well to determine the order in which parameters are considered for enregistration by the register allocator.

General LclVar Costs

The description above briefly mentions two of the primary throughput costs of a LclVar, whether tracked or untracked:

  • The unweighted count and weighted appearance counts of each LclVar (LclVarDsc::lvRefCnt and LclVarDsc::lvRefCntWtd, respectively) must be calculated and maintained
  • The LclVar table must be sorted

The former cost is paid first by lvaMarkLocalVars. This phase walks the function and bumps a LclVar's ref counts at each appearance. Once this phase has run, each successive phase must in principle ensure that it maintains accurate LclVar ref counts. In the general case, any time a tree is removed/added from/to the function, it must be walked and LclVar reference counts adjusted accordingly. In practice, it is not always clear when this is necessary, so ref counts are often rather inaccurate. These inaccuracies can adversely affect the quality of the generated code, especially in cases where there are more then 512 LclVars, as LclVars that are important to track may be pushed out of the tracked set. Furthermore, as ref counts are used in the register allocator's spill heuristic, inaccurate counts may cause less-than-ideal allocation.

The latter cost is paid first by lvaMarkLocalVars and then again any time the JIT deems it important to recalculate the set of tracked LclVars.

Decreasing the Throughput Cost of LclVars

Given the above, here are some ideas that seem worth exploring.

  • LclVar ref counting:
    • Do not ref count LclVars in MinOpts. This has been prototyped here, and gives about a 4.5% improvement in the number of instructions retired when precompiling System.Private.CoreLib.dll under MinOpts at the cost of a <0.1% in instructions retired when optimizing. This prototype probably needs to be updated to ensure that unreferenced LclVars are discovered (perhaps during LSRA) s.t. the JIT can avoid allocating stack space for them. Aside from that, I think this is a clear win.
    • Instead of maintaining LclVar ref counts throughout the compiler, recalculate them immediately before they are needed (i.e. whenever the set of tracked LclVars is going to be recalculated). This has been prototyped here (note that this prototype includes the changes mentioned above w.r.t. not ref counting LclVars in MinOpts). This gives a very minor (<0.1%) improvement in the number of instructions retired when precompiling System.Private.CorLib.dll with optimizations enabled. The improvement under MinOpts increases to about 4.7%. This is a tougher change to evaluate, as it causes a massive number of assembly diffs. That said, I believe that it is likely a win due to the minimal observed TP impact when optimizing and the improved clarity of the code, which no longer needs to maintain ref counts in the vast majority of cases.
  • LclVar sorting:
    • Rather than sorting all LclVars (including untrackable LclVars), the sort could only consider trackable LclVars. This has been prototyped here. It is primarily of interest due to its minimal impact on the rest of the JIT. I haven't measured its TP impact recently, but recall it being fairly small (probably ~0.25%).
    • If there are fewer than 512 LclVars, we could avoid sorting the LclVar table entirely. This has been prototyped here. This set of changes is somewhat incomplete, as it does not address the usage of the sorted table in CSE (though it does address LSRA's dependence by sorting only the parameters). However, its throughput impact is larger: avoiding the sorting gives about a .8% improvement in the number of instructions retired when precompiling S.P.C.dll with optimizations enabled. We could address the CSE issue by simply removing the register pressure heuristic. Though I have not measured the TP impact of doing so, I have gathered diffs: when the register pressure heuristic is disabled, overall code size as reported by SPMI diffs decreases by about 0.2%. This improvement is not universal: for example, mscorlib increases in size by 900 bytes while the frameworks decrease in size by 22kB.

category:throughput
theme:jit-coding-style
skill-level:expert
cost:large

@pgavlin
Copy link
Contributor Author

pgavlin commented Aug 9, 2017

cc @AndyAyersMS @CarolEidt @dotnet/jit-contrib

@CarolEidt
Copy link
Contributor

Thanks for writing this up! Overall, I agree with your thoughts and direction.

  • We could consider reusing lclVars that are created because we need a multi-use value, but whose lifetime is limited to a small region. This would effectively increase the number of trackable lclVars, at the cost of ensuring that we handle these in a way that leverages the fact that each use has a single def (in LSRA, at least, the presence of a large number of unrelated references could pessimize allocation). See [RyuJIT][CQ] LSRA: Investigate separating cross-block lclVars from intra-block lclVars #7992.

    • Until such time as we support conditional assignment within a block, the fact that a lclVar is not live across any blocks means it has this property.
    • Of course, if we had accurate SSA info during LSRA we could simply use that. But that has its own throughput issues, at least without significant re-engineering.
  • Ref counting on demand is appealing, not only for the apparent throughput improvement, but because it is so easy to make mistakes in the updates.

@pgavlin
Copy link
Contributor Author

pgavlin commented Aug 9, 2017

We could consider reusing lclVars that are created because we need a multi-use value, but whose lifetime is limited to a small region. This would effectively increase the number of trackable lclVars, at the cost of ensuring that we handle these in a way that leverages the fact that each use has a single def (in LSRA, at least, the presence of a large number of unrelated references could pessimize allocation). See #7992.

I actually recently experimented with reusing the array and index LclVars created during GT_INDEX expansion. This turns out to be a net loss exactly because of some unfortunate preferencing behavior in LSRA. The pattern of the intervals for the shared reused LclVars is a set of block-local (usually, though I think "copy prop" can change this), always non-overlapping lifetimes, but LSRA effectively only allows for a single preference for the entire interval. The solution that immediately jumped to my mind was true interval splitting; @russellhadley may have some more thoughts on this.

FWIW, the more I work with the LclVar code, the less convinced I am that it is going to be worth the trouble to handle block-local LclVars as "tracked" LclVars that do not receive a name in the usual namespace. Though it's probably worth doing some experimentation to be sure, I think that we may be better off exploring ways to simply increase the tracked LclVar limit.

Ref counting on demand is appealing, not only for the apparent throughput improvement, but because it is so easy to make mistakes in the updates.

Yes. With the ref-count phase experiment referenced above, the only parts of the JIT that need to maintain ref counts are those that run between the ref count phase and liveness and those that run after the final liveness pass. This amounts to:

  1. optAddCopies, which may introduce new LclVars. This should be obviated by EH write-through.
  2. the DCE pass in fgLocalVarLiveness, which may remove LclVar references
  3. fgUpdateFlowGraph, which may remove LclVar references
    We want to update ref counts in (2) or (3) s.t. we can identify and elide unreferenced LclVars when assigning frame offsets. An alternative approach (which we probably need anyway for MinOpts) would be to find unreferenced LclVars while building intervals for LSRA.

@AndyAyersMS
Copy link
Member

I think I asked Brian this before, and forgot what he said, but it's not clear to me what the obligations are in the current jit when a tree is deemed unnecessary and is disposed. Certainly there are places that carefully go and decrement ref counts, but I'm sure this step gets missed in places and sometimes it is only conditionally needed.

In past compilers I've worked on there was a checkable discipline about disposing IR, basically by requiring some kind of dispose action that maintained invariants, kept track of the number of nodes disposed and modified the disposed IR in ways that made asserts or faults likely if they were somehow still accessed. Then at phase boundaries the "live IR" could be walked and some basic math (live-at-start + new - disposed) == currently-live? would tell you if some IR had not been cleaned up properly or leaked. (This was extendable in various interesting ways, eg the number of volatile ops in the live IR should generally be an invariant, with a few exceptions, so there can be special dispose/create disciplines around things like that....)

If disposing is conceptually a no-op then this at least serves as a check that IR is not leaking; if disposing carries some obligations then this is a way to ensure that those are being properly addressed and that incrementally maintained information is likely more accurate.

In our case it appears ref counts are sparsely read, so on demand makes more sense. If ref count maintenance is the only obligation when disposing a tree, then removing that and having no obligations on dispose would be a nice improvement. Whether we want to go to the trouble of having a placeholder dispose operation that operates in checked modes to give us leak detection and/or dead tree marking or other checking is then a separable question.

@CarolEidt
Copy link
Contributor

The pattern of the intervals for the shared reused LclVars is a set of block-local (usually, though I think "copy prop" can change this), always non-overlapping lifetimes, but LSRA effectively only allows for a single preference for the entire interval. The solution that immediately jumped to my mind was true interval splitting; @russellhadley may have some more thoughts on this.

Yes, that's why this is an outstanding issue (and not yet already done).

Once I've eliminated the separate TreeNodeInfoInit pass and have eliminated gtLsraInfo, I plan to work on building RefPositions on-the-fly for tree temps. From there it is (conceptually, at least) a short step to also incrementally building them for lclVars that aren't live across blocks. This would effectively split the intervals for those lclVars.

That still, however, leaves the remaining issue of splitting intervals for lclVars that are live across blocks, which will still have value.

@pgavlin
Copy link
Contributor Author

pgavlin commented Aug 9, 2017

In our case it appears ref counts are sparsely read, so on demand makes more sense. If ref count maintenance is the only obligation when disposing a tree, then removing that and having no obligations on dispose would be a nice improvement.

Aside from ref count maintenance, we may want to correct side-effect flags for ancestor nodes when removing an HIR tree, but that is the only other obligation of which I am aware. Currently the ref-count maintenance and side-effect propagation are handled in separate walks if either is performed (which is not always the case). LIR is only obligated to perform ref-count maintenance, and attempts to encapsulate this in LIR::Delete and LIR::DeleteRange.

Whether we want to go to the trouble of having a placeholder dispose operation that operates in checked modes to give us leak detection and/or dead tree marking or other checking is then a separable question.

Another data point to consider here is that before trying the recount phase approach, I first wrote a ref count checker. I quickly abandoned this approach after it became clear that due to the open-coded nature of the IR modifications performed by the JIT, it was going to be a rather arduous task to locate all of the various sites that need to dispose trees. IMO an approach that allows us to have no obligations upon disposal is likely to be best for both throughput (no need for extra IR walks) and maintainability (no question about what to do when throwing away a tree).

Once I've eliminated the separate TreeNodeInfoInit pass and have eliminated gtLsraInfo, I plan to work on building RefPositions on-the-fly for tree temps. From there it is (conceptually, at least) a short step to also incrementally building them for lclVars that aren't live across blocks. This would effectively split the intervals for those lclVars.

Do you also plan on building Intervals on the fly? As best I could tell, the preferencing problem that I mentioned is rooted in the fact that a single Interval may currently only have a single preference for its entire lifetime (effectively).

@CarolEidt
Copy link
Contributor

Do you also plan on building Intervals on the fly? As best I could tell, the preferencing problem that I mentioned is rooted in the fact that a single Interval may currently only have a single preference for its entire lifetime (effectively).

Yes, in fact what I was doing when I experimented with this previously is that for tree temps no intervals are built at all. For the not-live-across-blocks lclVars, one would presumably reset its preferencing info when starting a new block.

@briansull
Copy link
Contributor

briansull commented Aug 9, 2017

The LclVar counts are there mainly for the Legacy JIT backend. In the Legacy JIT they are used to sort the LclVars and decided which locals are to be tracked and which ones are untracked. Also the coloring register allocator allocates the LclVars with the highest weight first. There is an assert that fires if you ever decrement a count below zero as that is obviously indicates that some kind of an error in ref count occurred. But there nothing that prevent over counting and in fact we actually use over counting to boost the internal short lifetime temps so that they are more likely to be tracked and enregistered.

Since we have these ref counts I have suggested that they be used by LSRA when a heuristic needs to make a choice between two or more LclVars.

@mikedn
Copy link
Contributor

mikedn commented Aug 14, 2017

I actually recently experimented with reusing the array and index LclVars created during GT_INDEX expansion

Wouldn't it be possible to reduce the number of lclvars used by GT_INDEX's expansion by making GT_ARR_BOUNDS_CHECK return the index it checks? This way the original index would be used only once and no new lclvar would be needed for it. This would also slightly simplify the IR by avoiding the need for a GT_COMMA node. Granted, it's not a trivial change...

@AndyAyersMS
Copy link
Member

Consider for 2.1.

@AndyAyersMS
Copy link
Member

Data from #7281 which has partially corrected ref counts indicates that there's also a CS win here, mainly from reducing the local offsets.

It should also give us a somewhat saner basis for using counts to help direct register allocation, since the counts today are somewhat inflated.

@AndyAyersMS
Copy link
Member

Would love to get to this but don't see it wrapping up within the 2.1 timeframe as it is a fairly disruptive change (lots of codegen impact). So pushing it back to Future.

@AndyAyersMS
Copy link
Member

I dusted off Pat's changes and squashed / rebased to a recent master commit. Updated bits here: master...AndyAyersMS:LclVarRefCountPhase2

Doesn't quite work out of the box as we try decrementing some ref count below zero. Likely some new count maintenance has popped into the code. Locally I've commented out that assert. But logically we should now never need to decrement ref counts.

Looking at the changes. I think perhaps more is needed. What we end up with (in optimized builds) has some odd characteristics, as the ref counts and weighted counts are still looked at before they're "correctly" computed, in a number of places (and arguably these early counts are now even less correct than before):

  • assertion prop optAddCopies
  • gentree gtIsLikelyRegVar
  • cse CSE_Heuristic::Initialize
  • morph fgRetypeImplicitByRefArgs
  • register promotion lvaPromoteLongVars
    and possibly more. Also counts are still manipulated directly during morph in a number of places.

So if the goal is to not create, consume, or maintain counts early on in the jit, we have to look at all these cases and find alternatives.

The post computation count logic that looks at counts and weights is also sensitive and we see unexpected diffs. Typically the old counts and weights were somewhat inflated and the new counts and weights are lower. For example the RA is sensitive to the weights it sees for vars that represent register parameters. In the old code these vars appear to be specially handled by incrementing counts/weights in key places (or perhaps: there are implicit appearances of these vars at entry and at some calls that the current ref count recomputation doesn't account for...). So now they often have less weight and get spilled immediately on entry when it appears they did not need or deserve spilling.

X64 jit diffs on a checked corelib shows:

Total bytes of diff: 7393 (0.20% of base)
    diff is a regression.

Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions by size (bytes):
        7393 : System.Private.CoreLib.dasm (0.20% of base)

1 total files with size differences (0 improved, 1 regressed), 0 unchanged.

Top method regessions by size (bytes):
         395 : System.Private.CoreLib.dasm - System.Number:NumberToStringFormat(byref,byref,struct,ref)
         153 : System.Private.CoreLib.dasm - System.ValueTuple`8[__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon][System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon]:ToString():ref:this
         153 : System.Private.CoreLib.dasm - System.ValueTuple`8[__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon][System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon]:System.IValueTupleInternal.ToStringEnd():ref:this
         103 : System.Private.CoreLib.dasm - System.Globalization.IdnMapping:PunycodeEncode(ref):ref
         103 : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HMS_F_D(byref,ubyte,byref):bool

Top method improvements by size (bytes):
         -84 : System.Private.CoreLib.dasm - System.Reflection.Emit.MethodBuilder:CreateMethodBodyHelper(ref):this
         -63 : System.Private.CoreLib.dasm - System.ValueTuple`8[__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`7][System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.ValueTuple`7[System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon]]:System.IValueTupleInternal.ToStringEnd():ref:this
         -63 : System.Private.CoreLib.dasm - System.ValueTuple`8[__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`7][System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.ValueTuple`7[System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon]]:ToString():ref:this
         -63 : System.Private.CoreLib.dasm - System.ValueTuple`8[__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`8][System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.ValueTuple`8[System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.ValueTuple`7[System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon]]]:System.IValueTupleInternal.ToStringEnd():ref:this
         -63 : System.Private.CoreLib.dasm - System.ValueTuple`8[__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,__Canon,ValueTuple`8][System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.ValueTuple`8[System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.ValueTuple`7[System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon,System.__Canon]]]:ToString():ref:this

1867 total methods with size differences (472 improved, 1395 regressed), 24253 unchanged.

So there is a fairly long tail of regressions to sort through.

I haven't yet tried to verify the TP numbers or look at correctness or minopts impact. Will post more when I have it.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitThroughput CLR JIT issues regarding speed of JIT itself tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants