Add documents about JIT optimization planning #12956

JosephTremoulet · 2017-07-20T17:35:34Z

This change adds two documents:

JitOptimizerPlanningGuide.md discusses how we can/do/should go about
identifying, prioritizing, and validating optimization improvement
opportunities, as well as several ideas for how we might improve the
process.
JitOptimizerTodoAssessment.md lists several potential optimization
improvements that always come up in planning discussions, with brief
notes about each, to capture current thinking.

JosephTremoulet · 2017-07-20T17:35:54Z

@dotnet/jit-contrib PTAL

redknightlois · 2017-07-20T18:08:39Z

Documentation/performance/JitOptimizerPlanningGuide.md

+   situations, the better, as it both increases developer productivity and
+   increases the usefulness of abstractions provided by the language and
+   libraries.  Finding a measurable metric to track this type of improvement
+   poses a challenge, but would be a big help toward prioritizing and validating


We use all of those, including also:

Forcing inlining like crazy to avoid methods calls

Using goto's to build "sub-routines" avoiding the cost of the call and the size of the multiple instances of inlining on big methods.
Examples of the result: https://ayende.com/blog/177569/why-we-arent-publishing-benchmarks-for-ravendb-4-0-yet

Thanks for the examples, this is exactly the sort of list I'm hoping we can build/prioritize/address.

omariom · 2017-07-20T18:47:22Z

I like the idea of making COMPlus_JitDisasm available in release builds. Currently I has to build CoreCLR myself and can only do it at home.

mikedn · 2017-07-20T19:09:19Z

Documentation/performance/JitOptimizerTodoAssessment.md

+Mulshift
+--------
+
+Replacing multiplication by constants with shift/add/lea sequences is a


Eh, the JIT does some of this already and I suspect it wouldn't be much trouble to make it do more.

That's a way of saying "do we really need a planning document to make it happen"? :)

Pull requests welcome :)

No, I'm not trying to modify our workflow, impose heavier process, or demand that changes get added to this document before (or after) getting implemented, or anything like that -- I'm just capturing a list of items we keep discussing in planning to avoid having to re-create the discussion.

Pull requests welcome :)

Not anytime soon, I don't think it's a very useful optimization (beyond what we already have now)

From perspective of a developer working in core team saying:

That's a way of saying "do we really need a planning document to make it happen"? :)

is entirely understandable since she/he is on bleeding edge of project development but from perspective of potential community members it would be very helpful and welcome.

From perspective of a developer working in core team saying ...

I'm not quite sure what you are trying to say. Just to be clear, I'm a community member, not core team member :)

From my perspective your knowledge of dotnet and contributions tell that you are a core team member :) - git blame does not lie, does it?

I ran some stats and yeah it looks like we're already getting nearly everything that makes sense, will put together a PR for the few stragglers that seem worthwhile.

I ran some stats

Wow, that's a bit of work. It would be nice to know how did you instrument the JIT. Or to be more precise - how did you get the numbers out of the JIT. Files I presume?

I didn't know that morph does this, I only knew about codegen. Now that I see this I'm not so sure it's a good idea to have this in morph. For one thing it increases IR size and it's not likely to enable additional optimizations, quite the contrary.

But more importantly, this really belongs in codegen as it is a very target specific optimization. IMUL is quite fast these days - latency 3 and throughput 1. Replacing it with a single LEA or SHR is pretty much always a win but the moment you replace it with 2 LEA/SHR instructions things become complicated. Those 2 instructions will have at least 2 cycle latency so in the best case you're saving 1 cycle at the cost of adding an instruction.

I added instance fields to Compiler, modified Morph and CodeGen to record the data in the new fields during processing, modified asm emission to print them in method headers, then ran jit-diff, and pulled the data out of the dasm files and ultimately into excel.

I agree it seems like something that should live in the backend. cc @russellhadley who had some reasons to prefer Lower to CodeGen.

I'm not planning to stop and migrate it now (bigger fish to fry), but would be happy to see that happen.

who had some reasons to prefer Lower to CodeGen

Yeah, I prefer Lower too. Doing this kind of stuff in CodeGen sometimes also requires adding logic in Lower or TreeNodeInfoInit and that logic needs to be kept in sync, otherwise bugs or CQ issues show up. But if we do it in Lower we also need to add a new lclvar because the non constant operand of MUL has multiple uses.

I'm not planning to stop and migrate it now (bigger fish to fry), but would be happy to see that happen.

I might take a look once I finish my pesky cast optimization attempt.

mikedn · 2017-07-20T19:21:14Z

Documentation/performance/JitOptimizerTodoAssessment.md

+------------------------------
+
+This hits big in microbenchmarks where it hits.  There's some work in flight
+on this (see #7447).


FWIW the "work" is in 10861, 7447 is just an issue with some discussion about this.

Thanks, updated.

mikedn · 2017-07-20T19:24:53Z

Documentation/performance/JitOptimizerTodoAssessment.md

+expressions could be folded into memory operands' address expressions.  This
+would likely give a measurable codesize win.  Needs some thought about where
+to run in phase list and how aggressive to be about e.g. analyzing across
+statements.


Isn't this related to forward substitution?

Yes, certainly. I suppose I mentioned it here simply thinking that if we tackle the address mode thing it might be worthwhile to add some simple forward propagation as part of that, which could then be refactored/subsumed if we add more general forward substitution subsequently.

mikedn · 2017-07-20T19:38:56Z

Documentation/performance/JitOptimizerPlanningGuide.md

+   the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping
+   with or making easily available an alt jit that supports JitDisasm.
+ - Hardware companies maintain optimization/performance guides for their ISAs.
+   Should we maintain one for MSIL and/or C# (and/or F#)?  If we hosted such a


ISAs a far more complicated than MSIL in this regard so it makes sense that there are such guides. I don't thinks there's a lot that can be done here but here are a few ideas:

Is it better to rely more on the IL stack (via dup) or is it better to use IL local variables?

Use of IL switch - when it is better to replace it with a search tree?

Is there perhaps a "best" fg layout for IL? Like having multiple returns or having a single return?

mikedn · 2017-07-21T06:08:38Z

Documentation/performance/JitOptimizerPlanningGuide.md

+   collect profiles and correlate them with that machine code.  This could
+   benefit any developers doing performance analysis of their own code.
+   The JIT team has discussed this, options include building something on top of
+   the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping


Having JitDisasm in release builds would certainly be nice but it may also be limiting (e.g. right now it outputs to the console so it can interfere with the application's own output). The current disassembler output is also a bit inaccurate at times, not a big problem usually but it can be confusing.

Another interesting option might be for the runtime to expose a managed API that offers information (e.g. code ranges) about JITed functions. That would allow people to use a 3rd party disassembler or perhaps find more creative uses.

Yes, we'd need to have a way to send the disasm somewhere other than stdout. I believe there's some functionality to send jit output to a logfile already, which of course if we do this we'd need to make sure it's working and working well with JitDisasm.

To my mind, the appeal of making JitDisasm available over disassembling the emitted code is that it would make it easy to bring along all the annotations we put in the disasm (method name, optimization flags, symbols and helper call names, annotated GC/EH tables, etc.), as well as things like DiffableDisasm.

Another interesting option might be for the runtime to expose a managed API that offers information (e.g. code ranges) about JITed functions. That would allow people to use a 3rd party disassembler or perhaps find more creative uses.

There is CLR MD, which for example SharpLab is using for in-proc disassembly with a 3rd-party disassembler.

It would very helpful to have a "side by side" very high resolution profiler. My suggestion would be to include as a one of available profiling options code described in paper Computer performance microscopy with S him, X Yang, SM Blackburn, KS McKinley - ACM SIGARCH Computer Architecture News, 2016. This profiler allows for 15 processor cycles resolution with overhead at around 68% and 1000 processor cycles resolution with overhead at 2% with no or very small observer effects. AFAIR currently used code (thread cycle measurements in utilities) has significant overhead (in range of 200 processor cycles for single measurement or 400 cycles for two point measurement which is necessary to determine time interval (cpuid + rdtsc instructions or similar serializing time stamp counter reading instruction). Last author of article Kathryn S McKinley is at Microsoft Research and code is available at https://github.com/ShimProfiler/SHIM under GPLv2. Work was funded by NSF
SHF-0910818 and ARC DP140103878 grants (other co-authors are at Australian National University) so it would be possible to license it from US government at other terms.

It is quite often that I would like to know how long performance critical method executes in real application and yet it is often called only once during typical application life cycle - i.e. image decompression, coding algorithm for short data sequnces or some parts of the multi stage / multi algorithm processing. If typical benchmarks are used method is isolated from it's usual context and execution time could be quite often very different from execution time when method is executed once in application context. In my experiments in managed code on .NET 4.6 - 4.7 the difference could be as large as 3 - 5 times.

alpencolt · 2017-07-21T10:04:52Z

cc @dotnet/arm32-contrib

4creators · 2017-07-22T11:38:23Z

It's great that these documents will appear in coreclr documentation. It is a prerequisite to better communication with coreclr community and .NET Core users on one of the most important subjects - performance. And I would like to thank you @JosephTremoulet for creation of those - see my comment to Performance Improvements in RyuJIT in .NET Core and .NET Framework

JosephTremoulet · 2017-07-24T17:12:40Z

@4creators, thanks for the comments! Fully agree that active discussions around performance are vital.

4creators · 2017-07-24T18:04:13Z

Documentation/performance/JitOptimizerTodoAssessment.md

+improving if resources were unlimited.  This document lists them and some
+thoughts about their current state and prioritization, in an effort to capture
+the thinking about them that comes up in planning discussions.
+


It would be very useful to have a description of existing optimisations with info on implemented algorithms and links to the code - Optimizer Codebase and Status. This would help in understanding existing RyuJIT implementation.

It would be very useful to have a description of existing optimisations with info on implemented algorithms and links to the code

This is more or less available in the existing documentation: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md

I know that document - I've read it already twice and to my taste I would like to go deeper with more detailed links to code. My intention is to indicate that documentation on jit, vm and gc should allow to understand implementation to a point that for experienced developer so called time to first commit would be as short as possible. Usually problem with documentation for developers is that it is best when it's written by code authors who have to write code in a first place and do not have much time for writing documents documenting their work. Other aspect of the same problem is a barrier to contributing to project which has a major impact on size of community and dynamics of open source project development. I would treat investments in documentation as an investment in community supporting project.

benaadams · 2017-07-25T18:45:46Z

Documentation/performance/JitOptimizerTodoAssessment.md

+We've made note of the prevalence of async/await in modern code (and particularly
+in web server code such as TechEmpower), but haven't really identified concrete
+opportunities presented here, to my knowledge.  Some sort of study of async
+peanut butter is probably in order, but what would that look like?


When looking at pre-completed async; you'd generally always get ~x10 per dip by using await vs checking completion and using .Result. As best I could tell from a light investigation was when using the pre-check everything was in registers; when using await everything was in memory locations https://github.com/dotnet/corefx/issues/18481

As best I could tell from a light investigation was when using the pre-check everything was in registers; when using await everything was in memory locations

Well, local variables that are live across await aren't really local variables anymore, they're fields of a struct.

Yep, and it means a function which has an await has a disproportionately higher min cost for sync completion; and it can be coded out; so it would be nice if a compiler resolved it instead (C# or Jit) - haven't worked out a general solution though - so just highlighting it.

so it would be nice if a compiler resolved it instead (C# or Jit)

Hmm, maybe some improvements could be done but it's not going to be trivial. One problem is the memory model, all stores have release semantics and that limits JITs ability to optimize such code.

@mikedn @omariom is right. ECMA IL spec only requires Release semantic for stores with IL volatile prefix. Certainly that is all arm64 is doing. However there could easily be Release semantics involved with cross threading like await.

But how is the JIT supposed to know all that?

Until it boxes? At that point its going async, then memory vs register becomes a minimal factor. Though user defined Task-likes could make it much more complicated

At that point its going async, then memory vs register becomes a minimal factor.

Yes, that's the idea, to move some loads/stores on the "going async" path. Though it looks like Roslyn is in a better position to do this. For example:

static async Task Test() { int i; for (i = 0; i < 10; i++) Whatever(i); await Task.CompletedTask; for (; i < 20; i++) Whatever(i); }

In this example i becomes a struct field and every access to it turns into a memory access. To make things worse all this is in a try/catch block. Accessing i requires this and this is live in the catch block and thus cannot be enregistered. So it's actually 2 memory accesses for every access to i:

G_M33543_IG04: 488B4D10 mov rcx, bword ptr [rbp+10H] 8B4904 mov ecx, dword ptr [rcx+4] E82AFAFFFF call Program:Whatever(int) 488B4D10 mov rcx, bword ptr [rbp+10H] 8B7104 mov esi, dword ptr [rcx+4] FFC6 inc esi 488B4D10 mov rcx, bword ptr [rbp+10H] 897104 mov dword ptr [rcx+4], esi 488B4D10 mov rcx, bword ptr [rbp+10H] 8379040A cmp dword ptr [rcx+4], 10 7CDA jl SHORT G_M33543_IG04

Perhaps Roslyn could generate something like:

static async Task Test() { int i; for (i = 0; i < 10; i++) Whatever(i); this.i = i; // "spill" i to a struct field await Task.CompletedTask; i = this.i; // "unspill" i from the struct field for (; i < 20; i++) Whatever(i); }

cc: @VSadov

See also #7914 and some of the changes that went in to Roslyn as a result.

See also #7914

Updated with link to that, thanks.

omariom · 2017-07-25T21:54:05Z

all stores have release semantics

Isn't it for stores to fields of reference types only?

mikedn · 2017-07-26T05:59:58Z

Isn't it for stores to fields of reference types only?

It doesn't matter if the type is a reference or a value type. What it matters is if the store goes to the stack (where it cannot be observed by any other thread) or on the heap (where it may be observed). And of course, a value type can end up on the heap by boing boxed or as a field of a reference type. The struct used in async code does end up being boxed if the task is not already completed.

CarolEidt

Looks good; I hope that this will be a living document that evolves to reflect new data, new ideas and new results. Thanks for writing this up!

CarolEidt · 2017-07-31T06:48:07Z

Documentation/performance/JitOptimizerTodoAssessment.md

+------------------------
+
+This is an area that has received recent attention, with the first-class structs
+work and the struct promotion improvements that went in for `Span<T>`.  Work here


You might want to add a reference here to https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md, although I confess that it needs to be updated with current status. There are still a number of work items left to complete, many (most?) of which have associated issues.

CarolEidt · 2017-07-31T06:50:41Z

Documentation/performance/JitOptimizerTodoAssessment.md

+key optimization here, and we are actively pursuing it.  Other things we've
+discussed include inlining methods with EH and computing funclet callee-save
+register usage independently of main function callee-save register usage, but
+I don't think we have any particular data pointing to either as a high priority.


While EH Write Thru is not strictly an optimization, the lack of it is an inhibitor to improve performance of EH-intensive code (see https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md, and I also submitted a PR with a suggested alternate approach).

Added link, rephrased to "optimization enabler".

CarolEidt · 2017-07-31T06:51:14Z

Documentation/performance/JitOptimizerTodoAssessment.md

+opt-repeat experiment indicated that they're not prevalent enough to merit
+pursuing changing this right now.  Also, using SSA def as the proxy for value
+number would require handling SSA renaming, so there's a big dependency chained
+to this.


I'm not sure what you mean by using SSA def as the proxy for value number. Could you clarify?

I mean eagerly replacing redundant expressions and thus being able to approximate "has same value" with "is use of same SSA def" (and re-casting the heap VN stuff as memory SSA) rather than dragging around side tables of value numbers in a separate expression language.

CarolEidt · 2017-07-31T06:52:07Z

Documentation/performance/JitOptimizerTodoAssessment.md

+assertion count and could switch inliner policies), nor do we have any sort of
+benchmarking story set up to validate whether tiering changes are helping or
+hurting.  We should get that benchmarking story sorted out and at least hook
+up those two knobs.


Again, not really an optimization issue, but it's pretty clear that existing issues with register allocation (and in particular, issue with spill placement) are a current inhibitor to more aggressive optimization.

Could you elaborate? Are you saying we'd do more aggressive post-RA optimization with better-placed spills, or do more aggressive pre-RA optimization if we had better spill placement in the RA to rely on, or both/neither? And specifically is there something you think the doc should say about this under "High Tier Optimizations" (like that we could use a different RA algorithm)?

I was saying the latter, and I think that all the doc really needs to say is that, until the RA issues are mitigated, aggressive optimizations are likely to be pessimized by RA issues and/or potentially make performance worse. Whether or not we need a different RA algorithm, I think, remains to be seen, but I think there's a lot of potential improvement with the existing RA algorithm that has not yet been achieved.

Makes sense. Added a note to that effect.

This change adds two documents: - JitOptimizerPlanningGuide.md discusses how we can/do/should go about identifying, prioritizing, and validating optimization improvement opportunities, as well as several ideas for how we might improve the process. - JitOptimizerTodoAssessment.md lists several potential optimization improvements that always come up in planning discussions, with brief notes about each, to capture current thinking.

JosephTremoulet requested a review from russellhadley July 20, 2017 17:35

dnfclas added the cla-already-signed label Jul 20, 2017

redknightlois reviewed Jul 20, 2017

View reviewed changes

mikedn reviewed Jul 20, 2017

View reviewed changes

mikedn reviewed Jul 21, 2017

View reviewed changes

JosephTremoulet force-pushed the OptDocs branch from 23fc9ad to f5a95ea Compare July 21, 2017 19:31

4creators reviewed Jul 24, 2017

View reviewed changes

benaadams reviewed Jul 25, 2017

View reviewed changes

JosephTremoulet force-pushed the OptDocs branch 2 times, most recently from 23f58eb to 602419c Compare July 28, 2017 19:37

CarolEidt approved these changes Jul 31, 2017

View reviewed changes

JosephTremoulet force-pushed the OptDocs branch from 602419c to 4a4606a Compare July 31, 2017 14:12

JosephTremoulet merged commit 6b38dca into dotnet:master Jul 31, 2017

JosephTremoulet deleted the OptDocs branch July 31, 2017 19:52

JosephTremoulet mentioned this pull request Jul 31, 2017

Emit power-of-two constant multiply as shift #13128

Merged

karelz modified the milestone: 2.1.0 Aug 28, 2017

JosephTremoulet mentioned this pull request Jan 31, 2020

JIT: mulshift optimization should happen in Lower dotnet/runtime#8668

Closed

Add documents about JIT optimization planning #12956

Add documents about JIT optimization planning #12956

Conversation

JosephTremoulet commented Jul 20, 2017

JosephTremoulet commented Jul 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omariom commented Jul 20, 2017

mikedn Jul 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikedn Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

alpencolt commented Jul 21, 2017

4creators commented Jul 22, 2017 • edited Loading

JosephTremoulet commented Jul 24, 2017

4creators Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4creators Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

benaadams Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikedn Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omariom commented Jul 25, 2017

mikedn commented Jul 26, 2017

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikedn Jul 20, 2017 •

edited

Loading

mikedn Jul 24, 2017 •

edited

Loading

4creators Jul 24, 2017 •

edited

Loading

4creators commented Jul 22, 2017 •

edited

Loading

4creators Jul 24, 2017 •

edited

Loading

4creators Jul 24, 2017 •

edited

Loading

benaadams Jul 25, 2017 •

edited

Loading

mikedn Jul 26, 2017 •

edited

Loading