Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Add documents about JIT optimization planning #12956

Merged
merged 1 commit into from
Jul 31, 2017

Conversation

JosephTremoulet
Copy link

This change adds two documents:

  • JitOptimizerPlanningGuide.md discusses how we can/do/should go about
    identifying, prioritizing, and validating optimization improvement
    opportunities, as well as several ideas for how we might improve the
    process.
  • JitOptimizerTodoAssessment.md lists several potential optimization
    improvements that always come up in planning discussions, with brief
    notes about each, to capture current thinking.

@JosephTremoulet
Copy link
Author

@dotnet/jit-contrib PTAL

situations, the better, as it both increases developer productivity and
increases the usefulness of abstractions provided by the language and
libraries. Finding a measurable metric to track this type of improvement
poses a challenge, but would be a big help toward prioritizing and validating

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use all of those, including also:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the examples, this is exactly the sort of list I'm hoping we can build/prioritize/address.

@omariom
Copy link

omariom commented Jul 20, 2017

I like the idea of making COMPlus_JitDisasm available in release builds. Currently I has to build CoreCLR myself and can only do it at home.

Mulshift
--------

Replacing multiplication by constants with shift/add/lea sequences is a
Copy link

@mikedn mikedn Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, the JIT does some of this already and I suspect it wouldn't be much trouble to make it do more.

That's a way of saying "do we really need a planning document to make it happen"? :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull requests welcome :)

No, I'm not trying to modify our workflow, impose heavier process, or demand that changes get added to this document before (or after) getting implemented, or anything like that -- I'm just capturing a list of items we keep discussing in planning to avoid having to re-create the discussion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull requests welcome :)

Not anytime soon, I don't think it's a very useful optimization (beyond what we already have now)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From perspective of a developer working in core team saying:

That's a way of saying "do we really need a planning document to make it happen"? :)

is entirely understandable since she/he is on bleeding edge of project development but from perspective of potential community members it would be very helpful and welcome.

Copy link

@mikedn mikedn Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From perspective of a developer working in core team saying ...

I'm not quite sure what you are trying to say. Just to be clear, I'm a community member, not core team member :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective your knowledge of dotnet and contributions tell that you are a core team member :) - git blame does not lie, does it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some stats and yeah it looks like we're already getting nearly everything that makes sense, will put together a PR for the few stragglers that seem worthwhile.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some stats

Wow, that's a bit of work. It would be nice to know how did you instrument the JIT. Or to be more precise - how did you get the numbers out of the JIT. Files I presume?

I didn't know that morph does this, I only knew about codegen. Now that I see this I'm not so sure it's a good idea to have this in morph. For one thing it increases IR size and it's not likely to enable additional optimizations, quite the contrary.

But more importantly, this really belongs in codegen as it is a very target specific optimization. IMUL is quite fast these days - latency 3 and throughput 1. Replacing it with a single LEA or SHR is pretty much always a win but the moment you replace it with 2 LEA/SHR instructions things become complicated. Those 2 instructions will have at least 2 cycle latency so in the best case you're saving 1 cycle at the cost of adding an instruction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added instance fields to Compiler, modified Morph and CodeGen to record the data in the new fields during processing, modified asm emission to print them in method headers, then ran jit-diff, and pulled the data out of the dasm files and ultimately into excel.

I agree it seems like something that should live in the backend. cc @russellhadley who had some reasons to prefer Lower to CodeGen.

I'm not planning to stop and migrate it now (bigger fish to fry), but would be happy to see that happen.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who had some reasons to prefer Lower to CodeGen

Yeah, I prefer Lower too. Doing this kind of stuff in CodeGen sometimes also requires adding logic in Lower or TreeNodeInfoInit and that logic needs to be kept in sync, otherwise bugs or CQ issues show up. But if we do it in Lower we also need to add a new lclvar because the non constant operand of MUL has multiple uses.

I'm not planning to stop and migrate it now (bigger fish to fry), but would be happy to see that happen.

I might take a look once I finish my pesky cast optimization attempt.

------------------------------

This hits big in microbenchmarks where it hits. There's some work in flight
on this (see #7447).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the "work" is in 10861, 7447 is just an issue with some discussion about this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

expressions could be folded into memory operands' address expressions. This
would likely give a measurable codesize win. Needs some thought about where
to run in phase list and how aggressive to be about e.g. analyzing across
statements.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this related to forward substitution?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, certainly. I suppose I mentioned it here simply thinking that if we tackle the address mode thing it might be worthwhile to add some simple forward propagation as part of that, which could then be refactored/subsumed if we add more general forward substitution subsequently.

the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping
with or making easily available an alt jit that supports JitDisasm.
- Hardware companies maintain optimization/performance guides for their ISAs.
Should we maintain one for MSIL and/or C# (and/or F#)? If we hosted such a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISAs a far more complicated than MSIL in this regard so it makes sense that there are such guides. I don't thinks there's a lot that can be done here but here are a few ideas:

  • Is it better to rely more on the IL stack (via dup) or is it better to use IL local variables?
  • Use of IL switch - when it is better to replace it with a search tree?
  • Is there perhaps a "best" fg layout for IL? Like having multiple returns or having a single return?

collect profiles and correlate them with that machine code. This could
benefit any developers doing performance analysis of their own code.
The JIT team has discussed this, options include building something on top of
the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having JitDisasm in release builds would certainly be nice but it may also be limiting (e.g. right now it outputs to the console so it can interfere with the application's own output). The current disassembler output is also a bit inaccurate at times, not a big problem usually but it can be confusing.

Another interesting option might be for the runtime to expose a managed API that offers information (e.g. code ranges) about JITed functions. That would allow people to use a 3rd party disassembler or perhaps find more creative uses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'd need to have a way to send the disasm somewhere other than stdout. I believe there's some functionality to send jit output to a logfile already, which of course if we do this we'd need to make sure it's working and working well with JitDisasm.

To my mind, the appeal of making JitDisasm available over disassembling the emitted code is that it would make it easy to bring along all the annotations we put in the disasm (method name, optimization flags, symbols and helper call names, annotated GC/EH tables, etc.), as well as things like DiffableDisasm.

Another interesting option might be for the runtime to expose a managed API that offers information (e.g. code ranges) about JITed functions. That would allow people to use a 3rd party disassembler or perhaps find more creative uses.

There is CLR MD, which for example SharpLab is using for in-proc disassembly with a 3rd-party disassembler.

Copy link

@4creators 4creators Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would very helpful to have a "side by side" very high resolution profiler. My suggestion would be to include as a one of available profiling options code described in paper Computer performance microscopy with S him, X Yang, SM Blackburn, KS McKinley - ACM SIGARCH Computer Architecture News, 2016. This profiler allows for 15 processor cycles resolution with overhead at around 68% and 1000 processor cycles resolution with overhead at 2% with no or very small observer effects. AFAIR currently used code (thread cycle measurements in utilities) has significant overhead (in range of 200 processor cycles for single measurement or 400 cycles for two point measurement which is necessary to determine time interval (cpuid + rdtsc instructions or similar serializing time stamp counter reading instruction). Last author of article Kathryn S McKinley is at Microsoft Research and code is available at https://github.com/ShimProfiler/SHIM under GPLv2. Work was funded by NSF
SHF-0910818 and ARC DP140103878 grants (other co-authors are at Australian National University) so it would be possible to license it from US government at other terms.

It is quite often that I would like to know how long performance critical method executes in real application and yet it is often called only once during typical application life cycle - i.e. image decompression, coding algorithm for short data sequnces or some parts of the multi stage / multi algorithm processing. If typical benchmarks are used method is isolated from it's usual context and execution time could be quite often very different from execution time when method is executed once in application context. In my experiments in managed code on .NET 4.6 - 4.7 the difference could be as large as 3 - 5 times.

@alpencolt
Copy link

cc @dotnet/arm32-contrib

@4creators
Copy link

4creators commented Jul 22, 2017

It's great that these documents will appear in coreclr documentation. It is a prerequisite to better communication with coreclr community and .NET Core users on one of the most important subjects - performance. And I would like to thank you @JosephTremoulet for creation of those - see my comment to Performance Improvements in RyuJIT in .NET Core and .NET Framework

@JosephTremoulet
Copy link
Author

@4creators, thanks for the comments! Fully agree that active discussions around performance are vital.

improving if resources were unlimited. This document lists them and some
thoughts about their current state and prioritization, in an effort to capture
the thinking about them that comes up in planning discussions.

Copy link

@4creators 4creators Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very useful to have a description of existing optimisations with info on implemented algorithms and links to the code - Optimizer Codebase and Status. This would help in understanding existing RyuJIT implementation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very useful to have a description of existing optimisations with info on implemented algorithms and links to the code

This is more or less available in the existing documentation: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md

Copy link

@4creators 4creators Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that document - I've read it already twice and to my taste I would like to go deeper with more detailed links to code. My intention is to indicate that documentation on jit, vm and gc should allow to understand implementation to a point that for experienced developer so called time to first commit would be as short as possible. Usually problem with documentation for developers is that it is best when it's written by code authors who have to write code in a first place and do not have much time for writing documents documenting their work. Other aspect of the same problem is a barrier to contributing to project which has a major impact on size of community and dynamics of open source project development. I would treat investments in documentation as an investment in community supporting project.

We've made note of the prevalence of async/await in modern code (and particularly
in web server code such as TechEmpower), but haven't really identified concrete
opportunities presented here, to my knowledge. Some sort of study of async
peanut butter is probably in order, but what would that look like?
Copy link
Member

@benaadams benaadams Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at pre-completed async; you'd generally always get ~x10 per dip by using await vs checking completion and using .Result. As best I could tell from a light investigation was when using the pre-check everything was in registers; when using await everything was in memory locations https://github.com/dotnet/corefx/issues/18481

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As best I could tell from a light investigation was when using the pre-check everything was in registers; when using await everything was in memory locations

Well, local variables that are live across await aren't really local variables anymore, they're fields of a struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and it means a function which has an await has a disproportionately higher min cost for sync completion; and it can be coded out; so it would be nice if a compiler resolved it instead (C# or Jit) - haven't worked out a general solution though - so just highlighting it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it would be nice if a compiler resolved it instead (C# or Jit)

Hmm, maybe some improvements could be done but it's not going to be trivial. One problem is the memory model, all stores have release semantics and that limits JITs ability to optimize such code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn @omariom is right. ECMA IL spec only requires Release semantic for stores with IL volatile prefix. Certainly that is all arm64 is doing. However there could easily be Release semantics involved with cross threading like await.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how is the JIT supposed to know all that?

Until it boxes? At that point its going async, then memory vs register becomes a minimal factor. Though user defined Task-likes could make it much more complicated

Copy link

@mikedn mikedn Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that point its going async, then memory vs register becomes a minimal factor.

Yes, that's the idea, to move some loads/stores on the "going async" path. Though it looks like Roslyn is in a better position to do this. For example:

static async Task Test()
{
    int i;
    for (i = 0; i < 10; i++)
        Whatever(i);
    await Task.CompletedTask;
    for (; i < 20; i++)
        Whatever(i);
}

In this example i becomes a struct field and every access to it turns into a memory access. To make things worse all this is in a try/catch block. Accessing i requires this and this is live in the catch block and thus cannot be enregistered. So it's actually 2 memory accesses for every access to i:

G_M33543_IG04:
       488B4D10             mov      rcx, bword ptr [rbp+10H]
       8B4904               mov      ecx, dword ptr [rcx+4]
       E82AFAFFFF           call     Program:Whatever(int)
       488B4D10             mov      rcx, bword ptr [rbp+10H]
       8B7104               mov      esi, dword ptr [rcx+4]
       FFC6                 inc      esi
       488B4D10             mov      rcx, bword ptr [rbp+10H]
       897104               mov      dword ptr [rcx+4], esi
       488B4D10             mov      rcx, bword ptr [rbp+10H]
       8379040A             cmp      dword ptr [rcx+4], 10
       7CDA                 jl       SHORT G_M33543_IG04

Perhaps Roslyn could generate something like:

static async Task Test()
{
    int i;
    for (i = 0; i < 10; i++)
        Whatever(i);
    this.i = i; // "spill" i to a struct field
    await Task.CompletedTask;
    i = this.i; // "unspill" i from the struct field
    for (; i < 20; i++)
        Whatever(i);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @VSadov

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #7914 and some of the changes that went in to Roslyn as a result.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #7914

Updated with link to that, thanks.

@omariom
Copy link

omariom commented Jul 25, 2017

all stores have release semantics

Isn't it for stores to fields of reference types only?

@mikedn
Copy link

mikedn commented Jul 26, 2017

Isn't it for stores to fields of reference types only?

It doesn't matter if the type is a reference or a value type. What it matters is if the store goes to the stack (where it cannot be observed by any other thread) or on the heap (where it may be observed). And of course, a value type can end up on the heap by boing boxed or as a field of a reference type. The struct used in async code does end up being boxed if the task is not already completed.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; I hope that this will be a living document that evolves to reflect new data, new ideas and new results. Thanks for writing this up!

------------------------

This is an area that has received recent attention, with the first-class structs
work and the struct promotion improvements that went in for `Span<T>`. Work here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add a reference here to https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md, although I confess that it needs to be updated with current status. There are still a number of work items left to complete, many (most?) of which have associated issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

key optimization here, and we are actively pursuing it. Other things we've
discussed include inlining methods with EH and computing funclet callee-save
register usage independently of main function callee-save register usage, but
I don't think we have any particular data pointing to either as a high priority.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While EH Write Thru is not strictly an optimization, the lack of it is an inhibitor to improve performance of EH-intensive code (see https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md, and I also submitted a PR with a suggested alternate approach).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link, rephrased to "optimization enabler".

opt-repeat experiment indicated that they're not prevalent enough to merit
pursuing changing this right now. Also, using SSA def as the proxy for value
number would require handling SSA renaming, so there's a big dependency chained
to this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by using SSA def as the proxy for value number. Could you clarify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean eagerly replacing redundant expressions and thus being able to approximate "has same value" with "is use of same SSA def" (and re-casting the heap VN stuff as memory SSA) rather than dragging around side tables of value numbers in a separate expression language.

assertion count and could switch inliner policies), nor do we have any sort of
benchmarking story set up to validate whether tiering changes are helping or
hurting. We should get that benchmarking story sorted out and at least hook
up those two knobs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not really an optimization issue, but it's pretty clear that existing issues with register allocation (and in particular, issue with spill placement) are a current inhibitor to more aggressive optimization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Are you saying we'd do more aggressive post-RA optimization with better-placed spills, or do more aggressive pre-RA optimization if we had better spill placement in the RA to rely on, or both/neither? And specifically is there something you think the doc should say about this under "High Tier Optimizations" (like that we could use a different RA algorithm)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying the latter, and I think that all the doc really needs to say is that, until the RA issues are mitigated, aggressive optimizations are likely to be pessimized by RA issues and/or potentially make performance worse. Whether or not we need a different RA algorithm, I think, remains to be seen, but I think there's a lot of potential improvement with the existing RA algorithm that has not yet been achieved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added a note to that effect.

This change adds two documents:

 - JitOptimizerPlanningGuide.md discusses how we can/do/should go about
   identifying, prioritizing, and validating optimization improvement
   opportunities, as well as several ideas for how we might improve the
   process.
 - JitOptimizerTodoAssessment.md lists several potential optimization
   improvements that always come up in planning discussions, with brief
   notes about each, to capture current thinking.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.