-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add documents about JIT optimization planning #12956
Conversation
@dotnet/jit-contrib PTAL |
situations, the better, as it both increases developer productivity and | ||
increases the usefulness of abstractions provided by the language and | ||
libraries. Finding a measurable metric to track this type of improvement | ||
poses a challenge, but would be a big help toward prioritizing and validating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use all of those, including also:
- Forcing inlining like crazy to avoid methods calls
- Using goto's to build "sub-routines" avoiding the cost of the call and the size of the multiple instances of inlining on big methods.
Examples of the result: https://ayende.com/blog/177569/why-we-arent-publishing-benchmarks-for-ravendb-4-0-yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the examples, this is exactly the sort of list I'm hoping we can build/prioritize/address.
I like the idea of making COMPlus_JitDisasm available in release builds. Currently I has to build CoreCLR myself and can only do it at home. |
Mulshift | ||
-------- | ||
|
||
Replacing multiplication by constants with shift/add/lea sequences is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, the JIT does some of this already and I suspect it wouldn't be much trouble to make it do more.
That's a way of saying "do we really need a planning document to make it happen"? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull requests welcome :)
No, I'm not trying to modify our workflow, impose heavier process, or demand that changes get added to this document before (or after) getting implemented, or anything like that -- I'm just capturing a list of items we keep discussing in planning to avoid having to re-create the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull requests welcome :)
Not anytime soon, I don't think it's a very useful optimization (beyond what we already have now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From perspective of a developer working in core team saying:
That's a way of saying "do we really need a planning document to make it happen"? :)
is entirely understandable since she/he is on bleeding edge of project development but from perspective of potential community members it would be very helpful and welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From perspective of a developer working in core team saying ...
I'm not quite sure what you are trying to say. Just to be clear, I'm a community member, not core team member :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective your knowledge of dotnet and contributions tell that you are a core team member :) - git blame does not lie, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some stats and yeah it looks like we're already getting nearly everything that makes sense, will put together a PR for the few stragglers that seem worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some stats
Wow, that's a bit of work. It would be nice to know how did you instrument the JIT. Or to be more precise - how did you get the numbers out of the JIT. Files I presume?
I didn't know that morph does this, I only knew about codegen. Now that I see this I'm not so sure it's a good idea to have this in morph. For one thing it increases IR size and it's not likely to enable additional optimizations, quite the contrary.
But more importantly, this really belongs in codegen as it is a very target specific optimization. IMUL is quite fast these days - latency 3 and throughput 1. Replacing it with a single LEA or SHR is pretty much always a win but the moment you replace it with 2 LEA/SHR instructions things become complicated. Those 2 instructions will have at least 2 cycle latency so in the best case you're saving 1 cycle at the cost of adding an instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added instance fields to Compiler
, modified Morph and CodeGen to record the data in the new fields during processing, modified asm emission to print them in method headers, then ran jit-diff, and pulled the data out of the dasm files and ultimately into excel.
I agree it seems like something that should live in the backend. cc @russellhadley who had some reasons to prefer Lower to CodeGen.
I'm not planning to stop and migrate it now (bigger fish to fry), but would be happy to see that happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who had some reasons to prefer Lower to CodeGen
Yeah, I prefer Lower too. Doing this kind of stuff in CodeGen sometimes also requires adding logic in Lower or TreeNodeInfoInit and that logic needs to be kept in sync, otherwise bugs or CQ issues show up. But if we do it in Lower we also need to add a new lclvar because the non constant operand of MUL has multiple uses.
I'm not planning to stop and migrate it now (bigger fish to fry), but would be happy to see that happen.
I might take a look once I finish my pesky cast optimization attempt.
------------------------------ | ||
|
||
This hits big in microbenchmarks where it hits. There's some work in flight | ||
on this (see #7447). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the "work" is in 10861, 7447 is just an issue with some discussion about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
expressions could be folded into memory operands' address expressions. This | ||
would likely give a measurable codesize win. Needs some thought about where | ||
to run in phase list and how aggressive to be about e.g. analyzing across | ||
statements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this related to forward substitution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, certainly. I suppose I mentioned it here simply thinking that if we tackle the address mode thing it might be worthwhile to add some simple forward propagation as part of that, which could then be refactored/subsumed if we add more general forward substitution subsequently.
the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping | ||
with or making easily available an alt jit that supports JitDisasm. | ||
- Hardware companies maintain optimization/performance guides for their ISAs. | ||
Should we maintain one for MSIL and/or C# (and/or F#)? If we hosted such a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISAs a far more complicated than MSIL in this regard so it makes sense that there are such guides. I don't thinks there's a lot that can be done here but here are a few ideas:
- Is it better to rely more on the IL stack (via
dup
) or is it better to use IL local variables? - Use of IL
switch
- when it is better to replace it with a search tree? - Is there perhaps a "best" fg layout for IL? Like having multiple returns or having a single return?
collect profiles and correlate them with that machine code. This could | ||
benefit any developers doing performance analysis of their own code. | ||
The JIT team has discussed this, options include building something on top of | ||
the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having JitDisasm in release builds would certainly be nice but it may also be limiting (e.g. right now it outputs to the console so it can interfere with the application's own output). The current disassembler output is also a bit inaccurate at times, not a big problem usually but it can be confusing.
Another interesting option might be for the runtime to expose a managed API that offers information (e.g. code ranges) about JITed functions. That would allow people to use a 3rd party disassembler or perhaps find more creative uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'd need to have a way to send the disasm somewhere other than stdout. I believe there's some functionality to send jit output to a logfile already, which of course if we do this we'd need to make sure it's working and working well with JitDisasm.
To my mind, the appeal of making JitDisasm available over disassembling the emitted code is that it would make it easy to bring along all the annotations we put in the disasm (method name, optimization flags, symbols and helper call names, annotated GC/EH tables, etc.), as well as things like DiffableDisasm.
Another interesting option might be for the runtime to expose a managed API that offers information (e.g. code ranges) about JITed functions. That would allow people to use a 3rd party disassembler or perhaps find more creative uses.
There is CLR MD, which for example SharpLab is using for in-proc disassembly with a 3rd-party disassembler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would very helpful to have a "side by side" very high resolution profiler. My suggestion would be to include as a one of available profiling options code described in paper Computer performance microscopy with S him, X Yang, SM Blackburn, KS McKinley - ACM SIGARCH Computer Architecture News, 2016. This profiler allows for 15 processor cycles resolution with overhead at around 68% and 1000 processor cycles resolution with overhead at 2% with no or very small observer effects. AFAIR currently used code (thread cycle measurements in utilities) has significant overhead (in range of 200 processor cycles for single measurement or 400 cycles for two point measurement which is necessary to determine time interval (cpuid + rdtsc instructions or similar serializing time stamp counter reading instruction). Last author of article Kathryn S McKinley is at Microsoft Research and code is available at https://github.com/ShimProfiler/SHIM under GPLv2. Work was funded by NSF
SHF-0910818 and ARC DP140103878 grants (other co-authors are at Australian National University) so it would be possible to license it from US government at other terms.
It is quite often that I would like to know how long performance critical method executes in real application and yet it is often called only once during typical application life cycle - i.e. image decompression, coding algorithm for short data sequnces or some parts of the multi stage / multi algorithm processing. If typical benchmarks are used method is isolated from it's usual context and execution time could be quite often very different from execution time when method is executed once in application context. In my experiments in managed code on .NET 4.6 - 4.7 the difference could be as large as 3 - 5 times.
cc @dotnet/arm32-contrib |
It's great that these documents will appear in coreclr documentation. It is a prerequisite to better communication with coreclr community and .NET Core users on one of the most important subjects - performance. And I would like to thank you @JosephTremoulet for creation of those - see my comment to Performance Improvements in RyuJIT in .NET Core and .NET Framework |
@4creators, thanks for the comments! Fully agree that active discussions around performance are vital. |
improving if resources were unlimited. This document lists them and some | ||
thoughts about their current state and prioritization, in an effort to capture | ||
the thinking about them that comes up in planning discussions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very useful to have a description of existing optimisations with info on implemented algorithms and links to the code - Optimizer Codebase and Status. This would help in understanding existing RyuJIT implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very useful to have a description of existing optimisations with info on implemented algorithms and links to the code
This is more or less available in the existing documentation: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that document - I've read it already twice and to my taste I would like to go deeper with more detailed links to code. My intention is to indicate that documentation on jit, vm and gc should allow to understand implementation to a point that for experienced developer so called time to first commit would be as short as possible. Usually problem with documentation for developers is that it is best when it's written by code authors who have to write code in a first place and do not have much time for writing documents documenting their work. Other aspect of the same problem is a barrier to contributing to project which has a major impact on size of community and dynamics of open source project development. I would treat investments in documentation as an investment in community supporting project.
We've made note of the prevalence of async/await in modern code (and particularly | ||
in web server code such as TechEmpower), but haven't really identified concrete | ||
opportunities presented here, to my knowledge. Some sort of study of async | ||
peanut butter is probably in order, but what would that look like? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking at pre-completed async; you'd generally always get ~x10 per dip by using await vs checking completion and using .Result. As best I could tell from a light investigation was when using the pre-check everything was in registers; when using await everything was in memory locations https://github.com/dotnet/corefx/issues/18481
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As best I could tell from a light investigation was when using the pre-check everything was in registers; when using await everything was in memory locations
Well, local variables that are live across await aren't really local variables anymore, they're fields of a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, and it means a function which has an await
has a disproportionately higher min cost for sync completion; and it can be coded out; so it would be nice if a compiler resolved it instead (C# or Jit) - haven't worked out a general solution though - so just highlighting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it would be nice if a compiler resolved it instead (C# or Jit)
Hmm, maybe some improvements could be done but it's not going to be trivial. One problem is the memory model, all stores have release semantics and that limits JITs ability to optimize such code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how is the JIT supposed to know all that?
Until it boxes? At that point its going async, then memory vs register becomes a minimal factor. Though user defined Task-likes could make it much more complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that point its going async, then memory vs register becomes a minimal factor.
Yes, that's the idea, to move some loads/stores on the "going async" path. Though it looks like Roslyn is in a better position to do this. For example:
static async Task Test()
{
int i;
for (i = 0; i < 10; i++)
Whatever(i);
await Task.CompletedTask;
for (; i < 20; i++)
Whatever(i);
}
In this example i
becomes a struct field and every access to it turns into a memory access. To make things worse all this is in a try/catch block. Accessing i
requires this
and this
is live in the catch block and thus cannot be enregistered. So it's actually 2 memory accesses for every access to i
:
G_M33543_IG04:
488B4D10 mov rcx, bword ptr [rbp+10H]
8B4904 mov ecx, dword ptr [rcx+4]
E82AFAFFFF call Program:Whatever(int)
488B4D10 mov rcx, bword ptr [rbp+10H]
8B7104 mov esi, dword ptr [rcx+4]
FFC6 inc esi
488B4D10 mov rcx, bword ptr [rbp+10H]
897104 mov dword ptr [rcx+4], esi
488B4D10 mov rcx, bword ptr [rbp+10H]
8379040A cmp dword ptr [rcx+4], 10
7CDA jl SHORT G_M33543_IG04
Perhaps Roslyn could generate something like:
static async Task Test()
{
int i;
for (i = 0; i < 10; i++)
Whatever(i);
this.i = i; // "spill" i to a struct field
await Task.CompletedTask;
i = this.i; // "unspill" i from the struct field
for (; i < 20; i++)
Whatever(i);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @VSadov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #7914 and some of the changes that went in to Roslyn as a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #7914
Updated with link to that, thanks.
Isn't it for stores to fields of reference types only? |
It doesn't matter if the type is a reference or a value type. What it matters is if the store goes to the stack (where it cannot be observed by any other thread) or on the heap (where it may be observed). And of course, a value type can end up on the heap by boing boxed or as a field of a reference type. The struct used in async code does end up being boxed if the task is not already completed. |
23f58eb
to
602419c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; I hope that this will be a living document that evolves to reflect new data, new ideas and new results. Thanks for writing this up!
------------------------ | ||
|
||
This is an area that has received recent attention, with the first-class structs | ||
work and the struct promotion improvements that went in for `Span<T>`. Work here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to add a reference here to https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md, although I confess that it needs to be updated with current status. There are still a number of work items left to complete, many (most?) of which have associated issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
key optimization here, and we are actively pursuing it. Other things we've | ||
discussed include inlining methods with EH and computing funclet callee-save | ||
register usage independently of main function callee-save register usage, but | ||
I don't think we have any particular data pointing to either as a high priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While EH Write Thru is not strictly an optimization, the lack of it is an inhibitor to improve performance of EH-intensive code (see https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md, and I also submitted a PR with a suggested alternate approach).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added link, rephrased to "optimization enabler".
opt-repeat experiment indicated that they're not prevalent enough to merit | ||
pursuing changing this right now. Also, using SSA def as the proxy for value | ||
number would require handling SSA renaming, so there's a big dependency chained | ||
to this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by using SSA def as the proxy for value number
. Could you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean eagerly replacing redundant expressions and thus being able to approximate "has same value" with "is use of same SSA def" (and re-casting the heap VN stuff as memory SSA) rather than dragging around side tables of value numbers in a separate expression language.
assertion count and could switch inliner policies), nor do we have any sort of | ||
benchmarking story set up to validate whether tiering changes are helping or | ||
hurting. We should get that benchmarking story sorted out and at least hook | ||
up those two knobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not really an optimization issue, but it's pretty clear that existing issues with register allocation (and in particular, issue with spill placement) are a current inhibitor to more aggressive optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? Are you saying we'd do more aggressive post-RA optimization with better-placed spills, or do more aggressive pre-RA optimization if we had better spill placement in the RA to rely on, or both/neither? And specifically is there something you think the doc should say about this under "High Tier Optimizations" (like that we could use a different RA algorithm)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was saying the latter, and I think that all the doc really needs to say is that, until the RA issues are mitigated, aggressive optimizations are likely to be pessimized by RA issues and/or potentially make performance worse. Whether or not we need a different RA algorithm, I think, remains to be seen, but I think there's a lot of potential improvement with the existing RA algorithm that has not yet been achieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Added a note to that effect.
This change adds two documents: - JitOptimizerPlanningGuide.md discusses how we can/do/should go about identifying, prioritizing, and validating optimization improvement opportunities, as well as several ideas for how we might improve the process. - JitOptimizerTodoAssessment.md lists several potential optimization improvements that always come up in planning discussions, with brief notes about each, to capture current thinking.
This change adds two documents:
identifying, prioritizing, and validating optimization improvement
opportunities, as well as several ideas for how we might improve the
process.
improvements that always come up in planning discussions, with brief
notes about each, to capture current thinking.