Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Commit

Permalink
Add documents about JIT optimization planning
Browse files Browse the repository at this point in the history
This change adds two documents:

 - JitOptimizerPlanningGuide.md discusses how we can/do/should go about
   identifying, prioritizing, and validating optimization improvement
   opportunities, as well as several ideas for how we might improve the
   process.
 - JitOptimizerTodoAssessment.md lists several potential optimization
   improvements that always come up in planning discussions, with brief
   notes about each, to capture current thinking.
  • Loading branch information
JosephTremoulet committed Jul 31, 2017
1 parent f17fae2 commit 6b38dca
Show file tree
Hide file tree
Showing 2 changed files with 261 additions and 0 deletions.
127 changes: 127 additions & 0 deletions Documentation/performance/JitOptimizerPlanningGuide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
JIT Optimizer Planning Guide
============================

The goal of this document is to capture some thinking about the process used to
prioritize and validate optimizer investments. The overriding goal of such
investments is to help ensure that the dotnet platform satisfies developers'
performance needs.


Benchmarking
------------

There are a number of public benchmarks which evaluate different platforms'
relative performance, so naturally dotnet's scores on such benchmarks give
some indication of how well it satisfies developers' performance needs. The JIT
team has used some of these benchmarks, particularly [TechEmpower](https://www.techempower.com/benchmarks/)
and [Benchmarks Game](http://benchmarksgame.alioth.debian.org/), for scouting
out optimization opportunities and prioritizing optimization improvements.
While it is important to track scores on such benchmarks to validate performance
changes in the dotnet platform as a whole, when it comes to planning and
prioritizing JIT optimization improvements specifically, they aren't sufficient,
due to a few well-known issues:

- For macro-benchmarks, such as TechEmpower, compiler optimization is often not
the dominant factor in performance. The effects of individual optimizer
changes are most often in the sub-percent range, well below the noise level
of the measurements, which will usually be at least 3% or so even for the
most well-behaved macro-benchmarks.
- Source-level changes can be made much more rapidly than compiler optimization
changes. This means that for anything we're trying to track where the whole
team is effecting changes in source, runtime, etc., any particular code
sequence we may target with optimization improvements may well be targeted
with source changes in the interim, nullifying the measured benefit of the
optimization change when it is eventually merged. Source/library/runtime
changes are in play for TechEmpower and Benchmarks Game both.

Compiler micro-benchmarks (like those in our [test tree](https://github.com/dotnet/coreclr/tree/master/tests/src/JIT/Performance/CodeQuality))
don't share these issues, and adding them as optimizations are implemented is
critical for validation and regression prevention; however, micro-benchmarks
often aren't as representative of real-world code, and therefore not as
reflective of developers' performance needs, so aren't well suited for scouting
out and prioritizing opportunities.


Benefits of JIT Optimization
----------------------------

While source changes can more rapidly and dramatically effect changes to
targeted hot code sequences in macro-benchmarks, compiler changes have the
advantage that they apply broadly to all compiled code. One of the best reasons
to invest in compiler optimization improvements is to capitalize on this. A few
specific benefits:

- Optimizer changes can effect "peanut-butter" improvements; by making an
improvement which is small in any particular instance to a code sequence that
is repeated thousands of times across a codebase, they can produce substantial
cumulative wins. These should accrue toward the standard metrics (benchmark
scores and code size), but identifying the most profitable "peanut-butter"
opportunities is difficult. Improving our methodology for identifying such
opportunities would be helpful; some ideas are below.
- Optimizer changes can unblock coding patterns that performance-sensitive
developers want to employ but consider prohibitively expensive. They may
have inelegant works-around in their code, such as gotos for loop-exiting
returns to work around poor block layout, manually scalarized structs to work
around poor struct promotion, manually unrolled loops to work around lack of
loop unrolling, limited use of lambdas to work around inefficient access to
heap-allocated closures, etc. The more the optimizer can improve such
situations, the better, as it both increases developer productivity and
increases the usefulness of abstractions provided by the language and
libraries. Finding a measurable metric to track this type of improvement
poses a challenge, but would be a big help toward prioritizing and validating
optimization improvements; again, some ideas are below.


Brainstorm
----------

Listed here are several ideas for undertakings we might pursue to improve our
ability to identify opportunities and validate/track improvements that mesh
with the benefits discussed above. Thinking here is in the early stages, but
the hope is that with some thought/discussion some of these will surface as
worth investing in.

- Is there telemetry we can implement/analyze to identify "peanut-butter"
opportunities, or target "coding pattern"s? Probably easier to use this
to evaluate/prioritize patterns we're considering targeting than to identify
the patterns in the first place.
- Can we construct some sort of "peanut-butter profiler"? The idea would
roughly be to aggregate samples/counters under particular input constructs
rather than aggregate them under callstack. Might it be interesting to
group by MSIL opcode, or opcode pair, or opcode triplet... ?
- It might behoove us to build up some SPMI traces that could be data-mined
for any of these experiments.
- We should make it easy to view machine code emitted by the jit, and to
collect profiles and correlate them with that machine code. This could
benefit any developers doing performance analysis of their own code.
The JIT team has discussed this, options include building something on top of
the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping
with or making easily available an alt jit that supports JitDisasm.
- Hardware companies maintain optimization/performance guides for their ISAs.
Should we maintain one for MSIL and/or C# (and/or F#)? If we hosted such a
thing somewhere publicly votable, we could track which anti-patterns people
find most frustrating to avoid, and subsequent removal of them. Does such
a guide already exist somewhere, that we could use as a starting point?
Should we collate GitHub issues or Stack Overflow issues to create such a thing?
- Maybe we should expand our labels on GitHub so that there are sub-areas
within "optimization"? It could help prioritize by letting us compare the
relative sizes of those buckets.
- Can we more effectively leverage the legacy JIT codebases for comparative
analysis? We've compared micro-benchmark performance against Jit64 and
manually compared disassembly of hot code, what else can we do? One concrete
idea: run over some large corpus of code (SPMI?), and do a path-length
comparison e.g. by looking at each sequence of k MSIL instructions (for some
small k), and for each combination of k opcodes collect statistics on the
size of generated machine code (maybe using debug line number info to do the
correlation?), then look for common sequences which are much longer with
RyuJIT.
- Maybe hook RyuJIT up to some sort of superoptimizer to identify opportunities?
- Microsoft Research has done some experimenting that involved converting RyuJIT
IR to LLVM IR; perhaps we could use this to identify common expressions that
could be much better optimized.
- What's a practical way to establish a metric of "unblocked coding patterns"?
- How developers give feedback about patterns/performance could use some thought;
the GitHub issue list is open, but does it need to be publicized somehow? We
perhaps should have some regular process where we pull issues over from other
places where people report/discuss dotnet performance issues, like
[Stack Overflow](https://stackoverflow.com/questions/tagged/performance+.net).
134 changes: 134 additions & 0 deletions Documentation/performance/JitOptimizerTodoAssessment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
Optimizer Codebase Status/Investments
=====================================

There are a number of areas in the optimizer that we know we would invest in
improving if resources were unlimited. This document lists them and some
thoughts about their current state and prioritization, in an effort to capture
the thinking about them that comes up in planning discussions.


Improved Struct Handling
------------------------

This is an area that has received recent attention, with the [first-class structs](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md)
work and the struct promotion improvements that went in for `Span<T>`. Work here
is expected to continue and can happen incrementally. Possible next steps:

- Struct promotion stress mode (test mode to improve robustness/reliability)
- Promotion of more structs; relax limits on e.g. field count (should generally
help performance-sensitive code where structs are increasingly used to avoid
heap allocations)
- Improve handling of System V struct passing (I think we currently insert
some unnecessary round-trips through memory at call boundaries due to
internal representation issues)
- Implicit byref parameter promotion w/o shadow copy

We don't have specific benchmarks that we know would jump in response to any of
these. May well be able to find some with some looking, though this may be an
area where current performance-sensitive code avoids structs.


Exception handling
------------------

This is increasingly important as C# language constructs like async/await and
certain `foreach` incantations are implemented with EH constructs, making them
difficult to avoid at source level. The recent work on finally cloning, empty
finally removal, and empty try removal targeted this. [Writethrough](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md)
is another key optimization enabler here, and we are actively pursuing it. Other
things we've discussed include inlining methods with EH and computing funclet
callee-save register usage independently of main function callee-save register
usage, but I don't think we have any particular data pointing to either as a
high priority.


Loop Optimizations
------------------

We haven't been targeting benchmarks that spend a lot of time doing compuations
in an inner loop. Pursuing loop optimizations for the peanut butter effect
would seem odd. So this simply hasn't bubbled up in priority yet, though it's
bound to eventually.


More Expression Optimizations
-----------------------------

We again don't have particular benchmarks pointing to key missing cases, and
balancing the CQ vs TP will be delicate here, so it would really help to have
an appropriate benchmark suite to evaluate this work against.


Forward Substitution
--------------------

This too needs an appropriate benchmark suite that I don't think we have at
this time. The tradeoffs against register pressure increase and throughput
need to be evaluated. This also might make more sense to do if/when we can
handle SSA renames.


Value Number Conservativism
---------------------------

We have some frustrating phase-ordering issues resulting from this, but the
opt-repeat experiment indicated that they're not prevalent enough to merit
pursuing changing this right now. Also, using SSA def as the proxy for value
number would require handling SSA renaming, so there's a big dependency chained
to this.
Maybe it's worth reconsidering the priority based on throughput?


High Tier Optimizations
-----------------------

We don't have that many knobs we can "crank up" (though we do have the tracked
assertion count and could switch inliner policies), nor do we have any sort of
benchmarking story set up to validate whether tiering changes are helping or
hurting. We should get that benchmarking story sorted out and at least hook
up those two knobs.


Low Tier Back-Off
-----------------

We have some changes we know we want to make here: morph does more than it needs
to in minopts, and tier 0 should be doing throughput-improving inlines, as
opposed to minopts which does no inlining. It would be nice to have the
benchmarking story set up to measure the effect of such changes when they go in,
we should do that.


Async
-----

We've made note of the prevalence of async/await in modern code (and particularly
in web server code such as TechEmpower), and have some opportunities listed in
[#7914](https://github.com/dotnet/coreclr/issues/7914). Some sort of study of
async peanut butter to find more opportunities is probably in order, but what
would that look like?


Address Mode Building
---------------------

One opportunity that's frequently visible in asm dumps is that more address
expressions could be folded into memory operands' address expressions. This
would likely give a measurable codesize win. Needs some thought about where
to run in phase list and how aggressive to be about e.g. analyzing across
statements.


If-Conversion (cmov formation)
------------------------------

This hits big in microbenchmarks where it hits. There's some work in flight
on this (see #7447 and #10861).


Mulshift
--------

Replacing multiplication by constants with shift/add/lea sequences is a
classic optimization that keeps coming up in planning. An [analysis](https://gist.github.com/JosephTremoulet/c1246b17ea2803e93e203b9969ee5a25#file-mulshift-md)
indicates that RyuJIT is already capitalizing on most of the opportunity here.

0 comments on commit 6b38dca

Please sign in to comment.