From 23fc9ad8cacf791c1360fd467f9b2e826682b6d8 Mon Sep 17 00:00:00 2001 From: Joseph Tremoulet Date: Thu, 20 Jul 2017 13:28:53 -0400 Subject: [PATCH] Add documents about JIT optimization planning This change adds two documents: - JitOptimizerPlanningGuide.md discusses how we can/do/should go about identifying, prioritizing, and validating optimization improvement opportunities, as well as several ideas for how we might improve the process. - JitOptimizerTodoAssessment.md lists several potential optimization improvements that always come up in planning discussions, with brief notes about each, to capture current thinking. --- .../performance/JitOptimizerPlanningGuide.md | 127 +++++++++++++++++ .../performance/JitOptimizerTodoAssessment.md | 132 ++++++++++++++++++ 2 files changed, 259 insertions(+) create mode 100644 Documentation/performance/JitOptimizerPlanningGuide.md create mode 100644 Documentation/performance/JitOptimizerTodoAssessment.md diff --git a/Documentation/performance/JitOptimizerPlanningGuide.md b/Documentation/performance/JitOptimizerPlanningGuide.md new file mode 100644 index 000000000000..6f65f146e0a2 --- /dev/null +++ b/Documentation/performance/JitOptimizerPlanningGuide.md @@ -0,0 +1,127 @@ +JIT Optimizer Planning Guide +============================ + +The goal of this document is to capture some thinking about the process used to +prioritize and validate optimizer investments. The overriding goal of such +investments is to help ensure that the dotnet platform satisfies developers' +performance needs. + + +Benchmarking +------------ + +There are a number of public benchmarks which evaluate different platforms' +relative performance, so naturally dotnet's scores on such benchmarks give +some indication of how well it satisfies developers' performance needs. The JIT +team has used some of these benchmarks, particularly [TechEmpower](https://www.techempower.com/benchmarks/) +and [Benchmarks Game](http://benchmarksgame.alioth.debian.org/), for scouting +out optimization opportunities and prioritizing optimization improvements. +While it is important to track scores on such benchmarks to validate performance +changes in the dotnet platform as a whole, when it comes to planning and +prioritizing JIT optimization improvements specifically, they aren't sufficient, +due to a few well-known issues: + + - For macro-benchmarks, such as TechEmpower, compiler optimization is often not + the dominant factor in performance. The effects of individual optimizer + changes are most often in the sub-percent range, well below the noise level + of the measurements, which will usually be at least 3% or so even for the + most well-behaved macro-benchmarks. + - Source-level changes can be made much more rapidly than compiler optimization + changes. This means that for anything we're trying to track where the whole + team is effecting changes in source, runtime, etc., any particular code + sequence we may target with optimization improvements may well be targeted + with source changes in the interim, nullifying the measured benefit of the + optimization change when it is eventually merged. Source/library/runtime + changes are in play for TechEmpower and Benchmarks Game both. + +Compiler micro-benchmarks (like those in our [test tree](https://github.com/dotnet/coreclr/tree/master/tests/src/JIT/Performance/CodeQuality)) +don't share these issues, and adding them as optimizations are implemented is +critical for validation and regression prevention; however, micro-benchmarks +often aren't as representative of real-world code, and therefore not as +reflective of developers' performance needs, so aren't well suited for scouting +out and prioritizing opportunities. + + +Benefits of JIT Optimization +---------------------------- + +While source changes can more rapidly and dramatically effect changes to +targeted hot code sequences in macro-benchmarks, compiler changes have the +advantage that they apply broadly to all compiled code. One of the best reasons +to invest in compiler optimization improvements is to capitalize on this. A few +specific benefits: + + - Optimizer changes can effect "peanut-butter" improvements; by making an + improvement which is small in any particular instance to a code sequence that + is repeated thousands of times across a codebase, they can produce substantial + cumulative wins. These should accrue toward the standard metrics (benchmark + scores and code size), but identifying the most profitable "peanut-butter" + opportunities is difficult. Improving our methodology for identifying such + opportunities would be helpful; some ideas are below. + - Optimizer changes can unblock coding patterns that performance-sensitive + developers want to employ but consider prohibitively expensive. They may + have inelegant works-around in their code, such as gotos for loop-exiting + returns to work around poor block layout, manually scalarized structs to work + around poor struct promotion, manually unrolled loops to work around lack of + loop unrolling, limited use of lambdas to work around inefficient access to + heap-allocated closures, etc. The more the optimizer can improve such + situations, the better, as it both increases developer productivity and + increases the usefulness of abstractions provided by the language and + libraries. Finding a measurable metric to track this type of improvement + poses a challenge, but would be a big help toward prioritizing and validating + optimization improvements; again, some ideas are below. + + +Brainstorm +---------- + +Listed here are several ideas for undertakings we might pursue to improve our +ability to identify opportunities and validate/track improvements that mesh +with the benefits discussed above. Thinking here is in the early stages, but +the hope is that with some thought/discussion some of these will surface as +worth investing in. + + - Is there telemetry we can implement/analyze to identify "peanut-butter" + opportunities, or target "coding pattern"s? Probably easier to use this + to evaluate/prioritize patterns we're considering targeting than to identify + the patterns in the first place. + - Can we construct some sort of "peanut-butter profiler"? The idea would + roughly be to aggregate samples/counters under particular input constructs + rather than aggregate them under callstack. Might it be interesting to + group by MSIL opcode, or opcode pair, or opcode triplet... ? + - It might behoove us to build up some SPMI traces that could be data-mined + for any of these experiments. + - We should make it easy to view machine code emitted by the jit, and to + collect profiles and correlate them with that machine code. This could + benefit any developers doing performance analysis of their own code. + The JIT team has discussed this, options include building something on top of + the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping + with or making easily available an alt jit that supports JitDisasm. + - Hardware companies maintain optimization/performance guides for their ISAs. + Should we maintain one for MSIL and/or C# (and/or F#)? If we hosted such a + thing somewhere publicly votable, we could track which anti-patterns people + find most frustrating to avoid, and subsequent removal of them. Does such + a guide already exist somewhere, that we could use as a starting point? + Should we collate GitHub issues or Stack Overflow issues to create such a thing? + - Maybe we should expand our labels on GitHub so that there are sub-areas + within "optimization"? It could help prioritize by letting us compare the + relative sizes of those buckets. + - Can we more effectively leverage the legacy JIT codebases for comparative + analysis? We've compared micro-benchmark performance against Jit64 and + manually compared disassembly of hot code, what else can we do? One concrete + idea: run over some large corpus of code (SPMI?), and do a path-length + comparison e.g. by looking at each sequence of k MSIL instructions (for some + small k), and for each combination of k opcodes collect statistics on the + size of generated machine code (maybe using debug line number info to do the + correlation?), then look for common sequences which are much longer with + RyuJIT. + - Maybe hook RyuJIT up to some sort of superoptimizer to identify opportunities? + - Microsoft Research has done some experimenting that involved converting RyuJIT + IR to LLVM IR; perhaps we could use this to identify common expressions that + could be much better optimized. + - What's a practical way to establish a metric of "unblocked coding patterns"? + - How developers give feedback about patterns/performance could use some thought; + the GitHub issue list is open, but does it need to be publicized somehow? We + perhaps should have some regular process where we pull issues over from other + places where people report/discuss dotnet performance issues, like + [Stack Overflow](https://stackoverflow.com/questions/tagged/performance+.net). diff --git a/Documentation/performance/JitOptimizerTodoAssessment.md b/Documentation/performance/JitOptimizerTodoAssessment.md new file mode 100644 index 000000000000..cca77b7cbae1 --- /dev/null +++ b/Documentation/performance/JitOptimizerTodoAssessment.md @@ -0,0 +1,132 @@ +Optimizer Codebase Status/Investments +===================================== + +There are a number of areas in the optimizer that we know we would invest in +improving if resources were unlimited. This document lists them and some +thoughts about their current state and prioritization, in an effort to capture +the thinking about them that comes up in planning discussions. + + +Improved Struct Handling +------------------------ + +This is an area that has received recent attention, with the first-class structs +work and the struct promotion improvements that went in for `Span`. Work here +is expected to continue and can happen incrementally. Possible next steps: + + - Struct promotion stress mode (test mode to improve robustness/reliability) + - Promotion of more structs; relax limits on e.g. field count (should generally + help performance-sensitive code where structs are increasingly used to avoid + heap allocations) + - Improve handling of System V struct passing (I think we currently insert + some unnecessary round-trips through memory at call boundaries due to + internal representation issues) + - Implicit byref parameter promotion w/o shadow copy + +We don't have specific benchmarks that we know would jump in response to any of +these. May well be able to find some with some looking, though this may be an +area where current performance-sensitive code avoids structs. + + +Exception handling +------------------ + +This is increasingly important as C# language constructs like async/await and +certain `foreach` incantations are implemented with EH constructs, making them +difficult to avoid at source level. The recent work on finally cloning, empty +finally removal, and empty try removal targeted this. Writethrough is another +key optimization here, and we are actively pursuing it. Other things we've +discussed include inlining methods with EH and computing funclet callee-save +register usage independently of main function callee-save register usage, but +I don't think we have any particular data pointing to either as a high priority. + + +Loop Optimizations +------------------ + +We haven't been targeting benchmarks that spend a lot of time doing compuations +in an inner loop. Pursuing loop optimizations for the peanut butter effect +would seem odd. So this simply hasn't bubbled up in priority yet, though it's +bound to eventually. + + +More Expression Optimizations +----------------------------- + +We again don't have particular benchmarks pointing to key missing cases, and +balancing the CQ vs TP will be delicate here, so it would really help to have +an appropriate benchmark suite to evaluate this work against. + + +Forward Substitution +-------------------- + +This too needs an appropriate benchmark suite that I don't think we have at +this time. The tradeoffs against register pressure increase and throughput +need to be evaluated. This also might make more sense to do if/when we can +handle SSA renames. + + +Value Number Conservativism +--------------------------- + +We have some frustrating phase-ordering issues resulting from this, but the +opt-repeat experiment indicated that they're not prevalent enough to merit +pursuing changing this right now. Also, using SSA def as the proxy for value +number would require handling SSA renaming, so there's a big dependency chained +to this. +Maybe it's worth reconsidering the priority based on throughput? + + +High Tier Optimizations +----------------------- + +We don't have that many knobs we can "crank up" (though we do have the tracked +assertion count and could switch inliner policies), nor do we have any sort of +benchmarking story set up to validate whether tiering changes are helping or +hurting. We should get that benchmarking story sorted out and at least hook +up those two knobs. + + +Low Tier Back-Off +----------------- + +We have some changes we know we want to make here: morph does more than it needs +to in minopts, and tier 0 should be doing throughput-improving inlines, as +opposed to minopts which does no inlining. It would be nice to have the +benchmarking story set up to measure the effect of such changes when they go in, +we should do that. + + +Async +----- + +We've made note of the prevalence of async/await in modern code (and particularly +in web server code such as TechEmpower), but haven't really identified concrete +opportunities presented here, to my knowledge. Some sort of study of async +peanut butter is probably in order, but what would that look like? + + +Address Mode Building +--------------------- + +One opportunity that's frequently visible in asm dumps is that more address +expressions could be folded into memory operands' address expressions. This +would likely give a measurable codesize win. Needs some thought about where +to run in phase list and how aggressive to be about e.g. analyzing across +statements. + + +If-Conversion (cmov formation) +------------------------------ + +This hits big in microbenchmarks where it hits. There's some work in flight +on this (see #7447). + + +Mulshift +-------- + +Replacing multiplication by constants with shift/add/lea sequences is a +classic opportunity that could make a good first step toward peephole +optimization.