-
Notifications
You must be signed in to change notification settings - Fork 286
Description
We are working towards enabling OSR (On Stack Replacement) by default for .NET 7 for x64 and arm64. As part of this we will also modify the runtime so that quick jit for loops is enabled.
See for instance dotnet/runtime#63642.
This has performance implications for benchmarks that don't run enough iterations to reach Tier1. These are typically benchmarks that internally loop and so are currently eagerly optimized because quick jit for loops is disabled. A private benchmark run shows several hundred benchmarks impacted by this, with regressions outnumbering improvements by about 2 to 1.
[Upon further analysis the number of truly impacted benchmarks may be smaller, maybe ~100. It is hard to gauge from one-off runs as many benchmarks are noisy. But we can look at perf history in main and see that some of the "regressions" seen from the one-off OSR run are in noisy tests and the values are within the expected noise range.]
One such example is Burgers.Test3. With current strategy we end up running about 20 invocations total. The main method is initially fully optimized. When we turn on QuickJitForLoops and OSR, the main method is initially not optimized. OSR accelerates its performance, but OSR performance does not reach the same level as Tier1, and we don't run enough invocations to make it to Tier1.
While in this case the OSR version is slower, sometimes the OSR version runs faster. In general, we aspire to have the OSR perf be competitive with Tier1, but swings of +/- 20% are going to be common and cannot easily be addressed.
One way we can mitigate these effects is to always run (or selectively run, for some subset of benchmarks) at least 30 warmup iterations. For example:
default
| Method | Job | Ver | Mean |
|---|---|---|---|
| Burgers_0 | Job-SYCWNE | OSR | 192.68 ms |
| Burgers_0 | Job-ZXAOBL | DEF | 184.51 ms |
| Burgers_1 | Job-SYCWNE | OSR | 224.11 ms |
| Burgers_1 | Job-ZXAOBL | DEF | 155.64 ms |
| Burgers_2 | Job-SYCWNE | OSR | 178.63 ms |
| Burgers_2 | Job-ZXAOBL | DEF | 156.51 ms |
| Burgers_3 | Job-SYCWNE | OSR | 181.05 ms |
| Burgers_3 | Job-ZXAOBL | DEF | 85.63 ms |
default + --warmupCount 30
| Method | Job | Ver | Mean |
|---|---|---|---|
| Burgers_0 | Job-SOPVCH | OSR | 186.75 ms |
| Burgers_0 | Job-TMAUKP | DEF | 185.61 ms |
| Burgers_1 | Job-SOPVCH | OSR | 155.64 ms |
| Burgers_1 | Job-TMAUKP | DEF | 157.39 ms |
| Burgers_2 | Job-SOPVCH | OSR | 157.37 ms |
| Burgers_2 | Job-TMAUKP | DEF | 160.09 ms |
| Burgers_3 | Job-SOPVCH | OSR | 89.70 ms |
| Burgers_3 | Job-TMAUKP | DEF | 85.96 ms |
It is expected that if we can do this (or something equivalent) then OSR will not impact perf measurements.