Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't execute long operations more than once per iteration #760

Merged
merged 13 commits into from
May 27, 2018

Conversation

adamsitnik
Copy link
Member

Background: because of the default unroll factor = 16, every iteration was invoking benchmark at least 16 times.

So if the benchmark took 1s, then:

  1. jitting took 16x1s
  2. warmup took 6 (MinIterationCount for EngineWarmupStage) x 16 x 1s = 96 s
  3. every target iteration was taking 16s

Note: In the default mode, this was described in our docs and was possible to change with custom config. We just did not have the time to fix it.

This PR introduces following changes:

  1. EngineFactory is supposed to create an engine which is "ready to run", which means it's supposed to call GlobalSetup. (hence the method name change Create => CreateReadyToRun)
  2. Engine implements IDisposable, which calls GlobalCleanup
  3. When the strategy requires us to jit the code first (default mode) and to run the pilot (no explicit value for invocationCount), the engine factory does jitting and also checks how long does it take to run the benchmark once (unrol factor=invocation count=1).
    a. If it took longer than IterationTime (0.5s by default), the engine is created for job with one invocation per iteration
    b. otherwise, old rules applies
  4. Strategies which don't require jitting don't follow this path (by design)
  5. I had to duplicate both the C# and IL we generate to have possibility to choose the mode and keep the Engine immutable. It does not look as bad as I expected it to look ;)
  6. Because the changes applies mostly to EngineFactory, I was able to write some nice unit tests (not integration tests)
public class Sleeep
{
    [Benchmark] public void Sleep1s() => Thread.Sleep(TimeSpan.FromSeconds(1));
}

Before: Total time: 00:06:16 (376.01 sec)
After: Total time: 00:00:27 (27.88 sec)

This fixes #736, makes #730 easy and contributes to #550.

@AndreyAkinshin please take a look ;)
@KrzysztofCwalina this fixes the issue you have faced in ML.NET

@@ -17,6 +17,9 @@ public class CoreRtTests : BenchmarkTestExecutor
[Fact]
public void CoreRtIsSupported()
{
if (IntPtr.Size == sizeof(int)) // CoreRT does not support 32bit yet

This comment was marked as spam.

@AndreyAkinshin
Copy link
Member

Currently, I have the following output for a 50ms method:

IdleJit -1: 1 op, 109000.00 ns, 109.0000 us/op
Jit -1: 1 op, 53255800.00 ns, 53.2558 ms/op
IdleJit -1: 16 op, 212200.00 ns, 13.2625 us/op
Jit -1: 16 op, 848930200.00 ns, 53.0581 ms/op
Pilot  1: 16 op, 854972000.00 ns, 53.4358 ms/op

IdleWarmup  1: 16 op, 600.00 ns, 37.5000 ns/op
IdleWarmup  2: 16 op, 500.00 ns, 31.2500 ns/op
IdleWarmup  3: 16 op, 600.00 ns, 37.5000 ns/op
IdleWarmup  4: 16 op, 600.00 ns, 37.5000 ns/op
IdleWarmup  5: 16 op, 500.00 ns, 31.2500 ns/op
IdleWarmup  6: 16 op, 500.00 ns, 31.2500 ns/op

And the following output for a 5 sec method:

IdleJit -1: 1 op, 363300.00 ns, 363.3000 us/op
Jit -1: 1 op, 5004330100.00 ns, 5.0043 s/op
IdleWarmup  1: 1 op, 200.00 ns, 200.0000 ns/op
IdleWarmup  2: 1 op, 200.00 ns, 200.0000 ns/op
IdleWarmup  3: 1 op, 200.00 ns, 200.0000 ns/op
IdleWarmup  4: 1 op, 300.00 ns, 300.0000 ns/op
IdleWarmup  5: 1 op, 300.00 ns, 300.0000 ns/op
IdleWarmup  6: 1 op, 300.00 ns, 300.0000 ns/op

IdleTarget  1: 1 op, 200.00 ns, 200.0000 ns/op
IdleTarget  2: 1 op, 300.00 ns, 300.0000 ns/op
IdleTarget  3: 1 op, 300.00 ns, 300.0000 ns/op
IdleTarget  4: 1 op, 200.00 ns, 200.0000 ns/op

It works which is great! Awesome PR! However, I suggest to improve the output:

  • IterationIndex = -1 confuses me. Could we start it with 1?
  • Maybe it makes sense to rename IdleJit and Jit to IdleJitting and Jitting?
  • It would be great to align all output lines and form "pseudocolumns" for iteration name and index. In this case, the output will be much readable.

@AndreyAkinshin
Copy link
Member

I also tried to run a 50ms method with the following config:

private class Config : ManualConfig
{
    public Config()
    {
        Add(Job.Default.WithUnrollFactor(4));
    }
}

The output:

IdleJit -1: 1 op, 114500.00 ns, 114.5000 us/op
Jit -1: 1 op, 53745100.00 ns, 53.7451 ms/op
IdleJit -1: 4 op, 131000.00 ns, 32.7500 us/op
Jit -1: 4 op, 216355400.00 ns, 54.0889 ms/op
Pilot  1: 4 op, 212807600.00 ns, 53.2019 ms/op
Pilot  2: 8 op, 428189900.00 ns, 53.5237 ms/op
Pilot  3: 16 op, 861407400.00 ns, 53.8380 ms/op

IdleWarmup  1: 16 op, 400.00 ns, 25.0000 ns/op
IdleWarmup  2: 16 op, 500.00 ns, 31.2500 ns/op
IdleWarmup  3: 16 op, 400.00 ns, 25.0000 ns/op

If we specifically ask for UnrollFactor=4, this value should be used. The new auto strategy for UnrollFactor should be applied only for unspecified value. Also it would be nice to have a test for such case.

@adamsitnik
Copy link
Member Author

@AndreyAkinshin very good points! I totally missed the case with unrollFactor set in explicit way.

I have pushed a commit with fixes.

With the aligning it definitely looks better:

IdleJitting  1: 1 op, 361510.84 ns, 361.5108 us/op
MainJitting  1: 1 op, 1000483249.90 ns, 1.0005 s/op

Was:

IdleJitting -1: 1 op, 374340.49 ns, 374.3405 us/op
Jitting -1: 1 op, 1000731574.78 ns, 1.0007 s/op

@AndreyAkinshin
Copy link
Member

IdleJitting  1: 1 op, 234600.00 ns, 234.6000 us/op
MainJitting  1: 1 op, 51181500.00 ns, 51.1815 ms/op

IdleJitting  1: 16 op, 240400.00 ns, 15.0250 us/op
MainJitting  1: 16 op, 812670500.00 ns, 50.7919 ms/op

Pilot  1: 16 op, 813271600.00 ns, 50.8295 ms/op

IdleWarmup  1: 16 op, 1600.00 ns, 100.0000 ns/op
IdleWarmup  2: 16 op, 400.00 ns, 25.0000 ns/op
IdleWarmup  3: 16 op, 400.00 ns, 25.0000 ns/op
IdleWarmup  4: 16 op, 200.00 ns, 12.5000 ns/op
IdleWarmup  5: 16 op, 300.00 ns, 18.7500 ns/op
IdleWarmup  6: 16 op, 300.00 ns, 18.7500 ns/op

Now only the Pilot stage is unaligned.
@adamsitnik, could you align it as well?

@AndreyAkinshin
Copy link
Member

Also MainJitting and IdleWarmup have the different amount of symbols, so it would be nice to align them as well.

@adamsitnik
Copy link
Member Author

adamsitnik commented May 23, 2018

I aligned it, now it looks like this:

// Benchmark Process Environment Information:
// Runtime=.NET CoreRT 1.0.26523.01, 64bit AOT
// GC=Concurrent Workstation
// Job: CoreRT(Runtime=CoreRT)

IdleJitting  1: 1 op, 570.21 ns, 570.2064 ns/op
MainJitting  1: 1 op, 39059.14 ns, 39.0591 us/op

IdleJitting  1: 16 op, 570.21 ns, 35.6379 ns/op
MainJitting  1: 16 op, 34212.38 ns, 2.1383 us/op

Pilot        1: 16 op, 30791.14 ns, 1.9244 us/op
Pilot        2: 32 op, 38488.93 ns, 1.2028 us/op
Pilot        3: 64 op, 44476.10 ns, 694.9390 ns/op
Pilot        4: 128 op, 51603.68 ns, 403.1537 ns/op
Pilot        5: 256 op, 81539.51 ns, 318.5137 ns/op
Pilot        6: 512 op, 116892.31 ns, 228.3053 ns/op
Pilot        7: 1024 op, 202423.26 ns, 197.6790 ns/op
Pilot        8: 2048 op, 416250.65 ns, 203.2474 ns/op
Pilot        9: 4096 op, 925159.84 ns, 225.8691 ns/op
Pilot       10: 8192 op, 1741695.37 ns, 212.6093 ns/op
Pilot       11: 16384 op, 3537845.45 ns, 215.9329 ns/op
Pilot       12: 32768 op, 11446322.77 ns, 349.3140 ns/op
Pilot       13: 65536 op, 15517881.39 ns, 236.7841 ns/op
Pilot       14: 131072 op, 26722436.65 ns, 203.8760 ns/op
Pilot       15: 262144 op, 50601254.11 ns, 193.0285 ns/op
Pilot       16: 524288 op, 100716692.39 ns, 192.1018 ns/op
Pilot       17: 1048576 op, 197987912.77 ns, 188.8160 ns/op
Pilot       18: 2097152 op, 426441667.03 ns, 203.3432 ns/op
Pilot       19: 4194304 op, 829293326.13 ns, 197.7189 ns/op

IdleWarmup   1: 4194304 op, 16682242.80 ns, 3.9774 ns/op
IdleWarmup   2: 4194304 op, 9546395.13 ns, 2.2760 ns/op
IdleWarmup   3: 4194304 op, 8673124.06 ns, 2.0678 ns/op
IdleWarmup   4: 4194304 op, 8665141.17 ns, 2.0659 ns/op
IdleWarmup   5: 4194304 op, 8839054.12 ns, 2.1074 ns/op
IdleWarmup   6: 4194304 op, 9296644.73 ns, 2.2165 ns/op
IdleWarmup   7: 4194304 op, 9432068.75 ns, 2.2488 ns/op
IdleWarmup   8: 4194304 op, 8727578.77 ns, 2.0808 ns/op
IdleWarmup   9: 4194304 op, 8693936.60 ns, 2.0728 ns/op
IdleWarmup  10: 4194304 op, 8712468.30 ns, 2.0772 ns/op

IdleTarget   1: 4194304 op, 9002133.14 ns, 2.1463 ns/op
IdleTarget   2: 4194304 op, 8681106.95 ns, 2.0697 ns/op
IdleTarget   3: 4194304 op, 8636915.96 ns, 2.0592 ns/op
IdleTarget   4: 4194304 op, 8652596.63 ns, 2.0629 ns/op
IdleTarget   5: 4194304 op, 8727008.57 ns, 2.0807 ns/op
IdleTarget   6: 4194304 op, 8618669.35 ns, 2.0549 ns/op
IdleTarget   7: 4194304 op, 8693651.49 ns, 2.0727 ns/op
IdleTarget   8: 4194304 op, 8786310.03 ns, 2.0948 ns/op
IdleTarget   9: 4194304 op, 8670843.24 ns, 2.0673 ns/op
IdleTarget  10: 4194304 op, 8831641.44 ns, 2.1056 ns/op
IdleTarget  11: 4194304 op, 9313465.82 ns, 2.2205 ns/op
IdleTarget  12: 4194304 op, 8664856.07 ns, 2.0659 ns/op
IdleTarget  13: 4194304 op, 8687949.43 ns, 2.0714 ns/op
IdleTarget  14: 4194304 op, 8621520.39 ns, 2.0555 ns/op
IdleTarget  15: 4194304 op, 9069987.70 ns, 2.1625 ns/op

MainWarmup   1: 4194304 op, 812825480.93 ns, 193.7927 ns/op
MainWarmup   2: 4194304 op, 857425598.05 ns, 204.4262 ns/op
MainWarmup   3: 4194304 op, 793740388.46 ns, 189.2425 ns/op
MainWarmup   4: 4194304 op, 811990698.79 ns, 193.5937 ns/op
MainWarmup   5: 4194304 op, 756858299.72 ns, 180.4491 ns/op
MainWarmup   6: 4194304 op, 754029505.90 ns, 179.7746 ns/op

// BeforeMainRun
MainTarget   1: 4194304 op, 761732138.71 ns, 181.6111 ns/op
MainTarget   2: 4194304 op, 752033498.48 ns, 179.2988 ns/op
MainTarget   3: 4194304 op, 770667272.61 ns, 183.7414 ns/op
MainTarget   4: 4194304 op, 769454443.65 ns, 183.4522 ns/op
MainTarget   5: 4194304 op, 756888805.77 ns, 180.4564 ns/op
MainTarget   6: 4194304 op, 757020523.44 ns, 180.4878 ns/op
MainTarget   7: 4194304 op, 759448462.18 ns, 181.0666 ns/op
MainTarget   8: 4194304 op, 765128002.78 ns, 182.4207 ns/op
MainTarget   9: 4194304 op, 751105487.61 ns, 179.0775 ns/op
MainTarget  10: 4194304 op, 756828363.89 ns, 180.4419 ns/op
MainTarget  11: 4194304 op, 766402984.23 ns, 182.7247 ns/op
MainTarget  12: 4194304 op, 751867568.43 ns, 179.2592 ns/op
MainTarget  13: 4194304 op, 755829362.32 ns, 180.2038 ns/op
MainTarget  14: 4194304 op, 753383462.08 ns, 179.6206 ns/op
MainTarget  15: 4194304 op, 755330716.85 ns, 180.0849 ns/op

// AfterMainRun
Result       1: 4194304 op, 752954894.97 ns, 179.5184 ns/op
Result       2: 4194304 op, 743256254.74 ns, 177.2061 ns/op
Result       3: 4194304 op, 761890028.86 ns, 181.6487 ns/op
Result       4: 4194304 op, 760677199.90 ns, 181.3596 ns/op
Result       5: 4194304 op, 748111562.02 ns, 178.3637 ns/op
Result       6: 4194304 op, 748243279.69 ns, 178.3951 ns/op
Result       7: 4194304 op, 750671218.43 ns, 178.9740 ns/op
Result       8: 4194304 op, 756350759.03 ns, 180.3281 ns/op
Result       9: 4194304 op, 742328243.86 ns, 176.9848 ns/op
Result      10: 4194304 op, 748051120.14 ns, 178.3493 ns/op
Result      11: 4194304 op, 757625740.48 ns, 180.6321 ns/op
Result      12: 4194304 op, 743090324.68 ns, 177.1665 ns/op
Result      13: 4194304 op, 747052118.57 ns, 178.1111 ns/op
Result      14: 4194304 op, 744606218.33 ns, 177.5280 ns/op
Result      15: 4194304 op, 746553473.10 ns, 177.9922 ns/op
GC:  0 0 0 0 0

@adamsitnik
Copy link
Member Author

adamsitnik commented May 24, 2018

@AndreyAkinshin Tonight I have been running benchmark game benchmarks with this change and I have two more ideas for improvement:

  1. If the benchmark takes a lot of time to execute and we call it 1 per iteration and the user did not ask to measure overhead in an explicit way, don't measure the overhead. Why? If the benchmark takes more than 500 ms (500 * 1000000 ns) and the avg overhead is sth around 250 ns. then the overhead is 0,000025% of the final result.
  2. There is a lot of benchmarks, which should not be executed once, but 16 times time per iteration. Proposal: when Pilot condition is met for 16 per iteration, set the invocation count to 16 and don't run pilot.

An example:

IdleJitting  1: 1 op, 806842.02 ns, 806.8420 us/op
MainJitting  1: 1 op, 491133861.08 ns, 491.1339 ms/op

IdleJitting  1: 16 op, 2210975.22 ns, 138.1860 us/op
MainJitting  1: 16 op, 7509629645.26 ns, 469.3519 ms/op // we now know that the pilot is not needed

Pilot        1: 16 op, 7430266611.39 ns, 464.3917 ms/op // could have been avoided

// the next is IdleWarmup

@AndreyAkinshin What do you think?

@AndreyAkinshin
Copy link
Member

@adamsitnik, I like these ideas! However, we should discuss details and write tests for all corner cases.

P.S. The explicit unroll factor still doesn't work for me. Is it already fixed? Could you tell me the name of the corresponded test?

@adamsitnik
Copy link
Member Author

@AndreyAkinshin ok, I just pushed a commit with the changes. I have also cleaned up the code, I hope it's easier to read and maintain now.

For:

public class Tests
{
    [Benchmark] public void Sleep1s() => Thread.Sleep(TimeSpan.FromSeconds(1));
}

The output is:

IdleJitting  1: 1 op, 353813.06 ns, 353.8131 us/op
MainJitting  1: 1 op, 1000552244.87 ns, 1.0006 s/op

MainWarmup   1: 1 op, 1000413114.52 ns, 1.0004 s/op
MainWarmup   2: 1 op, 1000174768.25 ns, 1.0002 s/op
MainWarmup   3: 1 op, 1000517747.39 ns, 1.0005 s/op
MainWarmup   4: 1 op, 1000135138.91 ns, 1.0001 s/op
MainWarmup   5: 1 op, 1000249750.39 ns, 1.0002 s/op
MainWarmup   6: 1 op, 1000160513.09 ns, 1.0002 s/op

// BeforeMainRun
MainTarget   1: 1 op, 1000562508.59 ns, 1.0006 s/op
MainTarget   2: 1 op, 1000154525.93 ns, 1.0002 s/op
MainTarget   3: 1 op, 1000243478.12 ns, 1.0002 s/op
MainTarget   4: 1 op, 1000163079.02 ns, 1.0002 s/op
MainTarget   5: 1 op, 1000204418.99 ns, 1.0002 s/op
MainTarget   6: 1 op, 1000112330.66 ns, 1.0001 s/op
MainTarget   7: 1 op, 1000160227.99 ns, 1.0002 s/op
MainTarget   8: 1 op, 1000543976.88 ns, 1.0005 s/op
MainTarget   9: 1 op, 1000763506.34 ns, 1.0008 s/op
MainTarget  10: 1 op, 1000800284.65 ns, 1.0008 s/op
MainTarget  11: 1 op, 1000441909.94 ns, 1.0004 s/op
MainTarget  12: 1 op, 1000334711.14 ns, 1.0003 s/op
MainTarget  13: 1 op, 1000443620.56 ns, 1.0004 s/op
MainTarget  14: 1 op, 1000493513.62 ns, 1.0005 s/op
MainTarget  15: 1 op, 1001026941.68 ns, 1.0010 s/op

// AfterMainRun
Result       1: 1 op, 1000562508.59 ns, 1.0006 s/op
Result       2: 1 op, 1000154525.93 ns, 1.0002 s/op
Result       3: 1 op, 1000243478.12 ns, 1.0002 s/op
Result       4: 1 op, 1000163079.02 ns, 1.0002 s/op
Result       5: 1 op, 1000204418.99 ns, 1.0002 s/op
Result       6: 1 op, 1000112330.66 ns, 1.0001 s/op
Result       7: 1 op, 1000160227.99 ns, 1.0002 s/op
Result       8: 1 op, 1000543976.88 ns, 1.0005 s/op
Result       9: 1 op, 1000763506.34 ns, 1.0008 s/op
Result      10: 1 op, 1000800284.65 ns, 1.0008 s/op
Result      11: 1 op, 1000441909.94 ns, 1.0004 s/op
Result      12: 1 op, 1000334711.14 ns, 1.0003 s/op
Result      13: 1 op, 1000443620.56 ns, 1.0004 s/op
Result      14: 1 op, 1000493513.62 ns, 1.0005 s/op
Result      15: 1 op, 1001026941.68 ns, 1.0010 s/op

For

public class Tests
{
    [Benchmark] public void Sleep100ms() => Thread.Sleep(TimeSpan.FromMilliseconds(100));
}

The output is:

IdleJitting  1: 1 op, 348966.30 ns, 348.9663 us/op
MainJitting  1: 1 op, 101025744.25 ns, 101.0257 ms/op

IdleJitting  2: 16 op, 1038345.81 ns, 64.8966 us/op
MainJitting  2: 16 op, 1609747335.85 ns, 100.6092 ms/op

MainWarmup   1: 16 op, 1608774848.88 ns, 100.5484 ms/op
MainWarmup   2: 16 op, 1607150330.92 ns, 100.4469 ms/op
MainWarmup   3: 16 op, 1605935791.34 ns, 100.3710 ms/op
MainWarmup   4: 16 op, 1606883759.44 ns, 100.4302 ms/op
MainWarmup   5: 16 op, 1606966439.36 ns, 100.4354 ms/op
MainWarmup   6: 16 op, 1610310984.85 ns, 100.6444 ms/op
MainWarmup   7: 16 op, 1608051542.09 ns, 100.5032 ms/op
MainWarmup   8: 16 op, 1606959311.78 ns, 100.4350 ms/op
MainWarmup   9: 16 op, 1605103290.03 ns, 100.3190 ms/op
MainWarmup  10: 16 op, 1608219752.98 ns, 100.5137 ms/op
MainWarmup  11: 16 op, 1609136929.93 ns, 100.5711 ms/op
MainWarmup  12: 16 op, 1607567436.88 ns, 100.4730 ms/op

// BeforeMainRun
MainTarget   1: 16 op, 1605767865.56 ns, 100.3605 ms/op
MainTarget   2: 16 op, 1605735648.90 ns, 100.3585 ms/op
MainTarget   3: 16 op, 1607222747.13 ns, 100.4514 ms/op
MainTarget   4: 16 op, 1605089890.18 ns, 100.3181 ms/op
MainTarget   5: 16 op, 1606944771.52 ns, 100.4340 ms/op
MainTarget   6: 16 op, 1606836717.41 ns, 100.4273 ms/op
MainTarget   7: 16 op, 1605859668.79 ns, 100.3662 ms/op
MainTarget   8: 16 op, 1608290458.57 ns, 100.5182 ms/op
MainTarget   9: 16 op, 1607716830.95 ns, 100.4823 ms/op
MainTarget  10: 16 op, 1607361307.28 ns, 100.4601 ms/op
MainTarget  11: 16 op, 1606241992.16 ns, 100.3901 ms/op
MainTarget  12: 16 op, 1607062234.03 ns, 100.4414 ms/op
MainTarget  13: 16 op, 1607646410.47 ns, 100.4779 ms/op
MainTarget  14: 16 op, 1605199084.70 ns, 100.3249 ms/op
MainTarget  15: 16 op, 1606747480.12 ns, 100.4217 ms/op

// AfterMainRun
Result       1: 16 op, 1605767865.56 ns, 100.3605 ms/op
Result       2: 16 op, 1605735648.90 ns, 100.3585 ms/op
Result       3: 16 op, 1607222747.13 ns, 100.4514 ms/op
Result       4: 16 op, 1605089890.18 ns, 100.3181 ms/op
Result       5: 16 op, 1606944771.52 ns, 100.4340 ms/op
Result       6: 16 op, 1606836717.41 ns, 100.4273 ms/op
Result       7: 16 op, 1605859668.79 ns, 100.3662 ms/op
Result       8: 16 op, 1608290458.57 ns, 100.5182 ms/op
Result       9: 16 op, 1607716830.95 ns, 100.4823 ms/op
Result      10: 16 op, 1607361307.28 ns, 100.4601 ms/op
Result      11: 16 op, 1606241992.16 ns, 100.3901 ms/op
Result      12: 16 op, 1607062234.03 ns, 100.4414 ms/op
Result      13: 16 op, 1607646410.47 ns, 100.4779 ms/op
Result      14: 16 op, 1605199084.70 ns, 100.3249 ms/op
Result      15: 16 op, 1606747480.12 ns, 100.4217 ms/op
GC:  0 0 0 0 0

For

public class Tests
{
    [Benchmark] public void Sleep10ms() => Thread.Sleep(TimeSpan.FromMilliseconds(10));
}

The output is:

IdleJitting  1: 1 op, 380612.76 ns, 380.6128 us/op
MainJitting  1: 1 op, 10690514.22 ns, 10.6905 ms/op

IdleJitting  2: 16 op, 371774.56 ns, 23.2359 us/op
MainJitting  2: 16 op, 171752147.25 ns, 10.7345 ms/op

Pilot        1: 16 op, 169767258.86 ns, 10.6105 ms/op
Pilot        2: 32 op, 337642287.87 ns, 10.5513 ms/op
Pilot        3: 64 op, 676014725.01 ns, 10.5627 ms/op

IdleWarmup   1: 64 op, 5416.96 ns, 84.6400 ns/op
IdleWarmup   2: 64 op, 855.31 ns, 13.3642 ns/op
IdleWarmup   3: 64 op, 1140.41 ns, 17.8189 ns/op
IdleWarmup   4: 64 op, 855.31 ns, 13.3642 ns/op
IdleWarmup   5: 64 op, 855.31 ns, 13.3642 ns/op
IdleWarmup   6: 64 op, 855.31 ns, 13.3642 ns/op

IdleTarget   1: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget   2: 64 op, 855.31 ns, 13.3642 ns/op
IdleTarget   3: 64 op, 855.31 ns, 13.3642 ns/op
IdleTarget   4: 64 op, 855.31 ns, 13.3642 ns/op
IdleTarget   5: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   6: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   7: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   8: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   9: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  10: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  11: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget  12: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  13: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget  14: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  15: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  16: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget  17: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  18: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  19: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  20: 64 op, 285.10 ns, 4.4547 ns/op

MainWarmup   1: 64 op, 669696553.27 ns, 10.4640 ms/op
MainWarmup   2: 64 op, 673253785.74 ns, 10.5196 ms/op
MainWarmup   3: 64 op, 670058919.42 ns, 10.4697 ms/op
MainWarmup   4: 64 op, 671713943.43 ns, 10.4955 ms/op
MainWarmup   5: 64 op, 671377806.77 ns, 10.4903 ms/op
MainWarmup   6: 64 op, 673961696.96 ns, 10.5307 ms/op

// BeforeMainRun
MainTarget   1: 64 op, 670711520.62 ns, 10.4799 ms/op
MainTarget   2: 64 op, 671468184.48 ns, 10.4917 ms/op
MainTarget   3: 64 op, 673055924.13 ns, 10.5165 ms/op
MainTarget   4: 64 op, 671519217.95 ns, 10.4925 ms/op
MainTarget   5: 64 op, 668730623.67 ns, 10.4489 ms/op
MainTarget   6: 64 op, 670441527.90 ns, 10.4756 ms/op
MainTarget   7: 64 op, 669674030.12 ns, 10.4637 ms/op
MainTarget   8: 64 op, 672770250.74 ns, 10.5120 ms/op
MainTarget   9: 64 op, 671235540.28 ns, 10.4881 ms/op
MainTarget  10: 64 op, 671995055.17 ns, 10.4999 ms/op
MainTarget  11: 64 op, 676752572.06 ns, 10.5743 ms/op
MainTarget  12: 64 op, 674429266.18 ns, 10.5380 ms/op
MainTarget  13: 64 op, 668602612.34 ns, 10.4469 ms/op
MainTarget  14: 64 op, 671500971.35 ns, 10.4922 ms/op
MainTarget  15: 64 op, 673650364.28 ns, 10.5258 ms/op

// AfterMainRun
Result       1: 64 op, 670711050.20 ns, 10.4799 ms/op
Result       2: 64 op, 671467714.06 ns, 10.4917 ms/op
Result       3: 64 op, 673055453.71 ns, 10.5165 ms/op
Result       4: 64 op, 671518747.53 ns, 10.4925 ms/op
Result       5: 64 op, 668730153.25 ns, 10.4489 ms/op
Result       6: 64 op, 670441057.48 ns, 10.4756 ms/op
Result       7: 64 op, 669673559.70 ns, 10.4636 ms/op
Result       8: 64 op, 672769780.32 ns, 10.5120 ms/op
Result       9: 64 op, 671235069.86 ns, 10.4880 ms/op
Result      10: 64 op, 671994584.75 ns, 10.4999 ms/op
Result      11: 64 op, 676752101.64 ns, 10.5743 ms/op
Result      12: 64 op, 674428795.76 ns, 10.5379 ms/op
Result      13: 64 op, 668602141.92 ns, 10.4469 ms/op
Result      14: 64 op, 671500500.93 ns, 10.4922 ms/op
Result      15: 64 op, 673649893.86 ns, 10.5258 ms/op
GC:  0 0 0 0 0

For

public class Tests
{
    [Benchmark] public void Empty() { }
}

The output is:

IdleJitting  1: 1 op, 886100.71 ns, 886.1007 us/op
MainJitting  1: 1 op, 180470.32 ns, 180.4703 us/op

IdleJitting  2: 16 op, 375195.79 ns, 23.4497 us/op
MainJitting  2: 16 op, 350391.82 ns, 21.8995 us/op

Pilot        1: 16 op, 3706.34 ns, 231.6463 ns/op
Pilot        2: 32 op, 570.21 ns, 17.8189 ns/op
Pilot        3: 64 op, 285.10 ns, 4.4547 ns/op
Pilot        4: 128 op, 570.21 ns, 4.4547 ns/op
Pilot        5: 256 op, 570.21 ns, 2.2274 ns/op
Pilot        6: 512 op, 855.31 ns, 1.6705 ns/op
Pilot        7: 1024 op, 1710.62 ns, 1.6705 ns/op
Pilot        8: 2048 op, 3421.24 ns, 1.6705 ns/op
Pilot        9: 4096 op, 7127.58 ns, 1.7401 ns/op
Pilot       10: 8192 op, 13684.95 ns, 1.6705 ns/op
Pilot       11: 16384 op, 27084.80 ns, 1.6531 ns/op
Pilot       12: 32768 op, 51318.57 ns, 1.5661 ns/op
Pilot       13: 65536 op, 107483.90 ns, 1.6401 ns/op
Pilot       14: 131072 op, 214682.70 ns, 1.6379 ns/op
Pilot       15: 262144 op, 416535.76 ns, 1.5890 ns/op
Pilot       16: 524288 op, 866143.48 ns, 1.6520 ns/op
Pilot       17: 1048576 op, 1726584.90 ns, 1.6466 ns/op
Pilot       18: 2097152 op, 3454595.32 ns, 1.6473 ns/op
Pilot       19: 4194304 op, 6902918.37 ns, 1.6458 ns/op
Pilot       20: 8388608 op, 14281103.76 ns, 1.7024 ns/op
Pilot       21: 16777216 op, 27958358.97 ns, 1.6664 ns/op
Pilot       22: 33554432 op, 73826900.17 ns, 2.2002 ns/op
Pilot       23: 67108864 op, 109193950.57 ns, 1.6271 ns/op
Pilot       24: 134217728 op, 211685410.30 ns, 1.5772 ns/op
Pilot       25: 268435456 op, 421330051.99 ns, 1.5696 ns/op
Pilot       26: 536870912 op, 995081970.02 ns, 1.8535 ns/op

IdleWarmup   1: 536870912 op, 843111707.42 ns, 1.5704 ns/op
IdleWarmup   2: 536870912 op, 843451550.42 ns, 1.5711 ns/op
IdleWarmup   3: 536870912 op, 869994942.27 ns, 1.6205 ns/op
IdleWarmup   4: 536870912 op, 884145468.77 ns, 1.6468 ns/op
IdleWarmup   5: 536870912 op, 862387533.92 ns, 1.6063 ns/op
IdleWarmup   6: 536870912 op, 856060809.09 ns, 1.5945 ns/op
IdleWarmup   7: 536870912 op, 864332792.97 ns, 1.6099 ns/op
IdleWarmup   8: 536870912 op, 868011764.50 ns, 1.6168 ns/op
IdleWarmup   9: 536870912 op, 860173422.57 ns, 1.6022 ns/op

IdleTarget   1: 536870912 op, 868420887.57 ns, 1.6176 ns/op
IdleTarget   2: 536870912 op, 873615752.75 ns, 1.6272 ns/op
IdleTarget   3: 536870912 op, 868868499.58 ns, 1.6184 ns/op
IdleTarget   4: 536870912 op, 857457244.50 ns, 1.5971 ns/op
IdleTarget   5: 536870912 op, 865744909.05 ns, 1.6126 ns/op
IdleTarget   6: 536870912 op, 863927661.34 ns, 1.6092 ns/op
IdleTarget   7: 536870912 op, 856742775.91 ns, 1.5958 ns/op
IdleTarget   8: 536870912 op, 863842130.38 ns, 1.6090 ns/op
IdleTarget   9: 536870912 op, 855891457.80 ns, 1.5942 ns/op
IdleTarget  10: 536870912 op, 864548045.87 ns, 1.6103 ns/op
IdleTarget  11: 536870912 op, 858048263.41 ns, 1.5982 ns/op
IdleTarget  12: 536870912 op, 866360161.73 ns, 1.6137 ns/op
IdleTarget  13: 536870912 op, 859913693.56 ns, 1.6017 ns/op
IdleTarget  14: 536870912 op, 873673913.80 ns, 1.6273 ns/op
IdleTarget  15: 536870912 op, 875270776.75 ns, 1.6303 ns/op

MainWarmup   1: 536870912 op, 863368288.88 ns, 1.6081 ns/op
MainWarmup   2: 536870912 op, 868480759.24 ns, 1.6177 ns/op
MainWarmup   3: 536870912 op, 866211337.87 ns, 1.6134 ns/op
MainWarmup   4: 536870912 op, 901341467.52 ns, 1.6789 ns/op
MainWarmup   5: 536870912 op, 860632723.80 ns, 1.6031 ns/op
MainWarmup   6: 536870912 op, 881499996.29 ns, 1.6419 ns/op

// BeforeMainRun
MainTarget   1: 536870912 op, 857459240.22 ns, 1.5971 ns/op
MainTarget   2: 536870912 op, 861146764.85 ns, 1.6040 ns/op
MainTarget   3: 536870912 op, 866314260.12 ns, 1.6136 ns/op
MainTarget   4: 536870912 op, 876332786.13 ns, 1.6323 ns/op
MainTarget   5: 536870912 op, 863502002.28 ns, 1.6084 ns/op
MainTarget   6: 536870912 op, 857892311.96 ns, 1.5979 ns/op
MainTarget   7: 536870912 op, 861463229.39 ns, 1.6046 ns/op
MainTarget   8: 536870912 op, 856467081.13 ns, 1.5953 ns/op
MainTarget   9: 536870912 op, 855617758.74 ns, 1.5937 ns/op
MainTarget  10: 536870912 op, 865237710.48 ns, 1.6116 ns/op
MainTarget  11: 536870912 op, 855895449.24 ns, 1.5942 ns/op
MainTarget  12: 536870912 op, 878788950.08 ns, 1.6369 ns/op
MainTarget  13: 536870912 op, 869387387.38 ns, 1.6194 ns/op
MainTarget  14: 536870912 op, 865371138.78 ns, 1.6119 ns/op
MainTarget  15: 536870912 op, 853708422.69 ns, 1.5902 ns/op

// AfterMainRun
Result       1: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       2: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       3: 536870912 op, 1492515.19 ns, 0.0028 ns/op
Result       4: 536870912 op, 11511041.19 ns, 0.0214 ns/op
Result       5: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       6: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       7: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       8: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       9: 536870912 op, 0.00 ns, 0.0000 ns/op
Result      10: 536870912 op, 415965.55 ns, 0.0008 ns/op
Result      11: 536870912 op, 0.00 ns, 0.0000 ns/op
Result      12: 536870912 op, 13967205.15 ns, 0.0260 ns/op
Result      13: 536870912 op, 4565642.44 ns, 0.0085 ns/op
Result      14: 536870912 op, 549393.84 ns, 0.0010 ns/op
Result      15: 536870912 op, 0.00 ns, 0.0000 ns/op
GC:  0 0 0 0 0

// AfterAll

@adamsitnik
Copy link
Member Author

The explicit unroll factor still doesn't work for me. Is it already fixed? Could you tell me the name of the corresponded test?

The test is ForJobsWithExplicitUnrollFactorTheGlobalSetupIsCalledAndMultiActionCodeGetsJitted

The corresponding line of code: if engineParameters.HasUnrollFactor

@AndreyAkinshin
Copy link
Member

DontRunThePilotIfThePilotRequirementIsMetDuringWarmup is failed on Linux.

@adamsitnik
Copy link
Member Author

@AndreyAkinshin fixed.

var unrollFactor = Job.Default.ResolveValue(RunMode.UnrollFactorCharacteristic, DefaultResolver);
var mediumTime =  TimeSpan.FromMilliseconds(IterationTime.TotalMilliseconds / unrollFactor);

void MediumMultiple(long _)
{
	timesBenchmarkCalled += unrollFactor;

	for (int i = 0; i < unrollFactor; i++) // the real unroll factor obviously does not use loop ;)
		Thread.Sleep(mediumTime);
}

Executing MediumMultiple was for some reason taking less than IterationTime on Linux... I bet that Thread.Sleep is implemented in a different way and it was too close. I have increased the value, don't have time for proper investigation now ;(

@adamsitnik
Copy link
Member Author

I have just run some ML.NET benchmarks. Without this change: 24 minutes. With: 2 minutes ;)

@adamsitnik
Copy link
Member Author

@AndreyAkinshin I am going to merge it now to avoid conflicts with my incoming PR. If something is missing I will add it.

@adamsitnik adamsitnik merged commit a9664f4 into master May 27, 2018
@AndreyAkinshin
Copy link
Member

@adamsitnik, everything is OK.

@adamsitnik adamsitnik deleted the unrollFactor1 branch May 28, 2018 10:45
@adamsitnik adamsitnik added this to the v0.11.0 milestone Jul 18, 2018
alinasmirnova pushed a commit to alinasmirnova/BenchmarkDotNet that referenced this pull request Sep 22, 2018
…, fixes dotnet#736

* don't execute long operations more than once per iteration, dotnet#736

* generate the right C# code for dotnet#736

* EngineParameters.Resolver was always null or ignored ;), dotnet#736

* don't forget to JIT idle, dotnet#736

* do the math right for unroll factor for JIT, dotnet#736

* generate the right IL code, dotnet#736

* Setup and Cleanup are jitted together with benchmark, dotnet#736

* engine factory is now supposed to create an engine which is ready to run (hence the method name change), dotnet#736

* addressing PR feedback, dotnet#736

* bring back the calls to DummyActions, dotnet#736

* align iteration mode

* don't measure the overhead for time consuming benchmarks, don't run pilot if jitting gives the answer, dotnet#736

* fix the Linux build
@abelbraaksma
Copy link

abelbraaksma commented Dec 20, 2019

@adamsitnik, @AndreyAkinshin, this PR may have caused nano benchmarks to fail. It appears that the calculation mentioned in the OP above doesn't hold in practice. The case referenced in my issue, #1338, has 7ns per operation (when run correctly, when run once, it is 300-600ns per op) and 120ms as IterationTime. That's an order of magnitude out of reach of the threshold set by IterationTime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce number of initial pilot ops to 1 or make it configurable
3 participants