Draft: Experimental Tango support #25

bazhenov · 2023-12-08T04:27:56Z

For quite some time I've developed a benchmarking harness called Tango. It is based on paired testing methodology. Now I'm confident enough to publish it. Ordsearch in fact was one of the first projects I've tried it on and used for. So this PR is a benchmark implementation of ordsearch on Tango. The main purpose is to provide a performance regression toolkit using GHA.

The main promises of paired testing are:

convenience – we're estimating a difference in the performance of two algorithms directly which in most cases the thing we need
sensitivity and speed – because two algorithms are executed simultaneously both algorithms are subject to the same biases the platform has at the moment. Therefore we can estimate performance differences more accurately or spend less time on achieving the same level of precision.
performance regression testing – using paired testing tango allows us to compare not only 2 algorithms in the same code base but also 2 versions of the same algorithm from different VCS commits/branches (ex.). This allows to do quite sensitive performance regression testing. In my experience so far it can quite confidently find regressions as low as 5%..

Sensitivity

I've done some experiments on an AWS c2.medium instance. I ran a single test u32/1024/nodup in both criterion and tango in a loop 400 times which took several hours. Here are 80% confidence intervals on the performance difference between OrderedCollection and Vec as measured by tango and criterion.

Even using a fraction of a second Tango can provide tighter intervals.

Here are 100% CI (whiskers are showing minimum and maximum values)

Here the results are comparable between Tango and Criterion, but Tango requires much less time. Even 0.1s is enough to get a coarse estimate which is very useful when experimenting in development.

Regression testing

Tango builds an executable benchmark which is at the same time a dynamically linked library. This way 2 versions of the same code can be loaded in the address space and benchmarked in a paired way. The algorithm is quite simple:

checkout branch
build benchmarks and export then (I built cargo-export for that)
checkout main branch
build benchmark and run it passing branch executable as an argument ./main-bench compare ./branch-bench

I've added a benchmarking workflow for GitHub, but it needs to be debugged and streamlined which is possible only after PR is open (GHA doesn't work in forks).

bazhenov · 2023-12-08T04:34:48Z

Obviously, I need to do some additional work on tooling and Windows support 😂

Until tango benchmarks on the main

jonhoo

This is really cool, thanks for sharing!

A few thoughts:

This feels like it's useful both for "compare ordsearch to btree" and for "compare ordsearch between branch A and branch B". My understanding from the CI file you added is that this only testing the latter for PRs, though provides a way to run the former on-demand. Is that right?
This is still susceptible to performance differences across machines, meaning we probably still wouldn't check in the target/benchmarks I assume. Just double-checking.
Do you have a hypothesis about why criterion seemingly consistently reports a large performance diff than tango does? 44.05% vs 43.35% is a decently-chunked difference, and seems pretty stable in the diagrams you provided.

jonhoo · 2023-12-17T16:23:50Z

build.rs

@@ -0,0 +1,3 @@
+fn main() {
+    println!("cargo:rustc-link-arg-benches=-rdynamic");


nit: should also add

cargo:rerun-if-changed=build.rs

so the build script isn't re-run all the time

jonhoo · 2023-12-17T16:24:34Z

benches/search_comparison.rs

+/// Because benchmarks are builded with linker flag -rdynamic there should be library entry point defined
+/// in all benchmarks. This is only needed when two harnesses are used.


This feels like a short-coming in tango — shouldn't it be able to recognize if a particular binary has no entry-point for it?

Yes, tango executable is the dynamic library (benchmarks) and the executable (the runner) at the same time. To achieve this some linker shenanigans are required. But I was able to pull it of without requiring dummy entry point. This code is already removed. Thank you.

jonhoo · 2023-12-17T16:26:37Z

benches/common.rs

+    }
+}
+
+impl<C: FromSortedVec> Generator for RandomCollection<C>


I'm confused about why haystack and needle are first-class primitives in Tango — shouldn't it only be thinking in terms of "inputs" and "benchmarks"? If it also delineates between haystacks and needles, does that mean it doesn't generalize to other kinds of benchmarks without such concepts (e.g., factorials)? Alternatively, if it does support "general" benchmarks, why have this specialization built into it as well?

They are more of second-class primitives. For paired tests to be sensitive they need to operate on the same input. If the input is updated between iterations it should be updated at the same time between two algorithms. So harness need a way to control this process. This is how Generator trait was born.

Haystack/needle distinction is unusual, it's kind of "big" and "small" part of the input. Haystack is reused between benchmark samples, supposed to be costly in terms of generation (thus its generation time is not included in benchmark time). Needle is supposed to be cheap, generated for each iteration and its generation time is included in benchmark time.

This scheme can be generalized to arbitrary input, although it's not pretty. But it mirrors very important distinction which is present in almost any benchmarking harness including criterion. Under the hood we are almost never timing a single function call (iteration), but a bunch of them (sample). Some parts of the input can be cheaply generated per-iteration, another ones can not. This is needle/haystack separation.

I'm still unsure how easy it is to understand this concept. Maybe some another metaphor should be chosen, so I'm open to any feedback.

jonhoo · 2023-12-17T16:27:53Z

.github/workflows/bench.yaml

+jobs:
+  bench:
+    runs-on: ubuntu-22.04
+    steps:


I think several of these could probably be improved upon by drawing from other helpful GitHub actions. For example, installing Rust and installing crate-binaries have pretty good (and more optimized) variants. The other workflow files may be able to provide some inspiration.

migrated towards dtolnay/rust-toolchain

using lock files when building benchmarks

add support of taiki-e/install-action to cargo-export

jonhoo · 2023-12-17T16:28:56Z

.github/workflows/bench.yaml

+          set -eo pipefail
+
+          target/benchmarks/search_ord --color=never compare target/benchmarks/search_ord \
+            -ot 1 --fail-threshold 10 | tee target/benchmark.txt


What is -ot here?
Also, what is the unit for --fail-threshold?

-t – duration of each test (in seconds)

-o – filter outliers. Because tango is measuring difference directly it can detect and remove severe observations that are manifest itself symmetrically (in both algorithms), therefore they are considered independent of the algorithms performance.

--fail-threshold – exit with non zero exit code if candidate is slower than baseline on given amount of percents. Tango also do a z-test and fail only if difference is statistically significant (planning to move to bootstrap later on)

More on cli-arguments

jonhoo · 2023-12-18T08:56:43Z

Only semi-related, but https://github.com/bheisler/iai might also be an interesting thing for us to run in CI

bazhenov · 2023-12-18T08:58:58Z

This feels like it's useful both for "compare ordsearch to btree" and for "compare ordsearch between branch A and branch B". My understanding from the CI file you added is that this only testing the latter for PRs, though provides a way to run the former on-demand. Is that right?

Yes, you can compare vec/ord/btree in any pair you would like. ord (master) vs. ord (branch) was selected for GHA workflow because in this case we have quite obvious expectation of 0% as a performance regression target.

This is still susceptible to performance differences across machines, meaning we probably still wouldn't check in the target/benchmarks I assume. Just double-checking.

Correct. This is the idea. To conclude there is a difference between two algorithms (or lack of thereof) you need to execute both of them on the same machine with the same input at the "same time". The last requirement is strictly speaking impossible, so Tango executing algorithms in a alternative fashion so in any quite short amount of time (~50ms) both of algorithms are executed.

Do you have a hypothesis about why criterion seemingly consistently reports a large performance diff than tango does? 44.05% vs 43.35% is a decently-chunked difference, and seems pretty stable in the diagrams you provided.

Hard to say for sure, but here are some of my thoughts:

when executing 2 algorithms on the same core there are effectively half of the resources for each algorithm: memory throughput, data/instruction/TLB cache, uops cache slots. This can bias measurement slightly if one of the algorithms is more memory demanding than the other. This is basically the case for eytzinger layout.
in my version of benchmark I do rotate collection being searched, while in criterion benchmarks collection generated once per test.

I'll look into the matter closely.

bazhenov · 2023-12-18T09:04:03Z

Only semi-related, but https://github.com/bheisler/iai might also be an interesting thing for us to run in CI

Yes, indeed. Using performance metrics is another interesting school of thought. Intel/AMD had done a great job with PMC counters. I believe there is a great potential in using PMCs for benchmarking. I have no idea how good situation is on aarch64 though.

jonhoo · 2024-01-19T16:08:25Z

Tried using Tango for a thing (whose code I unfortunately can't share), and I'm sometimes seeing decently large fluctuations for very-fast operations even when comparing against the exact same binary (i.e., running export and then immediately bench -- compare). For example:

array::unshift                                     [  15.2 us ...  15.2 us ]      +0.11%
array::delete                                      [  19.1 us ...  19.1 us ]      +0.05%
array::update                                      [  39.3 us ...  38.7 us ]      -1.52%*
array::insert                                      [  19.5 us ...  19.6 us ]      +0.29%
array::push                                        [  15.1 us ...  15.1 us ]      +0.37%
map::insert                                        [ 582.9 ns ... 569.3 ns ]      -2.27%*
map::remove                                        [ 281.3 ns ... 276.3 ns ]      -1.66%*
map::update                                        [ 645.0 ns ... 638.0 ns ]      -1.05%*
register::write                                    [ 235.6 ns ... 232.6 ns ]      -1.17%*
register::clear                                    [  76.1 ns ...  74.6 ns ]      -1.57%*

This was with -ot 5 on an M2 Pro Mac. Ideas on why the observed variance might still be that high?

bazhenov · 2024-01-20T09:16:58Z

Can you please rerun the tests with the following options added: -vd target/dumps (-v verbose mode, -d dump individual observations in the target/dumps directory)? And please make sure you're using tango from the dev branch which has some important changes in scheduler strategy not available in release yet.

Ideas on why the observed variance might still be that high?

In my experience when A/A-test gives false positives it boils down to one of the following problems:

scheduling outliers. More frequent on laptops where a lot of background activity competing for CPU than on the servers.
cache elimination fairness. I don't know any way to ensure that both of the tested functions will experience the same amount of misses In case of test payload exceeds L2/L3 cache size. This is why I added --cache-firewall option which spoils the CPU cache between samples. Eg. --cache-firewall 1024 will perform 1Mb read between samples. It's good to use L2 size as a starting point.
code aligning issues. Sadly, even the identical machine code can have different performance when located on different memory addresses. This is because of the intricacies of uop-caching in modern CPUs (see. Causes of Performance Instability due to Code Placement in X86). I made a very simple example some time ago that shows this problem for x86 processors. The difference usually lay within 5-10% range. It's becoming even more sad when you think how modern OSes usually put a binary image in a random memory address each time for the security means (ASLR). On x86 it helps to align functions on 32-byte boundary. I got my hands on a M3 just a week ago, so I don't know if this problem is relevant for Apple Silicon or not. To the best of my knowledge, there is no information on wherever Apple Silicon does use uop-caching, but some ARM microarchitectures like Cortex A77 do.

If you are able to gather dumps (-d) if individual samples, those will help to differentiate (1) and (2). When (1) is true you will see individual observations have a large difference, but overall the performance is the same, but in the case of (2) usually you see the stable difference between two copies of the same algorithm across all observations.

jonhoo · 2024-01-22T09:44:58Z

The earlier results were with 0.2. With dev I get (across two runs):

$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [  15.0 us ...  14.9 us ]      -0.55%*
array::delete                                      [  19.3 us ...  19.0 us ]      -1.41%*
array::update                                      [  39.4 us ...  39.2 us ]      -0.59%*
array::insert                                      [  20.0 us ...  19.8 us ]      -0.83%*
array::push                                        [  15.1 us ...  15.0 us ]      -0.59%*
map::insert                                        [ 581.5 ns ... 583.3 ns ]      +0.30%
map::remove                                        [ 280.2 ns ... 281.4 ns ]      +0.43%
map::update                                        [ 634.8 ns ... 636.9 ns ]      +0.35%
register::write                                    [ 235.4 ns ... 234.0 ns ]      -0.62%*
register::clear                                    [  76.5 ns ...  75.6 ns ]      -1.17%*
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [  15.0 us ...  15.1 us ]      +0.21%
array::delete                                      [  19.3 us ...  19.3 us ]      -0.13%
array::update                                      [  38.9 us ...  38.3 us ]      -1.49%*
array::insert                                      [  19.8 us ...  19.6 us ]      -0.92%*
array::push                                        [  15.1 us ...  15.1 us ]      +0.40%
map::insert                                        [ 574.8 ns ... 570.2 ns ]      -0.79%*
map::remove                                        [ 287.8 ns ... 283.1 ns ]      -1.62%*
map::update                                        [ 619.7 ns ... 636.9 ns ]      +2.77%*
register::write                                    [ 237.0 ns ... 236.4 ns ]      -0.24%
register::clear                                    [  75.5 ns ...  76.2 ns ]      +0.86%*

I got an error ("no such file or directory") when I did -vd target/dumps, so had to mkdir that myself. After that though, it did the thing:

array::unshift  (n: 86, outliers: 10)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         14.9 us         15.1 us        175.1 ns  +1.17%*
    min          │         14.7 us         14.9 us        157.0 ns
    max          │         15.7 us         15.9 us        142.1 ns
    std. dev.    │        220.4 ns        231.5 ns        102.8 ns

array::delete  (n: 212, outliers: 12)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         18.9 us         19.1 us        178.5 ns  +0.94%*
    min          │         18.7 us         18.9 us        206.4 ns
    max          │         20.0 us         20.1 us        145.4 ns
    std. dev.    │        208.4 ns        197.9 ns         94.5 ns

array::update  (n: 82, outliers: 22)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         38.4 us         39.2 us        763.3 ns  +1.99%*
    min          │         38.0 us         38.7 us        749.8 ns
    max          │         40.1 us         41.0 us        875.4 ns
    std. dev.    │        466.2 ns        507.3 ns        186.9 ns

array::insert  (n: 82, outliers: 22)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         19.6 us         19.7 us        121.1 ns  +0.62%*
    min          │         19.1 us         19.2 us         95.7 ns
    max          │         20.8 us         21.0 us        185.8 ns
    std. dev.    │        427.9 ns        432.6 ns        140.8 ns

array::push  (n: 78, outliers: 26)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         14.9 us         15.0 us         71.6 ns  +0.48%
    min          │         14.8 us         14.8 us         48.3 ns
    max          │         15.9 us         16.0 us         81.7 ns
    std. dev.    │        180.7 ns        176.1 ns         55.2 ns

map::insert  (n: 1598, outliers: 90)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        589.7 ns        588.1 ns         -1.6 ns  -0.27%
    min          │        559.7 ns        561.7 ns          1.9 ns
    max          │        737.5 ns        724.4 ns        -13.1 ns
    std. dev.    │         12.4 ns         12.5 ns          7.8 ns

map::remove  (n: 3318, outliers: 266)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        280.6 ns        278.9 ns         -1.7 ns  -0.62%*
    min          │        265.6 ns        267.0 ns          1.4 ns
    max          │        349.6 ns        347.6 ns         -2.0 ns
    std. dev.    │          6.1 ns          6.0 ns          4.6 ns

map::update  (n: 1460, outliers: 92)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        647.1 ns        641.1 ns         -6.0 ns  -0.92%*
    min          │        611.9 ns        598.5 ns        -13.4 ns
    max          │        795.9 ns        799.8 ns          3.8 ns
    std. dev.    │         13.8 ns         13.6 ns          8.3 ns

register::write  (n: 3780, outliers: 476)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        235.4 ns        235.3 ns         -0.1 ns  -0.05%
    min          │        225.4 ns        223.4 ns         -1.9 ns
    max          │        301.9 ns        294.3 ns         -7.6 ns
    std. dev.    │          4.4 ns          4.4 ns          3.2 ns

register::clear  (n: 11824, outliers: 1648)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         74.4 ns         74.0 ns         -0.4 ns  -0.60%*
    min          │         68.8 ns         67.6 ns         -1.1 ns
    max          │        125.0 ns        125.0 ns            0 ns
    std. dev.    │          2.1 ns          2.2 ns          1.5 ns

Looking at target/dumps/map::update.csv, I see individual differences ranging from 0 to ~5000 (third column) in a fairly uniform manner. Looking across the runtime of a single column, I see a very wide range of absolute numbers, ranging from ~2000 to ~3324833.

Given the wide range in execution times, I wondered if maybe this was caused by HashMap randomness, but even after switching all the maps to have ahash::RandomState with a fixed seed, I observe the same kind of variance.

bazhenov · 2024-01-22T10:48:17Z

Does this effect state itself when you test binary against itself? You can omit the other executable in which case tango will test against itself to ensure the same code is used in base/candidate. eg.

$ cargo bench -q --bench=micro -- compare -ot 5

jonhoo · 2024-01-22T12:23:23Z

It certainly does look like the re-build changes the performance characteristics of the binary (even though it's the exact same environment, directory, code, compiler, and toolchain):

$ cargo bench -q --bench=micro -- compare -ot 5
array::unshift                                     [   8.8 us ...   8.8 us ]      -0.09%
array::delete                                      [  13.7 us ...  13.7 us ]      -0.01%
array::update                                      [  26.6 us ...  26.7 us ]      +0.27%
array::insert                                      [  13.3 us ...  13.3 us ]      +0.09%
array::push                                        [   8.5 us ...   8.4 us ]      -0.08%
map::insert                                        [ 572.0 ns ... 572.3 ns ]      +0.05%
map::remove                                        [ 303.0 ns ... 303.1 ns ]      +0.02%
map::update                                        [ 606.7 ns ... 606.6 ns ]      -0.02%
register::write                                    [ 219.5 ns ... 219.4 ns ]      -0.08%
register::clear                                    [  82.9 ns ...  82.9 ns ]      -0.01%
$ cargo export target/benchmarks -- bench --bench=micro
    Finished bench [optimized + debuginfo] target(s) in 0.66s
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [   8.5 us ...   8.4 us ]      -1.08%*
array::delete                                      [  13.2 us ...  13.1 us ]      -0.84%*
array::update                                      [  26.4 us ...  26.2 us ]      -0.87%*
array::insert                                      [  13.5 us ...  13.4 us ]      -0.81%*
array::push                                        [   8.5 us ...   8.4 us ]      -1.43%*
map::insert                                        [ 572.9 ns ... 577.7 ns ]      +0.85%*
map::remove                                        [ 303.7 ns ... 300.8 ns ]      -0.97%*
map::update                                        [ 610.0 ns ... 607.0 ns ]      -0.48%
register::write                                    [ 225.0 ns ... 225.0 ns ]      -0.02%
register::clear                                    [  84.7 ns ...  83.6 ns ]      -1.29%*
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [   8.6 us ...   8.6 us ]      -0.18%
array::delete                                      [  13.7 us ...  13.6 us ]      -0.09%
array::update                                      [  27.4 us ...  27.3 us ]      -0.24%
array::insert                                      [  13.9 us ...  13.8 us ]      -0.46%
array::push                                        [   8.7 us ...   8.7 us ]      +0.11%
map::insert                                        [ 574.9 ns ... 570.3 ns ]      -0.80%*
map::remove                                        [ 307.0 ns ... 309.3 ns ]      +0.76%*
map::update                                        [ 610.8 ns ... 608.4 ns ]      -0.39%
register::write                                    [ 225.6 ns ... 223.3 ns ]      -1.03%*
register::clear                                    [  86.8 ns ...  82.7 ns ]      -4.80%*
$ cargo bench -q --bench=micro -- compare -ot 5
array::unshift                                     [   8.5 us ...   8.5 us ]      +0.05%
array::delete                                      [  13.1 us ...  13.1 us ]      +0.06%
array::update                                      [  26.3 us ...  26.3 us ]      -0.01%
array::insert                                      [  13.4 us ...  13.4 us ]      +0.07%
array::push                                        [   8.5 us ...   8.5 us ]      +0.07%
map::insert                                        [ 567.3 ns ... 567.5 ns ]      +0.04%
map::remove                                        [ 305.6 ns ... 305.4 ns ]      -0.06%
map::update                                        [ 611.6 ns ... 611.7 ns ]      +0.02%
register::write                                    [ 221.3 ns ... 221.4 ns ]      +0.02%
register::clear                                    [  84.1 ns ...  84.1 ns ]      -0.03%

bazhenov · 2024-01-22T12:40:27Z

I encounter such a behavior on x86 occasionally. Usually it happens after some changes to code which is unrelated to the code being tested. But I never faced the situation when two binaries built one after the other has different performance. 😕

Theoretically it might happens because function linking order changes for whatever reason which in turn may change function layout and subsequently effectiveness of different optimizations. There is at least one publication describing this effect – Producing Wrong Data Without Doing Anything Obviously Wrong

bazhenov · 2024-01-22T12:49:33Z

It certainly does look like the re-build changes the performance characteristics of the binary

Could you please do some sanity checks of the assembly and make sure cargo bench and cargo export are producing same test function assembly? You might use cargo asm for that, but I more like:

nm [BINARY] to get function name
objdump --disassemble-symbols=[FN_NAME] [BINARY]

jonhoo · 2024-01-22T13:21:53Z

I'll do you one better:

$ shasum a
1cd583440bff0f5befae11ace8de8837ee7f0131  a
$ shasum b
1cd583440bff0f5befae11ace8de8837ee7f0131  b
$ ./a compare -ot 5
array::unshift                                     [   8.7 us ...   8.8 us ]      +0.05%
array::delete                                      [  13.9 us ...  13.9 us ]      +0.08%
array::update                                      [  27.3 us ...  27.3 us ]      +0.03%
array::insert                                      [  13.8 us ...  13.8 us ]      +0.02%
array::push                                        [   8.7 us ...   8.7 us ]      -0.05%
map::insert                                        [ 576.3 ns ... 576.4 ns ]      +0.02%
map::remove                                        [ 306.2 ns ... 306.1 ns ]      -0.02%
map::update                                        [ 624.0 ns ... 623.8 ns ]      -0.02%
register::write                                    [ 224.7 ns ... 224.7 ns ]      -0.01%
register::clear                                    [  84.9 ns ...  84.9 ns ]      +0.02%
$ ./a compare a -ot 5
array::unshift                                     [   8.8 us ...   8.8 us ]      -0.23%
array::delete                                      [  13.7 us ...  13.7 us ]      +0.14%
array::update                                      [  27.6 us ...  27.6 us ]      +0.01%
array::insert                                      [  14.0 us ...  14.0 us ]      -0.02%
array::push                                        [   8.6 us ...   8.6 us ]      +0.06%
map::insert                                        [ 583.6 ns ... 584.4 ns ]      +0.14%
map::remove                                        [ 311.2 ns ... 311.3 ns ]      +0.02%
map::update                                        [ 631.8 ns ... 631.5 ns ]      -0.04%
register::write                                    [ 231.6 ns ... 231.7 ns ]      +0.04%
register::clear                                    [  87.0 ns ...  87.0 ns ]      -0.01%
$ ./a compare b -ot 5
array::unshift                                     [   8.7 us ...   8.7 us ]      -0.95%*
array::delete                                      [  13.9 us ...  14.0 us ]      +1.02%*
array::update                                      [  27.8 us ...  28.2 us ]      +1.40%*
array::insert                                      [  14.2 us ...  14.5 us ]      +2.07%*
array::push                                        [   8.8 us ...   8.7 us ]      -0.88%*
map::insert                                        [ 575.6 ns ... 586.0 ns ]      +1.81%*
map::remove                                        [ 315.4 ns ... 302.0 ns ]      -4.25%*
map::update                                        [ 625.1 ns ... 636.7 ns ]      +1.86%*
register::write                                    [ 226.3 ns ... 227.3 ns ]      +0.44%
register::clear                                    [  85.9 ns ...  85.8 ns ]      -0.20%

a and b here are just copies of target/benchmarks/micro.

bazhenov · 2024-01-22T14:23:16Z

Ok, I think I can explain that. In first two cases we executing the same code although mapped twice in different parts of virtual memory while in third case we executing different code (although identical).

Does --sampler=flat helps? By default each sample consists of random number of iterations, in case of small number of iterations it will force CPU to switch frequently between different code of base/candidate functions. Flat sampler force each sample to be fixed amount of iterations (approximately 50ms of execution time).

jonhoo · 2024-01-22T14:49:35Z

Ah, you mean in the sense that the files are memory-mapped, so the page tables behind the scenes map them back to the same backing memory or something (for a vs a)?

I would say --sampler=flat does not help meaningfully:

$ ./a compare b -ot 5 --sampler=flat
array::unshift                                     [   8.5 us ...   8.5 us ]      -0.27%
array::delete                                      [  13.6 us ...  13.4 us ]      -1.11%*
array::update                                      [  27.3 us ...  27.1 us ]      -0.77%*
array::insert                                      [  13.8 us ...  13.6 us ]      -1.07%*
array::push                                        [   8.5 us ...   8.5 us ]      -0.12%
map::insert                                        [ 556.6 ns ... 567.9 ns ]      +2.03%*
map::remove                                        [ 299.4 ns ... 298.5 ns ]      -0.30%
map::update                                        [ 604.3 ns ... 601.2 ns ]      -0.50%*
register::write                                    [ 222.4 ns ... 217.8 ns ]      -2.08%*
register::clear                                    [  82.6 ns ...  90.0 ns ]      +8.99%*

bazhenov · 2024-01-22T15:03:37Z

Ah, you mean in the sense that the files are memory-mapped, so the page tables behind the scenes map them back to the same backing memory or something (for a vs a)?

Yes, exactly. After TLB lookup all memory mapped copies of the same file should have the same physical address linked to the OS buffer pool. At least this is how I conceptualize how OS works.

I'm trying to reproduce this effect on M3 Max using some tango tests I have at hand, but still no luck.

jonhoo · 2024-01-24T08:52:47Z

With 0.2, I also notice that the output is sometimes entirely empty when the only difference between the binaries being compared is that one of them has a dependency that is unrelated to the benchmark be updated — is that a known problem? Would maybe be good to print something to STDERR when benchmarks are skipped for whatever reason (e.g., if they're found to exist in only one of the binaries).

Relatedly: any plans to ship dev as 0.3?

bazhenov · 2024-01-24T09:01:42Z

With 0.2, I also notice that the output is sometimes entirely empty when the only difference between the binaries being compared is that one of them has a dependency that is unrelated to the benchmark be updated — is that a known problem? Would maybe be good to print something to STDERR when benchmarks are skipped for whatever reason (e.g., if they're found to exist in only one of the binaries).

Actually this is good idea. 👍 I will do it.

Relatedly: any plans to ship dev as 0.3?

I will fix those minor issues you've reported and release 0.3 in a few days.

bazhenov · 2024-01-24T09:03:57Z

Also I think there is one more thing you can try. There is implicit default limit for the number of iterations per sample. Last time when we tried --sampler=flat, I forget to tell you to disable it.

tango_main!(MeasurementSettings {
    max_iterations_per_sample: usize::max_value(),
    ..tango_bench::DEFAULT_SETTINGS
});

jonhoo · 2024-01-24T09:21:47Z

Tried that now, and doesn't seem to have made too much of a difference. One interesting thing I just noticed though is that there's a bit of a "pattern" to the results. Notice how the + and - s seem to kind of happen in waves? There's a sequence of +s, then a sequence of -s, then +s, etc., even across multiple invocations:

$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.6 us ...   8.6 us ]      -0.72%*
array::delete                                      [  13.5 us ...  13.5 us ]      +0.05%
array::update                                      [  26.9 us ...  26.9 us ]      -0.08%
array::insert                                      [  13.7 us ...  13.7 us ]      +0.26%
array::push                                        [   8.6 us ...   8.5 us ]      -0.51%*
map::insert                                        [ 562.1 ns ... 561.6 ns ]      -0.08%
map::remove                                        [ 299.1 ns ... 300.3 ns ]      +0.39%
map::update                                        [ 604.6 ns ... 613.1 ns ]      +1.41%*
register::write                                    [ 219.8 ns ... 225.9 ns ]      +2.77%*
register::clear                                    [  85.0 ns ...  86.2 ns ]      +1.38%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.6 us ...   8.7 us ]      +1.82%*
array::delete                                      [  13.4 us ...  13.6 us ]      +1.43%*
array::update                                      [  26.8 us ...  27.3 us ]      +1.68%*
array::insert                                      [  13.6 us ...  13.9 us ]      +1.70%*
array::push                                        [   8.5 us ...   8.7 us ]      +1.92%*
map::insert                                        [ 571.1 ns ... 573.5 ns ]      +0.41%
map::remove                                        [ 307.4 ns ... 296.9 ns ]      -3.42%*
map::update                                        [ 615.9 ns ... 607.5 ns ]      -1.37%*
register::write                                    [ 225.0 ns ... 224.2 ns ]      -0.38%
register::clear                                    [  86.1 ns ...  82.3 ns ]      -4.42%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.5 us ...   8.5 us ]      +0.65%*
array::delete                                      [  13.3 us ...  13.3 us ]      +0.29%
array::update                                      [  26.7 us ...  26.8 us ]      +0.34%
array::insert                                      [  13.5 us ...  13.6 us ]      +0.46%
array::push                                        [   8.5 us ...   8.6 us ]      +0.58%*
map::insert                                        [ 560.3 ns ... 562.7 ns ]      +0.44%
map::remove                                        [ 302.1 ns ... 298.7 ns ]      -1.11%*
map::update                                        [ 619.4 ns ... 611.1 ns ]      -1.34%*
register::write                                    [ 223.3 ns ... 220.0 ns ]      -1.44%*
register::clear                                    [  85.5 ns ...  82.9 ns ]      -3.05%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.5 us ...   8.6 us ]      +0.75%*
array::delete                                      [  13.7 us ...  13.5 us ]      -1.51%*
array::update                                      [  27.5 us ...  27.1 us ]      -1.54%*
array::insert                                      [  13.9 us ...  13.8 us ]      -1.18%*
array::push                                        [   8.6 us ...   8.6 us ]      +0.65%*
map::insert                                        [ 561.8 ns ... 570.0 ns ]      +1.47%*
map::remove                                        [ 302.3 ns ... 306.1 ns ]      +1.26%*
map::update                                        [ 613.5 ns ... 617.5 ns ]      +0.66%*
register::write                                    [ 220.0 ns ... 224.1 ns ]      +1.89%*
register::clear                                    [  85.7 ns ...  83.6 ns ]      -2.37%*

bazhenov · 2024-01-25T10:56:17Z

0.3 released

jonhoo · 2024-01-29T15:43:47Z

Another feature request: to have all the benchmarks run even if one of the earlier results exceed the threshold. And then only fail at the end!

bazhenov · 2024-02-03T10:37:35Z

Another feature request: to have all the benchmarks run even if one of the earlier results exceed the threshold. And then only fail at the end!

Implemented in bazhenov/tango#15 (0.4.0 published)

bazhenov · 2024-02-13T09:39:01Z

Ah, you mean in the sense that the files are memory-mapped, so the page tables behind the scenes map them back to the same backing memory or something (for a vs a)?

It is even simpler than that as it turns out. macOS dynamic linker is capable of linking executable files against itself and not load the same executable twice in a virtual memory.

If you are testing 2 different executables you'll have distinct mappings for each of binaries in the virtual memory:

$ ./search_ord compare search_ord2

$ vmmap `pgrep search_ord`  | grep search_ord
Virtual Memory Map of process 34763 (search_ord)
__TEXT                      100130000-1002b4000    [ 1552K  1552K     0K     0K] r-x/r-x SM=COW          /Users/USER/*/search_ord
__DATA_CONST                1002b4000-1002cc000    [   96K    80K     0K     0K] r--/rw- SM=COW          /Users/USER/*/search_ord
__LINKEDIT                  1002d0000-1003c8000    [  992K   992K     0K     0K] r--/r-- SM=COW          /Users/USER/*/search_ord
__DATA                      1002cc000-1002d0000    [   16K    16K    16K     0K] rw-/rw- SM=COW          /Users/USER/*/search_ord
__TEXT                      10086c000-1009f0000    [ 1552K   608K     0K     0K] r-x/rwx SM=COW          /Users/USER/*/search_ord2
__DATA_CONST                1009f0000-100a08000    [   96K    96K    96K     0K] r--/rwx SM=COW          /Users/USER/*/search_ord2
__LINKEDIT                  100a0c000-100b04000    [  992K   160K     0K     0K] r--/rwx SM=COW          /Users/USER/*/search_ord2
__DATA                      100a08000-100a0c000    [   16K    16K    16K     0K] rw-/rwx SM=COW          /Users/USER/*/search_ord2

But dynamic linker is smart enough not to load the file twice if you test binary agains itself.

$ ./search_ord compare search_ord

$ vmmap `pgrep search_ord`  | grep search_ord
Virtual Memory Map of process 34821 (search_ord)
__TEXT                      10049c000-100620000    [ 1552K  1552K     0K     0K] r-x/r-x SM=COW          /Users/USER/*/search_ord
__DATA_CONST                100620000-100638000    [   96K    80K     0K     0K] r--/rw- SM=COW          /Users/USER/*/search_ord
__LINKEDIT                  10063c000-100734000    [  992K   992K     0K     0K] r--/r-- SM=COW          /Users/USER/*/search_ord
__DATA                      100638000-10063c000    [   16K    16K    16K     0K] rw-/rw- SM=COW          /Users/USER/*/search_ord

Remarkably, this deduplication works even if you're using hard links on the filesystem (at least in case of APFS).

I was able to reproduce this effect at least partially and not with the same magnitude as your case, but still... Continue investigating the issue.

bazhenov added 2 commits December 8, 2023 10:35

Tango benchmarks are added

4504dcd

Benchmark workflow

6adc4fc

bazhenov marked this pull request as draft December 8, 2023 04:28

Using dev version of tango for diagnosing purposes

31ff8f2

bazhenov force-pushed the tango branch 2 times, most recently from 33d17c7 to 31ff8f2 Compare December 8, 2023 09:03

bazhenov added 7 commits December 8, 2023 16:06

Remove unneeded nightly aligning directives

0204010

Temporarly benching agains itself

11c5741

Until tango benchmarks on the main

Setting threshold to 20

6a1ad2c

Trying no outliers filtering

c315676

Some fixes in benchmarking harness

1ef4a54

Updating versions to pass minimum versions test

852fcf5

Updating log dependency

5ca8152

jonhoo reviewed Dec 17, 2023

View reviewed changes

Rerun build script only if it was changed

8feb7b5

There is no more need of a dummy entrypoint

e1a5111

bazhenov force-pushed the tango branch from 1da7323 to e1a5111 Compare December 19, 2023 07:51

bazhenov added 8 commits December 20, 2023 11:36

Updating checkout action

4f66478

Using dtolnay/rust-toolchain for Rust toolchain setup

9a264e1

Using cache to store cargo-export between runs

677cbf6

Using lock files when generating benchmarks

80ef080

hashFiles() only works inside GHA workspace directory

157f2ea

Shorter configuration

c94550a

Updating tango-bench

85b78bd

Using prebuilt version of cargo export and optimized rust cache

5a56472

This was referenced Jan 25, 2024

Automatically create missing directories for the dumps bazhenov/tango#11

Closed

Reporting skipped tests bazhenov/tango#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Experimental Tango support #25

Draft: Experimental Tango support #25

bazhenov commented Dec 8, 2023 •

edited

Loading

bazhenov commented Dec 8, 2023

jonhoo left a comment

jonhoo Dec 17, 2023

jonhoo Dec 17, 2023

bazhenov Dec 19, 2023

jonhoo Dec 17, 2023

bazhenov Dec 19, 2023

jonhoo Dec 17, 2023

bazhenov Dec 20, 2023 •

edited

Loading

jonhoo Dec 17, 2023

bazhenov Dec 18, 2023

jonhoo commented Dec 18, 2023

bazhenov commented Dec 18, 2023

bazhenov commented Dec 18, 2023

jonhoo commented Jan 19, 2024

bazhenov commented Jan 20, 2024 •

edited

Loading

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024 •

edited

Loading

bazhenov commented Jan 22, 2024

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024 •

edited

Loading

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024

jonhoo commented Jan 24, 2024

bazhenov commented Jan 24, 2024

bazhenov commented Jan 24, 2024 •

edited

Loading

jonhoo commented Jan 24, 2024

bazhenov commented Jan 25, 2024

jonhoo commented Jan 29, 2024

bazhenov commented Feb 3, 2024

bazhenov commented Feb 13, 2024 •

edited

Loading

		@@ -0,0 +1,3 @@
		fn main() {
		println!("cargo:rustc-link-arg-benches=-rdynamic");

		/// Because benchmarks are builded with linker flag -rdynamic there should be library entry point defined
		/// in all benchmarks. This is only needed when two harnesses are used.

Draft: Experimental Tango support #25

Are you sure you want to change the base?

Draft: Experimental Tango support #25

Conversation

bazhenov commented Dec 8, 2023 • edited Loading

Sensitivity

Regression testing

bazhenov commented Dec 8, 2023

jonhoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bazhenov Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonhoo commented Dec 18, 2023

bazhenov commented Dec 18, 2023

bazhenov commented Dec 18, 2023

jonhoo commented Jan 19, 2024

bazhenov commented Jan 20, 2024 • edited Loading

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024 • edited Loading

bazhenov commented Jan 22, 2024

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024 • edited Loading

jonhoo commented Jan 22, 2024

bazhenov commented Jan 22, 2024

jonhoo commented Jan 24, 2024

bazhenov commented Jan 24, 2024

bazhenov commented Jan 24, 2024 • edited Loading

jonhoo commented Jan 24, 2024

bazhenov commented Jan 25, 2024

jonhoo commented Jan 29, 2024

bazhenov commented Feb 3, 2024

bazhenov commented Feb 13, 2024 • edited Loading

bazhenov commented Dec 8, 2023 •

edited

Loading

bazhenov Dec 20, 2023 •

edited

Loading

bazhenov commented Jan 20, 2024 •

edited

Loading

bazhenov commented Jan 22, 2024 •

edited

Loading

bazhenov commented Jan 22, 2024 •

edited

Loading

bazhenov commented Jan 24, 2024 •

edited

Loading

bazhenov commented Feb 13, 2024 •

edited

Loading