Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Experimental Tango support #25

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

Conversation

bazhenov
Copy link
Contributor

@bazhenov bazhenov commented Dec 8, 2023

For quite some time I've developed a benchmarking harness called Tango. It is based on paired testing methodology. Now I'm confident enough to publish it. Ordsearch in fact was one of the first projects I've tried it on and used for. So this PR is a benchmark implementation of ordsearch on Tango. The main purpose is to provide a performance regression toolkit using GHA.

The main promises of paired testing are:

  • convenience – we're estimating a difference in the performance of two algorithms directly which in most cases the thing we need
  • sensitivity and speed – because two algorithms are executed simultaneously both algorithms are subject to the same biases the platform has at the moment. Therefore we can estimate performance differences more accurately or spend less time on achieving the same level of precision.
  • performance regression testing – using paired testing tango allows us to compare not only 2 algorithms in the same code base but also 2 versions of the same algorithm from different VCS commits/branches (ex.). This allows to do quite sensitive performance regression testing. In my experience so far it can quite confidently find regressions as low as 5%..

Sensitivity

I've done some experiments on an AWS c2.medium instance. I ran a single test u32/1024/nodup in both criterion and tango in a loop 400 times which took several hours. Here are 80% confidence intervals on the performance difference between OrderedCollection and Vec as measured by tango and criterion.

ordsearch-80CI

Even using a fraction of a second Tango can provide tighter intervals.

Here are 100% CI (whiskers are showing minimum and maximum values)

ordsearch-min-max

Here the results are comparable between Tango and Criterion, but Tango requires much less time. Even 0.1s is enough to get a coarse estimate which is very useful when experimenting in development.

Regression testing

Tango builds an executable benchmark which is at the same time a dynamically linked library. This way 2 versions of the same code can be loaded in the address space and benchmarked in a paired way. The algorithm is quite simple:

  • checkout branch
  • build benchmarks and export then (I built cargo-export for that)
  • checkout main branch
  • build benchmark and run it passing branch executable as an argument ./main-bench compare ./branch-bench

I've added a benchmarking workflow for GitHub, but it needs to be debugged and streamlined which is possible only after PR is open (GHA doesn't work in forks).

@bazhenov bazhenov marked this pull request as draft December 8, 2023 04:28
@bazhenov
Copy link
Contributor Author

bazhenov commented Dec 8, 2023

Obviously, I need to do some additional work on tooling and Windows support 😂

Copy link
Owner

@jonhoo jonhoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, thanks for sharing!

A few thoughts:

  • This feels like it's useful both for "compare ordsearch to btree" and for "compare ordsearch between branch A and branch B". My understanding from the CI file you added is that this only testing the latter for PRs, though provides a way to run the former on-demand. Is that right?
  • This is still susceptible to performance differences across machines, meaning we probably still wouldn't check in the target/benchmarks I assume. Just double-checking.
  • Do you have a hypothesis about why criterion seemingly consistently reports a large performance diff than tango does? 44.05% vs 43.35% is a decently-chunked difference, and seems pretty stable in the diagrams you provided.

@@ -0,0 +1,3 @@
fn main() {
println!("cargo:rustc-link-arg-benches=-rdynamic");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should also add

cargo:rerun-if-changed=build.rs

so the build script isn't re-run all the time

Comment on lines 20 to 21
/// Because benchmarks are builded with linker flag -rdynamic there should be library entry point defined
/// in all benchmarks. This is only needed when two harnesses are used.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a short-coming in tango — shouldn't it be able to recognize if a particular binary has no entry-point for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tango executable is the dynamic library (benchmarks) and the executable (the runner) at the same time. To achieve this some linker shenanigans are required. But I was able to pull it of without requiring dummy entry point. This code is already removed. Thank you.

}
}

impl<C: FromSortedVec> Generator for RandomCollection<C>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about why haystack and needle are first-class primitives in Tango — shouldn't it only be thinking in terms of "inputs" and "benchmarks"? If it also delineates between haystacks and needles, does that mean it doesn't generalize to other kinds of benchmarks without such concepts (e.g., factorials)? Alternatively, if it does support "general" benchmarks, why have this specialization built into it as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are more of second-class primitives. For paired tests to be sensitive they need to operate on the same input. If the input is updated between iterations it should be updated at the same time between two algorithms. So harness need a way to control this process. This is how Generator trait was born.

Haystack/needle distinction is unusual, it's kind of "big" and "small" part of the input. Haystack is reused between benchmark samples, supposed to be costly in terms of generation (thus its generation time is not included in benchmark time). Needle is supposed to be cheap, generated for each iteration and its generation time is included in benchmark time.

This scheme can be generalized to arbitrary input, although it's not pretty. But it mirrors very important distinction which is present in almost any benchmarking harness including criterion. Under the hood we are almost never timing a single function call (iteration), but a bunch of them (sample). Some parts of the input can be cheaply generated per-iteration, another ones can not. This is needle/haystack separation.

I'm still unsure how easy it is to understand this concept. Maybe some another metaphor should be chosen, so I'm open to any feedback.

jobs:
bench:
runs-on: ubuntu-22.04
steps:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think several of these could probably be improved upon by drawing from other helpful GitHub actions. For example, installing Rust and installing crate-binaries have pretty good (and more optimized) variants. The other workflow files may be able to provide some inspiration.

Copy link
Contributor Author

@bazhenov bazhenov Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • migrated towards dtolnay/rust-toolchain
  • using lock files when building benchmarks
  • add support of taiki-e/install-action to cargo-export

set -eo pipefail

target/benchmarks/search_ord --color=never compare target/benchmarks/search_ord \
-ot 1 --fail-threshold 10 | tee target/benchmark.txt
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is -ot here?
Also, what is the unit for --fail-threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • -t – duration of each test (in seconds)
  • -o – filter outliers. Because tango is measuring difference directly it can detect and remove severe observations that are manifest itself symmetrically (in both algorithms), therefore they are considered independent of the algorithms performance.
  • --fail-threshold – exit with non zero exit code if candidate is slower than baseline on given amount of percents. Tango also do a z-test and fail only if difference is statistically significant (planning to move to bootstrap later on)

More on cli-arguments

@jonhoo
Copy link
Owner

jonhoo commented Dec 18, 2023

Only semi-related, but https://github.com/bheisler/iai might also be an interesting thing for us to run in CI

@bazhenov
Copy link
Contributor Author

  • This feels like it's useful both for "compare ordsearch to btree" and for "compare ordsearch between branch A and branch B". My understanding from the CI file you added is that this only testing the latter for PRs, though provides a way to run the former on-demand. Is that right?

Yes, you can compare vec/ord/btree in any pair you would like. ord (master) vs. ord (branch) was selected for GHA workflow because in this case we have quite obvious expectation of 0% as a performance regression target.

  • This is still susceptible to performance differences across machines, meaning we probably still wouldn't check in the target/benchmarks I assume. Just double-checking.

Correct. This is the idea. To conclude there is a difference between two algorithms (or lack of thereof) you need to execute both of them on the same machine with the same input at the "same time". The last requirement is strictly speaking impossible, so Tango executing algorithms in a alternative fashion so in any quite short amount of time (~50ms) both of algorithms are executed.

  • Do you have a hypothesis about why criterion seemingly consistently reports a large performance diff than tango does? 44.05% vs 43.35% is a decently-chunked difference, and seems pretty stable in the diagrams you provided.

Hard to say for sure, but here are some of my thoughts:

  • when executing 2 algorithms on the same core there are effectively half of the resources for each algorithm: memory throughput, data/instruction/TLB cache, uops cache slots. This can bias measurement slightly if one of the algorithms is more memory demanding than the other. This is basically the case for eytzinger layout.
  • in my version of benchmark I do rotate collection being searched, while in criterion benchmarks collection generated once per test.

I'll look into the matter closely.

@bazhenov
Copy link
Contributor Author

Only semi-related, but https://github.com/bheisler/iai might also be an interesting thing for us to run in CI

Yes, indeed. Using performance metrics is another interesting school of thought. Intel/AMD had done a great job with PMC counters. I believe there is a great potential in using PMCs for benchmarking. I have no idea how good situation is on aarch64 though.

@jonhoo
Copy link
Owner

jonhoo commented Jan 19, 2024

Tried using Tango for a thing (whose code I unfortunately can't share), and I'm sometimes seeing decently large fluctuations for very-fast operations even when comparing against the exact same binary (i.e., running export and then immediately bench -- compare). For example:

array::unshift                                     [  15.2 us ...  15.2 us ]      +0.11%
array::delete                                      [  19.1 us ...  19.1 us ]      +0.05%
array::update                                      [  39.3 us ...  38.7 us ]      -1.52%*
array::insert                                      [  19.5 us ...  19.6 us ]      +0.29%
array::push                                        [  15.1 us ...  15.1 us ]      +0.37%
map::insert                                        [ 582.9 ns ... 569.3 ns ]      -2.27%*
map::remove                                        [ 281.3 ns ... 276.3 ns ]      -1.66%*
map::update                                        [ 645.0 ns ... 638.0 ns ]      -1.05%*
register::write                                    [ 235.6 ns ... 232.6 ns ]      -1.17%*
register::clear                                    [  76.1 ns ...  74.6 ns ]      -1.57%*

This was with -ot 5 on an M2 Pro Mac. Ideas on why the observed variance might still be that high?

@bazhenov
Copy link
Contributor Author

bazhenov commented Jan 20, 2024

Can you please rerun the tests with the following options added: -vd target/dumps (-v verbose mode, -d dump individual observations in the target/dumps directory)? And please make sure you're using tango from the dev branch which has some important changes in scheduler strategy not available in release yet.

Ideas on why the observed variance might still be that high?

In my experience when A/A-test gives false positives it boils down to one of the following problems:

  1. scheduling outliers. More frequent on laptops where a lot of background activity competing for CPU than on the servers.
  2. cache elimination fairness. I don't know any way to ensure that both of the tested functions will experience the same amount of misses In case of test payload exceeds L2/L3 cache size. This is why I added --cache-firewall option which spoils the CPU cache between samples. Eg. --cache-firewall 1024 will perform 1Mb read between samples. It's good to use L2 size as a starting point.
  3. code aligning issues. Sadly, even the identical machine code can have different performance when located on different memory addresses. This is because of the intricacies of uop-caching in modern CPUs (see. Causes of Performance Instability due to Code Placement in X86). I made a very simple example some time ago that shows this problem for x86 processors. The difference usually lay within 5-10% range. It's becoming even more sad when you think how modern OSes usually put a binary image in a random memory address each time for the security means (ASLR). On x86 it helps to align functions on 32-byte boundary. I got my hands on a M3 just a week ago, so I don't know if this problem is relevant for Apple Silicon or not. To the best of my knowledge, there is no information on wherever Apple Silicon does use uop-caching, but some ARM microarchitectures like Cortex A77 do.

If you are able to gather dumps (-d) if individual samples, those will help to differentiate (1) and (2). When (1) is true you will see individual observations have a large difference, but overall the performance is the same, but in the case of (2) usually you see the stable difference between two copies of the same algorithm across all observations.

@jonhoo
Copy link
Owner

jonhoo commented Jan 22, 2024

The earlier results were with 0.2. With dev I get (across two runs):

$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [  15.0 us ...  14.9 us ]      -0.55%*
array::delete                                      [  19.3 us ...  19.0 us ]      -1.41%*
array::update                                      [  39.4 us ...  39.2 us ]      -0.59%*
array::insert                                      [  20.0 us ...  19.8 us ]      -0.83%*
array::push                                        [  15.1 us ...  15.0 us ]      -0.59%*
map::insert                                        [ 581.5 ns ... 583.3 ns ]      +0.30%
map::remove                                        [ 280.2 ns ... 281.4 ns ]      +0.43%
map::update                                        [ 634.8 ns ... 636.9 ns ]      +0.35%
register::write                                    [ 235.4 ns ... 234.0 ns ]      -0.62%*
register::clear                                    [  76.5 ns ...  75.6 ns ]      -1.17%*
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [  15.0 us ...  15.1 us ]      +0.21%
array::delete                                      [  19.3 us ...  19.3 us ]      -0.13%
array::update                                      [  38.9 us ...  38.3 us ]      -1.49%*
array::insert                                      [  19.8 us ...  19.6 us ]      -0.92%*
array::push                                        [  15.1 us ...  15.1 us ]      +0.40%
map::insert                                        [ 574.8 ns ... 570.2 ns ]      -0.79%*
map::remove                                        [ 287.8 ns ... 283.1 ns ]      -1.62%*
map::update                                        [ 619.7 ns ... 636.9 ns ]      +2.77%*
register::write                                    [ 237.0 ns ... 236.4 ns ]      -0.24%
register::clear                                    [  75.5 ns ...  76.2 ns ]      +0.86%*

I got an error ("no such file or directory") when I did -vd target/dumps, so had to mkdir that myself. After that though, it did the thing:

array::unshift  (n: 86, outliers: 10)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         14.9 us         15.1 us        175.1 ns  +1.17%*
    min          │         14.7 us         14.9 us        157.0 ns
    max          │         15.7 us         15.9 us        142.1 ns
    std. dev.    │        220.4 ns        231.5 ns        102.8 ns

array::delete  (n: 212, outliers: 12)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         18.9 us         19.1 us        178.5 ns  +0.94%*
    min          │         18.7 us         18.9 us        206.4 ns
    max          │         20.0 us         20.1 us        145.4 ns
    std. dev.    │        208.4 ns        197.9 ns         94.5 ns

array::update  (n: 82, outliers: 22)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         38.4 us         39.2 us        763.3 ns  +1.99%*
    min          │         38.0 us         38.7 us        749.8 ns
    max          │         40.1 us         41.0 us        875.4 ns
    std. dev.    │        466.2 ns        507.3 ns        186.9 ns

array::insert  (n: 82, outliers: 22)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         19.6 us         19.7 us        121.1 ns  +0.62%*
    min          │         19.1 us         19.2 us         95.7 ns
    max          │         20.8 us         21.0 us        185.8 ns
    std. dev.    │        427.9 ns        432.6 ns        140.8 ns

array::push  (n: 78, outliers: 26)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         14.9 us         15.0 us         71.6 ns  +0.48%
    min          │         14.8 us         14.8 us         48.3 ns
    max          │         15.9 us         16.0 us         81.7 ns
    std. dev.    │        180.7 ns        176.1 ns         55.2 ns

map::insert  (n: 1598, outliers: 90)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        589.7 ns        588.1 ns         -1.6 ns  -0.27%
    min          │        559.7 ns        561.7 ns          1.9 ns
    max          │        737.5 ns        724.4 ns        -13.1 ns
    std. dev.    │         12.4 ns         12.5 ns          7.8 ns

map::remove  (n: 3318, outliers: 266)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        280.6 ns        278.9 ns         -1.7 ns  -0.62%*
    min          │        265.6 ns        267.0 ns          1.4 ns
    max          │        349.6 ns        347.6 ns         -2.0 ns
    std. dev.    │          6.1 ns          6.0 ns          4.6 ns

map::update  (n: 1460, outliers: 92)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        647.1 ns        641.1 ns         -6.0 ns  -0.92%*
    min          │        611.9 ns        598.5 ns        -13.4 ns
    max          │        795.9 ns        799.8 ns          3.8 ns
    std. dev.    │         13.8 ns         13.6 ns          8.3 ns

register::write  (n: 3780, outliers: 476)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │        235.4 ns        235.3 ns         -0.1 ns  -0.05%
    min          │        225.4 ns        223.4 ns         -1.9 ns
    max          │        301.9 ns        294.3 ns         -7.6 ns
    std. dev.    │          4.4 ns          4.4 ns          3.2 ns

register::clear  (n: 11824, outliers: 1648)
                          baseline       candidate               ∆
                 ╭────────────────────────────────────────────────
    mean         │         74.4 ns         74.0 ns         -0.4 ns  -0.60%*
    min          │         68.8 ns         67.6 ns         -1.1 ns
    max          │        125.0 ns        125.0 ns            0 ns
    std. dev.    │          2.1 ns          2.2 ns          1.5 ns

Looking at target/dumps/map::update.csv, I see individual differences ranging from 0 to ~5000 (third column) in a fairly uniform manner. Looking across the runtime of a single column, I see a very wide range of absolute numbers, ranging from ~2000 to ~3324833.

Given the wide range in execution times, I wondered if maybe this was caused by HashMap randomness, but even after switching all the maps to have ahash::RandomState with a fixed seed, I observe the same kind of variance.

@bazhenov
Copy link
Contributor Author

Does this effect state itself when you test binary against itself? You can omit the other executable in which case tango will test against itself to ensure the same code is used in base/candidate. eg.

$ cargo bench -q --bench=micro -- compare -ot 5

@jonhoo
Copy link
Owner

jonhoo commented Jan 22, 2024

It certainly does look like the re-build changes the performance characteristics of the binary (even though it's the exact same environment, directory, code, compiler, and toolchain):

$ cargo bench -q --bench=micro -- compare -ot 5
array::unshift                                     [   8.8 us ...   8.8 us ]      -0.09%
array::delete                                      [  13.7 us ...  13.7 us ]      -0.01%
array::update                                      [  26.6 us ...  26.7 us ]      +0.27%
array::insert                                      [  13.3 us ...  13.3 us ]      +0.09%
array::push                                        [   8.5 us ...   8.4 us ]      -0.08%
map::insert                                        [ 572.0 ns ... 572.3 ns ]      +0.05%
map::remove                                        [ 303.0 ns ... 303.1 ns ]      +0.02%
map::update                                        [ 606.7 ns ... 606.6 ns ]      -0.02%
register::write                                    [ 219.5 ns ... 219.4 ns ]      -0.08%
register::clear                                    [  82.9 ns ...  82.9 ns ]      -0.01%
$ cargo export target/benchmarks -- bench --bench=micro
    Finished bench [optimized + debuginfo] target(s) in 0.66s
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [   8.5 us ...   8.4 us ]      -1.08%*
array::delete                                      [  13.2 us ...  13.1 us ]      -0.84%*
array::update                                      [  26.4 us ...  26.2 us ]      -0.87%*
array::insert                                      [  13.5 us ...  13.4 us ]      -0.81%*
array::push                                        [   8.5 us ...   8.4 us ]      -1.43%*
map::insert                                        [ 572.9 ns ... 577.7 ns ]      +0.85%*
map::remove                                        [ 303.7 ns ... 300.8 ns ]      -0.97%*
map::update                                        [ 610.0 ns ... 607.0 ns ]      -0.48%
register::write                                    [ 225.0 ns ... 225.0 ns ]      -0.02%
register::clear                                    [  84.7 ns ...  83.6 ns ]      -1.29%*
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift                                     [   8.6 us ...   8.6 us ]      -0.18%
array::delete                                      [  13.7 us ...  13.6 us ]      -0.09%
array::update                                      [  27.4 us ...  27.3 us ]      -0.24%
array::insert                                      [  13.9 us ...  13.8 us ]      -0.46%
array::push                                        [   8.7 us ...   8.7 us ]      +0.11%
map::insert                                        [ 574.9 ns ... 570.3 ns ]      -0.80%*
map::remove                                        [ 307.0 ns ... 309.3 ns ]      +0.76%*
map::update                                        [ 610.8 ns ... 608.4 ns ]      -0.39%
register::write                                    [ 225.6 ns ... 223.3 ns ]      -1.03%*
register::clear                                    [  86.8 ns ...  82.7 ns ]      -4.80%*
$ cargo bench -q --bench=micro -- compare -ot 5
array::unshift                                     [   8.5 us ...   8.5 us ]      +0.05%
array::delete                                      [  13.1 us ...  13.1 us ]      +0.06%
array::update                                      [  26.3 us ...  26.3 us ]      -0.01%
array::insert                                      [  13.4 us ...  13.4 us ]      +0.07%
array::push                                        [   8.5 us ...   8.5 us ]      +0.07%
map::insert                                        [ 567.3 ns ... 567.5 ns ]      +0.04%
map::remove                                        [ 305.6 ns ... 305.4 ns ]      -0.06%
map::update                                        [ 611.6 ns ... 611.7 ns ]      +0.02%
register::write                                    [ 221.3 ns ... 221.4 ns ]      +0.02%
register::clear                                    [  84.1 ns ...  84.1 ns ]      -0.03%

@bazhenov
Copy link
Contributor Author

bazhenov commented Jan 22, 2024

I encounter such a behavior on x86 occasionally. Usually it happens after some changes to code which is unrelated to the code being tested. But I never faced the situation when two binaries built one after the other has different performance. 😕

Theoretically it might happens because function linking order changes for whatever reason which in turn may change function layout and subsequently effectiveness of different optimizations. There is at least one publication describing this effect – Producing Wrong Data Without Doing Anything Obviously Wrong

@bazhenov
Copy link
Contributor Author

It certainly does look like the re-build changes the performance characteristics of the binary

Could you please do some sanity checks of the assembly and make sure cargo bench and cargo export are producing same test function assembly? You might use cargo asm for that, but I more like:

  • nm [BINARY] to get function name
  • objdump --disassemble-symbols=[FN_NAME] [BINARY]

@jonhoo
Copy link
Owner

jonhoo commented Jan 22, 2024

I'll do you one better:

$ shasum a
1cd583440bff0f5befae11ace8de8837ee7f0131  a
$ shasum b
1cd583440bff0f5befae11ace8de8837ee7f0131  b
$ ./a compare -ot 5
array::unshift                                     [   8.7 us ...   8.8 us ]      +0.05%
array::delete                                      [  13.9 us ...  13.9 us ]      +0.08%
array::update                                      [  27.3 us ...  27.3 us ]      +0.03%
array::insert                                      [  13.8 us ...  13.8 us ]      +0.02%
array::push                                        [   8.7 us ...   8.7 us ]      -0.05%
map::insert                                        [ 576.3 ns ... 576.4 ns ]      +0.02%
map::remove                                        [ 306.2 ns ... 306.1 ns ]      -0.02%
map::update                                        [ 624.0 ns ... 623.8 ns ]      -0.02%
register::write                                    [ 224.7 ns ... 224.7 ns ]      -0.01%
register::clear                                    [  84.9 ns ...  84.9 ns ]      +0.02%
$ ./a compare a -ot 5
array::unshift                                     [   8.8 us ...   8.8 us ]      -0.23%
array::delete                                      [  13.7 us ...  13.7 us ]      +0.14%
array::update                                      [  27.6 us ...  27.6 us ]      +0.01%
array::insert                                      [  14.0 us ...  14.0 us ]      -0.02%
array::push                                        [   8.6 us ...   8.6 us ]      +0.06%
map::insert                                        [ 583.6 ns ... 584.4 ns ]      +0.14%
map::remove                                        [ 311.2 ns ... 311.3 ns ]      +0.02%
map::update                                        [ 631.8 ns ... 631.5 ns ]      -0.04%
register::write                                    [ 231.6 ns ... 231.7 ns ]      +0.04%
register::clear                                    [  87.0 ns ...  87.0 ns ]      -0.01%
$ ./a compare b -ot 5
array::unshift                                     [   8.7 us ...   8.7 us ]      -0.95%*
array::delete                                      [  13.9 us ...  14.0 us ]      +1.02%*
array::update                                      [  27.8 us ...  28.2 us ]      +1.40%*
array::insert                                      [  14.2 us ...  14.5 us ]      +2.07%*
array::push                                        [   8.8 us ...   8.7 us ]      -0.88%*
map::insert                                        [ 575.6 ns ... 586.0 ns ]      +1.81%*
map::remove                                        [ 315.4 ns ... 302.0 ns ]      -4.25%*
map::update                                        [ 625.1 ns ... 636.7 ns ]      +1.86%*
register::write                                    [ 226.3 ns ... 227.3 ns ]      +0.44%
register::clear                                    [  85.9 ns ...  85.8 ns ]      -0.20%

a and b here are just copies of target/benchmarks/micro.

@bazhenov
Copy link
Contributor Author

bazhenov commented Jan 22, 2024

Ok, I think I can explain that. In first two cases we executing the same code although mapped twice in different parts of virtual memory while in third case we executing different code (although identical).

Does --sampler=flat helps? By default each sample consists of random number of iterations, in case of small number of iterations it will force CPU to switch frequently between different code of base/candidate functions. Flat sampler force each sample to be fixed amount of iterations (approximately 50ms of execution time).

@jonhoo
Copy link
Owner

jonhoo commented Jan 22, 2024

Ah, you mean in the sense that the files are memory-mapped, so the page tables behind the scenes map them back to the same backing memory or something (for a vs a)?

I would say --sampler=flat does not help meaningfully:

$ ./a compare b -ot 5 --sampler=flat
array::unshift                                     [   8.5 us ...   8.5 us ]      -0.27%
array::delete                                      [  13.6 us ...  13.4 us ]      -1.11%*
array::update                                      [  27.3 us ...  27.1 us ]      -0.77%*
array::insert                                      [  13.8 us ...  13.6 us ]      -1.07%*
array::push                                        [   8.5 us ...   8.5 us ]      -0.12%
map::insert                                        [ 556.6 ns ... 567.9 ns ]      +2.03%*
map::remove                                        [ 299.4 ns ... 298.5 ns ]      -0.30%
map::update                                        [ 604.3 ns ... 601.2 ns ]      -0.50%*
register::write                                    [ 222.4 ns ... 217.8 ns ]      -2.08%*
register::clear                                    [  82.6 ns ...  90.0 ns ]      +8.99%*

@bazhenov
Copy link
Contributor Author

Ah, you mean in the sense that the files are memory-mapped, so the page tables behind the scenes map them back to the same backing memory or something (for a vs a)?

Yes, exactly. After TLB lookup all memory mapped copies of the same file should have the same physical address linked to the OS buffer pool. At least this is how I conceptualize how OS works.

I'm trying to reproduce this effect on M3 Max using some tango tests I have at hand, but still no luck.

@jonhoo
Copy link
Owner

jonhoo commented Jan 24, 2024

With 0.2, I also notice that the output is sometimes entirely empty when the only difference between the binaries being compared is that one of them has a dependency that is unrelated to the benchmark be updated — is that a known problem? Would maybe be good to print something to STDERR when benchmarks are skipped for whatever reason (e.g., if they're found to exist in only one of the binaries).

Relatedly: any plans to ship dev as 0.3?

@bazhenov
Copy link
Contributor Author

With 0.2, I also notice that the output is sometimes entirely empty when the only difference between the binaries being compared is that one of them has a dependency that is unrelated to the benchmark be updated — is that a known problem? Would maybe be good to print something to STDERR when benchmarks are skipped for whatever reason (e.g., if they're found to exist in only one of the binaries).

Actually this is good idea. 👍 I will do it.

Relatedly: any plans to ship dev as 0.3?

I will fix those minor issues you've reported and release 0.3 in a few days.

@bazhenov
Copy link
Contributor Author

bazhenov commented Jan 24, 2024

Also I think there is one more thing you can try. There is implicit default limit for the number of iterations per sample. Last time when we tried --sampler=flat, I forget to tell you to disable it.

tango_main!(MeasurementSettings {
    max_iterations_per_sample: usize::max_value(),
    ..tango_bench::DEFAULT_SETTINGS
});

@jonhoo
Copy link
Owner

jonhoo commented Jan 24, 2024

Tried that now, and doesn't seem to have made too much of a difference. One interesting thing I just noticed though is that there's a bit of a "pattern" to the results. Notice how the + and - s seem to kind of happen in waves? There's a sequence of +s, then a sequence of -s, then +s, etc., even across multiple invocations:

$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.6 us ...   8.6 us ]      -0.72%*
array::delete                                      [  13.5 us ...  13.5 us ]      +0.05%
array::update                                      [  26.9 us ...  26.9 us ]      -0.08%
array::insert                                      [  13.7 us ...  13.7 us ]      +0.26%
array::push                                        [   8.6 us ...   8.5 us ]      -0.51%*
map::insert                                        [ 562.1 ns ... 561.6 ns ]      -0.08%
map::remove                                        [ 299.1 ns ... 300.3 ns ]      +0.39%
map::update                                        [ 604.6 ns ... 613.1 ns ]      +1.41%*
register::write                                    [ 219.8 ns ... 225.9 ns ]      +2.77%*
register::clear                                    [  85.0 ns ...  86.2 ns ]      +1.38%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.6 us ...   8.7 us ]      +1.82%*
array::delete                                      [  13.4 us ...  13.6 us ]      +1.43%*
array::update                                      [  26.8 us ...  27.3 us ]      +1.68%*
array::insert                                      [  13.6 us ...  13.9 us ]      +1.70%*
array::push                                        [   8.5 us ...   8.7 us ]      +1.92%*
map::insert                                        [ 571.1 ns ... 573.5 ns ]      +0.41%
map::remove                                        [ 307.4 ns ... 296.9 ns ]      -3.42%*
map::update                                        [ 615.9 ns ... 607.5 ns ]      -1.37%*
register::write                                    [ 225.0 ns ... 224.2 ns ]      -0.38%
register::clear                                    [  86.1 ns ...  82.3 ns ]      -4.42%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.5 us ...   8.5 us ]      +0.65%*
array::delete                                      [  13.3 us ...  13.3 us ]      +0.29%
array::update                                      [  26.7 us ...  26.8 us ]      +0.34%
array::insert                                      [  13.5 us ...  13.6 us ]      +0.46%
array::push                                        [   8.5 us ...   8.6 us ]      +0.58%*
map::insert                                        [ 560.3 ns ... 562.7 ns ]      +0.44%
map::remove                                        [ 302.1 ns ... 298.7 ns ]      -1.11%*
map::update                                        [ 619.4 ns ... 611.1 ns ]      -1.34%*
register::write                                    [ 223.3 ns ... 220.0 ns ]      -1.44%*
register::clear                                    [  85.5 ns ...  82.9 ns ]      -3.05%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift                                     [   8.5 us ...   8.6 us ]      +0.75%*
array::delete                                      [  13.7 us ...  13.5 us ]      -1.51%*
array::update                                      [  27.5 us ...  27.1 us ]      -1.54%*
array::insert                                      [  13.9 us ...  13.8 us ]      -1.18%*
array::push                                        [   8.6 us ...   8.6 us ]      +0.65%*
map::insert                                        [ 561.8 ns ... 570.0 ns ]      +1.47%*
map::remove                                        [ 302.3 ns ... 306.1 ns ]      +1.26%*
map::update                                        [ 613.5 ns ... 617.5 ns ]      +0.66%*
register::write                                    [ 220.0 ns ... 224.1 ns ]      +1.89%*
register::clear                                    [  85.7 ns ...  83.6 ns ]      -2.37%*

@bazhenov
Copy link
Contributor Author

0.3 released

@jonhoo
Copy link
Owner

jonhoo commented Jan 29, 2024

Another feature request: to have all the benchmarks run even if one of the earlier results exceed the threshold. And then only fail at the end!

@bazhenov
Copy link
Contributor Author

bazhenov commented Feb 3, 2024

Another feature request: to have all the benchmarks run even if one of the earlier results exceed the threshold. And then only fail at the end!

Implemented in bazhenov/tango#15 (0.4.0 published)

@bazhenov
Copy link
Contributor Author

bazhenov commented Feb 13, 2024

Ah, you mean in the sense that the files are memory-mapped, so the page tables behind the scenes map them back to the same backing memory or something (for a vs a)?

It is even simpler than that as it turns out. macOS dynamic linker is capable of linking executable files against itself and not load the same executable twice in a virtual memory.

If you are testing 2 different executables you'll have distinct mappings for each of binaries in the virtual memory:

$ ./search_ord compare search_ord2

$ vmmap `pgrep search_ord`  | grep search_ord
Virtual Memory Map of process 34763 (search_ord)
__TEXT                      100130000-1002b4000    [ 1552K  1552K     0K     0K] r-x/r-x SM=COW          /Users/USER/*/search_ord
__DATA_CONST                1002b4000-1002cc000    [   96K    80K     0K     0K] r--/rw- SM=COW          /Users/USER/*/search_ord
__LINKEDIT                  1002d0000-1003c8000    [  992K   992K     0K     0K] r--/r-- SM=COW          /Users/USER/*/search_ord
__DATA                      1002cc000-1002d0000    [   16K    16K    16K     0K] rw-/rw- SM=COW          /Users/USER/*/search_ord
__TEXT                      10086c000-1009f0000    [ 1552K   608K     0K     0K] r-x/rwx SM=COW          /Users/USER/*/search_ord2
__DATA_CONST                1009f0000-100a08000    [   96K    96K    96K     0K] r--/rwx SM=COW          /Users/USER/*/search_ord2
__LINKEDIT                  100a0c000-100b04000    [  992K   160K     0K     0K] r--/rwx SM=COW          /Users/USER/*/search_ord2
__DATA                      100a08000-100a0c000    [   16K    16K    16K     0K] rw-/rwx SM=COW          /Users/USER/*/search_ord2

But dynamic linker is smart enough not to load the file twice if you test binary agains itself.

$ ./search_ord compare search_ord

$ vmmap `pgrep search_ord`  | grep search_ord
Virtual Memory Map of process 34821 (search_ord)
__TEXT                      10049c000-100620000    [ 1552K  1552K     0K     0K] r-x/r-x SM=COW          /Users/USER/*/search_ord
__DATA_CONST                100620000-100638000    [   96K    80K     0K     0K] r--/rw- SM=COW          /Users/USER/*/search_ord
__LINKEDIT                  10063c000-100734000    [  992K   992K     0K     0K] r--/r-- SM=COW          /Users/USER/*/search_ord
__DATA                      100638000-10063c000    [   16K    16K    16K     0K] rw-/rw- SM=COW          /Users/USER/*/search_ord

Remarkably, this deduplication works even if you're using hard links on the filesystem (at least in case of APFS).

I was able to reproduce this effect at least partially and not with the same magnitude as your case, but still... Continue investigating the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants