-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Experimental Tango support #25
base: main
Are you sure you want to change the base?
Conversation
Obviously, I need to do some additional work on tooling and Windows support 😂 |
33d17c7
to
31ff8f2
Compare
Until tango benchmarks on the main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool, thanks for sharing!
A few thoughts:
- This feels like it's useful both for "compare ordsearch to btree" and for "compare ordsearch between branch A and branch B". My understanding from the CI file you added is that this only testing the latter for PRs, though provides a way to run the former on-demand. Is that right?
- This is still susceptible to performance differences across machines, meaning we probably still wouldn't check in the
target/benchmarks
I assume. Just double-checking. - Do you have a hypothesis about why criterion seemingly consistently reports a large performance diff than tango does? 44.05% vs 43.35% is a decently-chunked difference, and seems pretty stable in the diagrams you provided.
@@ -0,0 +1,3 @@ | |||
fn main() { | |||
println!("cargo:rustc-link-arg-benches=-rdynamic"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should also add
cargo:rerun-if-changed=build.rs
so the build script isn't re-run all the time
benches/search_comparison.rs
Outdated
/// Because benchmarks are builded with linker flag -rdynamic there should be library entry point defined | ||
/// in all benchmarks. This is only needed when two harnesses are used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a short-coming in tango — shouldn't it be able to recognize if a particular binary has no entry-point for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, tango executable is the dynamic library (benchmarks) and the executable (the runner) at the same time. To achieve this some linker shenanigans are required. But I was able to pull it of without requiring dummy entry point. This code is already removed. Thank you.
} | ||
} | ||
|
||
impl<C: FromSortedVec> Generator for RandomCollection<C> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about why haystack
and needle
are first-class primitives in Tango — shouldn't it only be thinking in terms of "inputs" and "benchmarks"? If it also delineates between haystacks and needles, does that mean it doesn't generalize to other kinds of benchmarks without such concepts (e.g., factorials)? Alternatively, if it does support "general" benchmarks, why have this specialization built into it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are more of second-class primitives. For paired tests to be sensitive they need to operate on the same input. If the input is updated between iterations it should be updated at the same time between two algorithms. So harness need a way to control this process. This is how Generator
trait was born.
Haystack/needle distinction is unusual, it's kind of "big" and "small" part of the input. Haystack is reused between benchmark samples, supposed to be costly in terms of generation (thus its generation time is not included in benchmark time). Needle is supposed to be cheap, generated for each iteration and its generation time is included in benchmark time.
This scheme can be generalized to arbitrary input, although it's not pretty. But it mirrors very important distinction which is present in almost any benchmarking harness including criterion. Under the hood we are almost never timing a single function call (iteration), but a bunch of them (sample). Some parts of the input can be cheaply generated per-iteration, another ones can not. This is needle/haystack separation.
I'm still unsure how easy it is to understand this concept. Maybe some another metaphor should be chosen, so I'm open to any feedback.
jobs: | ||
bench: | ||
runs-on: ubuntu-22.04 | ||
steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think several of these could probably be improved upon by drawing from other helpful GitHub actions. For example, installing Rust and installing crate-binaries have pretty good (and more optimized) variants. The other workflow files may be able to provide some inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- migrated towards
dtolnay/rust-toolchain
- using lock files when building benchmarks
- add support of
taiki-e/install-action
to cargo-export
set -eo pipefail | ||
|
||
target/benchmarks/search_ord --color=never compare target/benchmarks/search_ord \ | ||
-ot 1 --fail-threshold 10 | tee target/benchmark.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is -ot
here?
Also, what is the unit for --fail-threshold
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-t
– duration of each test (in seconds)-o
– filter outliers. Because tango is measuring difference directly it can detect and remove severe observations that are manifest itself symmetrically (in both algorithms), therefore they are considered independent of the algorithms performance.--fail-threshold
– exit with non zero exit code if candidate is slower than baseline on given amount of percents. Tango also do a z-test and fail only if difference is statistically significant (planning to move to bootstrap later on)
More on cli-arguments
Only semi-related, but https://github.com/bheisler/iai might also be an interesting thing for us to run in CI |
Yes, you can compare
Correct. This is the idea. To conclude there is a difference between two algorithms (or lack of thereof) you need to execute both of them on the same machine with the same input at the "same time". The last requirement is strictly speaking impossible, so Tango executing algorithms in a alternative fashion so in any quite short amount of time (~50ms) both of algorithms are executed.
Hard to say for sure, but here are some of my thoughts:
I'll look into the matter closely. |
Yes, indeed. Using performance metrics is another interesting school of thought. Intel/AMD had done a great job with PMC counters. I believe there is a great potential in using PMCs for benchmarking. I have no idea how good situation is on aarch64 though. |
Tried using Tango for a thing (whose code I unfortunately can't share), and I'm sometimes seeing decently large fluctuations for very-fast operations even when comparing against the exact same binary (i.e., running
This was with |
Can you please rerun the tests with the following options added:
In my experience when A/A-test gives false positives it boils down to one of the following problems:
If you are able to gather dumps ( |
The earlier results were with 0.2. With $ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift [ 15.0 us ... 14.9 us ] -0.55%*
array::delete [ 19.3 us ... 19.0 us ] -1.41%*
array::update [ 39.4 us ... 39.2 us ] -0.59%*
array::insert [ 20.0 us ... 19.8 us ] -0.83%*
array::push [ 15.1 us ... 15.0 us ] -0.59%*
map::insert [ 581.5 ns ... 583.3 ns ] +0.30%
map::remove [ 280.2 ns ... 281.4 ns ] +0.43%
map::update [ 634.8 ns ... 636.9 ns ] +0.35%
register::write [ 235.4 ns ... 234.0 ns ] -0.62%*
register::clear [ 76.5 ns ... 75.6 ns ] -1.17%*
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift [ 15.0 us ... 15.1 us ] +0.21%
array::delete [ 19.3 us ... 19.3 us ] -0.13%
array::update [ 38.9 us ... 38.3 us ] -1.49%*
array::insert [ 19.8 us ... 19.6 us ] -0.92%*
array::push [ 15.1 us ... 15.1 us ] +0.40%
map::insert [ 574.8 ns ... 570.2 ns ] -0.79%*
map::remove [ 287.8 ns ... 283.1 ns ] -1.62%*
map::update [ 619.7 ns ... 636.9 ns ] +2.77%*
register::write [ 237.0 ns ... 236.4 ns ] -0.24%
register::clear [ 75.5 ns ... 76.2 ns ] +0.86%* I got an error ("no such file or directory") when I did
Looking at Given the wide range in execution times, I wondered if maybe this was caused by |
Does this effect state itself when you test binary against itself? You can omit the other executable in which case tango will test against itself to ensure the same code is used in base/candidate. eg. $ cargo bench -q --bench=micro -- compare -ot 5 |
It certainly does look like the re-build changes the performance characteristics of the binary (even though it's the exact same environment, directory, code, compiler, and toolchain): $ cargo bench -q --bench=micro -- compare -ot 5
array::unshift [ 8.8 us ... 8.8 us ] -0.09%
array::delete [ 13.7 us ... 13.7 us ] -0.01%
array::update [ 26.6 us ... 26.7 us ] +0.27%
array::insert [ 13.3 us ... 13.3 us ] +0.09%
array::push [ 8.5 us ... 8.4 us ] -0.08%
map::insert [ 572.0 ns ... 572.3 ns ] +0.05%
map::remove [ 303.0 ns ... 303.1 ns ] +0.02%
map::update [ 606.7 ns ... 606.6 ns ] -0.02%
register::write [ 219.5 ns ... 219.4 ns ] -0.08%
register::clear [ 82.9 ns ... 82.9 ns ] -0.01%
$ cargo export target/benchmarks -- bench --bench=micro
Finished bench [optimized + debuginfo] target(s) in 0.66s
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift [ 8.5 us ... 8.4 us ] -1.08%*
array::delete [ 13.2 us ... 13.1 us ] -0.84%*
array::update [ 26.4 us ... 26.2 us ] -0.87%*
array::insert [ 13.5 us ... 13.4 us ] -0.81%*
array::push [ 8.5 us ... 8.4 us ] -1.43%*
map::insert [ 572.9 ns ... 577.7 ns ] +0.85%*
map::remove [ 303.7 ns ... 300.8 ns ] -0.97%*
map::update [ 610.0 ns ... 607.0 ns ] -0.48%
register::write [ 225.0 ns ... 225.0 ns ] -0.02%
register::clear [ 84.7 ns ... 83.6 ns ] -1.29%*
$ cargo bench -q --bench=micro -- compare target/benchmarks/micro -ot 5
array::unshift [ 8.6 us ... 8.6 us ] -0.18%
array::delete [ 13.7 us ... 13.6 us ] -0.09%
array::update [ 27.4 us ... 27.3 us ] -0.24%
array::insert [ 13.9 us ... 13.8 us ] -0.46%
array::push [ 8.7 us ... 8.7 us ] +0.11%
map::insert [ 574.9 ns ... 570.3 ns ] -0.80%*
map::remove [ 307.0 ns ... 309.3 ns ] +0.76%*
map::update [ 610.8 ns ... 608.4 ns ] -0.39%
register::write [ 225.6 ns ... 223.3 ns ] -1.03%*
register::clear [ 86.8 ns ... 82.7 ns ] -4.80%*
$ cargo bench -q --bench=micro -- compare -ot 5
array::unshift [ 8.5 us ... 8.5 us ] +0.05%
array::delete [ 13.1 us ... 13.1 us ] +0.06%
array::update [ 26.3 us ... 26.3 us ] -0.01%
array::insert [ 13.4 us ... 13.4 us ] +0.07%
array::push [ 8.5 us ... 8.5 us ] +0.07%
map::insert [ 567.3 ns ... 567.5 ns ] +0.04%
map::remove [ 305.6 ns ... 305.4 ns ] -0.06%
map::update [ 611.6 ns ... 611.7 ns ] +0.02%
register::write [ 221.3 ns ... 221.4 ns ] +0.02%
register::clear [ 84.1 ns ... 84.1 ns ] -0.03% |
I encounter such a behavior on x86 occasionally. Usually it happens after some changes to code which is unrelated to the code being tested. But I never faced the situation when two binaries built one after the other has different performance. 😕 Theoretically it might happens because function linking order changes for whatever reason which in turn may change function layout and subsequently effectiveness of different optimizations. There is at least one publication describing this effect – Producing Wrong Data Without Doing Anything Obviously Wrong |
Could you please do some sanity checks of the assembly and make sure
|
I'll do you one better: $ shasum a
1cd583440bff0f5befae11ace8de8837ee7f0131 a
$ shasum b
1cd583440bff0f5befae11ace8de8837ee7f0131 b
$ ./a compare -ot 5
array::unshift [ 8.7 us ... 8.8 us ] +0.05%
array::delete [ 13.9 us ... 13.9 us ] +0.08%
array::update [ 27.3 us ... 27.3 us ] +0.03%
array::insert [ 13.8 us ... 13.8 us ] +0.02%
array::push [ 8.7 us ... 8.7 us ] -0.05%
map::insert [ 576.3 ns ... 576.4 ns ] +0.02%
map::remove [ 306.2 ns ... 306.1 ns ] -0.02%
map::update [ 624.0 ns ... 623.8 ns ] -0.02%
register::write [ 224.7 ns ... 224.7 ns ] -0.01%
register::clear [ 84.9 ns ... 84.9 ns ] +0.02%
$ ./a compare a -ot 5
array::unshift [ 8.8 us ... 8.8 us ] -0.23%
array::delete [ 13.7 us ... 13.7 us ] +0.14%
array::update [ 27.6 us ... 27.6 us ] +0.01%
array::insert [ 14.0 us ... 14.0 us ] -0.02%
array::push [ 8.6 us ... 8.6 us ] +0.06%
map::insert [ 583.6 ns ... 584.4 ns ] +0.14%
map::remove [ 311.2 ns ... 311.3 ns ] +0.02%
map::update [ 631.8 ns ... 631.5 ns ] -0.04%
register::write [ 231.6 ns ... 231.7 ns ] +0.04%
register::clear [ 87.0 ns ... 87.0 ns ] -0.01%
$ ./a compare b -ot 5
array::unshift [ 8.7 us ... 8.7 us ] -0.95%*
array::delete [ 13.9 us ... 14.0 us ] +1.02%*
array::update [ 27.8 us ... 28.2 us ] +1.40%*
array::insert [ 14.2 us ... 14.5 us ] +2.07%*
array::push [ 8.8 us ... 8.7 us ] -0.88%*
map::insert [ 575.6 ns ... 586.0 ns ] +1.81%*
map::remove [ 315.4 ns ... 302.0 ns ] -4.25%*
map::update [ 625.1 ns ... 636.7 ns ] +1.86%*
register::write [ 226.3 ns ... 227.3 ns ] +0.44%
register::clear [ 85.9 ns ... 85.8 ns ] -0.20%
|
Ok, I think I can explain that. In first two cases we executing the same code although mapped twice in different parts of virtual memory while in third case we executing different code (although identical). Does |
Ah, you mean in the sense that the files are memory-mapped, so the page tables behind the scenes map them back to the same backing memory or something (for I would say $ ./a compare b -ot 5 --sampler=flat
array::unshift [ 8.5 us ... 8.5 us ] -0.27%
array::delete [ 13.6 us ... 13.4 us ] -1.11%*
array::update [ 27.3 us ... 27.1 us ] -0.77%*
array::insert [ 13.8 us ... 13.6 us ] -1.07%*
array::push [ 8.5 us ... 8.5 us ] -0.12%
map::insert [ 556.6 ns ... 567.9 ns ] +2.03%*
map::remove [ 299.4 ns ... 298.5 ns ] -0.30%
map::update [ 604.3 ns ... 601.2 ns ] -0.50%*
register::write [ 222.4 ns ... 217.8 ns ] -2.08%*
register::clear [ 82.6 ns ... 90.0 ns ] +8.99%* |
Yes, exactly. After TLB lookup all memory mapped copies of the same file should have the same physical address linked to the OS buffer pool. At least this is how I conceptualize how OS works. I'm trying to reproduce this effect on M3 Max using some tango tests I have at hand, but still no luck. |
With Relatedly: any plans to ship |
Actually this is good idea. 👍 I will do it.
I will fix those minor issues you've reported and release 0.3 in a few days. |
Also I think there is one more thing you can try. There is implicit default limit for the number of iterations per sample. Last time when we tried tango_main!(MeasurementSettings {
max_iterations_per_sample: usize::max_value(),
..tango_bench::DEFAULT_SETTINGS
}); |
Tried that now, and doesn't seem to have made too much of a difference. One interesting thing I just noticed though is that there's a bit of a "pattern" to the results. Notice how the + and - s seem to kind of happen in waves? There's a sequence of +s, then a sequence of -s, then +s, etc., even across multiple invocations: $ ./c compare d -ot 5 --sampler=flat
array::unshift [ 8.6 us ... 8.6 us ] -0.72%*
array::delete [ 13.5 us ... 13.5 us ] +0.05%
array::update [ 26.9 us ... 26.9 us ] -0.08%
array::insert [ 13.7 us ... 13.7 us ] +0.26%
array::push [ 8.6 us ... 8.5 us ] -0.51%*
map::insert [ 562.1 ns ... 561.6 ns ] -0.08%
map::remove [ 299.1 ns ... 300.3 ns ] +0.39%
map::update [ 604.6 ns ... 613.1 ns ] +1.41%*
register::write [ 219.8 ns ... 225.9 ns ] +2.77%*
register::clear [ 85.0 ns ... 86.2 ns ] +1.38%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift [ 8.6 us ... 8.7 us ] +1.82%*
array::delete [ 13.4 us ... 13.6 us ] +1.43%*
array::update [ 26.8 us ... 27.3 us ] +1.68%*
array::insert [ 13.6 us ... 13.9 us ] +1.70%*
array::push [ 8.5 us ... 8.7 us ] +1.92%*
map::insert [ 571.1 ns ... 573.5 ns ] +0.41%
map::remove [ 307.4 ns ... 296.9 ns ] -3.42%*
map::update [ 615.9 ns ... 607.5 ns ] -1.37%*
register::write [ 225.0 ns ... 224.2 ns ] -0.38%
register::clear [ 86.1 ns ... 82.3 ns ] -4.42%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift [ 8.5 us ... 8.5 us ] +0.65%*
array::delete [ 13.3 us ... 13.3 us ] +0.29%
array::update [ 26.7 us ... 26.8 us ] +0.34%
array::insert [ 13.5 us ... 13.6 us ] +0.46%
array::push [ 8.5 us ... 8.6 us ] +0.58%*
map::insert [ 560.3 ns ... 562.7 ns ] +0.44%
map::remove [ 302.1 ns ... 298.7 ns ] -1.11%*
map::update [ 619.4 ns ... 611.1 ns ] -1.34%*
register::write [ 223.3 ns ... 220.0 ns ] -1.44%*
register::clear [ 85.5 ns ... 82.9 ns ] -3.05%*
$ ./c compare d -ot 5 --sampler=flat
array::unshift [ 8.5 us ... 8.6 us ] +0.75%*
array::delete [ 13.7 us ... 13.5 us ] -1.51%*
array::update [ 27.5 us ... 27.1 us ] -1.54%*
array::insert [ 13.9 us ... 13.8 us ] -1.18%*
array::push [ 8.6 us ... 8.6 us ] +0.65%*
map::insert [ 561.8 ns ... 570.0 ns ] +1.47%*
map::remove [ 302.3 ns ... 306.1 ns ] +1.26%*
map::update [ 613.5 ns ... 617.5 ns ] +0.66%*
register::write [ 220.0 ns ... 224.1 ns ] +1.89%*
register::clear [ 85.7 ns ... 83.6 ns ] -2.37%* |
0.3 released |
Another feature request: to have all the benchmarks run even if one of the earlier results exceed the threshold. And then only fail at the end! |
Implemented in bazhenov/tango#15 (0.4.0 published) |
It is even simpler than that as it turns out. macOS dynamic linker is capable of linking executable files against itself and not load the same executable twice in a virtual memory. If you are testing 2 different executables you'll have distinct mappings for each of binaries in the virtual memory: $ ./search_ord compare search_ord2
$ vmmap `pgrep search_ord` | grep search_ord
Virtual Memory Map of process 34763 (search_ord)
__TEXT 100130000-1002b4000 [ 1552K 1552K 0K 0K] r-x/r-x SM=COW /Users/USER/*/search_ord
__DATA_CONST 1002b4000-1002cc000 [ 96K 80K 0K 0K] r--/rw- SM=COW /Users/USER/*/search_ord
__LINKEDIT 1002d0000-1003c8000 [ 992K 992K 0K 0K] r--/r-- SM=COW /Users/USER/*/search_ord
__DATA 1002cc000-1002d0000 [ 16K 16K 16K 0K] rw-/rw- SM=COW /Users/USER/*/search_ord
__TEXT 10086c000-1009f0000 [ 1552K 608K 0K 0K] r-x/rwx SM=COW /Users/USER/*/search_ord2
__DATA_CONST 1009f0000-100a08000 [ 96K 96K 96K 0K] r--/rwx SM=COW /Users/USER/*/search_ord2
__LINKEDIT 100a0c000-100b04000 [ 992K 160K 0K 0K] r--/rwx SM=COW /Users/USER/*/search_ord2
__DATA 100a08000-100a0c000 [ 16K 16K 16K 0K] rw-/rwx SM=COW /Users/USER/*/search_ord2 But dynamic linker is smart enough not to load the file twice if you test binary agains itself. $ ./search_ord compare search_ord
$ vmmap `pgrep search_ord` | grep search_ord
Virtual Memory Map of process 34821 (search_ord)
__TEXT 10049c000-100620000 [ 1552K 1552K 0K 0K] r-x/r-x SM=COW /Users/USER/*/search_ord
__DATA_CONST 100620000-100638000 [ 96K 80K 0K 0K] r--/rw- SM=COW /Users/USER/*/search_ord
__LINKEDIT 10063c000-100734000 [ 992K 992K 0K 0K] r--/r-- SM=COW /Users/USER/*/search_ord
__DATA 100638000-10063c000 [ 16K 16K 16K 0K] rw-/rw- SM=COW /Users/USER/*/search_ord Remarkably, this deduplication works even if you're using hard links on the filesystem (at least in case of APFS). I was able to reproduce this effect at least partially and not with the same magnitude as your case, but still... Continue investigating the issue. |
For quite some time I've developed a benchmarking harness called Tango. It is based on paired testing methodology. Now I'm confident enough to publish it. Ordsearch in fact was one of the first projects I've tried it on and used for. So this PR is a benchmark implementation of ordsearch on Tango. The main purpose is to provide a performance regression toolkit using GHA.
The main promises of paired testing are:
Sensitivity
I've done some experiments on an AWS c2.medium instance. I ran a single test
u32/1024/nodup
in both criterion and tango in a loop 400 times which took several hours. Here are 80% confidence intervals on the performance difference betweenOrderedCollection
andVec
as measured by tango and criterion.Even using a fraction of a second Tango can provide tighter intervals.
Here are 100% CI (whiskers are showing minimum and maximum values)
Here the results are comparable between Tango and Criterion, but Tango requires much less time. Even 0.1s is enough to get a coarse estimate which is very useful when experimenting in development.
Regression testing
Tango builds an executable benchmark which is at the same time a dynamically linked library. This way 2 versions of the same code can be loaded in the address space and benchmarked in a paired way. The algorithm is quite simple:
./main-bench compare ./branch-bench
I've added a benchmarking workflow for GitHub, but it needs to be debugged and streamlined which is possible only after PR is open (GHA doesn't work in forks).