ci/compare: Bring back nix stats comparison#395113
Conversation
ac9b686 to
e8c79ac
Compare
ci/eval/compare/cmp-stats.py
Outdated
There was a problem hiding this comment.
A number of "stats" are accurate counts with accurate differences, so I'd like to just preserve the number.
This way, readers can aggregate the original counts into multi-way comparisons. (I have done so in the past)
There was a problem hiding this comment.
This is not true anymore. But i think you are right regarding the format.
Previously we used single core evaluation, which yields the exact number of thunks.
Now we split evaluation into 36 chunks. This means it is not guaranteed that every thunk is counted once. Because there is some overlap now, and the overlap may also change depending on the PR.
So the way that i agreed with @infinisil now is that we display the raw numbers, but treat them as statistical numbers and add the standard error i.e number of chunks 13193 ± 21 -> +Δ5.1% ± 0.2 % N=36
with sderr = σ / √N (How confident we are in the delta)
There was a problem hiding this comment.
After some hours of diving into the actual numbers. My conclusion is that in practice this means we are very unlikely to detect performance regressions of less than 10% with this method.
We should either increase the N further (100 or even 200, i.e. 5 VMs * 4 cores * 9 threads = 180 chunks ) or use other statistical methods to reduce the noise from our data. (Going back to single core is not really an option)
I think the better approach would be to use the paired T-test, P-test using Raw Data: (Means we need to expose the raw numbers, NOT the consolidated means)
We cannot calculate the mean over all chunks. We should compare them pairwise. Then produce a one Sample t-test. For a final result.
See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html#scipy.stats.ttest_ind_from_stats
This could give use something like:
t = 0.05 -> how big the difference between the "before" and "after" performance is. In this case almost NO difference.
p = 0.9 -> small difference observed is likely due to random chance.
We could still display the mean and sderr, but as said that wouldn't help us to detect changes in number of thunks less than ~10% because the noise in our dataset is too high.
There was a problem hiding this comment.
Now we split evaluation into 36 chunks.
Surely if the set of packages remains the same, these chunks are decided deterministically?
For cases where performance matters, that's generally the case, i.e. changes to implementations, not additions or removals.
But yes, when the chunking is different, the comparison could be very distorted, and I would consider hiding the stats entirely when that's the case.
Tbh, I have little interest in these stats if they're going to be as distorted as you suggest. A dedicated benchmark job may make more sense then; certainly for NixOS or the module system. I don't know what we should do about performance of the package set, which is relevant for changes to mkDerivation and other frequently used Nixpkgs functions. A representative, fixed sampling of the package set?
There was a problem hiding this comment.
Surely if the set of packages remains the same, these chunks are decided deterministically?
Oh yeah that's a good point, so we should instead really compare each chunk individually, and get the mean/standard deviation among the differences.
There was a problem hiding this comment.
Yes. Thats what i meant. We should also make Sure that the chunks contain the exact same packages. Otherwise we cant compare them pairwise. Makes Sense for changes in lib functions or some build-helpers. But Not while adding packages, that would distort the chunks and is probably very hard to handle
There was a problem hiding this comment.
Do I understand correctly that the sequence of inputs used for the p and t values comes from the chunking, and not from past evaluation stats?
If so, these p and t values are a measure of chunking consistency, which is not what we're interested in, and presented without context, they're highly misleading.
There was a problem hiding this comment.
Past evaluation stats aren't even that usable. The standard deviation I think we should compare to is that of many runs of the same commit on the given GHA hardware.
A standard deviation over recent history could be an interesting factoid, but that is not useful for evaluating the performance of a change either, because the baseline should be no change, not arbitrary change.
Furthermore, this kind of analysis only really applies to the independent stochastic stat that is CPU time.
Some other stats, like the GC ones may be non-deterministic because of the conservative GC (and possibly because of interactions between incrementality and timing, but I'd have to confirm). We can't assume independence there, and the massive 16MB rounding error on heapSize has the potential to make subtle changes seem impossibly bad.
Other stats, like function calls, are not probabilistic at all, since we don't have speculative execution or other probabilistic or non-deterministic phenomena affecting them.
There was a problem hiding this comment.
Unless I'm completely misread the scripts, I'd suggest the following:
- Drop the t and p values from the table for now
- Measure and store stats by evaluating the same hardcoded commit on CI. Check if it correlates with time of day and/or day of week. (If so, we'd have to adjust our use of data from a different time of day)
- Only provide t and p values where the variance makes sense - that is to say: time stats
There was a problem hiding this comment.
Yes i think we need to distinguish between probabilistic and deterministic values. I'll try to make a PR for it
roberth
left a comment
There was a problem hiding this comment.
Not clear what we're comparing.
Any chance we could put up reports for these?
- evaluating (all?) packages
- a realistic desktop configuration
Bonus points:
3. a basic server configuration
This would help with evaluating the performance of NixOS in a realistic way, instead of extrapolating from some sort of mixed run.
Absolutely I think we should have that. Instead of these context-less numbers hanging around. I would incorporate your review points into this PR. But wouldn't pack more features yet and see how it works with this basic setup. cc @infinisil any way of putting longer performance reports somewhere? Maybe use artifacts and have a link collection then? |
The details get tricky as to what exactly gets evaluated, but for now I think it's best to just link to https://github.com/NixOS/nixpkgs/tree/master/ci/eval#readme, which allows people to reproduce stats (though not the comparison) locally. I am in favor of updating the readme with better information in future work. |
c36d03c to
fb26a9f
Compare
|
@roberth @infinisil I managed to spend some time today on the stats comparison. We should decide how we want to continue with this. Here is what i got done:
ExamplePerformance comparisonThis compares the performance of this branch against its pull request base branch (e.g., 'master') For further help please refer to: ci/README.md
Detailed per-chunk differences
|
|
It would be very cool if we could include pictures somehow, that would make the dry numbers report more interactive If someone detects a change in thunk 1_3 that is still a little abstract. We could also show what this thunk contains |
c850652 to
08c267d
Compare
Displays stats table in the step-summary if there are no added/removed packages
|
Successfully created backport PR for |
|
Had to revert unfortunately: #403448 |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/a-look-at-nixos-nixpkgs-evaluation-times-over-the-years/65114/14 |

This brings back the comparison report.
We had something like this before the ci was migrated to github actions.
Currently as a lib maintainers we are completely blind if we want to know if any PR has significant impact on performance.
Running two evaluations locally of the checked out branch against the master is at least my painfull workaround currently.
This PR solves this pain.
Example output:
Performance comparison
This compares the performance of this branch against its pull request base branch (e.g., 'master')
For further help please refer to: ci/README.md
Things done
nix.conf? (See Nix manual)sandbox = relaxedsandbox = truenix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)Add a 👍 reaction to pull requests you find important.