runtime: bench stakes cache#10760
Conversation
0cca07d to
09b77da
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10760 +/- ##
==========================================
+ Coverage 81.8% 83.0% +1.2%
==========================================
Files 835 839 +4
Lines 306318 317174 +10856
==========================================
+ Hits 250617 263395 +12778
+ Misses 55701 53779 -1922 🚀 New features to boost your workflow:
|
vadorovsky
left a comment
There was a problem hiding this comment.
as best as i can tell we dont have anything simple to measure stakes cache and rewards distribution
What I do personally is trying to get a ledger a few minutes after crossing an epoch boundary, and then repetitively using agave-ledger-tool to replay the epoch boundary, but that's pretty annoying to do (especially because I have to catch the moment of doing epoch calculations, I was able to do it only by setting gdb/lldb breakpoints 😅 ) and having some bench would be nice.
That said, I think such a benchmark should at least match the mainnet load, see the inline comment.
| const NUM_STAKE_ACCOUNTS: usize = 100_000; | ||
| const NUM_VOTE_ACCOUNTS: usize = 10; |
There was a problem hiding this comment.
These values are unfortunately way lower than what we see on mainnet and I don't think they would be capable of reproducing some of the performance issues, I've been fixing in epoch boundary (#7742, #8065).
I think we should have two sets of values:
- Matching mainnet load - ~1_000 validators/vote accounts and ~1_000_000 stake accounts.
- Some even larger "stress" values, zillions both for vote and stake accounts, as close to OOMing a devbox as possible. 🙂 I think this would let us find even more performance issues and come up with some nice improvements.
There was a problem hiding this comment.
yea i just chose these numbers as the largest that lets the rust bench harness finish within single-digit minutes. switching to Criterion and cranking down the sample size i can do 1m stake 1k vote fine
bench_epoch_turnover/HANA
time: [160.48 ms 161.36 ms 161.99 ms]
change: [−0.7566% +0.0322% +0.7800%] (p = 0.95 > 0.05)
No change in performance detected.
Benchmarking bench_epoch_rewards_period/HANA: Warming up for 1.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 56.2s.
bench_epoch_rewards_period/HANA
time: [4.1164 s 4.1551 s 4.1983 s]
change: [−3.1529% −1.8001% −0.3664%] (p = 0.03 < 0.05)
Change within noise threshold.
|
I did not examine the actual benchmark, but will butt my head in here anyway with an opinion some others may not share. We have added similar microbenchmarks in the past, and what I've tended to see is that we use that benchmark to rapidly iterate several improvements. Then the benchmark is rarely or never run again, but farily consistently needs maintainence as interfaces change. I've even seen us functionally break benches and go months without noticing since noone runs them. I would encourage thoughtfulness on whether this will be used as a one-off or longer-term when deciding to merge to main - or just use in near-term PRs to show improvement. |
| std::sync::Arc, | ||
| test::{Bencher, black_box}, | ||
| }; | ||
|
|
There was a problem hiding this comment.
Can we enable jemalloc in this bench?
| #[cfg(not(any(target_env = "msvc", target_os = "freebsd")))] | |
| #[global_allocator] | |
| static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc; | |
You'll also need to add the following to Cargo.toml of the local crate:
[target.'cfg(not(any(target_env = "msvc", target_os = "freebsd")))'.dependencies]
jemallocator = { workspace = true }Without this line, the bench will use the glibc allocator, which is way slower in such scenarios. Usually when I profile benches without jemalloc, all I see is page faults and drops taking the most of the time. 😅
There was a problem hiding this comment.
ive added jemalloc and changed the benches to use the product of trivial/full votes and trivial/full stakes. this makes it easy to add bigger cases when testing locally. tbh the complete case is already slow as hell tho, if anything jemalloc may have made it slightly worse
There was a problem hiding this comment.
Sorry for following up late!
ive added jemalloc and changed the benches to use the product of trivial/full votes and trivial/full stakes. this makes it easy to add bigger cases when testing locally.
Thanks!
tbh the complete case is already slow as hell tho
Yeah, the unfortunate thing with criterion is that there is no way to set less than 10 samples and then there is no way to control how many iterations are done per sample. TBH I think we shouldn't use criterion for this bench - the approach of "statistical correctness" and hammering with lots of samples and iterations is correct for operations that take nano/microseconds (not milliseconds/seconds like epoch boundary) and are very CPU-bound (while epoch boundary is very heavy on memory operations).
I think our bench should allow to be ran with exactly one iteration, or a very small number like 5-10 (but the difference between iteration should be negligible). I see two options:
- Using some more lightweight bench runner, that allows tweaking the exact number of iterations, r.g. brunch. The downside is that we would need to add a dependency.
- Having this bench not as a cargo bench, but rather as a standalone
binorexample, like bench-tps or account-cluster-bench.
I'm trying the approach 2) on my branch right now:
https://github.com/vadorovsky/agave/tree/20260221_stakescachebench
vadorovsky@f2a8b8c
The initial, literal rewrite (your bench rewritten as an "example" bin with a single run + some printlns) still took a lot of time to execute, where the most of time (over 300s) goes on setting up the bank:
$ cargo run -p solana-runtime --example bench-epoch -- turnover
Benchmarking epoch turnover with 1000 vote accounts and 1000000 stake accounts
Bank setup: 302.825592795s
Epoch turnover: 487.747357ms
$ cargo run -p solana-runtime --example bench-epoch -- rewards-period
Benchmarking rewards period with 1000 vote accounts and 1000000 stake accounts
Bank setup: 304.401566589s
Rewards period: 24.963901635s
The time spent on bank setup is absurd and I was able to confirm that around 304s (out from 306s) goes into Bank::new_for_tests. I'm definitely going to spend some time profiling and figuring out how we can improve it.
if anything jemalloc may have made it slightly worse
I didn't believe it at first, but indeed it seems like jemalloc speeds up the bank initialization by 6s, but slows down the epoch boundary by 25ms:
jemalloc (as above):
Bank setup: 302.825592795s
Epoch turnover: 487.747357ms
glibc:
Bank setup: 309.052845609s
Epoch turnover: 462.977385ms
🤔
There was a problem hiding this comment.
I'm dumb and I forgot about a very important thing - running these bechmarks with --profile release-with-debug, so the binaries get optimized. Of course with the debug profile they take longer to execute.
The times look much better after that.
with jemalloc:
$ cargo run -p solana-runtime --profile release-with-debug --example bench-epoch -- turnover
Benchmarking epoch turnover with 1000 vote accounts and 1000000 stake accounts
Bank setup: 144.422557517s
Epoch turnover: 184.788629ms
$ cargo run -p solana-runtime --profile release-with-debug --example bench-epoch -- rewards-period
Benchmarking rewards period with 1000 vote accounts and 1000000 stake accounts
Bank setup: 145.206629666s
Rewards period: 5.402353169s
without jemalloc:
$ cargo run -p solana-runtime --profile release-with-debug --example bench-epoch -- turnover
Benchmarking epoch turnover with 1000 vote accounts and 1000000 stake accounts
Bank setup: 147.403266104s
Epoch turnover: 181.179222ms
$ cargo run -p solana-runtime --profile release-with-debug --example bench-epoch -- rewards-period
Benchmarking rewards period with 1000 vote accounts and 1000000 stake accounts
Bank setup: 148.393624658s
Rewards period: 5.415286154s
Still doesn't change the fact that the bank setup is unacceptably slow and that somehow jemalloc slows epoch turnover (but rewards period is faster). Sticking to jemalloc is still better IMO, since: a) that's what we're using on production; b) the setups is faster by seconds, and the turnover is slower by milliseconds.
The epoch turnover time of 184ms is faster than the one on mainnet - last time I profiled it it was 337ms. I think that's because in this benchmark, we are hard coding the account sizes, while on mainnet there are some larger accounts. We could think of applying some size heuristics like I did in the read-only cache benchmarks:
agave/accounts-db/benches/read_only_accounts_cache.rs
Lines 23 to 36 in 9491071
I guess these times are now somehow OK for occasional running, but I will still check if I can come up with some quick fixes for Bank::new_for_tests.
There was a problem hiding this comment.
i actually have an unrelated project that will need me to gather stats on all mainnet stake accounts, so ill look at the actual sizes while im at it!
There was a problem hiding this comment.
but I will still check if I can come up with some quick fixes for
Bank::new_for_tests.
Meh, I can't, at least for now...
I was able to narrow down that this loop takes pretty much the entire ~148s:
Lines 2764 to 2773 in 045045d
Two things which look pretty sus to me is that:
- For each account from genesis config, it calls
create_account_shared_data, which clones data of the account. - It calls
Bank::store_accountfor each account separately (that involves refreshing bank hashes separately N times) instead of callingBank::store_accountsfor all accounts at once (that would refresh just once).
The point 1) could be fixed if we had an owned GenesisConfig instead of a &GenesisConfig reference available there - we could simply just convert the owned Accounts into AccountSharedDatas without any cloning. But my attempt to do so wasted solid few hours of my day, with a conclusion that I would need much more time. Not sure if it really makes sense at this point, unless there would be some perf gain for loading snapshots from such work.
There was a problem hiding this comment.
ok i checked and im skeptical the perf difference is from nonstandard-size stakes. mainnet has 1.2m active stake accounts, 133 of which are 4008 bytes, with no other outliers
I'm usually on board with Alessandro and Trent in their crusade against benchmarks, but in this case I'm actually in favor in adding one, given that my comments up are addressed and we make it as close to the mainnet behavior as possible. Usually the main problem why benchmarks are inaccurate is pretty much what I commented inline on - numbers too low and glibc allocator. Reasons I'm in favor of improving and merging this one:
And yes, I would use it pretty much daily. |
e90c8f9 to
23d35fe
Compare
23d35fe to
924ce2f
Compare
924ce2f to
a5a8127
Compare
vadorovsky
left a comment
There was a problem hiding this comment.
Not gonna block this any further. The remaining open points to address are:
- Slight mismatch between the times between the bench and mainnet. I was suspecting the account sizes to be the reason, but they are not. I'm happy to take a deeper look whenever I have time.
Bank::new_for_tests/Bank::process_genesis_configtaking >300 seconds. That's a result of making a completely new bank with 1_000 validators and 1_000_000 delegations. But I think that having this bench merged will be actually helpful in rolling future PRs that address that. The slowness of these functions are the main reason why the most of current tests are creating banks with like 10-100 validators.
But I think at this point, this bench is already way better and more accurate than the most of other benches in the repo. 😛
Problem
as best as i can tell we dont have anything simple to measure stakes cache and rewards distribution
Summary of Changes
add two simple benches, one measuring how long going from one epoch to the next takes, and the other how long the rewards period takes. this captures the full cost of stakes cache-related work, including accounts-db access, which seems better than just measuring
StakesCachefunctions in isolationhowever these also seem to be extremely well-targeted: with 10 stake accounts, we take 150us/iter and 3.5ms/iter. with 100k stake accounts, we take 16ms/iter and 350ms/iter.
bench_epoch_rewards_period()spends 95% of its measured time andbench_epoch_turnover()>99% inside theupdate_epoch_time_usmetric, which containsprocess_new_epoch(),update_epoch_stakes(), anddistribute_partitioned_epoch_rewards()