runtime: Avoid locking during stake vote rewards calculation#6900
runtime: Avoid locking during stake vote rewards calculation#6900vadorovsky merged 5 commits intoanza-xyz:masterfrom
Conversation
aa79dfa to
f346d5d
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6900 +/- ##
========================================
Coverage 83.4% 83.4%
========================================
Files 813 813
Lines 366220 366291 +71
========================================
+ Hits 305691 305806 +115
+ Misses 60529 60485 -44 🚀 New features to boost your workflow:
|
5230af7 to
74bf762
Compare
2d72aef to
3c9f292
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes the stake vote rewards calculation by replacing the contention-heavy DashMap approach with a custom thread pool implementation that provides per-thread mutable states. This eliminates the need for locking during parallel reward calculations, improving performance from ~208ms to ~12ms.
Key Changes
- Introduces a custom scoped thread pool with per-thread worker states to avoid locking
- Replaces
DashMapwith thread-localHashMapcollections for vote account rewards - Removes
ThreadPooldependency from reward calculation functions
Reviewed Changes
Copilot reviewed 7 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| runtime/src/thread_pool.rs | New custom thread pool implementation with per-thread mutable states |
| runtime/src/lib.rs | Adds thread_pool module to public interface |
| runtime/src/bank/partitioned_epoch_rewards/calculation.rs | Refactors reward calculation to use custom thread pool and eliminates DashMap |
| runtime/src/bank.rs | Updates vote rewards type and calc_vote_accounts_to_store signature |
| runtime/src/bank/tests.rs | Updates tests to use HashMap instead of DashMap |
| runtime/Cargo.toml | Adds crossbeam-deque dependency |
| Cargo.toml | Adds crossbeam-deque to workspace dependencies |
97df3b0 to
acc7d39
Compare
acc7d39 to
c7c08a6
Compare
`calculate_stake_vote_rewards` was storing accumulated rewards per vote account in a `DashMap`, which then was used in a parallel iterator over all stake delegations. There are over 1,000,000 stake delegations and around 1,000 validators. Each thread processes one of the stake delegations and tries to acquire the lock on a `DashMap` shard corresponding to a validator. Given that the number of validators is disproportionally small and they have thousands of delegations, such solution results in high contention, with some threads spending the most of their time on waiting for lock. The time spent on these calculations was ~208.47ms: ``` redeem_rewards_us=208475i ``` Fix that by: * Removing the `DashMap` and instead using `fold` and `reduce` operations to build a regular `HashMap`. * Pre-allocating the `stake_rewards` vector and passing `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool. * Pulling the optimization of `StakeHistory::get` in `solana-stake-interface`. solana-program/stake#81 ``` redeem_rewards_us=48781i ```
…re_capacity_mut`
19e575f to
0856e31
Compare
|
@jstarry @HaoranYi It's ready for review.
I ended up doing exactly that. The That said - aggregating
Which makes sense, there is over 1,000,000 delegations. That's why in 27519db, I ended up pre-allocating the |
| ) { | ||
| let epoch_rewards_sysvar = self.get_epoch_rewards_sysvar(); | ||
| if epoch_rewards_sysvar.active { | ||
| let thread_pool = ThreadPoolBuilder::new() |
There was a problem hiding this comment.
why do we want to lower the thread pool creation? tests should be able to run on a smaller pool. there might even be an idle pool given we're between both slots and epochs
There was a problem hiding this comment.
Reward calculation is the only place up from Bank::new_from_fields which needs a thread pool. And we calculate rewards only if the epoch_rewards_sysvar is active.
In the previous place (runtime/src/bank.rs:1828) there was even a TODO:
// TODO: Only create the thread pool if we need to recalculate rewards,
// i.e. epoch_reward_status is active. Currently, this thread pool is
// always created and used for recalculate_partitioned_rewards and
// lt_hash calculation. Once lt_hash feature is active, lt_hash won't
// need the thread pool. Thereby, after lt_hash feature activation, we
// can change to create the thread pool only when we need to recalculate
// rewards.
By moving this thread pool here, we make sure that if someone starts a validator during the time this sysvar is inactive, that validator doesn't waste time on spawning threads, which don't end up being used.
There was a problem hiding this comment.
sure. my contention is more that we lose control over pool configuration the lower it's instantiated. buried, adhoc pools like this are how we got a billion pools in the first place and the inability to configure them was a major ci slow down
maybe just punt the change to be addressed in its own pr so we don't hold up the rest of the wins here?
There was a problem hiding this comment.
Fair, I moved the pool back.
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
`calculate_stake_vote_rewards` was storing accumulated rewards per vote account in a `DashMap`, which then was used in a parallel iterator over all stake delegations. There are over 1,000,000 stake delegations and around 1,000 validators. Each thread processes one of the stake delegations and tries to acquire the lock on a `DashMap` shard corresponding to a validator. Given that the number of validators is disproportionally small and they have thousands of delegations, such solution results in high contention, with some threads spending the most of their time on waiting for lock. The time spent on these calculations was ~208.47ms: ``` redeem_rewards_us=208475i ``` Fix that by: * Removing the `DashMap` and instead using `fold` and `reduce` operations to build a regular `HashMap`. * Pre-allocating the `stake_rewards` vector and passing `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool. * Pulling the optimization of `StakeHistory::get` in `solana-stake-interface`. solana-program/stake#81 ``` redeem_rewards_us=48781i ``` (cherry picked from commit e752ae6) # Conflicts: # Cargo.toml # programs/sbf/Cargo.toml
jstarry
left a comment
There was a problem hiding this comment.
Sorry but you'll have to revert this. Looks like the calculation in get_reward_distribution_num_blocks for num_partitions will be altered by this change because the length of PartitionedStakeRewards now includes items for stake accounts that do not have rewards because of the internal Option.
…nza-xyz#6900)" This reverts commit e752ae6.


Problem
calculate_stake_vote_rewardswas storing accumulated rewards per vote account in aDashMap, which then was used in a parallel iterator over all stake delegations.There are over 1,000,000 stake delegations and around 1,000 validators. Each thread processed one of the stake delegations and tried to acquire a lock on a
DashMapshard corresponding to a validator. Given that the number of validators is disproportionally small and they have thousands of delegations, such solution resulted in high contention, with some threads spending the most of their time on waiting for lock.The time spent on these calculations was ~232.21ms:
Threads spent 65% of their time on waiting for locks:
Summary of Changes
Fix that by:
DashMapand instead usingfoldandreduceoperations to build a regularHashMap.stake_rewardsvector and passing&mut [MaybeUninit<PartitionedStakeReward>]to the thread pool.StakeHistory::getinsolana-stake-interfaceinterface: Optimize theStakeHistory::getfunction solana-program/stake#81The time spent on reward calculations goes down to ~48.78ms:
Threads spend the most of time doing actual calculations:
Fixes #6899