Skip to content

runtime: Avoid locking during stake vote rewards calculation#6900

Merged
vadorovsky merged 5 commits intoanza-xyz:masterfrom
vadorovsky:epoch-threadpool
Aug 26, 2025
Merged

runtime: Avoid locking during stake vote rewards calculation#6900
vadorovsky merged 5 commits intoanza-xyz:masterfrom
vadorovsky:epoch-threadpool

Conversation

@vadorovsky
Copy link
Copy Markdown
Member

@vadorovsky vadorovsky commented Jul 9, 2025

Problem

calculate_stake_vote_rewards was storing accumulated rewards per vote account in a DashMap, which then was used in a parallel iterator over all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators. Each thread processed one of the stake delegations and tried to acquire a lock on a DashMap shard corresponding to a validator. Given that the number of validators is disproportionally small and they have thousands of delegations, such solution resulted in high contention, with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~232.21ms:

redeem_rewards_us=232210i

Threads spent 65% of their time on waiting for locks:

reward-calculation-before-1

Summary of Changes

Fix that by:

The time spent on reward calculations goes down to ~48.78ms:

redeem_rewards_us=48781i

Threads spend the most of time doing actual calculations:

epoch-rewards-final

Fixes #6899

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jul 15, 2025

Codecov Report

❌ Patch coverage is 86.95652% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.4%. Comparing base (94eb488) to head (9e74fcc).
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #6900    +/-   ##
========================================
  Coverage    83.4%    83.4%            
========================================
  Files         813      813            
  Lines      366220   366291    +71     
========================================
+ Hits       305691   305806   +115     
+ Misses      60529    60485    -44     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vadorovsky vadorovsky force-pushed the epoch-threadpool branch 12 times, most recently from 5230af7 to 74bf762 Compare July 18, 2025 16:50
@vadorovsky vadorovsky force-pushed the epoch-threadpool branch 4 times, most recently from 2d72aef to 3c9f292 Compare July 28, 2025 15:56
@vadorovsky vadorovsky requested a review from Copilot July 28, 2025 15:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the stake vote rewards calculation by replacing the contention-heavy DashMap approach with a custom thread pool implementation that provides per-thread mutable states. This eliminates the need for locking during parallel reward calculations, improving performance from ~208ms to ~12ms.

Key Changes

  • Introduces a custom scoped thread pool with per-thread worker states to avoid locking
  • Replaces DashMap with thread-local HashMap collections for vote account rewards
  • Removes ThreadPool dependency from reward calculation functions

Reviewed Changes

Copilot reviewed 7 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
runtime/src/thread_pool.rs New custom thread pool implementation with per-thread mutable states
runtime/src/lib.rs Adds thread_pool module to public interface
runtime/src/bank/partitioned_epoch_rewards/calculation.rs Refactors reward calculation to use custom thread pool and eliminates DashMap
runtime/src/bank.rs Updates vote rewards type and calc_vote_accounts_to_store signature
runtime/src/bank/tests.rs Updates tests to use HashMap instead of DashMap
runtime/Cargo.toml Adds crossbeam-deque dependency
Cargo.toml Adds crossbeam-deque to workspace dependencies

Comment thread runtime/src/thread_pool.rs Outdated
Comment thread runtime/src/thread_pool.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
Comment thread runtime/src/bank.rs Outdated
Comment thread runtime/src/thread_pool.rs Outdated
@vadorovsky vadorovsky force-pushed the epoch-threadpool branch 3 times, most recently from 97df3b0 to acc7d39 Compare July 29, 2025 07:11
@vadorovsky vadorovsky requested a review from alessandrod July 29, 2025 07:37
@vadorovsky vadorovsky marked this pull request as ready for review July 29, 2025 07:52
@vadorovsky vadorovsky requested a review from HaoranYi July 29, 2025 07:53
`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```
@vadorovsky vadorovsky marked this pull request as ready for review August 25, 2025 15:15
@vadorovsky
Copy link
Copy Markdown
Member Author

vadorovsky commented Aug 25, 2025

@jstarry @HaoranYi It's ready for review.

Is this the simplest approach? It looks like we can remove the dashmap if we split stake reward calculation across multiple workers and then aggregate the results when all the workers are finished.

I ended up doing exactly that. The HashMap is now built used fold and reduce - that came out to be performing the best.

That said - aggregating PartitionedStakeRewards with fold and reduce came out to be very slow. I tried that in 0fbf93e and the profile looked terrible, allocations in reduce ended up taking half of time:

rayon-min-len-500 rayon-min-len-500-end

Which makes sense, there is over 1,000,000 delegations.

That's why in 27519db, I ended up pre-allocating the stake_rewards Vec and passing it as &mut [MaybeUninit<T>] to rayon. And the final, better looking profiles, are in the PR description.

@vadorovsky vadorovsky requested review from HaoranYi and jstarry August 25, 2025 15:26
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
) {
let epoch_rewards_sysvar = self.get_epoch_rewards_sysvar();
if epoch_rewards_sysvar.active {
let thread_pool = ThreadPoolBuilder::new()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to lower the thread pool creation? tests should be able to run on a smaller pool. there might even be an idle pool given we're between both slots and epochs

Copy link
Copy Markdown
Member Author

@vadorovsky vadorovsky Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reward calculation is the only place up from Bank::new_from_fields which needs a thread pool. And we calculate rewards only if the epoch_rewards_sysvar is active.

In the previous place (runtime/src/bank.rs:1828) there was even a TODO:

        // TODO: Only create the thread pool if we need to recalculate rewards,
        // i.e. epoch_reward_status is active. Currently, this thread pool is
        // always created and used for recalculate_partitioned_rewards and
        // lt_hash calculation. Once lt_hash feature is active, lt_hash won't
        // need the thread pool. Thereby, after lt_hash feature activation, we
        // can change to create the thread pool only when we need to recalculate
        // rewards.

By moving this thread pool here, we make sure that if someone starts a validator during the time this sysvar is inactive, that validator doesn't waste time on spawning threads, which don't end up being used.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. my contention is more that we lose control over pool configuration the lower it's instantiated. buried, adhoc pools like this are how we got a billion pools in the first place and the inability to configure them was a major ci slow down

maybe just punt the change to be addressed in its own pr so we don't hold up the rest of the wins here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I moved the pool back.

HaoranYi
HaoranYi previously approved these changes Aug 25, 2025
Copy link
Copy Markdown

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing.
lgtm.

@vadorovsky vadorovsky merged commit e752ae6 into anza-xyz:master Aug 26, 2025
54 checks passed
@vadorovsky vadorovsky deleted the epoch-threadpool branch August 26, 2025 16:43
@mergify
Copy link
Copy Markdown

mergify Bot commented Aug 26, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify Bot pushed a commit that referenced this pull request Aug 26, 2025
`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```

(cherry picked from commit e752ae6)

# Conflicts:
#	Cargo.toml
#	programs/sbf/Cargo.toml
Copy link
Copy Markdown

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but you'll have to revert this. Looks like the calculation in get_reward_distribution_num_blocks for num_partitions will be altered by this change because the length of PartitionedStakeRewards now includes items for stake accounts that do not have rewards because of the internal Option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runtime: dash map locks in calculate_stake_vote_rewards are heavily contended

7 participants