Skip to content

runtime: Avoid locking during stake vote rewards calculation#7742

Merged
vadorovsky merged 1 commit intoanza-xyz:masterfrom
vadorovsky:epoch-threadpool-v2
Sep 11, 2025
Merged

runtime: Avoid locking during stake vote rewards calculation#7742
vadorovsky merged 1 commit intoanza-xyz:masterfrom
vadorovsky:epoch-threadpool-v2

Conversation

@vadorovsky
Copy link
Copy Markdown
Member

@vadorovsky vadorovsky commented Aug 27, 2025

Problem

calculate_stake_vote_rewards was storing accumulated rewards per vote account in a DashMap, which then was used in a parallel iterator over all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators. Each thread processes one of the stake delegations and tries to acquire the lock on a DashMap shard corresponding to a validator. Given that the number of validators is disproportionally small and they have thousands of delegations, such solution results in high contention, with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

redeem_rewards_us=208475i

Threads spent 65% of their time on waiting for locks:

reward-calculation-before-1

Summary of Changes

Fix that by:

The time spent on the calculation decreased to ~49ms:

redeem_rewards_us=48781i

Threads spend the most of time doing actual calculations:

epoch-rewards-final

Fixes #6899

// SAFETY: We initialized all the `stake_rewards` elements up to the capacity.
unsafe {
stake_rewards.set_len(stake_rewards.capacity());
stake_rewards.set_len_some(len_stake_rewards_some);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses #6900 (review)

}

/// Number of `Some` elements.
pub(crate) fn len_some(&self) -> usize {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method addresses #6900 (review)

I didn't implement Deref and DerefMut on purpose - this way, the len method from the inner Vec is not available, so consumers of PartitionedStakeRewards are forced to use len_some.

/// * there is no payout or if any deserved payout is < 1 lamport
/// * corresponding vote account was not found in cache and accounts-db
#[test]
fn test_get_reward_distribution_num_blocks_none() {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test to make sure we don't break get_reward_distribution_num_blocks

.0
.bank;
// Delegations with sufficient stake to get rewards (2 SOL).
let delegations_with_rewards = 100;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was tempted to put 1_000_000 to match the numbers we're seeing on mainnet, but unfortunately, even 1_000 causes create_reward_bank_with_specific_stakes to execute for over a minute... I will try to fix that and improve the test in a separate PR.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 98.10811% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.0%. Comparing base (680bb32) to head (53c3121).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #7742    +/-   ##
========================================
  Coverage    83.0%    83.0%            
========================================
  Files         815      815            
  Lines      357726   357966   +240     
========================================
+ Hits       297171   297412   +241     
+ Misses      60555    60554     -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vadorovsky vadorovsky marked this pull request as ready for review August 27, 2025 12:10
@vadorovsky vadorovsky force-pushed the epoch-threadpool-v2 branch from 0f44e4b to 9915bc5 Compare August 27, 2025 12:16
Comment thread runtime/src/bank/partitioned_epoch_rewards/distribution.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
@vadorovsky vadorovsky force-pushed the epoch-threadpool-v2 branch 3 times, most recently from c49991e to 3d61dfb Compare September 5, 2025 12:49
@vadorovsky
Copy link
Copy Markdown
Member Author

@jstarry all your comments are addressed, PTAL

Comment thread Cargo.toml
Comment thread runtime/src/bank.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
Comment thread runtime/src/inflation_rewards/mod.rs Outdated
pub fn redeem_rewards(
rewarded_epoch: Epoch,
stake_state: &mut StakeStateV2,
stake_account: &StakeAccount<Delegation>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like we only need to pass stake_state as before, not the full stake_account

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass stake_state like before, then we are going to lose the win that you commented below. 😅 We would need to do stake_account.stake_state().unwrap() in the caller (calculation.rs redeem_delegation_rewards). I prefer matching on the whole stake account here and returning a copy of stake here.

Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
@jstarry
Copy link
Copy Markdown

jstarry commented Sep 5, 2025

Looks good! Just a bunch of small things that you can take or leave.

@vadorovsky
Copy link
Copy Markdown
Member Author

@jstarry All comments should be addressed now. I disagree with only one of them #7742 (comment), all others are fixed in the way you proposed.

Copy link
Copy Markdown

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very solid. Sorry have a few other suggestions still but the PR looks correct and ready to go otherwise

Comment thread runtime/src/inflation_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/mod.rs Outdated
Comment thread runtime/src/bank/partitioned_epoch_rewards/calculation.rs
jstarry
jstarry previously approved these changes Sep 11, 2025
`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```
@vadorovsky vadorovsky merged commit 8aa41ea into anza-xyz:master Sep 11, 2025
43 checks passed
@mergify
Copy link
Copy Markdown

mergify Bot commented Sep 11, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify Bot pushed a commit that referenced this pull request Sep 11, 2025
`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```

(cherry picked from commit 8aa41ea)

# Conflicts:
#	runtime/src/bank/partitioned_epoch_rewards/calculation.rs
vadorovsky added a commit to vadorovsky/agave that referenced this pull request Sep 11, 2025
…z#7742)

`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```

(cherry picked from commit 8aa41ea)
vadorovsky added a commit that referenced this pull request Sep 15, 2025
`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```

(cherry picked from commit 8aa41ea)
vadorovsky added a commit to vadorovsky/agave that referenced this pull request Sep 19, 2025
…z#7742)

`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```

(cherry picked from commit 8aa41ea)

Conflicts:
	runtime/src/bank/partitioned_epoch_rewards/calculation.rs
vadorovsky added a commit that referenced this pull request Sep 23, 2025
…ackport of #7742) (#8012)

`calculate_stake_vote_rewards` was storing accumulated rewards per vote
account in a `DashMap`, which then was used in a parallel iterator over
all stake delegations.

There are over 1,000,000 stake delegations and around 1,000 validators.
Each thread processes one of the stake delegations and tries to acquire
the lock on a `DashMap` shard corresponding to a validator. Given that
the number of validators is disproportionally small and they have
thousands of delegations, such solution results in high contention,
with some threads spending the most of their time on waiting for lock.

The time spent on these calculations was ~208.47ms:

```
redeem_rewards_us=208475i
```

Fix that by:

* Removing the `DashMap` and instead using `fold` and `reduce`
  operations to build a regular `HashMap`.
* Pre-allocating the `stake_rewards` vector and passing
  `&mut [MaybeUninit<PartitionedStakeReward>]` to the thread pool.
* Pulling the optimization of `StakeHistory::get` in
  `solana-stake-interface`. solana-program/stake#81

```
redeem_rewards_us=48781i
```

(cherry picked from commit 8aa41ea)

Co-authored-by: Michal R <vad.sol@proton.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runtime: dash map locks in calculate_stake_vote_rewards are heavily contended

3 participants