Skip to content

Do not insert into accounts lt hash cache if freeze has started#6307

Merged
brooksprumo merged 3 commits intoanza-xyz:masterfrom
brooksprumo:lthash/freeze
May 27, 2025
Merged

Do not insert into accounts lt hash cache if freeze has started#6307
brooksprumo merged 3 commits intoanza-xyz:masterfrom
brooksprumo:lthash/freeze

Conversation

@brooksprumo
Copy link
Copy Markdown

@brooksprumo brooksprumo commented May 23, 2025

Problem

When activating the accounts lt hash feature (SIMD-215) on testnet, there was a bank hash mismatch caused by a bad lt hash.

After investigating with @jstarry1, it was concluded this is possible when the leader is still executing transactions after bank freeze has started. This would allow putting an entry into the accounts lt hash cache that was not the initial value of the account, and instead the value after bank freeze finishes up any deferred changes to account state. Then, then calculating the lt hash for the slot, we'd compute the wrong final lt hash value, since the initial state of this account was incorrect in the cache.

Here's an example:

  1. A bank calls freeze() and begins finishing up deferred changes to account state.
  2. One of ☝️ is distributing transaction fees, and that stores updates to account A.
  3. Concurrently, a transaction T is executing that includes an instruction that modifies account A.
  4. T loads A, and as part of loading A, we call inspect_account(). The value of A that gets loaded is after step (2). The modified A is added to the accounts lt hash cache.
  5. Transaction T finished executing and sees it is past the end of the block so it does not commit.
  6. When calculating the accounts lt hash, we see that A is a modified account in this slot because of step (2). As part of subtracting out the old state of account A, we look at the accounts lt hash cache and see there is an entry for A. However, this version of A is the same as the modified version, and not the old version of A. Thus, we calculate the accounts lt hash value.

Summary of Changes

Disallow adding entries to the accounts lt hash cache if freeze has started.

Note, I intend to backport this to v2.2.

Footnotes

  1. https://discord.com/channels/428295358100013066/838890116386521088/1375339076370300958

@brooksprumo brooksprumo self-assigned this May 23, 2025
@brooksprumo brooksprumo moved this to In Progress in Accounts Lt Hash May 23, 2025
@brooksprumo brooksprumo marked this pull request as ready for review May 23, 2025 22:05
@brooksprumo brooksprumo requested a review from jstarry May 23, 2025 22:05
@brooksprumo brooksprumo requested a review from HaoranYi May 23, 2025 22:06
@mergify
Copy link
Copy Markdown

mergify Bot commented May 23, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.7%. Comparing base (ae8fe77) to head (532b547).
⚠️ Report is 3319 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6307   +/-   ##
=======================================
  Coverage    82.7%    82.7%           
=======================================
  Files         845      845           
  Lines      377824   377847   +23     
=======================================
+ Hits       312782   312824   +42     
+ Misses      65042    65023   -19     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread runtime/src/bank/accounts_lt_hash.rs Outdated
Comment on lines +335 to +338
if self
.freeze_started_for_accounts_lt_hash
.load(Ordering::Acquire)
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, checking here isn't sufficient to guarantee freeze hasn't started before we modify the lt hash cache.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 013298a.

@brooksprumo brooksprumo marked this pull request as draft May 23, 2025 23:00
@brooksprumo brooksprumo removed the request for review from HaoranYi May 23, 2025 23:00
@HaoranYi
Copy link
Copy Markdown

HaoranYi commented May 24, 2025

image

since we don't do rent anymore, the buggy account is always self.collector_id, right? If so, can we just "always" remove it from the cache before we calculate the lthash?

diff --git a/runtime/src/bank/accounts_lt_hash.rs b/runtime/src/bank/accounts_lt_hash.rs
index 18450401eb..d7d28acee0 100644
--- a/runtime/src/bank/accounts_lt_hash.rs
+++ b/runtime/src/bank/accounts_lt_hash.rs
@@ -138,6 +138,9 @@ impl Bank {
             }
         }
 
+        // Remove the collector_id from the cache, as it is can be raced to put into the cache by uncommited transactions
+        self.cache_for_accounts_lt_hash.remove(&self.collector_id);
+
         let do_calculate_delta_lt_hash = || {
             // Work on chunks of 128 pubkeys, which is 4 KiB.
             // And 4 KiB is likely the smallest a real page size will be.

@HaoranYi
Copy link
Copy Markdown

The above patch can still have a race with TX when we compute the lthash.

Another idea is to blacklist self.collector_id from the cache.

diff --git a/runtime/src/bank/accounts_lt_hash.rs b/runtime/src/bank/accounts_lt_hash.rs
index 18450401eb..fb52612b10 100644
--- a/runtime/src/bank/accounts_lt_hash.rs
+++ b/runtime/src/bank/accounts_lt_hash.rs
@@ -152,10 +152,13 @@ impl Bank {
                     |mut accum, (pubkey, curr_account)| {
                         // load the initial state of the account
                         let (initial_state_of_account, measure_load) = meas_dur!({
-                            let cache_value = self
-                                .cache_for_accounts_lt_hash
-                                .get(pubkey)
-                                .map(|entry| entry.value().clone());
+                            let cache_value = if pubkey == &self.collector_id {
+                                None
+                            } else {
+                                self.cache_for_accounts_lt_hash
+                                    .get(pubkey)
+                                    .map(|entry| entry.value().clone())
+                            };

@HaoranYi HaoranYi self-requested a review May 24, 2025 01:42
@jstarry
Copy link
Copy Markdown

jstarry commented May 24, 2025

Another idea is to blacklist self.collector_id from the cache.

Not a bad idea but some future SIMD's that allow customizing where block fees are distributed will make this approach too complicated in my opinion

@brooksprumo
Copy link
Copy Markdown
Author

since we don't do rent anymore, the buggy account is always self.collector_id, right?

I'm not confident to say that self.collector_id is the only way to hit this issue. We'd have to audit all the functions that run during freeze and store accounts, and then ensure these never change and/or keep the lt hash code in sync. Seems prone to future errors/refactors 😬, so I'd prefer to not go that route 😸.

@brooksprumo
Copy link
Copy Markdown
Author

The above patch can still have a race with TX when we compute the lthash.

Correct. Fixed in 013298a.

@brooksprumo brooksprumo requested a review from jstarry May 24, 2025 17:44
@brooksprumo brooksprumo marked this pull request as ready for review May 24, 2025 17:47
HaoranYi
HaoranYi previously approved these changes May 25, 2025
Copy link
Copy Markdown

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I like it.
holding the readlock for cache insert in tx execute prevent the race. it may slightly increase the time for tx execute. but since it is just read lock, no block for mutliple threads. shouldn't be much.

@brooksprumo
Copy link
Copy Markdown
Author

brooksprumo commented May 25, 2025

holding the readlock for cache insert in tx execute prevent the race. it may slightly increase the time for tx execute. but since it is just read lock, no block for mutliple threads. shouldn't be much.

Yeah, I think it should be fine grabbing the read lock here. We only ever grab the write lock during Bank::freeze(), so during inspect() the read lock atomic instructions should never have to do cache line invalidation; they should remain in a shared state, so should be fast.

jstarry
jstarry previously approved these changes May 26, 2025
Copy link
Copy Markdown

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me as is, but we can do without the extra atomic

Comment thread runtime/src/bank/accounts_lt_hash.rs Outdated
Comment on lines +343 to +345
let has_freeze_started = self
.freeze_started_for_accounts_lt_hash
.load(Ordering::Relaxed);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need freeze_started_for_accounts_lt_hash if we are using self.freeze_lock() because we can check if the hash is not equal to Hash::default() to know if we read locked before or after freezing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. I could use freeze_started_for_accounts_lt_hash as an optimization: e.g. check the atomic bool first, and if true we can bail early, else grab the read lock and check again. I don't think that'll be a common case, so I don't think the added complexity is necessary.

I've removed freeze_started_for_accounts_lt_hash in 532b547.

@brooksprumo brooksprumo dismissed stale reviews from jstarry and HaoranYi via 532b547 May 26, 2025 21:29
@brooksprumo brooksprumo requested a review from HaoranYi May 26, 2025 21:32
@brooksprumo brooksprumo requested a review from jstarry May 26, 2025 21:32
Copy link
Copy Markdown

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. The change is much simipler now. Change is contained in only one file. It should be much easier to backport.

Comment thread runtime/src/bank/accounts_lt_hash.rs Outdated
@brooksprumo brooksprumo merged commit 273f713 into anza-xyz:master May 27, 2025
47 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Accounts Lt Hash May 27, 2025
@brooksprumo brooksprumo deleted the lthash/freeze branch May 27, 2025 14:56
mergify Bot pushed a commit that referenced this pull request May 27, 2025
brooksprumo added a commit that referenced this pull request May 28, 2025
… (backport of #6307) (#6320)

* Do not insert into accounts lt hash cache if freeze has started (#6307)

(cherry picked from commit 273f713)

* fix build due to sdk dependency renames

---------

Co-authored-by: Brooks <brooks@anza.xyz>
nibty pushed a commit to x1-labs/tachyon that referenced this pull request Jun 11, 2025
… (backport of anza-xyz#6307) (anza-xyz#6320)

* Do not insert into accounts lt hash cache if freeze has started (anza-xyz#6307)

(cherry picked from commit 273f713)

* fix build due to sdk dependency renames

---------

Co-authored-by: Brooks <brooks@anza.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants